关于BeutifulSoup4的用法入门请参考Python爬虫扩展库BeautifulSoup4用法精要,scrapy爬虫案例请参考Python使用Scrapy爬虫框架爬取天涯社区小说“大宗师”全文,爬虫原理请参考Python不使用scrapy框架而编写的网页爬虫程序
本文代码运行环境为Python 3.6.1+scrapy 1.3.0。
>>> import scrapy
# 测试样本
>>> html = '''
<html>
<head>
<base href='http://example.com/'/>
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='test.html'>This is a test.</a>
</div>
</body>
</html>
'''
# 创建选择器对象
>>> sel = scrapy.selector.Selector(text=html)
# 查看title标签
>>> sel.xpath('//title').extract()
['<title>Example website</title>']
# 查看标签title的文本
>>> sel.xpath('//title/text()').extract()
['Example website']
# 使用等价的CSS选择器
>>> sel.css('title::text').extract()
['Example website']
# 查看所有href和src链接
>>> sel.xpath('//@href').extract()
['http://example.com/', 'image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html', 'test.html']
>>> sel.xpath('//@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
# 查看base标签中的链接地址
>>> sel.xpath('//base/@href').extract()
['http://example.com/']
>>> sel.css('base::attr(href)').extract()
['http://example.com/']
# 查看所有标签a
>>> sel.xpath('//a').extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>', '<a href="test.html">This is a test.</a>']
# 要求标签a的href属性中含有image字符串
>>> sel.xpath('//a[contains(@href, "image")]').extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
# 查看符合条件的标签a的文本
>>> sel.xpath('//a[contains(@href, "image")]/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
# 使用正则表达式
>>> sel.xpath('//a[contains(@href, "image")]/text()').re('Name:\s*(.*)')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']
>>> sel.xpath('//a[contains(@href, "image")]/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> sel.xpath('//a/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html', 'test.html']
>>> sel.css('a[href*=image] ::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> sel.css('a[href*=image] img::attr(src)').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
>>> sel.xpath('//img/@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
>>> sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
# href属性包含image1的a标签中img标签的src属性值
>>> sel.xpath('//a[contains(@href, "image1")]/img/@src').extract()
['image1_thumb.jpg']
热烈庆祝拙作《Python可以这样学》出版3个月荣登亚马逊Python新品排行榜第2位