构建scrapy项目
构建项目:
scrapy startproject taobao
pycharm打开项目。
在项目根路径创建一个爬虫:
scrapy genspider 爬虫名称 要爬取的限制域
调试工具:
scrapy shell http://www.taobao.com
#选择标签(也可以class),::取标签值,extract提取数据,first指第一个,等价于extract[0]
response.css('title::text').extract_first()
启动项目:
scrapy crawl quotes
pycharm断点调试:
在根路径下新建文件main.py,内容如下:修改下爬虫名称即可
from scrapy.cmdline import execute
import os
import sysif __name__ == '__main__':sys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(['scrapy', 'crawl', '爬虫name'])
爬虫如下:
爬虫文件:
import scrapy
from ..items import MyscrapyItem# 定义爬取逻辑
class QuotesSpider(scrapy.Spider):# 爬虫名称name = 'quotes'# 允许的域allowed_domains = ['lab.scrapyd.cn']# 初始url集合start_urls = ['http://lab.scrapyd.cn/page/1/']# 处理response的方法def parse(self, response):# 用css选择器、获取所有的quote节点quotes = response.css('div.quote')# 遍历节点集for quote in quotes:item = MyscrapyItem()item["text"] = quote.css('span.text::text').extract_first()item["author"] = quote.css('span small.author::text').extract_first()item["tags"] = quote.css('div.tags a.tag::text').extract()yield item# 如果存在下一页则继续递归爬取next_page = response.css('.next a::attr(href)').extract_first()if next_page is not None:yield scrapy.Request(next_page, callback=self.parse)
item文件:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MyscrapyItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 定义爬取的数据字段text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()pass
pipelines:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MyscrapyPipeline:def process_item(self, item, spider):# 处理爬取的结果print(item)
要激活pipelines还需要修改一下settings.py
将这一段注释释放掉
ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline': 300,
}
注意xpath的使用:
def parse(self, response, **kwargs):# 使用绝对路径定位标签elements = response.xpath('//div[@class="ui relaxed divided items explore-repo__list"]//div[@class="item"]')for element in elements:# 注意:再次进行xpath的时候是相对路径在需要//前面加上.。是.//而不是//link = self.allow_domains + element.xpath('.//h3/a/@href').get()print(link)pass
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
