["首页","博客标签","我","开源","深度学习","机器学习","自然语言","爬虫","编程","开发语言","前端开发","生活","论文","关于me"]
XSpider爬虫框架
作者: IntoHole | 可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.buyiker.com/2017/04/20/xspider-intro.html
XSpider爬虫框架
项目背景
- 抓取单线程
- 简单api使用
- xpath/css/json提取器
- 多种队列
- 架构代码逻辑清晰,可以了解spider抓取过程
- it’s easy to crawl and extract web;
- 工程地址:xspider
main.py:
from xspider.spider.spider import BaseSpider
from xspider.filters import urlfilter
from kuailiyu import KuaiLiYu
if __name__ == "__main__":
spider = BaseSpider(name = "kuailiyu" , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"])
spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"]))
spider.start()
kuailiyu.py
from xspider import processor
from xspider.selector import xpath_selector
from xspider import model
class KuaiLiYu(processor.PageProcessor.PageProcessor):
def __init__(self):
super(KuaiLiYu , self).__init__()
self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()")
def process(self , page , spider):
items = model.fileds.Fileds()
items["title"] = self.title_extractor.find(page)
items["url"] = page.url
return items
参考资料