当前位置：首页 > news >正文

影楼网站推广/seo五大经验分享

news 2025/8/7 13:51:16

影楼网站推广,seo五大经验分享,纯静态企业网站模板免费下载,深圳租赁住房和建设局网站1.CrawlSpider介绍 Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。此案例采用的是CrawlSpider类实现爬虫。它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的…

1.CrawlSpider介绍

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。
此案例采用的是CrawlSpider类实现爬虫。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

scrapy startproject baidu

模版创建：

scrapy genspider -t crawl baidu 'tieba.baidu.com'

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接｡
每个LinkExtractor有唯一的公共方法是 extract_links()，它接收一个 Response 对象，并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次，并且 extract_links 方法会根据不同的 response 调用多次提取链接｡

主要参数：allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。allow_domains：会被提取的链接的domains。deny_domains：一定不会被提取链接的domains。            restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

rules

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

参数介绍：
link_extractor：是一个Link Extractor对象，用于定义需要提取的链接

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。    注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。    follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。    process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。    process_request：指定该spider中哪个的函数将会被调用， 该规则提取到每个request时都会调用该函数。 (用来过滤request)

2.创建案例

a.开始一个项目

scrapy startproject wxapp

b.创建模板

scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

c.settings.py

# -*- coding: utf-8 -*-# Scrapy settings for wxapp project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wxapp'SPIDER_MODULES = ['wxapp.spiders']
NEWSPIDER_MODULE = 'wxapp.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'wxapp.pipelines.WxappPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

wxapp_spider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItemclass WxappSpiderSpider(CrawlSpider):name = 'wxapp_spider'allowed_domains = ['wxapp-union.com']start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']rules = (Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False))def parse_detail(self, response):title=response.xpath('//h1[@class="ph"]/text()').get()author_p=response.xpath('//p[@class="authors"]')author=author_p.xpath('.//a/text()').get()pub_time=author_p.xpath('.//span/text()').get()content=response.xpath('//td[@id="article_content"]//text()').getall()item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)yield item#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()#i['name'] = response.xpath('//div[@id="name"]').extract()#i['description'] = response.xpath('//div[@id="description"]').extract()

items.py

import scrapyclass WxappItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field()author=scrapy.Field()pub_time=scrapy.Field()content=scrapy.Field()

pipelines.py

from scrapy.exporters import JsonLinesItemExporterclass WxappPipeline(object):def __init__(self):self.fp=open("wxapp.json","wb")self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self,spider):self.fp.close()

在wxapp目录下创建：start.py

from scrapy import cmdlinecmdline.execute("scrapy crawl wxapp_spider".split())

执行start.py即可

转载于:https://www.cnblogs.com/hbxZJ/p/9629101.html

查看全文

http://www.lbrq.cn/news/1057555.html

企业网站模块介绍/如何做网站推广及优化

我做微信淘宝客网站/优秀软文案例

做网站加手机app需要多少钱/百度pc端入口

中国建设银行辽宁分行网站/首页排名优化公司

哪个网站可以做英文兼职/网络营销到底是干嘛的

河北省住房与建设厅网站/新站整站优化

莱西做网站公司/搜索引擎营销的英文缩写是

网站的建设要多少钱/网页做推广

wordpress have posts/佛山做网络优化的公司

wordpress hotnews/武汉本地seo

wordpress qq在线聊天/网站设计优化

用tomcat做网站/plc培训机构哪家最好

专做校园购物网站/农产品网络营销

网上做网站怎么赚钱/广州百度竞价托管

合肥装饰公司做的好的网站/利于seo的建站系统有哪些

网站定位与功能分析/武汉seo认可搜点网络

【数据结构与算法-Day 12】深入浅出栈：从“后进先出”原理到数组与链表双实现

前端保持和服务器时间同步的方法【使用vue3举例】

什么是mysql的垂直分表，理论依据是什么，如何使用？

Vue3 defineAsyncComponent() 函数

国内办公安全平台新标杆：iOA一体化办公安全解决方案

深入剖析 RAG 检索系统中的召回方式：BM25、向量召回、混合策略全解析

rules

相关文章：