文章目录
- 爬虫的伪装
- 动态IP接入指南
- IP代理中间件编写
- Setting中配置Middleware
博文配套视频课程:24小时实现从零到AI人工智能
爬虫的伪装
如果不进行伪装则我们每次采用相同IP抓取数据时可以会被目前服务器的防火墙之别,伪装有两种:配置代理IP和user-agent中间件编写,需要先注册阿布云
动态IP接入指南
注册阿布云之后,可以选择1元购买1小时进行动态IP的测试。如果购买成功打开对应的接入指南会有提示scrapy的相关配置
import base64# 代理服务器proxyServer = "http://http-dyn.abuyun.com:9020"# 代理隧道验证信息proxyUser = "H01234567890123D"proxyPass = "0123456789012345"# for Python2proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass)# for Python3#proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")class ProxyMiddleware(object):def process_request(self, request, spider):request.meta["proxy"] = proxyServerrequest.headers["Proxy-Authorization"] = proxyAuth
IP代理中间件编写
根据上面的接入指南,采用创建一个ProxyMiddleware配置相关的信息即可完成动态IP的配置
import base64# 代理服务器
proxyServer = "http://http-dyn.abuyun.com:9020"
# 代理隧道验证信息
proxy_name_pass = b"HH59908195O5720D:4B4748D2DBD1B53D"
# for Python2
proxyAuth = base64.b64encode(proxy_name_pass)class ProxyMiddleware(object):def process_request(self, request, spider):request.meta["proxy"] = proxyServerrequest.headers["Proxy-Authorization"] = "Basic " + proxyAuth.decode()
Setting中配置Middleware
DOWNLOADER_MIDDLEWARES = {# 未来完成ajax加载'douban.middlewares.DoubanDownloaderMiddleware': 544,# IP伪装'douban.proxymiddlewares.ProxyMiddleware': 542,# User-Agent伪装'douban.user_agent_middlewares.UserAgentMiddleware': 543,
}