很多网站需要通过提交表单来进行登陆或相应的操作,可以用requests库的POST方法,通过观测表单源代码和逆向工程来填写表单获取网页信息。本代码以获取拉勾网Python相关招聘职位为例作为练习。打开拉钩网,F12进入浏览器开发者工具,可以发现网站使用了Ajax,点击Network选项卡,选中XHR项,在Header中可以看到请求的网址,Response中可以看到返回的信息为Json格式。这里由于Json字符串比较长且复杂,所以可以用Preview选项观察,正好是网页中的职位信息。招聘信息全在content-posiotionResult-result中。翻页后发现请求地址没有改变,但是提交方法为POST,提交的字段中有一个pn字段随着翻页在改变,因此,可以据此构造出爬虫程序。代码如下:
import requests
import json
import time
import pymongoclient = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
lagou = mydb['lagou']cookie = '这里换成你自己的cookie'headers = {'cookie': cookie,'origin': "https://www.lagou.com",'x-anit-forge-code': "0",'accept-encoding': "gzip, deflate, br",'accept-language': "zh-CN,zh;q=0.8,",'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",'content-type': "application/x-www-form-urlencoded; charset=UTF-8",'accept': "application/json, text/javascript, */*; q=0.01",'referer': "https://www.lagou.com/jobs/list_Pyhon?labelWords=&fromSearch=true&suginput=",'x-requested-with': "XMLHttpRequest",'connection': "keep-alive",'x-anit-forge-token': "None"}def get_page(url, params):html = requests.post(url,data=params,headers=headers)json_data = json.loads(html.text)total_count = json_data['content']['positionResult']['totalCount']page_number = int(total_count/15) if int(total_count/15)<30 else 30get_info(url,page_number)def get_info(url,page):for pn in range(1,page+1):params={'first':'true','pn':str(pn),'kd':'Python'}try:html = requests.post(url,data=params,headers=headers)json_data = json.loads(html.text)results = json_data['content']['positionResult']['result']for result in results:infos = {'businessZones':result['businessZones'],'city': result['city'],'companyFullName': result['companyFullName'],'companyLabelList': result['companyLabelList'],'companySize': result['companySize'],'district': result['district'],'education': result['education'],'financeStage': result['financeStage'],'firstType': result['firstType'],'formatCreateTime': result['formatCreateTime'],'gradeDescription': result['gradeDescription'],'imState': result['imState'],'industryField': result['industryField'],'positionAdvantage': result['positionAdvantage'],'salary': result['salary'],'workYear': result['workYear'],}lagou.insert_one(infos)time.sleep(2)except requests.exceptions.ConnectionError:passif __name__=='__main__':url = 'https://www.lagou.com/jobs/positionAjax.json'params = {'first': 'true','pn': '1','kd': 'Python'}get_page(url,params)
拉钩网由于采取了反扒技术,使用简单的代理或者使用普通的headers都会被屏蔽,提示“您的操作过于频繁,请稍后再试”,经过尝试,如果采用完整的头部就没有问题,爬取的数据存储在MongoDB数据库中。