python爬虫经常需要复制浏览器请求头,之前都是用pycharm批量替换。今天想看看有啥方便的方法没。结果发现了超出预期的东西。chrome的Copy as cURL 和curl to python。
下图是Copy as cURL,python爬取动态网页时经常需要寻找真正的接口然后利用参数构造请求。
以https://fr.news.yahoo.com/politique/这个网站为例
复制过来是这么一坨:
curl 'https://www.wpbeginner.com/wp-tutorials/how-to-display-recently-registered-users-in-wordpress/' \ -H 'authority: www.wpbeginner.com' \ -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \ -H 'accept-language: zh-CN,zh;q=0.9' \ -H 'cache-control: max-age=0' \ -H 'cookie: _gcl_au=1.1.436465112.1662469028; _omappvp=epBIA5cd0ZWUxNENwH3BIB7xc0aYZhHXWJ8due9UpGESG981kQNkxcwclZNZXf1cwASDIMJA4EkKa90SXkXN9sMKM4ovHclf; PushSubscriberStatus=CLOSED; peclosed=true; omSeen-wswleymcr7lvrnemcb77=1662491983243; _omra=%7B%22wswleymcr7lvrnemcb77%22%3A%22view%22%7D; om-wswleymcr7lvrnemcb77=1662492524006; _gid=GA1.2.1414703957.1662651389; PHPSESSID=usov6qpk0v7b18l0n0u1vieuqu; _ga_YFDKLJ5Q0T=GS1.1.1662717815.8.1.1662719411.43.0.0; _ga=GA1.2.1930480014.1662469028' \ -H 'if-modified-since: Fri, 09 Sep 2022 10:14:20 GMT' \ -H 'sec-ch-ua: "Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "Windows"' \ -H 'sec-fetch-dest: document' \ -H 'sec-fetch-mode: navigate' \ -H 'sec-fetch-site: none' \ -H 'sec-fetch-user: ?1' \ -H 'upgrade-insecure-requests: 1' \ -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' \ --compressed
然后https://curlconverter.com/这网站登场了:可以直接将curl的请求转换成各种语言对应的代码。python可以选择转成requests库对应的代码。
转换完成,不用自己调试参数了,直接一把梭,方便了很多。postman的import功能也可以实现这样的效果,但实测这个请求postman转过来不对。
import requests cookies = { '_gcl_au': '1.1.436465112.1662469028', '_omappvp': 'epBIA5cd0ZWUxNENwH3BIB7xc0aYZhHXWJ8due9UpGESG981kQNkxcwclZNZXf1cwASDIMJA4EkKa90SXkXN9sMKM4ovHclf', 'PushSubscriberStatus': 'CLOSED', 'peclosed': 'true', 'omSeen-wswleymcr7lvrnemcb77': '1662491983243', '_omra': '%7B%22wswleymcr7lvrnemcb77%22%3A%22view%22%7D', 'om-wswleymcr7lvrnemcb77': '1662492524006', '_gid': 'GA1.2.1414703957.1662651389', 'PHPSESSID': 'usov6qpk0v7b18l0n0u1vieuqu', '_ga_YFDKLJ5Q0T': 'GS1.1.1662717815.8.1.1662719411.43.0.0', '_ga': 'GA1.2.1930480014.1662469028', } headers = { 'authority': 'www.wpbeginner.com', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'max-age=0', # Requests sorts cookies= alphabetically # 'cookie': '_gcl_au=1.1.436465112.1662469028; _omappvp=epBIA5cd0ZWUxNENwH3BIB7xc0aYZhHXWJ8due9UpGESG981kQNkxcwclZNZXf1cwASDIMJA4EkKa90SXkXN9sMKM4ovHclf; PushSubscriberStatus=CLOSED; peclosed=true; omSeen-wswleymcr7lvrnemcb77=1662491983243; _omra=%7B%22wswleymcr7lvrnemcb77%22%3A%22view%22%7D; om-wswleymcr7lvrnemcb77=1662492524006; _gid=GA1.2.1414703957.1662651389; PHPSESSID=usov6qpk0v7b18l0n0u1vieuqu; _ga_YFDKLJ5Q0T=GS1.1.1662717815.8.1.1662719411.43.0.0; _ga=GA1.2.1930480014.1662469028', 'if-modified-since': 'Fri, 09 Sep 2022 10:14:20 GMT', 'sec-ch-ua': '"Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'none', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', } response = requests.get('https://www.wpbeginner.com/wp-tutorials/how-to-display-recently-registered-users-in-wordpress/', cookies=cookies, headers=headers)