Python爬虫 — Scrapy爬取IT桔子网

转载自   成长之路丶关注@简书

目标:

此次爬取主要是针对IT桔子网的事件信息模块,然后把爬取的数据存储到mysql数据库中。

 

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC0xODdmZDlhZDdlM2Y0ZTdhLnBuZw

 

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1mZWFhYjQ4ZTAwODQyYmZiLnBuZw

目标分析:

通过浏览器浏览发现事件模块需要登录才能访问,因此我们需要先登录,抓取登录接口:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC05OGI2M2QxZDdkZjg4ZTBiLnBuZw

可以看到桔子网的登录接口是:https://www.itjuzi.com/api/authorizations,请求方式是post请求,数据的提交方式是payload,提交的数据格式是json(payload方式提交的数据一般都是json),再看一下响应:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1iZWQwYzI0NDVhMjIyY2RmLnBuZw

发现没有响应数据,其实是有响应数据的,只是F12调试看不到,我们可以用postman来查看响应体:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC05NjE0YjM0MWFlZmRkMzI0LnBuZw

可以发现响应体是json数据,我们先把它放到一边,我们再来分析事件模块,通过F12抓包调试发现事件模块的数据其实是一个ajax请求得到的:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1kYTY2Y2MwZjc4OWIyZTRmLnBuZw

ajax请求得到的是json数据,我们再看看headers:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1lNTg3ZDJjZDQ3NmE3MDI3LnBuZw

 

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1kZTUxMmQzNTk0MjBkZGIxLnBuZw

 

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC0zNDg1M2IwN2M0ODg0NjNhLnBuZw

可以发现headers里有一个Authorization参数,参数的值恰好是我们登录时登录接口返回的json数据的token部分,所以这个参数很有可能是判断我们是否登录的凭证,我们可以用postman模拟请求一下:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC1hMWQ4MWY2NjA1NzA4ZjE0LnBuZw

通过postman的模拟请求发现如我们所料,我们只要在请求头里加上这个参数我们就可以获得对应的数据了。
解决了如何获得数据的问题,再来分析一下如何翻页,通过对比第一页和第二页提交的json数据可以发现几个关键参数,page、pagetotal、per_page分别代表当前请求页、记录总数、每页显示条数,因此根据pagetotal和per_page我们可以算出总的页数,到此整个项目分析结束,可以开始写程序了。

 

scrapy代码的编写

1.创建scrapy项目和爬虫:

  1. E:\>scrapy startproject ITjuzi
  2. E:\>scrapy genspider juzi itjuzi.com

 

2.编写items.py:

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/items.html
  6. import scrapy
  7. class ItjuziItem(scrapy.Item):
  8. # define the fields for your item here like:
  9. # name = scrapy.Field()
  10. invse_des = scrapy.Field()
  11. invse_title = scrapy.Field()
  12. money = scrapy.Field()
  13. name = scrapy.Field()
  14. prov = scrapy.Field()
  15. round = scrapy.Field()
  16. invse_time = scrapy.Field()
  17. city = scrapy.Field()
  18. com_registered_name = scrapy.Field()
  19. com_scope = scrapy.Field()
  20. invse_company = scrapy.Field()

 

3.编写Spider:

  1. import scrapy
  2. from itjuzi.settings import JUZI_PWD, JUZI_USER
  3. import json
  4. class JuziSpider(scrapy.Spider):
  5. name = ‘juzi’
  6. allowed_domains = [‘itjuzi.com’]
  7. def start_requests(self):
  8. “””
  9. 先登录桔子网
  10. “””
  11. url = “https://www.itjuzi.com/api/authorizations”
  12. payload = {“account”: JUZI_USER, “password”: JUZI_PWD}
  13. # 提交json数据不能用scrapy.FormRequest,需要使用scrapy.Request,然后需要method、headers参数
  14. yield scrapy.Request(url=url,
  15. method=“POST”,
  16. body=json.dumps(payload),
  17. headers={‘Content-Type’: ‘application/json’},
  18. callback=self.parse
  19. )
  20. def parse(self, response):
  21. # 获取Authorization参数的值
  22. token = json.loads(response.text)
  23. url = “https://www.itjuzi.com/api/investevents”
  24. payload = {
  25. “pagetotal”: 0, “total”: 0, “per_page”: 20, “page”: 1, “type”: 1, “scope”: “”, “sub_scope”: “”,
  26. “round”: [], “valuation”: [], “valuations”: “”, “ipo_platform”: “”, “equity_ratio”: [“”],
  27. “status”: “”, “prov”: “”, “city”: [], “time”: [], “selected”: “”, “location”: “”, “currency”: [],
  28. “keyword”: “”
  29. }
  30. yield scrapy.Request(url=url,
  31. method=“POST”,
  32. body=json.dumps(payload),
  33. meta={‘token’: token},
  34. # 把Authorization参数放到headers中
  35. headers={‘Content-Type’: ‘application/json’, ‘Authorization’: token[‘data’][‘token’]},
  36. callback=self.parse_info
  37. )
  38. def parse_info(self, response):
  39. # 获取传递的Authorization参数的值
  40. token = response.meta[“token”]
  41. # 获取总记录数
  42. total = json.loads(response.text)[“data”][“page”][“total”]
  43. # 因为每页20条数据,所以可以算出一共有多少页
  44. if type(int(total)/20) is not int:
  45. page = int(int(total)/20)+1
  46. else:
  47. page = int(total)/20
  48. url = “https://www.itjuzi.com/api/investevents”
  49. for i in range(1,page+1):
  50. payload = {
  51. “pagetotal”: total, “total”: 0, “per_page”: 20, “page”:i , “type”: 1, “scope”: “”, “sub_scope”: “”,
  52. “round”: [], “valuation”: [], “valuations”: “”, “ipo_platform”: “”, “equity_ratio”: [“”],
  53. “status”: “”, “prov”: “”, “city”: [], “time”: [], “selected”: “”, “location”: “”, “currency”: [],
  54. “keyword”: “”
  55. }
  56. yield scrapy.Request(url=url,
  57. method=“POST”,
  58. body=json.dumps(payload),
  59. headers={‘Content-Type’: ‘application/json’, ‘Authorization’: token[‘data’][‘token’]},
  60. callback=self.parse_detail
  61. )
  62. def parse_detail(self, response):
  63. infos = json.loads(response.text)[“data”][“data”]
  64. for i in infos:
  65. item = ItjuziItem()
  66. item[“invse_des”] = i[“invse_des”]
  67. item[“com_des”] = i[“com_des”]
  68. item[“invse_title”] = i[“invse_title”]
  69. item[“money”] = i[“money”]
  70. item[“com_name”] = i[“name”]
  71. item[“prov”] = i[“prov”]
  72. item[“round”] = i[“round”]
  73. item[“invse_time”] = str(i[“year”])+“-“+str(i[“year”])+“-“+str(i[“day”])
  74. item[“city”] = i[“city”]
  75. item[“com_registered_name”] = i[“com_registered_name”]
  76. item[“com_scope”] = i[“com_scope”]
  77. invse_company = []
  78. for j in i[“investor”]:
  79. invse_company.append(j[“name”])
  80. item[“invse_company”] = “,”.join(invse_company)
  81. yield item

 

4.编写PIPELINE:

  1. from itjuzi.settings import DATABASE_DB, DATABASE_HOST, DATABASE_PORT, DATABASE_PWD, DATABASE_USER
  2. import pymysql
  3. class ItjuziPipeline(object):
  4. def __init__(self):
  5. host = DATABASE_HOST
  6. port = DATABASE_PORT
  7. user = DATABASE_USER
  8. passwd = DATABASE_PWD
  9. db = DATABASE_DB
  10. try:
  11. self.conn = pymysql.Connect(host=host, port=port, user=user, passwd=passwd, db=db, charset=‘utf8’)
  12. except Exception as e:
  13. print(“连接数据库出错,错误原因%s”%e)
  14. self.cur = self.conn.cursor()
  15. def process_item(self, item, spider):
  16. params = [item[‘com_name’], item[‘com_registered_name’], item[‘com_des’], item[‘com_scope’],
  17. item[‘prov’], item[‘city’], item[’round’], item[‘money’], item[‘invse_company’],item[‘invse_des’],item[‘invse_time’],item[‘invse_title’]]
  18. try:
  19. com = self.cur.execute(
  20. ‘insert into juzi(com_name, com_registered_name, com_des, com_scope, prov, city, round, money, invse_company, invse_des, invse_time, invse_title)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)’,params)
  21. self.conn.commit()
  22. except Exception as e:
  23. print(“插入数据出错,错误原因%s” % e)
  24. return item
  25. def close_spider(self, spider):
  26. self.cur.close()
  27. self.conn.close()

 

5.编写settings.py

  1. # -*- coding: utf-8 -*-
  2. # Scrapy settings for itjuzi project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. # https://doc.scrapy.org/en/latest/topics/settings.html
  8. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  9. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  10. BOT_NAME = ‘itjuzi’
  11. SPIDER_MODULES = [‘itjuzi.spiders’]
  12. NEWSPIDER_MODULE = ‘itjuzi.spiders’
  13. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  14. #USER_AGENT = ‘itjuzi (+http://www.yourdomain.com)’
  15. # Obey robots.txt rules
  16. ROBOTSTXT_OBEY = False
  17. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  18. #CONCURRENT_REQUESTS = 32
  19. # Configure a delay for requests for the same website (default: 0)
  20. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  21. # See also autothrottle settings and docs
  22. DOWNLOAD_DELAY = 0.25
  23. # The download delay setting will honor only one of:
  24. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  25. #CONCURRENT_REQUESTS_PER_IP = 16
  26. # Disable cookies (enabled by default)
  27. #COOKIES_ENABLED = False
  28. # Disable Telnet Console (enabled by default)
  29. #TELNETCONSOLE_ENABLED = False
  30. # Override the default request headers:
  31. #DEFAULT_REQUEST_HEADERS = {
  32. # ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8’,
  33. # ‘Accept-Language’: ‘en’,
  34. #}
  35. # Enable or disable spider middlewares
  36. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  37. #SPIDER_MIDDLEWARES = {
  38. # ‘itjuzi.middlewares.ItjuziSpiderMiddleware’: 543,
  39. #}
  40. # Enable or disable downloader middlewares
  41. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  42. DOWNLOADER_MIDDLEWARES = {
  43. # ‘itjuzi.middlewares.ItjuziDownloaderMiddleware’: 543,
  44. ‘itjuzi.middlewares.RandomUserAgent’: 102,
  45. ‘itjuzi.middlewares.RandomProxy’: 103,
  46. }
  47. # Enable or disable extensions
  48. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  49. #EXTENSIONS = {
  50. # ‘scrapy.extensions.telnet.TelnetConsole’: None,
  51. #}
  52. # Configure item pipelines
  53. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  54. ITEM_PIPELINES = {
  55. ‘itjuzi.pipelines.ItjuziPipeline’: 100,
  56. }
  57. # Enable and configure the AutoThrottle extension (disabled by default)
  58. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  59. #AUTOTHROTTLE_ENABLED = True
  60. # The initial download delay
  61. #AUTOTHROTTLE_START_DELAY = 5
  62. # The maximum download delay to be set in case of high latencies
  63. #AUTOTHROTTLE_MAX_DELAY = 60
  64. # The average number of requests Scrapy should be sending in parallel to
  65. # each remote server
  66. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  67. # Enable showing throttling stats for every response received:
  68. #AUTOTHROTTLE_DEBUG = False
  69. # Enable and configure HTTP caching (disabled by default)
  70. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  71. #HTTPCACHE_ENABLED = True
  72. #HTTPCACHE_EXPIRATION_SECS = 0
  73. #HTTPCACHE_DIR = ‘httpcache’
  74. #HTTPCACHE_IGNORE_HTTP_CODES = []
  75. #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’
  76. JUZI_USER = “1871111111111”
  77. JUZI_PWD = “123456789”
  78. DATABASE_HOST = ‘数据库ip’
  79. DATABASE_PORT = 3306
  80. DATABASE_USER = ‘数据库用户名’
  81. DATABASE_PWD = ‘数据库密码’
  82. DATABASE_DB = ‘数据表’
  83. USER_AGENTS = [
  84. “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)”,
  85. “Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”,
  86. “Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)”,
  87. “Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)”,
  88. “Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6”,
  89. “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1”,
  90. “Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0”,
  91. “Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5”,
  92. “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.29 Safari/537.36”
  93. ]
  94. PROXIES = [
  95. {‘ip_port’: ‘代理ip:代理IP端口’, ‘user_passwd’: ‘代理ip用户名:代理ip密码’},
  96. {‘ip_port’: ‘代理ip:代理IP端口’, ‘user_passwd’: ‘代理ip用户名:代理ip密码’},
  97. {‘ip_port’: ‘代理ip:代理IP端口’, ‘user_passwd’: ‘代理ip用户名:代理ip密码’},
  98. ]

 

6.让项目跑起来:

E:\>scrapy crawl juzi

7.结果展示:

aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8xNjc1NTM5MC03ODBhOTVhYWE3OTYyMzAzLnBuZw

PS:详情信息这里没有爬取,详情信息主要是根据上面列表页返回的json数据中每个公司的id来爬取,详情页的数据可以不用登录就能拿到如:https://www.itjuzi.com/api/investevents/10262327https://www.itjuzi.com/api/get_investevent_down/10262327,还有最重要的一点是如果你的账号不是vip会员的话只能爬取前3页数据这个有点坑,其他的信息模块也是一样的分析方法,需要的可以自己去分析爬取。

https://blog.csdn.net/sandorn/article/details/104284233