Scrapy安装

确保当前系统运行环境下存在Python3环境以及最新的pip3

使用命令安装Scrapy

1
pip3 install scrapy

网络环境不佳可使用aliyun的镜像库

1
pip3 install scrapy -i https://mirrors.aliyun.com/pypi/simple

安装完成后使用命令行输入scrapy进行查看帮助信息。

Tencent 招聘信息爬取

爬取招聘信息

招聘信息

http://careers.tencent.com/search.html

爬取字段

职位名称、详情链接、职位类别、工作国家、工作城市、工作职责、发布时间

初始化工程

新建工程

1
scrapy startprojection Tencent

进入到tencent目录下,创建基础爬虫类

1
2
scrapy genspider postion "tencent.com" 
#position为爬虫名,tencent.com为爬虫作用范围

执行命令后会在spider文件夹中创建一个tencentPosition.py的文件,在此文件中编写爬虫内容。

初始化URL

腾讯的招聘信息都是由json进行传输的,可以在F12下进行查看,所以真正请求的链接在headers中可以看到

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

可以直接访问此链接,可以看到一堆的json数据,为方便查看可以复制此数据到json.cn中查看。

此链接赋值给爬虫python文件中的start_urls

position.py

爬虫初始化URLS

1
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

导入jsonimport json

导入jsonpathimport jsonpath

此库若无需自行安装,参考scrapy安装

setting.py中添加LOG_LEVEL = 'WARN'

输出需要的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def parse(self, response):
json_text = json.loads(response.text)
Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
LocationName = jsonpath.jsonpath(json_text,'$..LocationName')#城市
CategoryName = jsonpath.jsonpath(json_text,'$..CategoryName')#岗位类型
Responsibility = jsonpath.jsonpath(json_text,'$..Responsibility')#岗位职责
LastUpdateTime = jsonpath.jsonpath(json_text,'$..LastUpdateTime')#最后更新时间
PostURL = jsonpath.jsonpath(json_text,'$..PostURL')#详情链接

for Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL in zip(Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL):
item = TencentItem()
print(Postname)
print(CountryName)
print(LocationName)
print(CategoryName)
print(Responsibility)
print(LastUpdateTime)
print(PostURL) def parse(self, response):
json_text = json.loads(response.text)
Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
print(Postname)

在命令行中输入scrapy crawl position执行

存储信息

修改items.py文件

position.py文件中导入items.pyTencentItems方法

from ..items import TencentItem

初始化tencentItem

1
item = TencentItem()

爬完一页后进行存储

1
yield item

腾讯的招聘信息有很多页,观察url是在pageIndex=处进行控制,将链接赋值给变量url并在pageIndex后加上{}

1
url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'

控制参数pageIndex递增并继续爬取

1
2
3
self.pageIndex += 1
url = self.url.format(self.pageIndex)
yield scrapy.Request(url=url,callback=self.parse)

将信息存储于本地当中

修改settings.py中的参数,将ITEM_PIPELINES注释去掉

添加方法写入爬取的数据

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class TencentPipeline:
#创建文件
def __init__(self):
self.file = open('Tencent_Job.json','wb')

#写入文件
def process_item(self, item, spider):
json_text = json.dumps(dict(item),ensure_ascii=False) + '\n'
self.file.write(json_text.encode('utf-8'))
return item

#关闭文件
def close_spider(self,spider):
self.file.close()

爬取结果

文件内容

position.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import scrapy
import json
import jsonpath
from ..items import TencentItem

class PositionSpider(scrapy.Spider):
name = 'position'
url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
allowed_domains = ['tencent.com']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

pageIndex = 1
def parse(self, response):
json_text = json.loads(response.text)
Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
LocationName = jsonpath.jsonpath(json_text,'$..LocationName')#城市
CategoryName = jsonpath.jsonpath(json_text,'$..CategoryName')#岗位类型
Responsibility = jsonpath.jsonpath(json_text,'$..Responsibility')#岗位职责
LastUpdateTime = jsonpath.jsonpath(json_text,'$..LastUpdateTime')#最后更新时间
PostURL = jsonpath.jsonpath(json_text,'$..PostURL')#详情链接

for Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL in zip(Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL):
item = TencentItem()
print(Postname)
print(CountryName)
print(LocationName)
print(CategoryName)
print(Responsibility)
print(LastUpdateTime)
print(PostURL)
print('==='*35)
item['Postname'] = Postname
item['CountryName'] = CountryName
item['LocationName'] = LocationName
item['CategoryName'] = CategoryName
item['Responsibility'] = Responsibility
item['LastUpdateTime'] = LastUpdateTime
item['PostURL'] = PostURL

yield item
self.pageIndex += 1
url = self.url.format(self.pageIndex)
yield scrapy.Request(url=url,callback=self.parse)

items.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
Postname = scrapy.Field() # 岗位名称
CountryName = scrapy.Field() # 国家
LocationName = scrapy.Field() # 城市
CategoryName = scrapy.Field() # 岗位类型
Responsibility = scrapy.Field() # 岗位职责
LastUpdateTime = scrapy.Field() # 最后更新时间
PostURL = scrapy.Field() # 详情链接

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter


class TencentPipeline:
def __init__(self):
self.file = open('Tencent_Job.json','wb')

def process_item(self, item, spider):
json_text = json.dumps(dict(item),ensure_ascii=False) + '\n'
self.file.write(json_text.encode('utf-8'))
return item

def close_spider(self,spider):
self.file.close()

settings.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
LOG_LEVEL='WARN'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'