Scrapy安装

确保当前系统运行环境下存在Python3环境以及最新的pip3

使用命令安装Scrapy

1	pip3 install scrapy

网络环境不佳可使用aliyun的镜像库

1	pip3 install scrapy -i https://mirrors.aliyun.com/pypi/simple

安装完成后使用命令行输入scrapy进行查看帮助信息。

Tencent 招聘信息爬取

爬取招聘信息

招聘信息

http://careers.tencent.com/search.html

爬取字段

职位名称、详情链接、职位类别、工作国家、工作城市、工作职责、发布时间

初始化工程

新建工程

1	scrapy startprojection Tencent

进入到tencent目录下，创建基础爬虫类

1 2	scrapy genspider postion "tencent.com" #position为爬虫名，tencent.com为爬虫作用范围

执行命令后会在spider文件夹中创建一个tencentPosition.py的文件，在此文件中编写爬虫内容。

初始化URL

腾讯的招聘信息都是由json进行传输的，可以在F12下进行查看，所以真正请求的链接在headers中可以看到

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

可以直接访问此链接，可以看到一堆的json数据，为方便查看可以复制此数据到json.cn中查看。

此链接赋值给爬虫python文件中的start_urls

position.py

爬虫初始化URLS

start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

导入json库import json

导入jsonpath库import jsonpath

此库若无需自行安装，参考scrapy安装

在setting.py中添加LOG_LEVEL = 'WARN'

输出需要的信息

def parse(self, response):
    json_text = json.loads(response.text)
    Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
    CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
    LocationName = jsonpath.jsonpath(json_text,'$..LocationName')#城市
    CategoryName = jsonpath.jsonpath(json_text,'$..CategoryName')#岗位类型
    Responsibility = jsonpath.jsonpath(json_text,'$..Responsibility')#岗位职责
    LastUpdateTime = jsonpath.jsonpath(json_text,'$..LastUpdateTime')#最后更新时间
    PostURL = jsonpath.jsonpath(json_text,'$..PostURL')#详情链接

    for Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL in zip(Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL):
        item = TencentItem()
        print(Postname)
        print(CountryName)
        print(LocationName)
        print(CategoryName)
        print(Responsibility)
        print(LastUpdateTime)
        print(PostURL)   def parse(self, response):
    json_text = json.loads(response.text)
    Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
    CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
    print(Postname)

在命令行中输入scrapy crawl position执行

存储信息

修改items.py文件

在position.py文件中导入items.py的TencentItems方法

from ..items import TencentItem

初始化tencentItem类

1	item = TencentItem()

爬完一页后进行存储

1	yield item

腾讯的招聘信息有很多页，观察url是在pageIndex=处进行控制，将链接赋值给变量url并在pageIndex后加上{}符

url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'

控制参数pageIndex递增并继续爬取

1
2
3

self.pageIndex += 1
url = self.url.format(self.pageIndex)
yield scrapy.Request(url=url,callback=self.parse)

将信息存储于本地当中

修改settings.py中的参数，将ITEM_PIPELINES注释去掉

添加方法写入爬取的数据

pipelines.py

class TencentPipeline:
    #创建文件
    def __init__(self):
        self.file = open('Tencent_Job.json','wb')

    #写入文件
    def process_item(self, item, spider):
        json_text = json.dumps(dict(item),ensure_ascii=False) + '\n'
        self.file.write(json_text.encode('utf-8'))
        return item

    #关闭文件
    def close_spider(self,spider):
        self.file.close()

爬取结果

文件内容

`position.py`

import scrapy
import json
import jsonpath
from ..items import TencentItem

class PositionSpider(scrapy.Spider):
    name = 'position'
    url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1631023581057&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    pageIndex = 1
    def parse(self, response):
        json_text = json.loads(response.text)
        Postname = jsonpath.jsonpath(json_text,'$..RecruitPostName')#岗位名称
        CountryName = jsonpath.jsonpath(json_text,'$..CountryName')#国家
        LocationName = jsonpath.jsonpath(json_text,'$..LocationName')#城市
        CategoryName = jsonpath.jsonpath(json_text,'$..CategoryName')#岗位类型
        Responsibility = jsonpath.jsonpath(json_text,'$..Responsibility')#岗位职责
        LastUpdateTime = jsonpath.jsonpath(json_text,'$..LastUpdateTime')#最后更新时间
        PostURL = jsonpath.jsonpath(json_text,'$..PostURL')#详情链接

        for Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL in zip(Postname,CountryName,LocationName,CategoryName,Responsibility,LastUpdateTime,PostURL):
            item = TencentItem()
            print(Postname)
            print(CountryName)
            print(LocationName)
            print(CategoryName)
            print(Responsibility)
            print(LastUpdateTime)
            print(PostURL)
            print('==='*35)
            item['Postname'] = Postname
            item['CountryName'] = CountryName
            item['LocationName'] = LocationName
            item['CategoryName'] = CategoryName
            item['Responsibility'] = Responsibility
            item['LastUpdateTime'] = LastUpdateTime
            item['PostURL'] = PostURL

            yield item
            self.pageIndex += 1
            url = self.url.format(self.pageIndex)
            yield scrapy.Request(url=url,callback=self.parse)

`items.py`

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    Postname = scrapy.Field()   # 岗位名称
    CountryName = scrapy.Field()   # 国家
    LocationName = scrapy.Field()   # 城市
    CategoryName = scrapy.Field()  # 岗位类型
    Responsibility = scrapy.Field()  # 岗位职责
    LastUpdateTime = scrapy.Field() # 最后更新时间
    PostURL = scrapy.Field()  # 详情链接

`pipelines.py`

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter


class TencentPipeline:
    def __init__(self):
        self.file = open('Tencent_Job.json','wb')

    def process_item(self, item, spider):
        json_text = json.dumps(dict(item),ensure_ascii=False) + '\n'
        self.file.write(json_text.encode('utf-8'))
        return item

    def close_spider(self,spider):
        self.file.close()

`settings.py`

# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
LOG_LEVEL='WARN'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Tencent.pipelines.TencentPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'