一步步走上爬虫巅峰——进阶1(Requests、XPath语法和lxml模块)

bigegpt 2024-11-04 12:11 79 浏览

Requests库

Requests :唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。

发送get请求

直接调用requests.get

import requests

response = requests.get('https://www.baidu.com')

response的属性

import requests

params = {'wd': 'python'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0'}
# params: 接受一个字典或者字符串的查询参数，字典类型自动转换为url编码，不需要urlencode()
response = requests.get('https://www.baidu.com', params=params, headers=headers)
# 查看响应内容，response.text返回的是Unicode格式的数据
print(response.text)
# 查看响应内容，response.content返回的是字节流数据
print(response.content.decode('utf-8'))
# 查看完整url地址
print(response.url)
# 查看响应头部字符编码
print(response.encoding)
# 查看响应码
print(response.status_code)

response.text和response.content的区别

response.content:直接从网络上面抓取的数据，没有经过任何解码，所以是bytes类型（硬盘上和网络上传输的字符串都是bytes类型）
response.text:将response.content进行解码的字符串，数据类型为str,解码需要指定一个编码方式，requests会根据自己的猜测来判断编码的方式，所以有时会猜测错误，就会导致解码产生乱码。这时候应该使用response.content.decode('utf-8')进行手动解码

发送POST请求

直接调用requests.post，如果返回的是json数据，可以调用response.json()来将json字符串转为字典或列表

下面是爬取拉勾网的一个示例，记得请求头添加Cookie，才能成功爬取到

import requests

data = {'first': 'true',
        'pn': '1',
        'kd': 'python'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0',
           'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
           'Cookie': 'user_trace_token=20200331183800-0c1f510a-ae9a-4f04-b70d-e17f9edec031; '
                     'LGUID=20200331183800-b8eca414-b7b2-479d-8100-71fff41d8087; _ga=GA1.2.17010052.1585651081; '
                     'index_location_city=%E5%85%A8%E5%9B%BD; lagou_utm_source=B; _gid=GA1.2.807051168.1585805257; '
                     'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22171302b7caa67e-0dedc0121b2532-255e0c45'
                     '-2073600-171302b7cabb9c%22%2C%22%24device_id%22%3A%22171302b7caa67e-0dedc0121b2532-255e0c45'
                     '-2073600-171302b7cabb9c%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B'
                     '%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22'
                     '%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5'
                     '%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; '
                     'JSESSIONID=ABAAABAABFIAAAC7D7CECCAFCFFA1FCBF3CB10D8EA6A189; '
                     'WEBTJ-ID=20200403095906-1713dc35e58a75-0b564b9cba1732-23580c45-2073600-1713dc35e598e; PRE_UTM=; '
                     'PRE_HOST=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; '
                     'LGSID=20200403095905-8201da05-4bb8-4e93-97bf-724ea6f758af; '
                     'PRE_SITE=https%3A%2F%2Fwww.lagou.com; _gat=1; '
                     'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1585651082,1585879146; TG-TRACK-CODE=index_search; '
                     'X_HTTP_TOKEN=0b356dc3463713117419785851e40fa7a09468f3f0; '
                     'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1585879149; '
                     'LGRID=20200403095908-4f1711b9-3e7e-4d54-a400-20c76b57f327; '
                     'SEARCH_ID=b875e8b91a764d63a2dc98d822ca1f85'}
response = requests.post('https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',
                         headers=headers,
                         data=data)
print(response.json())

使用代理ip

这里在基础中已经讲到过，使用requests只需要两行代码，非常方便

import requests

proxy = {'http': '59.44.78.30:54069'}
response = requests.get('http://httpbin.org/ip', proxies=proxy)
print(response.text)

cookie

使用session在多次请求中共享cookie，可以发现相比使用urllib代码变得特别简洁

import requests

headers = {'User-Agent': ''}
data = {'email': '',
        'password': ''}
login_url = 'http://www.renren.com/PLogin.do'
profile_url = 'http://www.renren.com/880151247/profile'
session = requests.Session()
session.post(login_url, data=data, headers=headers)
response = session.get(profile_url)
with open('renren.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

处理不信任的SSL证书

对于那些没有被信任的SSL证书的网站，可以在requests.get和requests.post中设置参数verify=False来进行访问

XPath语法和lxml模块

XPath

xpath(XML Path Language)是一门在XML和HTML文档中查找信息的语言，可用来在XML和HTML文档中对元素和属性进行访问。

XPath开发工具

Chrome插件XPath Helper
Firefox插件Xpath Checker

XPath语法

XPath使用路径表达式来选取XML文档中的节点或者节点集，这些路径表达式和我们在常规的电脑文件系统中的表示式非常类似。

表达式	描述	示例	结果
nodename	选取此节点的所有子节点	bookstore	选取bookstore下所有的子节点
/	如果在最前面，代表从根节点选取，否则选择某节点下的某个节点	/bookstore	选取根元素下所有的bookstore节点
//	从全局节点中选择节点，随便在哪个位置	//book	从全局节点中找到所有的book节点
@	选取某个节点的属性	//book[@price]	选择所有拥有price属性的book节点

谓语

谓语用来查找某个特定的节点或者包含某个指定节点的值的节点，被嵌在方括号中。

路径表达式	描述
/bookstore/book[1]	选取bookstore下的第一个book元素
/booksotre/book[last()]	选取bookstore下的最后一个book元素
/bookstore/book[position()??]	选取bookstore下前面两个book元素
//book[@price]	选择所有拥有price属性的book节点
//book[@price=10]	选取所有属性price=10的book元素

通配符

通配符	描述	示例	结果
*	匹配任意节点	/bookstore/*	选取bookstore下的所有子元素
@*	匹配节点中的任何属性	//book[@*]	选取所有带有属性的book元素

选取多个路径

通过在路径表达式中使用|运算符，可以选取若干个路径

//bookstore/book | //book/title
# 选取多个book元素以及book元素下的title元素

运算符

运算符	描述	实例	返回值
\|	计算两个节点集	//book \| //cd	返回所有拥有book和cd元素的节点集
+，-，*，div	加，减，乘，除	6+1， 6-1， 6 * 1， 6 div 1	7, 5, 6, 6
=, !=, <, <=, >, >=	-	-	返回false或true
or, and	或，与	-	返回false或true
mod	计算除法的余数	5 mod 2	1

注意事项

/和//的区别，/代表只获取直接子节点，//代表获取子孙节点。
contains:有时候某个属性中包含了多个值，那么可以使用contains函数

//div[contains(@class, 'job_detail')]

3.谓词中的下标从1开始。

lxml库

lxml是一个HTML/XML的解析器，主要功能是如何解析和提取HTML/XML数据。

基本使用

1，解析html字符串：使用lxml.etree.HTML进行解析

from lxml import etree
htmlElement = etree.HTML(text)
print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

2，解析html文件：使用lxml.etree.parse进行解析

htmlElement = etree.parse('tencent.xml')
print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

这个函数默认使用XML解析器，所以如果碰到不规范的HTML代码的时候就会解析错误，这时候要创建HTML解析器

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse('tencent.xml', parser=parser)
print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

XPath和lxml结合使用

使用xpath语法，应该使用Element.xpath方法，xpath返回列表。
获取标签属性
获取文本使用xpath中的text()函数
如果想在某个标签下，再执行xpath，获取这个标签下的子孙元素，那么应该在斜杠之前加点，代表在当前元素下获取。

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('tencent.html', parser=parser)

获取所有tr标签

trs = html.xpath('//tr')
for tr in trs:
    print(etree.tostring(tr, encoding='utf-8').decode('utf-8'))

获取第2个tr标签

tr = html.xpath('//tr[2]')[0]
print(etree.tostring(tr, encoding='utf-8').decode('utf-8'))

获取所有class等于even的tr标签

trs = html.xpath("//tr[@class='even']")
for tr in trs:
    print(etree.tostring(tr, encoding='utf-8').decode('utf-8'))

获取所有a标签的href属性

aList = html.xpath('//a/@href')
for a in aList:
    print('http://hr.tencent.com/' + a)

获取所有职位信息

trs = html.xpath('//tr[position()>1]')
positions = []
for tr in trs:
    href = tr.xpath('.//a/@href')[0]
    fullurl = 'http://hr.tencent.com/' + href
    title = tr.xpath('./td[1]//text()')[0]
    category = tr.xpath('./td[2]//text()')[0]
    nums = tr.xpath('./td[3]/text()')[0]
    address = tr.xpath('./td[4]/text()')[0]
    pubtime = tr.xpath('./td[5]/text()')[0]

    position = {
        'url': fullurl,
        'title': title,
        'category': category,
        'nums': nums,
        'address': address,
        'pubtime': pubtime
    }
    positions.append(position)

python求余数

上一篇：Python解析库lxml与xpath用法总结
下一篇：「Python精品教程」Python快速入门，基础数据结构:数字

一步步走上爬虫巅峰——进阶1(Requests、XPath语法和lxml模块)

Requests库

发送get请求

response的属性

response.text和response.content的区别

发送POST请求

使用代理ip

cookie

处理不信任的SSL证书

XPath语法和lxml模块

XPath

XPath开发工具

XPath语法

谓语

通配符

选取多个路径

运算符

注意事项

lxml库

基本使用

XPath和lxml结合使用

获取所有tr标签

获取第2个tr标签

获取所有class等于even的tr标签

获取所有a标签的href属性

获取所有职位信息

相关推荐

idea本地配置连接远程hadoop集群的一些网络问题解决汇总

无缓存不行?例行升级的入门级阿斯加特AN2 SSD装机点评

Ceph运维手册(基于P版本)

大数据开发前要做什么准备?8台Hadoop服务器进行集群规划前配置

Tensorflow分类loss函数总结 tensorflow绘制loss曲线

R语言学习笔记(七) -离散型数据的模型预测2

iOS Runtime详解

7 个对 Java 意义重大的性能指标，你知道几个?

PHP 远程调试最佳实践

Laravel框架使用图片处理简单教程