Python 页面解析Beautiful Soup库的使用方法

河源大盗 · 发表于 2023-2-10 22:43:25

1.Beautiful Soup库简介

<strong>Beautiful Soup</strong> 简称 <strong>BS4</strong>（其中 4 表示版本号）是一个 Python 中常用的页面解析库，它可以从 HTML 或 XML 文档中快速地提取指定的数据。

复制代码

相比于之前讲过的 [code]lxml

复制代码

库，Beautiful Soup 更加简单易用，不像正则和 XPath 需要刻意去记住很多特定语法，尽管那样会效率更高更直接。[/code]

对大多数 Python 使用者来说，<strong>好用</strong>会比<strong>高效</strong>更重要。

复制代码

<strong>Beautiful Soup</strong>库为第三方库，需要我们通过[code]pip

复制代码

命令安装：[/code]

pip install bs4

复制代码

<strong>BS4</strong> 解析页面时需要依赖文档解析器，所以还需要一个文档解析器。Python 自带了一个文档解析库 [code]html.parser

复制代码

，但是其解析速度稍慢，所以我们结合上篇内容（Python 文档解析：lxml库的使用），安装

lxml

复制代码

作为文档解析库：[/code]

pip install lxml

复制代码

2.Beautiful Soup库方法介绍

使用 <strong>bs4</strong> 的初始化操作，是用文本创建一个 [code]BeautifulSoup

复制代码

对象，并指定文档解析器：[/code]

from bs4 import BeautifulSoup
html_str = '''
<div>
<ul>
<li class="web" id="0"><a href="www.python.org" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Python</a></li>
<li class="web" id="1"><a href="www.java.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Java</a></li>
<li class="web" id="2"><a href="www.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >CSDN</a></li>
</ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
# prettify()用于格式化输出HTML/XML文档
print(soup.prettify())

复制代码

<strong>bs4</strong> 提供了[code]find_all()

复制代码

与

find()

复制代码

两个常用的查找方法它们的用法如下：[/code]
2.1 find_all()

[code]find_all()

复制代码

方法用来搜索当前

tag

复制代码

的所有子节点，并判断这些节点是否符合过滤条件，最后以列表形式将符合条件的内容返回，语法格式如下：[/code]

find_all(name, attrs, recursive, text, limit)

复制代码

<strong>参数说明：</strong>name：查找所有名字为 name 的 tag 标签，字符串对象会被自动忽略。attrs：按照属性名和属性值搜索 tag 标签，注意由于 class 是 Python 的关键字，所以要使用 “class_”。recursive：find_all() 会搜索 tag 的所有子孙节点，设置 recursive=False 可以只搜索 tag 的直接子节点。text：用来搜文档中的字符串内容，该参数可以接受字符串、正则表达式、列表、True。limit：由于 find_all() 会返回所有的搜索结果，这样会影响执行效率，通过 limit 参数可以限制返回结果的数量。

复制代码

from bs4 import BeautifulSoup
html_str = '''
<div>
<ul>
<li class="web" id="0"><a href="www.python.org" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Python</a></li>
<li class="web" id="1"><a href="www.java.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Java</a></li>
<li class="web" id="2"><a href="www.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >CSDN</a></li>
</ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
print(soup.find_all("li"))
print(soup.find_all("a"))
print(soup.find_all(text="Python"))

复制代码

上面程序使用 [code]find_all()

复制代码

方法，来查找页面中所有的

<li></li>

复制代码

标签、

<a></a>

复制代码

标签和

"Python"

复制代码

字符串内容。[/code]
2.2 find()

[code]find()

复制代码

方法与

find_all()

复制代码

方法极其相似，不同之处在于

find()

复制代码

仅返回第一个符合条件的结果，因此

find()

复制代码

方法也没有

limit

复制代码

参数，语法格式如下：[/code]

find(name, attrs, recursive, text)

复制代码

除了和 [code]find_all()

复制代码

相同的使用方式以外，bs4 为

find()

复制代码

方法提供了一种简写方式：[/code]

soup.find("li")
soup.li

复制代码

这两行代码的功能相同，都是返回第一个[code]<li></li>

复制代码

标签，完整程序：[/code]

from bs4 import BeautifulSoup
html_str = '''
<div>
<ul>
<li class="web" id="0"><a href="www.python.org" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Python</a></li>
<li class="web" id="1"><a href="www.java.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Java</a></li>
<li class="web" id="2"><a href="www.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >CSDN</a></li>
</ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
print(soup.li)
print(soup.a)

复制代码

上面的程序会打印出第一个[code]<li></li>

复制代码

标签和第一个

<a></a>

复制代码

标签。[/code]
2.3 select()

<strong>bs4</strong> 支持大部分的 CSS 选择器，比如常见的标签选择器、类选择器、id 选择器，以及层级选择器。<strong>Beautiful Soup</strong> 提供了一个 [code]select()

复制代码

方法，通过向该方法中添加选择器，就可以在 HTML 文档中搜索到与之对应的内容。[/code]

应用如下：

复制代码

from bs4 import BeautifulSoup
html_str = '''
<div>
<ul>
<li class="web" id="web0"><a href="www.python.org" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Python</a></li>
<li class="web" id="web1"><a href="www.java.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >Java</a></li>
<li class="web" id="web2"><a href="www.csdn.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >CSDN</a></li>
</ul>
</div>
'''
soup = BeautifulSoup(html_str, 'lxml')
#根据元素标签查找
print(soup.select('body'))
#根据属性选择器查找
print(soup.select('a[href]'))
#根据类查找
print(soup.select('.web'))
#后代节点查找
print(soup.select('div ul'))
#根据id查找
print(soup.select('#web1'))

复制代码

更多方法及其详细使用说明，请参见官方文档：https://beautiful-soup-4.readthedocs.io/en/latest/

复制代码

3.代码实例

学会了 <strong>Beautiful Soup</strong> ，让我们试着改写一下上次的爬虫代码吧：

复制代码

import os
import sys
import requests
from bs4 import BeautifulSoup
x = requests.get('https://www.csdn.net/')
soup = BeautifulSoup(x.text, 'lxml')
img_list = soup.select('img[src]')
# 创建img文件夹
os.chdir(os.path.dirname(sys.argv[0]))
if not os.path.exists('img'):
os.mkdir('img')
print('创建文件夹成功')
else:
print('文件夹已存在')
# 下载图片
for i in range(len(img_list)):
item = img_list[i]['src']
img = requests.get(item).content
if item.endswith('jpg'):
with open(f'./img/{i}.jpg', 'wb') as f:
f.write(img)
elif item.endswith('jpeg'):
with open(f'./img/{i}.jpeg', 'wb') as f:
f.write(img)
elif item.endswith('png'):
with open(f'./img/{i}.png', 'wb') as f:
f.write(img)
else:
print(f'第{i + 1}张图片格式不正确')
continue
print(f'第{i + 1}张图片下载成功')

复制代码

这就是本文的全部内容了，快去动手试试吧！

复制代码

到此这篇关于Python 页面解析Beautiful Soup库的使用的文章就介绍到这了,更多相关Python Beautiful Soup库内容请搜索晓枫资讯以前的文章或继续浏览下面的相关文章希望大家以后多多支持晓枫资讯！

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

		自动登录	找回密码
密码			立即注册