57pao国产成人免费,99在线精品观看,久久久综合免费视频

主頁 > 知識庫 > python 網(wǎng)頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法

python 網(wǎng)頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法

今天說的則是使用另外一種擴展庫 lxml 來對網(wǎng)頁完成解析。同樣的，lxml 庫能完成對 html、xml 格式的文件解析，并且能夠用來解析大型的文檔、解析速度也是相對比較快的。

要掌握 lxml 的使用，就需要掌握掌握 xpath 的使用方法，因為 lxml 擴展庫就是基于 xpath 的，所以這一章的重點主要還是對 xpath 語法使用的說明。

1、導入 lxml 擴展庫、并創(chuàng)建對象

# -*- coding: UTF-8 -*-

# 從 lxml 導入 etree
from lxml import etree

# 首先獲取到網(wǎng)頁下載器已經(jīng)下載到的網(wǎng)頁源代碼
# 這里直接取官方的案例
html_doc = """
html>head>title>The Dormouse's story/title>/head>
body>
p class="title">b>The Dormouse's story/b>/p>

p class="story">Once upon a time there were three little sisters; and their names were
a  rel="external nofollow" class="sister" id="link1">Elsie/a>,
a  rel="external nofollow" class="sister" id="link2">Lacie/a> and
a  rel="external nofollow" class="sister" id="link3">Tillie/a>;
and they lived at the bottom of a well./p>

p class="story">.../p>
"""

# 初始化網(wǎng)頁下載器的 html_doc 字符串,返回一個 lxml 的對象
html = etree.HTML(html_doc)

2、使用 xpath 語法提取網(wǎng)頁元素

按照節(jié)點的方式獲取元素

# xpath() 使用標簽節(jié)點的方式獲取元素
print html.xpath('/html/body/p')
# [Element p at 0x2ebc908>, Element p at 0x2ebc8c8>, Element p at 0x2eb9a48>]
print html.xpath('/html')
# [Element html at 0x34bc948>]
# 在當前節(jié)點的子孫節(jié)點中查找 a 節(jié)點
print html.xpath('//a')
# 在當前節(jié)點的子節(jié)點中查找 html 節(jié)點
print html.xpath('/html')

按照篩選的方式獲取元素

'''
根據(jù)單一屬性獲取元素
'''
# 獲取子孫節(jié)點中,屬性 class=bro 的 a 標簽
print html.xpath('//a[@class="bro"]')

# 獲取子孫節(jié)點中,屬性 id=link3 的 a 標簽
print html.xpath('//a[@id="link3"]')

'''
根據(jù)多個屬性獲取元素
'''
# 獲取class屬性等于sister，并且id等于link3的a標簽
print html.xpath('//a[contains(@class,"sister") and contains(@id,"link1")]')

# 獲取class屬性等于bro，或者id等于link1的a標簽
print html.xpath('//a[contains(@class,"bro") or contains(@id,"link1")]')

# 使用 last() 函數(shù)，獲取子孫代的a標簽的最后一個a標簽
print html.xpath('//a[last()]')
# 使用 1 函數(shù)，獲取子孫代的a標簽的第一個a標簽
print html.xpath('//a[1]')
# 標簽篩選，position()獲取子孫代的a標簽的前兩個a標簽
print html.xpath('//a[position()  3]')

'''
使用計算的方式，獲取多個元素
'''
# 標簽篩選，position()獲取子孫代的a標簽的第一個與第三個標簽
# 可以使用的計算表達式：>、、=、>=、=、+、-、and、or
print html.xpath('//a[position() = 1 or position() = 3]')

獲取元素的屬性與文本

'''
使用@獲取屬性值，使用text() 獲取標簽文本
'''
# 獲取屬性值
print html.xpath('//a[position() = 1]/@class')
# ['sister']
# 獲取標簽的文本值
print html.xpath('//a[position() = 1]/text()')

到此這篇關于python 網(wǎng)頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法的文章就介紹到這了,更多相關python lxml 擴展庫與 xpath內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

python網(wǎng)絡爬蟲精解之pyquery的使用說明
python爬蟲之Appium爬取手機App數(shù)據(jù)及模擬用戶手勢
Python 給我一個鏈接西瓜視頻隨便下載爬蟲
python網(wǎng)絡爬蟲精解之XPath的使用說明

標簽：股票駐馬店畢節(jié) 江蘇湖州衡水呼和浩特中山

巨人網(wǎng)絡通訊聲明：本文標題《python 網(wǎng)頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法》，本文關鍵詞 python,網(wǎng)頁,解析,器,掌握,；如發(fā)現(xiàn)本文內容存在版權問題，煩請?zhí)峁┫嚓P信息告之我們，我們將及時溝通與處理。本站內容系統(tǒng)采集于網(wǎng)絡，涉及言論、版權與本站無關。

婷婷综合国产,91蜜桃婷婷狠狠久久综合9色 ,九九九九九精品,国产综合av

python 網(wǎng)頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法

1、導入 lxml 擴展庫、并創(chuàng)建對象

2、使用 xpath 語法提取網(wǎng)頁元素

2、使用 xpath 語法提取網(wǎng)頁元素