The HTML markup language is utilized for creating web pages, and it contains tags that describe the structure and content of a particular page. Parsing HTML means extracting data from HTML doc/documents. Python provides several libraries for parsing HTML documents, such as BeautifulSoup, lxml, and html5lib.
This post provides several methods for parsing HTML in Python. First, let’s look at these methods:
- Method 1: Using BeautifulSoup Module
- Method 2: Using PyQuery Module
- Method 3: Using lxml Library
Method 1: Using BeautifulSoup Module
Python’s BeautifulSoup library extracts data from HTML and XML documents using web scraping. To use this module in Python we need to install it using the below command in the cmd terminal:
pip install beautifulsoup4
The following snippet shows the installation of the beautifulsoup4 module:
The below example code is used to parse HTML using beautifulsoup4:
import requests from bs4 import BeautifulSoup data=requests.get("http://www.itslinuxfoss.com") html_content=data.content soup = BeautifulSoup(html_content, 'html.parser') print(soup.title) paragraphs = soup.find_all('p') print(paragraphs)
- The “requests” module and “BeautifulSoup” class from the bs4 library are imported.
- The “requests.get()” function is used to send a GET request to the URL “http://www.itslinuxfoss.com”.
- The content of the response is then stored in a variable called html_content.
- The BeautifulSoup object is created by passing in the “html_content” and “html.parser” as an argument.
- The “soup.title()” function is used to display the title of the page.
- The “soup.find_all()” method is used to find all the paragraph tags in the HTML content.
The information is extracted from a webpage by parsing its HTML content.
Method 2: Using PyQuery Module
Similar to BeautifulSoup, PyQuery is a Python library that provides convenient and flexible ways to extract data from HTML and XML documents. We must install this module in Python before we can use it.
The following command is used to install the “pyquery” module in Python:
pip install pyquery
Here is the installation snippet:
The below example code is used to parse HTML using PyQuery:
import requests from pyquery import PyQuery as pq response_data = requests.get('https://itslinuxfoss.com') document = pq(response_data.text) page_title = document('title') print(page_title.text())
- The “requests” module and “PyQuery” module are imported.
- The “requests.get()” function is used to send a GET request to the URL ‘https://itslinuxfoss.com’.
- The PyQuery object is created using the HTML content of the “response_data” object.
- The “document(‘title’)” is used to extract the page’s title.
- The text on the page title is printed using the “page_title.text()” function.
The specified website text has been parsed.
Method 3: Using lxml Library
The Python library lxml can process and manipulate XML and HTML documents easily. This library needs to be installed before it can be used in Python using the following command:
pip install lxml
Here’s how lxml is installed in Python:
This example code uses lxml to parse HTML:
from lxml import html my_tree = html.fromstring('<html><body><h1>Python Guide!</h1></body></html>') my_header = my_tree.xpath('//h1/text()') print(my_header)
- The “html” module is imported from the “lxml” library.
- The “html.fromstring()” function is used to create an HTML tree object from a string.
- The string contains an HTML document with a single h1 tag that has the text “Python Guide!”.
- The “xpath()” method is used to extract the text content of the first h1 tag in the tree.
The HTML text has been successfully parsed and extracted.
The “BeautifulSoup” module, “PyQuery” module, and “lxml” module support various functions that are used to parse HTML in Python. The “BeautifulSoup” is widely used for parsing HTML in Python. It delivers an easy and intuitive way to navigate and search the HTML tree. Other Python libraries such as lxml or PyQuery can parse and extract HTML data. This guide presented various ways to parse HTML using Python.