List of the Best Python HTML Parsers

Discover the top Python HTML parsers—Beautiful Soup, HTMLParser, lxml, PyQuery, and Scrapy—to simplify and speed up your web scraping projects.
13 min read

Python is a popular choice for web scraping, thanks to the number of HTML parsers available. In this article, you’ll explore the most widely used parsers: Beautiful SoupHTMLParserlxmlPyQuery, and Scrapy. These parsers are favored for their ease of use, speed, support for modern HTML standards, documentation, and community support.

Let’s jump right in!

Beautiful Soup

Beautiful Soup is a Python library for parsing HTML and XML documents. It creates a parse tree that mirrors the structure of each page, making it easy to extract data automatically. This tree represents the hierarchy of elements within the document, allowing you to navigate and search through it efficiently to locate specific nodes.

Features and Ease of Use

Beautiful Soup is useful for organizing HTML documents into structured information. It comes with various parsers, including html.parserlxml, and html5lib, that help you handle different types of markup, such as standard HTML, malformed or broken HTML, XHTML, HTML5, and XML. This gives you the flexibility to choose the best balance between speed and accuracy. For instance, if you’re working with a web page that has missing tags or improperly nested elements, you can use html5lib to parse the HTML content just like a web browser would.

Beautiful Soup can also help when it comes to web scraping tasks where the HTML structure is unpredictable or unorganized. Once a document is parsed, you can easily search the tree to locate nodes. Search methods such as find()find_all(), and select() provide ways to access elements based on identifiers, classes, text content, or attributes. Whether you’re looking for all instances of a tag or targeting an element, using the right selector ensures quick access to the necessary data with minimal coding effort.

Speed

Beautiful Soup isn’t the fastest parser, but it offers flexible parsing strategies that give you adaptability. By default, it uses Python’s html.parser, which is best for simple tasks such as working with small documents to extract data from a blog post. If you want to scrape and process a large amount of data, consider using a different parser.

Support for Up-to-Date HTML Standards

If you want to analyze HTML5 elements and attributes from static web pages, then Beautiful Soup is a great choice. Its compatibility with parsers guarantees compliance with the most recent HTML standards.

Documentation and Support

Beautiful Soup has extensive documentation, and it’s used by more than 850,000 users on GitHub. Its documentation offers examples, tutorials, and references that make it easy to get started.

Learn more about web scraping with Beautiful Soup here.

Code Example

To install Beautiful Soup, run the following command from your shell or terminal:

pip3 install beautifulsoup4

The following code snippet uses Beautiful Soup to parse data from the Books to Scrape website:

import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
books_page_url = "https://books.toscrape.com/"

# Fetch the webpage content
response = requests.get(books_page_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page
    soup_parser = BeautifulSoup(response.text, 'html.parser')

    # Find all articles that contain book information
    book_articles = soup_parser.find_all('article', class_='product_pod')

    # Loop through each book article and extract its title and price
    for book_article in book_articles:
        # Extract the title of the book
        book_name = book_article.h3.a['title']
        
        # Extract the price of the book
        book_cost = book_article.find('p', class_='price_color').text
        
        # Print the title and price of the book
        print(f"Title: {book_name}, Price: {book_cost}")
else:
    # Print an error message if the page could not be retrieved
    print("Failed to retrieve the webpage")

If you’d like to test this code out, save it in a file named beautifulsoup_books_scrape.py and run it using the following command:

python3 beautifulsoup_books_scrape.py

You should see all the titles and prices of books from the first page printed on your terminal or shell:

…output omitted…
Title: Soumission, Price:  £50.10
Title: Sharp Objects, Price:  £47.82
Title: Sapiens: A Brief History of Humankind, Price: £54.23
Title: The Requiem Red, Price: £22.65
Title: The Dirty Little Secrets of Getting Your Dream Job, Price: £33.34
…output omitted…

If you’re new to web scraping, Beautiful Soup’s simplicity and ability to navigate through the HTML tree make it a good choice for your web scraping projects.

HTMLParser

HTMLParser is a library that comes preinstalled with Python and allows you to parse and extract data from HTML documents.

Features and Ease of Use

Although HTMLParser lacks some of the features provided by other parsing libraries like lxml and html5lib, HTMLParser’s simplicity and Python integration make it a good choice for projects with simple data structures where the HTML content is consistent (eg scraping static web pages). However, if you’re dealing with malformed HTML content, then HTMLParser is not the best option.

Speed

HTMLParser’s speed is adequate for most HTML parsing use cases where you have small to modestly sized documents (ie a few kilobytes to a couple of megabytes in size) and minimal preprocessing needs. However, for more complex HTML documents, using parsers like lxml is preferred.

Support for Up-to-Date HTML Standards

HTMLParser supports basic HTML parsing, but it can struggle with very complex or poorly formed HTML documents. Moreover, it doesn’t fully support the latest HTML5 standard.

Documentation and Support

Because HTMLParser is part of Python’s library, it has reliable documentation and support. It’s easy to find help through platforms like Stack Overflow, GitHub, and Python-related forums.

Code Example

As previously stated, the HTMLParser module is included with the Python standard library, and no additional installation is required.

Following is a code example using html.parser to parse HTML data:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
        
    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

html_data = """
<html>
  <head><title>Example</title></head>
  <body><h1>Heading</h1><p>Paragraph.</p></body>
</html>
"""

parser.feed(html_data)

In this script, you extend the HTMLParser class to create an HTML parser that manages start tags, end tags, and displays of each element.

To use this code, save it in a file named htmlparser_example.py and run it with the following command from your terminal or shell:

python3 htmlparser_example.py

The output shows each tag and data:

…output omitted…
Encountered a start tag: html
Encountered some data  : 
  
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Example
Encountered an end tag : title
Encountered an end tag : head
…output omitted…

lxml

lxml is a popular choice for web scraping and data extraction because it combines the power of built-in XML libraries with Python’s ease of use.

Features and Ease of Use

lxml is popular thanks to its efficient and versatile functions for navigating and analyzing HTML and XML documents. It offers advanced XML processing features, including XPathXSLT, and XPointer, allowing you to precisely extract and transform data.

Like Beautiful Soup, lxml supports tree structures, making it easy to navigate and parse HTML content. If you’re working with diverse content, its ability to work well with both formatted and improperly formatted documents can be helpful.

Speed

lxml is well known for its speed and efficiency, thanks to its utilization of C libraries like libxml2 and libxslt. This makes lxml faster than other parsing libraries, especially when handling extensive documents or complex parsing tasks, such as extracting deeply nested data from large HTML tables.

lxml is a great choice for projects with tight deadlines or those that require the processing of large amounts of data.

Support for Up-to-Date HTML Standards

lxml can handle the latest web technologies, including HTML5 files and poorly structured HTML. This makes lxml one of the best choices for web scraping tasks where HTML quality and structure can vary.

Documentation and Support

lxml has comprehensive documentation with detailed examples that cater to developers of all levels. Additionally, you can seek more information, troubleshooting tips, and best practices on platforms like Stack Overflow and GitHub.

Learn more about web scraping with lxml here.

Code Example

To install lxml, run the following:

pip3 install lxml

The following example shows you how to parse HTML data with lxml:

from lxml import html

html_content = """
<html>
  <body>
    <h1>Hello, world!</h1>
    <p>This is a paragraph.</p>
  </body>
</html>
"""

tree = html.fromstring(html_content)

h1_text = tree.xpath('//h1/text()')[0]
print("H1 text:", h1_text)

p_text = tree.xpath('//p/text()')[0]
print("Paragraph text:", p_text)

Here, you use lxml to parse HTML content, then you extract text from the HTML elements with XPath expressions.

If you want to test lxml out, save this code to a file called lxml_example.py and then run it with the following command from your shell or terminal:

python3 lxml_example.py

You should see the text from the <h1> and <p> elements printed out like this:

H1 text: Hello, world!
Paragraph text: This is a paragraph.

If you need a full-fledged, production-ready parser that can handle the complexity of XPath queries (like types in XML or multiple items), you should use lxml.

PyQuery

PyQuery is a jQuery-like library for Python that makes it possible to scrape whole web pages in seconds.

Features and Ease of Use

Similar to jQuery syntax, PyQuery is user-friendly. You can easily select elements, loop over them, update their content, and manage HTML attributes. This is especially useful when it comes to tasks like web scraping wherein you want to pull data from HTML pages and work on it.

PyQuery also has support for CSS selectors, which makes it easy to get started if you’re already familiar with animating DOM documents using jQuery.

Speed

PyQuery uses the lxml library for parsing HTML. This makes it easy to use but slower than if you were using lxml directly.

Support for Up-to-Date HTML Standards

PyQuery complies with the latest HTML5 standards, and since it uses lxml for parsing, PyQuery can handle both structured and unstructured HTML.

Documentation and Support

PyQuery provides thorough documentation that can help you get started quickly. While it has a smaller community than other libraries, it’s actively supported by over forty contributors. There are also other resources available, such as online forums, Stack Overflow, and various tutorials, that can help you if you run into issues.

Code Example

To install PyQuery, run the following:

pip3 install pyquery

Here’s a code snippet that uses pyquery to parse HTML data:

from pyquery import PyQuery as pq

html_content = """
<html>
  <body>
    <h1>Hello, from PyQuery!</h1>
    <p>This is a paragraph.</p>
  </body>
</html>
"""

doc = pq(html_content)

h1_text = doc('h1').text()
print("H1 text:", h1_text)

p_text = doc('p').text()
print("Paragraph text:", p_text)

In this snippet, you parse HTML content and then extract text from specific elements.

Save this code to a file called pyquery_example.py and run it using the following command from your shell or terminal:

python3 pyquery_example.py

Your output looks like this:

H1 text: Hello, from PyQuery!
Paragraph text: This is a paragraph.

If you already know how to use jQuery and are looking for similar features, then PyQuery is a great choice.

Scrapy

Scrapy is a flexible and open source web scraping framework that enables users to construct and operate spiders to collect information. It offers tools to handle every aspect of a scraping task, from managing HTTP requests to parsing, processing, and saving extracted data. The framework manages all the complexity involved in scraping tasks so that you can focus on collecting the desired information.

Features and Ease of Use

Scrapy is designed for ease of use and excels at parsing complex web data with a modular framework. It offers XPath and CSS selectors for navigating HTML and XML, and it includes utilities like request throttling, user agent spoofing, and IP rotation, which are essential for large-scale scraping.

Speed

Scrapy is efficient. Its networking functionality lets you process requests concurrently to perform data retrieval. This is especially powerful when conducting large-scale data sets or when you want to scrape commercial websites.

Support for Up-to-Date HTML Standards

Scrapy supports HTML5 standards and can handle complex websites, even those that contain dynamically generated JavaScript. While Scrapy itself doesn’t process JavaScript, Scrapy works alongside tools like Selenium to manage JavaScript pages.

Read more about how to scrape dynamic content here.

Documentation and Support

Scrapy has tons of documentation and a vibrant community backing it up. The official documentation covers everything you need to know about basic usage and advanced topics, and it includes plenty of examples, guides, and recommended practices to support developers of all levels.

Moreover, the Scrapy community actively engages through forums and GitHub repositories, ensuring that you can seek assistance and access resources for any issues you face.

Code Example

To install Scrapy, run the following:

pip3 install scrapy

Following is an example using a Scrapy spider to extract data:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

This script defines a spider class, sets the start URLs, and specifies how to parse the response to extract data.

Save this code in a file named quotes_spider.py and run it using the following command from your terminal or shell:

scrapy runspider quotes_spider.py -o quotes.json

When you execute this code, Scrapy crawls a Quotes to Scrape page and extracts and parses data quotes from this page with their respective authors and tags. Then, Scrapy saves the scraped data in a quotes.json file that looks like this:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}
…output omitted...
]

For complex web extraction projects where you have specific needs, Scrapy is a great option with its feature-rich tools and scalability.

Conclusion

In this article, you learned about five HTML parsers for Python, including Beautiful Soup, HTMLParser, lxml, PyQuery, and Scrapy.

Beautiful Soup is great for beginners due to its simplicity. HTMLParser is a lightweight option for basic parsing needs. lxml is something to consider if you’re looking for better performance and XPath support. PyQuery brings a jQuery-like simplicity to Python. Scrapy is the framework to use if you’re tackling large-scale scraping projects.

Want to skip scraping and get the data? Check out our datasets by signing up and download a free sample now.

No credit card required