In this guide on web scraping with Parsel in Python, you will learn:
- What Parsel is
- Why use it for web scraping
- A step-by-step tutorial that shows how to use Parsel for web scraping
- Advanced scraping scenarios with Parsel in Python
Let’s dive in!
What Is Parsel?
Parsel is a Python library for parsing and extracting data from HTML, XML, and JSON documents. It builds on top of lxml
, providing a higher-level and more user-friendly interface for web scraping. In detail, it offers an intuitive API that simplifies the process of extracting data from HTML and XML documents.
Why Use Parsel for Web Scraping
Parsel comes with interesting features for web scraping, such as:
- Support for XPath and CSS selectors: Use either XPath or CSS selectors to locate elements in HTML or XML documents. Find out more in our guide on XPath vs CSS selector for web scraping.
- Data extraction: Retrieve text, attributes, or other content from the selected elements.
- Chaining selectors: Chain multiple selectors to refine your data extraction.
- Scalability: The library works well with both small and large scraping projects.
Note that the library is tightly integrated into Scrapy, which uses it to parse and extract data from web pages. Still, Parsel can also be utilized as a standalone library.
How to Use Parsel in Python for Web Scraping: A Step-by-Step Tutorial
This section will guide you through the process of scraping the Web with Parsel in Python. The target site will be “Hockey Teams: Forms, Searching and Pagination“:
The Parsel scraper will extract all the data from the above table. Follow the steps below and see how to build it!
Prerequisites and Dependencies
To replicate this tutorial, you must have Python 3.10.1 or higher installed on your machine. In particular, note that Parsel has recently removed support for Python 3.8.
Suppose you call the main folder of your project parsel_scraping/
. At the end of this step, the folder will have the following structure:
Where:
parsel_scraper.py
is the Python file that contains the scraping logic.venv/
contains the virtual environment.
You can create the venv/
virtual environment directory like so:
To activate it, on Windows, run:
Equivalently, on macOS and Linux, execute:
In an activated virtual environment, install the dependencies with:
These two dependencies are:
parsel
: A library for parsing HTML and extracting data.requests
: Required becauseparsel
is only an HTML parser. To perform web scraping, you also need an HTTP client like Requests to retrieve the HTML documents of the pages you want to scrape.
Wonderful! You now have what you need to perform web scraping with Parsel in Python.
Step 1: Define The Target URL and Parse The Content
As a first step of this tutorial, you need to import the libraries:
Then, define the target webpage, fetch the content with Requests, and parse it with Parsel:
The above snippet instantiates the Selector()
class from Parsel. That parses the HTML read from the response of the HTTP request made with get()
.
Step 2: Extract All The Rows From The Table
If you inspect the table on the target web page in the browser, you will see the following HTML:
Since the table contains multiple rows, initialize an array where to store the scraped data:
Now, note that the HTML table has a .table
class. To select all rows from the table, you can use the line of code below:
This uses the css()
method applies the CSS selector on the parsed HTML structure.
Time to iterate over the selected rows and extract data from them!
Step 3: Iterate Over The Rows
Just like before, inspect a row inside the table:
What you can notice is that each row contains the following information in dedicated columns:
- Team name → inside the
.name
element - Season year → inside the
.year
element - Number of wins → inside the
.wins
element - Number of losses → inside the
.losses
element - Overtime losses → inside the
.ot-losses
element - Winning percentage → inside the
.pct
element - Goals scored (Goals For – GF) → inside the
.gf
element - Goals conceded (Goals Against – GA) → inside the
.ga
element - Goal difference → inside the
.diff
element
You can extract all that info with the following logic:
Here is what the above code does:
- The
get()
method selects text nodes using CSS3 pseudo-elements. - The method
strip()
removes any leading and trailing whitespace. - The
append()
method appends the content to thedata
list.
Great! Parsel data scraping logic completed.
Step 4: Print The Data and Run The Program
As a final step, print the scraped data in the CLI:
Run the program:
This is the expected result:
Amazing! That is exactly the data on the page but in a structured format.
Step 5: Manage Pagination
Until the previous step, you retrieved the data from the main page of the target URL. What if you now want to retrieve it all? To do so, you need to manage pagination by making some changes to the code.
First, you have to encapsulate the previous code into a function like this one:
Now, take a look at the HTML element that manages the pagination:
This includes a list of all pages, each with the URL embedded in an <a>
element. Encapsulate the logic for retrieving all pagination URLs in a function:
This function does the following:
- The
getall()
method retrieves all the pagination links. - The
list(set())
method removes duplicates to prevent visiting the same page twice. - The
urljoin()
method, from theurlib.parse
library, converts all relative URLs into absolute URLs so they can be used for further HTTP requests.
To make the above code work, ensure you import urljoin
from the Python standard library:
You can now scrapes all pages with:
The above snippet:
- Retrieves all page URLs calling the function
get_all_page_urls()
. - Scrapes data from each page calling the function
scrape_page()
. Then, it aggregates the results with the methodextend()
. - Prints the scraped data.
Fantastic! The Parsel pagination logic is now implemented.
Step 6: Put It All Together
Below is what the parsel_scraper.py
file should now contain:
Very well! You have completed your first scraping project with Parsel.
Advanced Web Scraping Scenarios with Parsel in Python
In the previous section, you learned how to use Parsel in Python to extract the data from a target web page using CSS selectors. Time to consider some more advanced scenarios!
Select Elements by Text
Parsel provides different query methods to retrieve the text from HTML by using XPath. In this case, the text()
function is used to extract the text content of an element.
Imagine you have HTML code such as this:
You can retrieve all the text like so:
This snippet locates the <p>
and <h1>
tags and extracts the text from them with text()
, resulting in:
Another useful function is contains()
, which can be used to match elements that contain specific text. For example, suppose you have such an HTML code:
You now want to extract the text from the paragraphs that only contain the word “test.” You can do it with the following code:
The Xpath p[contains(text(), 'test')]/text()
takes care of querying the paragraph containing only “test”. The result will be:
But what if you want to intercept the text that starts with a specific value of a string? Well, you can use the starts-with()
function! Consider this HTML:
To retrieve the text from the paragraphs that start with the word “start,” use p[starts-with(text(), 'Start')]/text()
like so:
The above snippet produces:
Learn more about CSS vs. XPath selectors.
Using Regular Expressions
Parsel allows you to retrieve text for advanced conditions by using regular expressions with the function re:test()
.
Consider this HTML:
To extract the text from the paragraphs containing only numeric values, you can use re:test()
as follows:
The result is:
Another typical use of regular expressions is to intercept email addresses. This can be used to extract text from paragraphs that contain only email addresses. For example, consider the following HTML:
Below is how you can use re:test()
to select nodes containing email addresses:
That results in:
Navigating the HTML Tree
Parsel lets you navigate the HTML tree with XPath, no matter how nested that is.
Consider this HTML:
You can get all the parent element of the <p>
node like so:
Resulting in:
Similarly, you can manage sibling elements. Suppose you have the following HTML code:
You can use following-sibling
to retrieve sibling nodes as follows:
Which results in:
Parsel Alternatives for HTML Parsing in Python
Parsel is one of the libraries available in Python for web scraping, but it is not the only one. Below are other well-known and widely used ones:
- Beautiful Soup: A Python library that makes it easy to scrape information from web pages. Learn how to use it in our guide on web scraping with Beautiful Soup.
lxml
: A Pythonic binding for thelibxml2
andlibxslt
libraries. See it in action in our tutorial on lxml for web data parsing.- PyQuery: A library that enables you to make jQuery queries on XML documents. That makes it one of the best 5 Python HTML parsers.
- Scrapy: An open-source and collaborative framework for extracting the data you need from websites. See how to use Scrapy for web scraping.
html.parser
: A module from the Python Standard Library that provides a class for parsing text HTML and XTHML content.html5-parser
: A fast implementation of HTML 5 in Python.
Conclusion
In this article, you learned about Parsel in Python and how to use it for web scraping. You started with the basics and then explored more complex scenarios.
No matter which Python scraping library you use, the biggest hurdle is that most websites safeguard their data with anti-bot and anti-scraping measures. These defenses can identify and block automated requests, rendering traditional scraping techniques ineffective.
Fortunately, Bright Data offers a suite of solutions to avoid any issue:
- Web Unlocker: An API that bypasses anti-scraping protections and delivers clean HTML from any webpage with minimal effort.
- Scraping Browser: A cloud-based, controllable browser with JavaScript rendering. It automatically handles CAPTCHAs, browser fingerprinting, retries, and more for you. It integrates seamlessly with Panther or Selenium PHP.
- Web Scraper APIs: Endpoints for programmatic access to structured web data from dozens of popular domains.
Don’t want to deal with web scraping but are still interested in online data? Explore our ready-to-use datasets!
Sign up for Bright Data now and start your free trial to test our scraping solutions.
No credit card required