Web Scraping With Parsel in Python: 2025 Guide

Master web scraping with Parsel! Learn how to extract data using XPath & CSS selectors, handle pagination, and tackle advanced scraping scenarios.
15 min read
web scraping with parsel blog image

In this guide on web scraping with Parsel in Python, you will learn:

  • What Parsel is
  • Why use it for web scraping
  • A step-by-step tutorial that shows how to use Parsel for web scraping
  • Advanced scraping scenarios with Parsel in Python

Let’s dive in!

What Is Parsel?

Parsel is a Python library for parsing and extracting data from HTML, XML, and JSON documents. It builds on top of lxml, providing a higher-level and more user-friendly interface for web scraping. In detail, it offers an intuitive API that simplifies the process of extracting data from HTML and XML documents.

Why Use Parsel for Web Scraping

Parsel comes with interesting features for web scraping, such as:

  • Support for XPath and CSS selectors: Use either XPath or CSS selectors to locate elements in HTML or XML documents. Find out more in our guide on XPath vs CSS selector for web scraping.
  • Data extraction: Retrieve text, attributes, or other content from the selected elements.
  • Chaining selectors: Chain multiple selectors to refine your data extraction.
  • Scalability: The library works well with both small and large scraping projects.

Note that the library is tightly integrated into Scrapy, which uses it to parse and extract data from web pages. Still, Parsel can also be utilized as a standalone library.

How to Use Parsel in Python for Web Scraping: A Step-by-Step Tutorial

This section will guide you through the process of scraping the Web with Parsel in Python. The target site will be “Hockey Teams: Forms, Searching and Pagination“:

The tabular data from the target page

The Parsel scraper will extract all the data from the above table. Follow the steps below and see how to build it!

Prerequisites and Dependencies

To replicate this tutorial, you must have Python 3.10.1 or higher installed on your machine. In particular, note that Parsel has recently removed support for Python 3.8.

Suppose you call the main folder of your project parsel_scraping/. At the end of this step, the folder will have the following structure:

parsel_scraping/
    ├── parsel_scraper.py
    └── venv/

Where:

  • parsel_scraper.py is the Python file that contains the scraping logic.
  • venv/ contains the virtual environment.

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venv\Scripts\activate

Equivalently, on macOS and Linux, execute:

source venv/bin/activate

In an activated virtual environment, install the dependencies with:

pip install parsel requests

These two dependencies are:

  • parsel: A library for parsing HTML and extracting data.
  • requests: Required because parsel is only an HTML parser. To perform web scraping, you also need an HTTP client like Requests to retrieve the HTML documents of the pages you want to scrape.

Wonderful! You now have what you need to perform web scraping with Parsel in Python.

Step 1: Define The Target URL and Parse The Content

As a first step of this tutorial, you need to import the libraries:

import requests
from parsel import Selector

Then, define the target webpage, fetch the content with Requests, and parse it with Parsel:

url = "https://www.scrapethissite.com/pages/forms/"
response = requests.get(url)
selector = Selector(text=response.text)

The above snippet instantiates the Selector() class from Parsel. That parses the HTML read from the response of the HTTP request made with get().

Step 2: Extract All The Rows From The Table

If you inspect the table on the target web page in the browser, you will see the following HTML:

The inspected table

Since the table contains multiple rows, initialize an array where to store the scraped data:

data = []

Now, note that the HTML table has a .table class. To select all rows from the table, you can use the line of code below:

rows = selector.css("table.table tr.team")

This uses the css() method applies the CSS selector on the parsed HTML structure.

Time to iterate over the selected rows and extract data from them!

Step 3: Iterate Over The Rows

Just like before, inspect a row inside the table:

The inspected row

What you can notice is that each row contains the following information in dedicated columns:

  • Team name → inside the .name element
  • Season year → inside the .year element
  • Number of wins → inside the .wins element
  • Number of losses → inside the .losses element
  • Overtime losses → inside the .ot-losses element
  • Winning percentage → inside the .pct element
  • Goals scored (Goals For – GF) → inside the .gf element
  • Goals conceded (Goals Against – GA) → inside the .ga element
  • Goal difference → inside the .diff element

You can extract all that info with the following logic:

for row in rows:
    # Extract data from each column
    name = row.css("td.name::text").get()
    year = row.css("td.year::text").get()
    wins = row.css("td.wins::text").get()
    losses = row.css("td.losses::text").get()
    ot_losses = row.css("td.ot-losses::text").get()
    pct = row.css("td.pct::text").get()
    gf = row.css("td.gf::text").get()
    ga = row.css("td.ga::text").get()
    diff = row.css("td.diff::text").get()

    # Append the extracted data
    data.append({
        "name": name.strip(),
        "year": year.strip(),
        "wins": wins.strip(),
        "losses": losses.strip(),
        "ot_losses": ot_losses.strip(),
        "pct": pct.strip(),
        "gf": gf.strip(),
        "ga": ga.strip(),
        "diff": diff.strip()
    })

Here is what the above code does:

  1. The get() method selects text nodes using CSS3 pseudo-elements.
  2. The method strip() removes any leading and trailing whitespace.
  3. The append() method appends the content to the data list.

Great! Parsel data scraping logic completed.

Step 4: Print The Data and Run The Program

As a final step, print the scraped data in the CLI:

# Print the extracted data
print("Data from the page:")
for entry in data:
    print(entry)

Run the program:

python parsel_scraper.py

This is the expected result:

Amazing! That is exactly the data on the page but in a structured format.

Step 5: Manage Pagination

Until the previous step, you retrieved the data from the main page of the target URL. What if you now want to retrieve it all? To do so, you need to manage pagination by making some changes to the code.

First, you have to encapsulate the previous code into a function like this one:

def scrape_page(url):
    # Fetch the page content
    response = requests.get(url)
    # Parse the HTML content
    selector = Selector(text=response.text)

    # Scraping logic...

    return data

Now, take a look at the HTML element that manages the pagination:

This includes a list of all pages, each with the URL embedded in an <a> element. Encapsulate the logic for retrieving all pagination URLs in a function:

def get_all_page_urls(base_url="https://www.scrapethissite.com/pages/forms/"):
    # Fetch the first page to extract pagination links
    response = requests.get(base_url)
    # Parse the page
    selector = Selector(text=response.text)

    # Extract all page links from the pagination area
    page_links = selector.css("ul.pagination li a::attr(href)").getall()  # Adjust selector based on HTML structure

    unique_links = list(set(page_links))  # Remove duplicates if any

    # Construct full URLs for all pages
    full_urls = [urljoin(base_url, link) for link in unique_links]

    return full_urls

This function does the following:

  • The getall() method retrieves all the pagination links.
  • The list(set()) method removes duplicates to prevent visiting the same page twice.
  • The urljoin() method, from the urlib.parse library, converts all relative URLs into absolute URLs so they can be used for further HTTP requests.

To make the above code work, ensure you import urljoin from the Python standard library:

from urllib.parse import urljoin

You can now scrapes all pages with:

# Where to store the scraped data
data = []

# Get all page URLs
page_urls = get_all_page_urls()

# Iterate over them and apply the scraping logic
for url in page_urls:
    # Scrape the current page
    page_data = scrape_page(url)
    # Add the scraped data to the list
    data.extend(page_data)

# Print the extracted data
print("Data from all pages:")
for entry in data:
    print(entry)

The above snippet:

  1. Retrieves all page URLs calling the function get_all_page_urls().
  2. Scrapes data from each page calling the function scrape_page(). Then, it aggregates the results with the method extend().
  3. Prints the scraped data.

Fantastic! The Parsel pagination logic is now implemented.

Step 6: Put It All Together

Below is what the parsel_scraper.py file should now contain:

import requests
from parsel import Selector
from urllib.parse import urljoin

def scrape_page(url):
    # Fetch the page content
    response = requests.get(url)
    # Parse the HTML content
    selector = Selector(text=response.text)

    # Where to store the scraped data
    data = []

    # Select all rows in the table body
    rows = selector.css("table.table tr.team")

    # Iterate over each row and scrape data from it
    for row in rows:
        # Extract data from each column
        name = row.css("td.name::text").get()
        year = row.css("td.year::text").get()
        wins = row.css("td.wins::text").get()
        losses = row.css("td.losses::text").get()
        ot_losses = row.css("td.ot-losses::text").get()
        pct = row.css("td.pct::text").get()
        gf = row.css("td.gf::text").get()
        ga = row.css("td.ga::text").get()
        diff = row.css("td.diff::text").get()

        # Append the extracted data to the list
        data.append({
            "name": name.strip(),
            "year": year.strip(),
            "wins": wins.strip(),
            "losses": losses.strip(),
            "ot_losses": ot_losses.strip(),
            "pct": pct.strip(),
            "gf": gf.strip(),
            "ga": ga.strip(),
            "diff": diff.strip(),
        })

    return data

def get_all_page_urls(base_url="https://www.scrapethissite.com/pages/forms/"):
    # Fetch the first page to extract pagination links
    response = requests.get(base_url)
    # Parse the page
    selector = Selector(text=response.text)

    # Extract all page links from the pagination area
    page_links = selector.css("ul.pagination li a::attr(href)").getall()  # Adjust selector based on HTML structure

    unique_links = list(set(page_links))  # Remove duplicates if any

    # Construct full URLs for all pages
    full_urls = [urljoin(base_url, link) for link in unique_links]

    return full_urls

# Where to store the scraped data
data = []

# Get all page URLs
page_urls = get_all_page_urls()

# Iterate over them and apply the scraping logic
for url in page_urls:
    # Scrape the current page
    page_data = scrape_page(url)
    # Add the scraped data to the list
    data.extend(page_data)

# Print the extracted data
print("Data from all pages:")
for entry in data:
    print(entry)

Very well! You have completed your first scraping project with Parsel.

Advanced Web Scraping Scenarios with Parsel in Python

In the previous section, you learned how to use Parsel in Python to extract the data from a target web page using CSS selectors. Time to consider some more advanced scenarios!

Select Elements by Text

Parsel provides different query methods to retrieve the text from HTML by using XPath. In this case, the text() function is used to extract the text content of an element.

Imagine you have HTML code such as this:

<html>
  <body>
    <h1>Welcome to Parsel</h1>
    <p>This is a paragraph.</p>
    <p>Another paragraph.</p>
  </body>
</html>

You can retrieve all the text like so:

from parsel import Selector

html = """
<html>
  <body>
    <h1>Welcome to Parsel</h1>
    <p>This is a paragraph.</p>
    <p>Another paragraph.</p>
  </body>
</html>
"""

selector = Selector(text=html)
# Extract text from the <h1> tag
h1_text = selector.xpath("//h1/text()").get()
print("H1 Text:", h1_text)
# Extract text from all <p> tags
p_texts = selector.xpath("//p/text()").getall()
print("Paragraph Text Nodes:", p_texts)

This snippet locates the <p> and <h1> tags and extracts the text from them with text(), resulting in:

H1 Text: Welcome to Parsel
Paragraph Text Nodes: ['This is a paragraph.', 'Another paragraph.']

Another useful function is contains(), which can be used to match elements that contain specific text. For example, suppose you have such an HTML code:

<html>
  <body>
    <p>This is a test paragraph.</p>
    <p>Another test paragraph.</p>
    <p>Unrelated content.</p>
  </body>
</html>

You now want to extract the text from the paragraphs that only contain the word “test.” You can do it with the following code:

from parsel import Selector

# html = """..."""

selector = Selector(text=html)
# Extract paragraphs containing the word "test"
test_paragraphs = selector.xpath("//p[contains(text(), 'test')]/text()").getall()
print("Paragraphs containing 'test':", test_paragraphs)

The Xpath p[contains(text(), 'test')]/text() takes care of querying the paragraph containing only “test”. The result will be:

Paragraphs containing 'test': ['This is a test paragraph.', 'Another test paragraph.']

But what if you want to intercept the text that starts with a specific value of a string? Well, you can use the starts-with() function! Consider this HTML:

<html>
  <body>
    <p>Start here.</p>
    <p>Start again.</p>
    <p>End here.</p>
  </body>
</html>

To retrieve the text from the paragraphs that start with the word “start,” use p[starts-with(text(), 'Start')]/text() like so:

from parsel import Selector

# html = """..."""

selector = Selector(text=html)
# Extract paragraphs where text starts with "Start"
start_paragraphs = selector.xpath("//p[starts-with(text(), 'Start')]/text()").getall()
print("Paragraphs starting with 'Start':", start_paragraphs)

The above snippet produces:

Paragraphs starting with 'Start': ['Start here.', 'Start again.']

Learn more about CSS vs. XPath selectors.

Using Regular Expressions

Parsel allows you to retrieve text for advanced conditions by using regular expressions with the function re:test().

Consider this HTML:

<html>
  <body>
    <p>Item 12345</p>
    <p>Item ABCDE</p>
    <p>A paragraph</p>
    <p>2025 is the current year</p>
  </body>
</html>

To extract the text from the paragraphs containing only numeric values, you can use re:test() as follows:

from parsel import Selector

# html = """..."""

selector = Selector(text=html)
# Extract paragraphs where text matches a numeric pattern
numeric_items = selector.xpath("//p[re:test(text(), '\\d+')]/text()").getall()
print("Numeric Items:", numeric_items)

The result is:

Numeric Items: ['Item 12345', '2025 is the current year']

Another typical use of regular expressions is to intercept email addresses. This can be used to extract text from paragraphs that contain only email addresses. For example, consider the following HTML:

<html>
  <body>
    <p>Contact us at support@example.com</p>
    <p>Send an email to info@domain.org</p>
    <p>No email here.</p>
  </body>
</html>

Below is how you can use re:test() to select nodes containing email addresses:

from parsel import Selector

selector = Selector(text=html)
# Extract paragraphs containing email addresses
emails = selector.xpath("//p[re:test(text(), '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}')]/text()").getall()
print("Email Matches:", emails)

That results in:

Email Matches: ['Contact us at support@example.com', 'Send an email to info@domain.org']

Navigating the HTML Tree

Parsel lets you navigate the HTML tree with XPath, no matter how nested that is.

Consider this HTML:

<html>
  <body>
    <div>
      <h1>Title</h1>
      <p>First paragraph</p>
    </div>
  </body>
</html>

You can get all the parent element of the <p> node like so:

from parsel import Selector

selector = Selector(text=html)
# Select the parent of the <p> tag
parent_of_p = selector.xpath("//p/parent::*").get()
print("Parent of <p>:", parent_of_p)

Resulting in:

Parent of <p>: <div>
      <h1>Title</h1>
      <p>First paragraph</p>
    </div>

Similarly, you can manage sibling elements. Suppose you have the following HTML code:

<html>
  <body>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>

You can use following-sibling to retrieve sibling nodes as follows:

from parsel import Selector

selector = Selector(text=html)
# Select the next sibling of the first <li> element
next_sibling = selector.xpath("//li[1]/following-sibling::li[1]/text()").get()
print("Next Sibling of First <li>:", next_sibling)
# Select all siblings of the first <li> element
all_siblings = selector.xpath("//li[1]/following-sibling::li/text()").getall()
print("All Siblings of First <li>:", all_siblings)

Which results in:

Next Sibling of First <li>: Item 2
All Siblings of First <li>: ['Item 2', 'Item 3']

Parsel Alternatives for HTML Parsing in Python

Parsel is one of the libraries available in Python for web scraping, but it is not the only one. Below are other well-known and widely used ones:

Conclusion

In this article, you learned about Parsel in Python and how to use it for web scraping. You started with the basics and then explored more complex scenarios.

No matter which Python scraping library you use, the biggest hurdle is that most websites safeguard their data with anti-bot and anti-scraping measures. These defenses can identify and block automated requests, rendering traditional scraping techniques ineffective.

Fortunately, Bright Data offers a suite of solutions to avoid any issue:

  • Web Unlocker: An API that bypasses anti-scraping protections and delivers clean HTML from any webpage with minimal effort.
  • Scraping Browser: A cloud-based, controllable browser with JavaScript rendering. It automatically handles CAPTCHAs, browser fingerprinting, retries, and more for you. It integrates seamlessly with Panther or Selenium PHP.
  • Web Scraper APIs: Endpoints for programmatic access to structured web data from dozens of popular domains.

Don’t want to deal with web scraping but are still interested in online data? Explore our ready-to-use datasets!

Sign up for Bright Data now and start your free trial to test our scraping solutions.

No credit card required