Web scraping is the process of automatically gathering data from websites for purposes such as analyzing data or fine-tuning AI models.
Python is a popular choice for web scraping due to its extensive array of scraping libraries, including lxml, which is used for parsing XML and HTML documents. lxml extends Python’s capabilities with a Python API for the fast C libraries libxml2 and libxslt. It also integrates with ElementTree, Python’s hierarchical data structure for XML/HTML trees, making lxml a preferred tool for efficient and reliable web scraping.
In this article, you’ll learn how to use lxml for web scraping.
Bright Data Solutions as the Perfect Alternative
When it comes to web scraping, using lxml with Python is a powerful approach, but it can be time-consuming and costly, especially when dealing with complex websites or large volumes of data. Bright Data offers an efficient alternative with its ready-to-use datasets and Web Scraper APIs. These solutions significantly reduce the time and cost involved in data collection by providing pre-collected data from 100+ domains and easy-to-integrate scraping APIs.
With Bright Data, you can bypass the technical challenges of manual scraping, allowing you to focus on analyzing the data rather than retrieving it. Whether you need datasets tailored to your specific requirements or APIs that handle proxy management and CAPTCHA solving, Bright Data’s tools offer a streamlined, cost-effective solution for all your web scraping needs.
Using lxml for Web Scraping in Python
On the web, structured and hierarchical data can be represented in two formats—HTML and XML:
- XML is a basic structure that does not come with prebuilt tags and styles. The coder creates the structure by defining its own tags. The tag’s main purpose is to create a standard data structure that can be understood between different systems.
- HTML is a web markup language with predefined tags. These tags come with some styling properties, such as
font-size
in<h1>
tags ordisplay
for<img />
tags. HTML’s primary function is to structure web pages effectively.
lxml works with both HTML and XML documents.
Prerequisites
Before you can start web scraping with lxml, you need to install a few libraries on your machine:
pip install lxml requests cssselect
This command installs the following:
- lxml to parse XML and HTML
- requests for fetching web pages
- cssselect, which uses CSS selectors to extract HTML elements
Parsing Static HTML Content
Two main types of web content can be scraped: static and dynamic. Static content is embedded in the HTML document when the web page initially loads, making it easy to scrape. In contrast, dynamic content is loaded continuously or triggered by JavaScript after the initial page load. Scraping dynamic content requires timing the scraping function to execute only after the content becomes available in the browser.
In this article, you start by scraping the Books to Scrape website, which has static HTML content designed for testing purposes. You extract the titles and prices of books and save that information as a JSON file.
To start, use your browser’s Dev Tools to identify the relevant HTML elements. Open Dev Tools by right-clicking the web page and selecting the Inspect option. If you’re in Chrome, you can press F12 to access this menu:
The right side of the screen displays the code responsible for rendering the page. To locate the specific HTML element that handles each book’s data, search through the code using the hover-to-select option (the arrow in the top-left corner of the screen):
In Dev Tools, you should see the following code snippet:
<article class="product_pod">
<!-- code omitted -->
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<!-- code omitted -->
</div>
</article>
In this snippet, each book is contained within an <article>
tag labeled with the class product_pod
. You target this element to extract the data. Create a new file named static_scrape.py
and input the following code:
import requests
from lxml import html
import json
URL = "https://books.toscrape.com/"
content = requests.get(URL).text
This code imports the necessary libraries and defines a URL
variable. It uses requests.get()
to fetch the web page’s static HTML content by sending a GET request to the specified URL. Then, the HTML code is retrieved using the text
attribute of the response.
Once the HTML content is obtained, your next step is to parse it using lxml and extract the necessary data. lxml offers two methods for extraction: XPath and CSS selectors. In this example, you use XPath to retrieve the book title and CSS selectors to fetch the book price.
Append your script with the following code:
parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []
This code initializes the parsed
variable using html.fromstring(content)
, which parses the HTML content into a hierarchical tree structure. The all_books
variable uses an XPath selector to retrieve all <article>
tags with the class product_pod
from the web page. This syntax is specifically valid for XPath expressions.
Next, add the following to your script to iterate through each book in all_books
and extract data from them:
for book in all_books:
book_title = book.xpath('.//h3/a/@title')
price = book.cssselect("p.price_color")[0].text_content()
books.append({"title": book_title, "price": price})
The book_title
variable is defined using an XPath selector that retrieves the title
attribute from an <a>
tag within an <h3>
tag. The dot (.
) at the beginning of the XPath expression specifies to start searching from the <article>
tag rather than the default starting point. The next line uses the cssselect
method to extract the price from a <p>
tag with the class price_color
. Since cssselect
returns a list, indexing ([0]
) accesses the first element, and text_content()
retrieves the text inside the element. Each extracted title and price pair is then appended to the books
list as a dictionary, which can be easily stored in a JSON file.
Now that you’ve completed the web scraping process, it’s time to save this data locally. Open your script file and input the following code:
with open("books.json", "w", encoding="utf-8") as file:
json.dump(books ,file)
In this code, a new file named books.json
is created. This file is populated using the json.dump
method, which takes the books
list as the source and a file
object as the destination.
You can test this script by opening the terminal and running the following command:
python static_scrape.py
This command generates a new file in your directory with the following output:
All the code for this script is available on GitHub.
Parsing Dynamic HTML Content
Scraping dynamic web content is trickier than scraping static content because JavaScript renders the data continuously rather than all at once. To help scrape dynamic content, you use a browser automation tool called Selenium, which lets you create and run a browser instance and control it programmatically.
To install Selenium, open the terminal and run the following command:
pip install selenium
YouTube is a great example of content rendered using JavaScript. When you visit any channel, only a limited number of videos load initially, with more videos appearing as you scroll down. Here, you scrape data for the top hundred videos from the freeCodeCamp.org YouTube channel by emulating keyboard presses to scroll the page.
To begin, inspect the HTML code of the web page. When you open Dev Tools, you’ll see the following:
The following code identifies the elements responsible for displaying the video title and link:
<a id="video-title-link" class="yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media" href="/watch?v=i740xlsqxEM">
<yt-formatted-string id="video-title" class="style-scope ytd-rich-grid-media">GitHub Advanced Security Certification – Pass the Exam!
</yt-formatted-string></a>
The video title is within the yt-formatted-string
tag with the ID video-title
, and the video link is located in the href
attribute of the a
tag with the ID video-title-link
.
Once you identify what you want to scrape, create a new file named dynamic_scrape.py
and add the following code, which imports all the modules required for the script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from lxml import html
from time import sleep
import json
Here, you begin by importing webdriver
from selenium
, which creates a browser instance that you can control programmatically. The next lines import By
and Keys
, which select an element on the web and perform some keystrokes on it. The sleep
function is imported to pause the program execution and wait for the JavaScript to render content on the page.
With all the imports sorted out, you can define the driver instance for the browser of your choice. This tutorial uses Chrome, but Selenium also supports Edge, Firefox, and Safari.
To define the driver instance for the browser, append the script with the following code:
URL = "https://www.youtube.com/@freecodecamp/videos"
videos = []
driver = webdriver.Chrome()
driver.get(URL)
sleep(3)
Similar to the previous script, you declare a URL
variable containing the web URL that you want to scrape and a videos
variable that stores all the data as a list. Next, a driver
variable is declared (ie a Chrome
instance) that you use when you interact with the browser. The get()
function opens the browser instance and sends a request to the specified URL
. After that, you call the sleep
function to wait for three seconds before accessing any element on the web page to make sure all the HTML code gets loaded in the browser.
As mentioned before, YouTube uses JavaScript to load more videos as you scroll to the bottom of the page. To scrape data from a hundred videos, you must programmatically scroll to the bottom of the page after opening the browser. You can do this by adding the following code to your script:
parent = driver.find_element(By.TAG_NAME, 'html')
for i in range(4):
parent.send_keys(Keys.END)
sleep(3)
In this code, the <html>
tag is selected using the find_element
function. It returns the first element matching the given criteria, which in this case is the html
tag. The send_keys
method simulates pressing the END
key to scroll to the bottom of the page, triggering more videos to load. This action is repeated four times within a for
loop to ensure enough videos are loaded. The sleep
function pauses for three seconds after each scroll to allow the videos to load before scrolling again.
Now that you have all the data needed to begin the scraping process, it’s time to use lxml with cssselect to select the elements you want to extract:
html_data = html.fromstring(driver.page_source)
videos_html = html_data.cssselect("a#video-title-link")
for video in videos_html:
title = video.text_content()
link = "https://www.youtube.com" + video.get("href")
videos.append( {"title": title, "link": link} )
In this code, you pass the HTML content from the driver’s page_source
attribute to the fromstring
method, which builds a hierarchical tree of the HTML. Then, you select all <a>
tags with the ID video-title-link
using CSS selectors, where the #
sign indicates selection using the tag’s ID. This selection returns a list of elements that satisfy the given criteria. The code then iterates over each element to extract the title and link. The text_content
method retrieves the inner text (the video title), while the get
method fetches the href
attribute value (the video link). Finally, the data is stored in a list called videos
.
At this point, you’re done with the scraping process. The next step involves storing this scraped data locally in your system. To store the data, append the following code in the script:
with open('videos.json', 'w') as file:
json.dump(videos, file)
driver.close()
Here, you create a videos.json
file and use the json.dump
method to serialize the videos list into JSON format and write it to the file object. Finally, you call the close method on the driver object to safely close and destroy the browser instance.
Now, you can test your script by opening the terminal and running the following command:
python dynamic_scrape.py
After running the script, a new file named videos.json
is created in your directory:
All the code for this script is also available on GitHub.
Using lxml with Bright Data Proxy
While web scraping is a great technique for automating data collection from various sources, the process isn’t without its challenges. You have to deal with anti-scraping tools implemented by websites, rate-limiting, geoblocking, and a lack of anonymity. Proxy servers can help with these issues by acting as intermediaries that mask the user’s IP address, allowing scrapers to bypass restrictions and access targeted data without being detected. Bright Data is a top choice for reliable proxy services.
The following example highlights how easy it is to work with Bright Data proxies. It involves making some changes to the script_scrape.py
file to scrape the Books to Scrape website.
To start, you need to obtain proxies from Bright Data by signing up for a free trial, which provides $5 USD worth of proxy resources. After creating a Bright Data account, you’ll see the following dashboard:
Navigate to the My Zones option and create a new residential proxy zone. Creating a new zone reveals your proxy username, password, and host, which you need in the next step.
Open the static_scrape.py
file and add the following code below the URL variable:
URL = "https://books.toscrape.com/"
# new
username = ""
password = ""
hostname = ""
proxies = {
"http": f"https://{username}:{password}@{hostname}",
"https": f"https://{username}:{password}@{hostname}",
}
content = requests.get(URL, proxies=proxies).text
Replace the username
, password
, and hostname
placeholders with your proxy credentials. This code instructs the requests
library to use the specified proxy. The rest of your script remains unchanged.
Test your script by running the following command:
python static_scrape.py
After running this script, you’ll see a similar output to what you received in the previous example.
You can view this entire script on GitHub.
Conclusion
lxml is a robust and easy-to-use tool for extracting data from HTML documents. lxml is optimized for speed and supports XPath and CSS selectors, allowing for efficient parsing of large XML and HTML documents.
In this tutorial, you learned all about web scraping with lxml and scraping both dynamic and static content. You also learned how using Bright Data proxy servers can help you bypass restrictions imposed by websites against scrapers.
Bright Data is a one-stop solution for all your web scraping projects. It offers features like proxies, scraping browsers, and reCAPTCHAs that enable users to effectively solve web scraping challenges. Bright Data also offers an in-depth blog with tutorials and best practices related to web scraping.
Interested in starting? Sign up now and test our products for free!
No credit card required