Web scraping is a technique that you can use to extract data from web pages. It’s especially useful when the target website doesn’t offer an API, the API can’t be used, or it doesn’t return the exact data that you want.
Regex, short for “regular expression,” is a powerful grammar pattern for extracting data from text and is commonly used for web scraping. Regex defines a pattern that can be matched in texts and is commonly used to find and extract information from text. As such, it is widely used in web scraping.
In this article, you’ll learn how to use regex in Python for web scraping. By the end of the article, you’ll know how to scrape static and dynamic sites, and you’ll have an understanding of some of the limitations you might face.
What Is Regex
A regular expression is defined using tokens that match a particular pattern. Describing all the tokens in detail is out of the scope of this article, but the following table lists a few commonly used tokens that you’ll likely encounter:
Token | Matches |
---|---|
Any non-special character | The character given |
^ |
Start of a string |
$ |
End of a string |
. |
Any character except \n |
* |
Zero or more occurrences of the previous element |
? |
Zero or one occurrence of the previous element |
+ |
One or more occurrences of the previous characters |
{Digit} |
Exact number of the previous element |
\d |
Any digit |
\s |
Any whitespace character |
\w |
Any word character |
\D |
Inverse of \d |
\S |
Inverse of \s |
\W |
Inverse of \w |
To learn more about regex and get some hands-on experience, visit regexr.com. Additionally, this article shares some important tips for optimizing your regex performance.
Using Regex in Python for Web Scraping
In this tutorial, you’ll build a simple web scraper in Python using regex to extract data from web pages.
To start, create a directory for your project:
mkdir web_scraping_with_regex
cd web_scraping_with_regex
Then create a Python virtual environment:
python -m venv venv
And activate it:
source ./venv/bin/activate
To write the web scraper, you need to install two libraries:
requests
for fetching web pagesbeautifulsoup4
for parsing the HTML content and finding elements
Run the following command to install the libraries:
pip install beautifulsoup4 requests
Note: Before you scrape any website, make sure to look at its terms and conditions to see if you’re allowed to scrape the site. You shouldn’t scrape a website if it’s forbidden.
Scraping an E-commerce Site
In this section, you’ll build a web scraper to scrape a simple dummy e-commerce site. You’ll scrape the first page and extract the titles and prices of the books.
To do so, create a file named scraper.py
and import the required modules:
import requests
from bs4 import BeautifulSoup
import re
Note: The
re
module is a built-in Python module that works with regex.
Next, you need to make a GET request to the target web page to fetch the HTML content of the page:
page = requests.get('https://books.toscrape.com/')
Pass this data to Beautiful Soup, which parses the HTML structure of the web page:
soup = BeautifulSoup(page.content, 'html.parser')
To figure out how the elements are structured in HTML, you use the Inspect Element tool. Open the web page in the browser and press Ctrl + Shift + I to open the Inspector. As you can see in the screenshot, the products are stored in li
elements with class col-xs-6 col-sm-4 col-md-3 col-lg-3
. The book title can be found from a
elements by reading their title
attribute, and the prices are stored in p
elements with class price_color
:
Use the find_all
method of Beautiful Soup to find all li
elements with class col-xs-6 col-sm-4 col-md-3 col-lg-3
:
books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
content = str(books)
The content
variable now holds the HTML text of the li
elements, and you can use regex to extract the titles and prices.
The first step is to construct a regex that matches the titles and prices of the text. For that, you need to use the Inspect Element again.
Observe that the titles of the book are stored in the title
attribute of a
elements, and the a
elements look like this:
<a href="..." title="...">
To match the contents of the double quotes after the title
, use the classic .*?
regex. The .
matches a single character, the *
matches zero or more occurrences of the preceding element (in this case, whatever is matched by .
), and the ?
matches zero or one occurrence of the preceding element (in this case, whatever is matched by .*
). Together, they’re used to match the contents of the double quotes in this complete expression:
<a href=".*?" title="(.*?)"
The parentheses around the .*?
are used to create a capturing group. Capturing groups memorize the information about the pattern match and, in complicated expressions, are used to identify and refer back to already matched patterns. However, in this case, the capturing group is used to extract the matched text. Without the capturing group, the text would still match, but you wouldn’t be able to access the matched text.
To extract the price, use the same regex (.*?)
. The prices are stored in p
elements with class price_color
, so the complete regex is <p class="price_color">(.*?)</p>
.
Define the two patterns:
re_book_title = r'<a href=".*?" title="(.*?)"'
re_prices = r'<p class="price_color">(.*?)</p>'
Note: In case you’re wondering why the
?
is needed after.*
, this Stack Overflow answer explains the role of?
well.
Now you can use re.findall()
to find all regex matches from the HTML string:
titles = re.findall(re_book_title, content)
prices = re.findall(re_prices, content)
Finally, iterate over the matches and print the results:
for i in zip(titles, prices):
print(f"{i[0]}: {i[1]}")
You can run this code with python scraper.py
. The output looks like this:
A Light in the Attic: £51.77
Tipping the Velvet: £53.74
Soumission: £50.10
Sharp Objects: £47.82
Sapiens: A Brief History of Humankind: £54.23
The Requiem Red: £22.65
The Dirty Little Secrets of Getting Your Dream Job: £33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull: £17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics: £22.60
The Black Maria: £52.15
Starving Hearts (Triangular Trade Trilogy, #1): £13.99
Shakespeare's Sonnets: £20.66
Set Me Free: £17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1): £52.29
Rip it Up and Start Again: £35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991: £57.25
Olio: £23.88
Mesaerion: The Best Science Fiction Stories 1800-1849: £37.59
Libertarianism for Beginners: £51.33
It's Only the Himalayas: £45.17
Scraping a Wikipedia Page
Now, let’s build a scraper that can scrape a Wikipedia page and extract information about all the links.
Create a new file named wiki_scraper.py
. Just like before, start by importing the libraries, making a GET request, and parsing the content:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(page.content, 'html.parser')
To find all the links, use the find_all()
method:
links = soup.find_all("a")
content = str(links)
The link texts are stored in the title
attribute, and the link URLs are stored in the href
attribute. You can use the same regex (.*?)
to extract the information. The complete expression looks like this:
<a href="(.*?)" title="(.*?)">.*?</a>
Note that the third .*?
is not in a capturing group because you aren’t interested in the content of the a
tags.
As before, use findall()
to find all the matches and print the result:
re_links = r'<a href="(.*?)" title="(.*?)">.*?</a>'
links = re.findall(re_links, content)
for i in links:
print(f"{i[0]} => {i[1]}")
When you run this with python wiki_scraper.py
, you get the following output:
OUTPUT TRUNCATED FOR BREVITY
/wiki/Category:Web_scraping => Category:Web scraping
/wiki/Category:CS1_maint:_multiple_names:_authors_list => Category:CS1 maint: multiple names: authors list
/wiki/Category:CS1_Danish-language_sources_(da) => Category:CS1 Danish-language sources (da)
/wiki/Category:CS1_French-language_sources_(fr) => Category:CS1 French-language sources (fr)
/wiki/Category:Articles_with_short_description => Category:Articles with short description
/wiki/Category:Short_description_matches_Wikidata => Category:Short description matches Wikidata
/wiki/Category:Articles_needing_additional_references_from_April_2023 => Category:Articles needing additional references from April 2023
/wiki/Category:All_articles_needing_additional_references => Category:All articles needing additional references
/wiki/Category:Articles_with_limited_geographic_scope_from_October_2015 => Category:Articles with limited geographic scope from October 2015
/wiki/Category:United_States-centric => Category:United States-centric
/wiki/Category:All_articles_with_unsourced_statements => Category:All articles with unsourced statements
/wiki/Category:Articles_with_unsourced_statements_from_April_2023 => Category:Articles with unsourced statements from April 2023
Scraping a Dynamic Site
So far, all the web pages you scraped were static. Scraping dynamic web pages is a little more difficult as it requires a browser automation tool like Selenium. Following is an example of scraping the OpenWeatherMap home page for London and using regex and Selenium to scrape the current temperature:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
driver = webdriver.Firefox()
driver.get("https://openweathermap.org/city/2643743")
elem = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".current-temp")))
content = elem.get_attribute('innerHTML')
re_temp = r'<span .*?>(.*?)</span>'
temp = re.findall(re_temp, content)
print(repr(temp))
driver.close()
This code uses Selenium to launch an instance of Firefox and uses the CSS selector to select the element with the current temperature. It then uses the regex <span .*?>(.*?)</span>
to extract the temperature.
If you’re looking for even more information to help you get started with scraping dynamic web pages with Selenium, check out this tutorial.
Limitation of Regex for Web Scraping
Regular expressions are powerful tools for pattern-matching and extracting information from texts. Developers often learn regex and try to use it for web scraping. However, regex by itself is not suitable for web scraping. Regex works on text and has no concept or understanding of HTML structures. This means the results are highly dependent on the way the HTML code is written. For instance, in the Wikipedia example, you might have noticed that some links were not extracted correctly:
If you edit the Python code and add print(content)
to print the HTML string returned by Beautiful Soup, you see the culprit a
looks like this:
<a href="#cite_ref-9">^</a>
Here, the title
attribute is missing, but in the regex, you assumed the structure <a href="(.*?)" title="(.*?)">.*?</a>
. Since regex has no idea of HTML elements, instead of throwing an error or stopping a match, the .*?
pattern went on blindly matching characters until it could match " title="(.*?)">.*?</a>
to finish the pattern. This ended up gobbling the next few a
tags and shows that using regex can cause unintended effects if the HTML code is written in an unexpected way.
Additionally, HTML is not a regular language, which means regex alone can’t be used to parse arbitrary HTML data. This Stack Overflow answer is a cult classic among developers for taking a jab at developers attempting to parse HTML with regex. However, there are a few situations where you can use regex to parse and scrape HTML data.
For instance, if you have a known, limited set of HTML code and you’re fully aware of how the code is structured, you can use regex. For example, if you know that all the a
tags in the HTML have the href
and title
attributes and conform to a fixed pattern, you can use regex to extract information. However, a better and more robust solution is to use an HTML parser like Beautiful Soup to find elements and extract textual data from them.
Once you’ve extracted textual data, you can use regex to process it further. For example, here’s a modified version of the Wikipedia scraper that uses Beautiful Soup to extract the href
and title
attributes and then uses regex to filter out any tags that contain nonalphanumeric characters:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all("a")
for link in links:
href = link.get('href')
title = link.get('title')
if title == None:
title = link.string
if title == None:
continue
pattern = r"[a-zA-Z0-9]"
if re.match(pattern, title):
print(f"{href} => {title}")
Conclusion
Regex is a powerful tool for finding patterns in textual data. Thanks to its robustness, it’s often used in web scraping to extract information.
In this article, you learned what regex is and how to use it with Beautiful Soup to scrape e-commerce websites, Wikipedia, and dynamic web pages. You also learned about some of the limitations of regex and how to best use it in conjunction with another tool.
Even if you make full use of regex, web scraping is full of challenges. Repeated web scraping can cause the IP address of your scraper to get blocked. You can also face CAPTCHAs that can prevent your scraper from working correctly. Bright Data offers powerful proxies that can circumvent IP bans. Its worldwide proxy network involves data center proxies, residential proxies, ISP proxies, and mobile proxies. With the Web Unlocker, you can circumvent bot detection and solve CAPTCHAs without any hassles. Start a free trial today!
No credit card required