Perl is one of the most popular languages, and thanks to its extensive module collection, it’s a great choice for writing web scrapers.
In this article, we will discuss the following:
- How to web scrape with Perl using the following methods:
LWP::UserAgent
andHTML::TreeBuilder
Web::Scraper
Mojo::UserAgent
andMojo::DOM
XML::LibXML
- Challenges of web scraping with Perl
- Conclusion
Web Scraping with Perl
To follow along with the article, make sure you have the latest version of Perl installed. The code in this article was tested with Perl 5.38.2. This article also assumes that you know how to install Perl modules using cpanm
.
In this article, you’ll scrape the Quotes to Scrape website to extract the quotes. Before you can scrape data from the website, you need to understand how the HTML is structured. Open the website in the browser and press CTRL + Shift + I (Windows) or Command + Shift + C (Mac) to open the Inspect Element dialog.
When you inspect the elements, you can see that each quote is stored in a div
with the class quote
. Each quote contains a span
with class text
and a small
element to store the text and author’s name, respectively:
Using LWP::UserAgent and HTML::TreeBuilder
The LWP::UserAgent
is part of LWP
, a group of modules that interact with the web. The LWP::UserAgent
module can be used to make an HTTP request to a web page and return the HTML content. Then you can use the HTML::TreeBuilder
module from HTML::Tree
to parse the HTML and extract information.
To use LWP::UserAgent
and HTML::TreeBuilder
, install the modules with the following commands:
cpanm Bundle::LWP
cpanm HTML::Tree
Create a file named lwp-and-tree-builder.pl
. This is where you’ll write the code. Then paste the following two lines in that file:
use LWP::UserAgent;
use HTML::TreeBuilder;
This code instructs the Perl interpreter to include the LWP::UserAgent
and HTML::TreeBuilder
modules.
Define an instance of LWP::UserAgent
and set the User-Agent
header to Quotes Scraper
:
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
Define the URL of the target website and create an instance of HTML::TreeBuilder
:
my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();
Now you can make the HTTP request:
my $request = $ua->get($url) or die "An error occurred $!\n";
Paste the following if-else statement that checks whether the request was successful or not:
if ($request->is_success) {
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
If the request is successful, you can start scraping.
Use the parse
method of HTML::TreeBuilder
to parse the HTML response. Paste the following code inside the if
block:
$root->parse($request->content);
Now, use the look_down
method to find the div
elements with the class quote
:
my @quotes = $root->look_down(
_tag => 'div',
class => 'quote'
);
Iterate over the array of quotes, use look_down
to find the text and the author, and print them:
foreach my $quote (@quotes) {
my $text = $quote->look_down(
_tag => 'span',
class => 'text'
)->as_text;
my $author = $quote->look_down(
_tag => 'small',
class => 'author'
)->as_text;
print "$text: $author\n";
}
The complete code looks like this:
use LWP::UserAgent;
use HTML::TreeBuilder;
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();
my $request = $ua->get($url) or die "An error occurred $!\n";
if ($request->is_success) {
$root->parse($request->content);
my @quotes = $root->look_down(
_tag => 'div',
class => 'quote'
);
foreach my $quote (@quotes) {
my $text = $quote->look_down(
_tag => 'span',
class => 'text'
)->as_text;
my $author = $quote->look_down(
_tag => 'small',
class => 'author'
)->as_text;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
Run this code with perl lwp-and-tree-builder.pl
, and you should see the following output:
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin
Using Web::Scraper
Web::Scraper
is a web scraping library inspired by Ruby’s ScrAPI. It provides a domain-specific language (DSL) for scraping HTML and XML documents. Check this article to learn more about web scraping with Ruby.
To use Web::Scraper
, install the module with cpanm Web::Scraper
.
Create a new file named web-scraper.pl
and include the following required modules:
use URI;
use Web::Scraper;
use Encode;
Next, you need to define a scraper
block using the module’s DSL. The DSL makes it easy to define a scraper in only a few lines. Start by defining a scraper
block named $quotes
:
my $quotes = scraper {
};
The scraper
method defines the logic of the scraper, which is executed when the scrape
method is called later. Inside the scraper
block, you use the process
method to find elements using CSS selectors and execute a function.
Start by finding all the div
elements with the quote
class:
# Parse all `div` with class `quote`
process 'div.quote', "quotes[]" => scraper {
};
This code finds all the div
elements with the quote
class and stores them in the quotes
array. For each element, it runs the scraper
method, which you define using the following:
# And, in each div, find `span` with class `text`
process_first "span.text", text => 'TEXT';
# get `small` with class `author`
process_first "small", author => 'TEXT';
The process_first
method finds the first element matching the CSS selector. Here, you’re finding the first span
element with text
class and then extracting its text and storing it in the text
key. For the author name, you’re finding the first small
element and extracting the text to store in the author
key.
The complete scraper
block looks like this:
my $quotes = scraper {
# Parse all `div` with class `quote`
process 'div.quote', "quotes[]" => scraper {
# And, in each div, find `span` with class `text`
process_first "span.text", text => 'TEXT';
# get `small` with class `author`
process_first "small", author => 'TEXT';
};
};
Now, call the scrape
method and pass the URL to start the scraping:
my $res = $quotes->scrape( URI->new("https://quotes.toscrape.com/") );
Finally, iterate over the quotes
array and print the result:
# iterate over the array
for my $quote (@{$res->{quotes}}) {
print Encode::encode("utf8", "$quote->{text}: $quote->{author}\n");
}
The complete code looks like this:
use URI;
use Web::Scraper;
use Encode;
my $quotes = scraper {
# Parse all `div` with class `quote`
process 'div.quote', "quotes[]" => scraper {
# And, in each div, find `span` with class `text`
process_first "span.text", text => 'TEXT';
# get `small` with class `author`
process_first "small", author => 'TEXT';
};
};
my $res = $quotes->scrape( URI->new("https://quotes.toscrape.com/") );
# iterate over the array
for my $quote (@{$res->{quotes}}) {
print Encode::encode("utf8", "$quote->{text}: $quote->{author}\n");
}
Run the previous code with perl web-scraper.pl
, and you should get the following output:
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin
Using Mojo::UserAgent and Mojo::DOM
Mojo::UserAgent
and Mojo::DOM
are part of the Mojolicious framework, a real-time web framework for Perl. In terms of functionality, they’re similar to LWP::UserAgent
and HTML::TreeBuilder
.
To use Mojo::UserAgent
and Mojo::DOM
, install the modules using the following command:
cpanm Mojo::UserAgent
cpanm Mojo::DOM
Create a new file named mojo.pl
and include the Mojo::USeragent
and Mojo::DOM
modules:
use Mojo::UserAgent;
use Mojo::DOM;
Define an instance of Mojo::UserAgent
and make the HTTP request:
my $ua = Mojo::UserAgent->new;
my $res = $ua->get('https://quotes.toscrape.com/')->result;
Similar to LWP::UserAgent
, use the following if-else block to check if the request was successful:
if ($res->is_success) {
} else {
print "Cannot parse the result. " . $res->message . "\n";
}
In the if
block, initializes an instance of Mojo::DOM
:
my $dom = Mojo::DOM->new($res->body);
Use the find
method to find all the div
elements with the quote
class:
my @quotes = $dom->find('div.quote')->each;
Iterate over the quotes
array and extract the text and author names:
foreach my $quote (@quotes) {
my $text = $quote->find('span.text')->map('text')->join;
my $author = $quote->find('small.author')->map('text')->join;
print "$text: $author\n";
}
The following is the full code:
use Mojo::UserAgent;
use Mojo::DOM;
my $ua = Mojo::UserAgent->new;
my $res = $ua->get('https://quotes.toscrape.com/')->result;
if ($res->is_success) {
my $dom = Mojo::DOM->new($res->body);
my @quotes = $dom->find('div.quote')->each;
foreach my $quote (@quotes) {
my $text = $quote->find('span.text')->map('text')->join;
my $author = $quote->find('small.author')->map('text')->join;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $res->message . "\n";
}
Run this code with perl mojo.pl
, and you should get the following output:
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin
Using XML::LibXML
The Perl module XML::LibXML
is a wrapper around the libxml2
library. The XML::LibXML
module provides a powerful XHTML parser with XPath capabilities.
Use cpanm
to install the module:
cpanm XML::LibXML
Then create a new file named xml-libxml.pl
. As is the case with HTML::TreeBuilder
, you need to use a library like LWP::UserAgent
to make the HTTP request to the website and fetch the HTML content, which you pass to XML::LibXML
.
Paste the following code, which sets up the LWP:UserAgent
module and fetches the HTML content of the web page:
use LWP::UserAgent;
use XML::LibXML;
use open qw( :std :encoding(UTF-8) );
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
my $url = "https://quotes.toscrape.com/";
my $request = $ua->get($url) or die "An error occurred $!\n";
if ($request->is_success) {
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
Inside the if
block, start by parsing the HTML document using the load_html
method:
$dom = XML::LibXML->load_html(string => $request->content, recover => 1, suppress_errors => 1);
The recover
option tells the parser to continue parsing the HTML in case of an error, and the suppress_errors
option causes the parser not to print HTML parsing errors to the console. Since HTML documents are not as strictly validated as an XHTML document, you’re likely to encounter non-fatal parsing errors. These options keep the code working in case those errors occur.
Once the HTML is parsed, you can use the indnodes
method to find the elements based on their XPath expression:
my $xpath = '//div[@class="quote"]';
foreach my $quote ($dom->findnodes($xpath)) {
my ($text) = $quote->findnodes('.//span[@class="text"]')->to_literal_list;
my ($author) = $quote->findnodes('.//small[@class="author"]')->to_literal_list;
print "$text: $author\n";
}
The full code looks like this:
use LWP::UserAgent;
use XML::LibXML;
use open qw( :std :encoding(UTF-8) );
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
my $url = "https://quotes.toscrape.com/";
my $request = $ua->get($url) or die "An error occurred $!\n";
if ($request->is_success) {
$dom = XML::LibXML->load_html(string => $request->content, recover => 1, suppress_errors => 1);
my $xpath = '//div[@class="quote"]';
foreach my $quote ($dom->findnodes($xpath)) {
my ($text) = $quote->findnodes('.//span[@class="text"]')->to_literal_list;
my ($author) = $quote->findnodes('.//small[@class="author"]')->to_literal_list;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
Run the code with perl xml-libxml.pl
, and you should see the following output:
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin
You can find all the code for this tutorial in this GitHub repo.
Challenges of Web Scraping in Perl
Although Perl makes it easy to scrape web pages using its powerful modules, developers often run into some common problems that can slow down or completely hinder web scraping. The following are a few of the challenges that you are likely to face.
Dealing with Pagination
Websites that deal with a large volume of data don’t often send all the data at once. Usually, the data is sent on multiple pages, and you need to handle the pagination to ensure you extract all the data. There are two steps to handle the pagination:
- Check whether other pages exist. Typically, you can look for a Next page button on the page, or you can try to load the next page and look for an error.
- If other pages exist, load the next page and scrape it.
For static websites, where each page has its own URL, you can run a loop and load new pages by incrementing the page number parameter in the URL. Or if you’re using a module like WWW::Mechanize
, you can simply follow the Next page URL.
Here’s the quotes scraper modified to handle pagination using WWW::Mechanize
. Note the use of follow_link
:
use WWW::Mechanize ();
use HTML::TreeBuilder;
use open qw( :std :encoding(UTF-8) );
my $mech = WWW::Mechanize->new();
my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();
my $request = $mech->get($url);
my $next_page = $mech->find_link(text_regex => qr/Next/);
while ($next_page) {
$root->parse($mech->content);
my @quotes = $root->look_down(
_tag => 'div',
class => 'quote'
);
foreach my $quote (@quotes) {
my $text = $quote->look_down(
_tag => 'span',
class => 'text'
)->as_text;
my $author = $quote->look_down(
_tag => 'small',
class => 'author'
)->as_text;
print "$text: $author\n";
}
$mech->follow_link(url => $next_page->url);
$next_page = $mech->find_link(text_regex => qr/Next/);
}
To handle dynamic websites that load the next page using JavaScript, check out our guide on Scraping Dynamic Websites With Python, or continue reading.
Rotating Proxies
Proxies are commonly used by web scrapers to protect their privacy and anonymity and to evade IP address bans. Modules like LWP::UserAgent
have the option to set proxies for scraping. However, using a single proxy server still runs the risk of getting the IP banned. That’s why it’s recommended to use multiple proxy servers and rotate them. Here’s a very simple example of how to do it using LWP::UserAgent
.
Begin by defining an array of proxies. Then choose one at random and set the proxy using the proxy
method:
my @proxies = ( 'https://proxy1.com', 'https://proxy2.com', 'http://proxy3.com' );
my $index = rand @proxies;
my $proxy = $proxies[$index];
$ua->proxy(['http', 'https'], $proxy);
Now, you can send a request as usual. If the request fails, it likely indicates that the proxy has been blocked, so you can remove that proxy from the list, choose a different proxy, and try again:
if(request->is_success) {
# Continue with the scraping
} else {
# Remove the proxy from the list
splice(@proxies, $index, 1);
# Try again
}
Handling Honeypot Traps
Honeypot traps are a common technique employed by web admins to trap bots and scrapers. Usually, they use links with the display
property set to none
, which makes them invisible to human users. But a bot can pick it up and follow the link, which leads to a decoy web page and away from the main product.
To tackle this issue, check the display
property of links before following them. The following is one way to do so using HTML::TreeBuilder
:
my @links = $root->look_down(
_tag => 'a',
);
foreach my $link (@qlinks) {
my $style = $link->attr('style');
if(defined $style && $style =~ /dislay: none/) {
# Honeypot detected!
} else {
# Safe to proceed
}
}
Solving CAPTCHAs
CAPTCHAs help prevent unauthorized access to a website. However, they can also prevent web scrapers from scraping web pages.
To fight CAPTCHAs, you can use a service like the Bright Data Web Unlocker, which solves the CAPTCHAs for you.
The following is an example using the Bright Data Web Unlocker to make an HTTP request:
use LWP::UserAgent;
my $agent = LWP::UserAgent->new();
$agent->proxy(['http', 'https'], "http://brd-customer-hl_6d74fc42-zone-residential_proxy4:812qoxo6po44\@brd.superproxy.io:22225");
print $agent->get('http://lumtest.com/myip.json')->content();
When you make an HTTP request using Web Unlocker, it automatically solves CAPTCHAs, evades anti-bot measures, and handles proxy management for you.
Scraping Dynamic Websites
So far, all the examples you’ve learned about here scrape static websites. However, single-page applications (SPA) and other dynamic websites need more advanced techniques.
Dynamic websites use JavaScript to load page content, which means you need scraping tools that are capable of running JavaScript. Selenium is one such tool that can emulate a browser to run dynamic websites. The following is a very small example snippet of this module in action:
use Selenium::Remote::Driver;
my $driver = Selenium::Remote::Driver->new;
$driver->get('http://example.com');
my $elem = $driver->find_element_by_id('foo');
print $elem->get_text();
$driver->quit();
Conclusion
Perl, thanks to its robust collection of modules, is an excellent language for web scraping. In this article, you learned how to scrape web pages in Perl using the following:
LWP::UserAgent
andHTML::TreeBuilder
Web::Scraper
Mojo::UserAgent
andMojo::DOM
XML::LibXML
However, as you saw, web scraping faces many challenges in real-life scenarios when website owners are determined to prevent scrapers from scraping. This article shed some light on some common scenarios and how to combat them. However, it can be tedious and error-prone to try to solve those challenges by yourself. That’s where Bright Data can help. With the best proxy services, a Scraping Browser, Web Unlocker, and the ultimate Web Scraper API, Bright Data is an all-encompassing solution for scraping the web with ease. Start a free trial today!
No credit card required