Web Scraping with Perl Guide: Methods and Challenges

Perl is one of the most popular languages, and thanks to its extensive module collection, it’s a great choice for writing web scrapers.

In this article, we will discuss the following:

How to web scrape with Perl using the following methods:
- LWP::UserAgent and HTML::TreeBuilder
- Web::Scraper
- Mojo::UserAgent and Mojo::DOM
- XML::LibXML
Challenges of web scraping with Perl
Conclusion

Web Scraping with Perl

To follow along with the article, make sure you have the latest version of Perl installed. The code in this article was tested with Perl 5.38.2. This article also assumes that you know how to install Perl modules using cpanm.

In this article, you’ll scrape the Quotes to Scrape website to extract the quotes. Before you can scrape data from the website, you need to understand how the HTML is structured. Open the website in the browser and press CTRL + Shift + I (Windows) or Command + Shift + C (Mac) to open the Inspect Element dialog.

When you inspect the elements, you can see that each quote is stored in a div with the class quote. Each quote contains a span with class text and a small element to store the text and author’s name, respectively:

Using LWP::UserAgent and HTML::TreeBuilder

The LWP::UserAgent is part of LWP, a group of modules that interact with the web. The LWP::UserAgent module can be used to make an HTTP request to a web page and return the HTML content. Then you can use the HTML::TreeBuilder module from HTML::Tree to parse the HTML and extract information.

To use LWP::UserAgent and HTML::TreeBuilder, install the modules with the following commands:

cpanm Bundle::LWP
cpanm HTML::Tree

Create a file named lwp-and-tree-builder.pl. This is where you’ll write the code. Then paste the following two lines in that file:

use LWP::UserAgent;
use HTML::TreeBuilder;

This code instructs the Perl interpreter to include the LWP::UserAgent and HTML::TreeBuilder modules.

Define an instance of LWP::UserAgent and set the User-Agent header to Quotes Scraper:

my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");

Define the URL of the target website and create an instance of HTML::TreeBuilder:

my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();

Now you can make the HTTP request:

my $request = $ua->get($url) or die "An error occurred $!n";

Paste the following if-else statement that checks whether the request was successful or not:

if ($request->is_success) {


} else {
  print "Cannot parse the result. " . $request->status_line . "n";
}

If the request is successful, you can start scraping.

Use the parse method of HTML::TreeBuilder to parse the HTML response. Paste the following code inside the if block:

$root->parse($request->content);

Now, use the look_down method to find the div elements with the class quote:

my @quotes = $root->look_down(
    _tag => 'div',
    class => 'quote'
);

Iterate over the array of quotes, use look_down to find the text and the author, and print them:

foreach my $quote (@quotes) {
    my $text = $quote->look_down(
        _tag => 'span',
        class => 'text'
    )->as_text;

    my $author = $quote->look_down(
        _tag => 'small',
        class => 'author'
    )->as_text;

    print "$text: $authorn";
}

The complete code looks like this:

use LWP::UserAgent;
use HTML::TreeBuilder;

my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");

my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();

my $request = $ua->get($url) or die "An error occurred $!n";

if ($request->is_success) {
    $root->parse($request->content);
    my @quotes = $root->look_down(
        _tag => 'div',
        class => 'quote'
    );

    foreach my $quote (@quotes) {
        my $text = $quote->look_down(
            _tag => 'span',
            class => 'text'
        )->as_text;

        my $author = $quote->look_down(
            _tag => 'small',
            class => 'author'
        )->as_text;

        print "$text: $authorn";
    }

} else {
  print "Cannot parse the result. " . $request->status_line . "n";
}

Run this code with perl lwp-and-tree-builder.pl, and you should see the following output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin

Using Web::Scraper

Web::Scraper is a web scraping library inspired by Ruby’s ScrAPI. It provides a domain-specific language (DSL) for scraping HTML and XML documents. Check this article to learn more about web scraping with Ruby.

To use Web::Scraper, install the module with cpanm Web::Scraper.

Create a new file named web-scraper.pl and include the following required modules:

use URI;
use Web::Scraper;
use Encode;

Next, you need to define a scraper block using the module’s DSL. The DSL makes it easy to define a scraper in only a few lines. Start by defining a scraper block named $quotes:

my $quotes = scraper {

};

The scraper method defines the logic of the scraper, which is executed when the scrape method is called later. Inside the scraper block, you use the process method to find elements using CSS selectors and execute a function.

Start by finding all the div elements with the quote class:

# Parse all `div` with class `quote`
process 'div.quote', "quotes[]" => scraper {

};

This code finds all the div elements with the quote class and stores them in the quotes array. For each element, it runs the scraper method, which you define using the following:

# And, in each div, find `span` with class `text`
process_first "span.text", text => 'TEXT';
# get `small` with class `author`
process_first "small", author => 'TEXT';

The process_first method finds the first element matching the CSS selector. Here, you’re finding the first span element with text class and then extracting its text and storing it in the text key. For the author name, you’re finding the first small element and extracting the text to store in the author key.

The complete scraper block looks like this:

my $quotes = scraper {
    # Parse all `div` with class `quote`
    process 'div.quote', "quotes[]" => scraper {
    # And, in each div, find `span` with class `text`
    process_first "span.text", text => 'TEXT';
    # get `small` with class `author`
    process_first "small", author => 'TEXT';
    };
};

Now, call the scrape method and pass the URL to start the scraping:

my $res = $quotes->scrape( URI->new("https://quotes.toscrape.com/") );

Finally, iterate over the quotes array and print the result:

# iterate over the array
for my $quote (@{$res->{quotes}}) {
    print Encode::encode("utf8", "$quote->{text}: $quote->{author}n");
}

The complete code looks like this:

use URI;
use Web::Scraper;
use Encode;

my $quotes = scraper {
    # Parse all `div` with class `quote`
    process 'div.quote', "quotes[]" => scraper {
    # And, in each div, find `span` with class `text`
    process_first "span.text", text => 'TEXT';
    # get `small` with class `author`
    process_first "small", author => 'TEXT';
    };
};

my $res = $quotes->scrape( URI->new("https://quotes.toscrape.com/") );

# iterate over the array
for my $quote (@{$res->{quotes}}) {
    print Encode::encode("utf8", "$quote->{text}: $quote->{author}n");
}

Run the previous code with perl web-scraper.pl, and you should get the following output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin

Using Mojo::UserAgent and Mojo::DOM

Mojo::UserAgent and Mojo::DOM are part of the Mojolicious framework, a real-time web framework for Perl. In terms of functionality, they’re similar to LWP::UserAgent and HTML::TreeBuilder.

To use Mojo::UserAgent and Mojo::DOM, install the modules using the following command:

cpanm Mojo::UserAgent
cpanm Mojo::DOM

Create a new file named mojo.pl and include the Mojo::USeragent and Mojo::DOM modules:

use Mojo::UserAgent;
use Mojo::DOM;

Define an instance of Mojo::UserAgent and make the HTTP request:

my $ua  = Mojo::UserAgent->new;
my $res = $ua->get('https://quotes.toscrape.com/')->result;

Similar to LWP::UserAgent, use the following if-else block to check if the request was successful:

if ($res->is_success) {

} else {
    print "Cannot parse the result. " . $res->message . "n";
}

In the if block, initializes an instance of Mojo::DOM:

my $dom = Mojo::DOM->new($res->body);

Use the find method to find all the div elements with the quote class:

my @quotes = $dom->find('div.quote')->each;

Iterate over the quotes array and extract the text and author names:

foreach my $quote (@quotes) {
    my $text = $quote->find('span.text')->map('text')->join;
    my $author = $quote->find('small.author')->map('text')->join;

    print "$text: $authorn";
}

The following is the full code:

use Mojo::UserAgent;
use Mojo::DOM;

my $ua  = Mojo::UserAgent->new;
my $res = $ua->get('https://quotes.toscrape.com/')->result;

if ($res->is_success) {
    my $dom = Mojo::DOM->new($res->body);
    my @quotes = $dom->find('div.quote')->each;

    foreach my $quote (@quotes) {
        my $text = $quote->find('span.text')->map('text')->join;
        my $author = $quote->find('small.author')->map('text')->join;

        print "$text: $authorn";
    }
} else {
    print "Cannot parse the result. " . $res->message . "n";
}

Run this code with perl mojo.pl, and you should get the following output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin

Using XML::LibXML

The Perl module XML::LibXML is a wrapper around the libxml2 library. The XML::LibXML module provides a powerful XHTML parser with XPath capabilities.

Use cpanm to install the module:

cpanm XML::LibXML

Then create a new file named xml-libxml.pl. As is the case with HTML::TreeBuilder, you need to use a library like LWP::UserAgent to make the HTTP request to the website and fetch the HTML content, which you pass to XML::LibXML.

Paste the following code, which sets up the LWP:UserAgent module and fetches the HTML content of the web page:

use LWP::UserAgent;
use XML::LibXML;
use open qw( :std :encoding(UTF-8) );

my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");

my $url = "https://quotes.toscrape.com/";

my $request = $ua->get($url) or die "An error occurred $!n";

if ($request->is_success) {

} else {
  print "Cannot parse the result. " . $request->status_line . "n";
}

Inside the if block, start by parsing the HTML document using the load_html method:

$dom = XML::LibXML->load_html(string => $request->content, recover => 1, suppress_errors => 1);

The recover option tells the parser to continue parsing the HTML in case of an error, and the suppress_errors option causes the parser not to print HTML parsing errors to the console. Since HTML documents are not as strictly validated as an XHTML document, you’re likely to encounter non-fatal parsing errors. These options keep the code working in case those errors occur.

Once the HTML is parsed, you can use the indnodes method to find the elements based on their XPath expression:

my $xpath = '//div[@class="quote"]';

foreach my $quote ($dom->findnodes($xpath)) {
        my ($text) = $quote->findnodes('.//span[@class="text"]')->to_literal_list;

        my ($author) = $quote->findnodes('.//small[@class="author"]')->to_literal_list;

        print "$text: $authorn";
}

The full code looks like this:

use LWP::UserAgent;
use XML::LibXML;
use open qw( :std :encoding(UTF-8) );

my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");

my $url = "https://quotes.toscrape.com/";

my $request = $ua->get($url) or die "An error occurred $!n";

if ($request->is_success) {
    $dom = XML::LibXML->load_html(string => $request->content, recover => 1, suppress_errors => 1);

    my $xpath = '//div[@class="quote"]';

    foreach my $quote ($dom->findnodes($xpath)) {
        my ($text) = $quote->findnodes('.//span[@class="text"]')->to_literal_list;

        my ($author) = $quote->findnodes('.//small[@class="author"]')->to_literal_list;

        print "$text: $authorn";
    }

} else {
  print "Cannot parse the result. " . $request->status_line . "n";
}

Run the code with perl xml-libxml.pl, and you should see the following output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.": Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities.": J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.": Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.": Jane Austen
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.": Marilyn Monroe
"Try not to become a man of success. Rather become a man of value.": Albert Einstein
"It is better to be hated for what you are than to be loved for what you are not.": André Gide
"I have not failed. I've just found 10,000 ways that won't work.": Thomas A. Edison
"A woman is like a tea bag; you never know how strong it is until it's in hot water.": Eleanor Roosevelt
"A day without sunshine is like, you know, night.": Steve Martin

You can find all the code for this tutorial in this GitHub repo.

Challenges of Web Scraping in Perl

Although Perl makes it easy to scrape web pages using its powerful modules, developers often run into some common problems that can slow down or completely hinder web scraping. The following are a few of the challenges that you are likely to face.

Dealing with Pagination

Websites that deal with a large volume of data don’t often send all the data at once. Usually, the data is sent on multiple pages, and you need to handle the pagination to ensure you extract all the data. There are two steps to handle the pagination:

Check whether other pages exist. Typically, you can look for a Next page button on the page, or you can try to load the next page and look for an error.
If other pages exist, load the next page and scrape it.

For static websites, where each page has its own URL, you can run a loop and load new pages by incrementing the page number parameter in the URL. Or if you’re using a module like WWW::Mechanize, you can simply follow the Next page URL.

Here’s the quotes scraper modified to handle pagination using WWW::Mechanize. Note the use of follow_link:

use WWW::Mechanize ();
use HTML::TreeBuilder;
use open qw( :std :encoding(UTF-8) );

my $mech = WWW::Mechanize->new();

my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();

my $request = $mech->get($url);
my $next_page = $mech->find_link(text_regex => qr/Next/);

while ($next_page) {
    $root->parse($mech->content);
    my @quotes = $root->look_down(
        _tag => 'div',
        class => 'quote'
    );

    foreach my $quote (@quotes) {
        my $text = $quote->look_down(
            _tag => 'span',
            class => 'text'
        )->as_text;

        my $author = $quote->look_down(
            _tag => 'small',
            class => 'author'
        )->as_text;

        print "$text: $authorn";
    }

    $mech->follow_link(url => $next_page->url);
    $next_page = $mech->find_link(text_regex => qr/Next/);
}

To handle dynamic websites that load the next page using JavaScript, check out our guide on Scraping Dynamic Websites With Python, or continue reading.

Rotating Proxies

Proxies are commonly used by web scrapers to protect their privacy and anonymity and to evade IP address bans. Modules like LWP::UserAgent have the option to set proxies for scraping. However, using a single proxy server still runs the risk of getting the IP banned. That’s why it’s recommended to use multiple proxy servers and rotate them. Here’s a very simple example of how to do it using LWP::UserAgent.

Begin by defining an array of proxies. Then choose one at random and set the proxy using the proxy method:

my @proxies = ( 'https://proxy1.com', 'https://proxy2.com', 'http://proxy3.com' );

my $index = rand @proxies;
my $proxy = $proxies[$index];
$ua->proxy(['http', 'https'], $proxy);

Now, you can send a request as usual. If the request fails, it likely indicates that the proxy has been blocked, so you can remove that proxy from the list, choose a different proxy, and try again:

if(request->is_success) {
    # Continue with the scraping
} else {
    # Remove the proxy from the list
    splice(@proxies, $index, 1);

    # Try again
}

Handling Honeypot Traps

Honeypot traps are a common technique employed by web admins to trap bots and scrapers. Usually, they use links with the display property set to none, which makes them invisible to human users. But a bot can pick it up and follow the link, which leads to a decoy web page and away from the main product.

To tackle this issue, check the display property of links before following them. The following is one way to do so using HTML::TreeBuilder:

my @links = $root->look_down(
    _tag => 'a',
);

foreach my $link (@qlinks) {
    my $style = $link->attr('style');
    if(defined $style && $style =~ /dislay: none/) {
        # Honeypot detected!
    } else {
        # Safe to proceed
    }
}

Solving CAPTCHAs

CAPTCHAs help prevent unauthorized access to a website. However, they can also prevent web scrapers from scraping web pages.

To fight CAPTCHAs, you can use a service like the Bright Data Web Unlocker, which solves the CAPTCHAs for you.

The following is an example using the Bright Data Web Unlocker to make an HTTP request:

use LWP::UserAgent;
my $agent = LWP::UserAgent->new();
$agent->proxy(['http', 'https'], "http://brd-customer-hl_6d74fc42-zone-residential_proxy4:812qoxo6po44@brd.superproxy.io:22225");
print $agent->get('http://lumtest.com/myip.json')->content();

When you make an HTTP request using Web Unlocker, it automatically solves CAPTCHAs, evades anti-bot measures, and handles proxy management for you.

Scraping Dynamic Websites

So far, all the examples you’ve learned about here scrape static websites. However, single-page applications (SPA) and other dynamic websites need more advanced techniques.

Dynamic websites use JavaScript to load page content, which means you need scraping tools that are capable of running JavaScript. Selenium is one such tool that can emulate a browser to run dynamic websites. The following is a very small example snippet of this module in action:

use Selenium::Remote::Driver;

my $driver = Selenium::Remote::Driver->new;
$driver->get('http://example.com');
my $elem = $driver->find_element_by_id('foo');
print $elem->get_text();
$driver->quit();

Conclusion

Perl, thanks to its robust collection of modules, is an excellent language for web scraping. In this article, you learned how to scrape web pages in Perl using the following:

LWP::UserAgent and HTML::TreeBuilder
Web::Scraper
Mojo::UserAgent and Mojo::DOM
XML::LibXML

However, as you saw, web scraping faces many challenges in real-life scenarios when website owners are determined to prevent scrapers from scraping. This article shed some light on some common scenarios and how to combat them. However, it can be tedious and error-prone to try to solve those challenges by yourself. That’s where Bright Data can help. With the best proxy services, a Scraping Browser, Web Unlocker, and the ultimate Web Scraper API, Bright Data is an all-encompassing solution for scraping the web with ease. Start a free trial today!