Web Scraping With Rust

Dive into the world of web scraping with Rust, a language celebrated for its performance and safety features. This concise guide will lead you through setting up your environment and crafting efficient scrapers, showcasing how Rust’s unique strengths elevate the art of web scraping.
19 min read
Guide to Rust Web Scraping

The Definitive Guide to Rust Web Scraping

In this guide, you will learn:

Let’s dive in!

Is Rust a Good Language for Web Scraping?

Rust is a statically typed programming language known for its focus on security, performance, and concurrency. In recent years, it has gained popularity for its high efficiency. That makes it an excellent choice for a variety of applications, including web scraping.

Rust provides valuable features for online data scraping endeavors. Notably, its robust concurrency model facilitates the simultaneous execution of multiple web requests. This characteristic positions it as a versatile language adept at efficiently extracting substantial amounts of data from diverse websites.

Moreover, the Rust ecosystem encompasses HTTP client and HTML parsing libraries to streamline the processes of web page retrieval and data extraction. Let’s see some of the most top ones!

Best Rust Web Scraping Libraries

The most popular and widely adopted Rust web scraping libraries include:

  • reqwest: A powerful HTTP client for Rust, enabling seamless web requests and interactions.
  • scraper: A flexible HTML parsing library in Rust, facilitating efficient extraction of data from HTML documents.
  • rust-headless-chrome: Offers headless Chrome browser automation using Rust, providing a robust solution for dynamic web scraping.
  • thirtyfour: Rust bindings for Selenium, allowing automated testing and web scraping by interacting with web browsers.

Prerequisites

Follow the instructions below and get ready to write some Rust code.

Set Up the Environment

Before getting started, you must have Rust installed on your computer. To verify if you already have it, open the terminal and type the following command:

rustc --version

If the result is similar to the one below, you are ready to go:

rustc 1.75.0 (82e1608df 2023-12-21)

Update Rust to the latest version with:

rustup update

If that command returns an error instead, you need to install Rust. Download the installer from the official website, launch it, and follow the wizard. That will set up:

  • rustup:  An installer and version manager for the Rust programming language, enabling easy installation and management of different toolchains.
  • cargo:  The official package manager and build tool for Rust. It streamlines the process of managing dependencies and building Rust projects.

Close all open terminal windows and repeat the command at the beginning of this section. This time you will get the desired result.

Wonderful! You now have Rust in place!

Create a Rust Project

Suppose you want to create a new Rust project called simple_rust_web_scraper. Open the terminal and execute the following cargo new command:

cargo new simple_rust_web_scraper

If everything goes as intended, you will receive the following message:

Created binary (application) `simple_rust_web_scraper` package

Specifically, that command will create a simple_rust_web_scraper folder. Open it and note that it includes:

  • Cargo.toml: The manifest file to specify the project’s dependencies.
  • src/: The folder where to place your Rust files. By default, it initializes a sample main.rs file for you.

Open simple_rust_web_scraper in your Rust IDE. For example, Visual Studio Code with the Rust extension will be perfect:

Navigate inside the src/ folder, open the main.rs file, and you will see these lines:

fn main() {

    println!("Hello, world!");

}

That is nothing more than a simple Rust script that prints “Hello, world!” in the terminal. In particular, the main() function represents the entry point of any Rust application and is where you will write the scraping logic.

Amazing! It only remains to verify that your new Rust project works! 

Open the terminal of your IDE, and run this command to compile your Rust application:

cargo build

A target/ folder storing some binary files will appear in the root folder of your project. 

Run the compiled binary executable associated with your code with:

cargo run

That should print in the terminal:

Finished dev [unoptimized + debuginfo] target(s) in 0.05s

     Running `target\debug\simple_rust_web_scraper.exe`

Hello, world!

The first two lines are just log information, so you can ignore them. Focus on the last line and see that the project produced the “Hello, World!” message as expected.

Perfect! You now have a Rust project. It is time to write some Rust web scraping logic!

How to Build a Web Scraper in Rust

In this step-by-step tutorial section, you will learn how to perform web scraping with Rust. In detail, you will build a Rust web scraper that automatically collects data from the Scrape This Site Country sandbox. This is what the target page looks like:

As you can see, it contains a list of all the countries in the world and some interesting information about them.

What the Rust web scraping script will do is:

  1. Connect to the destination page and parse its HTML.
  2. Select the country HTML elements from the page.
  3. Extract data from them and store it in a Rust data structure.
  4. Transform the collected data into a human-readable format, such as CSV.

Follow the steps below and achieve your scraping goal!

Step #1: Inspect the Target Site

You will need to install some libraries to do web scraping in Rust, but which ones are best suited for your specific scenario? To answer this, you need to figure out whether the destination site has static content pages or dynamic content pages. Thus, visit the site in your browser.

Navigate to the target page, right-click on a blank section, and select the “Inspect” option to open the DevTools. Reach the “Network” tab and reload the page. Focus on what you see in the “Fetch/XHR” section:

While the page is loading and rendering, that section will remain empty. This means that the web page does not make any AJAX requests. In other words, it does not retrieve data dynamically on the client via JavaScript. It is therefore a static content page, whose HTML document already contains all the data of interest.

As further confirmation, right-click and select the “View page source” option:

Explore the code and you will notice that all the data in the page is embedded into the HTML returned by the server.

On a site with multiple pages, repeat this procedure on all pages of interest.

Since the target pages do not use JavaScript, you do not require a browser automation library like rust-headless-chrome. You could still use it, but running Chrome takes time and resources, so it would only introduce a performance overhead and no real benefit.

Instead, you should employ an HTTP client library to retrieve the HTML document associated with a page and an HTML parser library to extract data from it. Thus, reqwest and scraper are the two Rust web scraping libraries that you need!

Step #2: Install the Scraping Libraries

Time to install reqwest and scraper. 

Open a terminal in your project’s root folder or use your IDE’s terminal. Run the following command to add reqwest and scraper to your project’s dependencies:

cargo add scraper reqwest --features "reqwest/blocking"

Note: The reqwest/blocking feature allows reqwest to perform synchronous HTTP calls that block the current thread. Learn more in the documentation.

The cargo add command will update the Cargo.toml file accordingly, making sure it contains:

[dependencies]
reqwest = { version = "0.11.23", features = ["blocking"] }

scraper = "0.18.1"

Also, it will install the two libraries and all their dependencies. 

Perfect! You now have everything you need to do web scraping with Rust!

Step #3: Connect to the Target Page

Use the get() method from reqwest::blocking to make a GET request to the given URL and download the associated HTML document:

let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;

Keep in mind that this instruction is synchronous, so execution of the script will be interrupted until the server responds.

Once you get a response, you can access the HTML code of the target page with:

let html = response.text()?;

Write those two lines in the main() function of min.rs.:

fn main() -> Result<(), Box<dyn std::error::Error>> {

    // connect to the target page

    let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;

    // extract the raw html and print it

    let html = response.text()?;

    println!("{html}");

    Ok(())

}

If you are wondering what Result<(), Box<dyn std::error::Error>> is because we are going to use Residuals. Take also a look at the println() function at the end, which logs the retrieved HTML.

Execute the script, and it will print in the terminal:

<!doctype html>

<html lang="en">

  <head>

    <meta charset="utf-8">

    <title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>

    <!-- omitted for brevity... -->

Well done! That is exactly the HTML of the target page!

Step #4: Parse the HTML Document

You now have the source HTML of the desired page stored in a string variable. Feed it to the parse_document() function from scraper to parse it:

let document = scraper::Html::parse_document(&html);

The returned document object exposes the DOM exploration API you need to perform web scraping using Rust.

This is what your main.rs file should look like so far:

fn main() -> Result<(), Box<dyn std::error::Error>> {

    // connect to the target page

    let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;

    // extract the raw html and print it

    let html = response.text()?;

    // parse the HTML document

    let document = scraper::Html::parse_document(&html);

    Ok(())

}

You are ready to write the data parsing logic. But first, you have to study the structure of the target page!

Step #5: Inspect the Page

Web scraping involves selecting HTML nodes on a page and extracting data from them. CSS selectors are among the most popular methods for selecting HTML nodes. If you are a web developer, you are probably already familiar with them. If not, explore the documentation.

The only way to define effective CSS selectors is to inspect the HTML of the target page. Thus, open the Scrape This Site Country sandbox in the browser, right-click on a country element, and select “Inspect:”

 There, you can see that each country info box is a .country HTML node that contains:

  • The country name in a .country-name element.
  • The name of the capital in a .country-capital element.
  • The population information in a .country-population element.
  • The area in km² occupied by the country in the .country-area element.

In the above paragraph, there are all the CSS selectors required to select the desired HTML nodes. Test the selectors on a country info box before applying it to all elements on the page!

Step #6: Retrieve Data from a Single Element

The parse() function from scraper::Selector accepts a string representing a CSS selector and returns a selector object. Use it as below:

let html_country_info_box_selector = scraper::Selector::parse(".country")?;

You can then pass the selector to the select() method exposed by document:

let html_country_info_box_element = document

    .select(&html_country_info_box_selector)

    .next()

    .ok_or("Country info box element not found!")?;

That will apply the CSS selector on the page and return the selected HTML element. Since select() always returns an iterator, the .next() call is required to get the first country info box node.

Note that the object returned by select() exposes the select() function as well. In this case, it will search for nodes only in the children of the current node. So, you can implement the entire Rust web scraping logic as follows:

let country_name_selector = scraper::Selector::parse(".country-name")?;

let name = html_country_info_box_element

    .select(&country_name_selector)

    .next()

    .map(|element| element.text().collect::<String>().trim().to_owned())

    .ok_or("Country name not found")?;

let country_capital_selector = scraper::Selector::parse(".country-capital")?;

let capital = html_country_info_box_element

    .select(&country_capital_selector)

    .next()

    .map(|element| element.text().collect::<String>().trim().to_owned())

    .ok_or("Country capital not found")?;

let country_population_selector = scraper::Selector::parse(".country-population")?;

let population = html_country_info_box_element

    .select(&country_population_selector)

    .next()

    .map(|element| element.text().collect::<String>().trim().to_owned())

    .ok_or("Country population not found")?;

let country_area_selector = scraper::Selector::parse(".country-area")?;

let area = html_country_info_box_element

    .select(&country_area_selector)

    .next()

    .map(|element| element.text().collect::<String>().trim().to_owned())

    .ok_or("Country area not found")?;

The text() method enables you to access the text contained in the selected HTML node. For other data extraction approaches, check out the docs. As the extracted text could contain unwanted spaces, remove them with trim().

Print the scraped data to verify that the scraping logic works as expected:

println!("Country name: {name}");

println!("Country capital: {capital}");

println!("Country name: {population}");

println!("Country area: {area}");

That would produce:

Country name: Andorra

Country capital: Andorra la Vella

Country population: 84000

Country area: 468.0

Yes! You just performed web scraping in Rust!

Step #7: Scrape All Elements on the Page

This time, you will extend the code seen above to go through all the country information box nodes on the page. 

First, you need to define a custom data structure in which to store the collected data. To specify a new struct tailored for that, add the following lines on top of your main.rs file:

struct Country {

    name: String,

    capital: String,

    population: String,

    area: String,

}

Second, instantiate a Vec of Country objects in main():

let mut countries: Vec<Country> = Vec::new();

This vector will contain all your scraped data. 

Next, remove the .next() call to get all country info boxes, iterate over them, and populate countries:

// where to store the scraped data

let mut countries: Vec<Country> = Vec::new();

// select the country info box HTML elements

let html_country_info_box_selector = scraper::Selector::parse(".country")?;

let html_country_info_box_elements = document.select(&html_country_info_box_selector);

// iterate over the country HTML elements

// and scrape them all

for html_country_info_box_element in html_country_info_box_elements {

    // scraping logic for a single country info box HTML element...

    // create a new Country object and add it to the vector

    let country = Country {

            name,

            capital,

            population,

            area,

        };

    countries.push(country);

}

You can then print all scraped countries with:

// log the results

for country in countries {

    println!("Country name: {}", country.name);

    println!("Country capital: {}", country.capital);

    println!("Country name: {}", country.population);

    println!("Country area: {}", country.area);

    println!();

}

The new main.rs Rust web scraping file will contain:

// custom struct to store the scraping data

struct Country {

    name: String,

    capital: String,

    population: String,

    area: String,

}

fn main() -> Result<(), Box<dyn std::error::Error>> {

    // connect to the target page

    let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;

    // extract the raw html and print it

    let html = response.text()?;

    // parse the HTML document

    let document = scraper::Html::parse_document(&html);

    // where to store the scraped data

    let mut countries: Vec<Country> = Vec::new();

    // select the country info box HTML elements

    let html_country_info_box_selector = scraper::Selector::parse(".country")?;

    let html_country_info_box_elements = document.select(&html_country_info_box_selector);

    // iterate over the country HTML elements

    // and scrape them all

    for html_country_info_box_element in html_country_info_box_elements {

        // scraping logic for a single country info box HTML element

        let country_name_selector = scraper::Selector::parse(".country-name")?;

        let name = html_country_info_box_element

            .select(&country_name_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country name not found")?;

        let country_capital_selector = scraper::Selector::parse(".country-capital")?;

        let capital = html_country_info_box_element

            .select(&country_capital_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country capital not found")?;

        let country_population_selector = scraper::Selector::parse(".country-population")?;

        let population = html_country_info_box_element

            .select(&country_population_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country population not found")?;

        let country_area_selector = scraper::Selector::parse(".country-area")?;

        let area = html_country_info_box_element

            .select(&country_area_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country area not found")?;

        // create a new Country object and add it to the vector

        let country = Country {

            name,

            capital,

            population,

            area,

        };

        countries.push(country);

    }

    // log the results

    for country in countries {

        println!("Country name: {}", country.name);

        println!("Country capital: {}", country.capital);

        println!("Country name: {}", country.population);

        println!("Country area: {}", country.area);

        println!();

    }

    Ok(())

}

Launch it, and it will generate this output:

Country name: Andorra

Country capital: Andorra la Vella

Country population: 84000

Country area: 468.0

# omitted for brevity...

Country name: Zimbabwe

Country capital: Harare

Country name: 11651858

Country area: 390580.0

Mission complete! You just scraped all countries from the target page!

Step #8: Export the Extracted Data to CSV

The collected data is now stored in Rust vector, which is not the best format if you want to share it with other people. That is why you need to export it to easy-to-explore formats, such as CSV.

To export data to a CSV file, you should use the csv library. Install it with this command:

cargo add csv

You can then use it to produce a CSV export file with:

// initialize the output CSV file

let mut writer = csv::Writer::from_path("countries.csv")?;

// write the CSV header

writer.write_record(&["name", "capital", "population", "area"])?;

// populate the file with each country

for country in countries {

    writer.write_record(&[

        country.name,

        country.capital,

        country.population,

        country.area,

    ])?;

}

This snippet creates a CSV file, initializes it with the header row, and finally populates it by iterating over the countries vector.

Step #9: Put It All Together

Here is the complete code of your web scraping Rust script:

// custom struct to store the scraping data

pub struct Country {

    name: String,

    capital: String,

    population: String,

    area: String,

}

fn main() -> Result<(), Box<dyn std::error::Error>> {

    // connect to the target page

    let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;

    // extract the raw html and print it

    let html = response.text()?;

    // parse the HTML document

    let document = scraper::Html::parse_document(&html);

    // where to store the scraped data

    let mut countries: Vec<Country> = Vec::new();

    // select the country info box HTML elements

    let html_country_info_box_selector = scraper::Selector::parse(".country")?;

    let html_country_info_box_elements = document.select(&html_country_info_box_selector);

    // iterate over the country HTML elements

    // and scrape them all

    for html_country_info_box_element in html_country_info_box_elements {

        // scraping logic for a single country info box HTML element

        let country_name_selector = scraper::Selector::parse(".country-name")?;

        let name = html_country_info_box_element

            .select(&country_name_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country name not found")?;

        let country_capital_selector = scraper::Selector::parse(".country-capital")?;

        let capital = html_country_info_box_element

            .select(&country_capital_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country capital not found")?;

        let country_population_selector = scraper::Selector::parse(".country-population")?;

        let population = html_country_info_box_element

            .select(&country_population_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country population not found")?;

        let country_area_selector = scraper::Selector::parse(".country-area")?;

        let area = html_country_info_box_element

            .select(&country_area_selector)

            .next()

            .map(|element| element.text().collect::<String>().trim().to_owned())

            .ok_or("Country area not found")?;

        // create a new Country object and add it to the vector

        let country = Country {

            name,

            capital,

            population,

            area,

        };

        countries.push(country);

    }

    // initialize the output CSV file

    let mut writer = csv::Writer::from_path("countries.csv")?;

    // write the CSV header

    writer.write_record(&["name", "capital", "population", "area"])?;

    // populate the file with each country

    for country in countries {

        writer.write_record(&[

            country.name,

            country.capital,

            country.population,

            country.area,

        ])?;

    }

    Ok(())

}

Can you believe it? You can build a Rust data scraper in less than 100 lines of code.

Compile the application with the command below:

cargo build

Then, launch it with:

cargo run

When the script terminates, a countries.csv file will appear in the root folder of your project. Open it, and you should see the following data:

Et voilà! You know now the basics of Rust web scraping!

Keep Your Web Scraping Operation Ethical and Respectful

Automatically retrieving data from the Internet is an effective way to get useful information. However, you do not want to harm the target site while doing it. Thus, you must approach that operation with the right precautions. 

To perform responsible web scraping, consider these tips:

  • Comply with the robots.txt file: Every site has a robots.txt file that specifies the rules on how automated crawlers should access its pages. To maintain ethical scraping practices, you must adhere to those guidelines. Learn more in our robots.txt for web scraping guide.
  • Limit the frequency of your requests: Making too many requests in a short period will lead to a server overload, affecting site performance for all users. That might also trigger rate limiting measures and get you blocked. Thus, add random delays to your requests to avoid flooding the destination server.
  • Check and respect the site’s Terms of Service: Before scraping a website, review and abide by its Terms of Service. These may contain information on copyright, intellectual property rights, and guidelines on how and when to use their data. 
  • Scrape only publicly available information: Focus on extracting data that is publicly accessible on the site and not protected by login credentials or other forms of authorization. Scraping private or sensitive data without proper permission is unethical and may lead to legal consequences.
  • Rely on trustworthy and up-to-date scraping tools: Select reputable providers and opt for libraries and tools that are well-maintained and regularly updated. Only then can you ensure that they are in line with the latest ethical scraping principles and best practices. If you have any doubts, read our article on how to choose the best web scraping service

Conclusion

In this tutorial, you saw why Rust is a good option for web scraping and what libraries you should use to perform it. Here, you learned how to use reqwest and scraper to build a Rust web scraper that can extract data from a real-world site. That takes only a few lines of code!

However, keep in mind that web scraping may not always be that easy. The reason is that anti-scraping and anti-bot solutions are becoming more common. These technologies can detect the self-loved nature of your script and block it, posing a serious challenge to your scraping operation.

Avoid that headache with the next-generation and advanced web scraping tool provided by Bright Data. If you want to find out more about how to avoid being blocked, adopt a web proxy from one of the several proxy services available or start using the advanced Web Unlocker.

Don’t want to deal with web scraping? Explore our datasets.

Not sure which product to choose? Contact sales and find the right web scraping solution for you.