The Definitive Guide to Rust Web Scraping
In this guide, you will learn:
- Whether Rust is a good language for web scraping.
- What the best Rust web scraping libraries are.
- How to build a web scraper in Rust.
- How to keep your scraping operation ethical and respectful.
Let’s dive in!
Is Rust a Good Language for Web Scraping?
Rust is a statically typed programming language known for its focus on security, performance, and concurrency. In recent years, it has gained popularity for its high efficiency. That makes it an excellent choice for a variety of applications, including web scraping.
Rust provides valuable features for online data scraping endeavors. Notably, its robust concurrency model facilitates the simultaneous execution of multiple web requests. This characteristic positions it as a versatile language adept at efficiently extracting substantial amounts of data from diverse websites.
Moreover, the Rust ecosystem encompasses HTTP client and HTML parsing libraries to streamline the processes of web page retrieval and data extraction. Let’s see some of the most top ones!
Best Rust Web Scraping Libraries
The most popular and widely adopted Rust web scraping libraries include:
- reqwest: A powerful HTTP client for Rust, enabling seamless web requests and interactions.
- scraper: A flexible HTML parsing library in Rust, facilitating efficient extraction of data from HTML documents.
- rust-headless-chrome: Offers headless Chrome browser automation using Rust, providing a robust solution for dynamic web scraping.
- thirtyfour: Rust bindings for Selenium, allowing automated testing and web scraping by interacting with web browsers.
Prerequisites
Follow the instructions below and get ready to write some Rust code.
Set Up the Environment
Before getting started, you must have Rust installed on your computer. To verify if you already have it, open the terminal and type the following command:
rustc --version
If the result is similar to the one below, you are ready to go:
rustc 1.75.0 (82e1608df 2023-12-21)
Update Rust to the latest version with:
rustup update
If that command returns an error instead, you need to install Rust. Download the installer from the official website, launch it, and follow the wizard. That will set up:
- rustup: An installer and version manager for the Rust programming language, enabling easy installation and management of different toolchains.
- cargo: The official package manager and build tool for Rust. It streamlines the process of managing dependencies and building Rust projects.
Close all open terminal windows and repeat the command at the beginning of this section. This time you will get the desired result.
Wonderful! You now have Rust in place!
Create a Rust Project
Suppose you want to create a new Rust project called simple_rust_web_scraper. Open the terminal and execute the following cargo new command:
cargo new simple_rust_web_scraper
If everything goes as intended, you will receive the following message:
Created binary (application) `simple_rust_web_scraper` package
Specifically, that command will create a simple_rust_web_scraper folder. Open it and note that it includes:
- Cargo.toml: The manifest file to specify the project’s dependencies.
- src/: The folder where to place your Rust files. By default, it initializes a sample main.rs file for you.
Open simple_rust_web_scraper in your Rust IDE. For example, Visual Studio Code with the Rust extension will be perfect:
Navigate inside the src/ folder, open the main.rs file, and you will see these lines:
fn main() {
println!("Hello, world!");
}
That is nothing more than a simple Rust script that prints “Hello, world!” in the terminal. In particular, the main() function represents the entry point of any Rust application and is where you will write the scraping logic.
Amazing! It only remains to verify that your new Rust project works!
Open the terminal of your IDE, and run this command to compile your Rust application:
cargo build
A target/ folder storing some binary files will appear in the root folder of your project.
Run the compiled binary executable associated with your code with:
cargo run
That should print in the terminal:
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target\debug\simple_rust_web_scraper.exe`
Hello, world!
The first two lines are just log information, so you can ignore them. Focus on the last line and see that the project produced the “Hello, World!” message as expected.
Perfect! You now have a Rust project. It is time to write some Rust web scraping logic!
How to Build a Web Scraper in Rust
In this step-by-step tutorial section, you will learn how to perform web scraping with Rust. In detail, you will build a Rust web scraper that automatically collects data from the Scrape This Site Country sandbox. This is what the target page looks like:
As you can see, it contains a list of all the countries in the world and some interesting information about them.
What the Rust web scraping script will do is:
- Connect to the destination page and parse its HTML.
- Select the country HTML elements from the page.
- Extract data from them and store it in a Rust data structure.
- Transform the collected data into a human-readable format, such as CSV.
Follow the steps below and achieve your scraping goal!
Step #1: Inspect the Target Site
You will need to install some libraries to do web scraping in Rust, but which ones are best suited for your specific scenario? To answer this, you need to figure out whether the destination site has static content pages or dynamic content pages. Thus, visit the site in your browser.
Navigate to the target page, right-click on a blank section, and select the “Inspect” option to open the DevTools. Reach the “Network” tab and reload the page. Focus on what you see in the “Fetch/XHR” section:
While the page is loading and rendering, that section will remain empty. This means that the web page does not make any AJAX requests. In other words, it does not retrieve data dynamically on the client via JavaScript. It is therefore a static content page, whose HTML document already contains all the data of interest.
As further confirmation, right-click and select the “View page source” option:
Explore the code and you will notice that all the data in the page is embedded into the HTML returned by the server.
On a site with multiple pages, repeat this procedure on all pages of interest.
Since the target pages do not use JavaScript, you do not require a browser automation library like rust-headless-chrome. You could still use it, but running Chrome takes time and resources, so it would only introduce a performance overhead and no real benefit.
Instead, you should employ an HTTP client library to retrieve the HTML document associated with a page and an HTML parser library to extract data from it. Thus, reqwest and scraper are the two Rust web scraping libraries that you need!
Step #2: Install the Scraping Libraries
Time to install reqwest and scraper.
Open a terminal in your project’s root folder or use your IDE’s terminal. Run the following command to add reqwest and scraper to your project’s dependencies:
cargo add scraper reqwest --features "reqwest/blocking"
Note: The reqwest/blocking feature allows reqwest to perform synchronous HTTP calls that block the current thread. Learn more in the documentation.
The cargo add command will update the Cargo.toml file accordingly, making sure it contains:
[dependencies]
reqwest = { version = "0.11.23", features = ["blocking"] }
scraper = "0.18.1"
Also, it will install the two libraries and all their dependencies.
Perfect! You now have everything you need to do web scraping with Rust!
Step #3: Connect to the Target Page
Use the get() method from reqwest::blocking to make a GET request to the given URL and download the associated HTML document:
let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;
Keep in mind that this instruction is synchronous, so execution of the script will be interrupted until the server responds.
Once you get a response, you can access the HTML code of the target page with:
let html = response.text()?;
Write those two lines in the main() function of min.rs.:
fn main() -> Result<(), Box<dyn std::error::Error>> {
// connect to the target page
let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;
// extract the raw html and print it
let html = response.text()?;
println!("{html}");
Ok(())
}
If you are wondering what Result<(), Box<dyn std::error::Error>> is because we are going to use Residuals. Take also a look at the println() function at the end, which logs the retrieved HTML.
Execute the script, and it will print in the terminal:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
<!-- omitted for brevity... -->
Well done! That is exactly the HTML of the target page!
Step #4: Parse the HTML Document
You now have the source HTML of the desired page stored in a string variable. Feed it to the parse_document() function from scraper to parse it:
let document = scraper::Html::parse_document(&html);
The returned document object exposes the DOM exploration API you need to perform web scraping using Rust.
This is what your main.rs file should look like so far:
fn main() -> Result<(), Box<dyn std::error::Error>> {
// connect to the target page
let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;
// extract the raw html and print it
let html = response.text()?;
// parse the HTML document
let document = scraper::Html::parse_document(&html);
Ok(())
}
You are ready to write the data parsing logic. But first, you have to study the structure of the target page!
Step #5: Inspect the Page
Web scraping involves selecting HTML nodes on a page and extracting data from them. CSS selectors are among the most popular methods for selecting HTML nodes. If you are a web developer, you are probably already familiar with them. If not, explore the documentation.
The only way to define effective CSS selectors is to inspect the HTML of the target page. Thus, open the Scrape This Site Country sandbox in the browser, right-click on a country element, and select “Inspect:”
There, you can see that each country info box is a .country HTML node that contains:
- The country name in a .country-name element.
- The name of the capital in a .country-capital element.
- The population information in a .country-population element.
- The area in km² occupied by the country in the .country-area element.
In the above paragraph, there are all the CSS selectors required to select the desired HTML nodes. Test the selectors on a country info box before applying it to all elements on the page!
Step #6: Retrieve Data from a Single Element
The parse() function from scraper::Selector accepts a string representing a CSS selector and returns a selector object. Use it as below:
let html_country_info_box_selector = scraper::Selector::parse(".country")?;
You can then pass the selector to the select() method exposed by document:
let html_country_info_box_element = document
.select(&html_country_info_box_selector)
.next()
.ok_or("Country info box element not found!")?;
That will apply the CSS selector on the page and return the selected HTML element. Since select() always returns an iterator, the .next() call is required to get the first country info box node.
Note that the object returned by select() exposes the select() function as well. In this case, it will search for nodes only in the children of the current node. So, you can implement the entire Rust web scraping logic as follows:
let country_name_selector = scraper::Selector::parse(".country-name")?;
let name = html_country_info_box_element
.select(&country_name_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country name not found")?;
let country_capital_selector = scraper::Selector::parse(".country-capital")?;
let capital = html_country_info_box_element
.select(&country_capital_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country capital not found")?;
let country_population_selector = scraper::Selector::parse(".country-population")?;
let population = html_country_info_box_element
.select(&country_population_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country population not found")?;
let country_area_selector = scraper::Selector::parse(".country-area")?;
let area = html_country_info_box_element
.select(&country_area_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country area not found")?;
The text() method enables you to access the text contained in the selected HTML node. For other data extraction approaches, check out the docs. As the extracted text could contain unwanted spaces, remove them with trim().
Print the scraped data to verify that the scraping logic works as expected:
println!("Country name: {name}");
println!("Country capital: {capital}");
println!("Country name: {population}");
println!("Country area: {area}");
That would produce:
Country name: Andorra
Country capital: Andorra la Vella
Country population: 84000
Country area: 468.0
Yes! You just performed web scraping in Rust!
Step #7: Scrape All Elements on the Page
This time, you will extend the code seen above to go through all the country information box nodes on the page.
First, you need to define a custom data structure in which to store the collected data. To specify a new struct tailored for that, add the following lines on top of your main.rs file:
struct Country {
name: String,
capital: String,
population: String,
area: String,
}
Second, instantiate a Vec of Country objects in main():
let mut countries: Vec<Country> = Vec::new();
This vector will contain all your scraped data.
Next, remove the .next() call to get all country info boxes, iterate over them, and populate countries:
// where to store the scraped data
let mut countries: Vec<Country> = Vec::new();
// select the country info box HTML elements
let html_country_info_box_selector = scraper::Selector::parse(".country")?;
let html_country_info_box_elements = document.select(&html_country_info_box_selector);
// iterate over the country HTML elements
// and scrape them all
for html_country_info_box_element in html_country_info_box_elements {
// scraping logic for a single country info box HTML element...
// create a new Country object and add it to the vector
let country = Country {
name,
capital,
population,
area,
};
countries.push(country);
}
You can then print all scraped countries with:
// log the results
for country in countries {
println!("Country name: {}", country.name);
println!("Country capital: {}", country.capital);
println!("Country name: {}", country.population);
println!("Country area: {}", country.area);
println!();
}
The new main.rs Rust web scraping file will contain:
// custom struct to store the scraping data
struct Country {
name: String,
capital: String,
population: String,
area: String,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
// connect to the target page
let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;
// extract the raw html and print it
let html = response.text()?;
// parse the HTML document
let document = scraper::Html::parse_document(&html);
// where to store the scraped data
let mut countries: Vec<Country> = Vec::new();
// select the country info box HTML elements
let html_country_info_box_selector = scraper::Selector::parse(".country")?;
let html_country_info_box_elements = document.select(&html_country_info_box_selector);
// iterate over the country HTML elements
// and scrape them all
for html_country_info_box_element in html_country_info_box_elements {
// scraping logic for a single country info box HTML element
let country_name_selector = scraper::Selector::parse(".country-name")?;
let name = html_country_info_box_element
.select(&country_name_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country name not found")?;
let country_capital_selector = scraper::Selector::parse(".country-capital")?;
let capital = html_country_info_box_element
.select(&country_capital_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country capital not found")?;
let country_population_selector = scraper::Selector::parse(".country-population")?;
let population = html_country_info_box_element
.select(&country_population_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country population not found")?;
let country_area_selector = scraper::Selector::parse(".country-area")?;
let area = html_country_info_box_element
.select(&country_area_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country area not found")?;
// create a new Country object and add it to the vector
let country = Country {
name,
capital,
population,
area,
};
countries.push(country);
}
// log the results
for country in countries {
println!("Country name: {}", country.name);
println!("Country capital: {}", country.capital);
println!("Country name: {}", country.population);
println!("Country area: {}", country.area);
println!();
}
Ok(())
}
Launch it, and it will generate this output:
Country name: Andorra
Country capital: Andorra la Vella
Country population: 84000
Country area: 468.0
# omitted for brevity...
Country name: Zimbabwe
Country capital: Harare
Country name: 11651858
Country area: 390580.0
Mission complete! You just scraped all countries from the target page!
Step #8: Export the Extracted Data to CSV
The collected data is now stored in Rust vector, which is not the best format if you want to share it with other people. That is why you need to export it to easy-to-explore formats, such as CSV.
To export data to a CSV file, you should use the csv library. Install it with this command:
cargo add csv
You can then use it to produce a CSV export file with:
// initialize the output CSV file
let mut writer = csv::Writer::from_path("countries.csv")?;
// write the CSV header
writer.write_record(&["name", "capital", "population", "area"])?;
// populate the file with each country
for country in countries {
writer.write_record(&[
country.name,
country.capital,
country.population,
country.area,
])?;
}
This snippet creates a CSV file, initializes it with the header row, and finally populates it by iterating over the countries vector.
Step #9: Put It All Together
Here is the complete code of your web scraping Rust script:
// custom struct to store the scraping data
pub struct Country {
name: String,
capital: String,
population: String,
area: String,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
// connect to the target page
let response = reqwest::blocking::get("https://www.scrapethissite.com/pages/simple/")?;
// extract the raw html and print it
let html = response.text()?;
// parse the HTML document
let document = scraper::Html::parse_document(&html);
// where to store the scraped data
let mut countries: Vec<Country> = Vec::new();
// select the country info box HTML elements
let html_country_info_box_selector = scraper::Selector::parse(".country")?;
let html_country_info_box_elements = document.select(&html_country_info_box_selector);
// iterate over the country HTML elements
// and scrape them all
for html_country_info_box_element in html_country_info_box_elements {
// scraping logic for a single country info box HTML element
let country_name_selector = scraper::Selector::parse(".country-name")?;
let name = html_country_info_box_element
.select(&country_name_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country name not found")?;
let country_capital_selector = scraper::Selector::parse(".country-capital")?;
let capital = html_country_info_box_element
.select(&country_capital_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country capital not found")?;
let country_population_selector = scraper::Selector::parse(".country-population")?;
let population = html_country_info_box_element
.select(&country_population_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country population not found")?;
let country_area_selector = scraper::Selector::parse(".country-area")?;
let area = html_country_info_box_element
.select(&country_area_selector)
.next()
.map(|element| element.text().collect::<String>().trim().to_owned())
.ok_or("Country area not found")?;
// create a new Country object and add it to the vector
let country = Country {
name,
capital,
population,
area,
};
countries.push(country);
}
// initialize the output CSV file
let mut writer = csv::Writer::from_path("countries.csv")?;
// write the CSV header
writer.write_record(&["name", "capital", "population", "area"])?;
// populate the file with each country
for country in countries {
writer.write_record(&[
country.name,
country.capital,
country.population,
country.area,
])?;
}
Ok(())
}
Can you believe it? You can build a Rust data scraper in less than 100 lines of code.
Compile the application with the command below:
cargo build
Then, launch it with:
cargo run
When the script terminates, a countries.csv file will appear in the root folder of your project. Open it, and you should see the following data:
Et voilà! You know now the basics of Rust web scraping!
Keep Your Web Scraping Operation Ethical and Respectful
Automatically retrieving data from the Internet is an effective way to get useful information. However, you do not want to harm the target site while doing it. Thus, you must approach that operation with the right precautions.
To perform responsible web scraping, consider these tips:
- Comply with the robots.txt file: Every site has a robots.txt file that specifies the rules on how automated crawlers should access its pages. To maintain ethical scraping practices, you must adhere to those guidelines. Learn more in our robots.txt for web scraping guide.
- Limit the frequency of your requests: Making too many requests in a short period will lead to a server overload, affecting site performance for all users. That might also trigger rate limiting measures and get you blocked. Thus, add random delays to your requests to avoid flooding the destination server.
- Check and respect the site’s Terms of Service: Before scraping a website, review and abide by its Terms of Service. These may contain information on copyright, intellectual property rights, and guidelines on how and when to use their data.
- Scrape only publicly available information: Focus on extracting data that is publicly accessible on the site and not protected by login credentials or other forms of authorization. Scraping private or sensitive data without proper permission is unethical and may lead to legal consequences.
- Rely on trustworthy and up-to-date scraping tools: Select reputable providers and opt for libraries and tools that are well-maintained and regularly updated. Only then can you ensure that they are in line with the latest ethical scraping principles and best practices. If you have any doubts, read our article on how to choose the best web scraping service.
Conclusion
In this tutorial, you saw why Rust is a good option for web scraping and what libraries you should use to perform it. Here, you learned how to use reqwest and scraper to build a Rust web scraper that can extract data from a real-world site. That takes only a few lines of code!
However, keep in mind that web scraping may not always be that easy. The reason is that anti-scraping and anti-bot solutions are becoming more common. These technologies can detect the self-loved nature of your script and block it, posing a serious challenge to your scraping operation.
Avoid that headache with the next-generation and advanced web scraping tool provided by Bright Data. If you want to find out more about how to avoid being blocked, adopt a web proxy from one of the several proxy services available or start using the advanced Web Unlocker.
Don’t want to deal with web scraping? Explore our datasets.
Not sure which product to choose? Sign up now and find the right solution for your business.
No credit card required