TL;DR: This tutorial will show how to extract data from a website in C++ and why it is one of the most efficient languages for scraping.
This guide will cover:
- Is C++ a good language for web scraping?
- Best C++ web scraping libraries
- How to build a web scraper in C++
Is C++ a Good Language for Web Scraping?
C++ is a statically-typed programming language that is widely used for developing high-performance applications. This is because it is well known for its speed, efficiency, and memory management capabilities. C++ is a versatile language that comes in handy in a wide range of applications, including web scraping.
C++ is a compiled language and is inherently faster than interpreted languages, such as Python. This makes it an excellent choice for building fast scrapers. However, C++ is not designed for web development and there are not many libraries available for web scraping. While there are some third-party packages, the options are not as extensive as in Python, Ruby, or Java.
In summary, web scraping in C++ is possible and efficient but requires more low-level programming compared to other languages. Let’s find out what tools can make this process easier!
Best C++ Web Scraping Libraries
Here are some popular web scraping libraries for C++:
- CPR: A modern C++ HTTP client library inspired by the Python Requests project. It is a wrapper of libcurl that provides an easy-to-understand interface, built-in authentication capabilities, and support for asynchronous calls.
- libxml2: A powerful and full-featured library for parsing XML and HTML documents originally developed for Gnome. It supports DOM manipulation via XPath selectors.
- Lexbor: A fast and lightweight HTML parsing library entirely written in C with support for CSS selectors. It is only available for Linux.
For years, the most widely used HTML parser for C++ was Gumbo. This has not been maintained since 2016 and even the official README now advises against its use.
Prerequisites
Before diving into coding, you need to:
Follow the guide below for your operating system and learn how to meet those prerequisites.
Set Up C++ on macOS
On macOS, the most popular C, C++, and Objective-C compiler is Clang. Keep in mind that many Macs come with Clang preinstalled. To verify that, open a terminal and launch the command below:
clang --version
If you get a command not found: clang
error, it means Clang is not installed or configured correctly. In that case, you can install it via the Xcode command-line tools:
xcode-select --install
This may take a while, so be patient.
To set up vcpkg
, you will need the macOS Developer Tools first. Add them to your Mac with:
xcode-select --install
Then, you have to install vcpkg
globally. Create a /dev
folder, enter it in the terminal, and run:
git clone https://github.com/microsoft/vcpkg
The directory will now contain the source code. Build the package manager with:
./vcpkg/bootstrap-vcpkg.sh
To run this command you may need elevated privileges.
Lastly, add /dev/vcpkg
to your $PATH
following this guide.
To install CMake, download the installer from the official site, launch it, and follow the installation wizard.
Set Up C++ on Windows
Download the MinGW-x64 installer from MSYS2, launch it, and follow the instructions. This package provides up-to-date native builds of GCC, Mingw-w64, and other helpful C++ tools and libraries.
In the MSYS2 terminal opened at the end of the installation process, run the command below to install the Mingw-w64 toolchain:
pacman -S --needed base-devel mingw-w64-x86_64-toolchain
Wait for the process to end and then add MinGW to the PATH
env, as explained here.
Next, you need to install vcpkg
globally. Create a C:/dev
folder, open it in PowerShell, and execute:
git clone https://github.com/microsoft/vcpkg
Build the source code of the package manager contained in the vcpkg
sub-folder with:
./vcpkg/bootstrap-vcpkg.bat
Now, add C:/dev/vcpkg
to your PATH
as done before.
It only remains to install CMake. Download the installer, double-click on it, and make sure to check the option below during the setup.
Set Up C++ on Linux
On Debian-based distributions, install GCC (GNU Compiler Collection), CMake, and other useful packages for development with:
sudo apt install build-essential cmake
This might take some time, so be patient.
Next, you need to globally install vcpkg
. Create a /dev
directory, open it in the terminal, and type:
git clone https://github.com/microsoft/vcpkg
The vcpkg
sub-directory will now contain the source code of the package manager. Build the tool with:
./vcpkg/bootstrap-vcpkg.sh
Note that this command may require admin privileges.
Then, add /dev/vcpkg
to your $PATH
environment variable by following this guide.
Perfect! You now have everything you need to get started with C++ web scraping!
How to Build a Web Scraper in C++
In this chapter, you will learn how to code a C++ web spider. The target site will be the Bright Data home page and the script will take care of:
- Connecting to the webpage
- Selecting the HTML elements of interest from the DOM
- Retrieving data from them
- Exporting the scraped data to CSV
Right now, this is what visitors see when exploring the target page:
Remember, the BrightData home page changes frequently. So, it may have changed by the time you read this article.
Some interesting data to extract from the page is the industry info contained in these cards:
The scraping goal for this step-by-step tutorial has been defined. Let’s see how to do web scraping with C++!
Step 1: Initialize a C++ scraping project
First, you require a folder where to place your C++ project. Open the terminal and create the project directory with:
mkdir c++-web-scraper
This will contain your scraping script.
When building software in C++, you should opt for a Visual Studio IDE. In detail, you are about to see how to set up Visual Studio Code (VS Code) for C++ development with vcpkg
as package manager. Note that similar procedures can be applied to other C++ IDEs.
VS Code does not offer built-in support for C++, so you first have to add the C/C++ plugin. Launch Visual Studio Code, click on the “Extensions” icon in the left bar, and type “C++” in the search field at the top.
Click the “Install” button on the first element to add C++ development functionality to VS Code. Wait for the extension to be set up and then open the c++-web-scraper
folder with "``File``"
>
"``Open Folder...``"
.
Right-click in the “EXPLORER” section, select “New File…” and initialize a scraper.cpp
file as follows:
#include <iostream>
int main()
{
std::cout << "Hello World" << std::endl;
}
You now have a C++ project!
Step 2: Install the scraping libraries
C++ cumbersome syntax and its limited web capabilities can represent an obstacle when building a web scraper. To make everything easier, you should adopt some web scraping C++ libraries. As mentioned before, the choice is pretty limited. Thus, you should go for the most popular ones: cpr
and libxml2
.
You can install them on Windows through vcpkg
with:
vcpkg install cpr libxml2 --triplet=x64-windows
On macOS, replace the triplet option with x64-osx
. On Linux, use x64-linux
.
In the Visual Studio Code terminal, you also need to run the following command in the root directory of your project:
vcpkg integrate install
This will enable the linking of vcpkg
packages to the project.
Restart VS Code and you can now import any installed library with #include
. So, add the following three lines on top of your scraper.cpp
file:
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
Make sure that the IDE does not report any errors.
Step 3: Finalize the C++ project initialization
To build the C++ scraping script and complete the project initialization process, you have to add the CMake Tools extension to VS Code:
If your project does not have a .vscode
folder, create it. There is where VS Code looks for configurations related to the current project.
Configure CMake Tools to use vcpkg
as a toolchain by creating a settings.json
file inside the .vscode
folder as follows:
{
"cmake.configureSettings": {
"CMAKE_TOOLCHAIN_FILE": "c:/dev/vcpkg/scripts/buildsystems/vcpkg.cmake"
}
}
On macOS and Linux, fix the CMAKE_TOOLCHAIN_FILE
field according to the path you installed vcpkg
in. If you followed the setup guide above, it should be /dev/vcpkg/scripts/buildsystems/vcpkg.cmake
.
In the main search bar of VS Code, type “>cmake” and select the “CMake: Configure” option:
This will allow you to select the target compilation platform. On Windows, opt for “Visual Studio Build Tools 2019 Release – x86_amd64”:
Add the CMakeLists.txt
file in the root folder of your project to set up CMake:
cmake_minimum_required(VERSION 3.0.0)
project(main VERSION 0.1.0)
INCLUDE_DIRECTORIES(
C:/dev/vcpkg/installed/x86-windows/include
)
LINK_DIRECTORIES(
C:/dev/vcpkg/installed/x86-windows/lib
)
add_executable(main scraper.cpp)
target_compile_features(main PRIVATE cxx_std_20)
find_package(cpr CONFIG REQUIRED)
target_link_libraries(main PRIVATE cpr::cpr)
find_package(LibXml2 REQUIRED)
target_link_libraries(main PRIVATE LibXml2::LibXml2)
Note that it involves the two packages installed earlier. Make sure to update INCLUDE_DIRECTORIES
and LINK_DIRECTORIES
according to your vcpkg
installation folder.
To allow Visual Studio Code to run the C++ program, you need a launch configuration file. In the .vscode
folder, initialize launch.json
as below:
{
"configurations": [
{
"name": "C++ Launch (Windows)",
"type": "cppvsdbg",
"request": "launch",
"program": "${workspaceFolder}/build/Debug/main.exe",
"args": [],
"stopAtEntry": false,
"cwd": "${workspaceFolder}",
"environment": []
}
]
}
When launching the running or debugging command, VS Code will now run the file in the program
path produced by CMake. Note that on macOS and Linux it will not be a .exe
file.
The configuration is ready!
Every time you want to debug or build your app, type “>cmake: Build” in the top input field e select the “CMake: Build” option.
Wait for the build process to end and run the compiled program from the “Run & Debug” section or by pressing F5. You will see the result of your application in the VSC debug console.
Great! It is time to start scraping some data in C++!
Step 4: Download the target page with CPR
If you want to extract data from a page, you first have to retrieve its HTML document through an HTTP GET
request.
Use CPR to download the target page with:
cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"});
Behind the scene, the Get()
method performs a GET
request to the URL passed as a parameter. response.text
will contain the string representation of the HTML code returned by the server.
Note that performing automated HTTP requests can trigger anti-bot technologies. These may intercept your requests, preventing your script from accessing the target site. Specifically, the most basic anti-scraping solutions block incoming requests without a valid User-Agent
HTTP header. Learn more in our guide on User-
Agent
s for web scraping.
Just like any other HTTP client, CPR uses a placeholder value for User-Agent
. Since this is very different from the agents used by popular browsers, anti-bot systems can easily spot you. To avoid getting blocked because of that reason, you can set a valid User-Agent
in CPR with:
cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/121.21.21.da/31/3das/32/1"}, headers);
The HTPP request made through that Get()
will now appear as coming from Google Chrome 113.
This is what scraper.cpp
currently contains:
#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
int main()
{
// define the user agent for the GET request
cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
// make the HTTP request to retrieve the target page
cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"}, headers);
// scraping logic...
}
Step 5: Parse HTML content with libxml2
To make the HTML document returned by the server easily explorable, you should first parse it.
Pass its C string representation to the libxml2 htmlReadMemory()
function to achieve that:
htmlDocPtr doc = htmlReadMemory(response.text.c_str(), response.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
The doc
variable now exposes the DOM exploration API offered by libxml2. In detail, you can retrieve HTML elements on the page through XPath selectors. At the time of writing, libxml2 does not support CSS Selectors.
Step 6: Define the XPath selectors to get the desired HTML elements
To define an effective XPath selection strategy for the HTML nodes of interest, you need to analyze the DOM of the target page. Open the Bright Data homepage in the browser, right-click on one of the industry cards, and choose “Inspect.” This will open the DevTools section:
Explore the HTML code and you will notice that each industry card is a <div>
element that contains:
- A
<figure>
element with a<img>
representing the image of the industry and a<a>
containing the URL to the industry page. - A
<div>
HTML element storing the industry name in a<a>
.
For each card, the goal of the C++ scraper is to extract:
- The industry image URL
- The industry page URL
- The industry name
To define proper XPath selectors, shift your attention to the DOM structure of the elements of interest. You will notice that you can get all the industry cards with the XPath selector below:
//div[contains(@class, 'section_cases_row_col_item')]
If you have any doubts, test XPath instructions in the browser console with $x()
:
Given a card, you can get the desired nodes with:
.//figure/a/img
.//figure/a
.//div[contains(@class, 'elementor-image-box-title')]/a
Step 7: Scrape data from a webpage with libxml2
You can now use libxml2 to apply the XPath selectors defined before and get the desired data from the target HTML webpage.
First, you need a data structure whose instances will store the scraped data:
struct IndustryCard
{
std::string image;
std::string url;
std::string name;
};
In C++, a struct
allows you to bundle several data attributes under the same name in a block of memory.
Then, initialize an array of IndustryCard
s in the main()
function:
std::vector<IndustryCard> industry_cards;
This will store all scraping data objects.
Populate this vector
with the following C++ web scraping logic:
// define an array to store all retrieved data
std::vector<IndustryCard> industry_cards;
// set the libxml2 context to the current document
xmlXPathContextPtr context = xmlXPathNewContext(doc);
// select all industry card HTML elements
// with an XPath selector
xmlXPathObjectPtr industry_card_html_elements = xmlXPathEvalExpression((xmlChar *)"//div[contains(@class, 'section_cases_row_col_item')]", context);
// iterate over the list of industry card elements
for (int i = 0; i < industry_card_html_elements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr industry_card_html_element = industry_card_html_elements->nodesetval->nodeTab[i];
// set the libxml2 context to the current element
// to limit the XPath selectors to its children
xmlXPathSetContextNode(industry_card_html_element, context);
xmlNodePtr image_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a/img", context)->nodesetval->nodeTab[0];
std::string image = std::string(reinterpret_cast<char *>(xmlGetProp(image_html_element, (xmlChar *)"data-lazy-src")));
xmlNodePtr url_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a", context)->nodesetval->nodeTab[0];
std::string url = std::string(reinterpret_cast<char *>(xmlGetProp(url_html_element, (xmlChar *)"href")));
xmlNodePtr name_html_element = xmlXPathEvalExpression((xmlChar *)".//div[contains(@class, 'elementor-image-box-title')]/a", context)->nodesetval->nodeTab[0];
std::string name = std::string(reinterpret_cast<char *>(xmlNodeGetContent(name_html_element)));
// instantiate an IndustryCard struct with the collected data
IndustryCard industry_card = {image, url, name};
// add the object with the scraped data to the vector
industry_cards.push_back(industry_card);
}
// free up the resource allocated by libxml2
xmlXPathFreeObject(industry_card_html_elements);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
The snippet above selects the industry cards by applying the XPath selector defined earlier with xmlXPathEvalExpression()
. Then, it iterates over them and implements a similar approach to get the child elements of interest from each card. Next, it scrapes the industry image URL, page URL, and name from them. Finally, it frees up the resources allocated by libxml2.
As you can see, web scraping using C++ with libxml2 is not that complex. Thanks to the xmlGetProp()
and xmlNodeGetContent()
you can get the value of an HTML attribute and the content of a node, respectively.
Now that you know how data scraping in C++ works, you have the tools to go one step further and scrape the industry pages as well. You only have to follow the links discovered here and devise new scaping logic. This is what web crawling and web scraping are all about!
Amazing! You just achieved your goals. The tutorial is not over yet, though.
Step 7: Export the scraped data to CSV
At the end of the for()
loop, industry_cards
will store the scraped data in struct
instances. As you can imagine, that is not the best format to provide data to other teams. Here is why you should convert the retrieved data to CSV.
You can export a vector
to a CSV file with built-in C++ functions as follows:
// initialize the CSV output file
std::ofstream csv_file("output.csv");
// write the CSV header
csv_file << "url,image,name" << std::endl;
// poupulate the CSV output file
for (IndustryCard industry_card : industry_cards)
{
// transfrom each industry card record to a CSV record
csv_file << industry_card.url << "," << industry_card.image << "," << industry_card.name << std::endl;
}
// free up the file resources
csv_file.close();
The code above creates an output.csv
file and initializes it with the header record. Then, it iterates over the industry_cards
array, converts each element to a string in CSV format and appends it to the output file.
Build your scraping C++ script, run it, and you will see the following output.csv
file in the root directory of your project:
Well done! Now you know how to export scraped data to CSV in C++!
Step 8: Put it all together
Here is the entire C++ scraper:
// scraper.cpp
#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
#include <vector>
// define a struct where to store the scraped data
struct IndustryCard
{
std::string image;
std::string url;
std::string name;
};
int main()
{
// define the user agent for the GET request
cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
// make an HTTP GET request to retrieve the target page
cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"}, headers);
// parse the HTML document returned by the server
htmlDocPtr doc = htmlReadMemory(response.text.c_str(), response.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
// define an array to store all retrieved data
std::vector<IndustryCard> industry_cards;
// set the libxml2 context to the current document
xmlXPathContextPtr context = xmlXPathNewContext(doc);
// select all industry card HTML elements
// with an XPath selector
xmlXPathObjectPtr industry_card_html_elements = xmlXPathEvalExpression((xmlChar *)"//div[contains(@class, 'section_cases_row_col_item')]", context);
// iterate over the list of industry card elements
for (int i = 0; i < industry_card_html_elements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr industry_card_html_element = industry_card_html_elements->nodesetval->nodeTab[i];
// set the libxml2 context to the current element
// to limit the XPath selectors to its children
xmlXPathSetContextNode(industry_card_html_element, context);
xmlNodePtr image_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a/img", context)->nodesetval->nodeTab[0];
std::string image = std::string(reinterpret_cast<char *>(xmlGetProp(image_html_element, (xmlChar *)"data-lazy-src")));
xmlNodePtr url_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a", context)->nodesetval->nodeTab[0];
std::string url = std::string(reinterpret_cast<char *>(xmlGetProp(url_html_element, (xmlChar *)"href")));
xmlNodePtr name_html_element = xmlXPathEvalExpression((xmlChar *)".//div[contains(@class, 'elementor-image-box-title')]/a", context)->nodesetval->nodeTab[0];
std::string name = std::string(reinterpret_cast<char *>(xmlNodeGetContent(name_html_element)));
// instantiate an IndustryCard struct with the collected data
IndustryCard industry_card = {image, url, name};
// add the object with the scraped data to the vector
industry_cards.push_back(industry_card);
}
// free up the resource allocated by libxml2
xmlXPathFreeObject(industry_card_html_elements);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
// initialize the CSV output file
std::ofstream csv_file("output.csv");
// write the CSV header
csv_file << "url,image,name" << std::endl;
// poupulate the CSV output file
for (IndustryCard industry_card : industry_cards)
{
// transfrom each industry card record to a CSV record
csv_file << industry_card.url << "," << industry_card.image << "," << industry_card.name << std::endl;
}
// free up the file resources
csv_file.close();
return 0;
}
Et voilà! In around 80 lines of code, you can create a data scaping script in C++!
Conclusion
In this tutorial, we have learned why C++ is an efficient language for scraping the Web. Although there are not as many scraping libraries as in other languages, there are some. And here you had the opportunity to see which ones are the most popular. Next, you looked at how to use CPR and libxml2 to build a spider in C++ that can collect data from a real target.
However, many challenges come with web scraping. Actually, an increasing number of sites have been implementing anti-bot and anti-scraping technologies to protect their data. These tools are able to detect the automated requests performed by your scraping C++ script and ban them. Luckily, there are many automated solutions for your data collection needs. Contact us to find out what’s the best solution for your use case.
No credit card required
Don’t want to deal with web scraping at all but are interested in web data? Explore our ready-to-use datasets.