Web Scraping in C++: A Step-by-Step Guide

Learn how to scrape websites using C++ in this step by step guide.
13 min read
web scraping with C++ hero image

TL;DR: This tutorial will show how to extract data from a website in C++ and why it is one of the most efficient languages for scraping.

This guide will cover:

Is C++ a Good Language for Web Scraping?

C++ is a statically-typed programming language that is widely used for developing high-performance applications. This is because it is well known for its speed, efficiency, and memory management capabilities. C++ is a versatile language that comes in handy in a wide range of applications, including web scraping.

C++ is a compiled language and is inherently faster than interpreted languages, such as Python. This makes it an excellent choice for building fast scrapers. However, C++ is not designed for web development and there are not many libraries available for web scraping. While there are some third-party packages, the options are not as extensive as in Python, Ruby, or Java.

In summary, web scraping in C++ is possible and efficient but requires more low-level programming compared to other languages. Let’s find out what tools can make this process easier!

Best C++ Web Scraping Libraries

Here are some popular web scraping libraries for C++:

  • CPR: A modern C++ HTTP client library inspired by the Python Requests project. It is a wrapper of libcurl that provides an easy-to-understand interface, built-in authentication capabilities, and support for asynchronous calls.
  • libxml2: A powerful and full-featured library for parsing XML and HTML documents originally developed for Gnome. It supports DOM manipulation via XPath selectors.
  • Lexbor: A fast and lightweight HTML parsing library entirely written in C with support for CSS selectors. It is only available for Linux.

For years, the most widely used HTML parser for C++ was Gumbo. This has not been maintained since 2016 and even the official README now advises against its use.

Prerequisites

Before diving into coding, you need to:

  1. Have a C++ compiler
  2. Set up the vcpkg C++ package manager
  3. Install CMake

Follow the guide below for your operating system and learn how to meet those prerequisites.

Set Up C++ on macOS

On macOS, the most popular C, C++, and Objective-C compiler is Clang. Keep in mind that many Macs come with Clang preinstalled. To verify that, open a terminal and launch the command below:

clang --version

If you get a command not found: clang error, it means Clang is not installed or configured correctly. In that case, you can install it via the Xcode command-line tools:

xcode-select --install

This may take a while, so be patient.

To set up vcpkg, you will need the macOS Developer Tools first. Add them to your Mac with:

xcode-select --install

Then, you have to install vcpkg globally. Create a /dev folder, enter it in the terminal, and run:

git clone https://github.com/microsoft/vcpkg 

The directory will now contain the source code. Build the package manager with:

./vcpkg/bootstrap-vcpkg.sh

To run this command you may need elevated privileges.

Lastly, add /dev/vcpkg to your $PATH following this guide.

To install CMake, download the installer from the official site, launch it, and follow the installation wizard.

Set Up C++ on Windows

Download the MinGW-x64 installer from MSYS2, launch it, and follow the instructions. This package provides up-to-date native builds of GCC, Mingw-w64, and other helpful C++ tools and libraries.

In the MSYS2 terminal opened at the end of the installation process, run the command below to install the Mingw-w64 toolchain:

pacman -S --needed base-devel mingw-w64-x86_64-toolchain

Wait for the process to end and then add MinGW to the PATH env, as explained here.

Next, you need to install vcpkg globally. Create a C:/dev folder, open it in PowerShell, and execute:

git clone https://github.com/microsoft/vcpkg 

Build the source code of the package manager contained in the vcpkg sub-folder with:

./vcpkg/bootstrap-vcpkg.bat

Now, add C:/dev/vcpkg to your PATH as done before.

It only remains to install CMake. Download the installer, double-click on it, and make sure to check the option below during the setup.

Set Up C++ on Linux
On Debian-based distributions, install GCC (GNU Compiler Collection), CMake, and other useful packages for development with:

sudo apt install build-essential cmake

This might take some time, so be patient.

Next, you need to globally install vcpkg. Create a /dev directory, open it in the terminal, and type:

git clone https://github.com/microsoft/vcpkg 

The vcpkg sub-directory will now contain the source code of the package manager. Build the tool with:

./vcpkg/bootstrap-vcpkg.sh

Note that this command may require admin privileges.

Then, add /dev/vcpkg to your $PATH environment variable by following this guide.

Perfect! You now have everything you need to get started with C++ web scraping!

How to Build a Web Scraper in C++

In this chapter, you will learn how to code a C++ web spider. The target site will be the Bright Data home page and the script will take care of:

  • Connecting to the webpage
  • Selecting the HTML elements of interest from the DOM
  • Retrieving data from them
  • Exporting the scraped data to CSV

Right now, this is what visitors see when exploring the target page:

Remember, the BrightData home page changes frequently. So, it may have changed by the time you read this article.

Some interesting data to extract from the page is the industry info contained in these cards:

The scraping goal for this step-by-step tutorial has been defined. Let’s see how to do web scraping with C++!

Step 1: Initialize a C++ scraping project

First, you require a folder where to place your C++ project. Open the terminal and create the project directory with:

mkdir c++-web-scraper

This will contain your scraping script.

When building software in C++, you should opt for a Visual Studio IDE. In detail, you are about to see how to set up Visual Studio Code (VS Code) for C++ development with vcpkg as package manager. Note that similar procedures can be applied to other C++ IDEs.

VS Code does not offer built-in support for C++, so you first have to add the C/C++ plugin. Launch Visual Studio Code, click on the “Extensions” icon in the left bar, and type “C++” in the search field at the top.

Click the “Install” button on the first element to add C++ development functionality to VS Code. Wait for the extension to be set up and then open the c++-web-scraper folder with "``File``" > "``Open Folder...``".

Right-click in the “EXPLORER” section, select “New File…” and initialize a scraper.cpp file as follows:

#include <iostream>

int main()
{
    std::cout << "Hello World" << std::endl;
}

You now have a C++ project!

Step 2: Install the scraping libraries

C++ cumbersome syntax and its limited web capabilities can represent an obstacle when building a web scraper. To make everything easier, you should adopt some web scraping C++ libraries. As mentioned before, the choice is pretty limited. Thus, you should go for the most popular ones: cpr and libxml2.

You can install them on Windows through vcpkg with:

vcpkg install cpr libxml2 --triplet=x64-windows

On macOS, replace the triplet option with x64-osx . On Linux, use x64-linux.

In the Visual Studio Code terminal, you also need to run the following command in the root directory of your project:

vcpkg integrate install

This will enable the linking of vcpkg packages to the project.

Restart VS Code and you can now import any installed library with #include. So, add the following three lines on top of your scraper.cpp file:

#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"

Make sure that the IDE does not report any errors.

Step 3: Finalize the C++ project initialization

To build the C++ scraping script and complete the project initialization process, you have to add the CMake Tools extension to VS Code:

If your project does not have a .vscode folder, create it. There is where VS Code looks for configurations related to the current project.

Configure CMake Tools to use vcpkg as a toolchain by creating a settings.json file inside the .vscode folder as follows:

{
  "cmake.configureSettings": {
    "CMAKE_TOOLCHAIN_FILE": "c:/dev/vcpkg/scripts/buildsystems/vcpkg.cmake"
  }
}

On macOS and Linux, fix the CMAKE_TOOLCHAIN_FILE field according to the path you installed vcpkg in. If you followed the setup guide above, it should be /dev/vcpkg/scripts/buildsystems/vcpkg.cmake.

In the main search bar of VS Code, type “>cmake” and select the “CMake: Configure” option:

This will allow you to select the target compilation platform. On Windows, opt for “Visual Studio Build Tools 2019 Release – x86_amd64”:

Add the CMakeLists.txt file in the root folder of your project to set up CMake:

cmake_minimum_required(VERSION 3.0.0)
project(main VERSION 0.1.0)

INCLUDE_DIRECTORIES(
  C:/dev/vcpkg/installed/x86-windows/include
)

LINK_DIRECTORIES(
   C:/dev/vcpkg/installed/x86-windows/lib
)

add_executable(main scraper.cpp)
target_compile_features(main PRIVATE cxx_std_20)

find_package(cpr CONFIG REQUIRED)
target_link_libraries(main PRIVATE cpr::cpr)

find_package(LibXml2 REQUIRED)
target_link_libraries(main PRIVATE LibXml2::LibXml2)

Note that it involves the two packages installed earlier. Make sure to update INCLUDE_DIRECTORIES and LINK_DIRECTORIES according to your vcpkg installation folder.

To allow Visual Studio Code to run the C++ program, you need a launch configuration file. In the .vscode folder, initialize launch.json as below:

{
  "configurations": [
    {
      "name": "C++ Launch (Windows)",
      "type": "cppvsdbg",
      "request": "launch",
      "program": "${workspaceFolder}/build/Debug/main.exe",
      "args": [],
      "stopAtEntry": false,
      "cwd": "${workspaceFolder}",
      "environment": []
    }
  ]
}

When launching the running or debugging command, VS Code will now run the file in the program path produced by CMake. Note that on macOS and Linux it will not be a .exe file.

The configuration is ready!

Every time you want to debug or build your app, type “>cmake: Build” in the top input field e select the “CMake: Build” option.

Wait for the build process to end and run the compiled program from the “Run & Debug” section or by pressing F5. You will see the result of your application in the VSC debug console.

Great! It is time to start scraping some data in C++!

Step 4: Download the target page with CPR

If you want to extract data from a page, you first have to retrieve its HTML document through an HTTP GET request.

Use CPR to download the target page with:

cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"});

Behind the scene, the Get() method performs a GET request to the URL passed as a parameter. response.text will contain the string representation of the HTML code returned by the server.

Note that performing automated HTTP requests can trigger anti-bot technologies. These may intercept your requests, preventing your script from accessing the target site. Specifically, the most basic anti-scraping solutions block incoming requests without a valid User-Agent HTTP header. Learn more in our guide on User-Agents for web scraping.

Just like any other HTTP client, CPR uses a placeholder value for User-Agent. Since this is very different from the agents used by popular browsers, anti-bot systems can easily spot you. To avoid getting blocked because of that reason, you can set a valid User-Agent in CPR with:

cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/121.21.21.da/31/3das/32/1"}, headers);    

The HTPP request made through that Get() will now appear as coming from Google Chrome 113.

This is what scraper.cpp currently contains:

#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"

int main()
{
    // define the user agent for the GET request
    cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
    // make the HTTP request to retrieve the target page
    cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"}, headers);

    // scraping logic...
}

Step 5: Parse HTML content with libxml2

To make the HTML document returned by the server easily explorable, you should first parse it.
Pass its C string representation to the libxml2 htmlReadMemory() function to achieve that:

htmlDocPtr doc = htmlReadMemory(response.text.c_str(), response.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);

The doc variable now exposes the DOM exploration API offered by libxml2. In detail, you can retrieve HTML elements on the page through XPath selectors. At the time of writing, libxml2 does not support CSS Selectors.

Step 6: Define the XPath selectors to get the desired HTML elements

To define an effective XPath selection strategy for the HTML nodes of interest, you need to analyze the DOM of the target page. Open the Bright Data homepage in the browser, right-click on one of the industry cards, and choose “Inspect.” This will open the DevTools section:

Explore the HTML code and you will notice that each industry card is a <div> element that contains:

  1. <figure> element with a <img> representing the image of the industry and a <a> containing the URL to the industry page.
  2. <div> HTML element storing the industry name in a <a>.

For each card, the goal of the C++ scraper is to extract:

  • The industry image URL
  • The industry page URL
  • The industry name

To define proper XPath selectors, shift your attention to the DOM structure of the elements of interest. You will notice that you can get all the industry cards with the XPath selector below:

//div[contains(@class, 'section_cases_row_col_item')]

If you have any doubts, test XPath instructions in the browser console with $x():

Given a card, you can get the desired nodes with:

  1. .//figure/a/img
  2. .//figure/a
  3. .//div[contains(@class, 'elementor-image-box-title')]/a

Step 7: Scrape data from a webpage with libxml2
You can now use libxml2 to apply the XPath selectors defined before and get the desired data from the target HTML webpage.

First, you need a data structure whose instances will store the scraped data:

struct IndustryCard
{
    std::string image;
    std::string url;
    std::string name;
};

In C++, a struct allows you to bundle several data attributes under the same name in a block of memory.

Then, initialize an array of IndustryCards in the main() function:

std::vector<IndustryCard> industry_cards;

This will store all scraping data objects.

Populate this vector with the following C++ web scraping logic:

// define an array to store all retrieved data
std::vector<IndustryCard> industry_cards;
// set the libxml2 context to the current document
xmlXPathContextPtr context = xmlXPathNewContext(doc);

// select all industry card HTML elements
// with an XPath selector
xmlXPathObjectPtr industry_card_html_elements = xmlXPathEvalExpression((xmlChar *)"//div[contains(@class, 'section_cases_row_col_item')]", context);

// iterate over the list of industry card elements
for (int i = 0; i < industry_card_html_elements->nodesetval->nodeNr; ++i)
{
    // get the current element of the loop
    xmlNodePtr industry_card_html_element = industry_card_html_elements->nodesetval->nodeTab[i];

    // set the libxml2 context to the current element
    // to limit the XPath selectors to its children
    xmlXPathSetContextNode(industry_card_html_element, context);

    xmlNodePtr image_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a/img", context)->nodesetval->nodeTab[0];
    std::string image = std::string(reinterpret_cast<char *>(xmlGetProp(image_html_element, (xmlChar *)"data-lazy-src")));

    xmlNodePtr url_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a", context)->nodesetval->nodeTab[0];
    std::string url = std::string(reinterpret_cast<char *>(xmlGetProp(url_html_element, (xmlChar *)"href")));

    xmlNodePtr name_html_element = xmlXPathEvalExpression((xmlChar *)".//div[contains(@class, 'elementor-image-box-title')]/a", context)->nodesetval->nodeTab[0];
    std::string name = std::string(reinterpret_cast<char *>(xmlNodeGetContent(name_html_element)));

    // instantiate an IndustryCard struct with the collected data
    IndustryCard industry_card = {image, url, name};
    // add the object with the scraped data to the vector
    industry_cards.push_back(industry_card);
}

// free up the resource allocated by libxml2
xmlXPathFreeObject(industry_card_html_elements);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);

The snippet above selects the industry cards by applying the XPath selector defined earlier with xmlXPathEvalExpression(). Then, it iterates over them and implements a similar approach to get the child elements of interest from each card. Next, it scrapes the industry image URL, page URL, and name from them. Finally, it frees up the resources allocated by libxml2.

As you can see, web scraping using C++ with libxml2 is not that complex. Thanks to the xmlGetProp() and xmlNodeGetContent() you can get the value of an HTML attribute and the content of a node, respectively.

Now that you know how data scraping in C++ works, you have the tools to go one step further and scrape the industry pages as well. You only have to follow the links discovered here and devise new scaping logic. This is what web crawling and web scraping are all about!

Amazing! You just achieved your goals. The tutorial is not over yet, though.

Step 7: Export the scraped data to CSV

At the end of the for() loop, industry_cards will store the scraped data in struct instances. As you can imagine, that is not the best format to provide data to other teams. Here is why you should convert the retrieved data to CSV.

You can export a vector to a CSV file with built-in C++ functions as follows:

// initialize the CSV output file
std::ofstream csv_file("output.csv");
// write the CSV header
csv_file << "url,image,name" << std::endl;
// poupulate the CSV output file
for (IndustryCard industry_card : industry_cards)
{
    // transfrom each industry card record to a CSV record
    csv_file << industry_card.url << "," << industry_card.image << "," << industry_card.name << std::endl;
}
// free up the file resources
csv_file.close();

The code above creates an output.csv file and initializes it with the header record. Then, it iterates over the industry_cards array, converts each element to a string in CSV format and appends it to the output file.

Build your scraping C++ script, run it, and you will see the following output.csv file in the root directory of your project:

Well done! Now you know how to export scraped data to CSV in C++!

Step 8: Put it all together

Here is the entire C++ scraper:

// scraper.cpp

#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
#include <vector>

// define a struct where to store the scraped data
struct IndustryCard
{
    std::string image;
    std::string url;
    std::string name;
};

int main()
{
    // define the user agent for the GET request
    cpr::Header headers = {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}};
    // make an HTTP GET request to retrieve the target page
    cpr::Response response = cpr::Get(cpr::Url{"https://brightdata.com/"}, headers);
    
    // parse the HTML document returned by the server
    htmlDocPtr doc = htmlReadMemory(response.text.c_str(), response.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
    
    // define an array to store all retrieved data
    std::vector<IndustryCard> industry_cards;
    // set the libxml2 context to the current document
    xmlXPathContextPtr context = xmlXPathNewContext(doc);
    // select all industry card HTML elements
    // with an XPath selector
    xmlXPathObjectPtr industry_card_html_elements = xmlXPathEvalExpression((xmlChar *)"//div[contains(@class, 'section_cases_row_col_item')]", context);

    // iterate over the list of industry card elements
    for (int i = 0; i < industry_card_html_elements->nodesetval->nodeNr; ++i)
    {
        // get the current element of the loop
        xmlNodePtr industry_card_html_element = industry_card_html_elements->nodesetval->nodeTab[i];
        // set the libxml2 context to the current element
        // to limit the XPath selectors to its children
        xmlXPathSetContextNode(industry_card_html_element, context);

        xmlNodePtr image_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a/img", context)->nodesetval->nodeTab[0];
        std::string image = std::string(reinterpret_cast<char *>(xmlGetProp(image_html_element, (xmlChar *)"data-lazy-src")));

        xmlNodePtr url_html_element = xmlXPathEvalExpression((xmlChar *)".//figure/a", context)->nodesetval->nodeTab[0];
        std::string url = std::string(reinterpret_cast<char *>(xmlGetProp(url_html_element, (xmlChar *)"href")));

        xmlNodePtr name_html_element = xmlXPathEvalExpression((xmlChar *)".//div[contains(@class, 'elementor-image-box-title')]/a", context)->nodesetval->nodeTab[0];
        std::string name = std::string(reinterpret_cast<char *>(xmlNodeGetContent(name_html_element)));

        // instantiate an IndustryCard struct with the collected data
        IndustryCard industry_card = {image, url, name};
        // add the object with the scraped data to the vector
        industry_cards.push_back(industry_card);
    }

    // free up the resource allocated by libxml2
    xmlXPathFreeObject(industry_card_html_elements);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);

    // initialize the CSV output file
    std::ofstream csv_file("output.csv");
    // write the CSV header
    csv_file << "url,image,name" << std::endl;

    // poupulate the CSV output file
    for (IndustryCard industry_card : industry_cards)
    {
        // transfrom each industry card record to a CSV record
        csv_file << industry_card.url << "," << industry_card.image << "," << industry_card.name << std::endl;
    }

    // free up the file resources
    csv_file.close();

    return 0;
}

Et voilà! In around 80 lines of code, you can create a data scaping script in C++!

Conclusion

In this tutorial, we have learned why C++ is an efficient language for scraping the Web. Although there are not as many scraping libraries as in other languages, there are some. And here you had the opportunity to see which ones are the most popular. Next, you looked at how to use CPR and libxml2 to build a spider in C++ that can collect data from a real target.

However, many challenges come with web scraping. Actually, an increasing number of sites have been implementing anti-bot and anti-scraping technologies to protect their data. These tools are able to detect the automated requests performed by your scraping C++ script and ban them. Luckily, there are many automated solutions for your data collection needs. Contact us to find out what’s the best solution for your use case.

Don’t want to deal with web scraping at all but are interested in web data? Explore our ready-to-use datasets.