Web Scraping with C#

In this extensive guide, we will discuss:

Top C# Web Scraping Libraries
Prerequisites
Scraping Static Content Websites in C#
Scraping Dynamic Content Websites in C#
What To Do With the Scraped Data
Data Privacy With Proxies
Conclusion

Top C# Web Scraping Libraries

Web scraping becomes easier when adopting the right tools. Let’s take a look at the best NuGet scraping libraries for C#:

HtmlAgilityPack: the most popular C# scraper library. HtmlAgilityPack provides you the ability to download web pages, parse their HTML content, select HTML elements, and scrape data from them.
HttpClient: the most popular C# HTTP client. HttpClient is particularly useful when it comes to web crawling because it allows you to perform HTTP requests easily and asynchronously.
Selenium WebDriver is a library that supports several programming languages and allows you to write automated tests for web apps. You can also use it for web scraping purposes.
Puppeteer Sharp is the C# port of Puppeteer. Puppeteer Sharp provides headless browser capabilities and allows scraping dynamic content pages.

In this tutorial, you will see how to perform web scraping using C# with HtmlAgilityPack and Selenium.

Prerequisites for Web Scraping With C#

Before writing the first line of code of your C# web scraper, you need to meet some prerequisites:

Visual Studio: the free Community edition of Visual Studio 2022 will be fine.
.NET 6+: any LTS version greater than or equal to 6 will do.

If you do not meet one of these requirements, click the link above to download the tools and follow the installation wizard to set them up.

You are now ready to create a C# web scraping project in Visual Studio.

Setting Up a Project in Visual Studio

Open Visual Studio and click on the “Create a new project” option.

In the “Create a new project” window, select the “C#” option from the dropdown list. After specifying the programming language, select the “Console App” template, and click “Next”.

Then, call your project StaticWebScraping, click “Select”, and choose the .NET version. If you installed .NET 6.0, Visual Studio should already select it for you.

Click the “Create” button to initialize your C# web scraping project. Visual Studio will initialize a StaticWebScraping folder containing an App.cs file for you. This file will store your web scraping logic in C#:

namespace WebScraping {
    public class Program {
            public static void Main() {
               // scraping logic...               
            }
    }
}

It is time to understand how you can build a web scraper in C#!

Scraping Static Content Websites in C#

In static content websites, the content of the web pages is already stored in the HTML documents returned by the server. This means that a static content web page does not perform XHR requests to retrieve data or require JavaScript to be rendered.

Scraping static websites is rather simple. All you have to do is:

Install a web scraping C# library
Download your target web page and parse its HTML document
Use a web scraping library to select the HTML elements of interest
Extract data from them

Let’s apply all these steps to the “List of SpongeBob SquarePants episodes” Wikipedia page:

List of SpongeBob SquarePanta episodes on Wikipedia

The goal of the C# web scraper you are about to build is to automatically retrieve all episode data from that static content Wikipedia page.

Let’s get started!

Step 1: Install HtmlAgilityPack

HtmlAgilityPack is an open-source C# library that allows you to parse HTML documents, select elements from the DOM, and extract data from them. Basically, HtmlAgilityPack offers everything you need to scrape a static content website.

To install it, right-click on the “Dependencies” option under your project name in “Solution Explorer”. Then, select “Manage NuGet Packages”. In the NuGet package manager window, search for “HtmlAgilityPack”, and click the “Install” button in the right section of your screen.

Installing HtmlAgilityPack for C# web scraping

A popup window will ask you if you agree to make changes to your project. Click “OK” to install HtmlAgilityPack. You are now ready to use to perform web scraping in C# on a static website.

Now, add the following line on top of your App.cs file to import HtmlAgilityPack:

using HtmlAgilityPack;

Step 2: Load an HTML Web Page

You can connect to the target web page with HtmlAgilityPack as follows:

// the URL of the target Wikipedia page
string url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";

var web = new HtmlWeb();
// downloading to the target page
// and parsing its HTML content
var document = web.Load(url);

The instance of the HtmlWeb class allows you to load a web page thanks to its Load() method. Behind the scenes, this method performs an HTTP GET request to retrieve the HTML document associated with the URL passed as a parameter. Then, Load() returns an HtmlAgilityPack HtmlDocument instance you can use to select HTML elements from the page.

Step 3: Select HTML Elements

You can select HTML elements from a web page with XPath selectors. In detail, XPath allows you to select one or more specific DOM elements. To get the XPath selector related to an HTML element, right-click on it, open the inspection tools in your browser, make sure it is selecting the DOM element of interest, right-click on the DOM element, and select “Copy XPath”.

The goal of the C# web scraper is to extract the data associated with each episode. So, extract the XPath selector by applying the procedure described above to a <tr> episode element.

XPath selector extraction for the C# web scraper

This will return:

//*[@id="mw-content-text"]/div[1]/table[2]/tbody/tr[2]

Keep in mind that you want to select all <tr> elements. So, you have to change the index associated with the select row element. In detail, you do not want to scrape the first row of the table, since it only contains table headers. In XPath, indexes start from 1, so you can select all <tr> elements of the first episode table on the page by adding the position()>1 XPath syntax.

Also, you want to scrape data from all season tables. In the Wikipedia page, tables containing data on episodes are from the second to the fifteenth HTML table contained in the HTML document. So, this is what the final XPath string will look like:

//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]

Now, you can use the SelectNodes() offered by HtmlAgilityPack to select the HTML elements of interest as follows:

var nodes = document.DocumentNode.SelectNodes("//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]");

Note that you can only call the SelectNodes() method on an HtmlNode instance. So, you need to get the root HTML node of the HTML document with the DocumentNode property.

Also, do not forget that XPath selectors are just one of the many methods you have to select HTML elements from a web page. CSS selectors are another popular option.

Step 4: Extract Data From HTML Elements

First, you need a custom class where to store the scraped data. Create an Episode.cs file in the WebScraping folder and initialize it as follows:

namespace StaticWebScraping {
    public class Episode {
        public string OverallNumber { get; set; }        
        public string Title { get; set; }
        public string Directors { get; set; }
        public string WrittenBy { get; set; }
        public string Released { get; set; }
    }
}

As you can see, this class has four attributes to store all the most important information to scrape about an episode. Note that OverallNumber is a string because the episode number in SpongeBob always contains a character.

Now, you can implement the web scraping C# logic in your App.cs file as below:

using HtmlAgilityPack;
using System;
using System.Collections.Generic;                        

namespace StaticWebScraping {        
    public class Program {
        public static void Main() {
            // the URL of the target Wikipedia page
            string url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";

            var web = new HtmlWeb();
            // downloading to the target page
            // and parsing its HTML content
            var document = web.Load(url);
          
            // selecting the HTML nodes of interest  
            var nodes = document.DocumentNode.SelectNodes("//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]");
            
            // initializing the list of objects that will
            // store the scraped data
            List<Episode> episodes = new List<Episode>();           
            // looping over the nodes 
            // and extract data from them
            foreach (var node in nodes) {                                
                // add a new Episode instance to 
                // to the list of scraped data
                episodes.Add(new Episode() {
                    OverallNumber = HtmlEntity.DeEntitize(node.SelectSingleNode("th[1]").InnerText),
                    Title = HtmlEntity.DeEntitize(node.SelectSingleNode("td[2]").InnerText),
                    Directors = HtmlEntity.DeEntitize(node.SelectSingleNode("td[3]").InnerText),
                    WrittenBy = HtmlEntity.DeEntitize(node.SelectSingleNode("td[4]").InnerText),
                    Released = HtmlEntity.DeEntitize(node.SelectSingleNode("td[5]").InnerText)
                });
            }

            // converting the scraped data to CSV... 
            // storing this data in a db...
            // calling an API with this data...
        }
    }
}

This C# web scraper loops over the selected HTML nodes, create an instance of the Episode class for each of them, and store it in the episodes list. Keep in mind that the HTML nodes of interest are rows of a table. So, you need to select some elements with the SelectSingleNode() method. Then, use the InnerText attribute to extract the desired data to scrape from them. Note the use of HtmlEntity.DeEntitize() static function to replace the special HTML characters with their natural representations.

Step 5: Export the Scraped Data to CSV

Now that you learned how to do web scraping in C#, you can do whatever you want with the scraped data. One of the most common scenarios is to convert the scraped data to a human-readable format, such as CSV. By doing this, anyone in your team will be able to explore the scraped data directly into Excel.

Let’s now learn how to export scraped data to CSV with C#.

To make things easier, let’s use a library. CSVHelper is a fast, easy-to-use, powerful .NET library for reading and writing CSV files. To add the CSVHelper dependency, open the “Manage NuGet Packages” section in Visual Studio, look for “CSVHelper”, and install it.

You can use CSVHelper to convert our scraped data into CSV as below:

using CsvHelper;
using System.IO;
using System.Text;
using System.Globalization;

// scraping logic…

// initializing the CSV file
using (var writer = new StreamWriter("output.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
	// populating the CSV file
	csv.WriteRecords(episodes);
}

If you are not familiar with the using keyword, this defines a scope at the end of which the objects within it will be disposed of. In other terms, using is great for dealing with file resources. Then, the WriteRecords() CSVHelper function takes care of automatically converting the scraped data into CSV and writing it to the output.csv file.

As soon as your C# web scraper finishes running, you will see a output.csv file appear in the project root folder. Open it in Excel, and you will see the following data:

Et voilà! You just learned how to perform web scraping with C# on static content websites!

Scraping Dynamic Content Websites in C#

Dynamic content websites use JavaScript to dynamically retrieve data through the AJAX technology. The HTML documents associated with dynamic content pages can be basically empty. At the same time, it contains JavaScript scripts that are in charge of retrieving and rendering data dynamically, at render time. This means that if you want to extract data from them, you need a browser to render their pages. The reason is that only a browser can run JavaScript.

Scraping dynamic websites can be tricky and is definitely more difficult than scraping static ones. In detail, you need a headless browser to scrape such websites. If you are not familiar with this technology, a headless browser is a browser with no GUI. In other terms, if you want to scrape dynamic content websites in C#, you need a library providing headless browser capabilities, such as Selenium.

Follow the paragraph presented at the beginning of the article to set up a new C# project. This time, call it DynamicWebScraping.

Step 1: Install Selenium

Selenium is an open-source framework for automated testing that supports several programming languages. Selenium provides headless browser capabilities and allows you to instruct a web browser to perform specific actions.

To add Selenium to your project’s dependencies, go to the “Manage NuGet Packages” section again, search for “Selenium.WebDriver”, and install it.

Import Selenium by adding these two lines at the top of your App.cs file:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

Step 2: Connect to the Target Website

Since Selenium opens the target website in a browser, you do not have to manually perform an HTTP GET request. All you have to do is use the Selenium Web Driver as follows:

// the URL of the target Wikipedia page
string url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";

// to initialize the Chrome Web Driver in headless mode
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
var driver = new ChromeDriver();

// connecting to the target web page
driver.Navigate().GoToUrl(url);

Here, you created a Chrome web driver instance. If you are using another browser, adapt the code accordingly by using the right browser driver. Then, thanks to the Navigate() method on the driver variable, you can call the GoToUrl() method to connect to the target web page. In detail, this function accepts a URL parameter and uses it to visit the web page associated with the URL in the headless browser.

Step 3: Scrape Data From HTML Elements

Just as seen before, you can use the following XPath selector to select the HTML elements of interest:

//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]

Use an XPath selector in Selenium with:

 var nodes = driver.FindElements(By.XPath("//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]"));

Specifically, the Selenium By.XPath() method allows you to apply an XPath string to select HTML elements from the DOM of the page.

Now, let’s assume you already defined an Episode.cs class as before under the DynamicWebScraping namespace. You can now build a C# web scraper with Selenium as below:

using System;
using System.Collections.Generic;        
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

namespace DynamicWebScraping {        
    public class Program {
        public static void Main() {
            // the URL of the target Wikipedia page
            string url = "https://en.wikipedia.org/wiki/List_of_SpongeBob_SquarePants_episodes";

            // to initialize the Chrome Web Driver in headless mode
            var chromeOptions = new ChromeOptions();
            chromeOptions.AddArguments("headless");
            var driver = new ChromeDriver();

            // connecting to the target web page
            driver.Navigate().GoToUrl(url);

            // selecting the HTML nodes of interest 
            var nodes = driver.FindElements(By.XPath("//*[@id='mw-content-text']/div[1]/table[position()>1 and position()<15]/tbody/tr[position()>1]"));
                        
            // initializing the list of objects that will
            // store the scraped data
            List<Episode> episodes = new();           
            // looping over the nodes 
            // and extract data from them
            foreach (var node in nodes) {                                
                // add a new Episode instance to 
                // to the list of scraped data
                episodes.Add(new Episode() {
                    OverallNumber = node.FindElement(By.XPath("th[1]")).Text,
                    Title = node.FindElement(By.XPath("td[2]")).Text,
                    Directors = node.FindElement(By.XPath("td[3]")).Text,
                    WrittenBy = node.FindElement(By.XPath("td[4]")).Text,
                    Released = node.FindElement(By.XPath("td[5]")).Text
                });
            }

            // converting the scraped data to CSV... 
            // storing this data in a db...
            // calling an API with this data...
                        
        }
    }
}

As you can see, the web scraping logic does not change a lot compared to what is done with HtmlAgilityPack. In detail, thanks to the Selenium FindElements() and FindElement() methods, you can achieve the same scraping goal as before. What truly changes is that Selenium performs all these operations in a browser.

Note that on dynamic content websites, you may have to wait for the data to be retrieved and rendered. You can achieve this with WebDriverWait.

Congrats! You now learn how to perform C# web scraping on dynamic content websites. It only remains to learn what to do with the scraped data.

What To Do With the Scraped Data

Store it in a database to query it whenever you need.
Convert it into JSON and use it to call some APIs.
Transforming it into human-readable formats, such as CSV, to open it with Excel.

These are just a few examples. What really matters is that once you have the scraped data in the code, you can use it as you wish. Typically, scraped data is converted into a more useful format for your marketing, data analysis, or sales team.

But keep in mind that web scraping comes with several challenges!

Data Privacy With Proxies

If you want to avoid exposing your IP, getting blocked, and protecting your identity, consider adopting web scraping proxies. A proxy server acts as a gateway between your application and the server of the target website, hiding your IP as a consequence.

As a result, a proxy service allows you to overcome IP blocks, collect data anonymously, and unlock content in all countries. There are different proxy types, and they all have different use cases and purposes. Make sure you are choosing the right proxy provider.

Let’s now dig into what benefits web proxies can bring to your web scraping process.

Avoid IP banning

When your web scraping application is trying to reach a website on the Internet, the IP address from which the request comes is public. This means that websites can keep track of that and block users who are making too many requests. This is what bot detection is about. If you are using a web proxy, the target server will see the rotating proxy’s IP, not yours. So, with proxies, we can easily bypass IP bans.

Rotating IP Addresses

Premium proxies generally offer rotating IP features. This means that every time you contact the proxy server, you are given a new IP address from a large pool of IPs. This is great to avoid anti-scraping systems from tracking you.

Regional Scraping

Many websites change their info based on where the request comes from. Also, some of them are available only in certain regions. Scraping these websites to perform worldwide market research may be a problem. Luckily, you can use anonymous proxies to select the location of the exit IP address. This is a great way to collect valuable information about products from internationalized websites.

Conclusion

Here, you learned how to build a web scraper using C#. As you saw, this does not take too many lines of code. At the same time, when your target web pages change, you will have to update the scraper accordingly. Some websites make changes to their structure on a daily basis. Bright Data offers a wide range of products to assist businesses with their web scraping needs. Talk to our sales representatives to see which product will best suit your needs.

Web Scraping With C# Guide