Guide to Web Scraping With Java

Not sure which supporting tools to download to create an ideal Java environment for data collection? Is it unclear how to extract/parse data points from HTML , and then convert them into CSV format? This post will help set the record straight.
Nadav Roiter - Bright Data content manager and writer
Nadav Roiter | Data Collection Expert

In this article we will discuss:

What is Java

Java is an open-source programming language that was developed with the motto “write once, run anywhere”. This means that as long as a program supports Java, scripts will not need to be rewritten. This is an especially useful language for programmers who are used to C, and C++, as Java’s syntax is very similar. 

Some of the main advantages of using Java include:

  • It is an extremely popular program, which means that there is a large online community that can help support your efforts,  and troubleshoot issues. As well as having extensive documentation, which makes usage that much easier. 
  • The fact that it has  dynamic capabilities, specifically when it comes to the ‘modification of runtime code’, for example
  • It is open-source which means that it is compatible with a variety of platforms, enabling the use of different APIs (Application Programming Interface).

For those of you that are interested in manually installing Java on your Windows computer, feel free to head on over to the official manual download page.  

Web scraping with Java 

While some people prefer using Selenium for scraping or collecting data with Beautiful Soup, another popular option is utilizing Java for web scraping.  Here is a step-by-step guide of how to easily accomplish this.

Before you begin, ensure that you have the following set up on your computer so that the environment is optimal for web scraping:

  • Java11 -There are more advanced versions but this remains by far the most popular among developers. 
  • Maven – Is a building automation tool for dependency management and the like 
  • IntelliJ IDEA – IntelliJ IDEA is an integrated development environment for developing computer software written in Java.
  • HtmlUnit – This is a browser activity simulator (e.g. form submission simulation).

You can check installations with these commands:

  •    ‘java -version’
  •    ‘mvn -v’

Step One: Inspect your target page 

Head to the target site that you would like to collect data from, right click anywhere and hit ‘inspect element’ in order to access the ‘Developer Console’, affording you access to the site’s HTML.

Step Two: Start scraping the HTML

Open IntelliJ IDEA and create a Maven project:

Maven projects have a pom.xml file. Navigate to the pom.xml file, and first set up the JDK version for your project:


And then, since we will be using htmlunit, add the “htmlunit” dependency to the pom.xml file as follows:


Now circumstances are right to begin writing the first Java class. Start by creating a new Java source file.

We need to create a main method for our application to start. Create the main method like this:

   public static void main(String[] args) throws IOException {

The app will start with this method. It is the application entrypoint. You can now send an HTTP request using HtmlUnit imports as follows:

   import com.gargoylesoftware.htmlunit.*;
   import com.gargoylesoftware.htmlunit.html.*;
   import java.util.List;

Now create a WebClient by setting the options as follows:

	private static WebClient createWebClient() {
		WebClient webClient = new WebClient(BrowserVersion.CHROME);
		return webClient;

Step Three: Extract/parse the data from the HTML

Now let’s extract the target price data that we are interested in.
We will use the following HtmlUnit built-in commands in order to accomplish this. Here is what that would look like for data points pertaining to ‘product price’:

		WebClient webClient = createWebClient();
		try {
			String link = "";
			HtmlPage page = webClient.getPage(link);
			String xpath = "//*[@id=\"mm-saleDscPrc\"]";			
			HtmlSpan priceDiv = (HtmlSpan) page.getByXPath(xpath).get(0);			
			CsvWriter.writeCsvFile(link, priceDiv.asNormalizedText());
		} catch (FailingHttpStatusCodeException | IOException e) {
		} finally {

To get the XPath of the desired element, go ahead and use the Developer Console. On the Developer Console, right-click the selected section and click “Copy XPath”. This command will copy the selected section as an XPath expression:

The web pages contain links, text, graphics, and tables. If you select an XPath of a table, you can export it to CSV and make further calculations, and analysis with programs such as Microsoft Excel. In the next step, we will examine exporting a table as a CSV file.

Step Four: Exporting the data 

Now that the data has been parsed, we can export it into CSV format for further analysis. This format may be preferred by certain professionals over others, as it can then be easily opened/viewed in Microsoft Excel. Here are the command lines to use in order to accomplish this:

	public static void writeCsvFile(String link, String price) throws IOException {
		FileWriter recipesFile = new FileWriter("export.csv", true);

		recipesFile.write("link, price\n");

		recipesFile.write(link + ", " + price);


The bottom line 

Java can be an effective way for programmers, data scientists, and expert teams to gain access to the target data points that their business needs. But using Java for web scraping can be a very laborious task. That is why many companies have decided to fully automate their data collection operations. By utilizing a tool like Web Scraper IDE, any employee at a firm can now collect the data they need with zero coding capabilities. All they need to do is choose their target site, and dataset, and then select their desired ‘collection frequency’, ‘format’ and ‘method of delivery’. 

Nadav Roiter - Bright Data content manager and writer
Nadav Roiter | Data Collection Expert

Nadav Roiter is a data collection expert at Bright Data. Formerly the Marketing Manager at Subivi eCommerce CRM and Head of Digital Content at Novarize audience intelligence, he now dedicates his time to bringing businesses closer to their goals through the collection of big data.

You might also be interested in

What is data aggregation

Data Aggregation – Definition, Use Cases, and Challenges

This blog post will teach you everything you need to know about data aggregation. Here, you will see what data aggregation is, where it is used, what benefits it can bring, and what obstacles it involves.
What is a data parser featured image

What Is Data Parsing? Definition, Benefits, and Challenges

In this article, you will learn everything you need to know about data parsing. In detail, you will learn what data parsing is, why it is so important, and what is the best way to approach it.
What is a web crawler featured image

What is a Web Crawler?

Web crawlers are a critical part of the infrastructure of the Internet. In this article, we will discuss: Web Crawler Definition A web crawler is a software robot that scans the internet and downloads the data it finds. Most web crawlers are operated by search engines like Google, Bing, Baidu, and DuckDuckGo. Search engines apply […]

A Hands-On Guide to Web Scraping in R

In this tutorial, we’ll go through all the steps involved in web scraping in R with rvest with the goal of extracting product reviews from one publicly accessible URL from Amazon’s website.

The Ultimate Web Scraping With C# Guide

In this tutorial, you will learn how to build a web scraper in C#. In detail, you will see how to perform an HTTP request to download the web page you want to scrape, select HTML elements from its DOM tree, and extract data from them.
Javascript and node.js web scraping guide image

Web Scraping With JavaScript and Node.JS

We will cover why frontend JavaScript isn’t the best option for web scraping and will teach you how to build a Node.js scraper from scratch.
Web scraping with JSoup

Web Scraping in Java With Jsoup: A Step-By-Step Guide

Learn to perform web scraping with Jsoup in Java to automatically extract all data from an entire website.
Static vs. Rotating Proxies

Static vs Rotating Proxies: Detailed Comparison

Proxies play an important role in enabling businesses to conduct critical web research.