How to use Java for web scraping in order to unlock mission-critical data points

Not sure which supporting tools to download to create an ideal Java environment for data collection? Is it unclear how to extract/parse data points from HTML , and then convert them into CSV format? This post will help set the record straight.
Nadav Roiter - Bright Data content manager and writer
Nadav Roiter | Data Collection Expert
14-Aug-2022

In this article we will discuss:

What is Java

Java is an open-source programming language that was developed with the motto “write once, run anywhere”. This means that as long as a program supports Java, scripts will not need to be rewritten. This is an especially useful language for programmers who are used to C, and C++, as Java’s syntax is very similar. 

Some of the main advantages of using Java include:

  • It is an extremely popular program, which means that there is a large online community that can help support your efforts,  and troubleshoot issues. As well as having extensive documentation, which makes usage that much easier. 
  • The fact that it has  dynamic capabilities, specifically when it comes to the ‘modification of runtime code’, for example
  • It is open-source which means that it is compatible with a variety of platforms, enabling the use of different APIs (Application Programming Interface).

For those of you that are interested in manually installing Java on your Windows computer, feel free to head on over to the official manual download page.  

Web scraping with Java 

While some people prefer using Selenium for scraping or collecting data with Beautiful Soup, another popular option is utilizing Java for web scraping.  Here is a step-by-step guide of how to easily accomplish this.

Before you begin, ensure that you have the following set up on your computer so that the environment is optimal for web scraping:

  • Java11 -There are more advanced versions but this remains by far the most popular among developers. 
  • Maven – Is a building automation tool for dependency management and the like 
  • IntelliJ IDEA – IntelliJ IDEA is an integrated development environment for developing computer software written in Java.
  • HtmlUnit – This is a browser activity simulator (e.g. form submission simulation).

You can check installations with these commands:

  •    ‘java -version’
  •    ‘mvn -v’

Step One: Inspect your target page 

Head to the target site that you would like to collect data from, right click anywhere and hit ‘inspect element’ in order to access the ‘Developer Console’, affording you access to the site’s HTML.

Step Two: Start scraping the HTML

Open IntelliJ IDEA and create a Maven project:

Maven projects have a pom.xml file. Navigate to the pom.xml file, and first set up the JDK version for your project:

<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
		<maven.compiler.source>11</maven.compiler.source>
		<maven.compiler.target>11</maven.compiler.target>
	</properties>

And then, since we will be using htmlunit, add the “htmlunit” dependency to the pom.xml file as follows:

<dependencies>
		<dependency>
			<groupId>net.sourceforge.htmlunit</groupId>
			<artifactId>htmlunit</artifactId>
			<version>2.63.0</version>
		</dependency>
	</dependencies>

Now circumstances are right to begin writing the first Java class. Start by creating a new Java source file like so:

We need to create a main method for our application to start. Create the main method like this:

   public static void main(String[] args) throws IOException {
   }

The app will start with this method. It is the application entrypoint. You can now send an HTTP request using HtmlUnit imports as follows:

   import com.gargoylesoftware.htmlunit.*;
   import com.gargoylesoftware.htmlunit.html.*;
   import java.io.IOException;
   import java.util.List;

Now create a WebClient by setting the options as follows:

	private static WebClient createWebClient() {
		WebClient webClient = new WebClient(BrowserVersion.CHROME);
		webClient.getOptions().setThrowExceptionOnScriptError(false);
		webClient.getOptions().setCssEnabled(false);
	           webClient.getOptions().setJavaScriptEnabled(false);
		return webClient;
	}

Step Three: Extract/parse the data from the HTML

Now let’s extract the target price data that we are interested in.
We will use the following HtmlUnit built-in commands in order to accomplish this. Here is what that would look like for data points pertaining to ‘product price’:

		WebClient webClient = createWebClient();
	    
		try {
			String link = "https://www.ebay.com/itm/332852436920?epid=108867251&hash=item4d7f8d1fb8:g:cvYAAOSwOIlb0NGY";
			HtmlPage page = webClient.getPage(link);
			
			System.out.println(page.getTitleText());
			
			String xpath = "//*[@id=\"mm-saleDscPrc\"]";			
			HtmlSpan priceDiv = (HtmlSpan) page.getByXPath(xpath).get(0);			
			System.out.println(priceDiv.asNormalizedText());
			
			CsvWriter.writeCsvFile(link, priceDiv.asNormalizedText());
			
		} catch (FailingHttpStatusCodeException | IOException e) {
			e.printStackTrace();
		} finally {
			webClient.close();
		}	

To get the XPath of the desired element, go ahead and use the Developer Console. On the Developer Console, right-click the selected section and click “Copy XPath”. This command will copy the selected section as an XPath expression:

The web pages contain links, text, graphics, and tables. If you select an XPath of a table, you can export it to CSV and make further calculations, and analysis with programs such as Microsoft Excel. In the next step, we will examine exporting a table as a CSV file.

Step Four: Exporting the data 

Now that the data has been parsed, we can export it into CSV format for further analysis. This format may be preferred by certain professionals over others, as it can then be easily opened/viewed in Microsoft Excel. Here are the command lines to use in order to accomplish this:

	public static void writeCsvFile(String link, String price) throws IOException {
		
		FileWriter recipesFile = new FileWriter("export.csv", true);

		recipesFile.write("link, price\n");

		recipesFile.write(link + ", " + price);

		recipesFile.close();
	}

The bottom line 

Java can be an effective way for programmers, data scientists, and expert teams to gain access to the target data points that their business needs. But using Java for web scraping can be a very laborious task. That is why many companies have decided to fully automate their data collection operations. By utilizing a tool like Data Collector, any employee at a firm can now collect the data they need with zero coding capabilities. All they need to do is choose their target site, and dataset, and then select their desired ‘collection frequency’, ‘format’ and ‘method of delivery’. 

Nadav Roiter - Bright Data content manager and writer
Nadav Roiter | Data Collection Expert

Nadav Roiter is a data collection expert at Bright Data. Formerly the Marketing Manager at Subivi eCommerce CRM and Head of Digital Content at Novarize audience intelligence, he now dedicates his time to bringing businesses closer to their goals through the collection of big data.