Parsing HTML With Java and jsoup

Master HTML parsing with jsoup in Java. Learn DOM methods, handle pagination, and optimize your parsing workflow with this detailed guide.
2 min read
Parsing HTML With Java and Jsoup blog image

When you scrape the web, HTML parsing is vital no matter which tools you’re using. Web scraping with Java is no exception to this rule. In Python, we use tools like Requests and BeautifulSoup. With Java, we can send our HTTP requests and parse our HTML using jsoup. We’ll use Books to Scrape for this tutorial.

Getting Started

In this tutorial, we’re going to use Maven for dependency management. If you don’t already have it, you can install Maven here.

Once you’ve got Maven installed, you need to create a new Java project. The command below creates a new project, jsoup-scraper.

mvn archetype:generate -DgroupId=com.example -DartifactId=jsoup-scraper -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Next, you’ll need to add relevant dependencies. Replace the code in pom.xml with the code below. This is similar to dependency management in Rust with Cargo.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>jsoup-scraper</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>jsoup-scraper</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.16.1</version>
    </dependency>
  </dependencies>
  <properties>
    <maven.compiler.source>17</maven.compiler.source>
    <maven.compiler.target>17</maven.compiler.target>
</properties>
</project>

Go ahead and paste the following code into App.java. It’s not much, but this is the basic scraper we’ll build from.

package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class App {
    public static void main(String[] args) {

        String url = "https://books.toscrape.com";
        int pageCount = 1;

        while (pageCount <= 1) {

            try {
                System.out.println("---------------------PAGE "+pageCount+"--------------------------");

                //connect to a website and get its HTML
                Document doc = Jsoup.connect(url).get();

                //print the title
                System.out.println("Page Title: " + doc.title());


            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        System.out.println("Total pages scraped: "+(pageCount-1));
    }
}
  • Jsoup.connect("https://books.toscrape.com").get(): This line fetches the page and returns a Document object that we can manipulate.
  • doc.title() returns the title in the HTML document, in this case: All products | Books to Scrape - Sandbox.

Using DOM Methods With Jsoup

jsoup contains a variety of methods for finding elements in the DOM(Document Object Model). We can use any of the following to find page elements easily.

  • getElementById(): Find an element using its id.
  • getElementsByClass(): Find all elements using their CSS class.
  • getElementsByTag(): Find all elements using their HTML tag.
  • getElementsByAttribute(): Find all elements containing a certain attribute.

getElementById

On our target site, the sidebar contains a div with an id of promotions_left. You can see this in the image below.

Inspect the sidebar
//get by Id
Element sidebar = doc.getElementById("promotions_left");

System.out.println("Sidebar: " + sidebar);

This code outputs the HTML element you see in the inspect page.

Sidebar: <div id="promotions_left">
</div>

getElementsByTag

getElementsByTag() allows us to find all elements on the page with a certain tag. Let’s look at the books on this page.

Each book is contained in a unique article tag.

Inspect books

The code below won’t print anything, but it will return an array of books. These books will provide the basis for the rest of our data.

//get by tag
Elements books = doc.getElementsByTag("article");

getElementsByClass

Let’s look at the price of a book. As you can see hightlighted, its class is price_color.

Inspect price

In this snippet, we find all elements of the price_color class. We then print the text of the first one using .first().text().

System.out.println("Price: " + book.getElementsByClass("price_color").first().text());

getElementsByAttribute

As you might already know, all a elements require an href attribute. In the code below, we use getElementsByAttribute("href") to find all elements with an href. We use .first().attr("href") to return its href.

//get by attribute
Elements hrefs = book.getElementsByAttribute("href");
System.out.println("Link: https://books.toscrape.com/" + hrefs.first().attr("href"));

Advanced Techniques

CSS Selectors

When we want to use multiple criteria to find elements, we can pass CSS selectors into the select() method. This method returns an array of all objects matching the selector. Below, we use li[class='next'] to find all li items with the next class.

Elements nextPage = doc.select("li[class='next']");

Handling Pagination

To handle our pagination, we use nextPage.first() to call getElementsByAttribute("href").attr("href") on the first element returned from the array and extract its href. Interestingly enough, after page 2, the word catalogue gets removed from the links, so if it isn’t present in the href, we add it back in. We then combine this link with our base url and use it to get the link to the next page.

if (!nextPage.isEmpty()) {
    String nextUrl = nextPage.first().getElementsByAttribute("href").attr("href");
    if (!nextUrl.contains("catalogue")) {
        nextUrl = "catalogue/"+nextUrl;
    }
    url = "https://books.toscrape.com/" + nextUrl;
    pageCount++;
}

Putting Everything Together

Here is our final code. If you wish to scrape more than one page, simply change the 1 in while (pageCount <= 1) to your desired target. If you want to scrape 4 pages, use while (pageCount <= 4).

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class App {
    public static void main(String[] args) {

        String url = "https://books.toscrape.com";
        int pageCount = 1;

        while (pageCount <= 1) {

            try {
                System.out.println("---------------------PAGE "+pageCount+"--------------------------");

                //connect to a website and get its HTML
                Document doc = Jsoup.connect(url).get();

                //print the title
                System.out.println("Page Title: " + doc.title());

                //get by Id
                Element sidebar = doc.getElementById("promotions_left");

                System.out.println("Sidebar: " + sidebar);

                //get by tag
                Elements books = doc.getElementsByTag("article");

                for (Element book : books) {
                    System.out.println("------Book------");
                    System.out.println("Title: " + book.getElementsByTag("img").first().attr("alt"));
                    System.out.println("Price: " + book.getElementsByClass("price_color").first().text());
                    System.out.println("Availability: " + book.getElementsByClass("instock availability").first().text());

                    //get by attribute
                    Elements hrefs = book.getElementsByAttribute("href");
                    System.out.println("Link: https://books.toscrape.com/" + hrefs.first().attr("href"));
                }

                //find the next button using its CSS selector
                Elements nextPage = doc.select("li[class='next']");
                if (!nextPage.isEmpty()) {
                    String nextUrl = nextPage.first().getElementsByAttribute("href").attr("href");
                    if (!nextUrl.contains("catalogue")) {
                        nextUrl = "catalogue/"+nextUrl;
                    }
                    url = "https://books.toscrape.com/" + nextUrl;
                    pageCount++;
                }

            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        System.out.println("Total pages scraped: "+(pageCount-1));
    }
}

Before you run the code, remember to compile it.

mvn package

Then run it with the following command.

mvn exec:java -Dexec.mainClass="com.example.App"

Here is the output from the first page.

---------------------PAGE 1--------------------------
Page Title: All products | Books to Scrape - Sandbox
Sidebar: <div id="promotions_left">
</div>
------Book------
Title: A Light in the Attic
Price: £51.77
Availability: In stock
Link: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
------Book------
Title: Tipping the Velvet
Price: £53.74
Availability: In stock
Link: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
------Book------
Title: Soumission
Price: £50.10
Availability: In stock
Link: https://books.toscrape.com/catalogue/soumission_998/index.html
------Book------
Title: Sharp Objects
Price: £47.82
Availability: In stock
Link: https://books.toscrape.com/catalogue/sharp-objects_997/index.html
------Book------
Title: Sapiens: A Brief History of Humankind
Price: £54.23
Availability: In stock
Link: https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
------Book------
Title: The Requiem Red
Price: £22.65
Availability: In stock
Link: https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
------Book------
Title: The Dirty Little Secrets of Getting Your Dream Job
Price: £33.34
Availability: In stock
Link: https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
------Book------
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Price: £17.93
Availability: In stock
Link: https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
------Book------
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Price: £22.60
Availability: In stock
Link: https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
------Book------
Title: The Black Maria
Price: £52.15
Availability: In stock
Link: https://books.toscrape.com/catalogue/the-black-maria_991/index.html
------Book------
Title: Starving Hearts (Triangular Trade Trilogy, #1)
Price: £13.99
Availability: In stock
Link: https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
------Book------
Title: Shakespeare's Sonnets
Price: £20.66
Availability: In stock
Link: https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html
------Book------
Title: Set Me Free
Price: £17.46
Availability: In stock
Link: https://books.toscrape.com/catalogue/set-me-free_988/index.html
------Book------
Title: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Price: £52.29
Availability: In stock
Link: https://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html
------Book------
Title: Rip it Up and Start Again
Price: £35.02
Availability: In stock
Link: https://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html
------Book------
Title: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Price: £57.25
Availability: In stock
Link: https://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html
------Book------
Title: Olio
Price: £23.88
Availability: In stock
Link: https://books.toscrape.com/catalogue/olio_984/index.html
------Book------
Title: Mesaerion: The Best Science Fiction Stories 1800-1849
Price: £37.59
Availability: In stock
Link: https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html
------Book------
Title: Libertarianism for Beginners
Price: £51.33
Availability: In stock
Link: https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html
------Book------
Title: It's Only the Himalayas
Price: £45.17
Availability: In stock
Link: https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html
Total pages scraped: 1

Conclusion

Now that you’ve learned how to extract HTML data using jsoup, you can start building more advanced web scrapers. Whether you’re scraping product listings, news articles, or research data, handling dynamic content and avoiding blocks are key challenges.

To scale your scraping efforts efficiently, consider using Bright Data’s tools:

By combining jsoup with the right infrastructure, you can extract data at scale while minimizing detection risks. Ready to take your web scraping to the next level? Sign up now and start your free trial.

No credit card required