How to Parse HTML With Golang?

Master HTML parsing in Go with Node Parser, Tokenizer, and top third-party tools like Goquery and Colly for efficient web scraping solutions.
8 min read
How to Parse HTML With Golang blog image

There are all sorts of great parsing tools out there. In Python, your options feel almost unlimited. However, with Go, we don’t really get much to choose from.

Go is an excellent language for performance and memory management, but our parsing libraries are quite limited. Node Parser and Tokenizer are two options we can use from Go’s standard library. If you’re completely unfamiliar with how web scraping works, take a look at this guide. Follow along with us and learn when to use these tools or when to choose a third party library for a more complete scraping solution.

Prerequisites

Basic understanding of Go and Web Scraping are helpful here but not required. If you’re familiar with Go but you’d like to get to know the web scraping process take a look a this guide.

To start, you need to make sure you have Go installed on your machine. You can find the latest release here. Download the latest release for your system and we’re off to the races!

Create a new project folder and cd into it.

mkdir goparser
cd goparser

Initialize a new Go project.

go mod init goparser

Testing Your Configuration

Go ahead and paste the following code into a new file, main.go.

package main

import "fmt"

func main() {
    fmt.Println("Hello, World!")
}

You can run the file with the following command.

go run main.go

If everything is working, you should receive the following output.

Hello, World!

Install our only dependency.

go get golang.org/x/net/html

Examining the Page

Quotes to Scrape is a site built specifically for the purpose of scraping tutorials. In this tutorial, we’re going to extract each quote and its author from the page.

To better understand the quote object, take a look at the screenshot below. Each quote is a span and its class is text.

Inspect quote

In this next screenshot, we inspect the author. It is a small element and its class is author.

Inspect author

Both our Node Parser and Tokenizer examples will produce the same output you see below.

Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin

Extracting Data With Node Parser

Go’s Node Parser allows us to traverse the DOM (Document Object Model) and manipulate it recursively. When we use Node Parser, it converts the entire HTML page into a tree-like structure of Node objects that we can parse as we go.

In the code below, we create a recursive function: processNode(). It takes a pointer to an HTML node. If the node is a span and its class is text, we print the quote to the console. If the node is a small element and its class is author, we print the author to the console. These are the same attributes we discovered earlier when inspecting the page.

package main

import (
    "fmt"
    "net/http"

    "golang.org/x/net/html"
)

func main() {
    resp, _ := http.Get("http://quotes.toscrape.com")
    defer resp.Body.Close()
    doc, _ := html.Parse(resp.Body)

    var processNode func(*html.Node)
    processNode = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "span" {
            for _, a := range n.Attr {
                if a.Key == "class" && a.Val == "text" {
                    fmt.Println("Quote:", n.FirstChild.Data)
                }
            }
        }
        if n.Type == html.ElementNode && n.Data == "small" {
            for _, a := range n.Attr {
                if a.Key == "class" && a.Val == "author" {
                    fmt.Println("Author:", n.FirstChild.Data)
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            processNode(c)
        }
    }
    processNode(doc)
}

The Node Parser API is great when you need to process the entire document. For memory efficiency, we can use a pointer to the actual document and process our data as we traverse it.

Extracting Data With Tokenizer

Tokenizer processes the page a bit differently. html.NewTokenizer(resp.Body) is used to create a tokenizer object out of our response body. We then choose which tokens (HTML tags, text content or attributes) that we’d like to extract from the page.

When processing each token, we have two boolean objects: inQuote and inAuthor. If the token is inside a quote or an author, we trim it and print its data to the console. While our output from this code is the same, it actually functions very differently. With Node Parser, we process our data one node at a time as we go through the tree. With tokenizer, we process it one chunk at a time.

In the code below, we specify two start tokens: span and small. If our chunk is a span element, and its class is text we print it to the console. If our chunk is a small and its class is author, we also print it to the console. All other tokens (HTML tags) on the page get completely ignored.

package main

import (
    "fmt"
    "net/http"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    resp, _ := http.Get("http://quotes.toscrape.com")
    defer resp.Body.Close()
    tokenizer := html.NewTokenizer(resp.Body)

    inQuote := false
    inAuthor := false

    for {
        tt := tokenizer.Next()
        switch tt {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            t := tokenizer.Token()
            if t.Data == "span" {
                for _, a := range t.Attr {
                    if a.Key == "class" && a.Val == "text" {
                        inQuote = true
                    }
                }
            }
            if t.Data == "small" {
                for _, a := range t.Attr {
                    if a.Key == "class" && a.Val == "author" {
                        inAuthor = true
                    }
                }
            }
        case html.TextToken:
            if inQuote {
                fmt.Println("Quote:", strings.TrimSpace(tokenizer.Token().Data))
                inQuote = false
            }
            if inAuthor {
                fmt.Println("Author:", strings.TrimSpace(tokenizer.Token().Data))
                inAuthor = false
            }
        }
    }
}

Tokenizer is a bit more low level than the Node Parser, but it’s also far more efficient. We only need to process the relevant tokens (HTML tags) instead of traversing the entire document. This is best for processing large chunks from data streams. With Tokenizer, you only need to process relevant data instead of the entire page.

Third Party Alternatives

Both Node Parser and Tokenizer are pretty low level in comparison to the tools you get with Python and JavaScript. Here are some third party tools that can make scraping a bit easier.

Goquery

Built as a Go alternative to Jquery, Goquery is an excellent choice if you’re looking for a more intuitive parser. With Goquery, you get support for DOM traversal and CSS selectors. This is much more like the solutions you might be used to in other languages.

htmlquery

Similar to Goquery, htmlquery allows us to use both DOM traversal and selectors. However, with htmlquery, we use XPath selectors instead of CSS selectors. The choice between Goquery and htmlquery should really be based on which type of selector you prefer.

Colly

Colly is a full fledged web scraping framework for Go. With Colly, we get support for CSS selectors, concurrency and much more. You can think of it as a Go alternative to Scrapy. If you’re interested in using Colly, we have a great tutorial on it here.

Bright Data Web Scraper

Our Web Scraper allows you to bypass the scraping process entirely. With Web Scraper, we scrape the page and return its data to you in JSON format. This is an excellent choice if you just want to make an API request and get on with the day instead of traversing the DOM, writing tokens or writing selectors. Our Web Scraper isn’t a Go library, it’s an API service. If you know how to handle a REST API, this is really simple way to automate your scraping process.

Conclusion

Now you know how to parse HTML using Go. For a more full featured skillset, take a look at our guide on proxy integration in Go. If you want to traverse an entire page, use Node Parser. If you only want to parse relevant data from a page, try out the Tokenizer. If neither of these suit your needs, there are a variety of third party tools like Bright Data’s Web Scrapers. Sign up now and start your free trial!

No credit card required