Web Scraping with Google Sheets

Simplify web scraping with Google Sheets using IMPORTXML and IMPORTHTML to extract data from websites easily.
9 min read
Web Scraping With Google Sheets

Web scraping is the process of extracting content and data from websites using scripts or automated software tools. The scraped information is then usually exported to a more useful format, such as a raw file or CSV, for easier consumption.

If you’re looking to simplify your web scraping workflows, Google Sheets has you covered. It’s a popular data management tool that’s great for scraping structured or tabular data from websites and for analyzing or visualizing your data. For example, you can use it to pull product details and prices from e-commerce sites or grab contact info from business directories. It’s also helpful for tracking social media engagement or doing public sentiment analysis to measure campaign effectiveness.

In this tutorial, you’ll learn how to set up and use Google Sheets for web scraping.

Setting Up Your Google Sheets

To start web scraping with Google Sheets, you need to create a new Google Sheet by navigating to https://sheets.google.com and clicking the + button:

Creating a new Google Sheet

This tutorial showcases how to scrape book pricing information from the Books to Scrape website, but you can use a different website by modifying the following URL and queries.

Understanding Google Sheets Formulas

Google Sheets supports numerous cell formulas that can be used for a variety of operations, including web scraping. Let’s take a look at how some of these formulas work.

IMPORTXML

The IMPORTXML function lets you query and import structured data into Google Sheets. It supports XML, HTML, CSV, and TSV file formats. The syntax of the function is as follows:

=IMPORTXML(url, xpath_query)

The function imports data from the specified web URL and uses the XPath locator to find the relevant element on the web page. For example, you can fetch the H1 heading from the Books to Scrape website by adding the following formula in a Google Sheets cell:

=IMPORTXML("https://books.toscrape.com/catalogue/category/books/default_15/index.html", "//h1")

On the initial usage, Google Sheets prompts you to enable access before fetching data from third-party websites:

Enabling access before fetching data from 3rd party websites

Once you click Allow access, Google Sheets resolves the cell’s value to the H1 heading of the web page to Default.

IMPORTHTML

The IMPORTHTML function lets you import data from a table or a list on an HTML page. The syntax of the function is as follows:

=IMPORTHTML(url, query, index)

This function imports the data from the url to the sheet based on the specified query. The query attribute can be set to either a list or table based on the type of data you want to import. The index starts from 1 and determines which table or list should be imported. For example, you can fetch the list of books from Books to Scrape using this formula:

=IMPORTHTML("https://books.toscrape.com/catalogue/category/books/default_15/index.html", "list", 2)

This formula outputs the list of books in the current cell, as shown here:

List of books outputted from the `IMPORTHTML` formula

As you can see, the IMPORTXML and IMPORTHTML formulas are easy to use and enable you to start scraping data from a web page using simple queries. For more complex use cases, check out this guide that explains how to use VBA and Selenium for web scraping in Excel.

Extracting Data Using IMPORTXML

In the previous section, you learned the basic usage of IMPORTXML to fetch page headings by specifying the relevant XPath attribute. The XPath attribute is quite powerful and lets you traverse to any element in a web page, irrespective of its hierarchy. In the following section, you’ll use IMPORTXML to fetch the title, price, and rating of all the books on this Books to Scrape web page.

To start, add TitlePrice, and Rating columns in Google Sheets:

Add columns to Google Sheets

To fetch the book title from Books to Scrape, you need its XPath location, which can be found using the browser’s Inspect tool. To find the XPath for the book title, right-click the first book’s title and click Inspect. Then click Copy > XPath to copy its XPath locator:

Find XPath for the book title

The XPath for the first book’s title corresponds to an anchor tag (a) and looks something like this:

//*[@id="default"]/div/div/div/div/section/div[2]/ol/li[1]/article/h3/a

You need to make some adjustments to the XPath to ensure the book title is correctly imported for all books in the list:

  • The XPath contains li[1] in the path, indicating that the first book is selected. Replace it with li to retrieve all elements.
  • The inner content of the a tag contains a truncated book title, but the a tag contains a title attribute with the full book title. Modify the a in the XPath to a/@title to use the title attribute.
  • Replace any double quotes within the XPath with a single quote to avoid escaping issues in the formula.

Once you’ve adjusted the XPath, add the following formula with the updated XPath in the A2 cell of your Google Sheet:

=IMPORTXML("https://books.toscrape.com/catalogue/category/books/default_15/index.html", "//*[@id='default']/div/div/div/div/section/div[2]/ol/li/article/h3/a/@title")

The sheet imports data from the web page and updates the rows as follows:

Import book title using `IMPORTXML` formula

Next, construct the XPath for the price and add it to the B2 cell in the Google Sheet:

=IMPORTXML("https://books.toscrape.com/catalogue/category/books/default_15/index.html", "//*[@id='default']/div/div/div/div/section/div[2]/ol/li/article/div[2]/p[1]")

Finally, find the XPath for the rating and add it to the C2 cell in the Google Sheet:

=IMPORTXML("https://books.toscrape.com/", "//*[@id='default']/div/div/div/div/section/div[2]/ol/li/article/p/@class")

The final data in the sheet looks like this:

Scraped book information using `IMPORTXML`

Notice that the Rating column shows star-rating Three or star-rating Four. Since XPath 2.0 is not yet supported by Google Sheets, you cannot manipulate the data to simplify the output.

Handling Complex Web Pages

While Google Sheets is well suited for simple scraping tasks, scraping can become challenging if the target website contains dynamic content and pagination or if it requires click interactions. For instance, if your web page loads content asynchronously using JavaScript, Google Sheets IMPORTXML and IMPORTHTML formulas can’t extract data from it since they support only static web pages. Similarly, if the content relies on user interactions like clicking, typing, or scrolling, these formulas won’t be able to scrape the data. If you want to scrape dynamic content, you can write a script that uses a headless browser like Selenium.

Google Sheets also can’t handle paginated scraping tasks automatically. While you can manually add the IMPORTXML formula after the last row with an updated URL, this method isn’t scalable since it requires repeating the process for every page.

If you’re looking for more advanced use cases, such as handling dynamic content or large volumes of data, consider using Bright Data’s products for efficient data extraction. Bright Data provides a unified Web Scraping API for any data extraction tasks and handles the complexities of proxies, CAPTCHAs, and user agents under the hood. Its API handles bulk requests, parsing, and validation, allowing you to deploy and scale faster. Additionally, it provides a large collection of prebuilt datasets from popular websites, such as LinkedIn and Zillow, that can be integrated with your existing workflows, reducing the hassle of maintaining the scraping scripts.

Automating Data Refresh in Google Sheets

For some scraping tasks, like price or social-media engagement tracking, you need to automatically refresh the scraped data at periodic time intervals to ensure you have access to accurate data for analysis and decision-making.

To set the calculation interval in Google Sheets, all you need to do is click File > Settings and navigate to the Calculation tab:

Open Google Sheets Settings

Then, you can update the calculation interval to either one minute or one hour. For instance, here, the Recalculation setting is updated to On change and every minute to ensure that the data is auto-refreshed every minute:

Configure automatic data refresh in Google Sheets

Google Sheets automatic refresh options offer limited flexibility to configure refresh frequency or triggers as you can only choose between two values: hourly or every minute. If you’re looking for more flexibility, Bright Data offers clean, validated, and up-to-date data sets in multiple file formats, such as JSON, CSV, and Parquet. Therefore, it’s ideal for large-scale scraping tasks that would otherwise require maintaining a vast infrastructure.

Implementing Best Practices and Troubleshooting

If you want to improve the efficiency of your scraping, make sure you’re selective about the data you extract. Trying to scrape unnecessary data can slow down your process and increase the load on the target website.

If you’re looking to scrape large volumes of data, add artificial delays between requests and consider running tasks at off-peak hours to ensure that the website is not overloaded with unexpected traffic. High volumes of traffic can lead to IP bans or rate limiting, preventing you from continuing your scraping task. Learn more about scraping websites without getting blocked.

In addition to IP bans, presenting users with a CAPTCHA challenge is another common anti-bot technique used by websites to restrict access to content until the user verifies they are human. Consider using the Bright Data residential proxies for advanced scraping tasks that would benefit from IP rotation and automatic CAPTCHA solvers.

Before scraping any data, you must also review the website’s terms of service to ensure compliance. Your scripts should follow the robot.txt instructions for interacting with the website. Refer to this guide to learn more about how to use the robot.txt rules for web scraping.

Conclusion

Google Sheets is well-suited for scraping data from static websites that don’t involve dynamic content, hidden elements, or pagination. In this article, you learned how to easily automate data extraction tasks using IMPORTXML and IMPORTHTML formulas without prior scripting experience.

For complex scraping tasks that involve dynamic content or large volumes of data, Bright Data provides easy-to-use, flexible, scalable, and performant APIs for scraping web data in different formats, including JSON, CSV, or NDJSON. Under the hood, it handles the complexities of scraping by taking care of IP and user agent rotation, CAPTCHAs, and dynamic content. If you’re ready to take your web scraping to the next level, consider trying the best Web Scraper API

Sign up for a free trial today and start optimizing your data workflows!

No credit card required