How to Scrape Google Images With Python

A step-by-step guide to scraping Google Images with Python and Selenium, from setup to saving images.
14 min read
How to Scrape Google Images blog image

Google Images is one of the more difficult sites to scrape on the web. They don’t explicitly block scrapers, but they really make you work for the data… You’ve got to want it!

From dynamic CSS selectors to Base64 encoding, scraping Google Images is a lot more like solving a puzzle than scraping regular HTML.

Prerequisites

To scrape Google Images with us, you should have a basic understanding of Python and Selenium. You’ll need to make sure you’ve got Selenium installed. We suggest you learn more about web scraping with Python and Selenium if needed.

First, make sure you’ve got ChromeDriver and Chrome installed. You can download the most recent one here.

When downloading ChromeDriver, make sure you’re getting a version that matches your version of Chrome.

You can check your Chrome version with the following command.

google-chrome --version

The output should be similar to what you see below.

Google Chrome 131.0.6778.139 

Once you’ve got these, you can install Selenium with pip.

pip install selenium

What To Scrape

We can’t just plunge head first into code. We need to get a better idea of what we’re scraping and how we’ll extract it. Like we said earlier, scraping Google Images is like solving a puzzle.

Let’s examine one of the images from Google. This image is actually embedded in a custom HTML tag called, g-img. We’ll need to find all of these g-img elements.

Inspecting an image on Google Images

Once we’ve found all the g-img tags, we need to extract their img elements. You can see one of those below.

Inspecting the img element

If you looked at the img closely, you should’ve noticed something extremely strange. The src is a bizarre string of seemingly random characters.



The beginning of this string holds the key to everything:  tells us that this is a JPEG file. base64 tells us that it’s encoded using Base64. When we decode this string, we actually get the binary of the image. We’re not actually able to trace the true source of the image since its binary is actually inside the web page. However, we can write this binary to a file and recreate the image.

Scraping Google Images With Python

Now that we know what we want, it’s actually time to start coding our scraper. In the next few sections, we’ll put the scraper together and go through exactly what the code does.

Getting Started

Go ahead and create a new Python file. We’ll start with just our basic imports and structure.

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
import base64
from pathlib import Path

options = webdriver.ChromeOptions()

"""
Our actual scraping logic will go here
"""


if __name__ == "__main__":
    scrape_images("linux penguin", 100)
  • We import webdriver and By from Selenium. webdriver is used to control our browser. By is used for locating items on the page.
  • We’ll use sleep to pause our scraper for a period of time. For example, if we want the scraper to wait for one second, we’d use sleep(1).
  • As you might have guessed, base64 is going to decode our image binaries.
  • Path will be used to write our images to a folder containing our results.
  • options = webdriver.ChromeOptions() allows us to use custom settings with Selenium. Primarily, this is to run Selenium in headless mode. Headless mode allows us to run the scraper without rendering the actual browser on the machine. This saves valuable resources.

Scraping Google Images

Next, we’ll write our scraping function. The code below contains our entire scraper. Pay close attention to scrape_images().

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
import base64
from pathlib import Path

options = webdriver.ChromeOptions()


def scrape_images(keyword, batch_size, headless=True):
    if headless:
        options.add_argument("--headless")

    formatted_keyword = keyword.replace(" ", "+")
    folder_name = keyword.replace(" ", "-")
    output_folder = Path(f"results-{folder_name}")
    output_folder.mkdir(parents=True, exist_ok=True)

    result_count = 0

    driver = webdriver.Chrome(options=options)
    driver.get(f"https://www.google.com/search?q={formatted_keyword}")
    sleep(1)

    list_items = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
    list_items[1].click()

    while result_count < batch_size:
        driver.execute_script("window.scrollBy(0, 300);")
        sleep(1)

        img_tags = driver.find_elements(By.CSS_SELECTOR, "g-img > img")
        for img_tag in img_tags:
            src = img_tag.get_attribute("src")
            if not src or not src.startswith("data:image/"):
                continue

            base64_binary = src.split("base64,")[-1]
            mime_type = src.split(";")[0].split(":")[1]
            file_extension = mime_type.split("/")[-1]
            if file_extension == "gif":
                continue
            
            alt_text = img_tag.get_attribute("alt") or "image"
            filename = f"{alt_text}-{result_count}.{file_extension}"

            image_binary = base64.b64decode(base64_binary)
            output_path = output_folder.joinpath(filename)
            
            with open(output_path, "wb") as file:
                file.write(image_binary)
            result_count+=1
            print(f"Saved: {filename}")
            
    driver.quit()

if __name__ == "__main__":
    scrape_images("linux penguin", 100)
  • We set headless to True by default. If the user sets it to False, this will launch an actual browser that you can see on screen. This is useful for debugging purposes.
  • We create a formatted_keyword and folder_name by removing spaces from our actual keyword. This allows us to store the files without any issues.
  • We launch our browser with webdriver.Chrome(options=options).
  • driver.get(f"https://www.google.com/search?q={formatted_keyword}") takes us to the Google search results for our keyword.
  • Now we need to click on the images tab. We do this by finding all div elements with the role listitemlist_items[1].click() clicks on the second item, the images tab.
  • We use a while loop to run our scraping code over and over until we’ve found all the images we want.
  • driver.execute_script("window.scrollBy(0, 300);") runs JavaScript to scroll the page down by 300 pixels. After scrolling, we sleep() for one second while the content loads.
  • driver.find_elements(By.CSS_SELECTOR, "g-img > img") is used to find all img tags that are nested inside a g-img.
  • Next, we iterate through the img items we found.
  • If the img doesn’t start with data:image/, we use continue to skip it. Otherwise, we pull its src attribute.
  • We use some basic string splitting to extract the encoded binary and the file extension (JPEG, PNG, etc.). If the extension is a GIF, we skip it. For some reason, GIFs don’t display when we write them to a file.
  • base64.b64decode(base64_binary) decodes our image into actual machine readable binary.

If you run the code, you’ll see a new folder pop up inside your project folder. It should be full of images.

The results folder full of .png files

Consider Using Bright Data

Our SERP API parses the Google Images so you don’t have to. It even finds the image metadata, so our images will have actual names. Of course, the API is fully scalable and can deal with an enormous number of requests.

First, sign up for our SERP API.

When you’re ready, finish creating the zone.

Finishing creating the zone

Under Access Details, you’ll see your credentials.

Your SERP API credentials

Copy and paste the code below into a Python file. Replace the credentials in proxy_auth with your own and you’re good to go.

import requests
import base64
from pathlib import Path
import json

proxy = "brd.superproxy.io:33335"
proxy_auth = "brd-customer-<your-customer-id>-zone-<your-zone-name>:<your-zone-password>"
proxy_url = f"http://{proxy_auth}@{proxy}"


def scrape_images(keyword):
    formatted_keyword = keyword.replace(" ", "+")
    folder_name = keyword.replace(" ", "-")
    output_folder = Path(f"serp-results-{folder_name}")
    output_folder.mkdir(parents=True, exist_ok=True)
    url = f"https://www.google.com/search?q={formatted_keyword}&tbm=isch&brd_json=1"

    response = requests.get(
        url,
        proxies={"http": proxy_url, "https": proxy_url},
        verify=False
    )

    images = response.json()["images"]

    result_count = 0
    for image in images:    
        image_binary = base64.b64decode(image["source_logo"].split("base64,")[-1])
        title = image["title"].replace(" ", "-").replace("/", "").strip(".")
        file_extension = image["source_logo"].split(";")[0].split(":")[1].split("/")[-1]
        if file_extension == "gif":
            continue
        filename = f"{title}.{file_extension}"

        with open(output_folder.joinpath(filename), "wb") as file:
            file.write(image_binary)
            print(f"Saved: {filename}")

if __name__ == "__main__":
    scrape_images("linux penguin")

if you run the code, you’ll get a bunch of images again, but this time, they all have names.

The image results using the SERP API

Conclusion

In conclusion, scraping images from Google is a bit like trying to solve a puzzle without all the pieces. Our Google Images API finds the metadata and cuts out the need for Selenium!

If you need to scrape images from other sources, we also have an Instagram Image API, Shutterstock Scraper, and different structured datasets. Sign up now and find the perfect product for your needs, including a free trial!

No credit card required