In this tutorial, you will learn:
- Why scraping images from a site is useful
- How to scrape images from a website with Python using Selenium
Let’s dive in!
Why Scrape Images From a Site?
Web scraping is not just about extracting textual data. It can instead target any type of data, including multimedia files such as images. In particular, scraping images from a website is useful in several scenarios. These include:
- Retrieve images for training machine learning and AI models: Train a model using images downloaded online to enhance its accuracy and effectiveness.
- Studying how competitors approach visual communication: Understand trends and strategies by giving your marketing team access to images that competitors use to communicate key messages to their audience.
- Fetch visually appealing images from online providers automatically: Use high-quality images to achieve high engagement on your site and social media platforms, attracting and retaining audience attention.
Python Scrape Images: Step-by-Step Guide
To scrape images from a webpage, you need to perform the following operation:
- Connect to the target site
- Select all image HTML nodes of interest on the page
- Extract the image URLs from each of them
- Download the image files associated with those URLs
A good target site for this task is Unsplash, one of the most popular image providers on the Internet. This is what the dashboard for the search word “wallpaper” for free images looks like:
As you can see, the page loads new images as the user scrolls down. In other terms, it is an interactive site that requires a browser automation tool for scraping.
The URL of that page is:
https://unsplash.com/s/photos/wallpaper?license=free
Time to see how to scrape images from that site in Python!
Step #1: Getting Started
To follow this tutorial, make sure you have Python 3 installed on your machine. Otherwise, download the installer, double-click on it, and follow the instructions.
Initialize your Python scraping images project using the commands below:
mkdir image-scraper
cd image-scraper
python -m venv env
This creates an image-scraper
folder and adds a Python virtual environment inside it.
Open the project folder in a Python IDE of your choice. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Create a scraper.py
file in the project folder and initialize it as follows:
print('Hello, World!')
Right now, this file is a simple script that prints “Hello, World!” but it will soon contain the image scraping logic.
Verify that the script works by pressing the run button of your IDE or by running the command below:
python scraper.py
The following message should appear in your terminal:
Hello, World!
Great! You now have a Python project in place. Implement the logic required to scrape images from a website in the next steps.
Step #2: Install Selenium
Selenium is an excellent library for scraping images because it can handle sites with both static and dynamic content. As a browser automation tool, it can render pages even if they require JavaScript execution. Learn more in our guide on Selenium web scraping.
Compared to an HTML parser such as BeautifoulSoup, Selenium can target more sites and cover more use cases. For example, it also works with image providers that rely on user interactions to load new images. That is exactly the case with Unsplash, the target site for this guide.
Before installing Selenium, you need to activate the Python virtual environment. On Windows, achieve that with this command:
env\Scripts\activate
On macOS and Linux, run instead:
source env/bin/activate
In the env terminal, install the Selenium WebDriver package with the following pip
command:
pip install selenium
The installation process will take a while, so be patient.
Awesome! You have everything you need to scrape images in Python.
Step #3: Connect to the Target Site
Import Selenium and the classes required to control a Chrome instance by adding the following lines to scraper.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
You can now initialize a headless Chrome WebDriver instance with this code:
# to run Chrome in headless mode
options = Options()
options.add_argument("--headless") # comment while developing
# initialize a Chrome WerbDriver instance
# with the specified options
driver = webdriver.Chrome(
service=ChromeService(),
options=options
)
Comment out the --headless
option, if you want Selenium to launch a Chrome window with the GUI. That will allow you to follow what the script does on the page in real time, which is useful for debugging. On production, keep the --headless
option activated to save resources.
Do not forget the close the browser window by adding this line at the end of your script:
# close the browser and free up its resources
driver.quit()
Some pages display images differently depending on the screen size of the user’s device. To avoid issues with responsive content, maximize the Chrome window with:
driver.maximize_window()
You can now instruct Chrome to connect to the target page via Selenium by using the get()
method:
url = "https://unsplash.com/s/photos/wallpaper?license=free"
driver.get(url)
Put it all together, and you will get:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
# to run Chrome in headless mode
options = Options()
options.add_argument("--headless")
# initialize a Chrome WerbDriver instance
# with the specified options
driver = webdriver.Chrome(
service=ChromeService(),
options=options
)
# to avoid issues with responsive content
driver.maximize_window()
# the URL of the target page
url = "https://unsplash.com/s/photos/wallpaper?license=free"
# visit the target page in the controlled browser
driver.get(url)
# close the browser and free up its resources
driver.quit()
Launch the image scraping script in headed mode. It will show the following page for a fraction of a section before closing Chrome:
The message “Chrome is being controlled by automated test software” means that Selenium is operating on the Chrome window as desired.
Fantastic! Take a look at the HTML code of the page to learn how to extract images from it.
Step #4: Inspect the Target Site
Before digging into the Python scrape images logic, you must inspect the HTML source code of your target page. Only in this way can you understand how to define an effective node selection logic and figure out how to extract the desired data.
Thus, visit the target site in your browser, right-click on an image, and select the “Inspect” option to open the DevTools:
Here, you can notice a couple of interesting facts.
First, the image is contained in an <img>
HTML element. This means that the CSS selector to select the image nodes of interest is:
[data-test="photo-grid-masonry-img"]
Second, the image elements have both the traditional src
attribute and the srcset
attribute. If you are not familiar with the latter attribute, srcset
specifies several source images along with hints to help the browser pick the right one based on responsive breakpoints.
In detail, the value of an srcset
attribute has the following format:
<image_source_1_url> <image_source_1_size>, <image_source_1_url> <image_source_2_size>, ...
Where:
<image_source_1_url>
,<image_source_2_url>
, etc. are the URLs to the images with different sizes.<image_source_1_size>
,<image_source_2_size>
, etc. are the sizes of each image source. Allowed values are pixel widths (e.g.,200w
) or pixel ratios (e.g.,1.5x
).
This scenario where an image has both attributes is pretty common on modern responsive sites. Targeting directly in the image URL in src
is not the best approach as srcset
may contain URLs to images of higher quality.
From the HTML above, you can also see that all image URLs are absolute. So, you do not need to concatenate the site base URL to them.
In the next step, you will learn how to extract the right images in Python using Selenium.
Step #5: Retrieve All Images URLs
Use the findElements()
method to select all desired HTML image nodes on the page:
image_html_nodes = driver.find_elements(By.CSS_SELECTOR, "[data-test=\"photo-grid-masonry-img\"]")
To work, that instruction requires the following import:
from selenium.webdriver.common.by import By
Next, initialize a list that will contain the URLs extracted from the image elements:
image_urls = []
Iterate over the nodes in image_html_nodes
, collect the URL in src
or the URL of the largest image from srcset
(if present), and add it to image_urls
:
for image_html_node in image_html_nodes:
try:
# use the URL in the "src" as the default behavior
image_url = image_html_node.get_attribute("src")
# extract the URL of the largest image from "srcset",
# if this attribute exists
srcset = image_html_node.get_attribute("srcset")
if srcset is not None:
# get the last element from the "srcset" value
srcset_last_element = srcset.split(", ")[-1]
# get the first element of the value,
# which is the image URL
image_url = srcset_last_element.split(" ")[0]
# add the image URL to the list
image_urls.append(image_url)
except StaleElementReferenceException as e:
continue
Note that Unsplash is a pretty dynamic site and by the time you execute this loop, some images may no longer be on the page. To protect against that error, catch the StaleElementReferenceException
.
Again, do not forget to add this import:
from selenium.common.exceptions import StaleElementReferenceException
You can now print the scraped image URLs with:
print(image_urls)
The current scraper.py
file should contain:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
# to run Chrome in headless mode
options = Options()
options.add_argument("--headless")
# initialize a Chrome WerbDriver instance
# with the specified options
driver = webdriver.Chrome(
service=ChromeService(),
options=options
)
# to avoid issues with responsive content
driver.maximize_window()
# the URL of the target page
url = "https://unsplash.com/s/photos/wallpaper?license=free"
# visit the target page in the controlled browser
driver.get(url)
# select the node images on the page
image_html_nodes = driver.find_elements(By.CSS_SELECTOR, "[data-test=\"photo-grid-masonry-img\"]")
# where to store the scraped image url
image_urls = []
# extract the URLs from each image
for image_html_node in image_html_nodes:
try:
# use the URL in the "src" as the default behavior
image_url = image_html_node.get_attribute("src")
# extract the URL of the largest image from "srcset",
# if this attribute exists
srcset = image_html_node.get_attribute("srcset")
if srcset is not None:
# get the last element from the "srcset" value
srcset_last_element = srcset.split(", ")[-1]
# get the first element of the value,
# which is the image URL
image_url = srcset_last_element.split(" ")[0]
# add the image URL to the list
image_urls.append(image_url)
except StaleElementReferenceException as e:
continue
# log in the terminal the scraped data
print(image_urls)
# close the browser and free up its resources
driver.quit()
Run the script to scrape images, and you will get an output similar to this:
[
'https://images.unsplash.com/photo-1707343843598-39755549ac9a?w=2000&auto=format&fit=crop&q=60&ixlib=rb-4.0.3&ixid=M3wxMjA3fDF8MHxzZWFyY2h8MXx8d2FsbHBhcGVyfGVufDB8fDB8fHwy',
# omitted for brevity...
'https://images.unsplash.com/photo-1507090960745-b32f65d3113a?w=2000&auto=format&fit=crop&q=60&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8MjB8fHdhbGxwYXBlcnxlbnwwfHwwfHx8Mg%3D%3D'
]
Here we go! The above array contains the URLs to the images to retrieve. It only remains to see how to download images in Python.
Step #6: Download the Images
The easiest way to download an image in Python is to use the urlretrieve()
method from the url.request
package of the Standard Library. That function copies a network object specified by a URL to a local file.
Import url.request
by adding the following line on top of your scraper.py
file:
import urllib.request
In the project folder, create a images
directory:
mkdir images
This is where the script will write the image files.
Now, iterate over the list with the URLs of the scraped images. For each image, generate an incremental file name and download the image with urlretrieve()
:
image_name_counter = 1
# download each image and add it
# to the "/images" local folder
for image_url in image_urls:
print(f"downloading image no. {image_name_counter} ...")
file_name = f"./images/{image_name_counter}.jpg"
# download the image
urllib.request.urlretrieve(image_url, file_name)
print(f"images downloaded successfully to \"{file_name}\"\n")
# increment the image counter
image_name_counter += 1
This is everything you need to download images in Python. The print()
instructions are not required but are useful to understand what the script is doing.
Wow! You just learned how to scrape images from a website in Python. It is time to see the entire code of the scrape images Python script.
Step #7: Put It All Together
This is the code of the final scraper.py
:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
import urllib.request
# to run Chrome in headless mode
options = Options()
options.add_argument("--headless")
# initialize a Chrome WerbDriver instance
# with the specified options
driver = webdriver.Chrome(
service=ChromeService(),
options=options
)
# to avoid issues with responsive content
driver.maximize_window()
# the URL of the target page
url = "https://unsplash.com/s/photos/wallpaper?license=free"
# visit the target page in the controlled browser
driver.get(url)
# select the node images on the page
image_html_nodes = driver.find_elements(By.CSS_SELECTOR, "[data-test=\"photo-grid-masonry-img\"]")
# where to store the scraped image url
image_urls = []
# extract the URLs from each image
for image_html_node in image_html_nodes:
try:
# use the URL in the "src" as the default behavior
image_url = image_html_node.get_attribute("src")
# extract the URL of the largest image from "srcset",
# if this attribute exists
srcset = image_html_node.get_attribute("srcset")
if srcset is not None:
# get the last element from the "srcset" value
srcset_last_element = srcset.split(", ")[-1]
# get the first element of the value,
# which is the image URL
image_url = srcset_last_element.split(" ")[0]
# add the image URL to the list
image_urls.append(image_url)
except StaleElementReferenceException as e:
continue
# to keep track of the images saved to disk
image_name_counter = 1
# download each image and add it
# to the "/images" local folder
for image_url in image_urls:
print(f"downloading image no. {image_name_counter} ...")
file_name = f"./images/{image_name_counter}.jpg"
# download the image
urllib.request.urlretrieve(image_url, file_name)
print(f"images downloaded successfully to \"{file_name}\"\n")
# increment the image counter
image_name_counter += 1
# close the browser and free up its resources
driver.quit()
Terrific! You can build an automated script to download images from a site in Python with less than 100 lines of code.
Execute it with the following command:
python scraper.py
The Python scraping image script will log the following string:
downloading image no. 1 ...
images downloaded successfully to "./images/1.jpg"
# omitted for brevity...
downloading image no. 20 ...
images downloaded successfully to "./images/20.jpg"
Explore the /images
folder and you will see the images automatically downloaded by the script:
Note that these images are different from those in the screenshot of the Unsplash page seen earlier because the site keeps receiving updated content.
Et voilà! Mission complete.
Step #8: Next Steps
Although we have achieved the goal, there are some possible implementations to improve your Python script. The most important ones are:
- Export image URLs to CSV or store them in a database: This way, you will be able to download or use them in the future.
- Avoid downloading images already in the
/images
folder: This improvement saves network resources by skipping images that have already been downloaded. - Scrape also the metadata information: Retrieving tags and author information can be useful for getting complete information about downloaded images. Learn how in our guide on Python web scraping.
- Scrape more images: Simulate the infinite scrolling interaction, load more images, and download them all.
Conclusion
In this guide, you learned why scraping images from a website is useful and how to do it in Python. In particular, you saw a step-by-step tutorial on how to build a Python scrape images script that can automatically download images from a site. As proven here, it is not complex and takes only a few lines of code.
At the same time, you do not have to overlook the anti-bot systems. Selenium is a great tool but it can do nothing against such advanced technologies. These can detect your script as a bot and prevent it from accessing images of the site.
To avoid that, you need a tool that can render JavaScript and is also able to handle fingerprinting, CAPTCHAs, and anti-scraping for you. That is exactly what Bright Data’s Scraping Browser is all about!
Talk to one of our data experts about our scraping solutions.
No credit card required
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.
FAQ
Is it legal to scrape images from a website?
Scraping images from a website is not an illegal activity in itself. At the same time, it is essential to download only public images, respect the robots.txt
file for scraping, and comply with the site’s Terms and Conditions. Many people think that web scraping is not legal, but this is a myth. Find out more in our article on the myths about web scraping.
What are the best libraries to download images with Python?
On static content sites, an HTTP client like requests
and an HTML parser such as beautifulsoup4
will be enough. On dynamic content sites or highly interactive pages, you will need a browser automation tool like Selenium or Playwright. Check out the list of the best headless browser tools for web scraping.
How to address the “HTTP Error 403: Forbidden” in urllib.request
?
The HTTP 403 error occurs because the target site recognizes the request made with urllib.request
as coming from an automated script. An effective way to avoid this issue is to set the User-Agent
header to a real-world value. When using the urlretrieve()
method, this is how you can do it:
opener = urllib.request.build_opener()
user_agent_string = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
opener.addheaders = [("User-Agent", user_agent_header)]
urllib.request.install_opener(opener)
# urllib.request.urlretrieve(...)