How To Build a Zalando Scraper

This guide will cover: Why scrape product details data from the web? Libraries and tools for scraping Zalando Scraping product data from Zalando with Selenium
18 min read
How to Scrape Zalando

Why Scrape Product Details Data From Zalando?

Zalando is one of the most popular online clothing retail platforms in Europe. With more than 50 million active users, it is Europe’s leading fashion e-commerce site. It provides a vast offer of products, including footwear, clothes, and accessories from both well-established brands and emerging designers.

The top three reasons to scrape product details data from Zalando are:

  1. Market research: Gain valuable insights into current fashion trends. This info helps businesses make informed decisions, stay competitive, and tailor their offerings to meet customer demands effectively.
  2. Price monitoring: Track price fluctuations to take advantage of great deals and study the market.
  3. Brand popularity: Focus on popular products on Zalando to see what brands are currently more popular among customers to study their strategy.

In short, Zalando scraping opens up a world of possibilities and is great both for companies and users. 

Libraries and Tools for Scraping Zalando

To understand which of the many scraping tools available is best for scraping Zalando, open it in your browser. Inspect the DOM and compare it with the raw source code. You will notice that the DOM structure is slightly different from the HTML document produced by the server. This means that the site relies on JavaScript for rendering. To scrape a dynamic content site, you need a tool that can run JavaScript, such as Selenium!

Now it is the turn of the programming language. When it comes to web scraping, the most popular one is Python. Its easy syntax and rich ecosystem of libraries make it perfect for our objectives. So, let’s use Python

Before getting started, check out these two guides:

Selenium renders sites in a controllable web browser that you can instruct to perform specific operations. By using it in Python, you will be able to build an effective Zalando scraper. Time to see how!

Scraping Product Data From Zalando With Selenium

Follow this step-by-step tutorial and learn how to create a Zalando scraper in Python.

Step 1: Setup a Python project

Before jumping into web scraping, make sure you meet the following prerequisites:

You now have everything required to set up a Python project and write some code!

Launch the terminal and run the commands below to:

  1. Create a zalando-scraper folder. 
  2. Enter it.
  3. Initialize it with a Python virtual environment.
mkdir zalando-scraper

cd zalando-scraper

python -m venv env

On Linux or macOS, execute the command below to activate the environment:

./env/bin/activateOn Windows, run:env\Scripts\activate.ps1

Next, create a scraper.py file in the project folder and add the following line to it:

print("Hello, World!")

This is the easiest Python script you can write. Right now, it only prints “Hello, World!” but it will soon contain the Zalando scraping logic.

Launch it to verify that it works with:

python scraper.py

It should print this message in the terminal:

Hello, World!

Now that you are sure that the script works as expected, open the project folder in your Python IDE.

Awesome! Get ready to write the first lines of your scraper.

Step 2: Install the scraping libraries

As mentioned earlier, Selenium is the chosen tool to build a Zalando scraper. In the activated Python virtual environment, run the command below to add it to the project’s dependencies:

pip install selenium

The installation process might take a while, so be patient.

Note that this tutorial refers to Selenium 4.13.x, which comes with automatic driver detection functionality. If you have an older version of Selenium on your machine, update it with:

pip install selenium -U

Remove all the content from scraper.py and initialize a Selenium scraper with: 

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

# set up a controllable Chrome instance

service = Service()

options = webdriver.ChromeOptions()

# your browser options...

driver = webdriver.Chrome(

    service=service,

    options=options

)

# maxime the window to avoid the responsive rendering

driver.maximize_window()

# scraping logic...

# close the browser and free up its resources

driver.quit()

The above script imports Selenium and uses it to instantiate a WebDriver object. This allows you to programmatically control a Chrome browser instance. 

By default, the browser window will be opened and you will be able to monitor the actions performed on the page. That is useful in development.

To open Chrome in headless mode with no GUI, configure options as below:


options.add_argument('--headless=new')
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
options.add_argument(f'user-agent={user_agent}')

Note that the extra user-agent option is required as Zalando blocks requests from headless browsers withouth that header. That setup is more common in production. 

Great! Time to build your web scraping Zalando Python scraper.

Step 3: Open the target page

In this guide, you will see how to scrape details data from a shoe product from Zalando UK. When targeting a different product type, you will have to make minor changes to the script you are about to build. The reason is that each product can have specific page structures with different information.

As of this writing, here is what the target page looks like:

Zalando UK scraping

In detail, this is the URL of the target page:

https://www.zalando.co.uk/adidas-originals-3mc-trainers-footwear-whitegold-metallic-ad115o0da-a11.html

Connect to the target page in Selenium with:

driver.get('https://www.zalando.co.uk/adidas-originals-3mc-trainers-footwear-whitegold-metallic-ad115o0da-a11.html')

get() instructs the browser to visit the page specified by the URL passed as a parameter.

This is Zalando scraping script so far:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

service = Service()

# configure the Chrome instance

options = webdriver.ChromeOptions()

# your browser options...

driver = webdriver.Chrome(

    service=service,

    options=options

)

# maxime the window to avoid the responsive rendering

driver.maximize_window()

# visit the target page in the controlled browser

driver.get('https://www.zalando.co.uk/adidas-originals-3mc-trainers-footwear-whitegold-metallic-ad115o0da-a11.html')

# scraping logic...

# close the browser and free up its resources

driver.quit()

Run the application. It will open the window below for less than a second before terminating:

The “Chrome is being controlled by automated software.” disclaimer ensures that Selenium is working as expected.

Step 4: Familiarize yourself with the page structure

To write effective scraping logic, you need to spend some time studying the DOM structure of the target page. That will help you understand how to select HTML elements and extract data from them.

Open your browser in incognito and visit the chosen Zalando product page. Right-click and select the “Inspect” option to open the DevTools of your browser:

Here, you will certainly notice that most CSS classes seem to be randomly generated at build time. In other words, you should not base your selection strategy on them, as they will change at each deploy. At the same time, some elements have uncommon HTML attributes such as data-testid. That will help you define effective selectors. 

Interact with the page to study how the DOM changes after clicking on specific elements, such as the accordions. You will realize that some data is dynamically added to the DOM based on user actions.

Keep inspecting the target page and familiarize yourself with its HTML structure until you feel ready to move on.

Step 5: Start extracting the product data

First, initialize a data structure where to keep track of the scraped data. A Python dictionary will be perfect:

product = {}

Start selecting elements on the page and extract data from them!

Inspect the HTML element containing the shoe brand: 

Note that the brand is an <h3> and the product name in an <h1>. Scrape this data with:

brand_element = driver.find_element(By.CSS_SELECTOR, 'h3')

brand = brand_element.text

name_element = driver.find_element(By.CSS_SELECTOR, 'h1')

name = name_element.text

find_element() is a Selenium method that returns the first element that matches the selection strategy passed as a parameter. In particular, By.CSS_SELECTOR instructs the driver to use a CSS selector strategy. Selenium also supports:

  • By.TAG_NAME: To search for elements based on their HTML tag.
  • By.XPATH: To search for elements through an XPath expression.

Similarly, there is also find_elements(), which returns the list of all nodes that match the selection query.

Remember to import By with:

from selenium.webdriver.common.by import By

Given an HTML element, you can then access its text content with the text attribute. When required, use the replace() Python method to clean out the text strings.

Extracting price information is a bit trickier. As you can see from the image below, there is no easy way to select these elements:

What you can do is:

  • Access the price <div>  as the first sibling of the <h1> name element.
  • Get all <p> nodes inside it.

Achieve that with:

price_elements = name_element \

    .find_element(By.XPATH, 'following-sibling::*[1]') \

    .find_elements(By.TAG_NAME, "p")

Keep in mind that Selenium does not provide a utility method for accessing the siblings of a node. This is why you need to use the following-sibling::* Xpath expression instead.

You can then get the product price data with:

discount = None

price = None

original_price = None

if len(price_elements) >= 3:

    discount = price_elements[0].text.replace(' off', '')

    price = price_elements[1].text

    original_price = price_elements[2].text

Now focus on the product image gallery:

This contains several images, so initialize an array to store them all:

images = []

Again, selecting the <img> is not easy, but you can achieve that by targeting the <li> elements inside the “Product media gallery” <ul>:

image_elements = driver.find_elements(By.CSS_SELECTOR, '[aria-label="Product media gallery"] li')

for image_element in image_elements:

image = image_element.find_element(By.TAG_NAME, 'img').get_attribute('src')

    images.append(image)

Similarly, you can collect the shoe color options:

Just as before, each color element is <li>. In detail, each color section has:

  • An optional link.
  • An image.
  • A name, stored in the alt attribute of the image element.

Extract all colors with:

colors = []

color_elements = driver.find_elements(By.CSS_SELECTOR, '[aria-label="Available colours"] li')

for color_element in color_elements:

    # initialize a new color object

    color = {

        'color': None,

        'image': None,

        'link': None

    }

    # check if the color link is present and scrape its URL

    link_elements = color_element.find_elements(By.TAG_NAME, 'a')

    if len(link_elements) > 0:

        color['link'] = link_elements[0].get_attribute('href')

    # check if the color image is present and scrape its data

    image_elements = color_element.find_elements(By.TAG_NAME, 'img')

    if len(image_elements) > 0:

        color['image'] = image_elements[0].get_attribute('src')

        color['color'] = image_elements[0].get_attribute('alt') \

            .replace('Selected, ', '') \

            .replace('Unselected, ','') \

            .strip()

    colors.append(color)

Perfect! You just implemented some scraping logic, but there is still more data to retrieve.

Step 6: Scrape the product details data

The product details are stored in cards placed under the color selection element:

First, focus on the delivery information:

That consists of three data fields, so initialize a delivery dictionary as below:

delivery = {

    'time': None,

    'type': None,

    'cost': None,

}

Again, there is not an easy selector to select those three elements. What you can do is:

  1. Select the node whose data-testid attribute is “pdp-delivery-info”.
  2. Move to its parent.
  3. Get all descendant <p> elements.

Implement this logic and extract the delivery data with:

delivery_elements = driver \

    .find_element(By.CSS_SELECTOR, '[data-testid="pdp-delivery-info"]') \

    .find_element(By.XPATH, 'parent::*[1]') \

    .find_elements(By.TAG_NAME, 'p')

if len(delivery_elements) == 3:

    delivery['time'] = delivery_elements[0].text

    delivery['type'] = delivery_elements[1].text

    delivery['cost'] = delivery_elements[2].text

Since Selenium does not expose a way to access the parent of a node, you need to use the parent::* Xpath expression.

Next, focus your attention on the product details accordions:

This time, you can get all accordion elements by targeting nodes whose data-testid attribute starts with “pdp-accordion-“. Do so with the following CSS selector:

[data-testid^="pdp-accordion-"]

That section contains several fields, so you have to create a dictionary to keep track of it:

info = {}

Then, apply the aforementioned CSS selector to select the product details accordions:

info_elements = driver.find_elements(By.CSS_SELECTOR, '[data-testid^="pdp-accordion-"]')[:2]

The “Size & fit” element does not contain relevant data, so you can ignore it. [:2] will reduce the list to the first two elements as desired. 

Those HTML elements are dynamic and their content is added to the DOM only when opened. So, you need to simulate the click interaction with the click() method:

for info_element in info_elements:

    info_element.click()

    // scraping logic...

Next, programmatically populate the info object with:

info_section_name = info_element.find_element(By.CSS_SELECTOR, 'h5').text

info[info_section_name] = {}

for dt_element in info_element.find_elements(By.CSS_SELECTOR, 'dt'):

    info_section_detail_name = dt_element.text.replace(':', '')

    info[info_section_name][info_section_detail_name] = dt_element.find_element(By.XPATH, 'following-sibling::dd').text

The above logic dynamically extracts the information in the accordions and organizes it by name.

To better understand how that code works, try to print info. You will see:

{'Material & care': {'Upper material': 'Imitation leather/ textile', 'Lining': 'Imitation leather/ textile', 'Insole': 'Textile', 'Sole': 'Synthetics', 'Padding type': 'No lining', 'Fabric': 'Canvas'}, 'Details': {'Shoe tip': 'Round', 'Heel type': 'Flat', 'Fastening': 'Laces', 'Shoe fastener': 'Laces', 'Pattern': 'Plain', 'Article number': 'AD115O0DA-A11'}}

Fantastic! Zalando product details scraped!

Step 7: Populate the product object

It only remains to populate the product dictionary with the scraped data:

# assign the scraped data to the dictionary

product['brand'] = brand

product['name'] = name

product['price'] = price

product['original_price'] = original_price

product['discount'] = discount

product['images'] = images

product['colors'] = colors

product['delivery'] = delivery

product['info'] = info

You can also add a log instruction to verify that the Zalando scraper works as expected:

print(job)

Run the script:

python scraper.py

This will produce an output similar to:

{'brand': 'adidas Originals', 'name': '3MC UNISEX - Trainers', 'price': '£51.00', 'original_price': '£59.99', 'discount': '15%', ... }

Et voilà! You just learned how to scrape product data from Zalando.

Step 8: Export scraped data to JSON

Right now, the scraped data is stored in a Python dictionary. Export it to JSON to make it easier to share and read:

with open('product.json', 'w', encoding='utf-8') as file:

    json.dump(product, file, indent=4, ensure_ascii=False)

The above snippet creates a product.json output file with open() and it populates it with JSON data via json.dump(). Take a look at our guide to learn more about how to parse and serialize data to JSON in Python.

Remember to add the json import:

import json

This package comes from Python Standard Library, so you do not even need to manually install it. 

Amazing! You started from raw product data contained in a webpage and now have semi-structured JSON data. You are ready to check out the complete Zalando scraper.

Step 8: Put it all together

Here is the complete code of the scraper.py file:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

import json

service = Service()

# configure the Chrome instance

options = webdriver.ChromeOptions()

# your browser options...

driver = webdriver.Chrome(

    service=service,

    options=options

)

# maxime the window to avoid the responsive rendering

driver.maximize_window()

# visit the target page in the controlled browser

driver.get('https://www.zalando.co.uk/adidas-originals-3mc-trainers-footwear-whitegold-metallic-ad115o0da-a11.html')

# instantiate the object that will contain the scraped data

product = {}

# scraping logic

brand_element = driver.find_element(By.CSS_SELECTOR, 'h3')

brand = brand_element.text

name_element = driver.find_element(By.CSS_SELECTOR, 'h1')

name = name_element.text

price_elements = name_element \

    .find_element(By.XPATH, 'following-sibling::*[1]') \

    .find_elements(By.TAG_NAME, "p")

discount = None

price = None

original_price = None

if len(price_elements) >= 3:

    discount = price_elements[0].text.replace(' off', '')

    price = price_elements[1].text

    original_price = price_elements[2].text

images = []

image_elements = driver.find_elements(By.CSS_SELECTOR, '[aria-label="Product media gallery"] li')

for image_element in image_elements:

    image = image_element.find_element(By.TAG_NAME, 'img').get_attribute('src')

    images.append(image)

colors = []

color_elements = driver.find_elements(By.CSS_SELECTOR, '[aria-label="Available colours"] li')

for color_element in color_elements:

    color = {

        'color': None,

        'image': None,

        'link': None

    }

    link_elements = color_element.find_elements(By.TAG_NAME, 'a')

    if len(link_elements) > 0:

        color['link'] = link_elements[0].get_attribute('href')

    image_elements = color_element.find_elements(By.TAG_NAME, 'img')

    if len(image_elements) > 0:

        color['image'] = image_elements[0].get_attribute('src')

        color['color'] = image_elements[0].get_attribute('alt') \

            .replace('Selected, ', '') \

            .replace('Unselected, ','') \

            .strip()

    colors.append(color)

delivery = {

    'time': None,

    'type': None,

    'cost': None,

}

delivery_elements = driver \

    .find_element(By.CSS_SELECTOR, '[data-testid="pdp-delivery-info"]') \

    .find_element(By.XPATH, 'parent::*[1]') \

    .find_elements(By.TAG_NAME, 'p')

if len(delivery_elements) == 3:

    delivery['time'] = delivery_elements[0].text

    delivery['type'] = delivery_elements[1].text

    delivery['cost'] = delivery_elements[2].text

info = {}

info_elements = driver.find_elements(By.CSS_SELECTOR, '[data-testid^="pdp-accordion-"]')[:2]

for info_element in info_elements:

    info_element.click()

    info_section_name = info_element.find_element(By.CSS_SELECTOR, 'h5').text

    info[info_section_name] = {}

    for dt_element in info_element.find_elements(By.CSS_SELECTOR, 'dt'):

        info_section_detail_name = dt_element.text.replace(':', '')

        info[info_section_name][info_section_detail_name] = dt_element.find_element(By.XPATH, 'following-sibling::dd').text

# close the browser and free up its resources

driver.quit()

# assign the scraped data to the dictionary

product['brand'] = brand

product['name'] = name

product['price'] = price

product['original_price'] = original_price

product['discount'] = discount

product['images'] = images

product['colors'] = colors

product['delivery'] = delivery

product['info'] = info

print(product)

# export the scraped data to a JSON file

with open('product.json', 'w', encoding='utf-8') as file:

    json.dump(product, file, indent=4, ensure_ascii=False)

In just over 100 lines of code, you just built a fully-featured Zalando web scraper to retrieve product details data. 

Execute it with:

python scraper.py

Wait some seconds for the script to complete.

At the end of the scraping process, a product.json file will appear in the root folder of your project. Open it and you will see:

{

    "brand": "adidas Originals",

    "name": "3MC UNISEX - Trainers",

    "price": "£51.00",

    "original_price": "£59.99",

    "discount": "15%",

    "images": [

        "https://img01.ztat.net/article/spp-media-p1/637562911a7e36c28ce77c9db69b4cef/00373c35a7f94b4b84a4e070879289a2.jpg?imwidth=156",

        // omitted for brevity...

        "https://img01.ztat.net/article/spp-media-p1/7d4856f0e4803b759145755d10e8e6b6/521545d1286c478695901d26fcd9ed3a.jpg?imwidth=156"

    ],

    "colors": [

        {

            "color": "footwear white",

            "image": "https://img01.ztat.net/article/spp-media-p1/afe668d0109a3de0a5175a1b966bf0c9/c99c48c977ff429f8748f961446f79f5.jpg?imwidth=156&filter=packshot",

            "link": null

        },

        // omitted for brevity...

        {

            "color": "white",

            "image": "https://img01.ztat.net/article/spp-media-p1/87e6a1f18ce44e3cbd14da8f10f52dfd/bb1c3a8c409544a085c977d6b4bef937.jpg?imwidth=156&filter=packshot",

            "link": "https://www.zalando.co.uk/adidas-originals-3mc-unisex-trainers-white-ad115o0da-a16.html"

        }

    ],

    "delivery": {

        "time": "2-4 working days",

        "type": "Standard delivery",

        "cost": "free"

    },

    "info": {

        "Material & care": {

            "Upper material": "Imitation leather/ textile",

            "Lining": "Imitation leather/ textile",

            "Insole": "Textile",

            "Sole": "Synthetics",

            "Padding type": "No lining",

            "Fabric": "Canvas"

        },

        "Details": {

            "Shoe tip": "Round",

            "Heel type": "Flat",

            "Fastening": "Laces",

            "Shoe fastener": "Laces",

            "Pattern": "Plain",

            "Article number": "AD115O0DA-A11"

        }

    }

}

 Congrats! You just learned how to scrape Zalando in Python! 

Conclusion

In this tutorial, you understood why Zalando is a great e-commerce site to scrape and how to extract data from it. Here, you saw how to build a Zalando scraper that automatically retrieves data from a product page.

As shown here, scraping Zalando is not the easiest task for at least three reasons:

  1. The site implements some anti-scraping measures that might block your script.
  2. The web pages contain random CSS classes.
  3. Each product page has a specific structure and can involve different information.

To avoid the first issue and forget about getting blocked, try our new solution! Scraping Browser is a controllable browser that automatically handles CAPTCHAs, fingerprinting, automated retries, and more for you. However, you will still have to write code and keep maintaining it. Address the remaining two issues with an out-of-the-box solution, check out our Zalando scraper!

Not sure what tool is best for you? Talk to one of our data experts.