How to Scrape Zillow

Unlock the potential of Zillow’s data with this comprehensive guide. Learn how to use Beautiful Soup for scraping and how Bright Data can help you overcome Zillow’s anti-scraping techniques. Get valuable insights into the real estate market with these proven methods.
17 min read
How to scrape Zillow

Scraping Zillow, an online real estate transaction website, offers you valuable insights into the real estate market, encompassing market analytics, housing industry trends, and competitor overviews. By scraping Zillow, you can gather comprehensive information on property prices, locations, features, and historical trends, empowering you to perform market analysis, stay updated with housing industry trends, evaluate competitors’ strategies, and make data-driven decisions that align with your investment goals.

In this tutorial, you’ll learn how to scrape Zillow using Beautiful Soup. In addition to learning how to gather helpful data, you’ll also learn about anti-scraping techniques employed by Zillow and how Bright Data can help.

Want to skip scraping and just get the data? Check out our Zillow datasets.

Scraping Zillow

Whether you’re new to Python or already skilled in it, this tutorial is here to help you build a web scraper using free libraries such as Beautiful Soup or Requests. So get started!

Prerequisites

Before you begin, it’s recommended that you have a basic understanding of web scraping and HTML. You also need to do the following:

  • Install Python: If you don’t already have Python installed, check the official documentation to install it now.
  • Install Beautiful Soup, Requests, Playwright, and pandas libraries: Beautiful Soup helps you easily extract data from web pages by analyzing HTML and XML documents. Requests simplifies making HTTP requests in Python, helping you communicate with web servers and retrieve web content. pandas is a powerful library designed for manipulating and analyzing structured data. It offers helpful data structures and functions that make tasks such as cleaning, transforming, and analyzing data much easier. Lastly, Python Playwright is a library for automating web browsers in Python. It allows you to interact with browsers and automate tasks, offering a unified interface, support for headless mode, and powerful automation features. To download the libraries, open your shell or terminal and run the following commands:
   pip3 install beautifulsoup4
   pip3 install requests
   pip3 install pandas
   pip3 install playwright

Understand the Zillow Website Structure

Before you begin scraping Zillow, it’s important to understand its structure. Notice that Zillow’s home page features a convenient search bar, enabling you to search for homes, apartments, and various real estate properties. Once you initiate a search, the results are displayed on a page presenting a list of properties, which includes their prices, addresses, and other relevant details. It’s worth mentioning that these search results can be sorted based on parameters such as price, number of bedrooms, and number of bathrooms.

If you want more search results beyond what is initially displayed, you can utilize the pagination buttons situated at the bottom of the page. Each page typically includes forty listings, allowing you to access additional properties. By leveraging the filters located on the left-hand side of the page, you can narrow down your search based on your preferences and requirements.

To gain an understanding of the HTML structure of the website, you should follow these steps:

  • Visit Zillow’s website: www.zillow.com.
  • Enter a city or ZIP code into the search bar and press enter.
  • Right-click on a property card and click Inspect to open the browser’s developer tools.
  • Analyze the HTML structure to identify the tags and attributes containing the data you want to scrape.
Zillow Website

Identify Key Data Points

To effectively gather information from Zillow, you need to identify the exact content that you’re looking to scrape. This guide will show you how to extract information about a property, including the following key data points:

  • Address: The location of the property, including the street address, city, and state.
  • Price: The listed price of the property, which provides insights into its current market value.
  • Zestimate: Zillow’s estimated market value of the property. The Zestimate takes into account various factors and provides an approximate valuation based on market trends and comparable property data.
  • Bedrooms: The number of bedrooms in the property.
  • Bathrooms: The number of bathrooms on the property.
  • Square Footage: The total area of the property in square feet.
  • Year Built: The year in which the property was constructed.
  • Type: The type of property, which can include options such as a house, apartment, condo, or other relevant classifications.

Zillow provides you with an extensive range of information that enables you to easily evaluate and compare different listings, consider pricing trends in specific neighborhoods, assess the property’s condition, and identify any additional amenities. Moreover, by analyzing historical and current market data, you can stay updated on trends and make strategic decisions regarding buying, selling, or investing in real estate.

Build the Scraper

Now that you’ve identified what you want to scrape, it’s time to build the scraper. Here, you use the Requests library to make HTTP requests to Zillow, Beautiful Soup to parse the HTML, and Python to extract the data.

Extract the Data

The first step is to extract the data you’re looking for. Create a new file named scraper.py and add the following code:

import requests
from bs4 import BeautifulSoup


url = 'https://www.zillow.com/homes/for_sale/San-Francisco_rb/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

listings = []

for listing in soup.find_all('div', {'class': 'property-card-data'}):
    result = {}
    result['address'] = listing.find('address', {'data-test': 'property-card-addr'}).get_text().strip()
    result['price'] = listing.find('span', {'data-test': 'property-card-price'}).get_text().strip()
    details_list = listing.find('ul', {'class': 'dmDolk'})
    details = details_list.find_all('li') if details_list else []
    result['bedrooms'] = details[0].get_text().strip() if len(details) > 0 else ''
    result['bathrooms'] = details[1].get_text().strip() if len(details) > 1 else ''
    result['sqft'] = details[2].get_text().strip() if len(details) > 2 else ''
    type_div = listing.find('div', {'class': 'gxlfal'})
    result['type'] =  type_div.get_text().split("-")[1].strip() if type_div else ''

    listings.append(result)

print(listings)

This code makes an HTTP GET request to the Zillow search results page and then uses Beautiful Soup to parse the HTML. It extracts the data points for each property and then prints all the properties.

Run the Scraper

To run the scraper, you need to provide it with a URL for a Zillow search results page. The URL should look like this: https://www.zillow.com/homes/for_sale/{city-or-zip}_rb/, where {city-or-zip} is replaced with the name of the city or the ZIP code you want to scrape.

For instance, If you’re looking to collect information about houses being sold in San Francisco, the web address you’d use is https://www.zillow.com/homes/for_sale/San-Francisco_rb/.

After you’ve inputted the website URL, it’s time to run your program and start scraping. Make sure to save the changes to scraper.py and run the following command in your shell or terminal:

 python3 scraper.py
…output…
[{'address': '19 Tehama St SUITE 3, San Francisco, CA 94105', 'price': '$1,025,000', 'bedrooms': '1 bd', 'bathrooms': '1 ba', 'sqft': '956 sqft', 'type': 'Condo for sale'}, {'address': '267A Chattanooga St, San Francisco, CA 94114', 'price': '$1,740,000', 'bedrooms': '2 bds', 'bathrooms': '3 ba', 'sqft': '2,114 sqft', 'type': 'Condo for sale'}, {'address': '998 Union St, San Francisco, CA 94133', 'price': '$1,650,000', 'bedrooms': '2 bds', 'bathrooms': '1 ba', 'sqft': '1,181 sqft', 'type': 'Condo for sale'}, {'address': '37-39 Mirabel Ave, San Francisco, CA 94110', 'price': '$2,395,000', 'bedrooms': '7 bds', 'bathrooms': '6 ba', 'sqft': '2,300 sqft', 'type': 'Multi'}, {'address': '304 Yale St, San Francisco, CA 94134', 'price': '$1,399,900', 'bedrooms': '3 bds', 'bathrooms': '4 ba', 'sqft': '1,764 sqft', 'type': 'New construction'}, {'address': '173 Coleridge St, San Francisco, CA 94110', 'price': '$745,000', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '905 sqft', 'type': 'Condo for sale'}, {'address': '289 Sadowa St, San Francisco, CA 94112', 'price': '$698,000', 'bedrooms': '4 bds', 'bathrooms': '2 ba', 'sqft': '1,535 sqft', 'type': 'House for sale'}, {'address': '1739 19th Ave, San Francisco, CA 94122', 'price': '$475,791', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '1,780 sqft', 'type': 'Townhouse for sale'}, {'address': '1725 Quesada Ave, San Francisco, CA 94124', 'price': '$600,000', 'bedrooms': '3 bds', 'bathrooms': '2 ba', 'sqft': '1,011 sqft', 'type': 'Condo for sale'}]

Please remember that web scraping should respect the website’s robots.txt file and terms of service, and excessive scraping may lead to your IP being blocked.

Save Your Data

Now that you’ve extracted your data, you need to save it in a JSON or CSV file. Saving the data in a file allows you to process it and create analytics based on what you’ve collected.

To save the data, start by importing the pandas and json libraries at the top of your scraper.py file:

import pandas as pd
import json

Then add the following code at the end of your file:

#Write data to Json file
with open('listings.json', 'w') as f:
    json.dump(listings, f)
print('Data written to Json file')

#Write data to csv
df = pd.DataFrame(listings)
df.to_csv('listings.csv', index=False)
print('Data written to CSV file')

This code writes the listings data, a list of dictionaries, to a JSON file named listings.json, using json.dump(). It then creates a pandas DataFrame from the listings data and writes it to a CSV file named listings.csv using the to_csv() method. The code prints messages indicating that the data has been successfully written to both the JSON and CSV files.

Next, run the code from your shell or terminal:

python3 scraper.py
…output…
[{'address': '19 Tehama St SUITE 3, San Francisco, CA 94105', 'price': '$1,025,000', 'bedrooms': '1 bd', 'bathrooms': '1 ba', 'sqft': '956 sqft', 'type': 'Condo for sale'}, {'address': '267A Chattanooga St, San Francisco, CA 94114', 'price': '$1,740,000', 'bedrooms': '2 bds', 'bathrooms': '3 ba', 'sqft': '2,114 sqft', 'type': 'Condo for sale'}, {'address': '998 Union St, San Francisco, CA 94133', 'price': '$1,650,000', 'bedrooms': '2 bds', 'bathrooms': '1 ba', 'sqft': '1,181 sqft', 'type': 'Condo for sale'}, {'address': '37-39 Mirabel Ave, San Francisco, CA 94110', 'price': '$2,395,000', 'bedrooms': '7 bds', 'bathrooms': '6 ba', 'sqft': '2,300 sqft', 'type': 'Multi'}, {'address': '304 Yale St, San Francisco, CA 94134', 'price': '$1,399,900', 'bedrooms': '3 bds', 'bathrooms': '4 ba', 'sqft': '1,764 sqft', 'type': 'New construction'}, {'address': '173 Coleridge St, San Francisco, CA 94110', 'price': '$745,000', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '905 sqft', 'type': 'Condo for sale'}, {'address': '289 Sadowa St, San Francisco, CA 94112', 'price': '$698,000', 'bedrooms': '4 bds', 'bathrooms': '2 ba', 'sqft': '1,535 sqft', 'type': 'House for sale'}, {'address': '1739 19th Ave, San Francisco, CA 94122', 'price': '$475,791', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '1,780 sqft', 'type': 'Townhouse for sale'}, {'address': '1725 Quesada Ave, San Francisco, CA 94124', 'price': '$600,000', 'bedrooms': '3 bds', 'bathrooms': '2 ba', 'sqft': '1,011 sqft', 'type': 'Condo for sale'}]
Data written to Json file
Data written to CSV file

If it works, you should find two new files created in your project directory: a listings.csv file and a listings.json file. These two files should have similar content to these GitHub repo files, respectively: listings.csv and listings.json.

If you try running the code several times, you’ll notice a high failure rate (around 50 percent). This is because Zillow sometimes returns a CAPTCHA page instead of the actual content when it detects automated scraping. To achieve a better success rate when scraping a website like Zillow, you need to use tools that can help you hop between different IPs and can bypass CAPTCHA.

Anti-scraping Techniques Employed by Zillow

To stop people from taking data without permission, Zillow uses a bunch of different methods to stop automatic data copying (aka scraping) from its website. These methods include using CAPTCHAs, blocking IP addresses, and setting up honeypot traps.

A CAPTCHA is a test to tell if a user is a human or a computer program. It’s typically easy for humans to solve but hard for programs and can slow down or even stop data scraping.

Another way Zillow stops scraping is by blocking IP addresses. IP addresses are like house addresses but for computers. If a computer is making too many requests, which often happens with data scraping, Zillow can block that IP address to stop any more requests. These blocks can be short term or long term, depending on how serious the situation is.

Zillow also uses honeypot traps. These traps are bits of data or links that can only be seen by programs, not humans. If a program interacts with a honeypot trap, Zillow knows it’s a bot and can block it.

All these methods make it hard to scrape data from Zillow. It can be time-consuming, difficult, and sometimes impossible. Anyone who wants to scrape data from Zillow not only needs to know about these methods, but they need to understand the legal and moral issues around data scraping. Remember, Zillow might change how it uses these methods, and it might not tell the public.

A Better Alternative: Use Bright Data to Scrape Zillow

Bright Data provides a better alternative to scraping Zillow by circumventing the anti-scraping techniques employed by the website with Bright Data’s Scraping Browser. The Scraping Browser allows you to run Puppeteer scripts on Bright Data’s network, which provides access to millions of IP addresses and prevents detection by Zillow’s anti-scraping techniques.

Scrape Zillow Using Bright Data’s Scraping Browser

To scrape Zillow using Bright Data’s Scraping Browser, follow these steps:

1. Create a Bright Data account

If you don’t already have a Bright Data account, visit Bright Data’s website, click on Start free trial, and follow the prompts.

Once you’re logged into your Bright Data account, navigate to Billing by clicking on the credit card icon on the bottom left of your navigation bar. Add a payment method based on your preferred option; otherwise, you won’t be able to activate your account:

Create a Bright Data account

Next, click on the pin icon, which opens the Proxies & Scraping Infrastructure page; then select Scraping Browser > Get started:

Proxies & Scraping Infrastructure

Next, specify your Solution name; then click on the Add button:

Add new proxy solution

Then click on Access parameters and take note of your username, host, and password, as they’ll be needed later in the tutorial:

Scraping browser

Once you have completed the previous steps, you are ready to proceed.

2. Write the Scraper

Create a new file named scraper-brightdata.py and add the following code:

import asyncio
from playwright.async_api import async_playwright
import json
import pandas as pd

username='YOUR_BRIGHTDATA_USERNAME'
password='YOUR_BRIGHTDATA_PASSWORD'
auth=f'{username}:{password}'
host = 'YOUR_BRIGHTDATA_HOST'
browser_url = f'wss://{auth}@{host}'

async def main():
    async with async_playwright() as pw:
        print('Connecting to a remote browser...')
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('Connected. Opening new page...')
        page = await browser.new_page()
        print('Navigating to Zillow...')
        await page.goto('https://www.zillow.com/homes/for_sale/San-Francisco_rb/', timeout=3600000)
        print('Scraping data...')
        listings = []
        properties = await page.query_selector_all('div.property-card-data')
        for property in properties:
            result = {}
            address = await property.query_selector('address[data-test="property-card-addr"]')
            result['address'] = await address.inner_text() if address else ''
            price = await property.query_selector('span[data-test="property-card-price"]')
            result['price'] = await price.inner_text() if price else ''
            details = await property.query_selector_all('ul.dmDolk > li')
            result['bedrooms'] = await details[0].inner_text() if len(details) >= 1 else ''
            result['bathrooms'] = await details[1].inner_text() if len(details) >= 2 else ''
            result['sqft'] = await details[2].inner_text() if len(details) >= 3 else ''
            type_div = await property.query_selector('div.gxlfal')
            result['type'] = (await type_div.inner_text()).split("-")[1].strip() if type_div else ''
            listings.append(result)
        await browser.close()
        return listings

# Run the asynchronous function
listings = asyncio.run(main()) 

# Print the listings
for listing in listings:
    print(listing)

# Write data to Json file
with open('listings-brightdata.json', 'w') as f:
    json.dump(listings, f)
print('Data written to Json file')

# Write data to csv
df = pd.DataFrame(listings)
df.to_csv('listings-brightdata.csv', index=False)
print('Data written to CSV file')

Make sure to replace YOUR_BRIGHTDATA_USERNAMEYOUR_BRIGHTDATA_PASSWORD, and YOUR_BRIGHTDATA_HOST with your actual Bright Data account credentials.

3. Run the Scraper

Save the changes to scraper-brightdata.py and run the code from your shell or terminal:

python3 scraper-brightdata.py
…output…
Connecting to a remote browser...
Connected. Opening new page...
Navigating to Zillow...
Scraping data...
{'address': '1438 Green St UNIT 2B, San Francisco, CA 94109', 'price': '$995,000', 'bedrooms': '1 bd', 'bathrooms': '1 ba', 'sqft': '974 sqft', 'type': 'Condo for sale'}
{'address': '815 Tennessee St UNIT 504, San Francisco, CA 94107', 'price': '$1,195,000', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '-- sqft', 'type': ''}
{'address': '455 27th Ave, San Francisco, CA 94121', 'price': '$1,375,000', 'bedrooms': '2 bds', 'bathrooms': '1 ba', 'sqft': '1,040 sqft', 'type': 'House for sale'}
{'address': '19 Tehama St SUITE 3, San Francisco, CA 94105', 'price': '$1,025,000', 'bedrooms': '1 bd', 'bathrooms': '1 ba', 'sqft': '956 sqft', 'type': 'Condo for sale'}
{'address': '267A Chattanooga St, San Francisco, CA 94114', 'price': '$1,740,000', 'bedrooms': '2 bds', 'bathrooms': '3 ba', 'sqft': '2,114 sqft', 'type': 'Condo for sale'}
{'address': '998 Union St, San Francisco, CA 94133', 'price': '$1,650,000', 'bedrooms': '2 bds', 'bathrooms': '1 ba', 'sqft': '1,181 sqft', 'type': 'Condo for sale'}
{'address': '37-39 Mirabel Ave, San Francisco, CA 94110', 'price': '$2,395,000', 'bedrooms': '7 bds', 'bathrooms': '6 ba', 'sqft': '2,300 sqft', 'type': 'Multi'}
{'address': '304 Yale St, San Francisco, CA 94134', 'price': '$1,399,900', 'bedrooms': '3 bds', 'bathrooms': '4 ba', 'sqft': '1,764 sqft', 'type': 'New construction'}
{'address': '173 Coleridge St, San Francisco, CA 94110', 'price': '$745,000', 'bedrooms': '2 bds', 'bathrooms': '2 ba', 'sqft': '905 sqft', 'type': 'Condo for sale'}
Data written to Json file
Data written to CSV file

This code connects to the Bright Data Scraping Browser, navigates to the Zillow search results page, and extracts the data. Next, the code prints the results and then writes it in the listings data, a list of dictionaries, to a JSON file named listings-brightdata.json, using json.dump(). Then it creates a pandas DataFrame from the listings data and writes it to a CSV file named listings-brightdata.csv using the to_csv() method. The code prints messages indicating that the data has been successfully written to both the JSON and CSV files.

If it works, you should find two files: a listings-brightdata.csv file and a listings-brightdata.json file. These files should be similar to listings-brightdata.json and listings-brightdata.csv.

If you try running this code several times and you notice that you don’t have any data saved in your files, this means that your IP got blocked by Zillow or that the browser closed before finishing. If the browser closed before the scraping is finished, you need to change the timeout to a larger value, which, in the earlier code, is related to the await page.goto('https://www.zillow.com/homes/for_sale/San-Francisco_rb/', timeout=3600000).

If your IP got blocked by Zillow, you need to change your zone, and thankfully, Bright Data gives you access to multiple zones.

To switch between different zones, go to Proxies & Scraping Infrastructure by clicking on the pin icon, then select Scraping Browser and click on Access parameters. Next, click on the </> Check out code and integration examples:

Check out code and integration examples

Select Python as the language, and on the right navigation, there’s a Country drop-down list. Select the country you want, and your zone is simultaneously updated. You should see that the auth variable changes in the Python sample code. You need to grab the user related to that zone from the auth variable. Mainly, it’s the value that is before the : since the auth variable holds the username and the password with the following syntax username:password:

proxy integration example

Each time you change your country, you get a different user for that specific country/zone. Based on the username you get and the country you select, take the user, put it in your code, and run it again.

Conclusion

In this tutorial, you learned how to scrape Zillow using Beautiful Soup. You also learned what
anti-scraping techniques are employed by Zillow and how to circumvent them. To address these issues, the Bright Data Scraping Browser was introduced, helping you surpass Zillow’s anti-scraping mechanisms and seamlessly extract the desired data.