This tutorial will cover:
Why Scrape Yelp?
The are several reasons business scrape Yelp. These include:
- Gain access to comprehensive business data: It provides a wealth of info about local businesses, including reviews, ratings, contact information, and more.
- Get insights into customer feedback: It is known for its user reviews, providing a treasure trove of insights into customer opinions and experiences.
- Perform competitive analysis and benchmarking: It offers valuable insights into your competitors’ performance, strengths, weaknesses, and customer sentiment.
There are similar platforms, but Yelp is the preferred choice for data scraping because of its:
- Extensive user base
- Diverse business categories
- Well-established reputation
The data scraped from Yelp can be valuable for market research, competitor analysis, sentiment analysis, and decision-making. Such information also helps you identify areas for improvement, fine-tune your offerings, and stay ahead of the competition.
Yelp Scraping Libraries and Tools
Python is widely regarded as an excellent language for web scraping due to its user-friendly nature, straightforward syntax, and extensive range of libraries. That is why it is the recommended programming language for scraping Yelp. To learn more about it, check out our in-depth guide on how to do web scraping with Python.
The next step involves selecting the appropriate scraping libraries from the vast array of options available. To make an informed decision, you should first explore the platform in a web browser. By inspecting the AJAX calls made by web pages, you will discover that the majority of data is embedded within the HTML documents retrieved from the server.
This implies that a simple HTTP client to make requests to the server combined with an HTML parser, will be enough for the task. Here is why you should go for:
- Requests: The most popular HTTP client library for Python. It streamlines the process of sending HTTP requests and handling their corresponding responses.
- Beautiful Soup: A comprehensive HTML and XML parsing library extensively employed for web scraping. It provides robust methods for navigating and extracting data from the DOM.
Thanks to Requests and Beautiful Soup, you can effectively scrape Yelp using Python. Let’s jump into the details of how to accomplish this task!
Scraping Yelp Business Data With Beautiful Soup
Follow this step-by-step tutorial and learn how to build a Yelp scraper.
Step 1: Python project setup
Before getting started, you first need to make sure you have:
- Python 3+ installed on your computer: Download the installer, execute it, and follows the instructions.
- A Python IDE of your choice: Visual Studio Code with the Python extension or PyCharm Community Edition will both be fine.
First, create a yelp-scraper folder and initialize it as a Python project with a virtual environment with:
mkdir yelp-scraper
cd yelp-scraper
python -m venv env
On Windows, run the command below to activate the environment:
env\Scripts\activate.ps1
While on Linux or macOS:
env/bin/activate
Next, add a scraper.py file containing the line below in the project folder:
print('Hello, World!')
This is the easiest Python script. Right now, it only prints “Hello, World!” but it will soon contain the logic to scrape Yelp.
You can launch the scraper with:
python scraper.py
It should print in the terminal:
Hello, World!
Exactly what was expected. Now that you know that everything works, open the project folder in your Python IDE.
Great, get ready to write some Python code!
Step 2: Install the scraping libraries
You now have to add the libraries needed to perform web scraping to the project’s dependencies. In the activated virtual environment, run the following command to install Beautiful Soup and Requests:
pip install beautifulsoup4 requests
Clear the scraper.py file and then add these lines to import the packages:
import requests
from bs4 import BeautifulSoup
# scraping logic...
Make sure that your Python IDE does not report any errors. You may get some warnings because of unused imports, but you can ignore them. You are about to use those scraping libraries to extract data from Yelp.
Step 3: Identify and download the target page
Browse Yelp and identify the page you want to scrape. In this guide, you will look at how to retrieve data from the list of New York’s top-rated Italian restaurants:
Assign the URL of the target page to a variable:
url = 'https://www.yelp.com/search?find_desc=Italian&find_loc=New+York%2C+NY'
Next, use requests.get() to make an HTTP GET request to that URL:
page = requests.get(url)
The variable page will now contain the response produced by the server.
Specifically, page.text stores the HTML document associated with the target webpage. You can verify that by logging it:
print(page.text)
This should print:
<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(no-j/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/dcfe403147fc/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>
<!-- Omitted for brevity... -->
Perfect! Let’s learn how to parse it to retrieve data from it.
Step 4: Parse the HTML content
Feed the HTML content retrieved by the server to the BeautifulSoup() constructor to parse it:
soup = BeautifulSoup(page.text, 'html.parser')
The function takes two arguments:
- The string containing the HTML.
- The parser that Beautiful Soup will use to go through the content.
“html.parser” is the name of the Python built-in HTML parser.
BeautifulSoup() will return the parsed content as an explorable tree structure. In particular, the soup variable exposes useful methods for selecting elements from the DOM tree. The most popular are:
- find(): Returns the first HTML element matching the selector strategy passed as a parameter.
- find_all(): Returns the list of HTML elements matching the input selector strategy.
- select_one(): Returns the first HTML element matching the CSS selector passed as a parameter.
- select(): Returns the list of HTML elements matching the input CSS selector.
Fantastic! You will soon use them to extract the desired data from Yelp.
Step 5: Familiarize yourself with the page
To devise an effective selection strategy, you must first get familiar with the structure of the target webpage. Open it in your browser and begin to explore it.
Right-click on an HTML element on the page, and select “Inspect” to open the DevTools:
You will immediately notice that the site relies on CSS classes that appear to be randomly generated at build time. As they might change at each deploy, you should not base your CSS selectors on them. This is essential information to know to build an effective scraper.
If you dig into the DOM, you will also see that the most important elements have distinctive HTML attributes. Thus, your selector strategy should rely on them.
Keep inspecting the page in the DevTools until you feel ready to scrape it with Python!
Step 6: Extract the business data
The goal here is to extract business information from each card on the page. To keep track of this data, you will need a data structure where to store it:
items = []
First, inspect a card HTML element:
Note that you can select them all with:
html_item_cards = soup.select('[data-testid="serp-ia-card"]')
Iterate over them and prepare your script to:
- Extract data from each of them.
- Save it in a Python dictionary item.
- Add it to items.
for html_item_card in html_item_cards:
item = {}
# scraping logic...
items.append(item)
Time to implement the scraping logic!
Inspect the image element:
Retrieve the URL of the image of the business with:
image = html_item_card.select_one('[data-lcp-target-id="SCROLLABLE_PHOTO_BOX"] img').attrs['src']
After retrieving an element with select_one(), you can access its HTML attribute through the attrs member.
Other useful information to retrieve is the title and URL to the business detail page:
As you can see, you can get both data fields from the h3 a node:
name = html_item_card.select_one('h3 a').text
url = 'https://www.yelp.com' + html_item_card.select_one('h3 a').attrs['href']
The text attribute returns the text content within the current element and all its children. Because some links are relative, you may need to add the base URL to complete them.
One of the most important data on Yelp is the user review rate:
In this case, there is not an easy way to get it but you can still achieve the goal with:
html_stars_element = html_item_card.select_one('[class^="five-stars"]')
stars = html_stars_element.attrs['aria-label'].replace(' star rating', '')
reviews = html_stars_element.parent.parent.next_sibling.text
Notice the use of the replace() Python function to clean the string and get only the relevant data.
Inspect the tags and the price range elements as well:
To collect all tag strings, it is necessary to select them all and iterate over them:
tags = []
html_tag_elements = html_item_card.select('[class^="priceCategory"] button')
for html_tag_element in html_tag_elements:
tag = html_tag_element.text
tags.append(tag)
Instead, retrieving the optional price range indication is far easier:
price_range_html = html_item_card.select_one('[class^="priceRange"]')
# since the price range info is optional
if price_range_html is not None:
price_range = price_range_html.text
Finally, you should also scrape the services offered by the restaurant:
Again, you need to iterate over every single node:
services = []
html_service_elements = html_item_card.select('[data-testid="services-actions-component"] p[class^="tagText"]')
for html_service_element in html_service_elements:
service = html_service_element.text
services.append(service)
Well done! You just implemented the scraping logic.
Add the scraped data variables to the dictionary:
item['name'] = name
item['image'] = image
item['url'] = url
item['stars'] = stars
item['reviews'] = reviews
item['tags'] = tags
item['price_range'] = price_range
item['services'] = services
Use print(item) to make sure the data extraction process works as desired. On the first card, you will get:
{'name': 'Olio e Più', 'image': 'https://s3-media0.fl.yelpcdn.com/bphoto/CUpPgz_Q4QBHxxxxDJJTTA/348s.jpg', 'url': 'https://www.yelp.com/biz/olio-e-pi%C3%B9-new-york-7?osq=Italian', 'stars': '4.5', 'reviews': '4588', 'tags': ['Pizza', 'Italian', 'Cocktail Bars'], 'price_range': '$$', 'services': ['Outdoor seating', 'Delivery', 'Takeout']}
Awesome! You are closer to your goal!
Step 7: Implement the crawling logic
Do not forget that businesses are presented to users in a paginated list. You just saw how to scrape a single page, but what if you wanted to get all the data? To do so, you will have to integrate web crawling into the Yelp data scraper.
First, define some support data structures on top of your script:
visited_pages = []
pages_to_scrape = ['https://www.yelp.com/search?find_desc=Italian&find_loc=New+York%2C+NY']
visited_pages will contain the URLs of the pages scraped, while pages_to_scrape the next ones to visit.
Create a while loop that terminates when there are no longer pages to scrape or after a specific number of iterations:
limit = 5 # in production, you can remove it
i = 0
while len(pages_to_scrape) != 0 and i < limit:
# extract the first page from the array
url = pages_to_scrape.pop(0)
# mark it as "visited"
visited_pages.append(url)
# download and parse the page
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
# scraping logic...
# crawling logic...
# increment the page counter
i += 1
Each iteration will take care of removing one page from the list, scraping it, discovering new pages, and adding them to the queue. limit simply prevents the scraper from running forever.
It only remains to implement the crawling logic. Inspect the HTML pagination element:
This consists of several links. Collect them all and add the newly discovered ones to pages_to_visit with:
pagination_link_elements = soup.select('[class^="pagination-links"] a')
for pagination_link_element in pagination_link_elements:
pagination_url = pagination_link_element.attrs['href']
# if the discovered URL is new
if pagination_url not in visited_pages and pagination_url not in pages_to_scrape:
pages_to_scrape.append(pagination_url)
Wonderful! Now your scraper will automatically go through all the pagination pages.
Step 8: Export scraped data to CSV
The final step is to make the collected data easier to share and read. The best way to do that is to export it to a human-readable format, such as CSV:
import csv
# ...
# initialize the .csv output file
with open('restaurants.csv', 'w', newline='', encoding='utf-8') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=headers, quoting=csv.QUOTE_ALL)
writer.writeheader()
# populate the CSV file
for item in items:
# transform array fields from "['element1', 'element2', ...]"
# to "element1; element2; ..."
csv_item = {}
for key, value in item.items():
if isinstance(value, list):
csv_item[key] = '; '.join(str(e) for e in value)
else:
csv_item[key] = value
# add a new record
writer.writerow(csv_item)
Create a restaurants.csv file with open(). Then, use DictWriter and some custom logic to populate it. Since the csv package comes from the Python Standard Library, no additional dependencies need to be installed.
Great! You started from raw data contained in a webpage and now have semi-structured CSV data. It is time to take a look at the entire Yelp Python scraper.
Step 9: Put it all together
Here is what the complete scraper.py script looks like:
import requests
from bs4 import BeautifulSoup
import csv
# support data structures to implement the
# crawling logic
visited_pages = []
pages_to_scrape = ['https://www.yelp.com/search?find_desc=Italian&find_loc=New+York%2C+NY']
# to store the scraped data
items = []
# to avoid overwhelming Yelp's servers with requests
limit = 5
i = 0
# until all pagination pages have been visited
# or the page limit is hit
while len(pages_to_scrape) != 0 and i < limit:
# extract the first page from the array
url = pages_to_scrape.pop(0)
# mark it as "visited"
visited_pages.append(url)
# download and parse the page
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
# select all item card
html_item_cards = soup.select('[data-testid="serp-ia-card"]')
for html_item_card in html_item_cards:
# scraping logic
item = {}
image = html_item_card.select_one('[data-lcp-target-id="SCROLLABLE_PHOTO_BOX"] img').attrs['src']
name = html_item_card.select_one('h3 a').text
url = 'https://www.yelp.com' + html_item_card.select_one('h3 a').attrs['href']
html_stars_element = html_item_card.select_one('[class^="five-stars"]')
stars = html_stars_element.attrs['aria-label'].replace(' star rating', '')
reviews = html_stars_element.parent.parent.next_sibling.text
tags = []
html_tag_elements = html_item_card.select('[class^="priceCategory"] button')
for html_tag_element in html_tag_elements:
tag = html_tag_element.text
tags.append(tag)
price_range_html = html_item_card.select_one('[class^="priceRange"]')
# this HTML element is optional
if price_range_html is not None:
price_range = price_range_html.text
services = []
html_service_elements = html_item_card.select('[data-testid="services-actions-component"] p[class^="tagText"]')
for html_service_element in html_service_elements:
service = html_service_element.text
services.append(service)
# add the scraped data to the object
# and then the object to the array
item['name'] = name
item['image'] = image
item['url'] = url
item['stars'] = stars
item['reviews'] = reviews
item['tags'] = tags
item['price_range'] = price_range
item['services'] = services
items.append(item)
# discover new pagination pages and add them to the queue
pagination_link_elements = soup.select('[class^="pagination-links"] a')
for pagination_link_element in pagination_link_elements:
pagination_url = pagination_link_element.attrs['href']
# if the discovered URL is new
if pagination_url not in visited_pages and pagination_url not in pages_to_scrape:
pages_to_scrape.append(pagination_url)
# increment the page counter
i += 1
# extract the keys from the first object in the array
# to use them as headers of the CSV
headers = items[0].keys()
# initialize the .csv output file
with open('restaurants.csv', 'w', newline='', encoding='utf-8') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=headers, quoting=csv.QUOTE_ALL)
writer.writeheader()
# populate the CSV file
for item in items:
# transform array fields from "['element1', 'element2', ...]"
# to "element1; element2; ..."
csv_item = {}
for key, value in item.items():
if isinstance(value, list):
csv_item[key] = '; '.join(str(e) for e in value)
else:
csv_item[key] = value
# add a new record
writer.writerow(csv_item)
In around 100 lines of code, you can build a web spider to extract business data from Yelp.
Run the scraper with:
python scraper.py
Wait for the execution to complete, and you will find the restaurants.csv file below in the root folder of your project:
Congrats! You just learned how to scrape Yelp in Python!
Conclusion
In this step-by-step guide, you understood why Yelp is one the best scraping target to get user data about local businesses. In detail, you learned how to build a Python scraper that can retrieve Yelp data. As shown here, it takes only a few lines of code.
At the same time, sites keep evolving and adapting their UI and structure to the ever-changing expectations of users. The scraper built here works today but may no longer be effective tomorrow. Avoid spending time and money on maintenance, try out our Yelp scraper!
Also, keep in mind that most sites rely heavily on JavaScript. In these scenarios, a traditional approach based on an HTML parser will not work. Instead, you will have to use a tool that can render JavaScript and handle fingerprinting, CAPTCHAs, and automatic retries for you. This is exactly what our new Scraping Browser solution is all about!
No credit card required
Don’t want to deal with web scraping Yelp and just want data? Purchase Yelp datasets
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.