In this step-by-step guide, you will learn how to perform web scraping on YouTube using Python
This tutorial will cover:
- YouTube API vs. YouTube scraping
- What data to scrape from YouTube
- Scraping YouTube with Selenium
YouTube API vs YouTube scraping
YouTube Data API is the official way to get data from the platform, including information about videos, playlists, and content creators. However, there are at least three good reasons why scraping YouTube is better than relying solely on its API:
- Flexibility and Customization: With a YouTube spider, you can tailor the code to select only the data you need. This level of customization helps you collect the exact information for your specific use case. In contrast, the API only gives you access to predefined data.
- Access to unofficial data: The API provides access to specific sets of data selected by YouTube. This means that some data you currently rely on may no longer be available in the future. Web scraping allows you instead to obtain any additional information available on the YouTube website, even if not exposed through the API.
- No limitation: YouTube APIs are subject to rate limiting. This restriction determine the frequency and volume of requests that you can make in a given time frame. By interacting directly with the platform, you can circumvent any limitation.
What Data to Scrape From YouTube
Main data fields to scrape from YouTube
- Video metadata:
- Title
- Description
- Views
- Likes
- Duration
- Publication date
- Channel
- User profiles:
- Username
- User Description
- Subscribers
- Number of videos
- Playlists
- Other:
- Comments
- Related videos
As seen earlier, the best way to get this data is through a custom scraper. But which programming language to choose?
Python is one of the most popular languages for web scraping thanks to its simple syntax and rich ecosystem of libraries. Its versatility, readability, and extensive community support make it an excellent option. Check out our in-depth guide to get started on web scraping with Python.
Scraping YouTube With Selenium
Follow this tutorial and learn how to build a YouTube web scraping Python script.
Step 1: Setup
Before coding, you need to meet the following prerequisites:
- Python 3+: Download the installer, double-click on it, and follow the instructions.
- A Python IDE: PyCharm Community Edition or Visual Studio Code with the Python extension are two great free options.
You can initialize a Python project with a virtual environment using the commands below:
The youtube-scraper
directory created above represents the project folder for your Python script.
Open it in the IDE, create a scraper.py
file, and initialize it as follows:
Right now, this file is a sample script that only prints “Hello, World!” but it will soon contain the scraping logic.
Verify that the script works by pressing the run button of your IDE or with:
In the terminal, you should see:
Perfect, you now have a Python project for your YouTube scraper.
Step 2: Choose and install the scraping libraries
If you spend some time visiting YouTube, you will notice that it is a highly interactive platform. Based on click and scroll operations, the site loads and renders data dynamically. This means that YouTube relies greatly on JavaScript.
Scraping YouTube requires a tool that can render web pages in a browser, just like Selenium! This tool makes it possible to scrape dynamic websites in Python, allowing you to perform automated tasks on websites in a browser.
Add Selenium and the Webdriver Manager packages to your project’s dependencies with:
The installation task may take a while, so be patient.
webdriver-manager
is not strictly necessary, but it makes it easier to manage web drivers in Selenium. Thanks to it, you do not have to manually download, install, and configure web drivers.
Get started with Selenium in scraper.py
:
This script creates an instance of Chrome WebDriver
, the object through which programmatically controls a Chrome window.
By default, Selenium starts the browser with the UI. Although this is useful for debugging, as you can experience live what the automated script is doing on the page, it takes a lot of resources. For this reason, you should configure Chrome to run in headless mode. Thanks to the --headless=new
option, the controlled browser instance will be launched behind the scene, with no UI.
Perfect! Time to define the scraping logic!
Step 3: Connect to YouTube
To perform web scraping on YouTube, you must first select a video to extract data from. In this guide, you are going to see how to scrape the latest video from Bright Data’s YouTube channel. Keep in mind that any other video will do.
Here is the YouTube page chosen as a target:
It is a video on web scraping entitled “Introduction to Bright Data | Scraping Browser.”
Store the URL string in a Python variable:
You can now instruct Selenium to connect to the target page with:
The get()
function tells the controlled browser to visit the page identified by the URL passed as a parameter.
This is what your YouTube scraper looks like so far:
If you run the script, it will open the browser window below for a split second before closing it due to the quit()
instruction:
Note the “Chrome is being controlled by automated test software” message, which ensures that Selenium is working properly on Chrome.
Step 4: Inspect the target page
Have a look at the previous screenshot. When you open YouTube for the first time, a consent dialog appears. To access the data on the page, you must first close it by clicking the “Accept all” button. Let’s learn how to do so!
To create a new browser session, open YouTube in incognito mode. Right-click on the consent modal, and select “Inspect.” This will open the Chrome DevTools section:
Note that the dialog has an id
attribute. This is useful information to define an effective selector strategy in Selenium.
Similarly, inspect the “Accept all” button:
It is the second button identified by the CSS selector below:
Put it all together and use these lines of code to deal with the YouTube cookie policy in Selenium:
The consent modal gets loaded dynamically and might take some time to show up. Here is why you need to use WebDriverWait
to wait for the expected condition to occur. If nothing happens in the specified timeout, it raises a TimeoutException
. YouTube is pretty slow, so it is recommended to use timeouts beyond 10 seconds.
Since YouTube keeps changing its policies, the dialog may not show up in specific countries or situations. Therefore, handle the exception with a try-catch
to prevent the script from failing in case the modal is not present.
To make the script work, remember to add the following imports:
After pressing the “Accept all” button, YouTube takes a while to dynamically re-render the page:
During this period of time, you cannot interact with the page in Selenium. If you try to select an HTML element, you will get the “stale element reference” error. That happens because the DOM changes a lot in this process.
As you can see, the title element contains a gray line. If you inspect that element, you will see:
A good indicator of when the page has been loaded is to wait until the title element is visible:
You are ready to scrape YouTube in Python. Keep analyzing the target site in the DevTools and familiarize yourself with its DOM.
Step 5: Extract YouTube data
First, you need a data structure where to store the scraped info. Initialize a Python dictionary with:
As you should have noticed in the previous step, some of the most interesting information is in the section under the video player:
With the h1.ytd-watch-metadata
CSS selector, you can get the video title:
Just below the title, there is the HTML element containing the channel info:
This is identified by the “owner” id
attribute, and you can get all data from it with:
Even further below, there is the video description. This component has tricky behavior, as it shows different data based on whether it is closed or open.
Click on it to get access to see the complete data:
You should have access to the expanded description info element:
Retrieve the video views and publication date with:
The textual description associated with the video is contained in the following child element:
Scrape it with:
Next, inspect the like button:
Collect the number of likes with:
Finally, do not forget to insert the scraped data into the video
dictionary:
Wonderful! You just performed web scraping in Python!
Step 6: Export the scraped data to JSON
The data of interest is now stored in a Python dictionary, which is not the best format for sharing it with other teams. You can convert the collected info to JSON and export it to a file with just two lines of code:
This snippet initializes a video.json
file with open()
. Then, it uses json.dump()
to write the JSON representation of the video
dictionary to the output file. Take a look at our article to learn more about how to parse JSON in Python.
You do not require an extra dependency to achieve the objective. All you need is the Python Standard Library json
package you can import with:
Fantastic! You started with raw data contained in a dynamic HTML page and now have semi-structured JSON data. It is time to see the entire YouTube scraper.
Step 7: Put it all together
Here is the complete scraper.py
script:
You can build a web scraper to get data from YouTube videos with only about 100 lines of code!
Launch the script, and the following video.json
file will appear in the root folder of your project:
Congrats! You just learned how to scrape YouTube in Python!
Conclusion
In this guide, you learned why scraping YouTube is better than using its data APIs. In particular, you saw a step-by-step tutorial on how to build a Python scraper that can retrieve YouTube video data. As proven here, it is not complex and takes only a few lines of code.
At the same time, YouTube is a dynamic platform that keeps evolving so the scraper built here might not work forever. Maintaining it to cope with changes in the target site is time-intensive and cumbersome. This is why we built YouTube Scraper, a reliable and easy-to-use solution to get all the data you want with no worries!
Also, do not overlook the Google anti-bot systems. Selenium is a great tool but cannot do anything against such advanced technologies. If Google decides to protect YouTube from bots, most automated scripts will be cut off. If this happened, you would need a tool that can render JavaScript and is automatically able to handle fingerprinting, CAPTCHAs, and anti-scraping for you. Well, it exists and is called Scraping Browser!
No credit card required
Don’t want to deal with YouTube web scraping at all but are interested in item data? Request a YouTube dataset.
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.