How To Scrape YouTube in Python

Learn how to scraper YouTube with Python in this step-by-step guide.
16 min read
how to scrape youtube with python blog

In this step-by-step guide, you will learn how to perform web scraping on YouTube using Python

This tutorial will cover:

YouTube API vs YouTube scraping

YouTube Data API is the official way to get data from the platform, including information about videos, playlists, and content creators. However, there are at least three good reasons why scraping YouTube is better than relying solely on its API:

  • Flexibility and Customization: With a YouTube spider, you can tailor the code to select only the data you need. This level of customization helps you collect the exact information for your specific use case. In contrast, the API only gives you access to predefined data.
  • Access to unofficial data: The API provides access to specific sets of data selected by YouTube. This means that some data you currently rely on may no longer be available in the future. Web scraping allows you instead to obtain any additional information available on the YouTube website, even if not exposed through the API.
  • No limitation: YouTube APIs are subject to rate limiting. This restriction determine the frequency and volume of requests that you can make in a given time frame. By interacting directly with the platform, you can circumvent any limitation.
    What Data to Scrape From YouTube

Main data fields to scrape from YouTube

  • Video metadata:
    • Title
    • Description
    • Views
    • Likes
    • Duration
    • Publication date
    • Channel
  • User profiles:
    • Username
    • User Description
    • Subscribers
    • Number of videos
    • Playlists
  • Other:
    • Comments
    • Related videos

As seen earlier, the best way to get this data is through a custom scraper. But which programming language to choose?

Python is one of the most popular languages for web scraping thanks to its simple syntax and rich ecosystem of libraries. Its versatility, readability, and extensive community support make it an excellent option. Check out our in-depth guide to get started on web scraping with Python.

Scraping YouTube With Selenium

Follow this tutorial and learn how to build a YouTube web scraping Python script.

Step 1: Setup

Before coding, you need to meet the following prerequisites:

You can initialize a Python project with a virtual environment using the commands below:

mkdir youtube-scraper
cd youtube-scraper
python -m venv env

The youtube-scraper directory created above represents the project folder for your Python script.

Open it in the IDE, create a scraper.py file, and initialize it as follows:

print('Hello, World!')

Right now, this file is a sample script that only prints “Hello, World!” but it will soon contain the scraping logic.

Verify that the script works by pressing the run button of your IDE or with:

python scraper.py

In the terminal, you should see:

Hello, World!

Perfect, you now have a Python project for your YouTube scraper.

Step 2: Choose and install the scraping libraries

If you spend some time visiting YouTube, you will notice that it is a highly interactive platform. Based on click and scroll operations, the site loads and renders data dynamically. This means that YouTube relies greatly on JavaScript.

Scraping YouTube requires a tool that can render web pages in a browser, just like Selenium! This tool makes it possible to scrape dynamic websites in Python, allowing you to perform automated tasks on websites in a browser.

Add Selenium and the Webdriver Manager packages to your project’s dependencies with:

pip install selenium webdriver-manager

The installation task may take a while, so be patient.

webdriver-manager is not strictly necessary, but it makes it easier to manage web drivers in Selenium. Thanks to it, you do not have to manually download, install, and configure web drivers.

Get started with Selenium in scraper.py:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

# initialize a web driver instance to control a Chrome window
# in headless mode
options = Options()
options.add_argument('--headless=new')

driver = webdriver.Chrome(
    service=ChromeService(ChromeDriverManager().install()),
    options=options
)

# scraping logic...

# close the browser and free up the resources
driver.quit()

This script creates an instance of Chrome WebDriver, the object through which programmatically controls a Chrome window.

By default, Selenium starts the browser with the UI. Although this is useful for debugging, as you can experience live what the automated script is doing on the page, it takes a lot of resources. For this reason, you should configure Chrome to run in headless mode. Thanks to the --headless=new option, the controlled browser instance will be launched behind the scene, with no UI.

Perfect! Time to define the scraping logic!

Step 3: Connect to YouTube

To perform web scraping on YouTube, you must first select a video to extract data from. In this guide, you are going to see how to scrape the latest video from Bright Data’s YouTube channel. Keep in mind that any other video will do.

Here is the YouTube page chosen as a target:

https://www.youtube.com/watch?v=kuDuJWvho7Q

It is a video on web scraping entitled “Introduction to Bright Data | Scraping Browser.”

Store the URL string in a Python variable:

url = 'https://www.youtube.com/watch?v=kuDuJWvho7Q'

You can now instruct Selenium to connect to the target page with:

driver.get(url)

The get() function tells the controlled browser to visit the page identified by the URL passed as a parameter.

This is what your YouTube scraper looks like so far:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

# initialize a web driver instance to control a Chrome window
# in headless mode
options = Options()
options.add_argument('--headless=new')

driver = webdriver.Chrome(
    service=ChromeService(ChromeDriverManager().install()),
    options=options
)

# the URL of the target page
url = 'https://www.youtube.com/watch?v=kuDuJWvho7Q'
# visit the target page in the controlled browser
driver.get(url)

# close the browser and free up the resources
driver.quit()

If you run the script, it will open the browser window below for a split second before closing it due to the quit() instruction:

Before continuing with youtube image

Note the “Chrome is being controlled by automated test software” message, which ensures that Selenium is working properly on Chrome.

Step 4: Inspect the target page

Have a look at the previous screenshot. When you open YouTube for the first time, a consent dialog appears. To access the data on the page, you must first close it by clicking the “Accept all” button. Let’s learn how to do so!

To create a new browser session, open YouTube in incognito mode. Right-click on the consent modal, and select “Inspect.” This will open the Chrome DevTools section:

Dev tools youtube

Note that the dialog has an id attribute. This is useful information to define an effective selector strategy in Selenium.

Similarly, inspect the “Accept all” button:

It is the second button identified by the CSS selector below:

.eom-buttons button.yt-spec-button-shape-next

Put it all together and use these lines of code to deal with the YouTube cookie policy in Selenium:

try:
    # wait up to 15 seconds for the consent dialog to show up
    consent_overlay = WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.ID, 'dialog'))
    )

    # select the consent option buttons
    consent_buttons = consent_overlay.find_elements(By.CSS_SELECTOR, '.eom-buttons button.yt-spec-button-shape-next')
    if len(consent_buttons) > 1:
        # retrieve and click the 'Accept all' button
        accept_all_button = consent_buttons[1]
        accept_all_button.click()
except TimeoutException:
    print('Cookie modal missing')

The consent modal gets loaded dynamically and might take some time to show up. Here is why you need to use WebDriverWait to wait for the expected condition to occur. If nothing happens in the specified timeout, it raises a TimeoutException. YouTube is pretty slow, so it is recommended to use timeouts beyond 10 seconds.

Since YouTube keeps changing its policies, the dialog may not show up in specific countries or situations. Therefore, handle the exception with a try-catch to prevent the script from failing in case the modal is not present.

To make the script work, remember to add the following imports:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common import TimeoutException

After pressing the “Accept all” button, YouTube takes a while to dynamically re-render the page:

During this period of time, you cannot interact with the page in Selenium. If you try to select an HTML element, you will get the “stale element reference” error. That happens because the DOM changes a lot in this process.

As you can see, the title element contains a gray line. If you inspect that element, you will see:

A good indicator of when the page has been loaded is to wait until the title element is visible:

# wait for YouTube to load the page data
WebDriverWait(driver, 15).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, 'h1.ytd-watch-metadata'))
)

You are ready to scrape YouTube in Python. Keep analyzing the target site in the DevTools and familiarize yourself with its DOM.

Step 5: Extract YouTube data

First, you need a data structure where to store the scraped info. Initialize a Python dictionary with:

video = {}

As you should have noticed in the previous step, some of the most interesting information is in the section under the video player:

With the h1.ytd-watch-metadata CSS selector, you can get the video title:

title = driver \
    .find_element(By.CSS_SELECTOR, 'h1.ytd-watch-metadata') \
    .text

Just below the title, there is the HTML element containing the channel info:

This is identified by the “owner” id attribute, and you can get all data from it with:

# dictionary where to store the channel info
channel = {}

# scrape the channel info attributes
channel_element = driver \
    .find_element(By.ID, 'owner')

channel_url = channel_element \
              .find_element(By.CSS_SELECTOR, 'a.yt-simple-endpoint') \
              .get_attribute('href')
channel_name = channel_element \
              .find_element(By.ID, 'channel-name') \
              .text
channel_image = channel_element \
              .find_element(By.ID, 'img') \
              .get_attribute('src')
channel_subs = channel_element \
              .find_element(By.ID, 'owner-sub-count') \
              .text \
              .replace(' subscribers', '')

channel['url'] = channel_url
channel['name'] = channel_name
channel['image'] = channel_image
channel['subs'] = channel_subs

Even further below, there is the video description. This component has tricky behavior, as it shows different data based on whether it is closed or open.

Click on it to get access to see the complete data:

driver.find_element(By.ID, 'description-inline-expander').click()

You should have access to the expanded description info element:

Retrieve the video views and publication date with:

info_container_elements = driver \
    .find_elements(By.CSS_SELECTOR, '#info-container span')

views = info_container_elements[0] \
    .text \
    .replace(' views', '')
publication_date = info_container_elements[2] \
    .text

The textual description associated with the video is contained in the following child element:

Scrape it with:

description = driver \
    .find_element(By.CSS_SELECTOR, '#description-inline-expander .ytd-text-inline-expander span') \
    .text

Next, inspect the like button:

Collect the number of likes with:

likes = driver \
    .find_element(By.ID, 'segmented-like-button') \
    .text

Finally, do not forget to insert the scraped data into the video dictionary:

video['url'] = url
video['title'] = title
video['channel'] = channel
video['views'] = views
video['publication_date'] = publication_date
video['description'] = description
video['likes'] = likes

Wonderful! You just performed web scraping in Python!

Step 6: Export the scraped data to JSON

The data of interest is now stored in a Python dictionary, which is not the best format for sharing it with other teams. You can convert the collected info to JSON and export it to a file with just two lines of code:

with open('video.json', 'w') as file:
    json.dump(video, file)

This snippet initializes a video.json file with open(). Then, it uses json.dump() to write the JSON representation of the video dictionary to the output file. Take a look at our article to learn more about how to parse JSON in Python.

You do not require an extra dependency to achieve the objective. All you need is the Python Standard Library json package you can import with:

import json

Fantastic! You started with raw data contained in a dynamic HTML page and now have semi-structured JSON data. It is time to see the entire YouTube scraper.

Step 7: Put it all together
Here is the complete scraper.py script:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common import TimeoutException
import json

# enable the headless mode
options = Options()
# options.add_argument('--headless=new')

# initialize a web driver instance to control a Chrome window
driver = webdriver.Chrome(
    service=ChromeService(ChromeDriverManager().install()),
    options=options
)

# the URL of the target page
url = 'https://www.youtube.com/watch?v=kuDuJWvho7Q'
# visit the target page in the controlled browser
driver.get(url)

try:
    # wait up to 15 seconds for the consent dialog to show up
    consent_overlay = WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.ID, 'dialog'))
    )

    # select the consent option buttons
    consent_buttons = consent_overlay.find_elements(By.CSS_SELECTOR, '.eom-buttons button.yt-spec-button-shape-next')
    if len(consent_buttons) > 1:
        # retrieve and click the 'Accept all' button
        accept_all_button = consent_buttons[1]
        accept_all_button.click()
except TimeoutException:
    print('Cookie modal missing')

# wait for YouTube to load the page data
WebDriverWait(driver, 15).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, 'h1.ytd-watch-metadata'))
)

# initialize the dictionary that will contain
# the data scraped from the YouTube page
video = {}

# scraping logic
title = driver \
    .find_element(By.CSS_SELECTOR, 'h1.ytd-watch-metadata') \
    .text

# dictionary where to store the channel info
channel = {}

# scrape the channel info attributes
channel_element = driver \
    .find_element(By.ID, 'owner')

channel_url = channel_element \
              .find_element(By.CSS_SELECTOR, 'a.yt-simple-endpoint') \
              .get_attribute('href')
channel_name = channel_element \
              .find_element(By.ID, 'channel-name') \
              .text
channel_image = channel_element \
              .find_element(By.ID, 'img') \
              .get_attribute('src')
channel_subs = channel_element \
              .find_element(By.ID, 'owner-sub-count') \
              .text \
              .replace(' subscribers', '')

channel['url'] = channel_url
channel['name'] = channel_name
channel['image'] = channel_image
channel['subs'] = channel_subs

# click the description section to expand it
driver.find_element(By.ID, 'description-inline-expander').click()

info_container_elements = driver \
    .find_elements(By.CSS_SELECTOR, '#info-container span')
views = info_container_elements[0] \
    .text \
    .replace(' views', '')
publication_date = info_container_elements[2] \
    .text

description = driver \
    .find_element(By.CSS_SELECTOR, '#description-inline-expander .ytd-text-inline-expander span') \
    .text

likes = driver \
    .find_element(By.ID, 'segmented-like-button') \
    .text

video['url'] = url
video['title'] = title
video['channel'] = channel
video['views'] = views
video['publication_date'] = publication_date
video['description'] = description
video['likes'] = likes

# close the browser and free up the resources
driver.quit()

# export the scraped data to a JSON file
with open('video.json', 'w') as file:
    json.dump(video, file, indent=4)

You can build a web scraper to get data from YouTube videos with only about 100 lines of code!

Launch the script, and the following video.json file will appear in the root folder of your project:

{
    "url": "https://www.youtube.com/watch?v=kuDuJWvho7Q",
    "title": "Introduction to Bright Data | Scraping Browser",
    "channel": {
        "url": "https://www.youtube.com/@BrightData",
        "name": "Bright Data",
        "image": "https://yt3.ggpht.com/_Q-FPPjoMEH_3ocfi1lTy1HBwdh7CqUfehS7G9silsQcPZX11yAGffubPO1haKyFtbxKBURT=s48-c-k-c0x00ffffff-no-rj",
        "subs": "4.65K"
    },
    "views": "116",
    "publication_date": "Jun 14, 2023",
    "description": "Welcome to our comprehensive guide on setting up and using Bright Data's Scraping Browser for efficient web data extraction. This video walks you through the process of setting up the Scraping Browser, highlighting its unique features and benefits.\n\n- Introduction to Bright Data's Scraping Browser\n- Navigating the 'Proxies and Scraping Infrastructure' page\n- Creating and Naming Your Scraping Browser\n- Explaining User Interaction, Geo-Restrictions, and IP Rate Limits\n- Breakdown of Costs for Using the Scraping Browser\n- Access Parameters and Their Importance\n- Integration Examples: Puppeteer in Node.js and Playwright in Python\n- Introduction to Web Scraping 'Today's Deals' from Amazon.com\n- Automated Data Extraction Process\n- Statistics of Data Usage\n- Benefits of Automated Web Scraping\n\nWhether you're looking to extract data behind user interactions, dealing with geo-restrictions, or IP rate limits, Bright Data's Scraping Browser provides comprehensive solutions for your needs. In this video, we also delve into a practical demonstration using Puppeteer and Python, illustrating how this browser can help you access and extract data efficiently.\n\n#BrightData #ScrapingBrowser #WebScraping #Puppeteer #Python #Nodejs #Playwright #DataExtraction",
    "likes": "3"
}

Congrats! You just learned how to scrape YouTube in Python!

Conclusion

In this guide, you learned why scraping YouTube is better than using its data APIs. In particular, you saw a step-by-step tutorial on how to build a Python scraper that can retrieve YouTube video data. As proven here, it is not complex and takes only a few lines of code.

At the same time, YouTube is a dynamic platform that keeps evolving so the scraper built here might not work forever. Maintaining it to cope with changes in the target site is time-intensive and cumbersome. This is why we built YouTube Scraper, a reliable and easy-to-use solution to get all the data you want with no worries!

Also, do not overlook the Google anti-bot systems. Selenium is a great tool but cannot do anything against such advanced technologies. If Google decides to protect YouTube from bots, most automated scripts will be cut off. If this happened, you would need a tool that can render JavaScript and is automatically able to handle fingerprinting, CAPTCHAs, and anti-scraping for you. Well, it exists and is called Scraping Browser!

No credit card required

Don’t want to deal with YouTube web scraping at all but are interested in item data? Request a YouTube dataset.

Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.