In this step-by-step guide, you will learn how to scrape Reddit using Python.
This tutorial will cover:
- New Reddit API policy
- Reddit API vs. Reddit scraping
- Scraping Reddit with Selenium
New Reddit API Policy
In April 2023, Reddit announced new fees for its Data APIs, basically making smaller companies unable to afford them. At the time of writing, the API fee is set at $0.24 per 1,000 calls. As you can imagine, this figure can add up quickly even for modest usage. That is especially true considering the tons of user-generated content available on Reddit and the huge amounts of calls required to retrieve it. Apollo, one of the most used third-party app built on top of Reddit API was forced to shut down because of that.
Does this mean the end of Reddit as a source of sentiment analysis, user feedback, and trend data? Certainly not! There is a solution that is more effective, less expensive, and not subject to corporate overnight decisions. That solution is called web scraping. Let’s find out why!
Reddit API vs. Reddit Scraping
Reddit’s API is the official method for obtaining data from the site. Considering the recent policy changes and directions taken by the platform, there are good reasons why Reddit scraping is a better solution:
- Cost-effectiveness: In light of Reddit’s new API cost, scraping Reddit can be a much more affordable alternative. Building a Python Reddit scraper allows you to gather data without incurring additional expenses associated with API usage.
- Enhanced data collection: When scraping Reddit, you have the flexibility to customize the data extraction code to get only the information that matches your requirements. This customization helps you overcome the limitations on data format, rate limiting, and usage restrictions in the API.
- Access to unofficial data: While Reddit’s API only provides access to a curated selection of information, scraping provides access to any publicly accessible data on the site.
Now that you know why scraping is a more effective option than calling APIs, let’s see how to build a Reddit scraper in Python. Before moving on to the next chapter, consider exploring our in-depth guide on web scraping with Python.
Scraping Reddit With Selenium
In this step-by-step tutorial, you will see how to build a Reddit web scraping Python script.
Step 1: Project setup
First make sure to meet the following prerequisites:
- Python 3+: Download the installer, double-click on it, and follow the installation instructions.
- A Python IDE: PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Initialize a Python project with a virtual environment through the commands below:
mkdir reddit-scraper
cd reddit-scraper
python -m venv env
The reddit-scraper
folder created here is the project folder for your Python script.
Open the directory in the IDE, create a scraper.py
file, and initialize it as below:
print('Hello, World!')
Right now, this script simply prints “Hello, World!” but it will soon contain the scraping logic.
Verify that the program works by pressing the run button of your IDE or launching:
python scraper.py
In the terminal, you should see:
Hello, World!
Wonderful! You now have a Python project for your Reddit scraper.
Step 2: Select and install the scraping libraries
As you may already know, Reddit is a highly interactive platform. The site loads and renders new data dynamically based on how users interact with its pages through click and scroll operations. From a technical perspective, it means that Reddit relies heavily on JavaScript.
Thus, scraping Reddit in Python requires a tool that can render web pages in a browser. Here is where Selenium comes in! This tool allows scraping dynamic websites in Python, enabling automated operations on Web pages in a browser.
You can add Selenium and the Webdriver Manager to your project’s dependencies with:
pip install selenium webdriver-manager
The installation process might take a while, so be patient.
The webdriver-manager
package is not strictly necessary but is strongly recommended. It allows you to avoid manually downloading, installing, and configuring web drivers in Selenium. The library will take care of everything for you.
Integrate Selenium into your scraper.py
file:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
# enable the headless mode
options = Options()
options.add_argument('--headless=new')
# initialize a web driver to control Chrome
driver = webdriver.Chrome(
service=ChromeService(ChromeDriverManager().install()),
options=options
)
# maxime the controlled browser window
driver.fullscreen_window()
# scraping logic...
# close the browser and free up the Selenium resources
driver.quit()
This script instantiates a Chrome WebDrive
r object to programmatically control a Chrome window.
By default, Selenium opens the browser in a new GUI window. This is useful to monitor what the script is doing on the pages for debugging. At the same time, loading a web browser with its UI takes a lot of resources. So, it is recommended to configure Chrome to run in headless mode. Specifically, the --headless=new
option will instruct Chrome to start with no UI behind the scene.
Well done! Time to visit the target Reddit page!
Step 3: Connect to Reddit
Here, you are going to see how to extract data from the r/Technology subreddit. Keep in mind that any other subreddit will do.
In detail, assume you want to scrape the page with the top posts of the week. This is the URL of the target page:
https://www.reddit.com/r/technology/top/?t=week
Store that string in a Python variable:
url = 'https://www.reddit.com/r/technology/top/?t=week'
Then, use Selenium to visit the page with:
driver.get(url)
The get()
function instructs the controlled browser to connect to the page identified by the URL passed as a parameter.
This is what your Reddit web scraper looks like so far:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
# enable the headless mode
options = Options()
options.add_argument('--headless=new')
# initialize a web driver to control Chrome
driver = webdriver.Chrome(
service=ChromeService(ChromeDriverManager().install()),
options=options
)
# maxime the controlled browser window
driver.fullscreen_window()
# the URL of the target page to scrape
url = 'https://www.reddit.com/r/technology/top/?t=week'
# connect to the target URL in Selenium
driver.get(url)
# scraping logic...
# close the browser and free up the Selenium resources
driver.quit()
Test your script. It will open the browser window below for a split second before closing it because of the quit()
instruction:
Take a look at the “Chrome is being controlled by automated test software.” message. Great! That ensures Selenium is operating properly on Chrome.
Step 4: Inspect the target page
Before jumping into the code, you need to explore the target page to see what info it offers and how you can retrieve it. In particular, you have to identify which HTML elements contain the data of interest and devise proper selection strategies.
To simulate the conditions under which Selenium operates, which is a “vanilla” browser session, open the Reddit page incognito. Right-click on one any section of the page and click “Inspect” to open the Chrome DevTools:
This tool helps you understand the DOM structure of the page. As you can see, the site relies on CSS classes that seem to be randomly generated at build time. In other words, you should not base your selection strategies on them.
Fortunately, the most important elements on the site have special HTML attributes. For example, the subreddit description node has the following attribute:
data-testid="no-edit-description-block"
This is useful information for building effective HTML element selection logic.
Keep analyzing the site in the DevTools and familiarize yourself with its DOM until you are ready to scrape Reddit in Python.
Step 5: Scrape the subreddit main info
First, create a Python dictionary where to store the scraped data:
subreddit = {}
Then, note that you can get the name of the subreddit from the <h1> element on the top of the page:
Retrieve it as below:
name = driver \
.find_element(By.TAG_NAME, 'h1') \
.text
As you must have already noticed, some of the most interesting general info about the subreddit is in the sidebar on the right:
You can get the text description, creation date, and number of members with:
description = driver \
.find_element(By.CSS_SELECTOR, '[data-testid="no-edit-description-block"]') \
.get_attribute('innerText')
creation_date = driver \
.find_element(By.CSS_SELECTOR, '.icon-cake') \
.find_element(By.XPATH, "following-sibling::*[1]") \
.get_attribute('innerText') \
.replace('Created ', '')
members = driver \
.find_element(By.CSS_SELECTOR, '[id^="IdCard--Subscribers"]') \
.find_element(By.XPATH, "preceding-sibling::*[1]") \
.get_attribute('innerText')
In this case, you cannot use the text
attribute because the text strings are contained in nested nodes. If you used .text
, you would get an empty string. Instead, you need to call the get_attribute()
method to read the innerText
attribute, which returns the rendered text content of a node and its descendants.
If you look at the creation date element, you will notice that there is no easy way to select it. As it is the node following the cake icon, select the icon with .icon-cake
first, and then use the following-sibling::*[1]
XPath expression to get the next sibling. Clean the collected text to remove the “Created ”string by calling the Python replace()
method.
When it comes to the subscriber member counter element, something similar happens. The main difference is that you need to access the preceding sibling, in this case.
Do not forget to add the scraped data to the subreddit
dictionary:
subreddit['name'] = name
subreddit['description'] = description
subreddit['creation_date'] = creation_date
subreddit['members'] = members
Print subreddit
with print(subreddit)
, and you will see:
{'name': '/r/Technology', 'description': 'Subreddit dedicated to the news and discussions about the creation and use of technology and its surrounding issues.', 'creation_date': 'Jan 25, 2008', 'members': '14.4m'}
Perfect! You just performed web scraping in Python!
Step 6: Scrape the subreddit posts
Since a subreddit shows several posts, you will now need an array to store the collected data:
posts = []
Inspect a post HTML element:
Here you can notice that you can select them all with the [data-testid="post-container"]
CSS selector:
post_html_elements = driver \
.find_elements(By.CSS_SELECTOR, '[data-testid="post-container"]')
Iterate over them. For each element, create a post dictionary to keep track of individual post’s data:
for post_html_element in post_html_elements:
post = {}
# scraping logic...
Inspect the upvote element:
You can retrieve that info inside the for
loop with:
upvotes = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-click-id="upvote"]') \
.find_element(By.XPATH, "following-sibling::*[1]") \
.get_attribute('innerText')
Again, it is best to get the upvote button, which is easy to select, and then point to the next sibling to retrieve the target info.
Inspect the post author and title elements:
Getting this data is a bit easier:
author = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-testid="post_author_link"]') \
.text
title = post_html_element \
.find_element(By.TAG_NAME, 'h3') \
.text
Then, you can collect the number of comments and outbound link:
try:
outbound_link = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-testid="outbound-link"]') \
.get_attribute('href')
except NoSuchElementException:
outbound_link = None
comments = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-click-id="comments"]') \
.get_attribute('innerText') \
.replace(' Comments', '')
Since the outbound link element is optional, you need to wrap the selection logic with a try
block.
Add this data to post
and append it to the posts
array only if title
is present. This extra check prevents special advertisement posts placed by Reddit from being scraped:
# populate the dictionary with the retrieved data
post['upvotes'] = upvotes
post['title'] = title
post['outbound_link'] = outbound_link
post['comments'] = comments
# to avoid adding ad posts
# to the list of scraped posts
if title:
posts.append(post)
Lastly, add posts
to the subreddit
dictionary:
subreddit['posts'] = posts
Way to go! You now have all the desired Reddit data!
Step 7: Export the scraped data to JSON
The collected data is now inside a Python dictionary. This is not the best format for sharing it with other teams. To tackle that, you should export it to JSON:
import json
# ...
with open('subreddit.json', 'w') as file:
json.dump(video, file)
Import json
from the Python Standard Library, create a subreddit.json
file with open()
, and populate it with json.dump()
. Check out our guide to learn more about how to parse JSON in Python.
Fantastic! You started with raw data contained in a dynamic HTML page and now have semi-structured JSON data. You are now ready to see the entire Reddit scraper.
Step 8: Put it all together
Here is the full scraper.py
script:
from selenium import webdriver
from selenium.common import NoSuchElementException
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import json
# enable the headless mode
options = Options()
options.add_argument('--headless=new')
# initialize a web driver to control Chrome
driver = webdriver.Chrome(
service=ChromeService(ChromeDriverManager().install()),
options=options
)
# maxime the controlled browser window
driver.fullscreen_window()
# the URL of the target page to scrape
url = 'https://www.reddit.com/r/technology/top/?t=week'
# connect to the target URL in Selenium
driver.get(url)
# initialize the dictionary that will contain
# the subreddit scraped data
subreddit = {}
# subreddit scraping logic
name = driver \
.find_element(By.TAG_NAME, 'h1') \
.text
description = driver \
.find_element(By.CSS_SELECTOR, '[data-testid="no-edit-description-block"]') \
.get_attribute('innerText')
creation_date = driver \
.find_element(By.CSS_SELECTOR, '.icon-cake') \
.find_element(By.XPATH, "following-sibling::*[1]") \
.get_attribute('innerText') \
.replace('Created ', '')
members = driver \
.find_element(By.CSS_SELECTOR, '[id^="IdCard--Subscribers"]') \
.find_element(By.XPATH, "preceding-sibling::*[1]") \
.get_attribute('innerText')
# add the scraped data to the dictionary
subreddit['name'] = name
subreddit['description'] = description
subreddit['creation_date'] = creation_date
subreddit['members'] = members
# to store the post scraped data
posts = []
# retrieve the list of post HTML elements
post_html_elements = driver \
.find_elements(By.CSS_SELECTOR, '[data-testid="post-container"]')
for post_html_element in post_html_elements:
# to store the data scraped from the
# post HTML element
post = {}
# subreddit post scraping logic
upvotes = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-click-id="upvote"]') \
.find_element(By.XPATH, "following-sibling::*[1]") \
.get_attribute('innerText')
author = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-testid="post_author_link"]') \
.text
title = post_html_element \
.find_element(By.TAG_NAME, 'h3') \
.text
try:
outbound_link = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-testid="outbound-link"]') \
.get_attribute('href')
except NoSuchElementException:
outbound_link = None
comments = post_html_element \
.find_element(By.CSS_SELECTOR, '[data-click-id="comments"]') \
.get_attribute('innerText') \
.replace(' Comments', '')
# populate the dictionary with the retrieved data
post['upvotes'] = upvotes
post['title'] = title
post['outbound_link'] = outbound_link
post['comments'] = comments
# to avoid adding ad posts
# to the list of scraped posts
if title:
posts.append(post)
subreddit['posts'] = posts
# close the browser and free up the Selenium resources
driver.quit()
# export the scraped data to a JSON file
with open('subreddit.json', 'w', encoding='utf-8') as file:
json.dump(subreddit, file, indent=4, ensure_ascii=False)
Amazing! You can build a Python Reddit web scraper with a little more than 100 lines of code!
Launch the script, and the following subreddit.json
file will appear in the root folder of your project:
{
"name": "/r/Technology",
"description": "Subreddit dedicated to the news and discussions about the creation and use of technology and its surrounding issues.",
"creation_date": "Jan 25, 2008",
"members": "14.4m",
"posts": [
{
"upvotes": "63.2k",
"title": "Mojang exits Reddit, says they '\"no longer feel that Reddit is an appropriate place to post official content or refer [its] players to\".",
"outbound_link": "https://www.pcgamer.com/minecrafts-devs-exit-its-7-million-strong-subreddit-after-reddits-ham-fisted-crackdown-on-protest/",
"comments": "2.9k"
},
{
"upvotes": "35.7k",
"title": "JP Morgan accidentally deletes evidence in multi-million record retention screwup",
"outbound_link": "https://www.theregister.com/2023/06/26/jp_morgan_fined_for_deleting/",
"comments": "2.0k"
},
# omitted for brevity ...
{
"upvotes": "3.6k",
"title": "Facebook content moderators in Kenya call the work 'torture.' Their lawsuit may ripple worldwide",
"outbound_link": "https://techxplore.com/news/2023-06-facebook-content-moderators-kenya-torture.html",
"comments": "188"
},
{
"upvotes": "3.6k",
"title": "Reddit is telling protesting mods their communities ‘will not’ stay private",
"outbound_link": "https://www.theverge.com/2023/6/28/23777195/reddit-protesting-moderators-communities-subreddits-private-reopen",
"comments": "713"
}
]
}
Congrats! You just learned how to scrape Reddit in Python!
Conclusion
Scraping Reddit is a better way to get data than using its API, especially after the new policies. In this step-by-step tutorial, you learned how to build a scraper in Python to retrieve subreddit data. As shown here, it requires only a few lines of code.
At the same time, just as they changed their API policies overnight, Reddit may soon implement strict anti-scraping measures. Extracting data from it would become a feat, but there is a solution! Bright Data’s Scraping Browser is a tool that can render JavaScript just like Selenium while automatically handling fingerprinting, CAPTCHAs, and anti-scraping for you.
Get reliable and complete Reddit data hassle-free with Bright Data’s Reddit Scraper API. Bypass restrictions, enjoy continuous data access, and focus on insights, not scraping. Start your free trial now!
No credit card required
Don’t want to deal with Reddit web scraping at all but are interested in subreddit data? Purchase a Reddit dataset.
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.