How to Scrape Social Media Data – 2023 Guide

In this guide, you will learn how to scrape social media data using Python, a power tool for web scraping.
10 min read
How to scrape social media

Social media platforms, in particular, offer a wealth of information that can be utilized to benefit your market research, competitive analysis, reputation management, and customer service efforts. By leveraging this data through web scraping, your business can make informed decisions, improve its brand reputation, and gain a competitive edge in the market.

With web scraping, you can efficiently navigate through web pages, select specific data, and extract it into a structured format, such as a CSV or JSON file.

In this guide, you’ll learn how to scrape social media data from multiple platforms using Python. Python is widely recognized as a powerful tool for web scraping due to its extensive range of libraries that facilitate the parsing, extraction, and collation of various types of data.

Scraping Social Media with Python

Before you begin this tutorial, you’ll need the following:

  • Python: the programming language used for this walkthrough.
  • pip: the package manager for Python. It’s useful for getting your Python-specific libraries.
  • Beautiful Soup: a helpful Python package for parsing and extracting data from HTML and XML files. Information on how to download it can be found in this documentation.
  • Selenium: a framework for carrying out operations on a web browser. It’s particularly helpful in simulating a real browser, allowing you to render and parse dynamic and JavaScript-enabled websites. You can find more information about how to install Selenium in their official docs.

In this tutorial, you’ll set up your Python environment by installing these packages and then running Python executables to scrape the internet. However, the problem is that social media platforms, such as Facebook and Instagram, often have blockers for scraping bots, including having requirements for login credentials, hiding useful data behind buttons and dynamic content, or using JavaScript to display their data. You’ll utilize Selenium to bypass these measures and limit your scraping to public pages to reduce the need for login credentials.

All code for this tutorial is available in this GitHub repository.

Scraping Facebook with Python

Your first step when web scraping is to explore the page you need to parse. You need to pinpoint what data is needed, the paths your bot will take, and the specific selectors for the data you want.

For example, when viewing Tom Cruise’s public Facebook profile, you’ll find information about him, links to his other pages, the posts he makes publicizing his movies, and various media associated with them. All the information you see on this page can be accessed and scraped.

In order to scrape the contents of Tom Cruise’s posts, you need to inspect the page for selectors that are shared in the posts you’re interested in:

Once you find selectors that isolate the data you need, you can start writing your scraping code. Open a Python file (.py) and import the packages you’ll be using: Selenium to access the content of your HTML page, Beautiful Soup to parse and extract the specific data, and pandas to structure and clean your final data set:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

Then define the page you want to scrape and add options to Selenium to mimic a real user and avoid getting your crawler blocked:


# Define the URL you want to scrape
url = 'https://www.facebook.com/officialtomcruise/'

# Define the options for the Chrome webdriver
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')

# Create a new instance of the Chrome webdriver with the defined options
driver = webdriver.Chrome(options=options)

# Load the Facebook page in the webdriver
driver.get(url)

At this point, you’ve extracted the entire page. However, it isn’t in a human-readable format. To change this, utilize Beautiful Soup to extract the text you want using the selector you previously isolated:

# Extract the HTML content of the page using BeautifulSoup
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')

posts = soup.find_all('div', class_='x1l90r2v') 

# Extract the text content of each post
post_content = [post.text for post in posts]

# Save the scraped data in a CSV file
data = pd.DataFrame({'post_content': post_content})
data.to_csv('facebook_posts.csv', index=False)

# Print the scraped data
print(data)


# Close the webdriver
driver.quit()

Once you’ve isolated the data you need, you can structure it as a DataFrame (or whichever structure suits you). A DataFrame is a commonly used data structure because it provides a tabular view of your data:

Scraping Twitter with Python

Now that you’ve scraped Facebook, use the same process to scrape Twitter.

Explore Tom Cruise’s Twitter page and isolate a selector that can be used for the data you want:

For instance, in this example, you’ll scrape the text of all his tweets. Looking through the website code, there is an attribute data-testid = tweetText that can be used to efficiently extract this.

It’s important to note the different website behaviors you may encounter. For example, Twitter utilizes JavaScript to implement an infinite scroll feature. This means that more data appears as you scroll down the page, and if you attempt to scrape the content right after the page loads, you may not obtain all the necessary data or you might get an exception.

To overcome this, you can either configure your bot to wait a certain amount of time before scraping the content or ensure that the page is scrolled sufficiently to obtain all the required data.

Create a Python file again and use the following code to scrape the desired content:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
import time 

# Define the URL you want to scrape
url = 'https://twitter.com/TomCruise'

# Define the options for the Chrome webdriver to mimic a real page
options = Options()
options.add_argument('--headless')
options.add_argument("--incognito")
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
options.add_argument("--enable-javascript")

# Create a new instance of the Chrome webdriver with the defined options
driver = webdriver.Chrome(options=options)

# Load the Twitter page in the webdriver
driver.get(url)

# Wait for tweets to populate the page
try:
    WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
        (By.CSS_SELECTOR, '[data-testid="tweet"]')))
except WebDriverException:
    print("Something happened. No tweets loaded")

# scroll a bit for more tweets to appear
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)

# Extract the HTML content of the page using BeautifulSoup
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')

posts = soup.find_all(attrs={"data-testid": "tweetText"})

# Extract the text content of each post
post_content = [post.text for post in posts]

# Save the scraped data in a CSV file
data = pd.DataFrame({'post_content': post_content})
data.to_csv('twitter_posts.csv', index=False)

# Print the scraped data
print(data)

# Close the webdriver
driver.quit()

Scraping Instagram with Python

Lastly, take a look at how you can go about scraping Instagram.

Tom Cruise’s Instagram page presents as a media gallery with only pictures and videos available if you’re not logged in. However, exploring the page shows that the media alternative text, in most cases, has the post content as well. This means you can scrape the URLs for the media files and their alt descriptions directly from this page:

To do this, you only need to find the selectors for your data and structure the DataFrame to collect your data:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
# Define the URL you want to scrape
url = 'https://www.instagram.com/tomcruise'

# Define the options for the Chrome webdriver
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')

# Create a new instance of the Chrome webdriver with the defined options
driver = webdriver.Chrome(options=options)

# Load the Instagram page in the webdriver
driver.get(url)

# Extract the HTML content of the page using BeautifulSoup
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')

# Collect instagram links
links = soup.find_all('a', href= True) 

# Limit the collected links to posts data

posts = []
for i in links :
    if '/p/' in i:
        posts.append(i)

# Save the scraped data in a CSV file
data = pd.DataFrame({'post_links': posts})
data.to_csv('instagram_posts.csv', index=False)

# Print the scraped data
print(data)

# Close the webdriver
driver.quit()

Web scraping social media with Bright Data

Web scraping can be a time-consuming and tedious process, but it doesn’t have to be.

Bright Data is a web platform that allows companies to easily get access to massive amounts of publicly available structured data from the web. This curated data is readily available in the form of social meda datasets. These datasets contain a plethora of data, including user profiles, posts, and comments, that you can access without having to scrape the data yourself.

In addition, if you want to retrieve data from elsewhere, Bright Data’s Web Scraper IDE and Web Unlocker tools can help. They come with pre-built templates to reduce the amount of code you need to write, and feature built-in proxies that will help you access region-locked content and solve CAPTCHAs along the way.

Using Bright Data for scraping social media data can be a more efficient and reliable choice than scraping yourself. Moreover, Bright Data provides a vast proxy network that lets you evade rate restrictions and prevent IP blocks when scraping social media platforms.

Conclusion

In this article, you learned about the fundamentals of web scraping and discovered how to leverage Python to extract social media data from platforms like Facebook, Twitter, and Instagram. We also examined the important factors to consider when manually scraping social media websites and explored Bright Data as a solution for web scraping, particularly for social media data extraction.

Web scraping can gather social media data for marketing research, sentiment analysis, and trend analysis. However, you must utilize web scraping ethically and follow the terms of service of the websites and social media networks you scrape. Make sure the data you’re scraping is public and doesn’t violate privacy laws. Tools offered by Bright Data are particularly useful as they help you navigate the legal and ethical concerns involved with data scraping.

More from Bright Data

Datasets Icon
Get immediately structured data
Access reliable public web data for any use case. The datasets can be downloaded or delivered in a variety of formats. Subscribe to get fresh records of your preferred dataset based on a pre-defined schedule.
Web scraper IDE Icon
Build reliable web scrapers. Fast.
Build scrapers in a cloud environment with code templates and functions that speed up the development. This solution is based on Bright Data’s Web Unlocker and proxy infrastructure making it easy to scale and never get blocked.
Web Unlocker Icon
Implement an automated unlocking solution
Boost the unblocking process with fingerprint management, CAPTCHA-solving, and IP rotation. Any scraper, written in any language, can integrate it via a regular proxy interface.

Ready to get started?