How to Scrape Twitter Data – 2023 Guide

Twitter has become an indispensable platform for gathering real-time information on trends, news, and user sentiment. In this article, we will discuss how to scrape Twitter data to uncover valuable insights and analyze them effectively.
16 min read
How to scrape Twitter

As a major social media platform, Twitter is home to some of the most interesting content on the internet and has tons of useful data for businesses looking to understand and expand their markets.

While you can access this data programmatically via the Twitter API, it’s rate-limited and the application process is time-consuming. Additionally, Twitter recently announced the end of free API access and increased their API costs dramatically, making the API method inaccessible for a large number of small to mid-sized companies. However, web scraping can help you avoid these nuisances and extract what you need easily.

Web scraping is the process of capturing and storing large amounts of data from websites and web apps with the help of automated scripts or bots. In this article, you’ll learn how to scrape Twitter data using Python and Selenium, a popular combination for web scraping.

Scraping Twitter with Selenium

This tutorial will first help you understand what to scrape and then show you how to do it step-by-step.

Prerequisites

Before you begin, you’ll need a local copy of Python installed on your system. The latest stable distribution will work (which, at the time of writing this article, is 3.11.2).

Once you have Python installed, you need to install the following dependencies via pip, Python’s official package manager:

  • Selenium
  • Webdriver Manager

You can run the following commands to install the dependencies:

pip install selenium
pip install webdriver_manager

What You Will Scrape

Deciding what to scrape is as important as implementing the scraping script correctly. This is because Selenium will provide you access to a complete web page of the Twitter app, which contains a lot of data, and all of it is probably not useful. This means you need to make sure that you clearly understand and define what you’re looking for before you start writing a Python script.

For the purpose of this tutorial, you’ll extract the following data from a user profile:

  • Name
  • Username
  • Location
  • Website
  • Join date
  • Following count
  • Followers count
  • Tweets

Scraping a User Profile

To start scraping a user profile page, you need to create a new Python script file named profile-page.py. You can use the following command to create it on *nix systems:

touch profile-page.py

On non-*nix systems, you can simply create the file using your file manager application (such as Windows Explorer).

Setting Up Selenium

After creating a new Python script file, you need to import the following modules into your script:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

Then you need to set up a new Selenium WebDriver (which is basically an automated web browser that your script will control):

driver= webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Before you load the web page and scrape information, you need to define the URL of the web page. Since Twitter profile page URLs are dependent on usernames, you need to add the following code to your script to create a profile page URL from the given username:

username = "bright_data"
URL = "https://twitter.com/" + username + "?lang=en"

Then load the web page:

driver.get(URL)

Waiting for a Page to Load

You can’t proceed with scraping the data from this page until it loads completely. While there are a few deterministic methods for knowing if an HTML page is fully loaded (such as checking document.readyState), it’s not useful in the case of a single-page application (SPA) like Twitter. In this instance, you need to wait for the client-side API calls to be completed and the data to be rendered on the web page before you can scrape it.

To do that, you need to add the following piece of code to your script:

try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="tweetss"]')))
except WebDriverException:
    print("Tweets did not appear! Proceeding after timeout")

This code will make the web driver wait for an element with an attribute data-testid="tweet" to be loaded on the web page before moving ahead. The reason for picking this particular element and attribute is that this attribute will only be present in tweets under a profile, and if the tweets are loaded, it indicates that the rest of the page has been loaded as well:

Waiting for a Page to Load

Please note: You need to be careful when deciding how to mark the page as loaded. The previous code snippet would work for a public profile with at least one tweet. However, it will fail in all other cases, and a WebDriverException will be thrown. In such cases, the script will proceed after waiting for the given timeout duration (which, in this case, is ten seconds).

Extracting Information

At this point, you’re ready for the most important part of the tutorial: extracting information. However, to extract data from the loaded web page, you need to learn the structure of the web page you’re scraping:

Name

If you open Chrome DevTools and locate the source code for the name element on the page, you should see something like this:

<span class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0">Bright Data</span>
Extracting Information

The name element is wrapped in a span tag and is assigned a bunch of (obviously) randomly generated classes. This means that you can’t rely on these class names to identify the container tag for the user’s name element on the profile page. You’ll have to look for something static.

If you go up the hierarchy in the HTML source for the name element, you’ll find a div tag that contains both the name and the username in it (below several layers of spans). The starting tag for the div container will look like this:

<div class="css-1dbjc4n r-6gpygo r-14gqq1x" data-testid="UserName">

While there are randomly generated class names, it also has another attribute called data-testiddata-testid is an HTML attribute that is mainly used in UI testing to identify and locate HTML elements to run automated tests. You can use this attribute to select the div container that contains the user’s name. However, it also contains the username (ie the Twitter handle). This means you need to split the text where the line breaks and then extract the first item (which is the user’s name):

name = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserName"]').text.split('\n')[0]
Bio, Location, Website, and Join Date

In the same way that you identified the right selector for the name element, you need to find the correct selectors for the other data points. You’ll notice that the bio, location, website, and join date elements all have data-testids attached to them. This makes it easy to write CSS selectors to find the elements and extract their data:

bio = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserDescription"]').text
location = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserLocation"]').text
website = driver.find_element(By.CSS_SELECTOR,'a[data-testid="UserUrl"]').text
join_date = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserJoinDate"]').text
Followers and Following Count

When you look at the followers and following counts, you’ll notice that they don’t have data-testids attached to them, which means you have to come up with something creative to identify and select them correctly.

Going up the hierarchy doesn’t help since none of their close parents have any static attribute values attached to them. In this case, you need to turn to XPath.

XPath stands for XML Path Language and is a language used to point to (or create references to) tags in XML documents. You can write a selector using XPath that looks for a span container with the text 'Following' and then go up a level in its hierarchy to locate the count (since the text 'Following' and the count value are both wrapped in individual container tags):

following_count = driver.find_element(By.XPATH, "//span[contains(text(), 'Following')]/ancestor::a/span").text

Similarly, you can write an XPath-based selector for the followers count as well:

followers_count = driver.find_element(By.XPATH, "//span[contains(text(), 'Followers')]/ancestor::a/span").text
Tweets

Fortunately, each tweet has a parent container with a data-testid value “tweet” (which you used earlier to check if tweets had loaded). You can use the find_elements() method instead of the find_element() method from Selenium to collect all elements that satisfy the given selector:

tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

Printing Everything

To print everything you extracted on the stdout, use the following code:

print("Name\t\t: " + name)
print("Bio\t\t: " + bio)
print("Location\t: " + location)
print("Website\t\t: " + website)
print("Joined on\t: " + join_date)
print("Following count\t: " + following_count)
print("Followers count\t: " + followers_count)

To print the content of tweets, you need to loop through all the tweets and extract the text from inside the text container of the tweet (a tweet has other elements, such as avatar, username, time, and action buttons, apart from the main content). Here’s how you can use a CSS selector to do that:

for tweet in tweets:
    tweet_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="tweetText"]').text
    print("Tweet text\t: " + tweet_text)

Run the script with the following command:

python profile-page.py

And you should receive an output like this:

Name            : Bright Data
Bio             : The World's #1 Web Data Platform
Location        : We're everywhere!
Website         : brdta.com/2VQYSWC
Joined on       : Joined February 2016
Following count : 980
Followers count : 3,769
Tweet text      : Happy BOO-RIM! Our offices transformed into a spooky "Bright Fright" wonderland today. The treats were to die for and the atmosphere was frightfully fun...
Check out these bone-chilling sights:
Tweet text      : Our Bright Champions are honored each month, and today we are happy to present February's! Thank you for your exceptional work. 
Sagi Tsaeiri  (Junior BI Developer)
Or Dinoor (Compliance Manager)
Sergey Popov (R&D DevOps)
Tweet text      : Omri Orgad, Chief Customer Officer at 
@bright_data
, explores the benefits of outsourcing public web data collections for businesses using AI tools.
#WebData #ArtificialIntelligence

Click the link below to find out more
.
.
.
<output truncated>

Here’s the complete code for the scraping script:

# import the required packages and libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# set up a new Selenium driver
driver= webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# define the username of the profile to scrape and generate its URL
username = "bright_data"
URL = "https://twitter.com/" + username + "?lang=en"

# load the URL in the Selenium driver
driver.get(URL)

# wait for the webpage to be loaded
# PS: this considers a profile page to be loaded when at least one tweet has been loaded
#     it might not work well for restricted profiles or public profiles with zero tweets
try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="tweet"]')))
except WebDriverException:
    print("Tweets did not appear! Proceeding after timeout")

# extract the information using either CSS selectors (and data-testid) or XPath
name = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserName"]').text.split('\n')[0]
bio = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserDescription"]').text
location = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserLocation"]').text
website = driver.find_element(By.CSS_SELECTOR,'a[data-testid="UserUrl"]').text
join_date = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserJoinDate"]').text
following_count = driver.find_element(By.XPATH, "//span[contains(text(), 'Following')]/ancestor::a/span").text
followers_count = driver.find_element(By.XPATH, "//span[contains(text(), 'Followers')]/ancestor::a/span").text
tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

# print the collected information
print("Name\t\t: " + name)
print("Bio\t\t: " + bio)
print("Location\t: " + location)
print("Website\t\t: " + website)
print("Joined on\t: " + join_date)
print("Following count\t: " + following_count)
print("Followers count\t: " + followers_count)

# print each collected tweet's text
for tweet in tweets:
    tweet_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="tweetText"]').text
    print("Tweet text\t: " + tweet_text)

Twitter Scraping with Bright Data

While web scraping gives you a lot of flexibility and control over how you extract data from web pages, it can sometimes be difficult to set up. In cases where the target web app loads most of its page data via XHR calls after the static page is loaded and there are very few static identifiers in HTML to locate elements (similar to what you saw earlier in Twitter’s case), it can be tricky to figure out the right configuration.

In these instances, Bright Data can help. Bright Data is a web data platform that can help you extract huge amounts of unstructured data from the internet. Bright Data offers a product for scraping Twitter data that can help you obtain a detailed collection of almost all the possible data points from Twitter web pages.

For instance, the following are the instructions for how you can scrape the same Twitter user profile using Bright Data.

Start by navigating to the Bright Data Control Panel. Click on the View data products button to view the web scraping solutions offered by Bright Data:

BroghtData CP

Next, click on Get started on the Web Scraper IDE card:

Datasets & web data

Bright Data provides you with a Web Scraper IDE that you can use to create your own scraper from scratch or from a baseline template. Bright Data also offers auto-scaling infrastructure and built-in debug tools to help you get started quickly.

You will be prompted to either create a scraper from scratch or use an existing template. If you want to get started quickly, check out the Twitter hashtag search template (which is what you’ll use here to get the initial IDE setup). Click on the Twitter hashtag search option:

You should be able to see the complete IDE on the screen with some code already added to the editor so you can get started. To use this IDE to scrape Twitter profile pages, remove the existing code in the editor and paste the following code in it:

const start_time = new Date().getTime();

block(['*.png*', '*.jpg*', '*.mp4*', '*.mp3*', '*.svg*', '*.webp*', '*.woff*']);
// Set US ip address
country('us');
// Save the response data from a browser request
tag_response('profile', /\/UserTweets/)
// Store the website's URL here
let url = new URL('https://twitter.com/' + input["Username"]);

// function initialization
async function navigate_with_wait() {
  navigate(url, { wait_until: 'domcontentloaded' });
  try {
    wait_network_idle({ ignore: [/accounts.google.com/, /twitter.com\/sw.js/, /twitter.com\/i\/jot/] })
  } catch (e) { }
}

// calling navigate_with_wait function
navigate_with_wait()

// sometimes page does not load. If the "Try again" button exists in such case, try to click it and wait for results
let try_count = 0
while (el_exists('[value="Try again"]') && try_count++ <= 5) {
  //   wait_page_idle(4000)
  if (el_exists('[value="Try again"]')) {
    try { click('[value="Try again"]', { timeout: 1e3 }) } catch (e) { }
  } else {
    if (location.href.includes(url)) break
    else navigate_2()
  }
  if (el_exists('[data-testid="empty_state_header_text"]')) navigate_2()
}

const gatherProfileInformation = (profile) => {

  // Extract tweet-related information
  let tweets = profile.data.user.result.timeline_v2.timeline.instructions[1].entries.flatMap(entry => {
    if (!entry.content.itemContent)
      return [];

    let tweet = entry.content.itemContent.tweet_results.result

    return {
      "text": tweet.legacy.full_text,
      "time": tweet.legacy.created_at,
      "id": tweet.legacy.id_str,
      "replies": tweet.legacy.reply_count,
      "retweets": tweet.legacy.retweet_count,
      "likes": tweet.legacy.favorite_count,
      "hashtags": tweet.legacy.entities?.hashtags.toString(),
      "tagged_users": tweet.legacy.entities?.user_mentions.toString(),
      "isRetweeted": tweet.legacy.retweeted,
      "views": tweet.views.count
    }
  })

  // Extract profile information from first tweet
  let profileDetails = profile.data.user.result.timeline_v2.timeline.instructions[1].entries[0].content.itemContent.tweet_results.result.core.user_results.result;

  // Prepare the final object to be collected
  let profileData = {
    "profile_name": profileDetails.legacy.name,
    "isVerified": profileDetails.legacy.verified, // Might need to swap with profileDetails.isBlueVerified
    "bio": profileDetails.legacy.description,
    "location": profileDetails.legacy.location,
    "following": profileDetails.legacy.friends_count,
    "followers": profileDetails.legacy.followers_count,
    "website_url": profileDetails.legacy.entities?.url.urls[0].display_url || "",
    "posts": profileDetails.legacy.statuses_count,
    "media_count": profileDetails.legacy.media_count,
    "profile_background_image_url": profileDetails.legacy.profile_image_url_https,
    "handle": profileDetails.legacy.screen_name,
    "collected_number_of_posts": tweets.length,
    "posts_info": tweets
  }

  // Collect the data in the IDE
  collect(profileData)

  return null;
}

try {
  if (el_is_visible('[data-testid="app-bar-close"]')) {
    click('[data-testid="app-bar-close"]');
    wait_hidden('[data-testid="app-bar-close"]');
  }
  // Scroll to the bottom of the page for all tweets to load
  scroll_to('bottom');

  // Parse the webpage data
  const { profile } = parse();

  // Collect profile information from the page
  gatherProfileInformation(profile)

} catch (e) {
  console.error(`Interaction warning (1 stage): ${e.message}`);
}

There are inline comments in the preceding code to help you understand what’s happening. The basic structure is as follows:

  1. Navigate to the profile page
  2. Wait for the page to load
  3. Intercept the response from the /UserTweets/ API
  4. Parse the response and extract the information

You will need to delete the existing input parameters and add a single input parameter “Username” in the input section at the bottom of the page. Next, you will need to provide it with an input value “bright_data”, for instance. Then run the code by clicking on the preview button:

The results will look like this:

Here’s the detailed JSON response for reference:

{
  "profile_name": "Bright Data",
  "isVerified": false,
  "bio": "The World's #1 Web Data Platform",
  "location": "We're everywhere!",
  "following": 981,
  "followers": 3970,
  "website_url": "brdta.com/2VQYSWC",
  "posts": 1749,
  "media_count": 848,
  "profile_background_image_url": "https://pbs.twimg.com/profile_images/1372153221146411008/U_ua34Q5_normal.jpg",
  "handle": "bright_data",
  "collected_number_of_posts": 40,
  "posts_info": [
    {
      "text": "This week we will sponsor and attend @neudatalab's London Data Summit 2023. @omri_orgad, our CCO, will also participate in a panel discussion on the impact of artificial intelligence on the financial services industry. \nWe look forward to seeing you there! \n#ai #financialservices https://t.co/YtVOK4NuKY",
      "time": "Mon Mar 27 14:31:22 +0000 2023",
      "id": "1640360870143315969",
      "replies": 0,
      "retweets": 1,
      "likes": 2,
      "hashtags": "[object Object],[object Object]",
      "tagged_users": "[object Object],[object Object]",
      "isRetweeted": false,
      "views": "386"
    },
    {
      "text": "Is our Web Unlocker capable of bypassing multiple anti-bot solutions? That's the question that @webscrapingclub sought to answer! \nIn their latest blog post, they share their hands-on, step-by-step challenge and their conclusions.\nRead here: https://t.co/VwxcxGMLWm",
      "time": "Thu Mar 23 11:35:32 +0000 2023",
      "id": "1638867069587566593",
      "replies": 0,
      "retweets": 2,
      "likes": 3,
      "hashtags": "",
      "tagged_users": "[object Object]",
      "isRetweeted": false,
      "views": "404"
    },
  ]
}

In addition to web scraping capabilities, Bright Data offers social media datasets that carry highly enriched information based on data collected from social media websites like Twitter. You can use these to learn more about your target audience, pick up on trends, identify rising influencers, and do more!

Conclusion

In this article, you learned how to scrape information from Twitter using Selenium. While it’s possible to scrape data this way, it’s not ideal since it can be complicated and time-consuming. That’s why you also learned how to use Bright Data, which is a simpler solution for scraping Twitter data.

More from Bright Data

Datasets Icon
Get immediately structured data
Access reliable public web data for any use case. The datasets can be downloaded or delivered in a variety of formats. Subscribe to get fresh records of your preferred dataset based on a pre-defined schedule.
Web scraper IDE Icon
Build reliable web scrapers. Fast.
Build scrapers in a cloud environment with code templates and functions that speed up the development. This solution is based on Bright Data’s Web Unlocker and proxy infrastructure making it easy to scale and never get blocked.
Web Unlocker Icon
Implement an automated unlocking solution
Boost the unblocking process with fingerprint management, CAPTCHA-solving, and IP rotation. Any scraper, written in any language, can integrate it via a regular proxy interface.

Ready to get started?