Text Scraping: A Step-By-Step Tutorial

This guide covers text scraping in Python, from setup to data storage, with tips on using proxies to avoid IP blocks.
12 min read
Text scraping tutorial blog image

Web scraping is the process of extracting data from web pages. Because data can take many forms, the term text scraping is specifically used when referring to collecting textual data.

Having a large amount of relevant data is essential for every successful business decision. Scraping information from competitor websites can give you an insight into their business logic, which can help you gain a competitive edge. In this tutorial, you’ll learn how to implement a text scraper in Python, making it easy to extract and use web data.

Prerequisites

Before you start this tutorial, you need the following prerequisites:

  • The latest version of Python and pip installed on your system.
  • A Python virtual environment. Make sure you install all the necessary packages in the virtual environment, including requests to fetch the HTML content of a web page, Beautiful Soup to parse and extract the desired text or data from the HTML, and pandas to organize and store the extracted data in a structured format like a CSV file.

If you’re looking for more information to help get you started web scraping with Python, check out this article.

Understanding the Website Structure

Before you start scraping, you need to analyze the structure of the website you’re targeting. Websites are built using HTML, which is a markup language that defines how the content is organized and displayed.

Each piece of content, whether it’s a headline, paragraph, or link, is enclosed within HTML tags. These tags help you identify where the data you want to scrape is located. For instance, in this example, you scrape the quotes from Quotes to Scrape, a mock website. To view the structure of this website, you need to open the website in your browser and access the Developer Tools by right-clicking the page and selecting Inspect or Inspect Element. This brings up the HTML code of the page:

Inspect element in a web browser

Take some time to familiarize yourself with the structure—look for tags like <div><span><p>, and <a> as these often contain the text or links you might want to extract. Also, note that tags usually contain a class attribute. Its purpose is to define a specific class for the HTML element, allowing it to be styled with CSS or selected with JavaScript.

Note: The class attribute is particularly useful in text scraping because it helps you target specific elements on a page that share the same styling or structure, making it easier to extract the exact data you need.

Here, each quote is contained in a div element with the class quote. If you’re interested in the text and the author of each quote, the text is contained within a div with the class text, and the author is contained within a small element with the class of author:

HTML structure of a quote

If you’re not familiar with how HTML works, check out this HTML web scraping article to learn more.

Text-Scraping a Website

With the structure of the website in mind, the next step is to write the code that you’ll use to scrape the Quotes to Scrape site.

Python is a popular choice for this task due to its ease of use and powerful libraries, including requests and BeautifulSoup. You use the requests library to fetch the HTML content of the page. This is necessary because you need to retrieve the raw data before you can analyze or extract it. Once you have the HTML content, you can break it down into a more manageable structure using BeautifulSoup.

To start, create a Python file for the text-scraping script named text-scraper.py. Then, import BeautifulSoup and requests:

import requests
from bs4 import BeautifulSoup

Specify the URL of the website you’re scraping and send a GET request:

# URL of the quotes website
url = 'https://quotes.toscrape.com/'

# Send a GET request to the URL
response = requests.get(url)

After sending the GET request, you receive the HTML of an entire page. You have to parse it to extract only the data you need, which in this case is the text and the author of each quote. To do so, the first thing you need to do is create a BeautifulSoup object to parse the HTML:

soup = BeautifulSoup(response.text, 'html.parser')

Find all the div elements that contain quotes (meaning they are of the quote class):

quotes = soup.find_all('div', class_='quote')

Create a list to store the quotes in:

data = []

Then, extract the text and author from each quote and store it in the data list:

for quote in quotes:
    text = quote.find('span', class_='text').text.strip()
    author = quote.find('small', class_='author').text.strip()

    data.append({
        'Text': text,
        'Author': author
    })

The script should look something like this:

import requests
from bs4 import BeautifulSoup

# URL of the quotes website
url = 'http://quotes.toscrape.com/'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find all quote containers
quotes = soup.find_all('div', class_='quote')

# Extract data from each quote
data = []
for quote in quotes:
    text = quote.find('span', class_='text').text.strip()
    author = quote.find('small', class_='author').text.strip()

    data.append({
        'Text': text,
        'Author': author
    })

print(data)

Now it’s time to run the script from your terminal:

# For Linux and macOS
python3 text-scraper.py

# For Windows
python text-scraper.py

You should get a list of extracted quotes printed out:

[{'Author': 'Albert Einstein',
  'Text': '"The world as we have created it is a process of our thinking. It '
        'cannot be changed without changing our thinking."'},
 {'Author': 'J.K. Rowling',
  'Text': '"It is our choices, Harry, that show what we truly are, far more '
        'than our abilities."'},
 {'Author': 'Albert Einstein',
  'Text': '"There are only two ways to live your life. One is as though '
        'nothing is a miracle. The other is as though everything is a '
        'miracle."'},
 {'Author': 'Jane Austen',
  'Text': '"The person, be it gentleman or lady, who has not pleasure in a '
        'good novel, must be intolerably stupid."'},
 {'Author': 'Marilyn Monroe',
  'Text': ""Imperfection is beauty, madness is genius and it's better to be "
        'absolutely ridiculous than absolutely boring."'},
 {'Author': 'Albert Einstein',
  'Text': '"Try not to become a man of success. Rather become a man of '
        'value."'},
 {'Author': 'André Gide',
  'Text': '"It is better to be hated for what you are than to be loved for '
        'what you are not."'},
 {'Author': 'Thomas A. Edison',
  'Text': ""I have not failed. I've just found 10,000 ways that won't work.""},
 {'Author': 'Eleanor Roosevelt',
  'Text': '"A woman is like a tea bag; you never know how strong it is until '
        "it's in hot water.""},
 {'Author': 'Steve Martin',
  'Text': '"A day without sunshine is like, you know, night."'}]

While this text scraping seemed fairly straightforward, you’ll likely experience challenges during web scraping, such as IP blocking, if the website detects too many requests, or CAPTCHAs to prevent automated access. To overcome these challenges, you can use proxies.

Using Proxies for Anonymous Scraping

Proxies help you avoid and bypass IP blocks and CAPTCHAs by rotating your IP address and making your requests appear to come from different locations. To use proxies, you need to configure the request.get() method to route all requests through a proxy server.

In this scenario, you use the Bright Data rotating proxies, which give you access to more than 72 million IP addresses from over 195 countries. To start, create a free Bright Data account by selecting Start Free Trial in the top-right corner, filling out the registration form, and clicking Create Account:

Bright Data sign-up form

Create a Basic Residential Proxy

Once you have a Bright Data account, log in, and navigate to the Proxies & Scraping section. Under the Proxy networks section, find Residential proxies and click Get Started:

Bright Data Dashboard: **Proxies & Scraping** section

You are prompted to add a new zone for the residential proxy. Keep all the defaults, name the zone, and click Add:

Create a new residential proxy zone

And that’s all it takes to create a new residential proxy zone!

To use the proxy, you need your credentials (ie username, password, and host). To find these credentials, go to the Proxies & Scraping section again and select the proxy zone you just created:

List of created proxy zones

After clicking the proxy zone, you see the zone control panel. Under the Authorization section, you see your credentials:

Bright Data proxy zone credentials

Update the Scraping Script

Now that you have your proxy credentials, it’s time to configure the proxy. To start, store your credentials as variables:

host = 'brd.superproxy.io'
port = 22225

username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'

Then, compose a proxy URL out of the stored credentials:

proxy_url = f'http://{username}:{password}@{host}:{port}'

Create a proxy configuration for both HTTP and HTTPS requests:

proxies = {
    'http': proxy_url,
    'https': proxy_url
}

And add the proxy configuration to the existing requests.get() call:

response = requests.get(url, proxies=proxies)

At this point, your script should look like this:

import requests
from bs4 import BeautifulSoup

# BrightData credentials
host = 'brd.superproxy.io'
port = 22225

username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'

# Compose a proxy URL
proxy_url = f'http://{username}:{password}@{host}:{port}'

# Create a proxy configuration
proxies = {
    'http': proxy_url,
    'https': proxy_url
}

# URL of the quotes website
url = 'http://quotes.toscrape.com/'

# Send a GET request to the URL via the specified proxy
response = requests.get(url, proxies=proxies)

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find all quote containers
quotes = soup.find_all('div', class_='quote')

# Extract data from each quote
data = []
for quote in quotes:
    text = quote.find('span', class_='text').text.strip()
    author = quote.find('small', class_='author').text.strip()

    data.append({
        'Text': text,
        'Author': author
    })

print(data)

Run and Test the Script

Running this script gives you the same result as the script with no proxies. The difference is that the website you’re scraping now thinks that you’re sending a request from somewhere else, so your actual location remains private. Let’s illustrate this by writing a new simple script.

Import the necessary libraries and set the url to "http://lumtest.com/myip.json" in the script:

import requests
from bs4 import BeautifulSoup

url = "http://lumtest.com/myip.json"

Send a GET request to the url without a proxy configuration and create a BeautifulSoup object for the response:

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

Finally, print the soup object:

print(soup)

Run this script. You then get information about your IP address and location as a response.

To compare, configure the GET request to use a Bright Data proxy and leave everything else the same:

# BrightData credentials
host = 'brd.superproxy.io'
port = 22225

username = 'brd-customer-hl_459f8bd4-zone-test_residential_proxy'
password = '8sdgouh1dq5h'

proxy_url = f'http://{username}:{password}@{host}:{port}'

proxies = {
    'http': proxy_url,
    'https': proxy_url
}

# Send a GET request to the URL
response = requests.get(url, proxies=proxies)

When you run the updated script, you should see that you’re getting a different IP address as a response; that’s not your actual IP, but an IP address of a proxy you set up. You’re essentially hiding your IP address behind one of the proxy servers.

Storing Data

Once you’ve successfully scraped data from a website, the next step is to store it in a structured format that allows for easy access and analysis. CSV is popular for this as it is widely supported by data analysis tools and programming languages.

To save scraped data to a CSV file, start by importing the pandas library (at the top of the scraping script) since it has methods for converting data into CSV format:

import pandas as pd

Then, create a pandas DataFrame object out of the scraped data you collected:

df = pd.DataFrame(data)

Finally, convert the DataFrame to a CSV file and give it a name (eg quotes.csv):

df.to_csv('quotes.csv', index=False)

After making these changes, run the script. Then, you get the scraped data stored in the CSV file.

In this simple example, there’s not much you can do with the quotes. However, depending on the data you scrape, there are all kinds of ways you can analyze it to extract insights.

You could start by exploring descriptive statistics using the pandas describe() function. This function provides a quick overview of your numerical data, including mean, median, and standard deviation. You could visualize your data using Matplotlib or seaborn to create histograms, scatter plots, or bar charts, helping you identify patterns or trends visually. For textual data, consider using natural language processing techniques, like word frequency analysis or sentiment analysis, to understand common themes or overall sentiment in reviews or comments.

To derive deeper insights, look for correlations between different variables in your data set. For example, you might examine the relationship between book ratings and review length, or analyze how ratings vary across different genres or authors. Use the pandas groupby() function to aggregate data and compare metrics across categories.

Don’t forget to consider the context of your data and the questions you’re trying to answer. For instance, if you’re analyzing book reviews, you might investigate which factors contribute most to high ratings or identify trends in popular genres over time. Always be critical of your findings and consider potential biases in your data collection process.

Conclusion

In this tutorial, you learned how to text scrape with Python, explored the benefits of using proxies, and discovered how the Bright Data rotating proxies can help you avoid IP blocks and maintain anonymity.

While developing your own scraping solutions can be rewarding, it often comes with challenges, like maintaining code, handling CAPTCHAs, and staying compliant with website policies. This is where the Bright Data scraping APIs can help. With features like automatic CAPTCHA solving, IP rotation, and robust data parsing, Bright Data simplifies the scraping process and allows you to focus on data analysis rather than infrastructure management.

Sign up for a Bright Data free trial to see how Bright Data can enhance your web scraping projects, providing you with reliable, scalable, and efficient data collection solutions for your business needs.

No credit card required