BeautifulSoup Web Scraping Guide

Looking for a quick tutorial to help you get started web scraping using Python’s Beautiful Soup? You have come to the right place – read on and get started immediately.
5 min read
Python user scraping the web using Beautiful Soup libraries

In this article, we will discuss:

How does Web Scraping Work?

Scraping a web page means requesting specific data from a target webpage. When you scrape a page, the code you write sends your request to the server hosting the destination page. The code then downloads the page, only extracting the elements of the page defined initially in the crawling job.

For example, let’s say we are looking to target data in H3 title tags. We would write code for a scraper that looks specifically for that information. The scraper will work in three stages:

Step 1: Send a request to the server to download the site’s content.

Step 2: Filter the page’s HTML to look for the desired H3 tags.

Step 3: Copying the text inside the target tags, producing the output in the format previously specified in the code.

It is possible to carry out web scraping tasks in many programming languages with different libraries, but using Python with the Beautiful Soup library is one of the most popular and effective methods. In the following sections, we will cover the basics for scraping in Python using Beautiful Soup.

What is Beautiful Soup?

Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree in HTML, XML files. It transforms a complex HTML document into a tree of Python objects. It also automatically converts the document to Unicode, so you don’t have to think about encodings. This tool not only helps you scrape but also to clean the data. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports several third-party Python parsers like lxml or hml5lib.

You can learn more about the full spectrum of its capabilities here: Beautiful Soup documentation.

Installing Requests and Beautiful Soup

To install Beautiful Soup, you need pip or any other Python installer. You can also use your jupyter lab. In this post, we will use pip as it is the most convenient. Open your terminal or Jupyter Lab and write:command line syntax for pip install beautifulsoup4 for Beautiful Soup Libraries in Python

command line for installing html5lib using pip install for Python
screenshot of syntax for pip install requests for python

Another method is to download the libraries manually by following these links :

1: Requests

2: Hml5lib

3: Beautifulsoup4

Easy steps for scraping in Python using Requests and Beautiful Soup

Step 1: You need to send an HTTP request to the server of the page you want to scrape. The server responds by sending the HTML content of the web page. Since we are using Python for our requests, we need a third-party HTTP library, and we will use Requests.

Start by importing the Requests library and making a simple GET request to the URL -we chose https://www.brainyquote.com/topics/motivational-quotes because it has a straightforward HTML structure and will allow us to demonstrate the potential of Beautiful Soup easily.python code to import requests and then scrape a given URLIf you get a response [200], this is essentially saying that the site responded with an HTTP 200 OK response code and sent us the HTML content of the page.

Let’s make the same request again, but this time we’ll store the result in a variable called r, and we will print its content.scraping using requests and saving the URL as a variable called rThe output should be the entire HTML code for this page. As you can see, it’s unstructured and beautiful soup will help us clean it up and get the data that we need.

Step 2: Now that we have the HTML content, we need to parse the data. For that, we will be using beautiful soup with a parser html5lib.importing beautiful soup from bs4 - command line syntaxWe need to pass two values into BeautifulSoup():

#1: HTML string from the website; ‘r.content’

#2: What HTML parser to use; ‘html5lib’

Step 3: At this point, you want to go to the site you are scraping. Open up the Devtools (F12), and go to the Elements tab. We are going to be looking for the top table layer.nice chunk of scraped data using beautiful soup

command line syntax for configuring a table in beautiful soup using python

Let’s print the table to get a better idea of what we have so far, and let’s use .prettify()command line for displaying the table in the console using prettify

Your output should look something like this:

Now we look for the data that we need. For this example, all we want is the quoted text and the author’s name. As you can see, all of this data is at

So let’s loop through all instances of this class and get all quotes in our table.command line for looping through all the data to find the rows you need for the table

You should now only have the

in each loop instance available to you. You can test this by running print(row) in the loop.
finding the alt-text inside the scraped data

We are looking for the information under “img alt” key so let’s create a quote variable and assign this data to it.

assigning data to the alt text

As you can see, I wrapped it in a ‘try’ statement. In this case if one of the rows does not have the data you are looking for, you will not get an error, and the loop will continue forward. I also split the results at ‘-’. As you saw earlier, the text and the author name are separated using an ‘-’ Let’s use that to separate the two and splitting them.example of alt-attribute text being filled with the data provided

That’s it, you are done. Here is what your quote variable should look like now:

example of the data after it has applied the alt-text scrape

At the end of this process, you can save your data in a file, and your code should look something like this:Screen shot of the final Python code and how it should look

 

More from Bright Data

Datasets Icon

Get immediately structured data

Access reliable public web data for any use case. The datasets can be downloaded or delivered in a variety of formats. Subscribe to get fresh records of your preferred dataset based on a pre-defined schedule.

Web scraper IDE Icon

Build reliable web scrapers. Fast.

Build scrapers in a cloud environment with code templates and functions that speed up the development. This solution is based on Bright Data's Web Unlocker and proxy infrastructure making it easy to scale and never get blocked.

Web Unlocker Icon

Implement an automated unlocking solution

Boost the unblocking process with fingerprint management, CAPTCHA-solving, and IP rotation. Any scraper, written in any language, can integrate it via a regular proxy interface.

Ready to get started?