Learn how to use Beautiful Soup for web scraping with Python in 3 minutes

Looking for a quick tutorial to help you get started web scraping using Python’s Beautiful Soup? You have come to the right place – read on and get started immediately.
Python user scraping the web using Beautiful Soup libraries
Rafael Levi
Rafael Levi | Senior Business Manager
05-Jan-2021

In this article, we will discuss:

How does Web Scraping Work?

Scraping a web page means requesting specific data from a target webpage. When you scrape a page, the code you write sends your request to the server hosting the destination page. The code then downloads the page, only extracting the elements of the page defined initially in the crawling job.

For example, let’s say we are looking to target data in H3 title tags. We would write code for a scraper that looks specifically for that information. The scraper will work in three stages:

Step 1: Send a request to the server to download the site’s content.

Step 2: Filter the page’s HTML to look for the desired H3 tags.

Step 3: Copying the text inside the target tags, producing the output in the format previously specified in the code.

It is possible to carry out web scraping tasks in many programming languages with different libraries, but using Python with the Beautiful Soup library is one of the most popular and effective methods. In the following sections, we will cover the basics for scraping in Python using Beautiful Soup.

What is Beautiful Soup?

Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree in HTML, XML files. It transforms a complex HTML document into a tree of Python objects. It also automatically converts the document to Unicode, so you don’t have to think about encodings. This tool not only helps you scrape but also to clean the data. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports several third-party Python parsers like lxml or hml5lib.

You can learn more about the full spectrum of its capabilities here: Beautiful Soup documentation.

Installing Requests and Beautiful Soup

To install beautiful soup, you need pip or any other Python installer. You can also use your jupyter lab. In this post, we will use pip as it is the most convenient. Open your terminal or Jupyter Lab and write:command line syntax for pip install beautifulsoup4 for Beautiful Soup Libraries in Python

pip install beautifulsoup4
Image source: Bright Data

Installing a parser

You may also want to install a parser that interprets the HTML, for example, ‘html5lib’. To do this, run the following code in the terminal:command line for installing html5lib using pip install for Python

pip install html5lib

Image source: Bright Data

Installing Requestsscreenshot of syntax for pip install requests for python

Pip install requests

Image source: Bright Data

Another method is to download the libraries manually by following these links :

1: Requests

2: Hml5lib

3: Beautifulsoup4

Easy steps for scraping in Python using Requests and Beautiful Soup

Step 1: You need to send an HTTP request to the server of the page you want to scrape. The server responds by sending the HTML content of the web page. Since we are using Python for our requests, we need a third-party HTTP library, and we will use Requests.

Start by importing the Requests library and making a simple GET request to the URL -we chose https://www.brainyquote.com/topics/motivational-quotes because it has a straightforward HTML structure and will allow us to demonstrate the potential of Beautiful Soup easily.python code to import requests and then scrape a given URLImage source: Bright Data

If you get a response [200], this is essentially saying that the site responded with an HTTP 200 OK response code and sent us the HTML content of the page.

Let’s make the same request again, but this time we’ll store the result in a variable called r, and we will print its content.scraping using requests and saving the URL as a variable called rImage source: Bright Data

The output should be the entire HTML code for this page. As you can see, it’s unstructured and beautiful soup will help us clean it up and get the data that we need.

Step 2: Now that we have the HTML content, we need to parse the data. For that, we will be using beautiful soup with a parser html5lib.importing beautiful soup from bs4 - command line syntaxImage source: Bright Data

We need to pass two values into BeautifulSoup():

#1: HTML string from the website; ‘r.content’

#2: What HTML parser to use; ‘html5lib’

Step 3: At this point, you want to go to the site you are scraping. Open up the Devtools (F12), and go to the Elements tab. We are going to be looking for the top table layer.nice chunk of scraped data using beautiful soupImage source: Bright Data

command line syntax for configuring a table in beautiful soup using pythonImage source: Bright Data

Let’s print the table to get a better idea of what we have so far, and let’s use .prettify()command line for displaying the table in the console using prettifyImage source: Bright Data

Your output should look something like this:Beautiful Soup data output, or at least what it should look likeImage source: Bright Data

Now we look for the data that we need. For this example, all we want is the quoted text and the author’s name. As you can see, all of this data is at

So let’s loop through all instances of this class and get all quotes in our table.command line for looping through all the data to find the rows you need for the tableImage source: Bright Data

You should now only have the

in each loop instance available to you. You can test this by running print(row) in the loop.
finding the alt-text inside the scraped dataImage source: Bright Data

We are looking for the information under “img alt” key so let’s create a quote variable and assign this data to it.

assigning data to the alt textImage source: Bright Data

As you can see, I wrapped it in a ‘try’ statement. In this case if one of the rows does not have the data you are looking for, you will not get an error, and the loop will continue forward. I also split the results at ‘-’. As you saw earlier, the text and the author name are separated using an ‘-’ Let’s use that to separate the two and splitting them.example of alt-attribute text being filled with the data providedImage source: Bright Data

That’s it, you are done. Here is what your quote variable should look like now:

example of the data after it has applied the alt-text scrape
Image source: Bright Data

At the end of this process, you can save your data in a file, and your code should look something like this:Screen shot of the final Python code and how it should lookImage source: Bright DataThe final beautiful soup python codeImage source: Bright Data

 

Rafael Levi
Rafael Levi | Senior Business Manager

Rafael Levi is a senior business manager at Bright Data. He specializes in data collection automation and works closely with customers to help them achieve their goals. He firmly believes that the future of any e-business lay within data aggregation and automation.