Learn How To Use Beautiful Soup For Web Scraping With Python In 3 minutes
In this article, we will discuss:
How does Web Scraping Work?
Scraping a web page means requesting specific data from a target webpage. When you scrape a page, the code you write sends your request to the server hosting the destination page. The code then downloads the page, only extracting the elements of the page defined initially in the crawling job.
For example, let’s say we are looking to target data in H3 title tags. We would write code for a scraper that looks specifically for that information. The scraper will work in three stages:
Step 1: Send a request to the server to download the site’s content.
Step 2: Filter the page’s HTML to look for the desired H3 tags.
Step 3: Copying the text inside the target tags, producing the output in the format previously specified in the code.
It is possible to carry out web scraping tasks in many programming languages with different libraries, but using Python with the Beautiful Soup library is one of the most popular and effective methods. In the following sections, we will cover the basics for scraping in Python using Beautiful Soup.
What is Beautiful Soup?
Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree in HTML, XML files. It transforms a complex HTML document into a tree of Python objects. It also automatically converts the document to Unicode, so you don’t have to think about encodings. This tool not only helps you scrape but also to clean the data. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports several third-party Python parsers like lxml or hml5lib.
You can learn more about the full spectrum of its capabilities here: Beautiful Soup documentation.
Installing Requests and Beautiful Soup
To install beautiful soup, you need pip or any other Python installer. You can also use your jupyter lab. In this post, we will use pip as it is the most convenient. Open your terminal or Jupyter Lab and write:
Easy steps for scraping in Python using Requests and Beautiful Soup
Step 1: You need to send an HTTP request to the server of the page you want to scrape. The server responds by sending the HTML content of the web page. Since we are using Python for our requests, we need a third-party HTTP library, and we will use Requests.
Start by importing the Requests library and making a simple GET request to the URL -we chose https://www.brainyquote.com/topics/motivational-quotes because it has a straightforward HTML structure and will allow us to demonstrate the potential of Beautiful Soup easily.If you get a response , this is essentially saying that the site responded with an HTTP 200 OK response code and sent us the HTML content of the page.
Let’s make the same request again, but this time we’ll store the result in a variable called r, and we will print its content.The output should be the entire HTML code for this page. As you can see, it’s unstructured and beautiful soup will help us clean it up and get the data that we need.
Step 2: Now that we have the HTML content, we need to parse the data. For that, we will be using beautiful soup with a parser html5lib.We need to pass two values into BeautifulSoup():
#1: HTML string from the website; ‘r.content’
#2: What HTML parser to use; ‘html5lib’
Step 3: At this point, you want to go to the site you are scraping. Open up the Devtools (F12), and go to the Elements tab. We are going to be looking for the top table layer.
Let’s print the table to get a better idea of what we have so far, and let’s use .prettify()
Your output should look something like this:
Now we look for the data that we need. For this example, all we want is the quoted text and the author’s name. As you can see, all of this data is at
You should now only have the
We are looking for the information under “img alt” key so let’s create a quote variable and assign this data to it.
As you can see, I wrapped it in a ‘try’ statement. In this case if one of the rows does not have the data you are looking for, you will not get an error, and the loop will continue forward. I also split the results at ‘-’. As you saw earlier, the text and the author name are separated using an ‘-’ Let’s use that to separate the two and splitting them.
That’s it, you are done. Here is what your quote variable should look like now:
At the end of this process, you can save your data in a file, and your code should look something like this: