Web scraping is an automated technique for extracting and collecting large amounts of data from websites, using different tools or programs. It’s commonly used to extract HTML tables, which contain data organized in columns and rows. Once collected, this data can be analyzed or used for research. For a more detailed guide, check out this article on HTML web scraping.
This tutorial will teach you how to scrape HTML tables from websites using Python.
Prerequisites
Before you begin this tutorial, you need to install Python version 3.8 or newer and create a virtual environment. If you’re new to web scraping with Python, this article is a helpful starting point.
After you’ve created the environment, install the following Python packages:
- Requests helps you send HTTP requests to interact with web services or APIs and retrieves data or sends data to the web servers.
- Beautiful Soup parses HTML documents and extracts specific information from the web page. It provides tools to navigate, search, and scrape data from web pages in a structured way.
- pandas analyzes, cleans, organizes, and saves scraped data from an HTML table or any other HTML elements into files like CSV or XLSX documents.
You can install the packages with the following command:
pip install requests beautifulsoup4 pandas
Understanding the Web Page Structure
In this tutorial, you’ll scrape data on the Worldometer website. This web page contains up-to-date data on countries around the world, including their respective population figures for 2024:
To locate the HTML table structure, right-click the table (shown in the preceding screenshot) and select Inspect. This action opens the Developer Tools panel, which displays the HTML code of the page, with the selected element highlighted:
The <table>
tag with ID example2
defines the beginning of the table structure. This table has headers with the <th>
tags, and rows are defined by <tr>
tags with each <tr>
representing a new horizontal row in the table. Inside each <tr>
, the <td>
tag creates individual cells within that row, holding the data displayed in each cell.
Note: Before doing any scraping, it’s important that you review and abide by the website’s privacy policy and terms of service to ensure you follow all restrictions on data usage and automated access.
Send an HTTP Request to Access the Web Page
To send an HTTP request and access the web page, create a Python file (eg html_table_scraper.py
) and import the requests
, BeautifulSoup
, and pandas
packages:
# import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
Then, define the URL of the web page you want to scrape and send a GET request to that web page using https://www.worldometers.info/world-population/population-by-country/
:
# Send a request to the website to get the page content
url = 'https://www.worldometers.info/world-population/population-by-country/'
To check if the response is successful or not, send a request using the get()
method from Requests:
# Get the content of the URL
response = requests.get(url)
# Check the status of the response.
if response.status_code == 200:
print("Request was successful!")
else:
print(f"Error: {response.status_code} - {response.text}")
This code sends a GET request to a specified URL and then checks the status of the response. A 200
response indicates the request was successful.
Use the following command to run the Python script in your terminal:
python html_table_scraper.py
Your output should look like this:
Request was successful!
Since the GET request is successful, you now have the HTML content of the entire web page, including the HTML table.
Parse the HTML Using Beautiful Soup
Beautiful Soup can handle poorly formatted or broken HTML content, which is common when scraping web pages. Here, you use the Beautiful Soup package to do the following:
- Parse the HTML content from the web page to find the table that presents population data.
- Collect the table headers.
- Collect all data presented in the table’s rows.
To parse the content you collected, create a Beautiful Soup object:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
Next, locate the table element in the HTML with the id
attribute "example2"
. This table contains the population of countries in 2024:
# Find the table containing population data
table = soup.find('table', attrs={'id': 'example2'})
Collect Table Headers
The table has a header located in the <thead>
and <th>
HTML tags. Use the find()
method from the Beautiful Soup package to extract the data in the <thead>
tag and the find_all()
method to collect all the headers:
# Collect the headers from the table
headers = []
# Locate the header row within the <thead> tag
header_row = table.find('thead').find_all('th')
for th in header_row:
# Add header text to the headers list
headers.append(th.text.strip())
This code creates an empty Python list called headers
, locates the <thead>
HTML tag to find all headers within <th>
HTML tags, and then appends each collected header to the headers
list.
Collect Table Row Data
To collect the data in each row, create an empty Python list called data
to store the scraped data:
# Initialize an empty list to store our data
data = []
Then, extract the data in each row in the table using the find_all()
method and append them to the Python list:
# Loop through each row in the table (skipping the header row)
for tr in table.find_all('tr')[1:]:
# Create a list of the current row's data
row = []
# Find all data cells in the current row
for td in tr.find_all('td'):
# Get the text content of the cell and remove extra spaces
cell_data = td.text.strip()
# Add the cleaned cell data to the row list
row.append(cell_data)
# After getting all cells for this row, add the row to our data list
data.append(row)
# Convert the collected data into a pandas DataFrame for easier handling
df = pd.DataFrame(data, columns=headers)
# Print the DataFrame to see the number of rows and columns
print(df.shape)
This code iterates through all <tr>
HTML tags found within the table
, starting from the second row (skipping the header row). For each row (<tr>
), an empty list row
is created to store the data from that row’s cells. Inside the row, the code finds all <td>
HTML tags using the find_all()
method, representing individual data cells in the row.
For each <td>
HTML tag, the code extracts the text content using the .text
attribute and applies the .strip()
method to remove any leading or trailing whitespace from the text. The cleaned cell data is appended to the row
list. After processing all the cells in the current row, the entire row is appended to the data
list. Finally, you convert the collected data to a pandas DataFrame with the column names defined by the headers
list and then show the shape of the data.
The full Python script should look like this:
# Import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Send a request to the website to get the page content
url = 'https://www.worldometers.info/world-population/population-by-country/'
# Get the content of the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table containing population data by its ID
table = soup.find('table', attrs={'id': 'example2'})
# Collect the headers from the table
headers = []
# Locate the header row within the <thead> HTML tag
header_row = table.find('thead').find_all('th')
for th in header_row:
# Add header text to the headers list
headers.append(th.text.strip())
# Initialize an empty list to store our data
data = []
# Loop through each row in the table (skipping the header row)
for tr in table.find_all('tr')[1:]:
# Create a list of the current row's data
row = []
# Find all data cells in the current row
for td in tr.find_all('td'):
# Get the text content of the cell and remove extra spaces
cell_data = td.text.strip()
# Add the cleaned cell data to the row list
row.append(cell_data)
# After getting all cells for this row, add the row to our data list
data.append(row)
# Convert the collected data into a pandas DataFrame for easier handling
df = pd.DataFrame(data, columns=headers)
# Print the DataFrame to see the collected data
print(df.shape)
else:
print(f"Error: {response.status_code} - {response.text}")
Use the following command to run the Python script in your terminal:
python html_table_scraper.py
Your output should look like this:
(234,12)
At this point, you’ve successfully extracted 234 rows and 12 columns from the HTML table.
Next, use the head()
method from pandas and print()
to view the first ten rows of the extracted data:
print(df.head(10))
Clean and Structure the Data
When scraping data from an HTML table, it’s important to clean the data to ensure consistency, accuracy, and proper usability for analysis. Raw data extracted from an HTML table may contain various issues, such as missing values, formatting issues, unwanted characters, or incorrect data types. These issues can lead to inaccurate analyses and unreliable results. Proper cleaning helps standardize the data set and ensures it aligns with the intended structure for analysis.
In this section, the following data-cleaning tasks are performed:
- Rename column names
- Replace missing values presented in the row data
- Remove commas and convert data types to the correct format
- Remove the percentage sign (%) and convert data types to the correct format
- Change data types for numerical columns
Rename Column Names
pandas has a method called rename()
that changes the name of a specific column to whatever you want. This method is useful when column names are not descriptive or when you want to make the column names easier to work with.
To rename a specific column, you pass a dictionary to the columns
parameter, where the keys are the current column names, and the values are the new names you want to assign. Apply this method to change the following column names:
#
toRank
Yearly change
toYearly change %
World Share
toWorld Share %
# Rename columns
df.rename(columns={'#': 'Rank'}, inplace=True)
df.rename(columns={'Yearly Change': 'Yearly Change %'}, inplace=True)
df.rename(columns={'World Share': 'World Share %'}, inplace=True)
# Show the first 5 rows
print(df.head())
Your columns should now look like this:
Replace Missing Values
Missing values in the data can affect calculations, such as averages or sums, leading to inaccurate results and incorrect insights. You need to remove, replace, or fill them with particular values before doing any calculation or analysis on the data set.
The Urban Pop %
column currently contains missing values labeled as N.A.
. Replace N.A.
with 0%
using the replace()
method from pandas like this:
# Replace 'N.A.' with '0%' in the 'Urban Pop %' column
df['Urban Pop %'] = df['Urban Pop %'].replace('N.A.', '0%')
Remove Percentage Signs and Convert Data Types
The columns Yearly Change %
, Urban Pop %
, and World Share %
contain numerical values followed by a percentage sign (eg 37.0%
). This hinders from doing mathematical operations, like calculating the average, maximum, and standard deviation, for analysis.
To fix this, you can apply the replace()
method to remove the %
sign and then apply the astype()
method to convert them to a float
data type for analysis:
# Remove the '%' sign and convert to float
df['Yearly Change %'] = df['Yearly Change %'].replace('%', '', regex=True).astype(float)
df['Urban Pop %'] = df['Urban Pop %'].replace('%', '', regex=True).astype(float)
df['World Share %'] = df['World Share %'].replace('%', '', regex=True).astype(float)
# Show the first 5 rows
df.head()
This code removes the %
sign from the values in the columns Yearly Change %
, Urban Pop %
, and World Share %
using the replace()
method with a regular expression. Then, it converts the cleaned values to a float
data type using astype(float)
. Finally, it displays the first five rows of the DataFrame with df.head()
.
Your output should look like this:
Remove Commas and Convert Data Types
Currently, the columns Population (2024)
, Net Change
, Density (P/Km²)
, Land Area (Km²)
, and Migrants (net)
contain numerical values with commas (eg 1,949,236). This makes it impossible to perform mathematical operations for analysis.
To fix this, you can apply the replace()
and astype()
to remove commas and convert the numbers to the integers data type:
# Remove commas and convert to integers
columns_to_convert = [
'Population (2024)', 'Net Change', 'Density (P/Km²)', 'Land Area (Km²)',
'Migrants (net)'
]
for column in columns_to_convert:
# Ensure the column is treated as a string first
df[column] = df[column].astype(str)
# Remove commas
df[column] = df[column].str.replace(',', '')
# Convert to integers
df[column] = df[column].astype(int)
This code defines a list, columns_to_convert
, containing the names of the columns that need processing. For each column in the list, it ensures that the column values are treated as strings using astype(str)
. It then removes any commas from the values using str.replace(',', '')
, and it converts the cleaned values to integers with astype(int)
, making the values suitable for mathematical operations.
Change Data Types for Numerical Columns
The columns Rank
, Med. Age
, and Fert. Rate
present data that are stored as an object data type but contain numerical values. Convert the data in these columns to either integer or float data types to enable mathematical operations:
# Convert to integer or float data types and integers
df['Rank'] = df['Rank'].astype(int)
df['Med. Age'] = df['Med. Age'].astype(int)
df['Fert. Rate'] = df['Fert. Rate'].astype(float)
This code converts the values in the Rank
and Med. Age
columns into an integer data type and the values in the Fert. Rate
into a float data type.
Finally, check to make sure the cleaned data has the correct data types using the head()
method:
print(df.head(10))
Your output should look like this:
With the data now cleaned, you can begin applying different mathematical operations, like average and mode, as well as analytical methods, like correlation, to examine the data.
Export Cleaned Data to CSV
After cleaning your data, it’s important to save the cleaned data for future use and analysis. You can export the cleaned data into a CSV file, which allows you to easily share it with others or further process/analyze it using different supported tools and software.
The to_csv()
method in pandas allows you to export the data from a DataFrame into a CSV file named world_population_by_country.csv
:
# Save the data to a file
filename = 'world_population_by_country.csv'
df.to_csv(filename, index=False)
Conclusion
The Beautiful Soup Python package makes it possible for you to parse HTML documents and extract data from an HTML table. In this article, you learned how to scrape, clean, and export data into a CSV file.
Although this tutorial was straightforward, extracting data from complex websites can be difficult and time-consuming. For instance, working with paginated HTML tables or nested structures where data is embedded within parent and child elements requires careful analysis to understand the layout. Moreover, website structures may change over time necessitating ongoing maintenance of your code and infrastructure.
To save you time and make things easier, consider using the Bright Data Web Scraper API. This powerful tool offers a prebuilt scraping solution, allowing you to extract data from complex websites with minimal technical knowledge. The API automates data collection, handling challenges like dynamic content, JavaScript-rendered pages, and CAPTCHA verification.
Sign up and start your free Web Scraper API trial!
No credit card required