In this guide, you will learn:
- What Jupyter Notebooks are
- Why you should use Jupyter Notebooks for web scraping
- How to use it in a step-by-step tutorial
- Use cases of Jupyter Notebooks for scraping online data
Let’s dive in!
What Are Jupyter Notebooks?
In the context of Jupyter, a notebook is “a shareable document that combines computer code, plain language descriptions, data, charts, graphs and figures, and interactive controls.”
Notebooks provide interactive environments for prototyping and explaining code, exploring and visualizing data, and sharing ideas. In particular, notebooks produced by the Jupyter Notebook App are called Jupyter Notebooks.
The Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser. It can be executed on a local desktop or can be installed on a remote server.
Jupyter Notebooks provide the so-called “kernel” which is a “computational engine” that executes the code contained in a Notebook document. In detail, the ipython
kernel executes Python code (but kernels in other languages exist):
The Jupyter Notebook App features a dashboard that supports typical operations like showing local files, opening existing notebook documents, managing documents’ kernels, and more:
Why Use Jupyter Notebooks for Web Scraping?
Jupyter Notebooks are specifically designed for data analysis and R&D purposes, and are useful for web scraping purposes due to their:
- Interactive development: Write and execute code in small, manageable chunks called cells. Each cell can work independently from the others. That guarantees simpified testing and debugging.
- Organization: Use markdown in its cells to document the code, explain the logic, and add notes or instructions.
- Integration with data analysis tools: After scraping, you can immediately clean, process, and analyze the data in Python as Jupyter Notebooks integrate with libraries such as
pandas
,matplotlib
,seaborn
, and more. - Reproducibility and sharing: Jupyter Notebooks can be easily shared with others as
.ipynb
files (its standard format) or converted to other formats like ReST, Markdown, and more.
Pros and Cons
These are the pros and cons of using Jupyter Notebooks for data scraping:
👍Pros:
- Step-by-step debugging: As each cell can run independently, you can subdivide your data extraction code into different cells and run them. This allows you to debug small chunks of code by running the cells and intercepting bugs at the cell’s level.
- Documentation: Use Markdown in cells to create files where you can document how your scraping code works, as well as describe the logic behind the choices you made.
- Flexibility: In Jupyter Notebooks, you can combine web scraping, data cleaning, and analysis in a single environment. This allows for switching between different environments, such as writing the scraping script into an IDE and analyzing data elsewhere.
👎Cons:
- Not ideal for large-scale projects: Jupyter Notebooks tend to become long documents. That makes them not the ideal choice for large-scale data scraping projects.
- Performance limitations: Notebooks tend to become slow or unresponsive when working with large datasets or running long scripts. Find out more on how to make web scraping faster.
- Not Ideal for automation: If you need to run your scraper on a schedule or deploy it as part of a larger system, Jupyter Notebooks are not the best choice. The reason is that they are primarily designed for interactive, manual execution of cells.
How to Use Jupyter Notebooks for Web Scraping: Step-By-Step Tutorial
Now you know why you should use Jupyter Notebooks for web scraping. Thus, you are ready to see how you can use them in a real-world scraping scenario!
Prerequisites
To replicate this tutorial, your system must match the following prerequisites:
- Python 3.6 or higher: Any Python version higher than 3.6 will do. Specifically, we will install the dependencies via
pip
, which is already installed with any Python version greater than 3.4.
Step 1: Setting Up the Environment and Installing Dependencies
Suppose you call the main folder of your project scraper/
. At the end of this step, the folder will have the following structure:
scraper/
├── analysis.ipynb
└── venv/
Where:
analysis.ipynb
is the Jupyter Notebook that contains all the code.venv/
contains the virtual environment.
You can create the venv/
virtual environment directory like so:
python -m venv venv
To activate it, on Windows, run:
venv\Scripts\activate
Equivalently, on macOS/Linux, execute:
source venv/bin/activate
In the activated virtual environment, install all the needed libraries for this tutorial:
pip install requests beautifulsoup4 pandas jupyter seaborn
These libraries serve the following purposes:
requests
: To perform HTTP requests.beautifulsoup4
: For parsing HTML and XML documents.pandas
: A powerful data manipulation and analysis library, ideal for working with structured data like CSV files or tables.jupyter
: A web-based interactive development environment for running and sharing Python code, great for analysis and visualization.seaborn
: A Python data visualization library based on Matplotlib.
To create the analysis.ipynb
file, you first need to enter the scraper/
folder:
cd scraper
Then, initialize a new Jupyter Notebook with this command:
jupyter notebook
You can now access your Jupyter Notebook App via the locahost8888
.
Create a new file by clicking on the “New > Python 3” option:
The new file will be automatically called untitled.ipynb
. You can rename it in the dashboard:
Great! You are now fully set up for web scrpaing with Jupyter Notebooks.
Step 2: Define the Target Page
In this tutorial, you will scrape the data from the website worldometer. In particular, the target page is the one related to CO2 emissions in the United States per year that provides tabular data like so:
Step 3: Retrieve the Data
You can retrieve the data from the target page and save them into a CSV file like so:
import requests
from bs4 import BeautifulSoup
import csv
# URL of the website
url = "https://www.worldometers.info/co2-emissions/us-co2-emissions/"
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status()
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Locate the table
table = soup.find("table")
# Extract table headers
headers = [header.text.strip() for header in table.find_all("th")]
# Extract table rows
rows = []
for row in table.find_all("tr")[1:]: # Skip the header row
cells = row.find_all("td")
row_data = [cell.text.strip() for cell in cells]
rows.append(row_data)
# Save the data to a CSV file
csv_file = "emissions.csv"
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(headers) # Write headers
writer.writerows(rows) # Write rows
print(f"Data has been saved to {csv_file}")
Here is what this code does:
- It uses the
requests
library to send a GET request to the target page via therequests.get()
method and checks for request errors via the methodresponse.raise_for_status()
. - It uses
BeautifulSoup
to parse the HTML content by instantiating theBeautifulSoup()
class and by finding thetable
selector with the methodsoup.find()
. In particular, this method is useful to locate the table containing the data. If you are not familiar with this syntax, read our guide on BeautifulSoup Web scraping. - It uses a list comprehension to extract the table’s header.
- It uses a
for
loop to retrieve all the data from the table while skipping the header row. - Finally, it opens a new CVS file and appends there all the data retrieved.
You can paste this code into a cell and run it by pressing SHIFT+ENTER
.
Another way to run the cell is to select it and press the “Run” button in the dashboard:
Amazing, see how the “Data has been saved to emissions.csv” notifies you of the successful data extraction operation.
Step 4: Ensure Data Is Correct
Now that you have saved the data into a CVS file. Open the CSV and see if everything has gone well–sometimes you may face issues with conversions. To do so, you can type the following code into a new cell:
import pandas as pd
# Load the CSV file into a pandas DataFrame
csv_file = "emissions.csv"
df = pd.read_csv(csv_file)
# Print the DataFrame
df.head()
This code does the following:
- Opens the CSV file as a data frame, thanks to
pandas
, with the methodpd.read_csv()
. - Prints the first five rows of the data frame with the method
df.head()
.
Here is the expected result:
Fantastic! It only remains to represent the extracted data.
Step 5: Visualize the Data
Now you can make any data analysis you prefer. For example, you can use seaborn
to create a line chart that shows the trend of the C02 emissions over the years. Do it as follows:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the CSV file into a pandas DataFrame
csv_file = "emissions.csv"
df = pd.read_csv(csv_file)
# Clean column names be removing extra spaces
df.columns = df.columns.str.strip().str.replace(' ', ' ')
# Convert 'Fossil CO2 Emissions (tons)' to numeric
df['Fossil CO2 Emissions (tons)'] = df['Fossil CO2 Emissions (tons)'].str.replace(',', '').astype(float)
# Ensure the 'Year' column is numeric
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
df = df.sort_values(by='Year')
# Create the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x='Year', y='Fossil CO2 Emissions (tons)', marker='o')
# Add labels and title
plt.title('Trend of Fossil CO2 Emissions Over the Years', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Fossil CO2 Emissions (tons)', fontsize=12)
plt.grid(True)
plt.show()
Here is what this code does:
- It uses
pandas
to:- Open the CSV file.
- Clean columns’ names by removing extra spaces with the method
df.columns.str.strip().str.replace(' ', ' ')
(if you do not use this, you will get errors in your code in this example). - Accesses the column “Fossil CO2 Emissions (tons)” and converts data to numbers with the method
df['Fossil CO2 Emissions (tons)'].str.replace(',', '').astype(float)
. - Accesses the column “Years”, converts the values to numbers with the method
pd.to_numeric()
, and sorts values into ascendant order with the methoddf.sort_values()
.
- It uses the libraries
matplotlib
andseaborn
(which is built uponmatplotlib
, so it is installed when you installseaborn
) to create the actual plot.
Here is the expected result:
Wow! This is how powerful Jupyter Notebook scraping is.
Step #6: Put It All Together
This is what the final Jupyter Notebook web scraping document looks like:
Note the presence of different blocks, each with its output.
Use Cases of Jupyter Notebook Web Scraping
Time to discuss use cases for using Jupyter Notebooks while web scraping!
Tutorials
Do not forget that each cell in a Jupyter Notebook can be executed independently. With the added support for Markdown, the library becomes an excellent tool for creating step-by-step tutorials.
For example, you can alternate between cells containing code and those explaining the logic and reasoning behind it. In the case of web scraping, Jupyter Notebooks are particularly useful. They can be used to create tutorials for junior developers, guiding them through each step of the process.
Science and Research (R&D)
Due to their interactive nature and the ability to be easily exported for collaboration, Jupyter Notebooks are ideal for research and R&D purposes. This is especially true for web scraping. For instance, when scraping websites that require multiple rounds of trial and error, you can keep all your tests in a single Notebook and use Markdown to highlight the tests that succeed.
Data Exploration
The Jupyter library has been specifically designed for data exploration and analysis. That also makes it a perfect tool for web scraping for machine learning.
This use case directly applies to the example you coded above. You retrieved the data from the website and immediately analyzed it, all within the same coding environment.
Conclusion
In this post, you learned how Jupyter Notebooks can be a powerful tool for web scraping, offering an interactive and flexible environment for data extraction and analysis. However, when it comes to scaling your web scraping operations or automating tasks, Jupyter Notebooks may not be the most efficient solution.
That’s where our Web Scrapers come in. Whether you’re a developer looking for API-based solutions or someone seeking a no-code option, our Web Scrapers are designed to simplify and enhance your data collection efforts. With features like dedicated endpoints for 100+ domains, bulk request handling, automatic IP rotation, and CAPTCHA solving, you can extract structured data effortlessly and at scale.Create a free Bright Data account today to try out our scraping solutions and test our proxies!
No credit card required