How to Scrape Wikipedia With Python

Step-by-step guide to scraping Wikipedia using Python and essential tools.
14 min read
How to Scrape Wikipedia blog image

Wikipedia is an extensive and comprehensive source of information, containing millions of articles covering nearly every topic. For researchers, data scientists, and developers, this data opens up countless opportunities, from building machine learning datasets to conducting academic research. In this article, we’ll walk you through the process of scraping Wikipedia step by step.

Using Bright Data Wikipedia Scraper API

If you’re looking to efficiently extract data from Wikipedia, the Bright Data Wikipedia Scraper API is a great alternative to manual web scraping. This powerful API automates the process, making it much easier to gather large volumes of information.

Key Use Cases:

  • Collect explanations on a wide range of topics
  • Compare information from Wikipedia with other data sources
  • Conduct research using large datasets
  • Scrape images from Wikipedia Commons

You can get your data in formats like JSON, CSV, and .gz, and it supports various delivery options, including Amazon S3, Google Cloud Storage, and Microsoft Azure.

With just one API call, you can access a wealth of data quickly and easily!

How to Scrape Wikipedia Using Python

Follow this step-by-step tutorial to scrape Wikipedia using Python.

1. Setup and Prerequisites

Before you begin, ensure your development environment is properly configured:

  • Install Python: Download and install the latest version of Python from the official Python website.
  • Choose an IDE: Use an IDE like PyCharm, Visual Studio Code, or Jupyter Notebook for your development work.
  • Basic Knowledge: Make sure you’re familiar with CSS selectors and comfortable using browser DevTools to inspect page elements.

If you’re new to Python, read this how to scrape with Python guide for detailed instructions.

Next, create a new project using Poetry, a dependency management tool that simplifies managing packages and virtual environments in Python.

poetry new wikipedia-scraper

This command will generate the following project structure:

wikipedia-scraper/
├── pyproject.toml
├── README.md
├── wikipedia_scraper/
│   └── __init__.py
└── tests/
    └── __init__.py

Navigate into the project directory and install the necessary dependencies:

cd wikipedia-scraper
poetry add requests beautifulsoup4 pandas lxml

First, BeautifulSoup is used for parsing HTML and XML documents, making it easy to navigate and extract specific elements from web pages. The requests library handles sending HTTP requests and retrieving the content of web pages. Pandas is a powerful tool for manipulating and analyzing the scraped data, particularly useful when working with tables. Finally, lxml is used to speed up the parsing process, enhancing the performance of BeautifulSoup.

Next, activate the virtual environment and open the project folder in your preferred code editor (VS Code in this case):

poetry shell
code .

Open the pyproject.toml file to verify your project’s dependencies. It should look like this:

[tool.poetry.dependencies]
python = "^3.12"
requests = "^2.32.3"
beautifulsoup4 = "^4.12.3"
pandas = "^2.2.3"
lxml = "^5.3.0"

Finally, create a main.py file within the wikipedia_scraper folder where you’ll write your scraping logic. Your updated project structure should now look like this:

wikipedia-scraper/
├── pyproject.toml
├── README.md
├── wikipedia_scraper/
│   ├── __init__.py
│   └── main.py
└── tests/
    └── __init__.py

Your environment is now set up, and you’re ready to start writing the Python code to scrape Wikipedia.

2. Connecting to the Target Wikipedia Page

To begin, connect to the desired Wikipedia page. In this example, we’ll scrape the following Wikipedia page.

Cristiano Ronaldo's page on Wikipedia

Here’s a simple code snippet to connect to a Wikipedia page using Python:

import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content

def connect_to_wikipedia(url):
    response = requests.get(url)  # Send a GET request to the URL

    # Check if the request was successful
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")  # Parse and return the HTML
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None  # Return None if the request fails

wikipedia_url = "<https://en.wikipedia.org/wiki/Cristiano_Ronaldo>"
soup = connect_to_wikipedia(wikipedia_url)  # Get the soup object for the specified

In the code, the Python requests library allows you to send an HTTP request to the URL, and with BeautifulSoup, you can parse the HTML content of the page.

3. Inspecting the Page

To scrape data effectively, you need to understand the structure of the webpage’s DOM (Document Object Model). For example, to extract all the links on the page, you can target the <a> tags, as shown below:

Inspecting links on a Wikipedia page

To scrape images, target the <img> tags and extract the src attribute to get the image URLs.

Inspecting images on a Wikipedia page

To extract data from tables, you can target the <table> tag with the class wikitable. This allows you to gather all the rows and columns of the table and extract the required data.

Inspecting tables on a Wikipedia page

To extract paragraphs, simply target the <p> tags that contain the main textual content of the page.

Inspecting paragraphs on a Wikipedia page

That’s it! By targeting these specific elements, you can extract desired data from any Wikipedia page.

4. Extracting Links

Wikipedia articles contain internal and external links that direct users to related topics, references, or external resources. To extract all the links from a Wikipedia page, you can use the following code:

def extract_links(soup):
    links = []
    for link in soup.find_all("a", href=True):  # Find all anchor tags with href attribute
        url = link["href"]
        if not url.startswith("http"):  # Check if the URL is relative
            url = "<https://en.wikipedia.org>" + url  # Convert relative links to absolute URLs
        links.append(url)
    return links  # Return the list of extracted links

The soup.find_all('a', href=True) function retrieves all <a> tags on the page that contain an href attribute, which includes both internal and external links. The code also ensures relative URLs are properly formatted.

The result might look like:

<https://en.wikipedia.org#Early_life>
<https://en.wikipedia.org#Club_career>
<https://en.wikipedia.org/wiki/Real_Madrid>
<https://en.wikipedia.org/wiki/Portugal_national_football_team>

5. Extracting Paragraphs

To scrape textual content from a Wikipedia article, you can target the <p> tags, which hold the main body of text. Here’s how to extract paragraphs using BeautifulSoup:

def extract_paragraphs(soup):
    paragraphs = [p.get_text(strip=True) for p in soup.find_all("p")]  # Extract text from paragraph tags
    return [p for p in paragraphs if p and len(p) > 10]  # Return paragraphs longer than 10 characters

This function captures all paragraphs on the page, filtering out any empty or overly short ones to avoid irrelevant content like citations or single words.

An example result:

Cristiano Ronaldo dos Santos AveiroGOIHComM(Portuguese pronunciation:[kɾiʃˈtjɐnuʁɔˈnaldu]; born 5 February 1985) is a Portuguese professionalfootballerwho plays as aforwardfor andcaptainsbothSaudi Pro LeagueclubAl Nassrand thePortugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won fiveBallon d'Orawards,[note 3]a record threeUEFA Men's Player of the Year Awards, and fourEuropean Golden Shoes, the most by a European player. He has won33 trophies in his career, including seven league titles, fiveUEFA Champions Leagues, theUEFA European Championshipand theUEFA Nations League. Ronaldo holds the records for mostappearances(183),goals(140) andassists(42) in the Champions League,most appearances(30), assists (8),goals in the European Championship(14),international goals(133) andinternational appearances(215). He is one of the few players to have madeover 1,200 professional career appearances, the most by anoutfieldplayer, and has scoredover 900 official senior career goalsfor club and country, making him the top goalscorer of all time.

6. Extracting Tables

Wikipedia often includes tables with structured data. To extract these tables, use this code:

def extract_tables(soup):
    tables = []
    for table in soup.find_all("table", {"class": "wikitable"}):  # Find tables with the 'wikitable' class
        table_html = StringIO(str(table))  # Convert table HTML to string
        df = pd.read_html(table_html)[0]  # Read the HTML table into a DataFrame
        tables.append(df)
    return tables  # Return list of DataFrames

This function finds all tables with the class wikitable and uses pandas.read_html() to convert them into DataFrames for further manipulation.

Example result:

Table data that was found

7. Extracting Images

Images are another valuable resource that you can scrape from Wikipedia. The following function captures image URLs from the page:

def extract_images(soup):
    images = []
    for img in soup.find_all("img", src=True):  # Find all image tags with src attribute
        img_url = img["src"]
        if not img_url.startswith("http"):  # Prepend 'https:' for relative URLs
            img_url = "https:" + img_url
        if "static/images" not in img_url:  # Exclude static or non-relevant images
            images.append(img_url)
    return images  # Return the list of image URLs

This function finds all images (<img> tags) on the page, appends https: to relative URLs, and filters out non-content images, ensuring only relevant images are extracted.

Example result:

<https://upload.wikimedia.org/wikipedia/commons/d/d7/Cristiano_Ronaldo_2018.jpg>
<https://upload.wikimedia.org/wikipedia/commons/7/76/Cristiano_Ronaldo_Signature.svgb>

8. Saving the Scraped Data

Once you’ve extracted the data, the next step is to save it for later use. Let’s save the data into separate files for links, images, paragraphs, and tables.

def store_data(links, images, tables, paragraphs):
    # Save links to a text file
    with open("wikipedia_links.txt", "w", encoding="utf-8") as f:
        for link in links:
            f.write(f"{link}\\n")
    
    # Save images to a JSON file
    with open("wikipedia_images.json", "w", encoding="utf-8") as f:
        json.dump(images, f, indent=4)
    
    # Save paragraphs to a text file
    with open("wikipedia_paragraphs.txt", "w", encoding="utf-8") as f:
        for para in paragraphs:
            f.write(f"{para}\\n\\n")
    
    # Save each table as a separate CSV file
    for i, table in enumerate(tables):
        table.to_csv(f"wikipedia_table_{i+1}.csv", index=False, encoding="utf-8-sig")

The store_data function organizes the scraped data:

  • Links are saved in a text file.
  • Image URLs are saved in a JSON file.
  • Paragraphs are stored in another text file.
  • Tables are saved in CSV files.

This organization makes it easy to access and work with the data later on.

Check out our guide to learn more about how to parse and serialize data to JSON in Python.

Putting It All Together

Now, let’s combine all the functions to create a complete scraper that extracts and saves data from a Wikipedia page:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
import json

# Extract all links from the page
def extract_links(soup):
    links = []
    for link in soup.find_all("a", href=True):
        url = link["href"]
        if not url.startswith("http"):
            url = "<https://en.wikipedia.org>" + url
        links.append(url)
    return links

# Extract image URLs from the page
def extract_images(soup):
    images = []
    for img in soup.find_all("img", src=True):
        img_url = img["src"]
        if not img_url.startswith("http"):
            img_url = "https:" + img_url
        if "static/images" not in img_url:  # Exclude unwanted static images
            images.append(img_url)
    return images

# Extract all tables from the page
def extract_tables(soup):
    tables = []
    for table in soup.find_all("table", {"class": "wikitable"}):
        table_html = StringIO(str(table))
        df = pd.read_html(table_html)[0]  # Convert HTML table to DataFrame
        tables.append(df)
    return tables

# Extract paragraphs from the page
def extract_paragraphs(soup):
    paragraphs = [p.get_text(strip=True) for p in soup.find_all("p")]
    return [p for p in paragraphs if p and len(p) > 10]  # Filter out empty or short paragraphs

# Store the extracted data into separate files
def store_data(links, images, tables, paragraphs):
    # Save links to a text file
    with open("wikipedia_links.txt", "w", encoding="utf-8") as f:
        for link in links:
            f.write(f"{link}\\n")
    
    # Save images to a JSON file
    with open("wikipedia_images.json", "w", encoding="utf-8") as f:
        json.dump(images, f, indent=4)
    
    # Save paragraphs to a text file
    with open("wikipedia_paragraphs.txt", "w", encoding="utf-8") as f:
        for para in paragraphs:
            f.write(f"{para}\\n\\n")
    
    # Save each table as a CSV file
    for i, table in enumerate(tables):
        table.to_csv(f"wikipedia_table_{i+1}.csv", index=False, encoding="utf-8-sig")

# Main function to scrape a Wikipedia page and save the extracted data
def scrape_wikipedia(url):
    response = requests.get(url)  # Fetch the page content
    soup = BeautifulSoup(response.text, "html.parser")  # Parse the content with BeautifulSoup

    links = extract_links(soup)
    images = extract_images(soup)
    tables = extract_tables(soup)
    paragraphs = extract_paragraphs(soup)

    # Save all extracted data into files
    store_data(links, images, tables, paragraphs)

# Example usage: scrape Cristiano Ronaldo's Wikipedia page
if __name__ == "__main__":
    scrape_wikipedia("<https://en.wikipedia.org/wiki/Cristiano_Ronaldo>")

When you run the script, several files will be created in your directory:

  • wikipedia_images.json containing all the image URLs.
  • wikipedia_links.txt with all the links from the page.
  • wikipedia_paragraphs.txt holding the extracted paragraphs.
  • CSV files for each table found on the page (e.g., wikipedia_table_1.csv, wikipedia_table_2.csv).

The result might look like:

Final result files

That’s it! You’ve successfully scraped and stored data from Wikipedia into separate files.

Setting Up Bright Data Wikipedia Scraper API

Setting up and using Bright Data Wikipedia Scraper API is straightforward and can be done in just a few minutes. Follow these steps to quickly get started and begin collecting data from Wikipedia with ease.

Step 1: Create a Bright Data Account

Go to the Bright Data website and sign in to your account. If you don’t have an account yet, create one—it’s free to get started. Follow these steps:

  1. Go to the Bright Data website.
  2. Click on Start Free Trial and follow the prompts to create your account.
  3. Once you’re in your dashboard, locate the credit card icon in the left sidebar to access the Billing page.
  4. Add a valid payment method to activate your account.
Setting up a Bright Data account

Once your account is successfully activated, navigate to the Web Scraper API section in the dashboard. Here, you can search for any web scraper API you’d like to use. For our purposes, search for Wikipedia.

The Wikipedia articles scraper API

Click on the Wikipedia articles – Collect by URL option. It will allow you to collect Wikipedia articles simply by providing the URLs.

Step 2: Start Setting Up an API Call

Once you’ve clicked, you’ll be directed to a page where you can set up your API call.

Setting an API call

Before proceeding, you need to create an API token to authenticate your API calls. Click on the Create Token button and copy the generated token. Keep this token safe, as you’ll need it later.

Creating an API token

Step 3: Set Parameters and Generate the API Call

Now that you have your token, you’re ready to configure your API call. Provide the URLs of the Wikipedia pages you want to scrape, and on the right side, a cURL command will be generated based on your input.

Trigger of the data collection API

Copy the cURL command, replace API_Token with your actual token, and run it in your terminal. This will generate a snapshot_id, which you’ll use to retrieve the scraped data.

Step 4: Retrieve the Data

Using the snapshot_id you generated, you can now retrieve the data. Simply paste this ID into the Snapshot ID field, and the API will automatically generate a new cURL command on the right side. You can use this command to pull the data. Additionally, you can choose the file format for the data, such as JSON, CSV, or other available options.

Data delivery options

You also have the option to deliver the data to different storage services such as Amazon S3, Google Cloud Storage, or Microsoft Azure Storage.

Delivering the data to different storage services

Step 5: Run the Command

For this example, let’s assume you want to get the data in a JSON file. Choose JSON as the file format, and copy the generated cURL command. If you want to save the data directly to a file, simply add -o my_data.json to the end of the cURL command. If you prefer to store this data on your local machine, adding -o will automatically store the data in the specified file.

Run it in your terminal, and you’ll have all the extracted data in just a few seconds!

curl.exe -H "Authorization: Bearer 50xxx52c-xxxx-xxxx-xxxx-2748xxxxx487" "<https://api.brightdata.com/datasets/v3/snapshot/s_mxxg2xxxxx2g3nq?format=json>" -o my_data.json

Don’t want to handle Wikipedia web scraping yourself but still need the data? Consider purchasing a Wikipedia dataset instead.

Wikipedia dataset search in Bright Data's control panel

Yes, It’s that simple!

Conclusion

This article covered everything you need to get started with scraping Wikipedia using Python. We’ve successfully extracted a variety of data, including image URLs, text content, tables, and internal and external links. However, for faster and more efficient data extraction, using Bright Data’s Wikipedia Scraper API is a straightforward solution.

Looking to scrape other websites? Register now and try our Web Scraper API. Start your free trial today!

No credit card required