Pagination in Web Scraping Guide

This article explores common pagination techniques and provides Python code examples to help you scrape data more effectively.
15 min read
Pagination in Web Scraping blog image

When scraping the web, you’ll often encounter pagination, where content is spread across multiple pages. Handling this pagination can be challenging because different websites use different pagination techniques.

In this article, I’ll explain the common pagination techniques and show how to handle them with a practical code example.

What is Pagination?

Websites like e-commerce platforms, job boards, and social media use pagination to manage large amounts of data. Displaying everything on one page would significantly increase load times and consume too much memory. Pagination splits content across multiple pages and provides navigation options like “Next,” page numbers, or auto-loading as you scroll. This makes browsing faster and more organized.

Types of Pagination

The complexity of pagination can vary, ranging from simple numbered pagination to more advanced techniques like infinite scrolling or dynamic content loading. In my experience, I’ve encountered three main types of pagination, which I believe are the most commonly used on websites:

  • Numbered Pagination: Users navigate through discrete pages using numbered links.
  • Click-to-Load Pagination: Users click a button (e.g., “Load More”) to load additional content.
  • Infinite Scrolling: Content loads automatically as users scroll down the page.

Let’s dive into each of these in more detail!

Numbered Pagination

This is the most common pagination technique, often called “Next and Previous Pagination”, “Arrow Pagination”, or “URL-Based Pagination”. Despite the different names, the core idea is the same—pages are linked using numbered links. You can navigate by changing the page number in the URL. To know when to stop pagination, you can check if the “Next” button is disabled or if no new data is available.

It usually looks like this:

pagination-in-web-scraping-screenshot-sample-numbered-pagination

`Let’s take an example! We’ll navigate through all the pages on the website Scrapethesite. The pagination bar for this site has a total of 24 pages.

pagination-in-web-scraping-screenshot-scrapethesite-pagination

You’ll notice that when you click the “>>” button, the URL changes as follows:

Now, take a look at the HTML of this “Next” button. It’s an anchor (<a>) tag with an href attribute that links to the next page. The aria-label attribute shows that the “Next” button is still active.
When there are no more pages, the aria-label will be missing, showing the end of pagination.

pagination-in-web-scraping-screenshot-scrapethesite-pagination-html

Let’s start by writing a basic web scraper to navigate through these pages. First, set up your environment by installing the required packages. For a detailed guide on web scraping with Python, you can check out the in-depth blog post here.

pip install requests beautifulsoup4 lxml

Here’s the code to paginate through each page:

import requests
from bs4 import BeautifulSoup

base_url = "https://www.scrapethissite.com/pages/forms/?page_num="

# Start with page 1
page_num = 1

while True:
    url = f"{base_url}{page_num}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "lxml")

    print(f"Currently on page: {page_num}")

    # Check if 'Next' button exists
    next_button = soup.find("a", {"aria-label": "Next"})

    if next_button:
        # Move to the next page
        page_num += 1
    else:
        # No more pages, exit loop
        print("Reached the last page.")
        break

This code navigates through the pages by checking if the “Next” button (with aria-label="Next") exists. If the button is present, it increments the page_num and makes a new request with the updated URL. The loop continues until the “Next” button is no longer found, indicating the last page.

Run the code, and you’ll see we’ve successfully navigated all pages.

pagination-in-web-scraping-screenshot-numbered-pagination-output

Some websites have a ‘Next’ button that doesn’t change the URL but still loads new content on the same page. In such cases, traditional web scraping methods may not work well. Tools like Selenium or Playwright are more suitable as they can interact with the page and simulate actions like clicking buttons to retrieve the dynamically loaded content. For more on using Selenium for such tasks, you can read a detailed guide here.

You’ll encounter a similar situation when trying to scrape the NGINX blog page.

pagination-in-web-scraping-screenshot-nginx-pagination-html

Let’s use Playwright to handle dynamically loaded content. If you’re new to Playwright, check out this helpful getting started guide.

Now, before writing the code, run the following command to set up Playwright on your machine:

pip install playwright
playwright install

Here’s the code:

import asyncio
from playwright.async_api import async_playwright

# Define an asynchronous function
async def scrape_nginx_blog():
    async with async_playwright() as p:
        # Launch a Chromium browser instance in headless mode
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # Navigate to the NGINX blog page
        await page.goto("https://www.f5.com/company/blog/nginx")

        page_num = 1
        while True:
            print(f"Currently on page {page_num}")

            # Locate the 'Next' button using a button locator with value "next"
            next_button = page.locator('button[value="next"]')

            # Check if the 'Next' button is enabled
            if await next_button.is_enabled():
                await next_button.click()  # Click the 'Next' button to go to the next page
                await page.wait_for_timeout(
                    2000
                )  # Wait for 2 seconds to allow new content to load
                page_num += 1
            else:
                print("No more pages. Scraping finished.")
                break  # Exit the loop if no more pages are available
        
        await browser.close()  # Close the browser


# Run the asynchronous scraping function
asyncio.run(scrape_nginx_blog())

The code uses asynchronous Playwright to navigate through all the pages. It enters a loop that checks for the ‘Next’ button. If the button is enabled, it clicks to go to the next page and waits for the content to load. This process repeats until no more pages are available. Finally, the browser is closed once the scraping is complete.

Run the code, and you’ll see we’ve successfully navigated all pages.

pagination-in-web-scraping-screenshot-numbered-pagination-nginx-output

Click-to-Load Pagination

On many websites, you’ve probably seen buttons like “Load More,” “Show More,” or “View More.” These are examples of click-to-load pagination, commonly used on modern sites. These buttons dynamically load content through JavaScript. The key challenge here is simulating user interaction—automating the process of clicking the button to load more content.

Let’s take the Bright Data blog section as an example. When you visit and scroll down, you’ll notice a “View More” button that loads blog posts as you click it.

pagination-in-web-scraping-screenshot-brightdata-load-more

You can use tools like Selenium or Playwright to automate this process by repeatedly clicking the “Load More” button until no more content is available. Let’s see how we can handle this easily with Playwright.

import asyncio
from playwright.async_api import async_playwright


async def scrape_brightdata_blog():
    async with async_playwright() as p:
        
        # Launch a headless browser
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # Navigate to the Bright Data blog
        await page.goto("https://brightdata.com/blog")

        page_num = 1

        while True:
            print(f"Currently on page {page_num}")

            # Locate the "View More" button
            view_more_button = page.locator("button.load_more_btn")

            # Check if the button is visible and enabled
            if (
                await view_more_button.count() > 0
                and await view_more_button.is_visible()
            ):
                await view_more_button.click()
                await page.wait_for_timeout(2000)
                page_num += 1
            else:
                print("No more pages to load. Scraping finished.")
                break
        
        # Close the browser
        await browser.close()


# Run the scraping function
asyncio.run(scrape_brightdata_blog())

The code locates the “View More” button using the CSS selector button.load_more_btn. It then checks if the button exists and is visible by using count() > 0 and is_visible(). If the button is visible, it interacts with it using the click() method and waits for 2 seconds to allow new content to load. This process repeats in a loop until the button is no longer visible.

Run the code, and you’ll see we’ve successfully navigated all pages.

pagination-in-web-scraping-screenshot-load-more-pagination-output

We have successfully scraped all 52 pages from the Bright Data blog section. This shows that the site has a total of 52 pages, which we only discovered after the scraping process. However, it is possible to know the total number of pages before scraping.

To do this, open Developer Tools, navigate to the “Network” tab, and filter the requests by selecting “Fetch/XHR.” Then, click the “View More” button again, and you’ll notice that an AJAX request is triggered.

pagination-in-web-scraping-screenshot-bright-data-blog

Click on this request and navigate to the “Preview” section, where you’ll see that the maximum number of pages is 52. Then, head over to the “Payload” section, and you’ll find that there are 6 blog posts per page, and we are currently on page 3.

pagination-in-web-scraping-screenshot-bright-data-blog-section

This is fantastic!

Infinite Scroll Pagination

Instead of “previous/next” buttons, many websites now use infinite scrolling, which improves user experience by eliminating the need to click through multiple pages. This technique automatically loads new content as the user scrolls down. However, it presents unique challenges for web scrapers, as it requires monitoring DOM changes and handling AJAX requests.

Let’s take a real-life example. When you visit the Nike website, you’ll notice that shoes load automatically as you scroll down. With every scroll, a loading icon briefly appears, and in the blink of an eye, more shoes are displayed as shown in the below image:

pagination-in-web-scraping-screenshot-nike-infinite-scroll

When you click on the request (d9a5bc), you can find all the data for the current page in the “Response” tab.

pagination-in-web-scraping-screenshot-infinite-scroll-response

Now, to handle the pagination, you need to keep scrolling down the page until you reach the end. As you scroll, the browser will make many requests, but only some of these Fetch/XHR requests will contain the actual data you need.

Here’s the code that handles pagination and extracts the titles of shoes:

import asyncio
from urllib.parse import parse_qs, urlparse
from playwright.async_api import async_playwright


async def scroll_to_bottom(page) -> None:
    """Scroll to the bottom of the page until no more content is loaded."""
    last_height = await page.evaluate("document.body.scrollHeight")
    scroll_count = 0
    while True:
        # Scroll down
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
        await asyncio.sleep(2)  # Wait for new content to load

        scroll_count += 1
        print(f"Scroll iteration: {scroll_count}")

        # Check if scroll height has changed
        new_height = await page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            print("Reached the bottom of the page.")
            break  # Exit if no new content is loaded
        last_height = new_height


async def extract_product_data(response, extracted_products) -> None:
    """Extract product data from the response."""
    parsed_url = urlparse(response.url)
    query_params = parse_qs(parsed_url.query)

    if "queryType" in query_params and query_params["queryType"][0] == "PRODUCTS":
        data = await response.json()
        for grouping in data.get("productGroupings", []):
            for product in grouping.get("products", []):
                title = product.get("copy", {}).get("title")
                extracted_products.append({"title": title})


async def scrape_shoes(target_url: str) -> None:
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(headless=True)
        page = await browser.new_page()
        extracted_products = []

        # Set up listener for product data responses
        page.on(
            "response",
            lambda response: extract_product_data(
                response, extracted_products),
        )

        # Navigate to the page and scroll to the bottom
        print("Navigating to the page...")
        await page.goto(target_url, wait_until="domcontentloaded")
        await asyncio.sleep(2)
        await scroll_to_bottom(page)

        # Save product titles to a text file
        with open("product_titles.txt", "w") as title_file:
            for product in extracted_products:
                title_file.write(product["title"] + "\n")
        print(f"Scraping completed!")
        await browser.close()


if __name__ == "__main__":
    asyncio.run(
        scrape_shoes(
            "https://www.nike.com/in/w/mens-running-shoes-37v7jznik1zy7ok")
    )

In the code, the scroll_to_bottom function continuously scrolls to the bottom of the page to load more content. It starts by recording the current scroll height and then repeatedly scrolls down. After each scroll, it checks if the new scroll height differs from the last recorded height. If the height remains unchanged, it concludes that no more content is being loaded and exits the loop. This approach ensures that all available products are fully loaded before the scraping process continues.

Here’s what happens when you run the code:

pagination-in-web-scraping-screenshot-infinite-scroll-output

After the code executes successfully, a new text file will be created containing all the titles of the Nike shoes.

pagination-in-web-scraping-screenshot-text-file

Challenges in Pagination

The risk of getting blocked increases when dealing with paginated content, and some websites may block you after just one page. For instance, if you attempt to scrape Glassdoor, you may encounter various web scraping challenges, one of which is the Cloudflare CAPTCHA challenge, as I have experienced.

Glassdoor Cloudflare CAPTCHA

Let’s make a request to the Glassdoor page and see what happens.

import requests

url = "https://www.glassdoor.com/"
response = requests.get(url)
print(f"Status code: {response.status_code}")

The result is a 403 status code.

This shows that Glassdoor has detected your request as coming from a bot or scraper, resulting in a CAPTCHA challenge. If you continue to send multiple requests, your IP could be blocked immediately.

To bypass these blocks and effectively extract the data you need, you can use proxies in Python Requests to avoid IP bans or mimic a real browser by rotating the User Agent. However, it’s important to note that none of these methods can guarantee avoiding advanced bot detection.

So, what’s the ultimate solution? Let’s dive into that next!

Incorporate Bright Data Solutions

Bright Data is an excellent solution for bypassing sophisticated anti-bot measures. It seamlessly integrates with your project using just a few lines of code and offers a range of solutions for any advanced anti-bot mechanisms.

One of its solutions is the web scraper API, which simplifies data extraction from any website by automatically handling IP rotation and CAPTCHA solving. This allows you to focus on data analysis rather than the intricacies of data retrieval.

For instance, in our case, we encountered challenges when trying to bypass the CAPTCHA on Glassdoor. To handle this, you can use Bright Data’s Glassdoor scraper API, which is specifically designed to bypass such obstacles and extract data seamlessly from the site.

To get started with the Glassdoor Scraper API, follow these steps:

First, create an account. Visit the Bright Data website, click on Start Free Trial, and follow the sign-up instructions. Once logged in, you’ll be redirected to your dashboard, where you will get some free credits.

Now, go to the Web Scraper API section and select Glassdoor under the B2B data category. You’ll find various data collection options, such as collecting companies by URL or collecting job listings by URL.

Web Scraper API on Bright Data's dashboard

Under “Glassdoor companies overview information”, get your API token and copy your dataset ID (e.g., gd_l7j0bx501ockwldaqf).

Getting the API token and dataset ID

Now, here is the simple code snippet that shows how to extract company data by providing the URL, API token, and dataset ID.

import requests
import json

def trigger_dataset(api_token, dataset_id, company_url):
    """
    Triggers a dataset using the BrightData API.

    Args:
    api_token (str): The API token for authentication.
    dataset_id (str): The dataset ID to trigger.
    company_url (str): The URL of the company page to analyze.

    Returns:
    dict: The JSON response from the API.
    """
    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json",
    }
    payload = json.dumps([{"url": company_url}])
    response = requests.post(
        "https://api.brightdata.com/datasets/v3/trigger",
        headers=headers,
        params={"dataset_id": dataset_id},
        data=payload,
    )
    return response.json()

api_token = "API_Token"
dataset_id = "DATASET_ID"
company_url = "https://www.glassdoor.com/"
response_data = trigger_dataset(api_token, dataset_id, company_url)
print(response_data)

Upon running the code, you will receive a snapshot ID as shown below:

Snapshot ID

Use the snapshot ID to retrieve the actual data of the company. Run the following command in your terminal. For Windows, use:

curl.exe -H "Authorization: Bearer API_TOKEN" \
"https://api.brightdata.com/datasets/v3/snapshot/s_m0v14wn11w6tcxfih8?format=json"

For Linux:

curl -H "Authorization: Bearer API_TOKEN" \
"https://api.brightdata.com/datasets/v3/snapshot/s_m0v14wn11w6tcxfih8?format=json"

After running the command, you’ll get the desired data.

The final desired data

That’s all it takes!

Similarly, you can extract various types of data from Glassdoor by modifying the code. I’ve explained one method, but there are five other ways to do it. So, I recommend exploring these options to scrape the data you want. Each method is tailored to specific data needs and helps you get the exact data you need.

Conclusion

This article discussed various pagination methods commonly used on modern websites, such as numbered pagination, “load more” buttons, and infinite scroll. It also provided code examples for effectively implementing these pagination techniques. However, while dealing with pagination was one part of web scraping, overcoming anti-bot detection presented a significant challenge.

Evading advanced anti-bot detections can be quite complex and often yields varying degrees of success. Bright Data’s tools offer a streamlined, cost-effective solution, including Web Unlocker, Scraping Browser, and Web Scraper APIs for all your web scraping needs. With just a few lines of code, you can achieve a higher success rate without the hassle of managing intricate anti-bot measures.

Not interested in being involved in the scraping process at all? Check out our Dataset Marketplace!

Sign up today for a free trial.

No credit card required