Guide to Web Scraping With SeleniumBase in 2024

Simplify web scraping with SeleniumBase using its advanced features and step-by-step guide.
20 min read
Guide to Web Scraping With SeleniumBase blog image

In this guide, you will learn:

  • What SeleniumBase is and why it is useful for web scraping
  • How it compares to vanilla Selenium
  • The features and benefits SeleniumBase offers
  • How to use it to build a simple scraper
  • How to utilize it for more complex use cases

Let’s dive in!

What Is SeleniumBase?

SeleniumBase is a Python framework for browser automation. Built on top of the Selenium/WebDriver APIs, it provides a professional-grade toolkit for web automation. It supports a wide range of tasks, from testing to scraping.

SeleniumBase is an all-in-one library for testing web pages, automating workflows, and scaling web-based operations. It comes equipped with advanced features such as CAPTCHA bypassing, bot-detection avoidance, and productivity-enhancing tools.

SeleniumBase vs Selenium: Feature and API Comparison

To better understand the why behind SeleniumBase, it makes sense to compare it directly with the vanilla version of Selenium—the tool it is built upon.

For a quick Selenium vs SeleniumBase comparison, take a look at the summary table below:

Feature SeleniumBase Selenium
Built-in test runners Integrates with pytestpynose, and behave Requires manual setup for test integration
Driver management Automatically downloads the browser driver matching the browser version Requires manual driver download and configuration
Web automation logic Combines multiple steps into a single method call Requires multiple lines of code for similar functionality
Selector handling Automatically detects CSS or XPath selectors Requires explicitly defining selector types in method calls
Timeout handling Applies default timeouts to prevent failures Methods fail immediately if timeouts are not explicitly set
Error outputs Provides clean, readable error messages for easier debugging Generates verbose and less interpretable error logs
Dashboards and reports Includes built-in dashboards, reports, and failure screenshots No built-in dashboards or reporting capabilities
Desktop GUI applications Offers visual tools for test running Lacks desktop GUI tools for test execution
Test recorder Built-in test recorder for creating scripts from manual browser actions Requires manual script writing
Test case management Provides CasePlans for organizing tests and documenting steps directly in the framework No built-in test case management tools
Data app support Includes ChartMaker for generating JavaScript from Python to create data apps No additional tools for building data apps

Time to dig into the differences!

Built-in Test Runners

SeleniumBase integrates with popular test runners like pytestpynose, and behave. These tools provide an organized structure, seamless test discovery, execution, test state tracking (e.g., passed, failed, or skipped), and command-line options for customizing settings such as browser selection.

With vanilla Selenium, you would need to manually implement an options parser or rely on third-party tools for configuring tests from the command line.

Enhanced Driver Management

By default, SeleniumBase downloads a compatible driver version that matches the major version of your browser. You can override this using the --driver-version=VER option in your pytest command. For example:

pytest my_script.py --driver-version=114

Instead, Selenium requires you to manually download and configure the appropriate driver. In that case, you are responsible for ensuring compatibility with the browser version.

Multi-Action Methods

SeleniumBase combines multiple steps into single methods for simplified web automation. For example, the driver.type(selector, text) method performs the following:

  1. Waits for the element to be visible
  2. Waits for the element to be interactive
  3. Clears any existing text
  4. Types the provided text
  5. Submits if the text ends with "\n"

With raw Selenium, replicating the same logic would require a few lines of code.

Simplified Selector Handling

SeleniumBase can automatically differentiate between CSS selectors and XPath expressions. That removes the need to explicitly specify selector types with By.CSS_SELECTOR or By.XPATH. However, you can still provide the type explicitly if preferred.

Example with SeleniumBase:

driver.click("button.submit")  # Automatically detects as CSS Selector
driver.click("//button[@class='submit']")  # Automatically detects as XPath

The vanilla Selenium equivalent code is:

driver.find_element(By.CSS_SELECTOR, "button.submit").click()
driver.find_element(By.XPATH, "//button[@class='submit']").click()

Default and Custom Timeout Values

SeleniumBase automatically applies a default timeout of 10 seconds to methods, ensuring elements have time to load. That prevents immediate failures, which are common in raw Selenium.

You can also set custom timeout values directly in method calls, as in the example below:

driver.click("button", timeout=20)

The equivalent Selenium code would be much more verbose and complex:

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button"))).click()

Clear Error Outputs

SeleniumBase provides clean, easy-to-read error messages when scripts fail. Raw Selenium, in contrast, often generates verbose and less interpretable error logs, requiring additional effort to debug.

Dashboards, Reports, and Screenshots

SeleniumBase includes features for generating dashboards and reports for test runs. It also saves screenshots of failures in the ./latest_logs/ folder for easy debugging. Raw Selenium lacks these features out of the box.

Extra Features

Compared to Selenium, SeleniumBase includes:

  • Desktop GUI applications for running tests visually, such as SeleniumBase Commander for pytest and SeleniumBase Behave GUI for behave.
  • A built-in Recorder / Test Generator for creating test scripts based on manual browser actions. This significantly reduces the effort required to write tests for complex workflows.
  • Test case management software called CasePlans to organize tests and document step descriptions directly within the framework.
  • Tools like ChartMaker to build data apps by generating JavaScript code from Python. That makes it a versatile solution beyond standard test automation.

SeleniumBase: Features, Methods, and CLI Options

See what makes SeleniumBase special by exploring its capabilities and API.

Features

This is a list of some of the most relevant SleniumBase features:

  • Includes Recorder Mode for instantly generating browser tests in Python.
  • Supports multiple browsers, tabs, iframes, and proxies within the same test.
  • Features Test Case Management Software with Markdown technology.
  • Smart waiting mechanism automatically improves reliability and reduces flaky tests.
  • Compatible with pytestunittestnose, and behave for test discovery and execution.
  • Includes advanced logging tools for dashboards, reports, and screenshots.
  • Can run tests in Headless Mode to hide the browser interface.
  • Supports multithreaded test execution across parallel browsers.
  • Allows tests to run using Chromium’s mobile device emulator.
  • Supports running tests through a proxy server, even an authenticated one.
  • Customizes the browser’s user-agent string for tests.
  • Prevents detection by websites that block Selenium automation.
  • Integrates with selenium-wire for inspecting browser network requests.
  • Flexible command-line interface for custom test execution options.
  • Global configuration file for managing test settings.
  • Supports integrations with GitHub Actions, Google Cloud, Azure, S3, and Docker.
  • Supports executing JavaScript from Python.
  • Can interact with Shadow DOM elements by using ::shadow in CSS selectors.

For the entire list, check out the documentation.

Methods

Below is a list of the most useful SeleniumBase methods:

  • driver.open(url): Navigate the browser window to the specified URL.
  • driver.go_back(): Navigate back to the previous URL.
  • driver.type(selector, text): Update the field identified by the selector with the specified text.
  • driver.click(selector): Click the element identified by the selector.
  • driver.click_link(link_text): Click the link containing the specified text.
  • driver.select_option_by_text(dropdown_selector, option): Select an option from a dropdown menu by visible text.
  • driver.hover_and_click(hover_selector, click_selector): Hover over an element and click another.
  • driver.drag_and_drop(drag_selector, drop_selector): Drag an element and drop it onto another element.
  • driver.get_text(selector): Get the text of the specified element.
  • driver.get_attribute(selector, attribute): Get the specified attribute of an element.
  • driver.get_current_url(): Get the current page’s URL.
  • driver.get_page_source(): Get the HTML source of the current page.
  • driver.get_title(): Get the title of the current page.
  • driver.switch_to_frame(frame): Switch into the specified iframe container.
  • driver.switch_to_default_content(): Exit the iframe container and return to the main document.
  • driver.open_new_window(): Open a new browser window in the same session.
  • driver.switch_to_window(window): Switch to the specified browser window.
  • driver.switch_to_default_window(): Return to the original browser window.
  • driver.get_new_driver(OPTIONS): Open a new driver session with the specified options.
  • driver.switch_to_driver(driver): Switch to the specified browser driver.
  • driver.switch_to_default_driver(): Return to the original browser driver.
  • driver.wait_for_element(selector): Wait until the specified element is visible.
  • driver.is_element_visible(selector): Check if the specified element is visible.
  • driver.is_text_visible(text, selector): Check if the specified text is visible within an element.
  • driver.sleep(seconds): Pause execution for the specified amount of time.
  • driver.save_screenshot(name): Save a screenshot in .png format with the given name.
  • driver.assert_element(selector): Verify that the specified element is visible.
  • driver.assert_text(text, selector): Verify that the specified text is present in the element.
  • driver.assert_exact_text(text, selector): Verify that the specified text matches exactly in the element.
  • driver.assert_title(title): Verify that the current page title matches the specified title.
  • driver.assert_downloaded_file(file): Verify that the specified file has been downloaded.
  • driver.assert_no_404_errors(): Verify there are no broken links on the page.
  • driver.assert_no_js_errors(): Verify there are no JavaScript errors on the page.

For the complete list, explore the documentation.

CLI Options

SeleniumBase extends pytest with the following command-line options:

  • --browser=BROWSER: Set the web browser (default: “chrome”).
  • --chrome: Shortcut for --browser=chrome.
  • --edge: Shortcut for --browser=edge.
  • --firefox: Shortcut for --browser=firefox.
  • --safari: Shortcut for --browser=safari.
  • --settings-file=FILE: Override default SeleniumBase settings.
  • --env=ENV: Set the test environment, accessible via driver.env.
  • --account=STR: Set account, accessible via driver.account.
  • --data=STRING: Extra test data, accessible via driver.data.
  • --var1=STRING: Extra test data, accessible via driver.var1.
  • --var2=STRING: Extra test data, accessible via driver.var2.
  • --var3=STRING: Extra test data, accessible via driver.var3.
  • --variables=DICT: Extra test data, accessible via driver.variables.
  • --proxy=SERVER:PORT: Connect to a proxy server.
  • --proxy=USERNAME:PASSWORD@SERVER:PORT: Use an authenticated proxy server.
  • --proxy-bypass-list=STRING: Hosts to bypass (e.g., “*.foo.com”).
  • --proxy-pac-url=URL: Connect via PAC URL.
  • --proxy-pac-url=USERNAME:PASSWORD@URL: Authenticated proxy with PAC URL.
  • --proxy-driver: Use proxy for driver download.
  • --multi-proxy: Allow multiple authenticated proxies in multi-threading.
  • --agent=STRING: Modify the browser’s User-Agent string.
  • --mobile: Enable mobile device emulator.
  • --metrics=STRING: Set mobile metrics (e.g., “CSSWidth,CSSHeight,PixelRatio”).
  • --chromium-arg="ARG=N,ARG2": Set Chromium arguments.
  • --firefox-arg="ARG=N,ARG2": Set Firefox arguments.
  • --firefox-pref=SET: Set Firefox preferences.
  • --extension-zip=ZIP: Load Chrome Extension .zip/.crx files.
  • --extension-dir=DIR: Load Chrome Extension directories.
  • --disable-features="F1,F2": Disable features.
  • --binary-location=PATH: Set Chromium binary path.
  • --driver-version=VER: Set driver version.
  • --headless: Default headless mode.
  • --headless1: Use Chrome’s old headless mode.
  • --headless2: Use Chrome’s new headless mode.
  • --headed: Enable GUI mode on Linux.
  • --xvfb: Run tests with Xvfb on Linux.
  • --locale=LOCALE_CODE: Set the browser’s language locale.
  • --reuse-session: Reuse browser session for all tests.
  • --reuse-class-session: Reuse session for class tests.
  • --crumbs: Delete cookies between reused sessions.
  • --disable-cookies: Disable cookies.
  • --disable-js: Disable JavaScript.
  • --disable-csp: Disable Content Security Policy.
  • --disable-ws: Disable Web Security.
  • --enable-ws: Enable Web Security.
  • --log-cdp: Log Chrome DevTools Protocol (CDP) events.
  • --remote-debug: Sync to Chrome Remote Debugger.
  • --visual-baseline: Set visual baseline for layout tests.
  • --timeout-multiplier=MULTIPLIER: Multiply default timeout values.

See the full list of command-line option definitions in the documentation.

Using SeleniumBase for Web Scraping: Step-By-Step Guide

Follow this step-by-step tutorial to learn how to build a SeleniumBase scraper to retrieve data from the Quotes to Scrape sandbox:

Quotes to Scrape sandbox to practice web scraping

For a similar tutorial using vanilla Selenium, check out our guide on web scraping with Selenium.

Step #1: Project Initialization

Before getting started, make sure you have Python 3 installed on your machine. Otherwise, download it and install it.

Open the terminal and launch the command below to create a directory for your project:

mkdir seleniumbase-scraper

seleniumbase-scraper will contain your SeleniumBase scraper.

Navigate inside it and initialize a virtual environment inside it:

cd seleniumbase-scraper
python -m venv env

Next, load the project folder in your favorite Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition will do.

Create a scraper.py file in the project’s directory, which should now contain this file structure:

Creating a new scraper.py file in the project's folder

scraper.py will soon contain your scraping logic.

Activate the virtual environment in the IDE’s terminal. In Linux or macOS, do that with the command below:

./env/bin/activate

Equivalently, on Windows, run:

env/Scripts/activate

In the activated environment, launch this command to install SeleniumBase:

pip install seleniumbase

Wonderful! You have a Python environment for SeleniumBase web scraping.

Step #2: SeleniumBase Test Setup

While SeleniumBase supports pytest syntax for building tests, a web scraping bot is not a test script. You can still take advantage of all the SeleniumBase pytest command-line extension options by using the SB syntax:

from seleniumbase import SB

with SB() as sb:
    pass 
    # Scraping logic... 

You can now execute your test with:

python3 scraper.py

Note: On Windows, replace python3 with python.

To execute it in headless mode, run:

python3 scraper.py --headless

Keep in mind that you can combine multiple command line options.

Step #3: Connect to the Target Page

Use the open() method to instruct the controlled browser to visit your target page:

sb.open("https://quotes.toscrape.com/")

If you execute the scraping test script in headed mode, this is what you will see for a fraction of a second:

The window you'll see for a second if you are using a headed mode

Note that, compared to vanilla Selenium, you do not have to manually close the driver. SeleniumBase will take care of that for you.

Step #4: Select the Quote Elements

Open the target page in incognito mode in your browser and inspect a quote element:

inspecting a quote element in incognito

Since the page contains multiple quotes, create a quotes array to store the scraped data:

quotes = []

In the DevTools section above, you can see that all quotes can be selected using the .quote CSS selector. Use find_elements() to select them all:

quote_elements = sb.find_elements(".quote")

Next, prepare to iterate over the elements and scrape data from each quote element. Add the scraped data to an array:

for quote_element in quote_elements:
    # Scraping logic...

Great! The high-level scraping logic is now ready.

Step #5: Scrape Quote Data

Inspect a single quote element:

Inspecting a single quote element

Note that you can scrape:

  • The quote text from .text
  • The quote author from .author
  • The quote tags from .tag

Select each node and extract data from them with the text attribute:

text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
text = text_element.text.replace("“", "").replace("”", "")

author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
author = author_element.text

tags = []
tag_elements = quote_element.find_elements(By.CSS_SELECTOR, ".tag")
for tag_element in tag_elements:
  tag = tag_element.text
  tags.append(tag)

Note that find_elements() returns vanilla Selenium WebElement objects. So, to select elements within them, you must use Selenium’s native methods. This is why you have to specify By.CSS_SELECTOR as the locator.

Make sure to import By at the beginning of your script:

from selenium.webdriver.common.by import By

Notice how scraping the tags requires a loop, as a single quote can have one or more tags. Also, observe the use of the replace() method to remove the special double quotes surrounding the text.

Step #6: Populate the Quotes Array

Populate a new quotes object with the scraped data and add it to quotes:

quote = {
    "text": text,
    "author": author,
    "tags": tags
}
quotes.append(quote)

Amazing! The SelenumBase scraping logic is complete.

Step #7: Implement Crawling Logic

Remember, the target site contains multiple pages. To navigate to the next page, click the “Next →” button at the bottom:

Inspecting the "next" button of the pagination

On the last page, this button will not be present.

To implement web crawling and scrape all pages, wrap your scraping logic in a loop that clicks the “Next →” button and stops when the button is no longer available:

while sb.is_element_present(".next"):
    # Scraping logic...

    # Visit the next page
    sb.click(".next a")

Note the use of teh special SleniumBae is_element_present() method to check whether the button is present or not.

Perfect! Your SeleniumBase scraper will now go through the entire site.

Step #8: Export the Scraped Data

Export the scraped data in quotes to a CSV file as follows:

with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
      writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
      writer.writeheader()
      # Flatten the quote objects for CSV writing
      for quote in quotes:
          writer.writerow({
            "text": quote["text"],
            "author": quote["author"],
            "tags": ";".join(quote["tags"])
          })

Do not forget to import csv from the Python standard library:

import csv

Step #9: Put It All Together

Your script.py file should now contain the following code:

from seleniumbase import SB
from selenium.webdriver.common.by import By
import csv

with SB() as sb:
    # Connect to the target page
    sb.open("https://quotes.toscrape.com/")

    # Where to store the scraped data
    quotes = []

    # Iterate over all quote pages
    while sb.is_element_present(".next"):
        # Select all quote elements on the page
        quote_elements = sb.find_elements(".quote")

        # Iterate over them and scrape data for each quote element
        for quote_element in quote_elements:
            # Data extraction logic
            text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
            text = text_element.text.replace("“", "").replace("”", "")

            author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
            author = author_element.text

            tags = []
            tag_elements = quote_element.find_elements(By.CSS_SELECTOR, ".tag")
            for tag_element in tag_elements:
              tag = tag_element.text
              tags.append(tag)

            # Populate a new quote object with the scraped data
            quote = {
                "text": text,
                "author": author,
                "tags": tags
            }
            # Add it to the list of scraped quotes
            quotes.append(quote)
        
        # Visit the next page
        sb.click(".next a")

    # Export the scraped data to CSV
    with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
          writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
          writer.writeheader()
          # Flatten the quote objects for CSV writing
          for quote in quotes:
              writer.writerow({
                "text": quote["text"],
                "author": quote["author"],
                "tags": ";".join(quote["tags"])
              })

Execute the SeleniumBase scraper in headless mode with:

python3 script.py --headless

After a few seconds, a quotes.csv file will appear in the project folder.

Open it, and you will see:

The final result with all the data that you will see

Et voilà! Your SeleniumBase web scraping script works like a charm.

Advanced SelenimBase Scraping Use Cases

Now that you have seen the basics of SeleniumBase, you are ready to explore some more complex scenarios.

Automate Form Filling and Submission

Note: Bright Data doesn’t scrape behind login.

SeleniumBase also allows you to interact with elements on a page as a human user would. For example, suppose you need to interact with a login form as shown below:

Quotes to Scrape login page

Your goal is to fill out the “Username” and “Password” fields, and then submit the form by clicking the “Login” button. You can achieve this with a SeleniumBase test as follows:

# login.py

from seleniumbase import BaseCase
BaseCase.main(__name__, __file__)


class LoginTest(BaseCase):
    def test_submit_login_form(self):
        # Visit the target page
        self.open("https://quotes.toscrape.com/login")

        # Fill out the form
        self.type("#username", "test")
        self.type("#password", "test")

        # Submit the form
        self.click("input[type=\"submit\"]")

        # Verify you are on the right page
        self.assert_text("Top Ten tags")

This example is great for building a test around it, so note the use of the BaseCase class. That allows you to create pytest tests.

Execute the test with this command:

pytest login.py

You will see the browser open, load the login page, fill out the form, submit it, and then check for the given text to appear on the page.

The output in the terminal will look something like this:

login.py .                                                                                     [100%]

======================================== 1 passed in 11.20s ========================================= 

Bypass Simple Anti-Bot Technologies

Many sites implement advanced anti-scraping measures to prevent bots from accessing their data. These techniques include CAPTCHA challenges, rate limits, browser fingerprinting, and others. To effectively scrape websites without getting blocked, you need to bypass these protections.

SeleniumBase provides a special feature called UC Mode (Undetected-Chromedriver Mode), which helps scraping bots appear more like human users. This allows them to evade detection by anti-bot services, which might otherwise block the scraping bot directly or trigger CAPTCHAs.

UC Mode is built on undetected-chromedriver and comes with several updates, fixes, and improvements, such as:

  • Automatic User-Agent rotation to avoid detection.
  • Automatic configuration of Chromium arguments as needed.
  • Special uc_*() methods for bypassing CAPTCHAs.

Now, let’s see how to use UC Mode in SeleniumBase to bypass anti-bot challenges.

For this demonstration, you will see how to access the anti-bot page from the Scraping Course site:

Basic Cloudflare CAPTCHA on the Scraping Course website

To bypass the anti-bot measures and handle the CAPTCHA, enable UC Mode and use the uc_open_with_reconnect() and uc_gui_click_captcha() methods:

from seleniumbase import SB

with SB(uc=True) as sb:
    # Target page with anti-bot measures
    url = "https://www.scrapingcourse.com/antibot-challenge"

    # Open the URL using UC Mode with a reconnect time of 4 seconds to avoid initial detection
    sb.uc_open_with_reconnect(url, reconnect_time=4)

    # Attempt to bypass the CAPTCHA
    sb.uc_gui_click_captcha()

    # Take a screenshot of the page
    sb.save_screenshot("screenshot.png")

Now, launch the script and verifies it works as expected. Since uc_gui_click_captcha() requires PyAutoGUI to work, SeleniumBase will install it for you on the first run:

PyAutoGUI required! Installing now...

You will see the browser automatically click on the “Verify you are human” check by moving your mouse. The screenshot.png file in your project folder will show:

Antibot challenge bypass succeeded

Wow! Cloudflare has been bypassed.

Bypass Complex Anti-Bot Technologies

Anti-bot solutions are becoming increasingly sophisticated, and UC Mode may not always be effective. This is why SeleniumBase also offers a special CDP Mode (Chrome DevTools Protocol Mode).

CDP Mode operates within UC Mode and allows bots to appear more human-like by controlling the browser through the CDP-Driver. While regular UC Mode cannot perform WebDriver actions when the driver is disconnected from the browser, the CDP-Driver can still interact with the browser, overcoming this limitation.

CDP Mode is built on python-cdptrio-cdp, and nodriver. It is designed to bypass advanced anti-bot solutions from real-world sites, as in the example below:

from seleniumbase import SB

with SB(uc=True, test=True) as sb:
    # Target page with advanced anti-bot measures
    url = "https://gitlab.com/users/sign_in"
    # Visit the page in CDP Mode
    sb.activate_cdp_mode(url)

    # Handle the CAPTCHA
    sb.uc_gui_click_captcha()

    # Wait for 2 seconds for the page to reload and the driver to retake control
    sb.sleep(2)

    # Take a screenshot of the page
    sb.save_screenshot("screenshot.png")

The result will be:

The final result that you will see

Here we go! You are now a SeleniumBase scraping master.

Conclusion

In this article, you learned about SeleniumBase, the features and methods it offers, and how to use it for web scraping. You started with basic scenarios and then explored more complex use cases.
While UC Mode and CDP Mode are effective for bypassing certain anti-bot measures, they are not foolproof.

Websites can still block your IP if you make too many requests or challenge you with more complex CAPTCHAs that require multiple actions. A more effective solution is to use a web browser automation tool like Selenium in combination with a scraping-dedicated, cloud-based, highly scalable browser like Scraping Browser from Bright Data.

Scraping Browser is a browser that works with Playwright, Puppeteer, Selenium, and others. It automatically rotates exit IPs with every request and can handle browser fingerprinting, retries, CAPTCHA resolution, and much more. Forget about getting blocked and streamline your scraping operation.

Sign up now and start your free trial!

No credit card required