Web scraping with Selenium guide

This is the only step-by-step guide you will need in order to start collecting web data from target sites, and saving them as CSV files in under 10 minutes
louisr
Louis Ruggeri | Support Manager
26-May-2022

In this article we will discuss:

The corresponding GitHub repository for this article can be found here.

Selenium: What it is, and how it is used

Selenium is an open-source software that includes a variety of tools, and libraries that enable browser automation activities, including:

  • Web page-based element actions/retrieval (e.g. close, back, get_cookie, get_screenshot_as_png, get_window_size)
  • Site testing 
  • Managing alert prompts, and cookies (adding/removing)
  • Form element submission 
  • Data collection/web scraping 

It is compatible with most browsers including Firefox, Chrome, Safari, and Internet Browser. And can be utilized in order to write tests in a variety of programming languages such as Python, Node.js, C#, JavaScript, and PHP. 
For your convenience, I have included a link to the official Selenium 4.1.5 documentation library.

Puppeteer vs. Selenium

For those of you debating, and contemplating Puppeteer vs. Selenium – I will say that Puppeteer may serve you better if you plan on focusing mainly on JavaScript, and Chrome. Selenium, on the other hand, may be the better choice if you are looking to work across multiple different browsers in order to test browser applications and/or perform web data collection.

A step-by-step guide to scraping with Selenium 

Step One: Install Selenium 

For those of you that have pip (i.e. package installer for Python) on your computers, all you need to do is open it up and type in:

pip install -U selenium

Otherwise, you can download PyPI, unarchive it, and run:

python setup.py install

Do note that you will need a driver so that Selenium can interface with your browser of choice. Here are links to some of the most popular browser drivers for your convenience:

Let’s use Firefox as an example browser. You would accomplish this by opening up Firefox, going to a web page, say Yahoo, searching for “seleniumhq”, and then closing the browser. Here’s what that would look like in code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

browser = webdriver.Firefox()

browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title

elem = browser.find_element(By.NAME, 'p')  # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)

browser.quit()

Step Two: Importing supporting packages

Selenium is not used in isolation but rather in tandem with other programs including Pandas (an easy to use open source data analysis tool), for example. Here is what you should be typing in, in order to accomplish this:

from selenium import webdriver
import time
import pandas as pd 

Step Three: Defining variables 

In this step we will define our target folder, search query, and target site. In this example we will be aiming to map different job opportunities as displayed by competing companies on LinkedIn. What you type in should look something like this:

FILE_PATH_FOLDER = 'F:....Competitive_Analysis'
search_query = 'https://www.linkedin.com/q-chief-financial-officer-jobs.html'
driver = webdriver.Chrome(executable_path='C:/.../chromedriver_win32/chromedriver.exe')
job_details = []

Step Four: HTML tag inspection

HTML sites typically have a unique identifier for each tag that is associated with information being displayed on any given target site. The technique here is to leverage this HTML site property in order to crawl the target site at hand. You can accomplish this by:

  • Right clicking anywhere on the page, and hitting ‘inspect’
  • And then either clicking the arrow that appears at the top left hand corner or by pushing the Ctrl+Shift+C keys in order to inspect a specific element and obtain the desired HTML tag 

Here’s what that looks like:

driver.get(search_query)
time.sleep(5)
job_list = driver.find_elements_by_xpath("//div[@data-tn-component='organicJob']")

Step Five: Specific data point extraction

We will extract our target data points by utilizing the ‘find_elements_by_xpath’ attribute on the Selenium web driver, and quit the driver, and close the browser once the target data has been collected. 

We will target data points as follows:

  • Job title
  • Company
  • Job location
  • Job description
  • Date job was uploaded 

Here’s what that looks like:

for each_job in job_list:
    # Getting job info
    job_title = each_job.find_elements_by_xpath(".//h2[@class='title']/a")[0]
    job_company = each_job.find_elements_by_xpath(".//span[@class='company']")[0]
    job_location = each_job.find_elements_by_xpath(".//span[@class='location accessible-contrast-color-location']")[0]
    job_summary = each_job.find_elements_by_xpath(".//div[@class='summary']")[0]
    job_publish_date = each_job.find_elements_by_xpath(".//span[@class='date ']")[0]
    # Saving job info 
    job_info = [job_title.text, job_company.text, job_location.text, job_summary.text, job_publish_date.text]
 # Saving into job_details
    job_details.append(job_info)
driver.quit()

Please note that these selectors can be changed by the target, so people should confirm that the selectors in question are correct, do not assume that they are.

Step Six: Saving the data in preparation for output

At this point you will want to add columns to the data frame and make use of the ‘to_csv’ attribute in order to save all of the obtained data in CSV format as follows:

job_details_df = pd.DataFrame(job_details)
job_details_df.columns = ['title', 'company', 'location', 'summary', 'publish_date']
job_details_df.to_csv('job_details.csv', index=False)

Your desired CSV file will be downloaded to the following location:  FILE_PATH_FOLDER

That’s it, you have just successfully completed your first web scraping job with Selenium.

Integrating proxies with Selenium 

By integrating proxies into your Selenium-built scraper you can:

  • Bypass site-specific geo-restrictions 
  • Avoid blocks, bans & CAPTCHAs .  
  • Ensure you are not served misleading information

Get started by creating a Bright Data account and choosing a proxy network type. Then head to Selenium, and fill in the ‘Proxy IP:Port’ in the ‘setProxy’ function for example: zproxy.lum-superproxy.io:22225 of both HTTP and HTTPS.

Under ‘sendKeys’ input your Bright Data account ID and proxy Zone name:lum-customer-CUSTOMER-zone-YOURZONE and your Zone password found in the Zone settings.

louisr
Louis Ruggeri | Support Manager

Lou is a Support Manager at Bright Data with a voracious appetite for knowledge, especially when it comes to data. In his spare time, he enjoys a good book, the occasional taco, and blogging about data collection.