In this article we will discuss:
- Selenium: What it is, and how it is used
- A step-by-step guide to scraping with Selenium
- Integrating proxies with Selenium
The corresponding GitHub repository for this article can be found here.
Selenium: What It Is, and How It Is Used
Selenium is an open-source software that includes a variety of tools, and libraries that enable browser automation activities, including:
- Web page-based element actions/retrieval (e.g. close, back, get_cookie, get_screenshot_as_png, get_window_size)
- Site testing
- Managing alert prompts, and cookies (adding/removing)
- Form element submission
- Data collection/web scraping
It is compatible with most browsers including Firefox, Chrome, Safari, and Internet Browser. And can be utilized in order to write tests in a variety of programming languages such as Python, Node.js, C#, JavaScript, and PHP.
For your convenience, I have included a link to the official Selenium 4.1.5 documentation library.
Puppeteer vs. Selenium
For those of you debating and contemplating Puppeteer vs. Selenium – I will say that Puppeteer may serve you better if you plan on focusing mainly on JavaScript, and Chrome. Selenium, on the other hand, may be the better choice if you are looking to work across multiple different browsers in order to test browser applications and/or perform web data collection.
A Step-By-Step Guide to Scraping With Selenium
Step One: Install Selenium
For those of you that have pip (i.e. package installer for Python) on your computers, all you need to do is open it up and type in:
pip install -U selenium
Otherwise, you can download PyPI, unarchive it, and run:
python setup.py install
Do note that you will need a driver so that Selenium can interface with your browser of choice. Here are links to some of the most popular browser drivers for your convenience:
Let’s use Firefox as an example browser. You would accomplish this by opening up Firefox, going to a web page, say Yahoo, searching for “seleniumhq”, and then closing the browser. Here’s what that would look like in code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element(By.NAME, 'p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
Step Two: Importing supporting packages
Selenium is not used in isolation but rather in tandem with other programs including Pandas (an easy to use open source data analysis tool), for example. Here is what you should be typing in, in order to accomplish this:
from selenium import webdriver
import time
import pandas as pd
Step Three: Defining variables
In this step we will define our target folder, search query, and target site. In this example we will be aiming to map different job opportunities as displayed by competing companies on LinkedIn. What you type in should look something like this:
FILE_PATH_FOLDER = 'F:....Competitive_Analysis'
search_query = 'https://www.linkedin.com/q-chief-financial-officer-jobs.html'
driver = webdriver.Chrome(executable_path='C:/.../chromedriver_win32/chromedriver.exe')
job_details = []
Step Four: HTML tag inspection
HTML sites typically have a unique identifier for each tag that is associated with information being displayed on any given target site. The technique here is to leverage this HTML site property in order to crawl the target site at hand. You can accomplish this by:
- Right clicking anywhere on the page, and hitting ‘inspect’
- And then either clicking the arrow that appears at the top left hand corner or by pushing the Ctrl+Shift+C keys in order to inspect a specific element and obtain the desired HTML tag
Here’s what that looks like:
driver.get(search_query)
time.sleep(5)
job_list = driver.find_elements_by_xpath("//div[@data-tn-component='organicJob']")
Step Five: Specific data point extraction
We will extract our target data points by utilizing the ‘find_elements_by_xpath’ attribute on the Selenium web driver, and quit the driver, and close the browser once the target data has been collected.
We will target data points as follows:
- Job title
- Company
- Job location
- Job description
- Date job was uploaded
Here’s what that looks like:
for each_job in job_list:
# Getting job info
job_title = each_job.find_elements_by_xpath(".//h2[@class='title']/a")[0]
job_company = each_job.find_elements_by_xpath(".//span[@class='company']")[0]
job_location = each_job.find_elements_by_xpath(".//span[@class='location accessible-contrast-color-location']")[0]
job_summary = each_job.find_elements_by_xpath(".//div[@class='summary']")[0]
job_publish_date = each_job.find_elements_by_xpath(".//span[@class='date ']")[0]
# Saving job info
job_info = [job_title.text, job_company.text, job_location.text, job_summary.text, job_publish_date.text]
# Saving into job_details
job_details.append(job_info)
driver.quit()
Please note that these selectors can be changed by the target, so people should confirm that the selectors in question are correct, do not assume that they are.
Step Six: Saving the data in preparation for output
At this point you will want to add columns to the data frame and make use of the ‘to_csv’ attribute in order to save all of the obtained data in CSV format as follows:
job_details_df = pd.DataFrame(job_details)
job_details_df.columns = ['title', 'company', 'location', 'summary', 'publish_date']
job_details_df.to_csv('job_details.csv', index=False)
Your desired CSV file will be downloaded to the following location: FILE_PATH_FOLDER
That’s it, you have just successfully completed your first web scraping job with Selenium.
Integrating proxies with Selenium
By integrating proxies into your Selenium-built scraper you can:
- Bypass site-specific geo-restrictions
- Avoid blocks, bans & CAPTCHAs .
- Ensure you are not served misleading information
Get started by creating a Bright Data account and choosing a proxy network type. Then head to Selenium, and fill in the ‘Proxy IP:Port’ in the ‘setProxy’ function, for example: zproxy.lum-superproxy.io:22225 of both HTTP and HTTPS.
Under ‘sendKeys’ input your Bright Data account ID and proxy Zone name:lum-customer-CUSTOMER-zone-YOURZONE and your Zone password found in the Zone settings.