In this article we will discuss:
- Selenium: What it is, and how it is used
- A step-by-step guide to scraping with Selenium
- Integrating proxies with Selenium
The corresponding GitHub repository for this article can be found here.
Selenium: What It Is, and How It Is Used
Selenium is an open-source software that includes a variety of tools, and libraries that enable browser automation activities, including:
- Web page-based element actions/retrieval (e.g. close, back, get_cookie, get_screenshot_as_png, get_window_size)
- Site testing
- Managing alert prompts, and cookies (adding/removing)
- Form element submission
- Data collection/web scraping
For your convenience, I have included a link to the official Selenium 4.1.5 documentation library.
Puppeteer vs. Selenium
A Step-By-Step Guide to Scraping With Selenium
Step One: Install Selenium
For those of you that have pip (i.e. package installer for Python) on your computers, all you need to do is open it up and type in:
pip install -U selenium
Otherwise, you can download PyPI, unarchive it, and run:
python setup.py install
Do note that you will need a driver so that Selenium can interface with your browser of choice. Here are links to some of the most popular browser drivers for your convenience:
Let’s use Firefox as an example browser. You would accomplish this by opening up Firefox, going to a web page, say Yahoo, searching for “seleniumhq”, and then closing the browser. Here’s what that would look like in code:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys browser = webdriver.Firefox() browser.get('http://www.yahoo.com') assert 'Yahoo' in browser.title elem = browser.find_element(By.NAME, 'p') # Find the search box elem.send_keys('seleniumhq' + Keys.RETURN) browser.quit()
Step Two: Importing supporting packages
Selenium is not used in isolation but rather in tandem with other programs including Pandas (an easy to use open source data analysis tool), for example. Here is what you should be typing in, in order to accomplish this:
from selenium import webdriver import time import pandas as pd
Step Three: Defining variables
In this step we will define our target folder, search query, and target site. In this example we will be aiming to map different job opportunities as displayed by competing companies on LinkedIn. What you type in should look something like this:
FILE_PATH_FOLDER = 'F:....Competitive_Analysis' search_query = 'https://www.linkedin.com/q-chief-financial-officer-jobs.html' driver = webdriver.Chrome(executable_path='C:/.../chromedriver_win32/chromedriver.exe') job_details = 
Step Four: HTML tag inspection
HTML sites typically have a unique identifier for each tag that is associated with information being displayed on any given target site. The technique here is to leverage this HTML site property in order to crawl the target site at hand. You can accomplish this by:
- Right clicking anywhere on the page, and hitting ‘inspect’
- And then either clicking the arrow that appears at the top left hand corner or by pushing the Ctrl+Shift+C keys in order to inspect a specific element and obtain the desired HTML tag
Here’s what that looks like:
driver.get(search_query) time.sleep(5) job_list = driver.find_elements_by_xpath("//div[@data-tn-component='organicJob']")
Step Five: Specific data point extraction
We will extract our target data points by utilizing the ‘find_elements_by_xpath’ attribute on the Selenium web driver, and quit the driver, and close the browser once the target data has been collected.
We will target data points as follows:
- Job title
- Job location
- Job description
- Date job was uploaded
Here’s what that looks like:
for each_job in job_list: # Getting job info job_title = each_job.find_elements_by_xpath(".//h2[@class='title']/a") job_company = each_job.find_elements_by_xpath(".//span[@class='company']") job_location = each_job.find_elements_by_xpath(".//span[@class='location accessible-contrast-color-location']") job_summary = each_job.find_elements_by_xpath(".//div[@class='summary']") job_publish_date = each_job.find_elements_by_xpath(".//span[@class='date ']") # Saving job info job_info = [job_title.text, job_company.text, job_location.text, job_summary.text, job_publish_date.text] # Saving into job_details job_details.append(job_info) driver.quit()
Please note that these selectors can be changed by the target, so people should confirm that the selectors in question are correct, do not assume that they are.
Step Six: Saving the data in preparation for output
At this point you will want to add columns to the data frame and make use of the ‘to_csv’ attribute in order to save all of the obtained data in CSV format as follows:
job_details_df = pd.DataFrame(job_details) job_details_df.columns = ['title', 'company', 'location', 'summary', 'publish_date'] job_details_df.to_csv('job_details.csv', index=False)
Your desired CSV file will be downloaded to the following location: FILE_PATH_FOLDER
That’s it, you have just successfully completed your first web scraping job with Selenium.
Integrating proxies with Selenium
By integrating proxies into your Selenium-built scraper you can:
- Bypass site-specific geo-restrictions
- Avoid blocks, bans & CAPTCHAs .
- Ensure you are not served misleading information
Get started by creating a Bright Data account and choosing a proxy network type. Then head to Selenium, and fill in the ‘Proxy IP:Port’ in the ‘setProxy’ function, for example: zproxy.lum-superproxy.io:22225 of both HTTP and HTTPS.
Under ‘sendKeys’ input your Bright Data account ID and proxy Zone name:lum-customer-CUSTOMER-zone-YOURZONE and your Zone password found in the Zone settings.