In this guide, we are going to cover:
- Getting Started with Web Scraping in R
- Deep-Dive into Web Scraping in R: Tutorial
- Scaling to Multiple URLs
- Next Step: Pre-Built vs. Self-Built?
Getting Started with Web Scraping in R
The first step would be to understand what tools we are going to use in this R tutorial.
Understanding the Tools: R and rvest
R is a rich easy-to-use library for statistical analysis and data visualizations that provides useful tools for data wrangling and dynamic typing.
rvest一from “harvest”一 is one if not the most popular R packages providing web scraping functionalities, also thanks to its extremely user-friendly interface. Vanilla rvest lets you extract data from one web page only, which is perfect for an initial exploration. You can extend it afterward with the polite library to scrape multiple pages.
Setting Up the Dev Environment
If you’re not already using R within RStudio, follow the instructions here for installation.
Once done, open the console and install rvest:
install.packages("rvest")
As part of the tidyverse collection, it is officially recommended to further extend the built-in functionalities of rvest with other packages in the collection such as magrittr for code readability or xml2 to work with HTML and XML. You can do this by installing tidyverse directly:
install.packages("tidyverse")
Understanding the Web Page
Web scraping is a technique to retrieve data from websites within compliant automated processes.
Three important considerations come from this definition:
- Data comes in various formats.
- Websites display information in very different ways.
- Scraped data needs to be legally accessible.
To understand how to scrape a URL, you first need to understand how the content of the web page is displayed via the HTML markup language and the CSS sheet style language.
HTML provides the content and structure of the web page一loaded into the web browser to create a tree-like Document Object Model (DOM)一by organizing content with “tags.”
Tags have a hierarchical structure, with each tag having a specific functionality applied to all content contained within its opening () and closing () statements:
<!DOCTYPE html>
<html lang="en-gb" class="a-ws a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-transform3d -scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.22.2-2022-12-01">
▶<head>..</head>
▶<body class="a-aui_72554-c a-aui_accordion_a11y_role_354025-c a-aui_killswitch_csa_logger_372963-c a-aui_launch_2021_ally_fixes_392482-t1 a-aui_pci_risk_banner_210084-c a-aui_preload_261698-c a-aui_rel_noreferrer_noopener_309527-c a-aui_template_weblab_cache_333406-c a-aui_tnr_v2_180836-c a-meter-animate" style="padding-bottom: 0px;">..</body>
</html>
The <html>
tag is the minimal component of any web page, with <head>
, and <body>
tags nested inside of it. The <head>
and <body> are themselves “parents” of other tags within them, with <div>
(for document section) and <p>
(for paragraph) being some of their most common “children.”
In the snippet above, you can see “attributes” associated with each HTML “element”: lang
, class
, and style
are pre-built; the attributes starting with data-
are custom to Amazon.
class
is of particular interest for web scraping, together with the id
attribute, as they allow us to target a group of elements and a specific element, respectively. This was originally for styling in CSS.
CSS provides the styling of the web page. From coloring to positioning and sizing, you can select any HTML element and assign new values to its styling properties. You can also apply CSS styling inline within the HTML element with the style
attribute, as you saw in the snippet above:
<body .. style="padding-bottom: 0px;">
In pure CSS, this would be written as:
body {padding-bottom: 0px;}
Here, body is the “selector”, padding-bottom
is the “property”, and 0px
is the “value”.
Any tag
, class
, or id
can be used as a CSS selector.
Users can dynamically interact with the content displayed on the web page via functionalities provided by the JavaScript programming language through the script
tag. After a user interaction, the displayed content may change and new content may appear; advanced web scrapers can mimic user interactions, as we’ll discuss later.
Understanding DevTools
Major web browsers provide built-in developer tools that allow for the collection and live updating of technical information on a web page for logging, debugging, testing, and performance analysis. For this tutorial, we’ll be using Chrome’s DevTools.
Developer Tools are accessible from the upper-right corner of the browser, in More Tools:
In DevTools, you can scroll through the raw HTML in the Elements tab. As you scroll through any of the HTML lines, you’ll see the corresponding element rendered in the web page highlighted in blue:
Conversely, you can click on the icon in the top-left corner and select any rendered element from the web page to be redirected to its raw HTML counterpart, again highlighted in blue.
These two processes are all you need to extract the CSS Descriptors for our hands-on tutorial.
Deep-Dive into Web Scraping in R: Tutorial
In this section, we’ll explore how to web scrape the Amazon URL to extract product reviews.
Prerequisites
Ensure you have the following installed in your Rstudio environment:
- R = 4.2.2
- rvest = 1.0.3
- tidyverse = 1.3.2
Interactively Exploring the Web Page
You can use Chrome’s DevTools to explore the HTML of your URL and create a list of all the classes and IDs of the HTML elements that contain the information we’re interested in scraping, i.e., the product reviews:
Each customer review belongs to a div
with an id
in the format:
customer_review_$INTERNAL_ID.
The HTML content of the div corresponding to the customer review in the screenshot above is the following:
<div id="customer_review-R2U9LWUSIPY0GS" class="a-section celwidget" data-csa-c-id="kj23dv-axnw47-69iej3-apdvzi" data-cel-widget="customer_review-R2U9LWUSIPY0GS">
<div data-hook="genome-widget" class="a-row a-spacing-mini">..</div>
<div class="a-row">
<a class="a-link-normal" title="4.0 out of 5 stars" href="https://www.amazon.co.uk/gp/customer-reviews/R2U9LWUSIPY0GS/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07SR4R8K1">
<i data-hook="review-star-rating" class="a-icon a-icon-star a-star-4 review-rating">
<span class="a-icon-alt">4.0 out of 5 stars</span>
</i>
</a>
<span class="a-letter-space"></span>
<a data-hook="review-title" class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" href="https://www.amazon.co.uk/gp/customer-reviews/R2U9LWUSIPY0GS/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07SR4R8K1">
<span>Very good controller if a little overpriced</span>
</a>
</div>
<span data-hook="review-date" class="a-size-base a-color-secondary review-date">..</span>
<div class="a-row a-spacing-mini review-data review-format-strip">..</div>
<div class="a-row a-spacing-small review-data">
<span data-hook="review-body" class="a-size-base review-text">
<div data-a-expander-name="review_text_read_more" data-a-expander-collapsed-height="300" class="a-expander-collapsed-height a-row a-expander-container a-expander-partial-collapse-container" style="max-height:300px">
<div data-hook="review-collapsed" aria-expanded="false" class="a-expander-content reviewText review-text-content a-expander-partial-collapse-content">
<span>In all honesty I'm not sure why the price is quite as high ….</span>
</div>
…</div>
…</span>
…</div>
…</div>
Each piece of content you’re interested in for the customer reviews has its unique class: review-title-content
for the title, review-text-content
for the body, and review-rating
for the rating.
You could check that the class is unique in the document and directly use the “simple selector”. A more foolproof approach is to use the CSS Descriptor instead, which will stay unique even if the class is assigned to new elements in the future.
Simply retrieve the CSS Descriptor by right-clicking on the element in the DevTools and selecting Copy Selector:
You can define your three selectors as:
customer_review-R2U9LWUSIPY0GS > div:nth-child(2) > a.a-size-base.a-link-normal.review-title.a-color-base.review-title-content.a-text-bold > span
for the titlecustomer_review-R2U9LWUSIPY0GS > div.a-row.a-spacing-small.review-data > span > div > div.a-expander-content.reviewText.review-text-content.a-expander-partial-collapse-content > span
for the bodycustomer_review-R2U9LWUSIPY0GS > div:nth-child(2) > a:nth-child(1) > i.review-rating > span
for the rating*
.review-rating
was manually added for better consistency.
CSS Selector vs. XPath for Web Scraping
For this tutorial, we’ve decided to use a CSS selector to identify elements for web scraping. The other common approach is using XPath, i.e., the XML Path, which identifies an element through its full path in the DOM.
You can extract the full XPath following the same procedure as for the CSS selector. For example, the review title is:
/html/body/div[2]/div[3]/div[6]/div[32]/div/div/div[2]/div/div[2]/span[2]/div/div/div[3]/div[3]/div/div[1]/div/div/div[2]/a[2]/span
A CSS selector is slightly faster, while XPath has slightly better backward compatibility. Outside of these small differences, selecting one over the other depends more on personal preferences than technical implications.
Programmatically Extracting Information from the Web Page
While we could use the Console directly to start exploring how to web scrape the URL, we’ll instead create a script for traceability and reproducibility and run it via the Console using the source()
command.
After creating the script, the first step is to load the installed libraries:
library(”rvest”)
library(”tidyverse”)
Then, you can programmatically extract the content you’re interested in as follows. First, create a variable where you’ll store the URL to search:
HtmlLink <- "https://www.amazon.co.uk/Xbox-Elite-Wireless-Controller-2/dp/B07SR4R8K1/ref=sr_1_1_sspa?crid=3F4M36E0LDQF3"
Next, extract the Amazon Standard Identification Number (ASIN) from the URL to use it as a unique product ID:
ASIN <- str_match(HtmlLink, "/dp/([A-Za-z0-9]+)/")[,2]
Using RegEx to clean the text extracted via web scraping is common and recommended to ensure data quality.
Now, download the HTML content of the web page:
HTMLContent <- read_html(HtmlLink)
The read_html() function is part of the xml2 package.
If you print()
the content, we’ll see it matches the raw HTML structure previously analyzed:
{html_document}
<html lang="en-gb" class="a-no-js" data-19ax5a9jf="dingo">
[1] <head>\n<meta http-equiv="Content-Type" content="text/ht ...
[2] <body class="a-aui_72554-c a-aui_accordion_a11y_role_354 ...
You can now extract the three nodes of interest for all product reviews on the page. Use the CSS Descriptors provided by Chrome’s DevTools, modified to remove the specific customer review identifier #customer_review-R2U9LWUSIPY0GS
and the “ >
” connector from the string; you can also take advantage of the html_nodes() and html_text() functionalities of rvest to save the HTML content in separate objects.
The following commands will extract the review titles:
review_title <- HTMLContent %>%
html_nodes("div:nth-child(2) a.a-size-base.a-link-normal.review-title.a-color-base.review-title-content.a-text-bold span") %>%
html_text()
An example of an entry in review_title
is “Very good controller if a little overpriced.”.
The code below will extract the review body:
review_body <- HTMLContent %>%
html_nodes("div.a-row.a-spacing-small.review-data span div div.a-expander-content.reviewText.review-text-content.a-expander-partial-collapse-content span") %>%
html_text()
An example of an entry in review_body
begins with “In all honesty, I’m not sure why the price…”.
And you can use the following commands to extract the review rating:
review_rating <- HTMLContent %>%
html_nodes("div:nth-child(2) a:nth-child(1) i.review-rating span") %>%
html_text()
An example of an entry in review_rating
is “4.0 out of 5 stars”.
To improve the quality of this variable, extract the rating “4.0” only and convert it to int:
review_rating <- substr(review_rating, 1, 3) %>% as.integer()
The pipe functionality %>% is provided by the magrittr toolkit.
Now it’s time to export the scraped content inside a tibble for data analysis.
tibble is an R package that also belongs to the tidyverse collection and is used to manipulate and print data frames.
df <- tibble(review_title, review_body, review_rating)
The output dataframe is as follows:
Finally, it’s a good idea to refactor the code into the function scrape_amazon
<- function(HtmlLink)
to comply with best practices and best prepare the code for scaling to multiple URLs.
Scaling to Multiple URLs
After the web scraping template has been created, you can create a list of URLs for all top competitors’ products on Amazon via web crawling and scraping.
When scaling to multiple URLs to productionize the solution, you need to outline the application’s technical requirements.
Having well-defined technical requirements will ensure you correctly support business requirements and seamlessly integrate with your existing systems.
Depending on the specific technical requirements, the scraping function needs to be updated to support a combination of the following:
- Real-time or batch process
- Output format(s), such as JSON, NDJSON, CSV, or XLSX
- Output target(s), such as email, API, webhook, or cloud storage
We’ve already mentioned that you can extend rvest with polite to scrape multiple web pages. polite creates and manages a web harvesting session via the use of three main functionalities, in full compliance with the web host’s robots.txt file and with built-in rate limiting and response caching:
bow()
creates the scraping session for a specific URL, i.e., it introduces you to the web host and asks for permission to scrape.scrape()
accesses the HTML of the URL; you can pipe the function tohtml_nodes()
andhtml_text()
from rvest to retrieve specific content.nod()
updates the session’s URL to the next page, without the need to recreate a session.
Quoting directly from their website, “The three pillars of a polite session are seeking permission, taking slowly and never asking twice.”
Next Step: Pre-Built vs. Self-Built?
To develop a state-of-the-art web scraper that can extract good data for a business, a few capabilities need to be available:
- A team of data specialists with expertise in web data extraction
- A team of DevOps engineers with expertise in proxy management and anti-bot circumvention to allow getting past CAPTCHAs and unlocking less publicly accessible websites
- A team of data engineers with expertise in creating infrastructure for real-time and batch data extraction
- A team of legal experts to understand data protection legal requirements for privacy (such as GDPR and CCPA)
Content for the web comes in varied formats, and it is hard to find two websites with the exact same structure. The more complex a website is and the more features and data there are to scrape, the more advanced the programming knowledge required will be, not to mention the additional time and resources needed for the solution.
Typically, you’d want to at least implement the following advanced functionalities:
- Minimize the chances of CAPTCHAs and bot detection: A simple approach here is adding a random sleep() to avoid overloading web servers and regular request patterns. A more effective approach is using a user_agent or proxy server to spread requests among different IPs.
- Scrape Javascript-powered websites: In our Amazon example, the URL does not change when selecting a specific product variant. This is acceptable for scraping reviews, as they are shared, but not for scraping product specifications. To mimic user interactions in dynamic web pages, you could use a tool such as RSelenium to automate the navigation of the web browser.
When wanting to gain access to web data with limited resources, ensure data quality, or unlock more advanced use cases, a pre-built web scraper can be the right choice.
Bright Data’s Web Scraper provides templates for many websites powered by state-of-the-art functionalities, including a much more advanced implementation of the demoed Amazon Scraper!
Don’t want to deal with data collection? Check out our datasets, free samples available.
No credit card required