How to Scrape Websites with PhantomJS

Learn how to leverage the power of headless web browsers in order to streamline your data collection operations as well as fully automated alternatives
How to Scrape Websites with PhantomJS
Daniel Shashko
Daniel Shashko | SEO Specialist
18-Jul-2022

In this article we will discuss:

Demistifying PhantomJS

PhantomJS is a ‘headless web browser’. This means that there is no Graphical User Interface (GUI), instead, it only runs on scripts (making it leaner, quicker, and thereby more efficient). It can be used to automate different tasks using JavaScript (JS) such as testing code or collecting data. 

For beginners, I would recommend first installing PhantomJS on your computer using ‘npm’ in your CLI. You can do this by running the following command:

npm install phantomjs -g 

Now the ‘phantomjs’ command will be available for you to use. 

The pros and cons of using PhantomJS for data crawling 

PhantomJS has many advantages, including being ‘headless’ which, as I explained above, makes it quicker as no graphics need to be loaded in order to test or retrieve information. 

PhantomJS can be used efficiently in order to accomplish :


Screen capture 

PhantomJS can help automate the process of snapping and saving PNGs, JPEGs, and even GIFs. This function makes performing front-end User Interface/Experience Assurance that much easier. For example, you can run the command line: Phantomjs amazon.js, in order to collect images of competitor product listings or to ensure that your company’s product listings are displayed properly. 

Page automation

This is a major PhantomJS advantage as it helps developers save a lot of time. By running command lines like Phantomjs userAgent.js, developers can write and check JS code in relation to a specific web page. The main time-saving advantage here is that this process can be automated, and accomplished without ever having to open a browser.

Testing

PhantomJS is advantageous when testing websites, as it streamlines the process, much like other popular web scraping tools such as Selenium. Headless browsing with no GUI means that scanning for issues can happen faster with error codes being discovered and delivered at the command line level. 

Developers also integrate PhantomJS with different types of Continuous Integration (CI) systems in order to test code before it is live. This helps developers fix broken code in real-time, ensuring smoother live projects. 

Network monitoring / Data collection

PhantomJS can also be used to monitor network traffic/activity. Many developers program it in a way that can help to collect target data, such as:

  • The performance of a specific web page
  • When lines of code are added/removed
  • Stock price fluctuation data 
  • Influencer/engagement data when scraping sites like TikTok

Some downside of using PhantomJS include:

  • It can be utilized by malicious parties to carry out automated attacks (mainly ‘thanks’ to the fact that it does not use a User Interface) 
  • It can sometimes prove tricky when it comes to full-cycle/end-to-end testing, and functional testing. 

A step-by-step PhantomJS data collection guide

The PhantomJS is widely popular among NodeJS developers so we are bringing an example how to use it in NodeJS environment. The example shows hot to fetch the HTML content from the URL.

Step One: Setup package.json and install npm packages 

Create a project folder and a file “package.json” in it.

{
"name": "phantomjs-example",
"version": "1.0.0",
"title": "PhantomJS Example",
"description": "PhantomJS Example",
"keywords": [
  	"phantom example"
],
"main": "./index.js",
"scripts": {
	"inst": "rm -rf node_modules && rm package-lock.json && npm install",
	"dev": "nodemon index.js"
},
"dependencies": {
	"phantom": "^6.3.0"
}
}

Then run this command in your terminal: $ npm install. It will install Phantom in your local project folder “node_modules”.

Step Two: Create a Phantom JS script

Create JS script and name it “index.js”

const phantom = require('phantom');

const main = async () => {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const url = 'https://example.com/'; 
  console.log('URL::', url);

  const status = await page.open(url);
  console.log('STATUS::', status);

  const content = await page.property('content');
  console.log('CONTENT::', content);

  await instance.exit();
};

main().catch(console.log);

Step Three: Run the JS script

To start the script, run in your terminal: $ node index.js. The result will be HTML content.

Source: npmjs

Data automation: Easier alternatives to manual scraping

When it comes to scraping data at scale, some companies may prefer utilizing PhantomJS alternatives. 

These include:

  1. Proxies: Web scraping with proxies can be beneficial in that they enable users to collect data at scale, submitting an infinite number of concurrent requests. Proxies can also help address target site blockades such as rate limitations or geolocation-based blocks. In this instance, businesses can leverage country/city-specific Mobile, and Residential IPs/devices in order to route data requests, enabling them to retrieve more accurate user-facing data (e.g. competitor pricing, ad campaigns, and Google search results). 
  1. Ready-to-use Datasets: Datasets are essentially ‘informational packages’ that have already been collected, and are ready to be delivered to algorithms/teams for immediate use. They typically include information from a target site and are enriched from relevant sites across the web (say information on products in a relevant category among multiple vendors, and a variety of eCom marketplaces). Datasets can also be refreshed periodically to ensure that all data points are up-to-date. The major advantage here is zero time/resources invested in the act of data collection meaning that more time can be spent on data analysis and creating value for customers. 
  2. Fully automated Data Collectors: Data Collector is a zero-code, zero infrastructure, customizable data collection solution. Data collectors enable companies to be active participants in the data collection process without the headache of software/hardware development, and maintenance. The three-step process includes choosing a target site, collection times (real-time or scheduled), deciding on an output format (JSON, CSV, HTML, or Microsoft Excel), and having data delivered to wherever is convenient for you (webhook, email, Amazon S3, Google Cloud, Microsoft Azure, SFTP, or API).

Eager to put manual data collection techniques in your rear view mirror?

Daniel Shashko
Daniel Shashko | SEO Specialist

Daniel is an SEO specialist here at Bright Data with a B2C background. He is in charge of ensuring that businesses get exposed to articles that help them become more data-driven. He is fascinated by the intricate inner workings that the digital world is comprised of and how these can be navigated for hypergrowth.