Web Scraping with PHP: a Step-By-Step Guide

Learn how to easily create and program your own simple web scraper in PHP, from scratch.
Web scraping with PHP
Daniel Shashko
Daniel Shashko | SEO Specialist
04-Sep-2022

PHP (Hypertext Preprocessor) is a scripting language for web development which can be used for collecting web data. In this post, we will cover: 

Why use PHP

PHP currently powers ~40% of the web, including sites such as WordPress and Slack. It is one of the more popular server-side scripting languages when it comes to web development. For those working with MySQL, their databases are closely related. It is a relatively easy language to learn, with good documentation and libraries that can reduce dev time. 

Getting started with PHP

This guide will introduce a method of manual web scraping in which you send a bot to a web server and collect data using PHP as the foundational programming language. This is as opposed to using a fully automated data collection tool that can simplify and streamline the process. 

The web scraper will function by sending an HTTP request to the server and then collecting the website’s code. We will then teach you how to parse the information retrieved. 

Here is an example of a code snippet that may appear in the heading of a website that you wish to scrape:

<html><body><h1>This is a heading!</h1></body></html>

Once retrieved this code will need to be parsed so that the text can be read and understood by human analysts. In this example, post-parsing, you will be left with the following plain text:

‘This is a heading!’

Before beginning, make sure that you have PHP installed on your computer.

Web scraping with PHP in 3 easy steps

Step One: Collecting your target website’s code 

Begin by typing in the following code:

<?php
$code = file_get_contents (“http://quotes.toscrape.com”); 
?>

In terms of coding conventions:

  • “<?php” and “?>” are used in all PHP documentation at the beginning and the end of commands. 
  • The second line sets a variable called “$code” that pertains to the contents of the URL in question, in this example, we will be targeting: “http://quotes.toscrape.com”. This helps store the URL code inside of the “$code” variable.

Prefer a fully automated web scraping solution?

Step Two: Parsing the webpage 

This job aims to collect all the quotes from this website:

Quotes text on a quotes website

Right-click on your target page and click ‘view page source’, a new window with the source code will open. In our example, you will notice that all of the quotes are contained within <span> tags, with the “text” class with the itemprop attribute also set to ‘text’, as follows:

Quote code example

We will begin by utilizing PHP to get rid of all undesired text in the code, except for the quotes presented in the <span> tags, and then display it on our screen using the ‘echo’ function:

<?php
$code = file_get_contents("http://quotes.toscrape.com");
$code = str_replace(">", "<>", $code);

$splitCode = explode("<", $code);

// Find the first occurance of the opening tag of the quotes: 
$openingTag = array_search('span class="text" itemprop="text"', $splitCode, true);

// Find the first occurance of the closing tag of the quotes 
$closingTag = array_search('/span', $splitCode, true);

// Now, find the text in between the tags 
$i = $openingTag;
$total = "";
while ($i < $closingTag) {
	$total = $total . $splitCode[$i];
	$i = $i + 1;
}
$final = substr($total, 37);
echo $final;
?>

In line 2, it replaces all occurrences of “>” in the code with “<>”. This is so that it can be split along with “<” in line 5. Now, it has an array of all tags in the code. In line 8, our program finds the location of the opening <span> tag, and in line 11 it finds the location of the </span> closing tag. 

All it needs to do now is retrieve the text between these two occurrences. It accomplishes this by creating a variable called “i” with the value of the location of the opening tag variable. It then creates a variable to input the result into later on. On line 16, it begins to loop through each letter after the opening tag, adding the letter to the total value, then increases the variable ‘i’. Once it has passed the closing tag, the loop will stop. 

Next, it deletes the first 37 digits of the final string because those first 37 digits are the ones in the tag that we are parsing – the <span> tag. Lastly, it retrieves the final result using the ‘echo function’.

Once you run the program, it will look something like this:

“The world we have created it is a process of our thinking. It cannot be changed without changing our thinking.” 

That is the first quote shown on the website that we are scraping without any of the ‘non-human-friendly’ code.

Step Three: Looping through

You may have noticed that it only collects the first occurrence and none after that. To fix this, we can simply delete the occurrences that we just returned and then repeat the process until we have retrieved them all. Additionally, we can simplify our code by putting the scraping process into a function so that it can be run whenever we need it. Try using this code:

<?php 
$code = file_get_contents("http://quotes.toscrape.com");
$code = str_replace(">", "<>", $code); 

$splitCode = explode("<", $code);

function parseCode($splitCode) {
	// Find the first occurance of the opening tag of the quotes:
	$openingTag = array_search('span class="text" itemprop="text"', $splitCode, true);
	
	// Find the first occurance of the closing tag of the quotes: 
	$GLOBALS[closingTag] = array_search('/span', $splitCode, true);
	
	// Now, find the text in between the tags 
$i = $openingTag;
$total = "";
while ($i < $GLOBALS["closingTag"]) {
	$total = $total . $splitCode[$i];
	$i = $i + 1;
}
// Run the function, then update splitCode to delete the previous occurance 
// that it can be repeated for the next quote, then loop through 3 times 
// (You can change how many times):
parseCode($splitCode);
$splitCode = array_slice($splitCode, $GLOBALS["closingTag"]-1, NULL, TRUE);
parseCode($splitCode);
$splitCode = array_slice($splitCode, $GLOBALS["closingTag"]-1, NULL, TRUE);
parseCode($splitCode);
$splitCode = array_slice($splitCode, $GLOBALS["closingTag"]-1, NULL, TRUE);
parseCode($splitCode);

?>

You may have noticed that our previous code has been input into a function called ‘parseCode’, containing a parameter called ‘$splitCode’, so that it can access the code, and then ‘echo’ the result. The’ parseCode’ function is run on line 27, and then on line 28, our program deletes the previous occurrence of the closing tag so that it can then be replicated. Lines 27 and 28 are simply repeated ~ 3 times so that the program can identify a pattern and discover the next occurrence. 


Lastly, we input the closing tag as a ‘global variable’ with the ‘$GLOBALS’ superglobal scope, and on line 21, we input <p> tags around each line that it returns so that it will create a new line for each new quote that it parses. This is the result:

“The world we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

“There are only two ways to live your life. One is as though nothing is a miracle. The Other is as though everything is a miracle.” 

“Try not to become a man of success. Rather become a man of value.”

The result is exactly what we were looking for. No code, just readable text. This process can be replicated for nearly any target site, such as scraping eBay for target data points such as product pricing, reviews, and SKUs (Stock Keeping Units). 

The bottom line  

Using PHP to scrape the web for target data can be an effective, albeit slow/manual process. One viable alternative that companies may want to consider is simply purchasing ready-to-use Datasets. This saves time, and resources, allowing you and your team to shift all of your attention to expanding your business, ensuring customer satisfaction, and focusing on core product development.

Daniel Shashko
Daniel Shashko | SEO Specialist

Daniel is an SEO specialist here at Bright Data with a B2C background. He is in charge of ensuring that businesses get exposed to articles that help them become more data-driven. He is fascinated by the intricate inner workings that the digital world is comprised of and how these can be navigated for hypergrowth.