In this guide, you will see:
- The reasons why parsing HTML in PHP is useful
- The prerequisites to get started with the article’s goal
- How to parse HTML in PHP using:
Dom\HTMLDocument
- Simple HTML DOM Parser
- Symfony’s
DomCrawler
- A comparison table of the three approaches
Let’s dive in!
Why Parse HTML in PHP?
HTML Parsing in PHP involves converting HTML content into its DOM (Document Object Model) structure. Once in the DOM format, you can easily navigate and manipulate the HTML content.
In particular, the top reasons to parse HTML in PHP are:
- Data extraction: Gather specific content from web pages, such as text or attributes from HTML elements.
- Automation: Automate tasks like content scraping, reporting, and data aggregation from HTML content.
- Server-side HTML content handling: Parse HTML to manipulate, clean, or format web content on the server before displaying it in your application.
Discover the best HTML parsing libraries!
Prerequisites
Before you start coding, make sure you have PHP 8.4+ installed on your machine. You can verify this by running the following command:
php -v
The output should look something like this:
PHP 8.4.3 (cli) (built: Jan 19 2025 14:20:58) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.3, Copyright (c) Zend Technologies
with Zend OPcache v8.4.3, Copyright (c), by Zend Technologies
Next, you want to initialize a Composer project to make dependency management easier. If Composer is not installed on your system, download it and follow the installation instructions.
First, create a new folder for your PHP HTML project:
mkdir php-html-parser
Navigate to the folder in your terminal and initialize a Composer project inside it using the composer init
command:
composer init
During this process, you will be prompted with a few questions. The default answers will work, but feel free to add more specific details for your PHP HTML parsing project if desired.
Next, open the project folder in your favorite IDE. Visual Studio Code with the PHP extension or IntelliJ WebStorm are good choices for PHP development.
Now, add an empty index.php
file to the project folder. Your project structure should now look like this:
php-html-parser/
├── vendor/
├── composer.json
└── index.php
Open index.php
and add the following code to initialize your project:
<?php
require_once __DIR__ . "/vendor/autoload.php";
// scraping logic...
This file will soon contain the logic to parse HTML in PHP.
You can now run your script with this command:
php index.php
Great! You are all set up to start parsing HTML in PHP. From here, you can begin adding the necessary HTML retrieval and parsing logic to your script.
HTML Retrieval in PHP
Before parsing HTML in PHP, you need some HTML to parse. In this section, we will see two different approaches to accessing HTML content in PHP.
With CURL
PHP natively supports cURL, a popular HTTP client used to perform HTTP requests. Enable the cURL extension or install it on Linux with:
sudo apt-get install php8.4-curl
You can use cURL to send an HTTP GET request to an online server and retrieve the HTML document returned by the server.
Here is an example script that makes a simple GET request and retrieves HTML content:
// initialize cURL session
$ch = curl_init();
// set the URL you want to make a GET request to
curl_setopt($ch, CURLOPT_URL, "https://www.scrapethissite.com/pages/forms/?per_page=100");
// return the response instead of outputting it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// execute the cURL request and store the result in $response
$html = curl_exec($ch);
// close the cURL session
curl_close($ch);
// output the HTML response
echo $html;
Add the above snippet to index.php
and launch it. It will produce the following HTML code:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />
<!-- Omitted for brevity... -->
</html>
Learn more in our guide on cURL GET requests in PHP.
From a File
Another way to get the HTML content is to store it in a dedicated file. To do that:
- Visit a page of your choice in the browser
- Right-click on the page
- Select the “View page source” option
- Copy and paste the HTML into a file
Alternatively, you can write your own HTML logic in a file.
For this example, we will assume the file is named index.html
. This contains the HTML of the “Hockey Teams” page from Scrape This Site, which was previously retrieved using cURL:
HTML Parsing in PHP: 3 Approaches
In this section, you will learn how to use three different libraries to parse HTML in PHP:
- Using
Dom\HTMLDocument
for vanilla PHP - Using the Simple HTML DOM Parser library
- Using Symfony’s
DomCrawler
component
In all three cases, you will see how to parse either the HTML string retrieved via cURL or the HTML content read from the local index.html
file.
Then, you will learn how to use the methods provided by each PHP HTML parsing library to select all hockey team entries on the page and extract data from them:
The final result will be a list of scraped hockey team entries containing the following details:
- Team Name
- Year
- Wins
- Losses
- Win %
- Goals For (GF)
- Goals Against (GA)
- Goal Difference
You can extract them from the HTML table with this structure:
As you can see, each column in a table row has a specific class. You can extract data from it by selecting elements using their class as a CSS selector and then retrieving their content by accessing their text.
Keep in mind that parsing HTML is just one step in a web scraping script. To dive deeper, read our tutorial on web scraping with PHP.
Now, let’s explore three different approaches to HTML parsing in PHP.
Approach #1: With Dom\HTMLDocument
PHP 8.4+ comes with a built-in Dom\HTMLDocument
class. This represents an HTML document and allows you to parse HTML content and navigate the DOM tree. See how to use it for HTML parsing in PHP!
Step #1: Installation and Set Up
Dom\HTMLDocument
is part of the Standard PHP Library. Still, you need to enable the DOM extension or install it with this Linux command to use it:
sudo apt-get install php-dom
No further action is needed. You are now ready to use Dom\HTMLDocument
for HTML parsing in PHP.
Step #2: HTML Parsing
You can parse the HTML string as below:
$dom = \DOM\HTMLDocument::createFromString($html);
Equivalently, you can parse the index.html
file with:
$dom = \DOM\HTMLDocument::createFromFile("./index.html");
$dom
is a Dom\HTMLDocument
object that exposes the methods you need for data parsing.
Step #3: Data Parsing
You can select all hockey team entries using \DOM\HTMLDocument
with the following approach:
// select each row on the page
$table = $dom->getElementsByTagName("table")->item(0);
$rows = $table->getElementsByTagName("tr");
// iterate through each row and extract data
foreach ($rows as $row) {
$cells = $row->getElementsByTagName("td");
// extracting the data from each column
$team = trim($cells->item(0)->textContent);
$year = trim($cells->item(1)->textContent);
$wins = trim($cells->item(2)->textContent);
$losses = trim($cells->item(3)->textContent);
$win_pct = trim($cells->item(5)->textContent);
$goals_for = trim($cells->item(6)->textContent);
$goals_against = trim($cells->item(7)->textContent);
$goal_diff = trim($cells->item(8)->textContent);
// create an array for the scraped team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print ("\n");
}
\DOM\HTMLDocument
does not offer advanced query methods. So you have to rely on methods like getElementsByTagName()
and manual iteration.
Here is a breakdown of the methods used:
getElementsByTagName()
: Retrieve all elements of a given tag (like<table>
,<tr>
, or<td>
) within the document.item()
: Return an individual element from a list of elements returned bygetElementsByTagName()
.textContent
: This property gives the raw text content of an element, allowing you to extract the visible data (like the team name, year, etc.).
We also used trim()
to remove extra whitespace before and after the text content for cleaner data.
When added to index.php
, the above snippet will produce this result:
Array
(
[team] => Boston Bruins
[year] => 1990
[wins] => 44
[losses] => 24
[win_pct] => 0.55
[goals_for] => 299
[goals_against] => 264
[goal_diff] => 35
)
// omitted for brevity...
Array
(
[team] => Detroit Red Wings
[year] => 1994
[wins] => 33
[losses] => 11
[win_pct] => 0.688
[goals_for] => 180
[goals_against] => 117
[goal_diff] => 63
)
Approach #2: Using Simple HTML DOM Parser
Simple HTML DOM Parser is a lightweight PHP library that makes it easy to parse and manipulate HTML content. The library is actively maintained and has over 880 stars on GitHub.
Step #1: Installation and Set Up
You can install Simple HTML Dom Parser via Composer with this command:
composer require voku/simple_html_dom
Alternatively, you can manually download and include the simple_html_dom.php
file in your project.
Then, import it in index.php
with this line of code:
use voku\helper\HtmlDomParser;
Step #2: HTML Parsing
To parse an HTML string, use the file_get_html()
method:
$dom = HtmlDomParser::str_get_html($html);
For parsing index.html
, write file_get_html()
instead:
$dom = HtmlDomParser::file_get_html($str);
This will load the HTML content into a $dom
object, which allows you to navigate the DOM easily.
Step #3: Data Parsing
Extract the hockey team data from the HTML using Simple HTML DOM Parser:
// find all rows in the table
$rows = $dom->findMulti("table tr.team");
// loop through each row to extract the data
foreach ($rows as $row) {
// extract data using CSS selectors
$team_element = $row->findOne(".name");
$team = trim($team_element->plaintext);
$year_element = $row->findOne(".year");
$year = trim($year_element->plaintext);
$wins_element = $row->findOne(".wins");
$wins = trim($wins_element->plaintext);
$losses_element = $row->findOne(".losses");
$losses = trim($losses_element->plaintext);
$win_pct_element = $row->findOne(".pct");
$win_pct = trim($win_pct_element->plaintext);
$goals_for_element = $row->findOne(".gf");
$goals_for = trim($goals_for_element->plaintext);
$goals_against_element = $row->findOne(".ga");
$goals_against = trim(string: $goals_against_element->plaintext);
$goal_diff_element = $row->findOne(".diff");
$goal_diff = trim(string: $goal_diff_element->plaintext);
// create an array with the extracted team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print("\n");
}
The Simple HTML DOM Parser features used above are:
findMulti()
: Select all elements identified by the given CSS selector.findOne()
: Locate the first element matching the given CSS selector.plaintext
: An attribute to get the raw text content inside an HTML element.
This time, we used CSS selectors with a more complete and robust logic. Still, the result will be the same as in the initial HTML parsing PHP approach.
Approach #3: Using Symfony’s DomCrawler Component
Symfony’s DomCrawler
component provides an easy way to parse HTML documents and extract data from them.
Note: The component is part of the Symfony framework but can also be used standalone, as we will do in this section.
Step #1: Installation and Set Up
Install Symfony’s DomCrawler
component with this Composer command:
composer require symfony/dom-crawler
Then, import it in the index.php
file:
use Symfony\Component\DomCrawler\Crawler;
Step #2: HTML Parsing
To parse an HTML string, create a Crawler
instance with the html()
method:
$crawler = new Crawler($html);
For parsing a file, use file_get_contents()
and create the Crawler
instance:
$crawler = new Crawler(file_get_contents("./index.html"));
The above lines will load the HTML content into the $crawler
object, which provides easy methods to traverse and extract data.
Step #3: Data Parsing
Extract the hockey team data using the DomCrawler
component:
// select all rows within the table
$rows = $crawler->filter("table tr.team");
// loop through each row to extract the data
$rows->each(function ($row, $i) {
// extract data using CSS selectors
$team_element = $row->filter(".name");
$team = trim($team_element->text());
$year_element = $row->filter(".year");
$year = trim($year_element->text());
$wins_element = $row->filter(".wins");
$wins = trim($wins_element->text());
$losses_element = $row->filter(".losses");
$losses = trim($losses_element->text());
$win_pct_element = $row->filter(".pct");
$win_pct = trim($win_pct_element->text());
$goals_for_element = $row->filter(".gf");
$goals_for = trim($goals_for_element->text());
$goals_against_element = $row->filter(".ga");
$goals_against = trim($goals_against_element->text());
$goal_diff_element = $row->filter(".diff");
$goal_diff = trim($goal_diff_element->text());
// create an array with the extracted team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print ("\n");
});
The DomCrawler
methods used are:
each()
: To iterate over a list of selected elements.filter()
: Select elements based on CSS selectors.text()
: Extract the text content of the selected elements.
Wonderful! You are now a PHP HTML parsing master.
Parsing HTML in PHP: Comparison Table
You can compare the three approaches to parsing HTML in PHP explored here in the summary table below:
\DOM\HTMLDocument | Simple HTML DOM Parser | Symfony’s DomCrawler | |
---|---|---|---|
Type | Native PHP component | External Library | Symfony Component |
GitHub Stars | — | 880+ | 4,000+ |
XPath Support | ❌ | ✔️ | ✔️ |
CSS Selector Support | ❌ | ✔️ | ✔️ |
Learning Curve | Low | Low to Medium | Medium |
Simplicity of Use | Medium | High | High |
API | Basic | Rich | Rich |
Conclusion
In this article, you learned about three approaches to HTML parsing in PHP, ranging from using vanilla built-in extensions to third-party libraries.
While all these solutions work, keep in mind that the target web page might use JavaScript for rendering. In that case, simple HTML parsing approaches like the ones presented above will not work. Instead, you need a fully-fledged scraping browser with advanced HTML parsing capabilities like Scraping Browser.
Want to skip HTML parsing and get the data immediately? Check out our ready-to-use datasets covering hundreds of websites!
Create a free Bright Data account today to test our data and scraping solutions with a free trial!
No credit card required