In this guide, you will see:
- The reasons why parsing HTML in PHP is useful
- The prerequisites to get started with the article’s goal
- How to parse HTML in PHP using:
Dom\HTMLDocument
- Simple HTML DOM Parser
- Symfony’s
DomCrawler
- A comparison table of the three approaches
Let’s dive in!
Why Parse HTML in PHP?
HTML Parsing in PHP involves converting HTML content into its DOM (Document Object Model) structure. Once in the DOM format, you can easily navigate and manipulate the HTML content.
In particular, the top reasons to parse HTML in PHP are:
- Data extraction: Gather specific content from web pages, such as text or attributes from HTML elements.
- Automation: Automate tasks like content scraping, reporting, and data aggregation from HTML content.
- Server-side HTML content handling: Parse HTML to manipulate, clean, or format web content on the server before displaying it in your application.
Discover the best HTML parsing libraries!
Prerequisites
Before you start coding, make sure you have PHP 8.4+ installed on your machine. You can verify this by running the following command:
The output should look something like this:
Next, you want to initialize a Composer project to make dependency management easier. If Composer is not installed on your system, download it and follow the installation instructions.
First, create a new folder for your PHP HTML project:
Navigate to the folder in your terminal and initialize a Composer project inside it using the composer init
command:
During this process, you will be prompted with a few questions. The default answers will work, but feel free to add more specific details for your PHP HTML parsing project if desired.
Next, open the project folder in your favorite IDE. Visual Studio Code with the PHP extension or IntelliJ WebStorm are good choices for PHP development.
Now, add an empty index.php
file to the project folder. Your project structure should now look like this:
Open index.php
and add the following code to initialize your project:
This file will soon contain the logic to parse HTML in PHP.
You can now run your script with this command:
Great! You are all set up to start parsing HTML in PHP. From here, you can begin adding the necessary HTML retrieval and parsing logic to your script.
HTML Retrieval in PHP
Before parsing HTML in PHP, you need some HTML to parse. In this section, we will see two different approaches to accessing HTML content in PHP.
With CURL
PHP natively supports cURL, a popular HTTP client used to perform HTTP requests. Enable the cURL extension or install it on Linux with:
You can use cURL to send an HTTP GET request to an online server and retrieve the HTML document returned by the server.
Here is an example script that makes a simple GET request and retrieves HTML content:
Add the above snippet to index.php
and launch it. It will produce the following HTML code:
Learn more in our guide on cURL GET requests in PHP.
From a File
Another way to get the HTML content is to store it in a dedicated file. To do that:
- Visit a page of your choice in the browser
- Right-click on the page
- Select the “View page source” option
- Copy and paste the HTML into a file
Alternatively, you can write your own HTML logic in a file.
For this example, we will assume the file is named index.html
. This contains the HTML of the “Hockey Teams” page from Scrape This Site, which was previously retrieved using cURL:
HTML Parsing in PHP: 3 Approaches
In this section, you will learn how to use three different libraries to parse HTML in PHP:
- Using
Dom\HTMLDocument
for vanilla PHP - Using the Simple HTML DOM Parser library
- Using Symfony’s
DomCrawler
component
In all three cases, you will see how to parse either the HTML string retrieved via cURL or the HTML content read from the local index.html
file.
Then, you will learn how to use the methods provided by each PHP HTML parsing library to select all hockey team entries on the page and extract data from them:
The final result will be a list of scraped hockey team entries containing the following details:
- Team Name
- Year
- Wins
- Losses
- Win %
- Goals For (GF)
- Goals Against (GA)
- Goal Difference
You can extract them from the HTML table with this structure:
As you can see, each column in a table row has a specific class. You can extract data from it by selecting elements using their class as a CSS selector and then retrieving their content by accessing their text.
Keep in mind that parsing HTML is just one step in a web scraping script. To dive deeper, read our tutorial on web scraping with PHP.
Now, let’s explore three different approaches to HTML parsing in PHP.
Approach #1: With Dom\HTMLDocument
PHP 8.4+ comes with a built-in Dom\HTMLDocument
class. This represents an HTML document and allows you to parse HTML content and navigate the DOM tree. See how to use it for HTML parsing in PHP!
Step #1: Installation and Set Up
Dom\HTMLDocument
is part of the Standard PHP Library. Still, you need to enable the DOM extension or install it with this Linux command to use it:
No further action is needed. You are now ready to use Dom\HTMLDocument
for HTML parsing in PHP.
Step #2: HTML Parsing
You can parse the HTML string as below:
Equivalently, you can parse the index.html
file with:
$dom
is a Dom\HTMLDocument
object that exposes the methods you need for data parsing.
Step #3: Data Parsing
You can select all hockey team entries using \DOM\HTMLDocument
with the following approach:
\DOM\HTMLDocument
does not offer advanced query methods. So you have to rely on methods like getElementsByTagName()
and manual iteration.
Here is a breakdown of the methods used:
getElementsByTagName()
: Retrieve all elements of a given tag (like<table>
,<tr>
, or<td>
) within the document.item()
: Return an individual element from a list of elements returned bygetElementsByTagName()
.textContent
: This property gives the raw text content of an element, allowing you to extract the visible data (like the team name, year, etc.).
We also used trim()
to remove extra whitespace before and after the text content for cleaner data.
When added to index.php
, the above snippet will produce this result:
Approach #2: Using Simple HTML DOM Parser
Simple HTML DOM Parser is a lightweight PHP library that makes it easy to parse and manipulate HTML content. The library is actively maintained and has over 880 stars on GitHub.
Step #1: Installation and Set Up
You can install Simple HTML Dom Parser via Composer with this command:
Alternatively, you can manually download and include the simple_html_dom.php
file in your project.
Then, import it in index.php
with this line of code:
Step #2: HTML Parsing
To parse an HTML string, use the file_get_html()
method:
For parsing index.html
, write file_get_html()
instead:
This will load the HTML content into a $dom
object, which allows you to navigate the DOM easily.
Step #3: Data Parsing
Extract the hockey team data from the HTML using Simple HTML DOM Parser:
The Simple HTML DOM Parser features used above are:
findMulti()
: Select all elements identified by the given CSS selector.findOne()
: Locate the first element matching the given CSS selector.plaintext
: An attribute to get the raw text content inside an HTML element.
This time, we used CSS selectors with a more complete and robust logic. Still, the result will be the same as in the initial HTML parsing PHP approach.
Approach #3: Using Symfony’s DomCrawler Component
Symfony’s DomCrawler
component provides an easy way to parse HTML documents and extract data from them.
Note: The component is part of the Symfony framework but can also be used standalone, as we will do in this section.
Step #1: Installation and Set Up
Install Symfony’s DomCrawler
component with this Composer command:
Then, import it in the index.php
file:
Step #2: HTML Parsing
To parse an HTML string, create a Crawler
instance with the html()
method:
For parsing a file, use file_get_contents()
and create the Crawler
instance:
The above lines will load the HTML content into the $crawler
object, which provides easy methods to traverse and extract data.
Step #3: Data Parsing
Extract the hockey team data using the DomCrawler
component:
The DomCrawler
methods used are:
each()
: To iterate over a list of selected elements.filter()
: Select elements based on CSS selectors.text()
: Extract the text content of the selected elements.
Wonderful! You are now a PHP HTML parsing master.
Parsing HTML in PHP: Comparison Table
You can compare the three approaches to parsing HTML in PHP explored here in the summary table below:
\DOM\HTMLDocument | Simple HTML DOM Parser | Symfony’s DomCrawler | |
---|---|---|---|
Type | Native PHP component | External Library | Symfony Component |
GitHub Stars | — | 880+ | 4,000+ |
XPath Support | ❌ | ✔️ | ✔️ |
CSS Selector Support | ❌ | ✔️ | ✔️ |
Learning Curve | Low | Low to Medium | Medium |
Simplicity of Use | Medium | High | High |
API | Basic | Rich | Rich |
Conclusion
In this article, you learned about three approaches to HTML parsing in PHP, ranging from using vanilla built-in extensions to third-party libraries.
While all these solutions work, keep in mind that the target web page might use JavaScript for rendering. In that case, simple HTML parsing approaches like the ones presented above will not work. Instead, you need a fully-fledged scraping browser with advanced HTML parsing capabilities like Scraping Browser.
Want to skip HTML parsing and get the data immediately? Check out our ready-to-use datasets covering hundreds of websites!
Create a free Bright Data account today to test our data and scraping solutions with a free trial!
No credit card required