How to Parse HTML with PHP in 2025

This guide compares three PHP HTML parsing techniques, highlighting their strengths and differences to help you choose the right solution.
2 min read
Parsing HTML With PHP blog image

In this guide, you will see:

  • The reasons why parsing HTML in PHP is useful
  • The prerequisites to get started with the article’s goal
  • How to parse HTML in PHP using:
    • Dom\HTMLDocument
    • Simple HTML DOM Parser
    • Symfony’s DomCrawler
  • A comparison table of the three approaches

Let’s dive in!

Why Parse HTML in PHP?

HTML Parsing in PHP involves converting HTML content into its DOM (Document Object Model) structure. Once in the DOM format, you can easily navigate and manipulate the HTML content.

In particular, the top reasons to parse HTML in PHP are:

  • Data extraction: Gather specific content from web pages, such as text or attributes from HTML elements.
  • Automation: Automate tasks like content scraping, reporting, and data aggregation from HTML content.
  • Server-side HTML content handling: Parse HTML to manipulate, clean, or format web content on the server before displaying it in your application.

Discover the best HTML parsing libraries!

Prerequisites

Before you start coding, make sure you have PHP 8.4+ installed on your machine. You can verify this by running the following command:

php -v

The output should look something like this:

PHP 8.4.3 (cli) (built: Jan 19 2025 14:20:58) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.3, Copyright (c) Zend Technologies
    with Zend OPcache v8.4.3, Copyright (c), by Zend Technologies

Next, you want to initialize a Composer project to make dependency management easier. If Composer is not installed on your system, download it and follow the installation instructions.

First, create a new folder for your PHP HTML project:

mkdir php-html-parser

Navigate to the folder in your terminal and initialize a Composer project inside it using the composer init command:

composer init

During this process, you will be prompted with a few questions. The default answers will work, but feel free to add more specific details for your PHP HTML parsing project if desired.

Next, open the project folder in your favorite IDE. Visual Studio Code with the PHP extension or IntelliJ WebStorm are good choices for PHP development.

Now, add an empty index.php file to the project folder. Your project structure should now look like this:

php-html-parser/
  ├── vendor/
  ├── composer.json
  └── index.php

Open index.php and add the following code to initialize your project:

<?php

require_once __DIR__ . "/vendor/autoload.php";

// scraping logic...

This file will soon contain the logic to parse HTML in PHP.

You can now run your script with this command:

php index.php

Great! You are all set up to start parsing HTML in PHP. From here, you can begin adding the necessary HTML retrieval and parsing logic to your script.

HTML Retrieval in PHP

Before parsing HTML in PHP, you need some HTML to parse. In this section, we will see two different approaches to accessing HTML content in PHP.

With CURL

PHP natively supports cURL, a popular HTTP client used to perform HTTP requests. Enable the cURL extension or install it on Linux with:

sudo apt-get install php8.4-curl

You can use cURL to send an HTTP GET request to an online server and retrieve the HTML document returned by the server.

Here is an example script that makes a simple GET request and retrieves HTML content:

// initialize cURL session
$ch = curl_init();

// set the URL you want to make a GET request to
curl_setopt($ch, CURLOPT_URL, "https://www.scrapethissite.com/pages/forms/?per_page=100");

// return the response instead of outputting it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// execute the cURL request and store the result in $response
$html = curl_exec($ch);

// close the cURL session
curl_close($ch);

// output the HTML response
echo $html;

Add the above snippet to index.php and launch it. It will produce the following HTML code:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />
    <!-- Omitted for brevity... -->
</html>

Learn more in our guide on cURL GET requests in PHP.

From a File

Another way to get the HTML content is to store it in a dedicated file. To do that:

  1. Visit a page of your choice in the browser
  2. Right-click on the page
  3. Select the “View page source” option
  4. Copy and paste the HTML into a file

Alternatively, you can write your own HTML logic in a file.

For this example, we will assume the file is named index.html. This contains the HTML of the “Hockey Teams” page from Scrape This Site, which was previously retrieved using cURL:

The index.html file in the project folder

HTML Parsing in PHP: 3 Approaches

In this section, you will learn how to use three different libraries to parse HTML in PHP:

  1. Using Dom\HTMLDocument for vanilla PHP
  2. Using the Simple HTML DOM Parser library
  3. Using Symfony’s DomCrawler component

In all three cases, you will see how to parse either the HTML string retrieved via cURL or the HTML content read from the local index.html file.

Then, you will learn how to use the methods provided by each PHP HTML parsing library to select all hockey team entries on the page and extract data from them:

The table on the target page

The final result will be a list of scraped hockey team entries containing the following details:

  • Team Name
  • Year
  • Wins
  • Losses
  • Win %
  • Goals For (GF)
  • Goals Against (GA)
  • Goal Difference

You can extract them from the HTML table with this structure:

The HTML DOM structure of the table's rows

As you can see, each column in a table row has a specific class. You can extract data from it by selecting elements using their class as a CSS selector and then retrieving their content by accessing their text.

Keep in mind that parsing HTML is just one step in a web scraping script. To dive deeper, read our tutorial on web scraping with PHP.

Now, let’s explore three different approaches to HTML parsing in PHP.

Approach #1: With Dom\HTMLDocument

PHP 8.4+ comes with a built-in Dom\HTMLDocument class. This represents an HTML document and allows you to parse HTML content and navigate the DOM tree. See how to use it for HTML parsing in PHP!

Step #1: Installation and Set Up

Dom\HTMLDocument is part of the Standard PHP Library. Still, you need to enable the DOM extension or install it with this Linux command to use it:

sudo apt-get install php-dom

No further action is needed. You are now ready to use Dom\HTMLDocument for HTML parsing in PHP.

Step #2: HTML Parsing

You can parse the HTML string as below:

$dom = \DOM\HTMLDocument::createFromString($html);

Equivalently, you can parse the index.html file with:

$dom = \DOM\HTMLDocument::createFromFile("./index.html");

$dom is a Dom\HTMLDocument object that exposes the methods you need for data parsing.

Step #3: Data Parsing

You can select all hockey team entries using \DOM\HTMLDocument with the following approach:

// select each row on the page
$table = $dom->getElementsByTagName("table")->item(0);
$rows = $table->getElementsByTagName("tr");

// iterate through each row and extract data
foreach ($rows as $row) {
  $cells = $row->getElementsByTagName("td");

  // extracting the data from each column
  $team = trim($cells->item(0)->textContent);
  $year = trim($cells->item(1)->textContent);
  $wins = trim($cells->item(2)->textContent);
  $losses = trim($cells->item(3)->textContent);
  $win_pct = trim($cells->item(5)->textContent);
  $goals_for = trim($cells->item(6)->textContent);
  $goals_against = trim($cells->item(7)->textContent);
  $goal_diff = trim($cells->item(8)->textContent);

  // create an array for the scraped team data
  $team_data = [
    "team" => $team,
    "year" => $year,
    "wins" => $wins,
    "losses" => $losses,
    "win_pct" => $win_pct,
    "goals_for" => $goals_for,
    "goals_against" => $goals_against,
    "goal_diff" => $goal_diff
  ];

  // print the scraped team data
  print_r($team_data);
  print ("\n");
}

\DOM\HTMLDocument does not offer advanced query methods. So you have to rely on methods like getElementsByTagName() and manual iteration.

Here is a breakdown of the methods used:

  • getElementsByTagName(): Retrieve all elements of a given tag (like <table><tr>, or <td>) within the document.
  • item(): Return an individual element from a list of elements returned by getElementsByTagName().
  • textContent: This property gives the raw text content of an element, allowing you to extract the visible data (like the team name, year, etc.).

We also used trim() to remove extra whitespace before and after the text content for cleaner data.

When added to index.php, the above snippet will produce this result:

Array
(
    [team] => Boston Bruins
    [year] => 1990
    [wins] => 44
    [losses] => 24
    [win_pct] => 0.55
    [goals_for] => 299
    [goals_against] => 264
    [goal_diff] => 35
)

// omitted for brevity...

Array
(
    [team] => Detroit Red Wings
    [year] => 1994
    [wins] => 33
    [losses] => 11
    [win_pct] => 0.688
    [goals_for] => 180
    [goals_against] => 117
    [goal_diff] => 63
)

Approach #2: Using Simple HTML DOM Parser

Simple HTML DOM Parser is a lightweight PHP library that makes it easy to parse and manipulate HTML content. The library is actively maintained and has over 880 stars on GitHub.

Step #1: Installation and Set Up

You can install Simple HTML Dom Parser via Composer with this command:

composer require voku/simple_html_dom

Alternatively, you can manually download and include the simple_html_dom.php file in your project.

Then, import it in index.php with this line of code:

use voku\helper\HtmlDomParser;

Step #2: HTML Parsing

To parse an HTML string, use the file_get_html() method:

$dom = HtmlDomParser::str_get_html($html);

For parsing index.html, write file_get_html() instead:

$dom = HtmlDomParser::file_get_html($str);

This will load the HTML content into a $dom object, which allows you to navigate the DOM easily.

Step #3: Data Parsing

Extract the hockey team data from the HTML using Simple HTML DOM Parser:

// find all rows in the table
$rows = $dom->findMulti("table tr.team");

// loop through each row to extract the data
foreach ($rows as $row) {
  // extract data using CSS selectors
  $team_element = $row->findOne(".name");
  $team = trim($team_element->plaintext);

  $year_element = $row->findOne(".year");
  $year = trim($year_element->plaintext);

  $wins_element = $row->findOne(".wins");
  $wins = trim($wins_element->plaintext);

  $losses_element = $row->findOne(".losses");
  $losses = trim($losses_element->plaintext);

  $win_pct_element = $row->findOne(".pct");
  $win_pct = trim($win_pct_element->plaintext);

  $goals_for_element = $row->findOne(".gf");
  $goals_for = trim($goals_for_element->plaintext);

  $goals_against_element = $row->findOne(".ga");
  $goals_against = trim(string: $goals_against_element->plaintext);

  $goal_diff_element = $row->findOne(".diff");
  $goal_diff = trim(string: $goal_diff_element->plaintext);

  // create an array with the extracted team data
  $team_data = [
    "team" => $team,
    "year" => $year,
    "wins" => $wins,
    "losses" => $losses,
    "win_pct" => $win_pct,
    "goals_for" => $goals_for,
    "goals_against" => $goals_against,
    "goal_diff" => $goal_diff
  ];

  // print the scraped team data
  print_r($team_data);
  print("\n");
}

The Simple HTML DOM Parser features used above are:

  • findMulti(): Select all elements identified by the given CSS selector.
  • findOne(): Locate the first element matching the given CSS selector.
  • plaintext: An attribute to get the raw text content inside an HTML element.

This time, we used CSS selectors with a more complete and robust logic. Still, the result will be the same as in the initial HTML parsing PHP approach.

Approach #3: Using Symfony’s DomCrawler Component

Symfony’s DomCrawler component provides an easy way to parse HTML documents and extract data from them.

Note: The component is part of the Symfony framework but can also be used standalone, as we will do in this section.

Step #1: Installation and Set Up

Install Symfony’s DomCrawler component with this Composer command:

composer require symfony/dom-crawler

Then, import it in the index.php file:

use Symfony\Component\DomCrawler\Crawler;

Step #2: HTML Parsing

To parse an HTML string, create a Crawler instance with the html() method:

$crawler = new Crawler($html);

For parsing a file, use file_get_contents() and create the Crawler instance:

$crawler = new Crawler(file_get_contents("./index.html"));

The above lines will load the HTML content into the $crawler object, which provides easy methods to traverse and extract data.

Step #3: Data Parsing

Extract the hockey team data using the DomCrawler component:

// select all rows within the table
$rows = $crawler->filter("table tr.team");

// loop through each row to extract the data
$rows->each(function ($row, $i) {
  // extract data using CSS selectors
  $team_element = $row->filter(".name");
  $team = trim($team_element->text());

  $year_element = $row->filter(".year");
  $year = trim($year_element->text());

  $wins_element = $row->filter(".wins");
  $wins = trim($wins_element->text());

  $losses_element = $row->filter(".losses");
  $losses = trim($losses_element->text());

  $win_pct_element = $row->filter(".pct");
  $win_pct = trim($win_pct_element->text());

  $goals_for_element = $row->filter(".gf");
  $goals_for = trim($goals_for_element->text());

  $goals_against_element = $row->filter(".ga");
  $goals_against = trim($goals_against_element->text());

  $goal_diff_element = $row->filter(".diff");
  $goal_diff = trim($goal_diff_element->text());

  // create an array with the extracted team data
  $team_data = [
    "team" => $team,
    "year" => $year,
    "wins" => $wins,
    "losses" => $losses,
    "win_pct" => $win_pct,
    "goals_for" => $goals_for,
    "goals_against" => $goals_against,
    "goal_diff" => $goal_diff
  ];

  // print the scraped team data
  print_r($team_data);
  print ("\n");
});

The DomCrawler methods used are:

  • each(): To iterate over a list of selected elements.
  • filter(): Select elements based on CSS selectors.
  • text(): Extract the text content of the selected elements.

Wonderful! You are now a PHP HTML parsing master.

Parsing HTML in PHP: Comparison Table

You can compare the three approaches to parsing HTML in PHP explored here in the summary table below:

\DOM\HTMLDocument Simple HTML DOM Parser Symfony’s DomCrawler
Type Native PHP component External Library Symfony Component
GitHub Stars 880+ 4,000+
XPath Support ✔️ ✔️
CSS Selector Support ✔️ ✔️
Learning Curve Low Low to Medium Medium
Simplicity of Use Medium High High
API Basic Rich Rich

Conclusion

In this article, you learned about three approaches to HTML parsing in PHP, ranging from using vanilla built-in extensions to third-party libraries.

While all these solutions work, keep in mind that the target web page might use JavaScript for rendering. In that case, simple HTML parsing approaches like the ones presented above will not work. Instead, you need a fully-fledged scraping browser with advanced HTML parsing capabilities like Scraping Browser.

Want to skip HTML parsing and get the data immediately? Check out our ready-to-use datasets covering hundreds of websites!

Create a free Bright Data account today to test our data and scraping solutions with a free trial!

No credit card required