Web Scraping With Goutte In PHP: 2025 Tutorial

Master Goutte web scraping with this step-by-step guide. Learn setup, alternatives, and how to bypass scraping limitations for better data extraction.
3 min read
Web Scraping With Goutte blog image

In this Goutte web scraping guide, you will learn:

  • What the PHP library Goutte is
  • How to use it for web scraping in a step-by-step tutorial
  • Alternatives to Goutte for web scraping
  • The limitations of this approach and possible solutions

Let’s dive in!

What Is Goutte?

Goutte is a PHP library for screen scraping and web crawling, offering an intuitive API to navigate websites and extract data from HTML/XML responses. It includes an integrated HTTP client and HTML parsing capabilities, allowing you to retrieve web pages through HTTP requests and process them for data scraping.

Note: As of April 1, 2023, Goutte is no longer maintained and is now considered deprecated. However, as of this writing, it still functions reliably.

How to Perform Web Scraping With Goutte: Step-By-Step Guide

Follow this step-by-step tutorial section and see how to use Goutte for extracting data from the “Hockey Teams” site:

The "Hockey Teams" target page

The goal is to extract the data from the table above and export it to a CSV file.

Time to learn how to perform web scraping with Goutte!

Step #1: Project Set Up

Before you get started, make sure your system meets Goutte’s requirements—PHP 7.1 or higher. To check your current PHP version, run the following command:

php -v

The output should look something like this:

PHP 8.4.3 (cli) (built: Jan 19 2025 14:20:58) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.3, Copyright (c) Zend Technologies
    with Zend OPcache v8.4.3, Copyright (c), by Zend Technologies

If your PHP version is lower than 7.1, you will need to upgrade PHP before proceeding.

Next, keep in mind that Goutte will be installed via Composer—a dependency manager for PHP. If Composer is not installed on your system, download it from the official site and follow the installation instructions.

Now, create a new directory for your Goutte project and navigate to it in the terminal:

mkdir goutte-parser
cd goutte-parser

Next, use the composer init command to initialize a Composer project inside the folder:

composer init

Composer will prompt you to enter project details like package name and description. The default answers will work, but feel free to customize them according to your goals.

Now, open the project folder in your favorite PHP IDE. Visual Studio Code with the PHP extension or IntelliJ WebStorm are both good choices.

Create an empty index.php file in the project folder, which should contain:

php-html-parser/
  ├── vendor/
  ├── composer.json
  └── index.php

Open index.php and add the following line of code for importing Composer libraries:

<?php

require_once __DIR__ . "/vendor/autoload.php";

// scraping logic...

This file will soon contain the Goutte scraping logic.

You can now execute your script using this command:

php index.php

Great! You are all set up to start scraping data with Goutte in PHP.

Step #2: Install and Configure Goutte

Install Goutte with the Compose command below:

composer require fabpot/goutte

This will add the fabpot/goutte dependency to your composer.json file, which will now include:

"require": {
    "fabpot/goutte": "^4.0"
}

In index.php, import Goutte by adding the following line of code:

use Goutte\Client;

This exposes the Goutte HTTP client you can use to connect to a target page, parse its HTML, and extract data from it. See how to do that in the next step!

Step #3: Get the HTML of the Target Page

First, create a new Goutte HTTP client:

$client = new Client();

Behind the scenes, Goutte’s Client class is simply a wrapper around Symfony’s BrowserKit\HttpBrowser component. See it in action in our guide on web scraping with Laravel.

Next, store the target webpage URL in a variable and use the request() method to fetch its contents:

$url = "https://www.scrapethissite.com/pages/forms/";
$crawler = $client->request("GET", $url);

This sends a GET request to the webpage, retrieves its HTML document, and parses it for you. Specifically, the $crawler object provides access to all the methods of Symfony’s DomCrawler component. $crawler is the object you will use to navigate and extract data from the page.

Amazing! You now have everything you need for Goutte web scraping.

Step #4: Prepare to Scrape the Data of Interest

Before extracting data, you must familiarize yourself with the HTML structure of the target page.

First, remember that the data of interest is presented in rows inside a table. Since that table contains multiple rows, an array is a great data structure where to store the scraped data:

$teams = [];

Now, focus on the HTML structure of the table. Visit the target page in your browser, right-click on the table containing the data of interest, and select the “Inspect” option:

The HTML structure of the table element

In the DevTools, you will see that the table has a table class and is contained within a <section> element with the id=``"``hockey``". This means you can target the table using the following CSS selector:

#hockey .table

Apply the CSS selector to select the table node using the $crawler->filter() method:

$table = $crawler->filter("#hockey .table");

Then, note that each row is represented by a <tr> element with the class team. Select all rows and iterate over them, preparing to extract data from them:

    $table->filter("tr.team")->each(function ($tr) use (&$teams) {
      // data extraction logic...
    });

Wonderful! You now have a skeleton ready for Goutte data scraping.

Step #5: Implement the Data Extraction Logic

Just like before, this time inspect the rows inside the table:

The HTML structure of the row elements

What you can notice is that each row contains the following information in dedicated columns:

  • Team name → inside the .name element
  • Season year → inside the .year element
  • Number of wins → inside the .wins element
  • Number of losses → inside the .losses element
  • Overtime losses → inside the .ot-losses element
  • Winning percentage → inside the .pct element
  • Goals scored (Goals For – GF) → inside the .gf element
  • Goals conceded (Goals Against – GA) → inside the .ga element
  • Goal difference → inside the .diff element

To retrieve a single piece of information, you need to apply these two steps:

  1. Select the HTML element using filter()
  2. Extract its text content using the text() method and remove any extra spaces with trim()

For example, you can scrape the team name with:

$teamElement = $tr->filter(".name");
$team = trim($teamElement->text());

Similarly, extend this logic to all other columns:

$yearElement = $tr->filter(".year");
$year = trim($yearElement->text());

$winsElement = $tr->filter(".wins");
$wins = trim($winsElement->text());

$lossesElement = $tr->filter(".losses");
$losses = trim($lossesElement->text());

$otLossesElement = $tr->filter(".ot-losses");
$otLosses = trim($otLossesElement->text());

$pctElement = $tr->filter(".pct");
$pct = trim($pctElement->text());

$gfElement = $tr->filter(".gf");
$gf = trim($gfElement->text());

$gaElement = $tr->filter(".ga");
$ga = trim($gaElement->text());

$diffElement = $tr->filter(".diff");
$diff = trim($diffElement->text());

Once you have extracted the data of interest from the row, store it in the $teams array:

$teams[] = [
  "team" => $team,
  "year" => $year,
  "wins" => $wins,
  "losses" => $losses,
  "ot_losses" => $otLosses,
  "win_perc" => $pct,
  "goals_for" => $gf,
  "goals_against" => $ga,
  "goal_diff" => $diff
];

After looping through all rows, the $teams array will contain:

Array
(
    [0] => Array
        (
            [team] => Boston Bruins
            [year] => 1990
            [wins] => 44
            [losses] => 24
            [ot_losses] =>
            [win_perc] => 0.55
            [goals_for] => 299
            [goals_against] => 264
            [goal_diff] => 35
        )

    // ...

    [24] => Array
        (
            [team] => Chicago Blackhawks
            [year] => 1991
            [wins] => 36
            [losses] => 29
            [ot_losses] =>
            [win_perc] => 0.45
            [goals_for] => 257
            [goals_against] => 236
            [goal_diff] => 21
        )
)

Terrific! Goutte data scraping performed successfully.

Step #6: Implement the Crawling Logic

Now, do not forget that the target site presents data across multiple pages, showing only a portion at a time. Below the table, there is a pagination element that provides links to all pages:

The pagination element

Thus, you can manage pagination in your scraping script with these simple steps:

  1. Select the pagination link elements
  2. Extract the URLs of the paginated pages
  3. Visit each page and apply the scraping logic devised earlier

Start by inspecting the pagination link elements:

The HTML structure of the pagination link elements

Note that you can select all pagination links using the following CSS selector:

.pagination li a

To implement step 2 and collect all pagination URLs, use this logic:

$urls = [$url];

// select the pagination link elements
$crawler->filter(".pagination li a")->each(function ($a) use (&$urls) {
  // construct the absolute URL
  $url = "https://www.scrapethissite.com" . $a->attr("href");

  // add the pagination URL to the list only if it is not already present
  if (!in_array($url, $urls)) {
    $urls[] = $url;
  }
});

This initializes a list of URLs that will store the pagination links, starting with the first page’s URL. It then selects all pagination elements and iterates over them, adding new URLs to the $urls array only if they are not already present. As the URLs on the page are relative, they must be converted into absolute URLs before adding them to the list.

Since pagination handling should only be executed once and is not directly tied to data extraction, it is best to wrap it in a function:

function getPaginationUrls($client, $url)
{
  // connect to the first page of the site
  $crawler = $client->request("GET", $url);

  // initialize the list of URLs to scrape with the current URL
  $urls = [$url];

  // select the pagination link elements
  $crawler->filter(".pagination li a")->each(function ($a) use (&$urls) {
    // construct the absolute URL
    $url = "https://www.scrapethissite.com" . $a->attr("href");

    // add the pagination URL to the list only if it is not already present
    if (!in_array($url, $urls)) {
      $urls[] = $url;
    }
  });

  return $urls;
}

You can call the getPaginationUrls() function like this:

$urls = getPaginationUrls($client, "https://www.scrapethissite.com/pages/forms/?page_num=1");

After execution, $urls will contain all paginated URLs:

Array
(
    [0] => https://www.scrapethissite.com/pages/forms/?page_num=1
    [1] => https://www.scrapethissite.com/pages/forms/?page_num=2
    [2] => https://www.scrapethissite.com/pages/forms/?page_num=3
    [3] => https://www.scrapethissite.com/pages/forms/?page_num=4
    [4] => https://www.scrapethissite.com/pages/forms/?page_num=5
    [5] => https://www.scrapethissite.com/pages/forms/?page_num=6
    [6] => https://www.scrapethissite.com/pages/forms/?page_num=7
    [7] => https://www.scrapethissite.com/pages/forms/?page_num=8
    [8] => https://www.scrapethissite.com/pages/forms/?page_num=9
    [9] => https://www.scrapethissite.com/pages/forms/?page_num=10
    [10] => https://www.scrapethissite.com/pages/forms/?page_num=11
    [11] => https://www.scrapethissite.com/pages/forms/?page_num=12
    [12] => https://www.scrapethissite.com/pages/forms/?page_num=13
    [13] => https://www.scrapethissite.com/pages/forms/?page_num=14
    [14] => https://www.scrapethissite.com/pages/forms/?page_num=15
    [15] => https://www.scrapethissite.com/pages/forms/?page_num=16
    [16] => https://www.scrapethissite.com/pages/forms/?page_num=17
    [17] => https://www.scrapethissite.com/pages/forms/?page_num=18
    [18] => https://www.scrapethissite.com/pages/forms/?page_num=19
    [19] => https://www.scrapethissite.com/pages/forms/?page_num=20
    [20] => https://www.scrapethissite.com/pages/forms/?page_num=21
    [21] => https://www.scrapethissite.com/pages/forms/?page_num=22
    [22] => https://www.scrapethissite.com/pages/forms/?page_num=23
    [23] => https://www.scrapethissite.com/pages/forms/?page_num=24
)

Perfect! You just implemented web crawling in Goutte.

Step #7: Scrape Data From All Pages

Now that you have all the page URLs stored in an array, you can scrape them one by one by:

  1. Iterating over the list
  2. Retrieving and parsing the HTML content for each URL
  3. Extracting the required data
  4. Storing the scraped information in the $teams array.

Implement the above logic as follows:

$teams = [];

// iterate over all pages and scrape them all
foreach ($urls as $_ => $url) {
  // logging which page the scraper is currently working on
  echo "Scraping webpage \"$url\"...\n";

  // retrieve the HTML of the current page and parse it
  $crawler = $client->request("GET", $url);

  // $table = $crawler-> ...
  // data extraction logic
}

Note the echo instruction to log the current page the scraper is operating on. That info is useful to understand what the script is doing during execution.

Beautiful! It only remains to export the scraped data into a human-readable format like CSV.

Step #8: Export the Scraped Data to CSV

Right now, the scraped data is stored in the $teams array. To make it accessible for other teams and easier to analyze, export it to a CSV file.

PHP provides built-in support for CSV export through the fputcsv() function. Use it to write the scraped data to a file named teams.csv as below:

// open the output file for writing
$file = fopen("teams.csv", "w");

// write the header row
fputcsv($file, ["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %","Goals For (GF)",        "Goals Against (GA)", "+ / -"]);

// append each team as a new row
foreach ($teams as $team) {
  fputcsv($file, [
    $team["team"],
    $team["year"],
    $team["wins"],
    $team["losses"],
    $team["ot_losses"],
    $team["win_perc"],
    $team["goals_for"],
    $team["goals_against"],
    $team["goal_diff"]
  ]);
}

// close the file
fclose($file);

Mission complete! The Goutte scraper is fully functional.

Step #9: Put It All Together

Your Goutte web scraping script should now contain:

<?php

require_once __DIR__ . "/vendor/autoload.php";

use Goutte\Client;

function getPaginationUrls($client, $url)
{
  // connect to the first page of the site
  $crawler = $client->request("GET", $url);

  // initialize the list of URLs to scrape with the current URL
  $urls = [$url];

  // select the pagination link elements
  $crawler->filter(".pagination li a")->each(function ($a) use (&$urls) {
    // construct the absolute URL
    $url = "https://www.scrapethissite.com" . $a->attr("href");

    // add the pagination URL to the list only if it is not already present
    if (!in_array($url, $urls)) {
      $urls[] = $url;
    }
  });

  return $urls;
}

// initialize a new Goutte HTTP client
$client = new Client();

// get the URLs of the pages to scrape
$urls = getPaginationUrls($client, "https://www.scrapethissite.com/pages/forms/?page_num=1");

// where to store the scraped data
$teams = [];

// iterate over all pages and scrape them all
foreach ($urls as $_ => $url) {
  // logging which page the scraper is currently working on
  echo "Scraping webpage \"$url\"...\n";

  // retrieve the HTML of the current page and parse it
  $crawler = $client->request("GET", $url);

  // select the table element with the data of interest
  $table = $crawler->filter("#hockey .table");

  // iterate over each row and extract data from them
  $table->filter("tr.team")->each(function ($tr) use (&$teams) {
    // data extraction logic

    $teamElement = $tr->filter(".name");
    $team = trim($teamElement->text());

    $yearElement = $tr->filter(".year");
    $year = trim($yearElement->text());

    $winsElement = $tr->filter(".wins");
    $wins = trim($winsElement->text());

    $lossesElement = $tr->filter(".losses");
    $losses = trim($lossesElement->text());

    $otLossesElement = $tr->filter(".ot-losses");
    $otLosses = trim($otLossesElement->text());

    $pctElement = $tr->filter(".pct");
    $pct = trim($pctElement->text());

    $gfElement = $tr->filter(".gf");
    $gf = trim($gfElement->text());

    $gaElement = $tr->filter(".ga");
    $ga = trim($gaElement->text());

    $diffElement = $tr->filter(".diff");
    $diff = trim($diffElement->text());

    // add the scraped data to the array
    $teams[] = [
      "team" => $team,
      "year" => $year,
      "wins" => $wins,
      "losses" => $losses,
      "ot_losses" => $otLosses,
      "win_perc" => $pct,
      "goals_for" => $gf,
      "goals_against" => $ga,
      "goal_diff" => $diff
    ];
  });
}

// open the output file for writing
$file = fopen("teams.csv", "w");

// write the header row
fputcsv($file, ["Team Name", "Year", "Wins", "Losses", "OT Losses", "Win %","Goals For (GF)",        "Goals Against (GA)", "+ / -"]);

// append each team as a new row
foreach ($teams as $team) {
  fputcsv($file, [
    $team["team"],
    $team["year"],
    $team["wins"],
    $team["losses"],
    $team["ot_losses"],
    $team["win_perc"],
    $team["goals_for"],
    $team["goals_against"],
    $team["goal_diff"]
  ]);
}

// close the file
fclose($file);

Launch it with this command:

php index.php

The scraper would log the following output:

Scraping webpage "https://www.scrapethissite.com/pages/forms/?page_num=1"...
// omitted for brevity..
Scraping webpage "https://www.scrapethissite.com/pages/forms/?page_num=24"...

At the end of the execution, a teams.csv file containing this data will appear in the project folder:

The CSV output file

Et voilà! The exact data from the target site is now available in a structured format.

Alternatives to the PHP Goutte Library for Web Scraping

As mentioned at the beginning of this article, Goutte is deprecated and no longer maintained. This means you should consider alternative solutions.

the GitHub announcement of the library deprecation

As explained on GitHub, since Goutte v4 has essentially become a proxy for the HttpBrowser class from Symfony’s, you should migrate to it. To do so, you just need to install these libraries:

composer require symfony/browser-kit symfony/http-client

Then, replace:

use Goutte\Client;

with

use Symfony\Component\BrowserKit\HttpBrowser;

Finally, remove Goutte as a dependency in your project. The underlying API remains the same, so you should not need to change much in your script.

Instead of Goutte, you can also combine an HTTP client with an HTML parser. Some recommended alternatives:

  • Guzzle or cURL for making HTTP requests.
  • Dom\HTMLDocument, Simple HTML DOM Parser, or DomCrawler for parsing HTML in PHP.

All these alternatives give you more flexibility and guarantee that your web scraping script remains maintainable in the long run.

Limitations of This Approach to Web Scraping

Goutte is a powerful tool, but using it for web scraping comes with several limitations:

  • The library is deprecated
  • Its API is no longer maintained
  • It is subject to rate limiters and anti-scraping blocks
  • It cannot handle dynamic pages that rely on JavaScript
  • It has limited built-in proxy support, which is essential for avoiding IP bans

Some of these limitations can be mitigated by using alternative libraries or different approaches, as covered in our guide on web scraping with PHP. Still, you will always face anti-scraping measures that can only be bypassed using a Web Unlocker API.

A Web Unlocker API is a specialized scraping endpoint designed to bypass anti-bot protections and retrieve the raw HTML of any webpage. Using it is as simple as making an API call and parsing the returned content. This approach integrates seamlessly with Goutte (or Symfony’s updated components), just as demonstrated in this article.

Conclusion

In this guide, you explored what Goutte is and what it offers for web scraping through a step-by-step tutorial. Since this library is now deprecated, you also had the chance to explore some of its alternatives.

Regardless of which PHP scraping library you choose, the biggest challenge is that most websites protect their data using anti-bot and anti-scraping technologies. These mechanisms can detect and block automated requests, making traditional scraping methods ineffective.

Fortunately, Bright Data offers a suite of solutions to avoid any issue:

  • Web Unlocker: An API that bypasses anti-scraping protections and delivers clean HTML from any webpage with minimal effort.
  • Scraping Browser: A cloud-based, controllable browser with JavaScript rendering. It automatically handles CAPTCHAs, browser fingerprinting, retries, and more for you. It integrates seamlessly with Panther or Selenium PHP.
  • Web Scraping APIs: Endpoints for programmatic access to structured web data from dozens of popular domains.

Don’t want to deal with web scraping but are still interested in ‘online web data? Explore our ready-to-use datasets!

Sign up for Bright Data now and start your free trial to test our scraping solutions.

No credit card required