In the context of using Guzzle, a proxy acts as an intermediary server connecting your client application with the intended web server. It facilitates the forwarding of your requests to the desired server and returns the server’s response to your client. Additionally, proxies are instrumental in circumventing IP-based restrictions that may block web scraping activities or limit access to certain websites, besides offering benefits like caching server responses to reduce the number of direct requests to the target server.
This introduction outlines the essentials for effectively utilizing a proxy with Guzzle.
Getting Started Requirements – how to integrate
Before proceeding, make sure you have PHP version 7.2.5 or higher and Composer installed on your system. A basic understanding of web scraping with PHP will also be beneficial for following this guide. Begin by creating a new directory for your project and use Composer to install Guzzle within it:
composer require guzzlehttp/guzzle
Next, create a PHP file within the newly established directory and include Composer’s autoloader to proceed:
<?php
// Load Composer's autoloader
require 'vendor/autoload.php';
With that in place, we’re ready to configure the proxy settings.
Utilizing a Proxy with Guzzle
This segment demonstrates how to issue a request via Guzzle utilizing a proxy and authenticate it. Initially, source proxies, ensuring they are active and follow the format: <PROXY_PROTOCOL>://<PROXY_USERNAME>:<PROXY_PASSWORD>@<PROXY_HOST>:<PROXY_PORT>.
Key insight: Guzzle allows proxy use either through request-options or middleware. For straightforward, unchanged proxy setups, request-options are suitable. Conversely, middleware offers enhanced flexibility and control, albeit with more initial configuration.
We’ll delve into both approaches starting with request-options, involving the importation of Guzzle’s Client and RequestOptions classes for setup.
Method A: Set a Guzzle Proxy with request-options
To set a proxy with request-options, start by importing Guzzle’s Client and RequestOptions classes:
use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
Then, define your target URL and an associative array of all the proxies you’ll use:
# make request to
$targetUrl = 'https://lumtest.com/myip.json';
# proxies
$proxies = [
'http' => 'http://USERNAME:[email protected]:22225',
'https' => 'http://USERNAME:[email protected]:22225',
];
The specified target URL,lumtest, is designed to return the IP address of any client that issues a GET request to it. This setup allows Guzzle to manage both HTTP and HTTPS traffic, routing it through the designated HTTP and HTTPS proxies accordingly.
Next, we’ll initiate a Guzzle client instance, incorporating the previously defined proxies by assigning them to the proxy option in Guzzle’s configuration.
$client = new Client([
RequestOptions::PROXY => $proxies,
RequestOptions::VERIFY => false, # disable SSL certificate validation
RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
]);
Due to proxy servers often encountering issues with SSL verification, this setup opts to disable verification through the verify option. Additionally, the timeout setting restricts each request’s duration to a maximum of thirty seconds. Following this configuration, we will execute the request and display the resulting response.
try {
$body = $client->get($targetUrl)->getBody();
echo $body->getContents();
} catch (\Exception $e) {
echo $e->getMessage();
}
By now, your PHP script ought to resemble this:
'http://USERNAME:[email protected]:22225', 'https' => 'http://USERNAME:[email protected]:22225', ]; $client = new Client([ RequestOptions::PROXY => $proxies, RequestOptions::VERIFY => false, # disable SSL certificate validation RequestOptions::TIMEOUT => 30, # timeout of 30 seconds ]); try { $body = $client->get($targetUrl)->getBody(); echo $body->getContents(); } catch (\Exception $e) { echo $e->getMessage(); } ?>
Execute your script with the command php .php, and you’ll receive an output akin to the example provided below:
{"ip":"212.80.220.187","country":"IE","asn":{"asnum":9009,"org_name":"M247 Europe SRL"},"geo":{"city":"Dublin","region":"L","region_name":"Leinster","postal_code":"D12","latitude":53.323,"longitude":-6.3159,"tz":"Europe/Dublin","lum_city":"dublin","lum_region":"l"}}
Excellent! The ip key’s value corresponds to the IP address of the client initiating the request to lumtest. In this instance, it should reflect the proxies you’ve configured.
Approach B: Utilizing Middleware
Employing middleware for setting a Guzzle HTTP proxy follows a pattern similar to the first method. The sole distinction lies in creating and incorporating proxy middleware into the default handler stack.
To begin, adjust your import as follows:
# ...
use Psr\Http\Message\RequestInterface;
use GuzzleHttp\HandlerStack;
# ...
Then, establish a proxy middleware by inserting the following code immediately after your $proxies array. This middleware will intercept every request and configure the proxies accordingly.
function proxy_middleware(array $proxies)
{
return function (callable $handler) use ($proxies) {
return function (RequestInterface $request, array $options) use ($handler, $proxies) {
# add proxy to request option
$options[RequestOptions::PROXY] = $proxies;
return $handler($request, $options);
};
};
}
Now, we can integrate the middleware into the default handler stack and refresh our Guzzle client by incorporating the stack:
$stack = HandlerStack::create();
$stack->push(proxy_middleware($proxies));
$client = new Client([
'handler' => $stack,
RequestOptions::VERIFY => false, # disable SSL certificate validation
RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
]);
Your PHP script should look like this:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
use Psr\Http\Message\RequestInterface;
use GuzzleHttp\HandlerStack;
# make request to
$targetUrl = 'https://lumtest.com/myip.json';
# proxies
$proxies = [
'http' => 'http://USERNAME:[email protected]:22225',
'https' => 'http://USERNAME:[email protected]:22225',
];
function proxy_middleware(array $proxies)
{
return function (callable $handler) use ($proxies) {
return function (RequestInterface $request, array $options) use ($handler, $proxies) {
# add proxy to request option
$options[RequestOptions::PROXY] = $proxies;
return $handler($request, $options);
};
};
}
$stack = HandlerStack::create();
$stack->push(proxy_middleware($proxies));
$client = new Client([
'handler' => $stack,
RequestOptions::VERIFY => false, # disable SSL certificate validation
RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
]);
try {
$body = $client->get($targetUrl)->getBody();
echo $body->getContents();
} catch (\Exception $e) {
echo $e->getMessage();
}
?>
Execute the PHP script once more, and you’ll obtain results akin to those of the other method.
Implementing a Rotating Proxy with Guzzle involves utilizing a proxy server that frequently changes IP addresses. This approach aids in circumventing IP blocking since each request originates from a distinct IP, complicating the identification of bots originating from a singular source.
We’ll begin by implementing a rotating proxy with Guzzle, this is fairly easy when using Bright Data’s proxy services, for example:
function get_random_proxies(): array {
// Base proxy URL before the session identifier
$baseProxyUrl = 'http://USERNAME-session-';
$sessionSuffix = rand(1000, 9999); // Random integer between 1000 and 9999
$proxyCredentials = ':[email protected]:22225';
$httpProxy = $baseProxyUrl . $sessionSuffix . $proxyCredentials;
$httpsProxy = $baseProxyUrl . $sessionSuffix . $proxyCredentials;
$proxies = [
'http' => $httpProxy,
'https' => $httpsProxy,
];
return $proxies;
}
Now, add the intended function and call it:
function rotating_proxy_request(string $http_method, string $targetUrl, int $max_attempts = 3): string
{
$response = null;
$attempts = 1;
while ($attempts <= $max_attempts) {
$proxies = get_random_proxies();
echo "Using proxy: ".json_encode($proxies).PHP_EOL;
$client = new Client([
RequestOptions::PROXY => $proxies,
RequestOptions::VERIFY => false, # disable SSL certificate validation
RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
]);
try {
$body = $client->request(strtoupper($http_method), $targetUrl)->getBody();
$response = $body->getContents();
break;
} catch (\Exception $e) {
echo $e->getMessage().PHP_EOL;
echo "Attempt ".$attempts." failed!".PHP_EOL;
if ($attempts < $max_attempts) {
echo "Retrying with a new proxy".PHP_EOL;
}
$attempts += 1;
}
}
return $response;
}
$response = rotating_proxy_request('get', 'https://lumtest.com/myip.json');
echo $response;
Here’s the full PHP script:
<?php
# composer's autoloader
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
function get_random_proxies(): array {
// Base proxy URL before the session identifier
$baseProxyUrl = 'http://USERNAME-session-';
// Session ID suffix and proxy credentials
$sessionSuffix = rand(1000, 9999); // Random integer between 1000 and 9999
$proxyCredentials = ':[email protected]:22225';
// Assemble the full proxy URLs with the randomized session ID
$httpProxy = $baseProxyUrl . $sessionSuffix . $proxyCredentials;
$httpsProxy = $baseProxyUrl . $sessionSuffix . $proxyCredentials;
// Package and return the proxies
$proxies = [
'http' => $httpProxy,
'https' => $httpsProxy,
];
return $proxies;
}
function rotating_proxy_request(string $http_method, string $targetUrl, int $max_attempts = 3): string
{
$response = null;
$attempts = 1;
while ($attempts <= $max_attempts) {
$proxies = get_random_proxies();
echo "Using proxy: ".json_encode($proxies).PHP_EOL;
$client = new Client([
RequestOptions::PROXY => $proxies,
RequestOptions::VERIFY => false, # disable SSL certificate validation
RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
]);
try {
$body = $client->request(strtoupper($http_method), $targetUrl)->getBody();
$response = $body->getContents();
break;
} catch (\Exception $e) {
echo $e->getMessage().PHP_EOL;
echo "Attempt ".$attempts." failed!".PHP_EOL;
if ($attempts < $max_attempts) {
echo "Retrying with a new proxy".PHP_EOL;
}
$attempts += 1;
}
}
return $response;
}
$response = rotating_proxy_request('get', 'https://lumtest.com/myip.json');
echo $response;
Conclusion
In this guide, we’ve covered the necessary steps for integrating proxies with Guzzle. You’ve learned:
- The fundamentals of employing a proxy when working with Guzzle.
- Strategies for implementing a rotating proxy system.
Bright Data offers a dependable rotating proxy service accessible via API calls, along with sophisticated features designed to circumvent anti-bot measures, enhancing the efficiency of your scraping endeavors.
No credit card required