Today, we’re going to learn how to use proxies with HTTPX. A proxy sits between your scraper and the site you’re trying to scrape. Your scraper makes a request to the proxy server for the target site. The proxy then fetches the target site and returns it to your scraper.
How To Use Unauthenticated Proxies
In summary, all of our requests go to a proxy_url
. Below is an example using an unauthenticated proxy. This means that we’re not using a username or password. This example was inspired by their documentation.
import httpx
proxy_url = "http://localhost:8030"
with httpx.Client(proxy=proxy_url) as client:
ip_info = client.get("https://geo.brdtest.com/mygeo.json")
print(ip_info.text)
How To Use Authenticated Proxies
When a proxy requires a username and password, it’s called an “authenticated” proxy. These credentials are used to authenticate your account and give you a connection to the proxy.
With authentication, our proxy_url
looks like this: http://<username>:<password>@<proxy_url>:<port_number>
. In the example below, we use both our zone
and username
to create the user portion of the authentication string.
We’re using datacenter proxies for our base connection.
import httpx
username = "your-username"
zone = "your-zone-name"
password = "your-password"
proxy_url = f"http://brd-customer-{username}-zone-{zone}:{password}@brd.superproxy.io:33335"
ip_info = httpx.get("https://geo.brdtest.com/mygeo.json", proxy=proxy_url)
print(ip_info.text)
The code above is pretty simple. This is the basis for any sort of proxy you want to setup.
- First, we create our config variables:
username
,zone
andpassword
. - We use those to create our
proxy_url
:f"http://brd-customer-{username}-zone-{zone}:{password}@brd.superproxy.io:33335"
. - We make a request to the API to get general information about our proxy connection.
Your response should look similar to this.
{"country":"US","asn":{"asnum":20473,"org_name":"AS-VULTR"},"geo":{"city":"","region":"","region_name":"","postal_code":"","latitude":37.751,"longitude":-97.822,"tz":"America/Chicago"}}
How To Use Rotating Proxies
When we use rotating proxies, we create a list of proxies and choose from them randomly. In the code below, we create a list of countries
. When we make a request, we use random.choice()
to use a random country from the list. Our proxy_url
gets formatted to fit the country.
The example below creates a small list of rotating proxies.
import httpx
import asyncio
import random
countries = ["us", "gb", "au", "ca"]
username = "your-username"
proxy_url = "brd.superproxy.io:33335"
datacenter_zone = "your-zone"
datacenter_pass = "your-password"
for random_proxy in countries:
print("----------connection info-------------")
datacenter_proxy = f"http://brd-customer-{username}-zone-{datacenter_zone}-country-{random.choice(countries)}:{datacenter_pass}@{proxy_url}"
ip_info = httpx.get("https://geo.brdtest.com/mygeo.json", proxy=datacenter_proxy)
print(ip_info.text)
This example really isn’t all that different from your first. Here are the key differences.
- We create an array of countries:
["us", "gb", "au", "ca"]
. - Instead of making a single request, we make multiple ones. Each time we create a new request, we use
random.choice(countries)
to choose a random country each time we create ourproxy_url
.
How To Create a Fallback Proxy Connection
In the examples above, we’ve used only datacenter and free proxies. Free proxies aren’t very reliable. Datacenter proxies tend to get blocked with more difficult sites.
In this example, we create a function called safe_get()
. When we call this function, we first try to get the url using a datacenter connection. When this fails, we fall back to our residential connection.
import httpx
from bs4 import BeautifulSoup
import asyncio
country = "us"
username = "your-username"
proxy_url = "brd.superproxy.io:33335"
datacenter_zone = "datacenter_proxy1"
datacenter_pass = "datacenter-password"
residential_zone = "residential_proxy1"
residential_pass = "residential-password"
cert_path = "/home/path/to/brightdata_proxy_ca/New SSL certifcate - MUST BE USED WITH PORT 33335/BrightData SSL certificate (port 33335).crt"
datacenter_proxy = f"http://brd-customer-{username}-zone-{datacenter_zone}-country-{country}:{datacenter_pass}@{proxy_url}"
residential_proxy = f"http://brd-customer-{username}-zone-{residential_zone}-country-{country}:{residential_pass}@{proxy_url}"
async def safe_get(url: str):
async with httpx.AsyncClient(proxy=datacenter_proxy) as client:
print("trying with datacenter")
response = await client.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
if not soup.select_one("form[action='/errors/validateCaptcha']"):
print("response successful")
return response
print("response failed")
async with httpx.AsyncClient(proxy=residential_proxy, verify=cert_path) as client:
print("trying with residential")
response = await client.get(url)
print("response successful")
return response
async def main():
url = "https://www.amazon.com"
response = await safe_get(url)
with open("out.html", "w") as file:
file.write(response.text)
asyncio.run(main())
This example is a bit more complicated than the other ones we’ve dealt with in this article.
- We now have two sets of config variables, one for our datacenter connection, and one for our residential connection.
- This time, we use an
AsyncClient()
session to introduce some of the more advanced functionality of HTTPX. - First, we attempt to make our request with the
datacenter_proxy
. - If we fail to get a proper response, we retry the request using our
residential_proxy
. Also note theverify
flag in the code. When using our residential proxies, you need to download and use our SSL certificate. - Once we’ve got a solid response, we write the page to an HTML file. We can open this page up in our browser and see what the proxy actually accessed and sent back to us.
If you try the code above, your output and resulting HTML file should look a lot like this.
trying with datacenter
response failed
trying with residential
response successful
How Bright Data Products Help
As you’ve probably noticed throughout this article, our datacenter proxies are very affordable and our residential proxies provide an excellent fallback for when datacenter proxies don’t work. We also provide various other tools to assist with your data collection needs.
- Web Unlocker: Get past even the most difficult anti-bots. Web Unlocker automatically recognizes and solves any CAPTCHAs on the page. Once it’s through the anti-bots, it sends you back the web page.
- Scraping Browser: This product has even more features. Scraping Browser actually allows you to control a remote browser with proxy integration and an automated CAPTCHA solver.
- Web Scraper APIs: With these APIs, we do the scraping for you. All you need to do is call the API and parse the JSON data you receive in the response.
- Datasets: Explore our dataset marketplace to find hundreds of pre-collected datasets, or request/build a custom one. You can choose a refresh rate and filter only the data points you need.
Conclusion
When you combine HTTPX with our proxies you get a private, efficient, and reliable way to scrape the web. If you want to rotate proxies, it’s as simple as using Python’s built-in random
library. With a combination of datacenter and residential proxies, you can build a redundant connection that gets past most blocking systems.
As you learned, Bright Data offers the full package for your web scraping projects. Start your free trial with Bright Data’s proxies today!
No credit card required