Guide to Using a Proxy with Python Requests

Guide using proxy with Python request for web scraping and why this can be helpful when working on a web scraping project.
12 min read
Guide to using proxy with python request

Proxies are IP addresses from a proxy server that connect to the internet on your behalf. Instead of directly transmitting your requests to the website you visit, when you connect to the internet through a proxy, your requests are routed through the proxy server. Utilizing a proxy server is a great way to safeguard your online privacy and enhance security:

What is proxy

The proxy server acts as a middleman computer, which means that your original IP address and location are hidden from the website. This helps to protect you from online tracking, from targeted advertising, and from being blocked by the website you’re trying to access. Proxies also offer an added layer of security by encrypting your data as it travels between your device and the proxy server.

In this article, you’ll learn more about proxies and how you can use them with Python requests. You’ll also learn why this can be helpful when working on a web scraping project.

Why You Need Proxies When Web Scraping

Web scraping is an automated process for extracting data from websites for different purposes, including data aggregation, market research, and data analysis. However, many of these websites have restrictions that make it challenging to access the information you want.

Thankfully, proxies can help you circumvent IP and location-based restrictions. For instance, in some cases, websites serve different information for specific locations, such as a country or state. If you’re not located in that particular location, you won’t be able to access the information you’re looking for without a proxy, which can circumvent the IP and change your location.

In addition, most websites block the IP addresses of devices that are involved in web scraping activities. In this situation, you can implement a proxy to hide your IP address and location, making it more difficult for the website to identify and block you.

You can also use several proxies at the same time to distribute web scraping activities across different IP addresses and speed up the web scraping process, allowing the scraper to make multiple requests simultaneously.

Now that you know how proxies can help when it comes to web scraping projects, you will learn next how to implement a proxy in your project using the Requests Python package.

How to Use a Proxy with a Python Request

In order to use a proxy with a Python request, you need to establish a new Python project on your computer to write and run the Python scripts for web scraping. Create a directory (ie web_scrape_project) where you’ll store your source code files.

All the codes for this tutorial are available in this GitHub repo.

Install Packages

After you’ve created your directory, you need to install the following Python packages to send requests to the web page and collect the links:

  • Requests: The Requests Python package sends HTTP requests to the website where you want to scrape the data. HTTP requests return a response object containing all response data, such as status, encoding, and content.Run the following pip command in your terminal to install the package:pip install requests
  • Beautiful Soup: Beautiful Soup is a powerful Python library that parses HTML and XML documents. You’ll use this library to navigate through the HTML document and extract all the links on Bright Data’s web page.To install Beautiful Soup, run the following pip command in your terminal:pip install beautifulsoup4

Components of Proxy IP Address

Before using a proxy, it’s best to understand its components. The following are the three primary components of a proxy server:

  1. Protocol shows the type of content you can access on the internet. The most common protocols are HTTP and HTTPS.
  2. Address shows where the proxy server is located. The address can be an IP (ie 192.167.0.1) or a DNS hostname (ie proxyprovider.com).
  3. Port used to direct traffic to the correct server process when multiple services run on a single machine (ie port number 2000).

Using all three of these components, a proxy IP address would look like this: 192.167.0.1:2000 or proxyprovider.com:2000.

How to Set Proxies Directly in Requests

There are several ways to set proxies in Python requests, and in this article, you’ll look at three different scenarios. In this first example, you’ll learn how to set proxies directly in the requests module.

To start, you need to import the Requests and Beautiful Soup packages in your Python file for web scraping. Then create a directory called proxies that contains proxy server information to hide your IP address when scraping the web page. Here, you have to define both the HTTP and HTTPS connections to the proxy URL.

You also need to define the Python variable to set the URL of the web page you want to scrape the data from. For this tutorial, the URL is https://brightdata.com/

Next, you need to send a GET request to the web page using the request.get() method. The method takes two arguments: the URL of the website and proxies. Then the response from the web page is stored in the response variable.

To collect the links, use the Beautiful Soup package to parse the HTML content of the web page by passing response.content and html.parser as arguments to the BeautifulSoup() method.

Then use the find_all() method with a as an argument to find all the links on the web page. Finally, extract the href attribute of each link using the get() method.

Following is the complete source code to set proxies directly in requests:

# import packages.  
import requests  
from bs4 import BeautifulSoup  
  
# Define proxies to use.  
proxies = {  
    'http': 'http://proxyprovider.com:2000',  
    'https': 'http://proxyprovider.com:2000',  
}  
  
# Define a link to the web page.  
url = "https://brightdata.com/"  
  
# Send a GET request to the website.  
response = requests.get(url, proxies=proxies)  
  
# Use BeautifulSoup to parse the HTML content of the website.  
soup = BeautifulSoup(response.content, "html.parser")  
  
# Find all the links on the website.  
links = soup.find_all("a")  
  
# Print all the links.  
for link in links:  
    print(link.get("href"))

When you run this block of code, it sends a request to the web page defined using the proxy IP address and then returns the response that contains all the links to that web page:

How to Set Proxies via Environment Variables

Sometimes, you have to use the same proxy for all your requests to different web pages. In this case, it makes sense to set environment variables for your proxy.

To make the environment variables for the proxy available whenever you run scripts in the shell, run the following command in your terminal:

export HTTP_PROXY='http://proxyprovider.com:2000'  
export HTTPS_PROXY='https://proxyprovider.com:2000'

Here, the HTTP_PROXY variable sets the proxy server for HTTP requests, and the HTTPS_PROXY variable sets the proxy server for HTTPS requests.

At this point, your Python code has a few lines of code and uses the environment variables whenever you make a request to the web page:

# import packages.  
import requests  
from bs4 import BeautifulSoup  
  
# Define a link to the web page.  
url = "https://brightdata.com/"  
  
# Send a GET request to the website.  
response = requests.get(url)  
  
# Use BeautifulSoup to parse the HTML content of the website.  
soup = BeautifulSoup(response.content, "html.parser")  
  
# Find all the links on the website.  
links = soup.find_all("a")  
  
# Print all the links.  
for link in links:  
    print(link.get("href"))

How to Rotate Proxies Using a Custom Method and an Array of Proxies

Rotating proxies is crucial because websites often block or restrict access to bots and scrapers when they receive a large number of requests from the same IP address. When this happens, websites may suspect malicious scraping activity and, consequently, implement measures to block or limit access.

By rotating through different proxy IP addresses, you can avoid being detected, appear as multiple organic users, and bypass most anti-scraping measures implemented on the website.

In order to rotate proxies, you need to import a few Python libraries: Requests, Beautiful Soup, and Random.

Then create a list of proxies to use during the rotation process. This list must contain the URLs of the proxy servers in this format: http://proxyserver.com:port:

# List of proxies  
proxies = [  
    "http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",  
    "http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",  
    "http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",  
    "http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",  
    "http://proxyprovider3.com:2090"  
]


Then create a custom method called get_proxy(). This method randomly selects a proxy from the list of proxies using the random.choice() method and returns the selected proxy in dictionary format (both HTTP and HTTPS keys). You’ll use this method whenever you send a new request:

# Custom method to rotate proxies  
def get_proxy():  
    # Choose a random proxy from the list  
    proxy = random.choice(proxies)  
    # Return a dictionary with the proxy for both http and https protocols  
    return {'http': proxy, 'https': proxy}  

Once you’ve created the get_proxy() method, you need to create a loop that sends a certain number of GET requests using the rotated proxies. In each request, the get() method uses a randomly chosen proxy specified by the get_proxy() method.

Then you need to collect the links from the HTML content of the web page using the Beautiful Soup package, as explained in the first example.

Finally, the Python code catches any exceptions that occur during the request process and prints the error message to the console.

Here is the complete source code for this example:

# import packages  
import requests  
from bs4 import BeautifulSoup  
import random  
  
# List of proxies  
proxies = [  
    "http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",  
    "http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",  
    "http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",  
    "http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",  
    "http://proxyprovider3.com:2090"  
]

  
# Custom method to rotate proxies  
def get_proxy():  
    # Choose a random proxy from the list  
    proxy = random.choice(proxies)  
    # Return a dictionary with the proxy for both http and https protocols  
    return {'http': proxy, 'https': proxy}  
  
  
# Send requests using rotated proxies  
for i in range(10):  
    # Set the URL to scrape  
    url = 'https://brightdata.com/'  
    try:  
        # Send a GET request with a randomly chosen proxy  
        response = requests.get(url, proxies=get_proxy())  
  
        # Use BeautifulSoup to parse the HTML content of the website.  
        soup = BeautifulSoup(response.content, "html.parser")  
  
        # Find all the links on the website.  
        links = soup.find_all("a")  
  
        # Print all the links.  
        for link in links:  
            print(link.get("href"))  
    except requests.exceptions.RequestException as e:  
        # Handle any exceptions that may occur during the request  
        print(e)

Using the Bright Data Proxy Service with Python

If you’re looking for a reliable, fast, and stable proxy for your web scraping tasks, then look no further than Bright Data, a web data platform that offers different types of proxies for a wide range of use cases.

Bright Data has a large network of more than 72 million residential IPs and more than 770,000 datacenter proxies that helps them provide reliable and fast proxy solutions. Their proxy offerings are designed to help you overcome the challenges of web scraping, ad verification, and other online activities that require anonymous and efficient web data collection.

Integrating Bright Data’s proxies into your Python requests is easy. For example, use the datacenter Proxies to send a request to the URL used in the previous examples.

If you don’t already have an account, sign up for a free Bright Data trial and then add your details to register your account on the platform.

Once you’re done, follow these steps to create your first proxy:

Click View proxy product on the welcome page to view the different types of proxy offered by Bright Data:

Bright Data proxy types

Select Datacenter Proxies to create a new proxy, and on the subsequent page, add your details, and save it:

Datacenter proxies configuration

Once your proxy is created, you can view the important parameters (ie host, port, username, and password) to start accessing and using it:

Datacenter proxy parameters

Once you’ve accessed your proxy, you can use the parameters information to configure your proxy URL and send a request using the Requests Python package. The format of the proxy URL is username-(session-id)-password@host:port.

Note: The session-id is a random number created by using a Python package called random.

Following is what your code sample would look like to set your proxy from Bright Data in a Python request:

import requests  
from bs4 import BeautifulSoup  
import random  
  
# Define parameters provided by Brightdata  
host = 'zproxy.lum-superproxy.io'  
port = 22225  
username = 'username'  
password = 'password'  
session_id = random.random()  
  
# format your proxy  
proxy_url = ('http://{}-session-{}:{}@{}:{}'.format(username, session_id,  
                                                     password, host, port))  
  
# define your proxies in dictionary  
proxies = {'http': proxy_url, 'https': proxy_url}  
  
# Send a GET request to the website  
url = "https://brightdata.com/"  
response = requests.get(url, proxies=proxies)  
  
# Use BeautifulSoup to parse the HTML content of the website  
soup = BeautifulSoup(response.content, "html.parser")  
  
# Find all the links on the website  
links = soup.find_all("a")  
  
# Print all the links  
for link in links:  
    print(link.get("href"))

Here, you import the packages and define the proxy host, port, username, password, and session_id variables. Then you create a proxies dictionary with the http and https keys and the proxy credentials. Finally, you pass the proxies parameter to the requests.get() function to make the HTTP request and collect the links from the URL.

And that’s it! You’ve just made a successful request using Bright Data’s proxy service.

Conclusion

In this article, you learned why you need proxies as well as the different ways you can use them to send a request to a web page using the Requests Python package.

With Bright Data’s web platform, you can get reliable proxies for your project that cover any country or city in the world. They offer multiple ways to get the data you need through various types of proxies and tools for web scraping to suit your specific needs.

Whether you’re looking to gather market research data, monitor online reviews, or track competitor pricing, Bright Data has the resources you need to get the job done quickly and efficiently.