Proxies are IP addresses from a server that connect to the internet on your behalf. Instead of directly transmitting your requests to websites, proxies route them through a server, hiding your original IP address and location. This protects your privacy, prevents tracking, and avoids blocks. Proxies also encrypt your data for added security.
In this article, you’ll learn about using proxies with Python requests, particularly for web scraping. Web scraping involves extracting data from websites, but many sites have restrictions. Proxies help bypass these by changing your IP and location, making it harder for websites to detect and block you. You can also use multiple proxies to distribute requests and speed up the process.
Next, you’ll learn how to implement a proxy in your project using the Requests Python package.
How to Use a Proxy with a Python Request
In order to use a proxy with a Python request, you need to establish a new Python project on your computer to write and run the Python scripts for web scraping. Create a directory (ie web_scrape_project
) where you’ll store your source code files.
All the codes for this tutorial are available in this GitHub repo.
Install Packages
After you’ve created your directory, you need to install the following Python packages to send requests to the web page and collect the links:
- Requests: The Requests Python package sends HTTP requests to the website where you want to scrape the data. HTTP requests return a response object containing all response data, such as status, encoding, and content.Run the following
pip
command in your terminal to install the package:pip install requests
- Beautiful Soup: Beautiful Soup is a powerful Python library that parses HTML and XML documents. You’ll use this library to navigate through the HTML document and extract all the links on Bright Data’s web page.To install Beautiful Soup, run the following
pip
command in your terminal:pip install beautifulsoup4
Components of Proxy IP Address
Before using a proxy, it’s best to understand its components. The following are the three primary components of a proxy server:
- Protocol shows the type of content you can access on the internet. The most common protocols are HTTP and HTTPS.
- Address shows where the proxy server is located. The address can be an IP (ie
192.167.0.1
) or a DNS hostname (ieproxyprovider.com
). - Port used to direct traffic to the correct server process when multiple services run on a single machine (ie port number
2000
).
Using all three of these components, a proxy IP address would look like this: 192.167.0.1:2000
or proxyprovider.com:2000
.
How to Set Proxies Directly in Requests
There are several ways to set proxies in Python requests, and in this article, you’ll look at three different scenarios. In this first example, you’ll learn how to set proxies directly in the requests module.
To start, you need to import the Requests and Beautiful Soup packages in your Python file for web scraping. Then create a directory called proxies
that contains proxy server information to hide your IP address when scraping the web page. Here, you have to define both the HTTP and HTTPS connections to the proxy URL.
You also need to define the Python variable to set the URL of the web page you want to scrape the data from. For this tutorial, the URL is https://brightdata.com/
Next, you need to send a GET request to the web page using the request.get()
method. The method takes two arguments: the URL of the website and proxies. Then the response from the web page is stored in the response
variable.
To collect the links, use the Beautiful Soup package to parse the HTML content of the web page by passing response.content
and html.parser
as arguments to the BeautifulSoup()
method.
Then use the find_all()
method with a
as an argument to find all the links on the web page. Finally, extract the href
attribute of each link using the get()
method.
Following is the complete source code to set proxies directly in requests:
# import packages.
import requests
from bs4 import BeautifulSoup
# Define proxies to use.
proxies = {
'http': 'http://proxyprovider.com:2000',
'https': 'http://proxyprovider.com:2000',
}
# Define a link to the web page.
url = "https://brightdata.com/"
# Send a GET request to the website.
response = requests.get(url, proxies=proxies)
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
When you run this block of code, it sends a request to the web page defined using the proxy IP address and then returns the response that contains all the links to that web page:
How to Set Proxies via Environment Variables
Sometimes, you have to use the same proxy for all your requests to different web pages. In this case, it makes sense to set environment variables for your proxy.
To make the environment variables for the proxy available whenever you run scripts in the shell, run the following command in your terminal:
export HTTP_PROXY='http://proxyprovider.com:2000'
export HTTPS_PROXY='https://proxyprovider.com:2000'
Here, the HTTP_PROXY
variable sets the proxy server for HTTP requests, and the HTTPS_PROXY
variable sets the proxy server for HTTPS requests.
At this point, your Python code has a few lines of code and uses the environment variables whenever you make a request to the web page:
# import packages.
import requests
from bs4 import BeautifulSoup
# Define a link to the web page.
url = "https://brightdata.com/"
# Send a GET request to the website.
response = requests.get(url)
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
How to Rotate Proxies Using a Custom Method and an Array of Proxies
Rotating proxies is crucial because websites often block or restrict access to bots and scrapers when they receive a large number of requests from the same IP address. When this happens, websites may suspect malicious scraping activity and, consequently, implement measures to block or limit access.
By rotating through different proxy IP addresses, you can avoid being detected, appear as multiple organic users, and bypass most anti-scraping measures implemented on the website.
In order to rotate proxies, you need to import a few Python libraries: Requests, Beautiful Soup, and Random.
Then create a list of proxies to use during the rotation process. This list must contain the URLs of the proxy servers in this format: http://proxyserver.com:port:
# List of proxies
proxies = [
"http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",
"http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",
"http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",
"http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",
"http://proxyprovider3.com:2090"
]
Then create a custom method called get_proxy()
. This method randomly selects a proxy from the list of proxies using the random.choice()
method and returns the selected proxy in dictionary format (both HTTP and HTTPS keys). You’ll use this method whenever you send a new request:
# Custom method to rotate proxies
def get_proxy():
# Choose a random proxy from the list
proxy = random.choice(proxies)
# Return a dictionary with the proxy for both http and https protocols
return {'http': proxy, 'https': proxy}
Once you’ve created the get_proxy()
method, you need to create a loop that sends a certain number of GET requests using the rotated proxies. In each request, the get()
method uses a randomly chosen proxy specified by the get_proxy()
method.
Then you need to collect the links from the HTML content of the web page using the Beautiful Soup package, as explained in the first example.
Finally, the Python code catches any exceptions that occur during the request process and prints the error message to the console.
Here is the complete source code for this example:
# import packages
import requests
from bs4 import BeautifulSoup
import random
# List of proxies
proxies = [
"http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",
"http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",
"http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",
"http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",
"http://proxyprovider3.com:2090"
]
# Custom method to rotate proxies
def get_proxy():
# Choose a random proxy from the list
proxy = random.choice(proxies)
# Return a dictionary with the proxy for both http and https protocols
return {'http': proxy, 'https': proxy}
# Send requests using rotated proxies
for i in range(10):
# Set the URL to scrape
url = 'https://brightdata.com/'
try:
# Send a GET request with a randomly chosen proxy
response = requests.get(url, proxies=get_proxy())
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
except requests.exceptions.RequestException as e:
# Handle any exceptions that may occur during the request
print(e)
Using the Bright Data Proxy Service with Python
If you’re looking for a reliable, fast, and stable proxy for your web scraping tasks, then look no further than Bright Data, a web data platform that offers different types of proxies for a wide range of use cases.
Bright Data has a large network of more than 72 million residential proxy IPs and more than 770,000 datacenter proxies that helps them provide reliable and fast proxy solutions. Their proxy offerings are designed to help you overcome the challenges of web scraping, ad verification, and other online activities that require anonymous and efficient web data collection.
Integrating Bright Data’s proxies into your Python requests is easy. For example, use the datacenter Proxies to send a request to the URL used in the previous examples.
If you don’t already have an account, sign up for a free Bright Data trial and then add your details to register your account on the platform.
Once you’re done, follow these steps to create your first proxy:
Click View proxy product on the welcome page to view the different types of proxy offered by Bright Data:
Select Datacenter Proxies to create a new proxy, and on the subsequent page, add your details, and save it:
Once your proxy is created, you can view the important parameters (ie host, port, username, and password) to start accessing and using it:
Once you’ve accessed your proxy, you can use the parameters information to configure your proxy URL and send a request using the Requests Python package. The format of the proxy URL is username-(session-id)-password@host:port
.
Note: The
session-id
is a random number created by using a Python package calledrandom
.
Following is what your code sample would look like to set your proxy from Bright Data in a Python request:
import requests
from bs4 import BeautifulSoup
import random
# Define parameters provided by Brightdata
host = 'zproxy.lum-superproxy.io'
port = 22225
username = 'username'
password = 'password'
session_id = random.random()
# format your proxy
proxy_url = ('http://{}-session-{}:{}@{}:{}'.format(username, session_id,
password, host, port))
# define your proxies in dictionary
proxies = {'http': proxy_url, 'https': proxy_url}
# Send a GET request to the website
url = "https://brightdata.com/"
response = requests.get(url, proxies=proxies)
# Use BeautifulSoup to parse the HTML content of the website
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website
links = soup.find_all("a")
# Print all the links
for link in links:
print(link.get("href"))
Here, you import the packages and define the proxy host, port, username, password, and session_id variables. Then you create a proxies
dictionary with the http
and https
keys and the proxy credentials. Finally, you pass the proxies
parameter to the requests.get()
function to make the HTTP request and collect the links from the URL.
And that’s it! You’ve just made a successful request using Bright Data’s proxy service.
Conclusion
In this article, you learned why you need proxies as well as the different ways you can use them to send a request to a web page using the Requests Python package.
With Bright Data’s web platform, you can get reliable proxies for your project that cover any country or city in the world. They offer multiple ways to get the data you need through various types of proxies and tools for web scraping to suit your specific needs.
Whether you’re looking to gather market research data, monitor online reviews, or track competitor pricing, Bright Data has the resources you need to get the job done quickly and efficiently.
No credit card required