cURL is a versatile open source command line tool for transferring data over a network. It comes with a large variety of parameters so that it can handle almost any request. In addition, cURL is extendable and has an interface in basically every modern programming language.
Using cURL with a programming language offers many benefits. For instance, making requests can be automated for debugging or web scraping use cases.
In this article, you’ll learn how Python and cURL can be used together to automate GET, POST, and PUT requests, and for downloading files and web pages.
What Is cURL?
cURL is a software project, but its name is also used in two products: a library known as libcurl and a command line tool known as curl (which uses libcurl). When curl is referred to in this article, it points to the command line tool.
curl is considered versatile; however, its core task is simple: transferring data over various network protocols. Given the complexity of today’s web, curl comes with an enormous list of options to handle the most complex requests.
curl was first released in 1996 as HttpGet and later named urlget before becoming curl. Its first use case was fetching currency exchange rates to use them in an IRC channel. Nowadays, curl supports transferring data via a range of methods, including FTP(S), HTTP(S) (POST, GET, PUT), IMAP, POP3, MQTT, and SMB. Moreover, curl can handle cookies and SSL certificates.
When curl makes a connection via HTTPS, it obtains the remote server certificate and checks it against its CA certificate store to ensure the remote server is the one it claims to be. For example, the following request sends an HTTPS request to the Bright Data website and sets a cookie known as greeting
with the value hello
:
curl --cookie "greeting=hello" https://www.brightdata.com
Why Use curl with Python?
Even though curl is a versatile tool, there is still one main reason why you would want to use it with Python: Python can automate your requests. Following are three use cases where this is a valuable combination:
Web Scraping
Web scraping is the practice of collecting (often large) amounts of data from one or more web pages. To scrape data with Python, people often rely on the requests library. For scraping recursively, you can use wget. However, for advanced scraping use cases with complex HTTP(S) calls, curl with Python is ideal.
While data from a web page can be collected with a single curl command that generates and processes an HTTP(S) request, it can’t do that recursively. By embedding curl in Python code, you can simulate a navigational path on a website by manipulating elements, such as the request parameters, cookies, and user agents.
The navigation doesn’t even have to be fixed. By making it contingent on the scraped content, each new request can be entirely dynamic.
For example, if you’re scraping the comment section of a popular news website and you only want to scrape the author’s profile page, if the comment contains hateful keywords, you can create a conditional statement that depends on the scraped comments and easily apply this dynamic filter.
In addition, many websites have safety mechanisms that make scraping a lot of pages difficult: think of distributed denial-of-service (DDoS) protection or a reCAPTCHA prompt. By applying certain rules and pauses between requests, one can simulate human behavior that is harder to detect.
Testing and Debugging
Using curl on your own website seems silly, but it’s useful in a testing and debugging context. Testing or debugging one or more features of an application is often a cumbersome task. It needs to be tested recurrently and with a variety of settings or parameters. Although there are many off-the-shelf testing tools, Python and curl make it easy to set up some quick tests.
For example, if you’re releasing a new checkout flow for your (complex) online service that uses cookies, relies on the referrer, has minor differences per browser (ie user agent), and packs all steps from the checkout flow into the body of a POST request, manually testing all variations could take ages. In Python, you can make a dictionary that contains the whole parameter set and send a request using curl for each possible combination.
Workflow Automation
In addition to testing and debugging and web scraping, curl can be used in workflow automation use cases. For instance, many data integration pipelines start with a recurring dump of a data export, such as a CSV or Apache Parquet file. With a Python application that polls for new files on an (S)FTP server, copying data dumps can be entirely automated.
Or consider setting up mailhooks. Imagine how many daily tasks could be automated if an application could poll for email messages that contain a query. By polling for new messages via the POP3 or IMAP protocol, Python applications can be triggered when a mailbox receives a specific email.
How to Use cURL with Python
There are various ways to make requests using curl in Python. This article covers two options. The first is to simulate curl requests in the command line via the os
and subprocess
Python packages. This straightforward approach programmatically sends commands to the command line interface of your operating system.
The second option is to use the PycURL package. If you want to learn about other ways of scraping websites with Python (without using curl), you can check this Bright Data Scraping with Python guide.
Prerequisites
Before you begin this tutorial, make sure you’ve downloaded and installed curl. If you use Windows, make sure to add curl to your PATH environment variable so that you can simply execute the curl
command.
To make a Python interface with your operating system, you can use various packages. However, the two most popular ones are os
and subprocess
. To install them both, run the following pip command:
pip install os subprocess
Making a Request Using curl and os
The os
package is an extremely simple package. Executing a curl request without processing the response only takes two lines of code. You just need to pass the cookie described in the previous example, and the output gets written to the output.txt
file:
import os
os.system('curl -o output.txt --cookie "greeting=hello" -k https://curl.se')
If you want to process the response in Python instead of writing it into a file, you should use the subprocess
package discussed in the next section.
The following code will run the same statement, but instead of writing the response to a file, it will output the stdout
and the stderr
as a tuple. This output can then be processed with other Python packages, like Beautiful Soup:
import shlex
import subprocess
shell_cmd = shlex.split('curl --cookie "greeting=hello" -k https://curl.se')
process = subprocess.Popen(shell_cmd,
stdout = subprocess.PIPE,
stderr = subprocess.PIPE,
text = True,
shell = True
)
std_out, std_err = process.communicate()
std_out.strip(), std_err
Using PycURL
Instead of interfacing with your terminal in Python, you can use the PycURL package. If you’re a Linux user, you’re in luck since you can install PycURL using pip:
pip install pycurl
pip install certifi
You should also install certifi to interface over the HTTPS protocol. If you run into issues, follow these instructions from Stack Overflow.Making a POST Request with PycUR
While PycURL is also installable on Windows, it’s a very frustrating endeavor. If you try to install it via pip, it will return the following error:
Please specify --curl-dir=/path/to/built/libcurl
That’s why you need to install it from source, which is “not for the faint of heart due to the multitude of possible dependencies and each of these dependencies having its own directory structure, configuration style, parameters and quirks.”
For this reason, it’s recommended to stick to the requests package for basic network requests if you’re working on a Windows machine.
How to Make Requests with PycURL
The remainder of this article elaborates on creating various types of requests using the PycURL package.
Making a GET Request with PycURL
The easiest request you can make using PycURL is a GET request. It’s basically a template for all the other templates throughout this section.
You can identify five steps in the following code:
- All required packages are imported.
- Two objects are created: the buffer in which the curl request will store its response and the curl object, which is used to make the request.
- The options of the request are specified: the URL, the destination, and the SSL validation.
- The execution of the request.
- The output of the request.
# Preparation
import pycurl
import certifi
from io import BytesIO
# Set buffer and Curl object.
buffer = BytesIO()
c = pycurl.Curl()
# Set request options.
## Set the request destination.
c.setopt(c.URL, 'http://pycurl.io/')
## Set the buffer as the destination of the request's response.
c.setopt(c.WRITEDATA, buffer)
## Refer to the installed certificate authority bundle for validating the SSL certificate.
c.setopt(c.CAINFO, certifi.where())Making a POST Request with PycUR
# Execute and close the request.
c.perform()
c.close()
# Print the buffer's content with a Latin1 (iso-8859-1) encoding.
body = buffer.getvalue()
data = body.decode('iso-8859-1')
print(data)Downloading a File with PycURL
Making a POST Request with PycURL
Making a POST request with PycURL is very similar to making a GET request. However, one extra option is added to the request: the POST body. In the following code snippet, a key-value is set and URL-encoded to ensure it is processed adequately:
# Preparation
import pycurl
import certifi
from io import BytesIO
from urllib.parse import urlencode
# Set buffer and Curl object.
buffer = BytesIO()
c = pycurl.Curl()
# Set request options.
## Set the request destination.
c.setopt(c.URL, 'http://pycurl.io/')
## Set the request's body.
post_body = {'greeting': 'hello'}
postfields = urlencode(post_body)
c.setopt(c.POSTFIELDS, postfields)
## Set the buffer as the destination of the request's response.
c.setopt(c.WRITEDATA, buffer)
## Refer to the installed certificate authority bundle for validating the SSL certificate.
c.setopt(c.CAINFO, certifi.where())
# Execute and close the request.
c.perform()
c.close()
# Print the buffer's content with a Latin1 (iso-8859-1) encoding.
body = buffer.getvalue()
print(body.decode('iso-8859-1'))
Making a PUT Request with PycURL
The POST request you created in the previous section can also be sent as a PUT request. Instead of sending the key-value in the body of the request, you’ll send it as a file representation encoded in UTF-8. This method can also be used for uploading files:
import pycurl
import certifi
from io import BytesIO
c = pycurl.Curl()
# Set request options.
## Set the request destination.
c.setopt(c.URL, 'http://pycurl.io/')
## Set data for the PUT request.
c.setopt(c.UPLOAD, 1)
data = '{"greeting": "hello"}'
buffer = BytesIO(data.encode('utf-8'))
c.setopt(c.READDATA, buffer)
## Refer to the installed certificate authority bundle for validating the SSL certificate.
c.setopt(c.CAINFO, certifi.where())
# Execute and close the request.
c.perform()
c.close()
Downloading a File with PycURL
The next snippet demonstrates how a file can be downloaded using PycURL. A random JPEG image is requested, and a write stream is opened to some_image.jpg
and passed to PycURL as the destination for the file:
import pycurl
import certifi
c = pycurl.Curl()
# Set the request destination.
c.setopt(c.URL, 'http://pycurl.io/some_image.jpg')
# Refer to the installed certificate authority bundle for validating the SSL certificate.
c.setopt(c.CAINFO, certifi.where())
# Execute and close the request.
with open('some_image.jpg', 'wb') as f:
c.setopt(c.WRITEFUNCTION, f.write)
c.perform()
c.close()
Downloading and Processing a Web Page with PycURL
Because lots of PycURL use cases involve web scraping, the next snippet describes how you can process a request’s response with Beautiful Soup, a popular package for parsing HTML files.
First, install Beautiful Soup 4 using pip:
pip install beautifulsoup4
Second, put the next snippet right behind the first PycURL snippet that made a GET request. This will make Beautiful Soup process the response data.
For demonstration, the find_all
method is used to find all paragraph elements, and the content of the individual paragraphs is printed:
from bs4 import BeautifulSoup
# Parsing data using BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.text)
Using a Proxy with PycURL
Web scraping at scale works best when you work with proxies. The benefit is that you can emulate browsing behavior in parallel without your scraper being flagged as a bot or as doing anomalous behavior.
In this final section, you’ll learn how you can create a request with PycURL through a proxy. This is achieved by adjusting the request options, as you did previously. Subsequently, four settings are described, but you can adjust this to your situation:
- To make it easy, insecure proxies are enabled.
- The proxy server is set.
- The script authenticates with the server.
- The proxy is set as
HTTPS
.
# Enable insecure proxies
c.setopt(c.PROXY_SSL_VERIFYHOST, 0)
c.setopt(c.PROXY_SSL_VERIFYPEER, 0)
# Set proxy server
c.setopt(pycurl.PROXY, <YOUR_HTTPS_PROXY_SERVER>)
# Authenticate with the proxy server
c.setopt(pycurl.PROXYUSERPWD, f"{<YOUR_USERNAME>}:{<YOUR_PASSWORD>}")
# Set proxy type to https
c.setopt(pycurl.PROXYTYPE, 2)
These options can be inserted anywhere in a previously described code snippet to make the request reroute via the proxy server.
Conclusion
In this article, the combination of curl and Python was explained in detail, highlighting why you would want to use them together for generating complex requests for web scraping and application testing use cases. Multiple examples were provided to demonstrate the versatility of PycURL for generating a multitude of network requests.
Alternatively, you can make use of the Bright Data Proxy Network and their Web Scraper IDE which was specifically designed to handle all the heavy lifting for developers. That way, you can focus on working with the scraped data instead of worrying about getting past anti-scraping mechanisms.
No credit card required