cURL: What It Is, And How You Can Use It For Web Scraping
In this blog post you will learn:
What Is cURL?
cURL is a command-line tool that you can use to transfer data via network protocols. The name cURL stands for ‘Client URL’, and is also written as ‘curl’. This popular command uses URL syntax to transfer data to and from servers. Curl is powered by ‘libcurl’, a free and easy-to-use client-side URL transfer library.
Why using curl is advantageous?
The versatility of this command means you can use curl for a variety of use cases, including:
- User authentication
- HTTP posts
- SSL connections
- Proxy support
- FTP uploads
The simplest ‘use case’ for curl would be downloading and uploading entire websites using one of the supported protocols.
While curl has a long list of supported protocols it will use HTTP by default if you don’t provide a specific protocol. Here is the list of supported protocols:
The curl command is installed by default in Linux distributions.
How do you check if you already have curl installed?
1. Open your Linux console
2. Type ‘curl’, and press ‘enter’.
3. If you already have curl installed, you will see the following message:
4. If you don’t have curl installed already, you will see the following message: ‘command not found’. You can then turn to your distribution package and install it (more details below).
How to use cURL
Curl’s syntax is pretty simple:
For example, if you want to download a webpage: webpage.com just run:
The command will then give you the source code of the page in your terminal window. Keep in mind that if you don’t specify a protocol, curl will default to HTTP. Below you can find an example of how to define specific protocols:
If you forget to add the :// curl will guess the protocol you want to use.
We talked briefly about the basic use of the command, but you can find a list of options on the curl documentation site. The options are the possible actions you can perform on the URL. When you choose an option, it tells curl what action to take on the URL you listed. The URL tells curl where it needs to perform this action. Then curl lets you list one or several URLs.
To download multiple URLs, prefix each URL with a -0 followed by a space. You can do this in a single line or write a different line for each URL. You can also download part of a URL by listing the pages. For example:
Saving the download
You can save the content of the URL to a file by using curl using two different methods:
1. -o method: Allows you to add a filename where the URL will be saved. This option has the following structure:
2. -O method: Here you don’t need to add a filename, since this option allows you to save the file under the URL name. To use this option, you just need to prefix the URL with a -O.
Resuming the download
It may happen that your download stops in the middle. In this case scenario, rewrite the command adding the -C option at the beginning :
Why is curl so popular?
Curl is really the ‘swiss-knife’ of commands, created for complex operations. However, there are alternatives, for example, ‘wget’ or ‘Kurly’, that are good for simpler tasks.
Curl is a favorite among developers because it is available for almost every platform. Sometimes it is even installed by default. This means, whatever programs/jobs you are running, curl commands should work.
Also, chances are that if your OS is less than a decade old, you will have curl installed. You can also read the docs in a browser, and check the curl documentation. If you are running a recent version of Windows, you probably already have curl installed. If you don’t, check out this post on Stack Overflow to learn more about how to do this.
Web Scraping with cURL
Pro tip: Be sure to abide by a website’s rules, and in general do not try to access password-protected content which is illegal for the most part or at the very least frowned upon.
You can use curl to automate the repetitive process when web scraping, helping you avoid tedious tasks. For that, you will need to use PHP. Here’s an example we found on GitHub:
When you use curl to scrape a webpage there are three options, you should use:
- curl_init($url) -> Initializes the session
- curl_exec() -> Executes
- curl_close() -> Closes
Other options you should use include:
- Curlopt_url -> Sets the URL you want to scrape
- Curlopt_returntransfer -> Tells curl to save the scraped page as a variable. (This enables you to get exactly what you wanted to extract from the page.)
In this post, we explained what curl is and what you can do with some basic commands. We also showed you an example of how you can use curl to scrape web pages. Start taking advantage of this versatile tool to start collecting your target data.
Tired of complex and timely web scraping techniques?