How to Use Wget With Python to Download Web Pages and Files

This comprehensive guide introduces wget, a powerful command-line tool for downloading files via HTTP, HTTPS, and FTP, comparing it favorably to the requests library in Python.
13 min read
Illustration of wget usage with Python

In this guide, you will see:

  • What wget is.
  • Why it can be better than the requests library.
  • How easy using wget with Python is.
  • Pros and cons of its adoption in Python scripts.

Let’s dive in!

What Is Wget?

wget is a command-line utility for downloading files from the Web using HTTP, HTTPS, FTP, FTPS, and other Internet protocols. It is natively installed in most Unix-like operating systems, but it is also available for Windows.

Why Wget and Not a Python Package Like requests?

Sure, wget is a cool command-line tool, but why should you use it for downloading files in Python instead of a popular library like requests?

Well, there is a list of compelling reasons for using wget over requests:

  • Supports many more protocols than requests.
  • Can resume aborted or interrupted downloads.
  • Supports the specification of a limited download speed so that it does not consume all the network bandwidth.
  • Supports filenames and network locations with wildcards.
  • NLS-based message files for many languages.
  • Can convert absolute links in downloaded documents to relative links.
  • Supports HTTP/S proxies.
  • Supports persistent HTTP connections.
  • Can perform unattended/background downloading operations.
  • Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring.
  • Can recursively download files linked on a specific web page or until it reaches a user-specified depth of recursion.
  • Automatically respects robot exclusion rules defined in robots.txt. Find out more in our guide on robots.txt for web scraping.

These are only some of the wget features that make it so powerful and special compared to any Python HTTP client library. Discover more in the official manual.

In particular, note how wget can follow links in HTML pages and download files referenced in those pages. That helps you retrieve even entire websites, which makes wget ideal for web crawling.

In short, wget is a great option when writing scripts that need to download files and web pages from the Web. Let’s learn how to use wget with Python!

Running CLI Commands in Python

Follow the steps below and build a Python script that can run wget commands.

Prerequisites

Before getting started, make sure you have wget installed on your computer. The setup process changes based on your operating system:

You will also need Python 3+ installed on your machine. To set it up, download the installer, double-click on it, and follow the instructions.

A Python IDE such as PyCharm Community Edition or Visual Studio Code with the Python extension will also be useful.

Set Up a Python Project

Create a wget Python project with a virtual environment using the commands below:

mkdir wget-python-demo

cd wget-python-demo

python -m venv env

The wget-python-demo directory created above represents your project’s folder.

Load it in your Python IDE, create a script.py file, and initialize it as follows:

print('Hello, World!')

Right now, this is just a sample script that prints “Hello, World!” in the terminal. Soon, it will contain the wget integration logic.

Verify that the script works by pressing the run button of your IDE or with the command below:

python script.py

In the terminal, you should see:

Hello, World!

Perfect! You now have a Python project in place.

See how to use wget in the next section!

Write a Function to Execute CLI Commands Through the Subprocess Module

The easiest way to run CLI commands in a Python script is using the subprocess module.

That library from the Python Standard Library allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return code. In other terms, it equips you with everything you need to execute commands in the terminal from Python.

This is how you can use the Popen() method from subprocess to execute CLI commands like wget with Python:

import subprocess

def execute_command(command):

"""

Execute a CLI command and return the output and error messages.

Parameters:

- command (str): The CLI command to execute.

Returns:

- output (str): The output generated by the command.

- error (str): The error message generated by the command, if any.

"""

try:

# execute the command and capture the output and error messages

process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

output, error = process.communicate()

output = output.decode("utf-8")

error = error.decode("utf-8")

# return the output and error messages

return output, error

except Exception as e:

# if an exception occurs, return the exception message as an error

return None, str(e)

Popen() executes the command you passed as a string in a new process in your operating system. The shell=True option makes sure that the method will use the default shell configured in your OS.

Paste the above snippet in your script.py file. You can now invoke a CLI command in Python as in the following example:

output, error = execute_command("<CLI command string>")

if error:

print("An error occurred while running the CLI command:", error)

else:

print("CLI command output:", output)

Using Wget with Python: Use Cases

This is the syntax of a wget command:

wget [options] [url]

Where:

  • [options] is a list of the options and flags supported by the CLI tool to customize its behavior.
  • url is the URL of the file you want to download. This can be a direct link to a file or a URL of a webpage containing links to multiple files.

Note: On Windows, write wget.exe instead of wget.

Time to see wget in action in some Python snippets covering popular use cases!

Download a File

Suppose you want to download the http://lumtest.com/myip.json with wget. The command to achieve that would be:

wget http://lumtest.com/myip.json

In Python, that would become the following line of code:

output, error = execute_command("wget http://lumtest.com/myip.json")

If you print output, you will see something like:

--2024-04-18 15:20:59-- http://lumtest.com/myip.json

Resolving lumtest.com (lumtest.com)... 3.94.72.89, 3.94.40.55

Connecting to lumtest.com (lumtest.com)|3.94.72.89|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 266 [application/json]

Saving to: 'myip.json.1'

myip.json.1 100%[=================================================>] 266 --.-KB/s in 0s

2024-04-18 15:20:59 (5.41 MB/s) - 'myip.json.1' saved [266/266]

From the output of the command, you can see that:

  1. The URL is resolved to the IP address of the server.
  2. wget connects to the server via an HTTP request to the specified resource.
  3. The HTTP response status code received by the server is 200.
  4. wget downloads the file and stores it in the current directory.

Your Python project directory will now contain a myip.json file.

If you want to change the destination folder where to store the file to download, use the –directory-prefix or -P flag as below:

output, error = execute_command("wget --directory-prefix=./download http://lumtest.com/myip.json")

The myip.json file will now be stored in the download folder inside your project’s directory. Note that if the destination folder does not exist, wget will automatically create it.

To change the file name of the download resource, use the –output-document or -O flag:

output, error = execute_command("wget --output-document=custom-name.json http://lumtest.com/myip.json")

This time, the wget Python script will create a file named custom-name.json instead of myip.json.

Download a Webpage

The wget command is the same as before, with the main difference being that this time url will point to a web page:

output, error = execute_command("wget https://brightdata.com/")

Your project’s directory will now contain an index.html file with the HTML content of the webpage at https://brightdata.com/.

Download a File Only If It Has Changes Since The Last Download

To save disk space and network resources, you may not want to download a file if it has not changed since the last download. Here is why wget offers file timestamping capabilities.

In detail, the –timestamping option instructs wget to compare the timestamps of local files with the ones on the server. If the local file has the same or newer timestamp than the one on the server, wget will not download the file again. Otherwise, it will download them.

That is how the timestamping mechanism works:

  1. When you download a file using the –timestamping or -N option, wget retrieves the timestamp of the remote file.
  2. It checks the local file’s timestamp (if it exists) and compares it with the remote file’s timestamp.
  3. If the local file does not exist or if its timestamp is older than the one on the server, wget downloads the file. If the local file exists and its timestamp is the same or newer than the one on the server, wget will not download the file.

Timestamping in HTTP is implemented by checking the Last-Modified header returned by the server after a HEAD request. wget will also look at the Content-Length header to compare the file sizes. If they are not the same, the remote file will be downloaded no matter what the Last-Modified header says. Keep in mind that Last-Modified is an optional response header. If it not present, wget will download the file anyway.

Use the –timestamping option in Python with the following line of code:

output, error = execute_command("wget --timestamping https://brightdata.com")

If you have already downloaded index.hml, you will get the message below indicating that the file will not be downloaded again:

--2024-04-18 15:55:06-- https://brightdata.com

Resolving brightdata.com (brightdata.com)... 104.18.25.60, 104.18.24.60

Connecting to brightdata.com (brightdata.com)|104.18.25.60|:443... connected.

HTTP request sent, awaiting response... 304 Not Modified

File 'index.html' not modified on server. Omitting download.

The same mechanism also works when downloading files via FTP.

Complete Interrupted Downloads

By default, wget automatically retries downloading a file up to 20 times if the connection is lost during the process. If you want to manually continue a partially downloaded file, use the –continue or -c option as follows:

output, error = execute_command("wget --continue http://lumtest.com/myip.json")

Download an Entire Site

Recursive downloading is a wget feature to download an entire site with a single command.

Starting from the specified URL, wget parses the HTML page and follows the other documents found in the src and href HTML attributes or url() CSS attributes. If the next file is also a text/HTML file, it will parse it and follow its documents as well until it reaches the desired depth. Recursive download follows a breadth-first search algorithm, retrieving files at depth 1, then depth 2, and so on.

The wget options to bear in mind when using this download mode are:

  • –recursive or -r: Tells wget to download files recursively, meaning it will follow links on the web pages. It enables you to create local copies of entire websites, including all linked resources such as images, stylesheets, scripts, etc. When this option is specified, wget stores all downloaded files in a folder with the same name as the domain name of the target site.
  • –level=<depth> or -l=<depth>: Specifies the maximum depth of recursion to follow when downloading linked pages. For example, if you set –level=1, wget will only download the pages linked directly from the starting URL. It will not follow links on those pages to download further pages. To prevent crawling huge sites, the default depth value is 5. Set this option to 0 or ‘inf’ for infinite depth. If you want to make sure that all resources needed to properly display a page are downloaded regardless of the specified depth, add the -p or –page-requisites option.
  • –convert-links or -k: Modifies the links in the downloaded HTML files to point to the locally downloaded files instead of the original URLs. This option is useful when you want to create a local mirror of a site and ensure that all links within the downloaded pages work correctly offline.

Assume you want to recursively download the Bright Data site with a maximum depth of 1 while converting all links to point to local files. That is the Python wget instruction you should write:

output, error = execute_command("wget --recursive --level=1 --convert-links https://brightdata.com")

Note: This command might take a while based on your Internet connection speed, so be patient.

The brightdata.com folder will now contain a local copy of the Bright Data site files with 1 level of depth recursion.

Pros and Cons of Using Wget with Python

Let’s see the pros and cons of using wget with Python.

 Pros

  • Easy Python integration thanks to the subprocess module.
  • Tons of features and options, including recursive download, automatic retries, filestamping, and more.
  • Can make a local copy of an entire site with a single command.
  • FTP support.
  • Support for proxy integration.
  • Ability to recover interrupted downloads.

 Cons

  • The output is downloaded files and not string variables that you can directly use in the Python script.
  • You need a parser like Beautiful Soup to access specific DOM elements from the downloaded HTML files.

[Extra] Using Wget with a Proxy

The main challenge of using wget to download a file or an entire site is that your requests can get blocked. This is because wget requests will appear to destination servers as coming from a bot. To protect against them, some websites implement restrictions and limitations for their pages and resources. These limitations may include geographic restrictions, rate-limiting policies, or anti-scraping measures.

Integrating a proxy server into wget is a viable solution to bypass such restrictions. A proxy acts as an intermediate server between the computer and the Internet. By forwarding wget traffic through a proxy server, you can avoid exposing your IP and circumvent most limitations imposed by sites.

For a more complete tutorial, refer to our guide on how to use a proxy with Wget.

Conclusion

In this article, you understood what wget is, why it can be better than the requests library, and how to use it in Python. Now you know that wget is a powerful tool for downloading files and web pages over HTTP, HTTPS, and FTP. Thanks to what you have learned here, you know how to use wget with Python.

As seen in this guide, a proxy can be a great ally in avoiding all the anti-bot measures that sites take to prevent utilities like wget from downloading their content. The problem is that there are dozens of providers online and choosing the best one is not easy. Save time and go directly to the best in the market, Bright Data!

Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:

Start with a free trial or talk to one of our data experts about our proxy and scraping solutions.