In this article, you will see:
- The primary causes of a slow web scraping process
- Various techniques to speed up web scraping
- How to optimize a sample Python scraping script for faster data retrieval
Let’s dive in!
Reasons Why Your Scraping Process Is Slow
Explore the key reasons why your web scraping process may be slow.
Reason #1: Slow Server Responses
One of the most notable factors affecting your web scraping speed is the server response time. When you send a request to a website, the server will process and respond to your request. If the server is slow, your requests will take longer to complete. Reasons for a server to be slow are heavy traffic, limited resources, or network slowdowns.
Unfortunately, there is little you can do to speed up a target server. That is beyond your control, unless the slowdown is due to an overwhelming number of requests from your end. If this is the case, spread your requests over a longer duration by adding random delays between them.
Reason #2: Slow CPU Processing
CPU processing speed plays a crucial role in how quickly your scraping scripts can operate. When you run your scripts sequentially, your CPU is tasked with processing each operation one at a time, which can be time-consuming. This is particularly noticeable when your scripts involve complex computations or data transformations.
Additionally, HTML parsing takes some time and can significantly slow down your scraping process. Learn more in our article on HTML web scraping.
Reason #3: Limited I/O Operations
Input/Output (I/O) operations can easily become the bottleneck of your scraping operation. That is especially true when your target site consists of several pages. If your script is designed to wait for responses from external resources before proceeding, that can lead to considerable delays.
Sending a request, waiting for the server to respond, processing it, and then moving on to the next request is not an efficient way to perform web scraping.
Other reasons
Other reasons that make your web scraping script slow are:
- Inefficient code: Poorly optimized scraping logic can make the entire scraping process slow. Avoid inefficient data structures, unnecessary loops, or excessive logging.
- Rate limiting: If the target site restricts the number of requests a user can make in a specified time frame, your automated scraper will be slow as a result. The solution? Proxy services!
- CAPTCHAs and other anti-scraping solutions: CAPTCHAs and anti-bot measures can interrupt your scraping process by requiring user interaction. Discover other anti-scraping techniques.
Techniques to Speed Up Web Scraping
In this section, you will discover the most popular methods to speed up web scraping. We will start with a basic Python scraping script and demonstrate the impact of various optimizations on it.
Note: The techniques explored here work with any programming language or technology. Python is used just for simplicity and because it is one of the best programming languages for web scraping.
This is the initial Python scraping script:
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_quotes_to_scrape():
# array of the with the URLs of the page to scrape
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# where to store the scraped data
quotes = []
# scrape the pages sequentially
for url in urls:
print(f"Scraping page: '{url}'")
# send a GET request to get the page HTML
response = requests.get(url)
# parse the page HTML using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# select all quote elements on the page
quote_html_elements = soup.select(".quote")
# iterate over the quote elements and scrape their content
for quote_html_element in quote_html_elements:
# extract the text of the quote
text = quote_html_element.select_one(".text").get_text()
# extract the author of the quote
author = quote_html_element.select_one(".author").get_text()
# extract tags associated with the quote
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
# populate a new quote object and add it to the list
quote = {
"text": text,
"author": author,
"tags": ", ".join(tags)
}
quotes.append(quote)
print(f"Page '{url}' scraped successfully\n")
print("Exporting scraped data to CSV")
# export the scraped quotes to a CSV file
with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
# measure execution time
start_time = time.time()
scrape_quotes_to_scrape()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
The above scraper targets 10 paginated URLs from the Quotes to Scrape website. For each URL, the script performs the following operations:
- Sends a GET request using
requests
to fetch the page’s HTML. - Parses the HTML content with
BeautifulSoup
. - Extracts the quote text, author, and tags for each quote element on the page.
- Stores the scraped data in a list of dictionaries.
Finally, it exports the scraped data to a CSV file named quotes.csv
.
To run the script, install the required libraries with:
pip install requests beautifulsoup4
The scrape_quotes_to_scrape()
function call is wrapped with time.time()
calls to measure how long the scraping process takes. On our machine, the initial script takes approximately 4.51 seconds to complete.
Running the script generates a quotes.csv
file in your project folder. Additionally, you will see logs similar to the following:
Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/3/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/5/'
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/6/'
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/7/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 4.63 seconds
This output clearly shows that the script sequentially scrapes each paginated webpage from Quotes to Scrape. As you are about to see, some optimizations will significantly change the flow and speed of this process.
Now, let’s find out how to make web scraping faster!
1. Use a Faster HTML Parsing Library
Data parsing consumes time and resources, and different HTML parsers use various approaches to accomplish this task. Some focus on providing a rich feature set with a self-describing API, while others prioritize performance. For more details, check out our guide on the best HTML parsers.
In Python, Beautiful Soup is the most popular HTML parser, but it is not necessarily the fastest. See some benchmarks for more context.
In reality, Beautiful Soup acts just as a wrapper around different underlying parsers. You can specify the parser you want to use when initializing it, via the second argument:
soup = BeautifulSoup(response.content, "html.parser")
Generally, Beautiful Soup is used in combination with html.parser
, the built-in parser from Python’s standard library. However, if you are looking for speed, you should consider lxml
. This is one of the fastest HTML parsers available in Python, as it is based on a C implementation.
To install lxml
, run the following command:
pip install lxml
Once installed, you can use it with Beautiful Soup like this:
soup = BeautifulSoup(response.content, "lxml")
Now, run your Python scraping script again. This time, you should see the following output:
# omitted for brevity...
Execution time: 4.35 seconds
The execution time dropped from 4.61 seconds to 4.35 seconds. While this change might seem small, the impact of this optimization greatly depends on the size and complexity of the HTML pages being parsed and how many elements are being selected.
In this example, the target site has pages with a simple, short, and shallow DOM structure. Still, achieving a speed improvement of around 6% with just a small code change is a worthwhile gain!
👍 Pros:
- Easy to implement in Beautiful Soup
👎 Cons:
- Small advantage
- Only works on pages with complex DOM structures
- Faster HTML parsers may have a more complex API
2. Implement Multiprocessing Scraping
Multiprocessing is an approach to parallel execution where a program spawns multiple processes. Each of these processes operates in parallel and independently on a CPU core to perform tasks simultaneously rather than sequentially.
This method is particularly beneficial for I/O-bound operations like web scraping. The reason is that the primary bottleneck is often the time spent waiting for responses from web servers. By utilizing multiple processes, you can send requests to several pages at the same time, reducing overall scraping time.
To adapt your scraping script for multiprocessing, you need to make some important modifications to the execution logic. Follow the steps below to transform your Python scraper from a sequential to a multiprocessing approach.
To get started with multiprocessing in Python, the first step is to import Pool
and cpu_count
from multiprocessing
module:
from multiprocessing import Pool, cpu_count
Pool
provides what you need to manage a pool of worker processes. Instead, cpu_count
helps you determine the number of CPU cores available for parallel processing.
Next, isolate the logic to scrape a single URL within a function:
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
The above function will be called by each worker process and executed on a CPU core at a time.
Then, replace the sequential scraping flow with a multiprocessing logic:
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# create a pool of processes
with Pool(processes=cpu_count()) as pool:
results = pool.map(scrape_page, urls)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
Finally, run the scrape_quotes()
function while measuring the execution time:
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
Note that the if __name__ == "__main__":
construct is required to prevent certain parts of your code from being executed when the module is imported. Without this check, the multiprocessing module may attempt to spawn new processes that can lead to unexpected behavior, especially on Windows.
Put it all together, and you will get:
from multiprocessing import Pool, cpu_count
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# create a pool of processes
with Pool(processes=cpu_count()) as pool:
results = pool.map(scrape_page, urls)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
Execute the script again. This time, it will produce some logs as follows:
Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 1.87 seconds
As you can see, the execution order is no longer sequential. Your script can now scrape several pages simultaneously. Specifically, it can scrape up to the number of cores available on your CPU (8, in our case).
Parallel processing results in an around 145% time improvement, reducing the execution time from 4.61 seconds to 1.87 seconds. That is impressive!
👍 Pros:
- Great execution time enchanment
- Natively supported by most programming languages
👎 Cons:
- Limited by the number of cores available on your machine
- Does not respect the order of the URLs in the list
- Requires a lot of changes in the code
3. Implement Multithreading Scraping
Multithreading is a programming technique to run multiple threads concurrently within a single process. This enables your script to perform multiple tasks simultaneously, with each task handled by a dedicated thread.
While similar to multiprocessing, multithreading does not necessarily require multiple CPU cores. This is because a single CPU core can run numerous threads simultaneously, sharing the same memory space. Dig into this concept in our guide on concurrency vs parallelism.
Keep in mind that transforming a scraping script from a sequential approach to a multithreaded one requires changes similar to those described in the previous chapter.
In this implementation, we will use ThreadPoolExecutor
from the Python concurrent.futures
module. You can import it as below:
from concurrent.futures import ThreadPoolExecutor
ThreadPoolExecutor
provides a high-level interface for managing a pool of threads, running them concurrently for you.
As before, start by isolating the logic for scraping a single URL into a function, just like we did in the previous chapter. The key difference is that you now need to utilize ThreadPoolExecutor
to run the function in multiple threads:
quotes = []
# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
# use map to apply the scrape_page function to each URL
results = executor.map(scrape_page, urls)
# combine the results from all threads
for result in results:
quotes.extend(result)
By default, if max_workers
is None
or not specified, it will default to the number of processors on your machine, multiplied by 5. In this case, we have only 10 pages, so setting it to 10 will be fine. Do not forget that opening too many threads can slow down your system and may lead to performance drawbacks instead of enhancements.
The entire scraping script will contain the following code:
from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# where to store the scraped data
quotes = []
# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
# use map to apply the scrape_page function to each URL
results = executor.map(scrape_page, urls)
# combine the results from all threads
for result in results:
quotes.extend(result)
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
Launch it, and it will log messages as below:
Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'http://quotes.toscrape.com/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 0.52 seconds
Similar to multiprocessing scraping, the execution order of the pages is no longer sequential. This time, the performance improvement is even greater than with multiprocessing. That is because the script can now execute 10 requests simultaneously, exceeding the previous limit of 8 requests (the number of CPU cores).
The time improvement is huge, from 4.61 seconds to 0.52 seconds, resulting in a percentage reduction of around 885%!
👍 Pros:
- Huge execution time improvement
- Natively supported by most technologies
👎 Cons:
- Finding the right number of threads is not easy
- Does not respect the order of the URLs in the list
- Requires a lot of changes in the code
4. Using Async/Await Scraping
Asynchronous programming is a modern programming paradigm that enables you to write non-blocking code. The idea is to give developers the ability to deal with concurrent operations without explicitly managing multithreading or multiprocessing.
In a traditional synchronous approach, each operation must terminate before the next one begins. This can lead to inefficiencies, especially in I/O-bound tasks like web scraping. With async programming, you can initiate multiple I/O operations simultaneously and then wait for them to complete. That keeps your script responsive and efficient.
In Python, asynchronous scraping is generally implemented using the asyncio
module from the standard library. This package provides the infrastructure for writing single-threaded concurrent code using coroutines, via the async
and await
keywords.
However, standard HTTP libraries like requests
do not support asynchronous operations. Thus, you need to use an asynchronous HTTP client like AIOHTTP, which is specifically designed to work seamlessly with asyncio
. This combination helps you send multiple HTTP requests concurrently without blocking your script’s execution.
Install AIOHTTP using the following command:
pip install aiohttp
Then, import asyncio
and aiohttp
:
import asyncio
import aiohttp
Just as in previous chapters, encapsulate the logic for scraping a single URL into a function. However, this time, the function will be asynchronous:
async def scrape_url(session, url):
async with session.get(url) as response:
print(f"Scraping page: '{url}'")
html_content = await response.text()
soup = BeautifulSoup(html_content, "html.parser")
# scraping logic...
Note the use of the await
function to retrieve the HTML of the webpage.
To execute the function in parallel, create an AIOHTTP session and gather multiple scraping tasks:
# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
tasks = [scrape_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
Finally, use asyncio.run()
to execute your asynchronous main scraping function:
if __name__ == "__main__":
start_time = time.time()
asyncio.run(scrape_quotes())
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
Your async scraping script in Python will contain these lines of code:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv
import time
async def scrape_url(session, url):
async with session.get(url) as response:
print(f"Scraping page: '{url}'")
html_content = await response.text()
soup = BeautifulSoup(html_content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
async def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
tasks = [scrape_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
asyncio.run(scrape_quotes())
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
Launch it, and you will get an output like this:
Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 0.51 seconds
Note that the execution time is similar to the multithreading approach, but with the added benefit of not having to manually manage threads.
👍 Pros:
- Huge execution time gain
- Modern programming is based on async logic
- Does not require manual management of threads or processes
👎 Cons:
- Not so easy to master
- Does not respect the order of the URLs in the list
- Requires dedicted async libraries
5. Other Scraping Speeding Up Tips and Approaches
Other ways to make web scraping faster are:
- Request rate optimization: Fine-tune request intervals to find the optimal balance between speed and avoiding rate-limiting or getting banned.
- Rotating Proxies: Use rotating proxies to distribute requests across multiple IP addresses, reducing the chances of being blocked and enabling faster scraping. See the best rotating proxies.
- Parallel scraping with distributed systems: Distribute scraping tasks across multiple online machines.
- Reduce JavaScript rendering: Try to avoid browser automation tools, preferring tools like HTTP clients as HTML parsers. Remember that browsers eat up a lot of resources and are way slower than most traditional HTML parsers.
Conclusion
In this guide, we saw how to make web scraping faster. We uncovered the main reasons why a scraping script can be slow and examined various techniques to address these issues using a sample Python script. With just a few adjustments to the scraping logic, we achieved an 8x improvement in execution time.
While manual optimization of your web scraping logic is crucial for speeding up the data retrieval process, using the right tools is equally important. When targeting dynamic sites that require browser automation solutions, things can become more complicated, as browsers tend to be slow and resource-intensive.
To overcome these challenges, try Scraping Browser, a fully hosted cloud-based solution designed for scraping. It integrates seamlessly with Puppeteer, Selenium, Playwright, and other popular browser automation tools. Equipped with a CAPTCHA auto-solver and backed by a proxy network of 72+ million residential IPs, it offers unlimited scalability for your covering any scraping need!
Sign up now and start your free trial.
No credit card required