Serverless Scraping With Scrapy and AWS

Learn to write a Scrapy Spider, deploy it to AWS Lambda, and store scraped data in S3 with this tutorial.
8 min read
Serverless Scraping with Scrapy and AWS blog image

In today’s guide, we’re going to write a Scrapy Spider and deploy it to AWS Lambda. As far as code goes, it’s pretty straightforward here. When working with cloud services like Lambda, we have a lot of moving parts. We’ll show you how to navigate these moving parts and how to handle it when things break.

Prerequisites

To accomplish this task, you’ll need the following:

  • Basic understanding of Python: We’ll write our code using Python. You can learn more about web scraping with Python here.
  • AWS (Amazon Web Services) Account: Since we’re using AWS Lambda, you’ll need an AWS account.
  • A Linux or Windows machine with WSL 2: Amazon uses Amazon Linux to run code. When we upload our code, it needs to be binary compatible.
  • Basic knowledge of Scrapy: It’s not a requirement, but basic knowledge of scraping with Scrapy will be helpful as well.

What is Serverless?

Serverless architecture has been hailed as the future of computing. While the actual runtime of a serverless app might be more expensive per hour, if you’re not already paying to run a server, Lambda makes sense.

Let’s pretend your scraper takes one minute to run and you run it once a day. With a traditional server, you’d be paying for a month of 24 hour uptime, but your actual usage is only 30 minutes. With services like Lambda, you only pay for what you actually use.

Pros

  • Billing: You only pay for what you use.
  • Scalability: Lambda scales automatically, you don’t have to worry about it.
  • Server Management: You don’t have to spend any time managing a server. All of this is done automatically.

Cons

  • Latency: If your function has been idle, it takes longer to start up and run.
  • Execution Time: Lambda functions run with a default timeout of 3 seconds and a maximum time of 15 minutes. Traditional servers are much more flexible.
  • Portability: You’re not only dependant on OS compatibility, you’re at the mercy of your vendor. You can’t just copy your Lambda function and run it in Azure or Google Cloud.

Getting Started

Setting Up Services

After you’ve got your AWS account, you need an S3 bucket. Head over to their ‘All services’ page and scroll down.

A list of all AWS services

Eventually, you’ll see a section called Storage. This first option in this section is called S3. Click on it.

The S3 storage service

Next, click the Create bucket button.

Creating a new S3 bucket

Now, you need to name your bucket and choose your settings. We’re just going with the default ones.

Naming the new S3 bucket and choosing your settings

When you’re finished, click the Create bucket button located in the bottom right corner of the page.

Creating the new S3 bucket

Once you’ve created it, your bucket will show up in the Buckets tab under Amazon S3.

Clicking the create bucket button when the configuration is done

Setting Up Your Project

Create a new project folder.

mkdir scrapy_aws

Move into the new folder and create a virtual environment.

cd scrapy_aws
python3 -m venv venv  

Activate the environment.

source venv/bin/activate

Install Scrapy.

pip install scrapy

What To Scrape

For dynamic websites, anti-bot measures, or large-scale scraping, use Bright Data’s Scraping Browser. It automates tasks, bypasses CAPTCHAs, and scales seamlessly.

We’ll be using books.toscrape as our target site. It’s an educational site devoted entirely to web scraping. If you look at the image below, each book is an article with the class name, product_pod. We want to extract all of these elements from the page.

Inspecting one of the book in an article tag

The title of each book is embedded within an a element which is nested inside of an h3 element.

Inspecting the book title

Each price is embedded in a p which is nested inside of a div. It has a class name of price_color.

Inspecting the book price

Writing Our Code

Now, we’ll write our scraper and test it locally. Open a new Python file and paste the following code into it. We named ours aws_spider.py.

import scrapy


class BookSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        for card in response.css("article"):
            yield {
                "title": card.css("h3 > a::text").get(),
                "price": card.css("div > p::text").get(),
            }
        next_page = response.css("li.next > a::attr(href)").get()

        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

You can test the spider with the following command. It should output a JSON file full of books with prices.

 python -m scrapy runspider aws_spider.py -o books.json

Now, we need a handler. The handler’s job is simple, run the spider. Here, we’ll create two handlers that are basically the same. The major difference is that we’re running one locally and one on Lambda.

Here is our local handler, we called it lambda_function_local.py.

import subprocess

def handler(event, context):
    # Output file path for local testing
    output_file = "books.json"

    # Run the Scrapy spider with the -o flag to save output to books.json
    subprocess.run(["python", "-m", "scrapy", "runspider", "aws_spider.py", "-o", output_file])

    # Return success message
    return {
        'statusCode': '200',
        'body': f"Scraping completed! Output saved to {output_file}",
    }

# Add this block for local testing
if __name__ == "__main__":
    # Simulate an AWS Lambda invocation event and context
    fake_event = {}
    fake_context = {}

    # Call the handler and print the result
    result = handler(fake_event, fake_context)
    print(result)

Delete books.json. You can test the local handler with the following command. If everything is working properly, you’ll see a new books.json in your project folder. Remember to change bucket_name to your own bucket.

python lambda_function_local.py

Now, here’s the handler we’ll use for Lambda. It’s pretty similar, it just has some small tweaks to store our data to our S3 Bucket.

import subprocess
import boto3

def handler(event, context):
    # Define the local and S3 output file paths
    local_output_file = "/tmp/books.json"  # Must be in /tmp for Lambda
    bucket_name = "aws-scrapy-bucket"
    s3_key = "scrapy-output/books.json"  # Path in S3 bucket

    # Run the Scrapy spider and save the output locally
    subprocess.run(["python3", "-m", "scrapy", "runspider", "aws_spider.py", "-o", local_output_file])

    # Upload the file to S3
    s3 = boto3.client("s3")
    s3.upload_file(local_output_file, bucket_name, s3_key)

    return {
        'statusCode': 200,
        'body': f"Scraping completed! Output uploaded to s3://{bucket_name}/{s3_key}"
    }
  • We first save our data to a temp file: local_output_file = "/tmp/books.json". This prevents it from being lost.
  • We upload it to our bucket with s3.upload_file(local_output_file, bucket_name, s3_key).

Deploying To AWS Lambda

Now, we need to deploy to AWS Lambda.

Make a package folder.

mkdir package

Copy our dependencies over to the package folder.

cp -r venv/lib/python3.*/site-packages/* package/

Copy the files. Make sure you copy the handler you made for Lambda, not the local handler we tested earlier.

cp lambda_function.py aws_spider.py package/

Compress the package folder into a zip file.

zip -r lambda_function.zip package/

Once we’ve created the ZIP file, we need to head on over to AWS Lambda and select Create function. When prompted, enter your basic information such as runtime (Python) and architecture.

Make sure to add give it permission to access your S3 Bucket.

Creating a new function

Once you’ve created the function, select the Upload from dropdown. It’s located at the top right hand corner of the source tab.

Lambda upload from

Choose .zip file and upload the ZIP file you created.

Uploading the created ZIP file

Click the test button and wait for your function to run. After it runs, check your S3 Bucket and you should have a new file, books.json.

The new books.json file in your S3 bucket

Troubleshooting Tips

Scrapy Cannot Be Found

You might get an error saying that Scrapy cannot be found. If so, you need to add the following to your array of commands in subprocess.run().

Adding a piece of code in the subprocess.run() function

General Dependency Issues

You need to make sure your Python versions are the same. Check your local install of Python.

python --version

If this command outputs a different version than your Lambda function, change your Lambda configuration to match it.

Handler Issues

The handler should match the function you wrote

Your handler should match the function you wrote in lambda_function.py. As you can see above, we have lambda_function.handlerlambda_function represents the name of your Python file. handler is the name of the function.

Can’t Write to S3

You might run into permissions issues when it comes to storing the output. If so, you need to add these permissions to your Lambda instance.

Go to the IAM Console and search for your Lambda function. Click on it and then click the Add permissions dropdown.

Click Attach policies.

Clicking on 'attach policies' in the Lambda function

Select AmazonS3FullAccess.

Selecting AmazonS3FullAccess

Conclusion

You made it! At this point, you should be able to hold your own in the UI nightmare known as the AWS console. You know how to write a crawler with Scrapy. You know how to package the environment with either Linux or WSL to ensure binary compatibility with Amazon Linux.

If you are not into manual scraping, check out our Web Scraper APIs and ready-made datasets. Sign up now to start your free trial!

No credit card required