In today’s guide, we’re going to write a Scrapy Spider and deploy it to AWS Lambda. As far as code goes, it’s pretty straightforward here. When working with cloud services like Lambda, we have a lot of moving parts. We’ll show you how to navigate these moving parts and how to handle it when things break.
Prerequisites
To accomplish this task, you’ll need the following:
- Basic understanding of Python: We’ll write our code using Python. You can learn more about web scraping with Python here.
- AWS (Amazon Web Services) Account: Since we’re using AWS Lambda, you’ll need an AWS account.
- A Linux or Windows machine with WSL 2: Amazon uses Amazon Linux to run code. When we upload our code, it needs to be binary compatible.
- Basic knowledge of Scrapy: It’s not a requirement, but basic knowledge of scraping with Scrapy will be helpful as well.
What is Serverless?
Serverless architecture has been hailed as the future of computing. While the actual runtime of a serverless app might be more expensive per hour, if you’re not already paying to run a server, Lambda makes sense.
Let’s pretend your scraper takes one minute to run and you run it once a day. With a traditional server, you’d be paying for a month of 24 hour uptime, but your actual usage is only 30 minutes. With services like Lambda, you only pay for what you actually use.
Pros
- Billing: You only pay for what you use.
- Scalability: Lambda scales automatically, you don’t have to worry about it.
- Server Management: You don’t have to spend any time managing a server. All of this is done automatically.
Cons
- Latency: If your function has been idle, it takes longer to start up and run.
- Execution Time: Lambda functions run with a default timeout of 3 seconds and a maximum time of 15 minutes. Traditional servers are much more flexible.
- Portability: You’re not only dependant on OS compatibility, you’re at the mercy of your vendor. You can’t just copy your Lambda function and run it in Azure or Google Cloud.
Getting Started
Setting Up Services
After you’ve got your AWS account, you need an S3 bucket. Head over to their ‘All services’ page and scroll down.
Eventually, you’ll see a section called Storage. This first option in this section is called S3. Click on it.
Next, click the Create bucket button.
Now, you need to name your bucket and choose your settings. We’re just going with the default ones.
When you’re finished, click the Create bucket button located in the bottom right corner of the page.
Once you’ve created it, your bucket will show up in the Buckets tab under Amazon S3.
Setting Up Your Project
Create a new project folder.
mkdir scrapy_aws
Move into the new folder and create a virtual environment.
cd scrapy_aws
python3 -m venv venv
Activate the environment.
source venv/bin/activate
Install Scrapy.
pip install scrapy
What To Scrape
For dynamic websites, anti-bot measures, or large-scale scraping, use Bright Data’s Scraping Browser. It automates tasks, bypasses CAPTCHAs, and scales seamlessly.
We’ll be using books.toscrape as our target site. It’s an educational site devoted entirely to web scraping. If you look at the image below, each book is an article
with the class name, product_pod
. We want to extract all of these elements from the page.
The title of each book is embedded within an a
element which is nested inside of an h3
element.
Each price is embedded in a p
which is nested inside of a div
. It has a class name of price_color
.
Writing Our Code
Now, we’ll write our scraper and test it locally. Open a new Python file and paste the following code into it. We named ours aws_spider.py
.
import scrapy
class BookSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
for card in response.css("article"):
yield {
"title": card.css("h3 > a::text").get(),
"price": card.css("div > p::text").get(),
}
next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
You can test the spider with the following command. It should output a JSON file full of books with prices.
python -m scrapy runspider aws_spider.py -o books.json
Now, we need a handler. The handler’s job is simple, run the spider. Here, we’ll create two handlers that are basically the same. The major difference is that we’re running one locally and one on Lambda.
Here is our local handler, we called it lambda_function_local.py
.
import subprocess
def handler(event, context):
# Output file path for local testing
output_file = "books.json"
# Run the Scrapy spider with the -o flag to save output to books.json
subprocess.run(["python", "-m", "scrapy", "runspider", "aws_spider.py", "-o", output_file])
# Return success message
return {
'statusCode': '200',
'body': f"Scraping completed! Output saved to {output_file}",
}
# Add this block for local testing
if __name__ == "__main__":
# Simulate an AWS Lambda invocation event and context
fake_event = {}
fake_context = {}
# Call the handler and print the result
result = handler(fake_event, fake_context)
print(result)
Delete books.json
. You can test the local handler with the following command. If everything is working properly, you’ll see a new books.json
in your project folder. Remember to change bucket_name
to your own bucket.
python lambda_function_local.py
Now, here’s the handler we’ll use for Lambda. It’s pretty similar, it just has some small tweaks to store our data to our S3 Bucket.
import subprocess
import boto3
def handler(event, context):
# Define the local and S3 output file paths
local_output_file = "/tmp/books.json" # Must be in /tmp for Lambda
bucket_name = "aws-scrapy-bucket"
s3_key = "scrapy-output/books.json" # Path in S3 bucket
# Run the Scrapy spider and save the output locally
subprocess.run(["python3", "-m", "scrapy", "runspider", "aws_spider.py", "-o", local_output_file])
# Upload the file to S3
s3 = boto3.client("s3")
s3.upload_file(local_output_file, bucket_name, s3_key)
return {
'statusCode': 200,
'body': f"Scraping completed! Output uploaded to s3://{bucket_name}/{s3_key}"
}
- We first save our data to a temp file:
local_output_file = "/tmp/books.json"
. This prevents it from being lost. - We upload it to our bucket with
s3.upload_file(local_output_file, bucket_name, s3_key)
.
Deploying To AWS Lambda
Now, we need to deploy to AWS Lambda.
Make a package folder.
mkdir package
Copy our dependencies over to the package
folder.
cp -r venv/lib/python3.*/site-packages/* package/
Copy the files. Make sure you copy the handler you made for Lambda, not the local handler we tested earlier.
cp lambda_function.py aws_spider.py package/
Compress the package folder into a zip file.
zip -r lambda_function.zip package/
Once we’ve created the ZIP file, we need to head on over to AWS Lambda and select Create function. When prompted, enter your basic information such as runtime (Python) and architecture.
Make sure to add give it permission to access your S3 Bucket.
Once you’ve created the function, select the Upload from dropdown. It’s located at the top right hand corner of the source tab.
Choose .zip file and upload the ZIP file you created.
Click the test button and wait for your function to run. After it runs, check your S3 Bucket and you should have a new file, books.json.
Troubleshooting Tips
Scrapy Cannot Be Found
You might get an error saying that Scrapy cannot be found. If so, you need to add the following to your array of commands in subprocess.run()
.
General Dependency Issues
You need to make sure your Python versions are the same. Check your local install of Python.
python --version
If this command outputs a different version than your Lambda function, change your Lambda configuration to match it.
Handler Issues
Your handler should match the function you wrote in lambda_function.py
. As you can see above, we have lambda_function.handler
. lambda_function
represents the name of your Python file. handler
is the name of the function.
Can’t Write to S3
You might run into permissions issues when it comes to storing the output. If so, you need to add these permissions to your Lambda instance.
Go to the IAM Console and search for your Lambda function. Click on it and then click the Add permissions dropdown.
Click Attach policies.
Select AmazonS3FullAccess.
Conclusion
You made it! At this point, you should be able to hold your own in the UI nightmare known as the AWS console. You know how to write a crawler with Scrapy. You know how to package the environment with either Linux or WSL to ensure binary compatibility with Amazon Linux.
If you are not into manual scraping, check out our Web Scraper APIs and ready-made datasets. Sign up now to start your free trial!
No credit card required