Craigslist has a gold mine of information you can use for market research, price monitoring, and competitor analysis. In fact, it publishes more than 80 million classified ads every month.
But accessing all that information programmatically can be a challenge, thanks in part to bot-detection measures, like geoblocking, IP banning, rate limiting, and CAPTCHAs.
In this article, you’ll learn how to scrape Craigslist manually using Playwright and Python so you can take advantage of the plethora of information available. You’ll also learn how to avoid getting blocked using Bright Data proxies, Scraping Browser, and datasets.
How to Scrape Craigslist Using Python
In this section, you’ll build the Python script to scrape Craigslist. Specifically, your scraper will extract car listings for any city that’s entered and then store that data in a CSV file.
Start by setting up a new directory for your project:
mkdir craigScraper
Then navigate into your new directory:
cd craigScraper
And set up a virtual environment for your project using the following command:
python3 -m venv env
A virtual environment allows you to install packages and versions specific to that directory, avoiding conflict with your global installation. Activate the virtual environment using this command:
source env/bin/activate # For Linux/MacOS
env\Scripts\activate # For Windows
Then install the Playwright library, Microsoft’s open source cross-browser automation and testing platform:
pip3 install pytest-playwright
This is what you’ll use to scrape Craigslist. Once that’s done, install the required browsers:
playwright install
This command installs specific supported versions of testing browsers, like Chromium and Firefox.
Then create a scraper.py
file and add the following code to it:
from playwright.sync_api import sync_playwright
import csv, sys
def main():
city = input("Enter any city you want to scrape cars from: ").lower() # Get the city that we want to scrape cars from
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page() # Launch the browser using Playwright
try: # Go to the cars listing section on Craigslist of the city entered by the user
page.goto(f'https://{city}.craigslist.org/search/cta#search=1~gallery~0~0')
except:
print("Page does not exist on Craigslist")
sys.exit(1)
page.wait_for_timeout(15000)
cars = page.get_by_role("listitem").all() # Grab all the listings on the page
with open("cars.csv", "w") as csvfile: # Write to a CSV file
writer = csv.writer(csvfile)
writer.writerow(
['Car', 'Price', 'Miles Driven', 'Location', 'Posted', 'Link']) # Columns of the CSV file
for car in cars: # Looping over each of the listings
try:
meta = []
price = car.locator("span.priceinfo").inner_text() # Price of the car
text = car.locator("a > span.label").inner_text() # Title of the listing
link = car.locator("a.posting-title").get_attribute('href') # Link to the listing
info = car.locator("div.meta").inner_text() # Meta info (Posted ago, miles driven, location)
meta = info.split("·")
time, miles, location = meta[0], meta[1], meta[2]
writer.writerow([text, price, miles, location, time, link]) # Writing to CSV file
except:
print(f"Inadequate information about the car \n -------------")
page.wait_for_timeout(10000)
browser.close()
if __name__ == "__main__":
main()
In this code, you’re importing the required libraries, namely Playwright, CSV, and sys. You’re using the synchronous version in this tutorial, but for complex scenarios, you should use async_playwright
.
Next, you declare a main()
function, which stores the entire code. The city
variable stores the user’s choice of city that they want to scrape data from.
Then Playwright launches a browser window that navigates to the URL for the car listings. If the page doesn’t exist, the program prints an error message and quits.
After that, the program waits for the elements to be fully visible with a timeout of 15,000 milliseconds. It gets all the car listings, opens a CSV file, writes the column headings, and scrapes and fills all the remaining data. If anything’s missing, it displays an error message.
Playwright uses locators like page.get_by_role()
, page.get_by_text()
, page.get_by_label()
, and page.locator()
to find specific elements on the page.
This code searches for all the list items (car listings) and then uses CSS selectors to find the title, price, link, and other info related to each of the elements.
If you look at Craigslist’s website structure, you’ll see the following:
- Every car listing has a class of
gallery-card
. - Inside that, the
a
tag contains the title. - The
div
with classmeta
has the time posted and miles driven as well as the location. - The
span
with classpriceinfo
has the price of the listing.
That’s why, in this code, you’ve used span.priceinfo
to find the price. It goes to the span
tag inside the gallery card with the class name of priceinfo
. You’ve extracted the other details in a similar way.
Once the extraction is complete, the browser window is closed.
The output for the code looks like this:
With data from the listings, the scraper creates a CSV file that has all the parameters (ie title, price, link, and location) and prints an error message in the terminal for the rest.
This might seem simple on the surface, but it’s not. It gets progressively harder and riskier as soon as you try to scrape more data. For instance, your scraper has a high risk of getting your IP banned for scraping data too frequently. Or worse yet, you could face a legal risk for large-scale projects.
That’s why using proxies is a wise choice.
Sending requests to a website from the same IP at a high frequency can lead to banning. Additionally, most websites have bot-detection measures in place to prevent such attacks. That’s why proxies are a must for large-scale, anonymous web scraping.
Why You Should Use Proxies When Scraping Websites
Proxies add a layer of anonymity between you and the destination website by masking your IP address. It’s like sending a friend to pick up a package for you; they go to the store instead of you, protecting your identity.
Here are some reasons why you should use proxies while web scraping:
Geoblocking Avoidance
Geoblocking is the process of restricting access based on the user’s location.
When you visit a website, it queries your IP address through a database of IPs mapped to locations they belong to. This process enables the website to deliver personalized content or enforce location-based restrictions.
Proxies overcome this by masking your IP address and relaying your requests to the server using an IP that isn’t blocked (eg unblocking regional content on streaming platforms).
IP Rotation
As the name suggests, IP rotation refers to constantly changing your proxy IP to further trick the website. Frequent requests from one IP might lead to the website flagging it, which can be avoided by rotating from a pool of IPs.
Proxy providers like Bright Data use advanced rotation algorithms to ensure you don’t get blocked. They also allow you to customize predefined criteria, like the number of requests before changing, the time between IP changes, and the quantity of IPs in your pool.
Load Balancing
Load balancers are reverse proxies that layer the servers and distribute client requests across them. It allows you to scale your application easily while ensuring optimal utilization:
For example, in this diagram, the load balancer (or reverse proxy) distributes the client traffic among three servers, which then access the database server.
Apart from that, load balancing also ensures that your website is available at all times and eases the maintenance process at a minimal cost. The firewall built within the load balancer prevents attacks on your website, and it can also identify and mitigate distributed denial-of-service (DDoS) attacks.
While web scraping, proxies can distribute the load for efficiency and overcome bot detection.
How to Scrape Craigslist with Bright Data Proxies
Now that you know how beneficial proxies can be, enhance your Craigslist scraper by routing it through a proxy. You’ll use Bright Data in this tutorial as it provides comprehensive web scraping solutions with four types of proxies: an easy-to-use proxy manager, a browser extension, a Scraping Browser API, and a web scraping IDE.
Start off by visiting their website and signing up for a free account.
Once you’ve created your account, navigate to the Proxies & Scraping Infrastructure section in the sidebar and create a new proxy using the Add button. Then choose the Datacenter proxies option:
Next, you’ll see a screen where you can name the solution and select the IP type. The options are as follows:
- Shared (pay per usage): Shared pool of datacenter IPs in various countries. You only pay based on your usage.
- Shared (pay per IP): Select number of IPs and countries. You pay for the number of IPs and usage or switch to unlimited bandwidth.
- Dedicated: Buy dedicated IPs for dedicated domains.
- Premium IPs: Best-performing IPs for top-targeted websites, like airbnb.com, amazon.com, and reddit.com.
Make sure that you activate the proxy:
Next, you need to add three additional lines of code to get it working.
Update this line in scraper.py
:
browser = p.chromium.launch(headless=False)
Use this code:
browser = p.chromium.launch(headless=False, proxy={
"server": "brd.superproxy.io:22225",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD"
})
You can find these credentials in your Bright Data dashboard. Go to Proxies & Scraping Infrastructure and select the specific proxy you want to view them for. You’ll see them listed on the information page:
That’s it! If the scraper loads the website and creates a CSV file with the data, the proxy integration is successful.
Another benefit of Bright Data proxies is that you don’t need to worry about IP rotation because it’s built into the proxies.
Scrape Craigslist with the Bright Data Scraping Browser
Bright Data also offers a built-in browser for scraping that is compatible with Puppeteer, Playwright, and Selenium. You’ll implement your Craigslist scraper using that now.
The Scraping Browser offers tons of benefits over a local Chromium instance, like automated built-in CAPTCHA solving, browser fingerprinting, automatic retries, and JavaScript rendering, to bypass bot-detection software.
You can add as many scraping projects and browsers as you want. Everything is hosted and handled by Bright Data, saving you time, money, and resources.
In your dashboard, create a new Scraping Browser instance by navigating to Proxies & Scraping Infrastructure. Then click on the Add button and choose the Scraping Browser option:
You’ll be asked to give the browser a name and shown the estimated cost, after which you can click Add:
Next, copy your password from the Scraping Browser details and add the following code:
password = 'USER:PASS'
url = f'wss://{pass}@brd.superproxy.io:9222'
browser = browser = p.chromium.connect_over_cdp(url)
Please note: You need to replace the
USER
part with the username andPASS
with the password of the Scraping Browser.
This is what your code should look like:
from playwright.sync_api import sync_playwright
import csv, sys
def main():
city = input("Enter any city you want to scrape cars from: ").lower() # Get the city that we want to scrape cars from
with sync_playwright() as p:
password = 'USER:PASS'
url = f'wss://{password}@brd.superproxy.io:9222'
browser = browser = p.chromium.connect_over_cdp(url)
page = browser.new_page() # Launch the browser using Playwright
try: # Go to the cars listing section on Craigslist of the city entered by the user
page.goto(f'https://{city}.craigslist.org/search/cta#search=1~gallery~0~0')
except:
print("Page does not exist on Craigslist")
sys.exit(1)
page.wait_for_timeout(15000)
cars = page.get_by_role("listitem").all() # Grab all the listings on the page
with open("cars.csv", "w") as csvfile: # Write to a CSV file
writer = csv.writer(csvfile)
writer.writerow(
['Car', 'Price', 'Miles Driven', 'Location', 'Posted', 'Link']) # Columns of the CSV file
for car in cars: # Looping over each of the listings
try:
meta = []
price = car.locator("span.priceinfo").inner_text() # Price of the car
text = car.locator("a > span.label").inner_text() # Title of the listing
link = car.locator("a.posting-title").get_attribute('href') # Link to the listing
info = car.locator("div.meta").inner_text() # Meta info (Posted ago, miles driven, location)
meta = info.split("·")
time, miles, location = meta[0], meta[1], meta[2]
writer.writerow([text, price, miles, location, time, link]) # Writing to CSV file
except:
print(f"Inadequate information about the car \n -------------")
page.wait_for_timeout(10000)
browser.close()
if __name__ == "__main__":
main()
Now, instead of using your local Chromium instance, the Bright Data Scraping Browser will be utilized, providing you access to all the associated benefits.
Datasets: An Alternative to Web Scraping
Bright Data offers custom datasets for public websites, and these are tailored to your needs. These datasets provide high volumes of freshly scraped data on request while managing the entire process, from building the scraper to validating the data.
You can request the data in JSON, ndjson, and CSV, delivered via Snowflake, Google Cloud, PubSub, Amazon Simple Storage Service (Amazon S3), or Microsoft Azure, along with APIs for on-demand access.
Bright Data also complies with data protection laws like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).
You can check out their custom Craigslist dataset, which has data on jobs, housing, services, cars, and more. Apart from on-demand data access, you also have the ability to tweak and debug the code to fit your needs. For example, you could edit it to extract only housing data or specific parts from within, like images, titles, and links.
Conclusion
Scraping Craigslist can uncover a lot of useful data. However, you risk being blocked or banned without proper measures, like proxies.
This article showed you how you can scrape data on cars from Craigslist, save it to a CSV file, and integrate proxies. You also explored Bright Data scraping solutions, like the completely managed Scraping Browser as well as website datasets for Facebook, LinkedIn, Crunchbase, Amazon, and Zillow, and much more.
The Scraping Browser comes with built-in CAPTCHA unblocking, auto-retrying, and JavaScript rendering, while the datasets offer a no-code method to gather huge amounts of reliable website data.
Bright Data tools will help you with reliable data extraction at affordable prices. Happy scraping!
Register now to find the best Craigslist scraping solution for you, including a free trial.
No credit card required
Note: This guide was thoroughly tested by our team at the time of writing, but as websites frequently update their code and structure, some steps may no longer work as expected.