For 30 years, Craigslist has been a go-to marketplace for all kinds of deals. In spite of its very simple, 1990s design, Craigslist might be the best place in the world to shop “for sale by owner” deals.
Today, we’re going to extract car data from Craigslist using a Python scraper. Follow along and you’ll be scraping Craigslist like a pro in no time.
What to Extract From Craigslist
Digging Through HTML: The Hard Way
The most important skill in web scraping is knowing where to look. We could write an overcomplicated parser that pulls individual items from the HTML code.
If you look at the truck in the image below, its data is nested within a div
element of the class cl-gallery
. If we want to do things the hard way, we can find this tag and then further parse elements from there.
Finding The JSON: Saving Precious Time
However, there’s a better way. Many sites, including Craigslist, use embedded JSON data to build the entire page. If you can find this JSON, it cuts your parsing work to almost zero.
On a Craigslist page, there is a script
object holding all the data we want. If we pull this one element, we get the data from their entire page. If you look, its id
is ld_searchpage_results
. We can locate this element with the CSS selector: script[id='ld_searchpage_results']
.
Scraping Craigslist With Python
Now that we know what we’re trying to find, scraping Craigslist is going to be much easier. In these next few sections, we’ll go over the individual code and then put it all together into a functional scraper.
Parsing the Page
- First, we create our
url
,scraped_data
andsuccess
variables.url
: The exact url of the search we want to perform.scraped_data
: This is where we put all of our search results.success
: We want this scraper to be persistent. Used in combination with awhile
loop, our scraper won’t exit until the job has completed and we set success toTrue
.
- Then, we get the page and throw an error in the event of a bad response.
soup = BeautifulSoup(response.text, "html.parser")
creates aBeautifulSoup
object that we can use to parse the page.- We find our embedded JSON with
embedded_json_string = soup.select_one("script[id='ld_searchpage_results']")
. - We then convert it into a
dict
withjson.loads()
. - Next, we iterate through all the items and clean their data. The
clean_item
gets appended to ourscraped_data
. - Finally, we set
success
toTrue
and return the array of scraped listings.
Storing Our Data
The two most common storage methods in web scraping are CSV and JSON. We’ll walk through how to store our listings in both formats.
Saving to a JSON File
This basic snippet contains our JSON storage logic. We open a file and pass it into json.dump()
along with our data. We use indent=4
to make the JSON file readable.
Saving to a CSV File
Saving to a CSV requires a little more work. CSV doesn’t handle arrays very well. This is why we only extracted one image when cleaning the data.
If there are no listings, the function exits. If there are listings, we write the CSV headers using the keys()
from the first item in the array. Then we use csv.DictWriter()
to write the headers and the listings
.
Putting it All Together
Now, we can put all these pieces together. This code contains our fully functional scraper.
Inside the main
block, you can handle storage methods with the OUTPUT
variable. If you want to store in a JSON file, set it to json
. If you want a CSV, set this variable to csv
. In data collection, you’re going to use both of these storage methods all the time.
JSON Output
As you can see in the image below, each car is represented by a readable JSON object with a clean, clear structure.
CSV Output
Our CSV output is pretty similar. We get a clean spreadsheet holding all of our listings.
Using Scraping Browser
Scraping Browser allows us to run a Playwright instance with proxy integration. This can take your scraping to the next level by operating a full browser from your Python script. If you are interested in integrating proxies with Playwright
In the code below, our parsing method remains largely the same, but we use asyncio
with async_playwright
to open a headless browser and actually fetch the page using this browser. Instead of BeautifulSoup
, we pass our CSS selector into Playwright’s query_selector()
method.
Using a Custom Built No-Code Scraper
Here at Bright Data, we also offer a No-Code Craigslist Scraper. With No-Code scraper, you specify the data and pages you want to scrape. Then, we create and deploy a scraper for you!
In the “My Scrapers” section, click “New” and select “Request a custom built scraper.”
Next, you’ll be prompted to enter some URLs containing your site’s layout. In the image below, we pass the URL for our car search in Detroit. You could add a second URL for your city.
Through our automated process, we scrape the sites and create a schema for you to review.
Once the schema’s been created, you need to review it.
Here’s the sample JSON data from the schema for a custom Craigslist scraper. Within minutes, there is a functional prototype.
Next, you set the collection scope. We don’t need it to scrape all of Craigslist, or just a specfic section, so we’ll feed it urls to initiate a scrape.
Finally, you’ll be prompted to schedule a call with one of our experts for deployment. You can pay $300 monthly for upkeep and maintenance, or a one-time deployment fee of $1,000.
Conclusion
When you scrape Craigslist, you can now harness Python for quick and efficient data processing. You know how to parse and clean the data. You also learned how to store it using CSV and JSON. If you need full browser functionality, you can utilize Scraping Browser to fill these needs with full proxy integration. If you’re looking to completely automate your scraping process, you now know how to handle our No-Code Scraper as well.
In addition, if you want to skip the scraping process entirely, Bright Data offers ready-to-use Craigslist datasets. Sign up now and start your free trial today!
No credit card required