In this guide, you will see:
- What a Crunchbase scraper is and how it works
- What data you can automatically collect from Crunchbase
- How to build a Crunchbase scraping script with Python
- Why you might need a more advanced solution to scrape the site
Let’s dive in!
What Is a Crunchbase Scraper?
A Crunchbase scraper is an automated tool designed to extract data from Crunchbase web pages. It navigates through the site, identifies the desired information, and collects it through web scraping.
Crunchbase employs advanced anti-bot and anti-scraping measures to safeguard its data. As a result, an effective Crunchbase scraper must include features like JavaScript rendering, CAPTCHA solving, and browser fingerprint spoofing.
What Data To Scrape From Crunchbase
Below is a list of the data you can automatically retrieve from Crunchbase via web scraping:
- Company information: Name, description, industry, headquarters location, founded date, status (e.g., active, acquired), and more
- Funding data: Total funding amount, funding rounds, investors, and more
- Key people: Founders, executives, members, roles and titles, and more
- Products and services: Product descriptions, categories of products or services offered, and more
- Acquisitions and mergers: Details of any acquired companies, dates and terms of acquisitions, and more
- Market and financial data: Revenue estimates, number of employees, and more
- News and events: Press releases, significant milestones or events, and more
- Competitors: List of competing companies and more
How to Build a Crunchbase Scraper in Python
In this tutorial section, you will learn how to create a Crunchbase scraper using Python. The objective is to develop a script that can automatically gather data from the Bright Data Crunchbase page:
Follow the steps below to see how to scrape Crunchbase with Python!
Step #1: Create a Python Project
First, make sure you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the instructions.
Create a directory for your Python Crunchbase scraper:
The crunchbase-scraper
folder will contain your scraping bot.
Open the project folder in your favorite Python IDE, such as PyCharm Community Edition or Visual Studio Code with the Python extension.
Next, create a scraper.py file inside the project folder. That file will contain the Crunchbase scraping logic.
Now, initialize a Python virtual environment. If you are a macOS or Linux user, execute:
Equivalently, on Windows, run:
This will add an env
directory to your project.
Right now, your project should have the following structure:
Activate the virtual environment with this command:
Or, on Windows:
Great! You now have a Python project where you can install local dependencies.
Keep in mind that you can launch your script with:
Or, on Windows:
Step #2: Determine and Install the Scraping Libraries
You now need to find out which scraping libraries are best suited for extracting data from Crunchbase. Start by making a GET HTTP request to the target webpage using a desktop HTTP client. Here is the result you will get:
As you can see, Crunchbase blocks your request—even if you use realistic browser headers. In other words, you will need a browser automation tool to effectively scrape Crunchbase. Find out more in our article on the best headless browsers.
For Python, Selenium is one of the most popular headless browser automation tools. In detail, it allows you to instruct a browser to perform specific interactions and scrape data from dynamic pages.
To install Selenium, use the selenium
pip package. In an activated Python virtual environment, run the following command:
Then, import Selenium in your scraper.py file with the following line:
Wonderful! You now have everything you need to perform web scraping on Crunchbase.
Step #3: Visit the Target Page
Initialize a Chrome WebDriver instance and use the get()
method to instruct the controlled browser to visit the desired page:
Then, do not forget to close the WebDriver and release the browser resources with:
Currently, your Crunchbase scraper script will contain:
If you run it, you will see the following page for a split second before the script terminates:
The “Chrome is being controlled by test software” message signals that Selenium is operating on Chrome as intended.
Usually, browsers in Selenium scraping scripts are launched in headless mode to save resources. Unfortunately, Crunchbase has an advanced anti-bot detection system that blocks headless browsers. Thus, you need to keep the browser in headed mode. Alternatively, you can try using Playwright Stealth to bypass these detection mechanisms.
Step #4: Handle the Cookie Popup
If you are a European user, the page will show the following cookie popup after a few seconds:
If you do not click the “Accept All” button, interacting with the page is not possible. Inspect the button:
See that you can select it with the #onetrust-accept-btn-handler
CSS selector.
Now, write a function that waits up to 60 seconds for the “Accept All” button to be on the page and clickable, and then click it:
Note that:
- The
try ... except
block is required because the cookie popup may not be on the page. In that case,WebDriverWait
will raise aNoSuchElementException
, which will be caught byexcept
. - “Accept All” is clicked via JavaScript and not through the
click()
method. The reason is that the HTML button appears slowly with a fade in animation. So, if you try to click it withclick()
, you may get aElementClickInterceptedException
.
To work, the above function requires the following imports:
You can now handle the cookie popup by calling:
Fantastic! Get ready to start scraping data on the page.
Step #5: Scrape the About Information
The first piece of information to scrape in the “Summary” card is the “About” description of the company:
Inspect the “About” HTML element:
Note that you can select it with the CSS selector below:
Use the find_element()
method to apply the CSS selector on the page. Then, extract the text inside the node with the text
attribute:
The about variable will now contain:
Here we go!
Step #6: Inspect the Page Structure
Now, focus on the information contained in the “Details” card on the page:
If you inspect this section, you will notice that there is not an easy way to select the HTML elements to scrape data from:
Most of these nodes have random HTML attributes that are likely generated at build time. These attributes change after each deployment, so you cannot rely on them for node selection. Additionally, many of these elements are not marked with unique classes or IDs.
An effective approach for selecting the elements of interest is to focus on their labels. For example, you can select the fields-card
node containing the industries information by identifying which fields-card
has a label-with-info
node that contains the “Industries” string.
This technique will be used to scrape data from this section. So, it makes sense to centralize the logic in a function:
Use the above function to select the “Industries” fields-card
node with:
Terrific! Scraping Crunchbase will now be much easier.
Step #7: Scrape Company Details
Inspect on the “Industries” node:
That stores the industries in which the company operates stored in chips-container a nodes. Select them all, iterate over them, and extract data from them:
Now, focus on the “Founded Date” element:
In this case, the scraping logic is easier as you only get to extract the text from the field-formatter
element inside the parent fields-card li
node:
The same logic can be applied to most of the other company details elements:
Another node that requires special attention is the “Founders” element:
In this case, you need to iterate over identifier-multi-formatter
a nodes and extract data from them:
Finally, take a look at the description node at the end of the “Details” section:
Scrape this data with:
Amazing! Your Crunchbase scraper is almost complete.
Step #8: Scrape the Products and Services Table
Other information worth collecting is the list of products and services offered by the company:
Select the “Products and Services” section using the function defined earlier:
Then, scrape data from the table with:
Impressive! The Crunchbase scraping logic is completed.
Step #9: Export the Scraped Data
Populate a company dictionary with the scraped data:
Next, export it to a company.json
file:
First, open()
creates a company.json
output file. Then, json.dump()
transforms company into its JSON representation and writes it to the output file.
Remember to import json from the Python standard library:
Step #10: Put It All Together
Here is the final scraper.py
file:
In just over 100 lines of code, you just built a Crunchbase scraper in Python!
Launch the script with the following command:
Or, on Windows:
A company.json
file will appear in your project’s folder. Open it and you will see:
That is the data available on the Crunchbase company page for Bright Data.
Et voilà! You just learned how to do web scraping on Crunchbase using Python.
Unlocking Crunchbase Data with Ease
Crunchbase provides a wealth of valuable data but also takes extensive measures to protect it from scrapers and automated bots. While interacting with the site using a headless browser or performing certain actions, you may encounter 403 Forbidden
pages or CAPTCHAs.
As a first step, you can refer to our guide on how to bypass CAPTCHAs in Python. However, Crunchbase employs additional advanced anti-scraping solutions that could still lead to blocks.
Without the right tools, scraping Crunchbase can quickly become a slow and frustrating experience. The best solution is Bright Data’s dedicated Crunchbase Scraper API. Retrieve data from Crunchbase without getting blocked!
Conclusion
In this step-by-step tutorial, you learned what a Crunchbase scraper is and the types of data it can retrieve. You also saw how to build a Python script to scrape Crunchbase for company overview data, which only required around 150 lines of code.
The problem is that Crunchbase adopts strict measures against bots and automated scripts. CAPTCHAs, browser fingerprinting, and IP bans are just a few of the defenses used to prevent scraping. Forget about all those challenges with our Crunchbase Scraper API.
If web scraping is not for you but you are still interested in Cruncbase data, explore our Crunchbase datasets!
Talk to one of our experts to find out which of Bright Data’s solutions best suits your needs.
No credit card required