In this guide, you will see:
- What a Crunchbase scraper is and how it works
- What data you can automatically collect from Crunchbase
- How to build a Crunchbase scraping script with Python
- Why you might need a more advanced solution to scrape the site
Let’s dive in!
What Is a Crunchbase Scraper?
A Crunchbase scraper is an automated tool designed to extract data from Crunchbase web pages. It navigates through the site, identifies the desired information, and collects it through web scraping.
Crunchbase employs advanced anti-bot and anti-scraping measures to safeguard its data. As a result, an effective Crunchbase scraper must include features like JavaScript rendering, CAPTCHA solving, and browser fingerprint spoofing.
What Data To Scrape From Crunchbase
Below is a list of the data you can automatically retrieve from Crunchbase via web scraping:
- Company information: Name, description, industry, headquarters location, founded date, status (e.g., active, acquired), and more
- Funding data: Total funding amount, funding rounds, investors, and more
- Key people: Founders, executives, members, roles and titles, and more
- Products and services: Product descriptions, categories of products or services offered, and more
- Acquisitions and mergers: Details of any acquired companies, dates and terms of acquisitions, and more
- Market and financial data: Revenue estimates, number of employees, and more
- News and events: Press releases, significant milestones or events, and more
- Competitors: List of competing companies and more
How to Build a Crunchbase Scraper in Python
In this tutorial section, you will learn how to create a Crunchbase scraper using Python. The objective is to develop a script that can automatically gather data from the Bright Data Crunchbase page:
Follow the steps below to see how to scrape Crunchbase with Python!
Step #1: Create a Python Project
First, make sure you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the instructions.
Create a directory for your Python Crunchbase scraper:
mkdir crunchbase-scraper
The crunchbase-scraper
folder will contain your scraping bot.
Open the project folder in your favorite Python IDE, such as PyCharm Community Edition or Visual Studio Code with the Python extension.
Next, create a scraper.py file inside the project folder. That file will contain the Crunchbase scraping logic.
Now, initialize a Python virtual environment. If you are a macOS or Linux user, execute:
python3 -m venv env
Equivalently, on Windows, run:
python -m venv env
This will add an env
directory to your project.
Right now, your project should have the following structure:
Activate the virtual environment with this command:
source env/bin/activate
Or, on Windows:
env\Scripts\activate
Great! You now have a Python project where you can install local dependencies.
Keep in mind that you can launch your script with:
python3 scraper.py
Or, on Windows:
python scraper.py
Step #2: Determine and Install the Scraping Libraries
You now need to find out which scraping libraries are best suited for extracting data from Crunchbase. Start by making a GET HTTP request to the target webpage using a desktop HTTP client. Here is the result you will get:
As you can see, Crunchbase blocks your request—even if you use realistic browser headers. In other words, you will need a browser automation tool to effectively scrape Crunchbase. Find out more in our article on the best headless browsers.
For Python, Selenium is one of the most popular headless browser automation tools. In detail, it allows you to instruct a browser to perform specific interactions and scrape data from dynamic pages.
To install Selenium, use the selenium
pip package. In an activated Python virtual environment, run the following command:
pip install -U selenium
Then, import Selenium in your scraper.py file with the following line:
from selenium import webdriver
Wonderful! You now have everything you need to perform web scraping on Crunchbase.
Step #3: Visit the Target Page
Initialize a Chrome WebDriver instance and use the get()
method to instruct the controlled browser to visit the desired page:
driver = webdriver.Chrome()
url = "https://www.crunchbase.com/organization/brightdata"
driver.get(url)
Then, do not forget to close the WebDriver and release the browser resources with:
driver.quit()
Currently, your Crunchbase scraper script will contain:
from selenium import webdriver
# initialize the driver to control a Chrome instance
# in headed mode
driver = webdriver.Chrome()
# navigate to the desired Crunchbase page
url = "https://www.crunchbase.com/organization/brightdata"
driver.get(url)
# scraping logic...
# close the driver and release the browser resources
driver.quit()
If you run it, you will see the following page for a split second before the script terminates:
The “Chrome is being controlled by test software” message signals that Selenium is operating on Chrome as intended.
Usually, browsers in Selenium scraping scripts are launched in headless mode to save resources. Unfortunately, Crunchbase has an advanced anti-bot detection system that blocks headless browsers. Thus, you need to keep the browser in headed mode. Alternatively, you can try using Playwright Stealth to bypass these detection mechanisms.
Step #4: Handle the Cookie Popup
If you are a European user, the page will show the following cookie popup after a few seconds:
If you do not click the “Accept All” button, interacting with the page is not possible. Inspect the button:
See that you can select it with the #onetrust-accept-btn-handler
CSS selector.
Now, write a function that waits up to 60 seconds for the “Accept All” button to be on the page and clickable, and then click it:
def handle_cookie_banner(driver, seconds=60):
try:
# wait for the given number of seconds for the "Accept All"
# button of the cookie banner to appear on the page
accept_button = WebDriverWait(driver, seconds).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))
)
# click the banner via JavaScript to avoid
# ElementClickInterceptedException errors
driver.execute_script("arguments[0].click();", accept_button)
print("'Accept All' button clicked")
except:
print("'Accept All' button not found within {seconds} seconds")
Note that:
- The
try ... except
block is required because the cookie popup may not be on the page. In that case,WebDriverWait
will raise aNoSuchElementException
, which will be caught byexcept
. - “Accept All” is clicked via JavaScript and not through the
click()
method. The reason is that the HTML button appears slowly with a fade in animation. So, if you try to click it withclick()
, you may get aElementClickInterceptedException
.
To work, the above function requires the following imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
You can now handle the cookie popup by calling:
handle_cookie_banner(driver)
Fantastic! Get ready to start scraping data on the page.
Step #5: Scrape the About Information
The first piece of information to scrape in the “Summary” card is the “About” description of the company:
Inspect the “About” HTML element:
Note that you can select it with the CSS selector below:
profile-section description-card
Use the find_element()
method to apply the CSS selector on the page. Then, extract the text inside the node with the text
attribute:
about_node = driver.find_element(By.CSS_SELECTOR, "profile-section description-card")
about = about_node.text
The about variable will now contain:
'The World's #1 Web Data Platform'
Here we go!
Step #6: Inspect the Page Structure
Now, focus on the information contained in the “Details” card on the page:
If you inspect this section, you will notice that there is not an easy way to select the HTML elements to scrape data from:
Most of these nodes have random HTML attributes that are likely generated at build time. These attributes change after each deployment, so you cannot rely on them for node selection. Additionally, many of these elements are not marked with unique classes or IDs.
An effective approach for selecting the elements of interest is to focus on their labels. For example, you can select the fields-card
node containing the industries information by identifying which fields-card
has a label-with-info
node that contains the “Industries” string.
This technique will be used to scrape data from this section. So, it makes sense to centralize the logic in a function:
def find_parent_node_based_on_child_node_text(parent_nodes_selector, child_node_selector, text):
# select all parent nodes
parent_nodes = driver.find_elements(By.CSS_SELECTOR, parent_nodes_selector)
# iterate through the parent nodes to find the one
# whose specific child node contains the desired text
for parent_node in parent_nodes:
try:
# get the specific child node within the current parent node
child_node = parent_node.find_element(By.CSS_SELECTOR, child_node_selector)
# check if it contains the desired text
if text.upper() in child_node.text.upper():
return parent_node
except:
continue
return None
Use the above function to select the “Industries” fields-card
node with:
industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")
Terrific! Scraping Crunchbase will now be much easier.
Step #7: Scrape Company Details
Inspect on the “Industries” node:
That stores the industries in which the company operates stored in chips-container a nodes. Select them all, iterate over them, and extract data from them:
industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")
industries_nodes = industries_parent_node.find_elements(By.CSS_SELECTOR, "chips-container a")
industries = []
for industry_node in industries_nodes:
industries.append(industry_node.text)
Now, focus on the “Founded Date” element:
In this case, the scraping logic is easier as you only get to extract the text from the field-formatter
element inside the parent fields-card li
node:
founded_date_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founded Date")
founded_date_node = founded_date_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
founded_date = founded_date_node.text
The same logic can be applied to most of the other company details elements:
company_type_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Company Type")
company_type_node = company_type_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
company_type = company_type_node.text
operating_status_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Operating Status")
operating_status_node = operating_status_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
operating_status = operating_status_node.text
headquarters_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Headquarters Regions")
headquarters_node = headquarters_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
headquarters = headquarters_node.text
legal_name_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Legal Name")
legal_name_node = legal_name_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
legal_name = legal_name_node.text
contact_email_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Contact Email")
contact_email_node = contact_email_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
contact_email = contact_email_node.text
phone_number_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Phone Number")
phone_number_node = phone_number_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
phone_number = phone_number_node.text
Another node that requires special attention is the “Founders” element:
In this case, you need to iterate over identifier-multi-formatter
a nodes and extract data from them:
founders_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founders")
founders_nodes = founders_parent_node.find_elements(By.CSS_SELECTOR, "identifier-multi-formatter a")
founders = []
for founders_node in founders_nodes:
founders.append(founders_node.text)
Finally, take a look at the description node at the end of the “Details” section:
Scrape this data with:
description_node = driver.find_element(By.CSS_SELECTOR, "section-card description-card")
description = description_node.text
Amazing! Your Crunchbase scraper is almost complete.
Step #8: Scrape the Products and Services Table
Other information worth collecting is the list of products and services offered by the company:
Select the “Products and Services” section using the function defined earlier:
products_parent_node = find_parent_node_based_on_child_node_text("profile-section", ".section-title", "Products and Services")
Then, scrape data from the table with:
products = []
for row in products_table_rows:
# extract the name and description from each row's columns
name = row.find_element(By.CSS_SELECTOR, "td:nth-child(1)").text
description = row.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text
product = {
"name": name,
"description": description
}
products.append(product)
Impressive! The Crunchbase scraping logic is completed.
Step #9: Export the Scraped Data
Populate a company dictionary with the scraped data:
company = {
"about": about,
"industries": industries,
"founded_date": founded_date,
"company_type": company_type,
"operating_status": operating_status,
"headquarters": headquarters,
"founders": founders,
"email": contact_email,
"phone": phone_number,
"description": description,
"products": products
}
Next, export it to a company.json
file:
with open("company.json", "w") as json_file:
json.dump(company, json_file, indent=4)
First, open()
creates a company.json
output file. Then, json.dump()
transforms company into its JSON representation and writes it to the output file.
Remember to import json from the Python standard library:
import json
Step #10: Put It All Together
Here is the final scraper.py
file:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import json
def find_parent_node_based_on_child_node_text(parent_nodes_selector, child_node_selector, text):
# select all parent nodes
parent_nodes = driver.find_elements(By.CSS_SELECTOR, parent_nodes_selector)
# iterate through the parent nodes to find the one
# whose specific child node contains the desired text
for parent_node in parent_nodes:
try:
# get the specific child node within the current parent node
child_node = parent_node.find_element(By.CSS_SELECTOR, child_node_selector)
# check if it contains the desired text
if text.upper() in child_node.text.upper():
return parent_node
except:
continue
return None
def handle_cookie_popup(driver, seconds=60):
try:
# wait for the given number of seconds for the "Accept All"
# button of the cookie popup to appear on the page
accept_button = WebDriverWait(driver, seconds).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))
)
# click the popup via JavaScript to avoid
# ElementClickInterceptedException errors
driver.execute_script("arguments[0].click();", accept_button)
print("'Accept All' button clicked")
except:
print("'Accept All' button not found within {seconds} seconds")
# initialize the driver to control a Chrome instance
# in headed mode
driver = webdriver.Chrome()
# navigate to the desired Crunchbase page
url = "https://www.crunchbase.com/organization/brightdata"
driver.get(url)
# handle the cookie popup, if present
handle_cookie_popup(driver)
# scraping logic
about_node = driver.find_element(By.CSS_SELECTOR, "profile-section description-card")
about = about_node.text
industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")
industries_nodes = industries_parent_node.find_elements(By.CSS_SELECTOR, "chips-container a")
industries = []
for industry_node in industries_nodes:
industries.append(industry_node.text)
founded_date_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founded Date")
founded_date_node = founded_date_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
founded_date = founded_date_node.text
company_type_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Company Type")
company_type_node = company_type_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
company_type = company_type_node.text
operating_status_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Operating Status")
operating_status_node = operating_status_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
operating_status = operating_status_node.text
headquarters_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Headquarters Regions")
headquarters_node = headquarters_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
headquarters = headquarters_node.text
founders_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founders")
founders_nodes = founders_parent_node.find_elements(By.CSS_SELECTOR, "identifier-multi-formatter a")
founders = []
for founders_node in founders_nodes:
founders.append(founders_node.text)
legal_name_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Legal Name")
legal_name_node = legal_name_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
legal_name = legal_name_node.text
contact_email_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Contact Email")
contact_email_node = contact_email_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
contact_email = contact_email_node.text
phone_number_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Phone Number")
phone_number_node = phone_number_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")
phone_number = phone_number_node.text
description_node = driver.find_element(By.CSS_SELECTOR, "section-card description-card")
description = description_node.text
products_parent_node = find_parent_node_based_on_child_node_text("profile-section", ".section-title", "Products and Services")
products_table_rows = products_parent_node.find_elements(By.CSS_SELECTOR, "table tbody tr")
# scrape the product table
products = []
for row in products_table_rows:
# extract the name and description from each row's columns
name = row.find_element(By.CSS_SELECTOR, "td:nth-child(1)").text
description = row.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text
product = {
"name": name,
"description": description
}
products.append(product)
# populate a dictionary with the scraped data
company = {
"about": about,
"industries": industries,
"founded_date": founded_date,
"company_type": company_type,
"operating_status": operating_status,
"headquarters": headquarters,
"founders": founders,
"email": contact_email,
"phone": phone_number,
"description": description,
"products": products
}
# export the scraped data to a JSON file
with open("company.json", "w") as json_file:
json.dump(company, json_file, indent=4)
# close the driver and release the browser resources
driver.quit()
In just over 100 lines of code, you just built a Crunchbase scraper in Python!
Launch the script with the following command:
python3 script.py
Or, on Windows:
python script.py
A company.json
file will appear in your project’s folder. Open it and you will see:
{
"about": "The World's #1 Web Data Platform",
"industries": [
"Business Intelligence",
"Cloud Data Services",
"Computer",
"Data Collection and Labeling",
"Information Technology",
"IT Infrastructure",
"Network Security",
"SaaS",
"Software"
],
"founded_date": "2014",
"company_type": "For Profit",
"operating_status": "Active",
"headquarters": "Greater New York Area, East Coast, Northeastern US",
"founders": [
"Derry Shribman",
"Ofer Vilenski"
],
"email": "[email protected]",
"phone": "(888) 538-9204",
"description": "Proxies that hide your location and IP address, allowing access to public web content anonymously without detection or blocking.",
"products": [
{
"name": "Residential Proxies",
"description": "A network of over 72 million real residential IPs from 195 countries, allowing access to any website content while avoiding IP bans and CAPTCHAs."
},
{
"name": "Datacenter Proxies",
"description": "A network of 770,000+ datacenter IPs offering global coverage and the ability to target specific countries and cities for reliable data collection."
},
{
"name": "Mobile Proxies",
"description": "A network of 7 million+ real 3G/4G mobile IPs from around the world, enabling users to see the web as real mobile users and bypass IP location blocks and CAPTCHAs."
},
{
"name": "ISP Proxies",
"description": "700,000+ static residential IPs assigned by ISPs, providing long sessions and exclusive use for as long as needed."
},
{
"name": "Rotating Proxies",
"description": "Proxies that constantly replace your IP address to avoid detection and blocking, with 99.99% uptime and easy management through a Proxy Manager."
},
{
"name": "Anonymous Proxies",
"description": "Proxies that hide your location and IP address, allowing access to public web content anonymously without detection or blocking."
}
]
}
That is the data available on the Crunchbase company page for Bright Data.
Et voilà! You just learned how to do web scraping on Crunchbase using Python.
Unlocking Crunchbase Data with Ease
Crunchbase provides a wealth of valuable data but also takes extensive measures to protect it from scrapers and automated bots. While interacting with the site using a headless browser or performing certain actions, you may encounter 403 Forbidden
pages or CAPTCHAs:
As a first step, you can refer to our guide on how to bypass CAPTCHAs in Python. However, Crunchbase employs additional advanced anti-scraping solutions that could still lead to blocks.
Without the right tools, scraping Crunchbase can quickly become a slow and frustrating experience. The best solution is Bright Data’s dedicated Crunchbase Scraper API. Retrieve data from Crunchbase without getting blocked!
Conclusion
In this step-by-step tutorial, you learned what a Crunchbase scraper is and the types of data it can retrieve. You also saw how to build a Python script to scrape Crunchbase for company overview data, which only required around 150 lines of code.
The problem is that Crunchbase adopts strict measures against bots and automated scripts. CAPTCHAs, browser fingerprinting, and IP bans are just a few of the defenses used to prevent scraping. Forget about all those challenges with our Crunchbase Scraper API.
If web scraping is not for you but you are still interested in Cruncbase data, explore our Crunchbase datasets!
Talk to one of our experts to find out which of Bright Data’s solutions best suits your needs.
No credit card required