How to Scrape Crunchbase With Python in 2024

Learn how to scrape Crunchbase data using Python, explore what information can be collected, and discover solutions to bypass anti-scraping measures.
18 min read
How to Scrape Crunchbase blog image

In this guide, you will see:

  • What a Crunchbase scraper is and how it works
  • What data you can automatically collect from Crunchbase
  • How to build a Crunchbase scraping script with Python
  • Why you might need a more advanced solution to scrape the site

Let’s dive in! 

What Is a Crunchbase Scraper?

A Crunchbase scraper is an automated tool designed to extract data from Crunchbase web pages. It navigates through the site, identifies the desired information, and collects it through web scraping.

Crunchbase employs advanced anti-bot and anti-scraping measures to safeguard its data. As a result, an effective Crunchbase scraper must include features like JavaScript rendering, CAPTCHA solving, and browser fingerprint spoofing.

What Data To Scrape From Crunchbase

Below is a list of the data you can automatically retrieve from Crunchbase via web scraping:

  • Company information: Name, description, industry, headquarters location, founded date, status (e.g., active, acquired), and more
  • Funding data: Total funding amount, funding rounds, investors, and more
  • Key people: Founders, executives, members, roles and titles, and more
  • Products and services: Product descriptions, categories of products or services offered, and more
  • Acquisitions and mergers: Details of any acquired companies, dates and terms of acquisitions, and more
  • Market and financial data: Revenue estimates, number of employees, and more
  • News and events: Press releases, significant milestones or events, and more
  • Competitors: List of competing companies and more

How to Build a Crunchbase Scraper in Python

In this tutorial section, you will learn how to create a Crunchbase scraper using Python. The objective is to develop a script that can automatically gather data from the Bright Data Crunchbase page:

Bright Data's page on Crunchbase

Follow the steps below to see how to scrape Crunchbase with Python! 

Step #1: Create a Python Project

First, make sure you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the instructions.

Create a directory for your Python Crunchbase scraper:

mkdir crunchbase-scraper

The crunchbase-scraper folder will contain your scraping bot.

Open the project folder in your favorite Python IDE, such as PyCharm Community Edition or Visual Studio Code with the Python extension.

Next, create a scraper.py file inside the project folder. That file will contain the Crunchbase scraping logic.

Now, initialize a Python virtual environment. If you are a macOS or Linux user, execute:

python3 -m venv env

Equivalently, on Windows, run:

python -m venv env

This will add an env directory to your project.

Right now, your project should have the following structure: 

The structure of your Crunchbase scraper

Activate the virtual environment with this command:

source env/bin/activate

Or, on Windows:

env\Scripts\activate

Great! You now have a Python project where you can install local dependencies. 

Keep in mind that you can launch your script with:

python3 scraper.py

Or, on Windows:

python scraper.py

Step #2: Determine and Install the Scraping Libraries

You now need to find out which scraping libraries are best suited for extracting data from Crunchbase. Start by making a GET HTTP request to the target webpage using a desktop HTTP client. Here is the result you will get:

The result of the website on the desktop HTTP client

As you can see, Crunchbase blocks your request—even if you use realistic browser headers. In other words, you will need a browser automation tool to effectively scrape Crunchbase. Find out more in our article on the best headless browsers.

For Python, Selenium is one of the most popular headless browser automation tools. In detail, it allows you to instruct a browser to perform specific interactions and scrape data from dynamic pages.

To install Selenium, use the selenium pip package. In an activated Python virtual environment, run the following command:

pip install -U selenium

Then, import Selenium in your scraper.py file with the following line:

from selenium import webdriver

Wonderful! You now have everything you need to perform web scraping on Crunchbase.

Step #3: Visit the Target Page

Initialize a Chrome WebDriver instance and use the get() method to instruct the controlled browser to visit the desired page:

driver = webdriver.Chrome()

url = "https://www.crunchbase.com/organization/brightdata"

driver.get(url)

Then, do not forget to close the WebDriver and release the browser resources with:

driver.quit()

Currently, your Crunchbase scraper script will contain:

from selenium import webdriver

# initialize the driver to control a Chrome instance

# in headed mode

driver = webdriver.Chrome()

# navigate to the desired Crunchbase page

url = "https://www.crunchbase.com/organization/brightdata"

driver.get(url)

# scraping logic...

# close the driver and release the browser resources

driver.quit()

If you run it, you will see the following page for a split second before the script terminates:

The page you will see before the script terminates

The “Chrome is being controlled by test software” message signals that Selenium is operating on Chrome as intended.

Usually, browsers in Selenium scraping scripts are launched in headless mode to save resources. Unfortunately, Crunchbase has an advanced anti-bot detection system that blocks headless browsers. Thus, you need to keep the browser in headed mode. Alternatively, you can try using Playwright Stealth to bypass these detection mechanisms.

Step #4: Handle the Cookie Popup

If you are a European user, the page will show the following cookie popup after a few seconds:

Cookie popup on Crunchbase

If you do not click the “Accept All” button, interacting with the page is not possible. Inspect the button:

Inspecting the "Accept All" button

See that you can select it with the #onetrust-accept-btn-handler CSS selector.

Now, write a function that waits up to 60 seconds for the “Accept All” button to be on the page and clickable, and then click it:

def handle_cookie_banner(driver, seconds=60):

  try:

    # wait for the given number of seconds for the "Accept All"

    # button of the cookie banner to appear on the page

    accept_button = WebDriverWait(driver, seconds).until(

      EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))

    )

    # click the banner via JavaScript to avoid

    # ElementClickInterceptedException errors

    driver.execute_script("arguments[0].click();", accept_button)

    print("'Accept All' button clicked")

  except:

    print("'Accept All' button not found within {seconds} seconds")

Note that:

  1. The try ... except block is required because the cookie popup may not be on the page. In that case, WebDriverWait will raise a NoSuchElementException, which will be caught by except.
  2. “Accept All” is clicked via JavaScript and not through the click() method. The reason is that the HTML button appears slowly with a fade in animation. So, if you try to click it with click(), you may get a ElementClickInterceptedException.

To work, the above function requires the following imports:

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.by import By

You can now handle the cookie popup by calling:

handle_cookie_banner(driver)

Fantastic! Get ready to start scraping data on the page.

Step #5: Scrape the About Information

The first piece of information to scrape in the “Summary” card is the “About” description of the company:

The summary card of the company

Inspect the “About” HTML element:

Inspecting the HTML of the "about" element

Note that you can select it with the CSS selector below:

profile-section description-card

Use the find_element() method to apply the CSS selector on the page. Then, extract the text inside the node with the text attribute:

about_node = driver.find_element(By.CSS_SELECTOR, "profile-section description-card")

about = about_node.text

The about variable will now contain:

'The World's #1 Web Data Platform'

Here we go!

Step #6: Inspect the Page Structure

Now, focus on the information contained in the “Details” card on the page:

The details card on the company page

If you inspect this section, you will notice that there is not an easy way to select the HTML elements to scrape data from:

Inspecting the details card

Most of these nodes have random HTML attributes that are likely generated at build time. These attributes change after each deployment, so you cannot rely on them for node selection. Additionally, many of these elements are not marked with unique classes or IDs.

An effective approach for selecting the elements of interest is to focus on their labels. For example, you can select the fields-card node containing the industries information by identifying which fields-card has a label-with-info node that contains the “Industries” string.

This technique will be used to scrape data from this section. So, it makes sense to centralize the logic in a function:

def find_parent_node_based_on_child_node_text(parent_nodes_selector, child_node_selector, text):

  # select all parent nodes

  parent_nodes = driver.find_elements(By.CSS_SELECTOR, parent_nodes_selector)

  # iterate through the parent nodes to find the one

  # whose specific child node contains the desired text

  for parent_node in parent_nodes:

    try:

      # get the specific child node within the current parent node

      child_node = parent_node.find_element(By.CSS_SELECTOR, child_node_selector)

      # check if it contains the desired text

      if text.upper() in child_node.text.upper():

          return parent_node

    except:

      continue

  return None

Use the above function to select the “Industries” fields-card node with:

industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")

Terrific! Scraping Crunchbase will now be much easier.

Step #7: Scrape Company Details

Inspect on the “Industries” node:

Inspecting the industry node

That stores the industries in which the company operates stored in chips-container a nodes. Select them all, iterate over them, and extract data from them:

industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")

industries_nodes = industries_parent_node.find_elements(By.CSS_SELECTOR, "chips-container a")

industries = []

for industry_node in industries_nodes:

  industries.append(industry_node.text)

Now, focus on the “Founded Date” element:

The "founded date" element

In this case, the scraping logic is easier as you only get to extract the text from the field-formatter element inside the parent fields-card li node:

founded_date_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founded Date")

founded_date_node = founded_date_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

founded_date = founded_date_node.text

The same logic can be applied to most of the other company details elements:

company_type_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Company Type")

company_type_node = company_type_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

company_type = company_type_node.text

operating_status_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Operating Status")

operating_status_node = operating_status_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

operating_status = operating_status_node.text

headquarters_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Headquarters Regions")

headquarters_node = headquarters_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

headquarters = headquarters_node.text

legal_name_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Legal Name")

legal_name_node = legal_name_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

legal_name = legal_name_node.text

contact_email_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Contact Email")

contact_email_node = contact_email_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

contact_email = contact_email_node.text

phone_number_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Phone Number")

phone_number_node = phone_number_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

phone_number = phone_number_node.text

Another node that requires special attention is the “Founders” element:

The founders element

In this case, you need to iterate over identifier-multi-formatter a nodes and extract data from them:

founders_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founders")

founders_nodes = founders_parent_node.find_elements(By.CSS_SELECTOR, "identifier-multi-formatter a")

founders = []

for founders_node in founders_nodes:

  founders.append(founders_node.text)

Finally, take a look at the description node at the end of the “Details” section:

description node at the end of the “Details” section

Scrape this data with:

description_node = driver.find_element(By.CSS_SELECTOR, "section-card description-card")

description = description_node.text

Amazing! Your Crunchbase scraper is almost complete.

Step #8: Scrape the Products and Services Table

Other information worth collecting is the list of products and services offered by the company:

list of products and services offered by the scraped company

Select the “Products and Services” section using the function defined earlier:

products_parent_node = find_parent_node_based_on_child_node_text("profile-section", ".section-title", "Products and Services")

Then, scrape data from the table with:

products = []

for row in products_table_rows:

  # extract the name and description from each row's columns

  name = row.find_element(By.CSS_SELECTOR, "td:nth-child(1)").text

  description = row.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text

  product = {

    "name": name,

    "description": description

  }

  products.append(product)

Impressive! The Crunchbase scraping logic is completed.

Step #9: Export the Scraped Data 

Populate a company dictionary with the scraped data:

company = {

  "about": about,

  "industries": industries,

  "founded_date": founded_date,

  "company_type": company_type,

  "operating_status": operating_status,

  "headquarters": headquarters,

  "founders": founders,

  "email": contact_email,

  "phone": phone_number,

  "description": description,

  "products": products

}

Next, export it to a company.json file:

with open("company.json", "w") as json_file:

  json.dump(company, json_file, indent=4)

First, open() creates a company.json output file. Then, json.dump() transforms company into its JSON representation and writes it to the output file.

Remember to import json from the Python standard library:

import json

Step #10: Put It All Together

Here is the final scraper.py file:

from selenium import webdriver

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.by import By

import json

def find_parent_node_based_on_child_node_text(parent_nodes_selector, child_node_selector, text):

  # select all parent nodes

  parent_nodes = driver.find_elements(By.CSS_SELECTOR, parent_nodes_selector)

  # iterate through the parent nodes to find the one

  # whose specific child node contains the desired text

  for parent_node in parent_nodes:

    try:

      # get the specific child node within the current parent node

      child_node = parent_node.find_element(By.CSS_SELECTOR, child_node_selector)

      # check if it contains the desired text

      if text.upper() in child_node.text.upper():

          return parent_node

    except:

      continue

  return None

def handle_cookie_popup(driver, seconds=60):

  try:

    # wait for the given number of seconds for the "Accept All"

    # button of the cookie popup to appear on the page

    accept_button = WebDriverWait(driver, seconds).until(

      EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))

    )

    # click the popup via JavaScript to avoid

    # ElementClickInterceptedException errors

    driver.execute_script("arguments[0].click();", accept_button)

    print("'Accept All' button clicked")

  except:

    print("'Accept All' button not found within {seconds} seconds")

# initialize the driver to control a Chrome instance

# in headed mode

driver = webdriver.Chrome()

# navigate to the desired Crunchbase page

url = "https://www.crunchbase.com/organization/brightdata"

driver.get(url)

# handle the cookie popup, if present

handle_cookie_popup(driver)

# scraping logic

about_node = driver.find_element(By.CSS_SELECTOR, "profile-section description-card")

about = about_node.text

industries_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Industries")

industries_nodes = industries_parent_node.find_elements(By.CSS_SELECTOR, "chips-container a")

industries = []

for industry_node in industries_nodes:

  industries.append(industry_node.text)

founded_date_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founded Date")

founded_date_node = founded_date_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

founded_date = founded_date_node.text

company_type_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Company Type")

company_type_node = company_type_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

company_type = company_type_node.text

operating_status_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Operating Status")

operating_status_node = operating_status_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

operating_status = operating_status_node.text

headquarters_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Headquarters Regions")

headquarters_node = headquarters_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

headquarters = headquarters_node.text

founders_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Founders")

founders_nodes = founders_parent_node.find_elements(By.CSS_SELECTOR, "identifier-multi-formatter a")

founders = []

for founders_node in founders_nodes:

  founders.append(founders_node.text)

legal_name_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Legal Name")

legal_name_node = legal_name_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

legal_name = legal_name_node.text

contact_email_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Contact Email")

contact_email_node = contact_email_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

contact_email = contact_email_node.text

phone_number_parent_node = find_parent_node_based_on_child_node_text("fields-card li", "label-with-info", "Phone Number")

phone_number_node = phone_number_parent_node.find_element(By.CSS_SELECTOR, "field-formatter")

phone_number = phone_number_node.text

description_node = driver.find_element(By.CSS_SELECTOR, "section-card description-card")

description = description_node.text

products_parent_node = find_parent_node_based_on_child_node_text("profile-section", ".section-title", "Products and Services")

products_table_rows = products_parent_node.find_elements(By.CSS_SELECTOR, "table tbody tr")

# scrape the product table

products = []

for row in products_table_rows:

  # extract the name and description from each row's columns

  name = row.find_element(By.CSS_SELECTOR, "td:nth-child(1)").text

  description = row.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text

  product = {

    "name": name,

    "description": description

  }

  products.append(product)

# populate a dictionary with the scraped data

company = {

  "about": about,

  "industries": industries,

  "founded_date": founded_date,

  "company_type": company_type,

  "operating_status": operating_status,

  "headquarters": headquarters,

  "founders": founders,

  "email": contact_email,

  "phone": phone_number,

  "description": description,

  "products": products

}

# export the scraped data to a JSON file

with open("company.json", "w") as json_file:

  json.dump(company, json_file, indent=4)

# close the driver and release the browser resources

driver.quit()

In just over 100 lines of code, you just built a Crunchbase scraper in Python!

Launch the script with the following command:

python3 script.py

Or, on Windows:

python script.py

A company.json file will appear in your project’s folder. Open it and you will see:

{

    "about": "The World's #1 Web Data Platform",

    "industries": [

        "Business Intelligence",

        "Cloud Data Services",

        "Computer",

        "Data Collection and Labeling",

        "Information Technology",

        "IT Infrastructure",

        "Network Security",

        "SaaS",

        "Software"

    ],

    "founded_date": "2014",

    "company_type": "For Profit",

    "operating_status": "Active",

    "headquarters": "Greater New York Area, East Coast, Northeastern US",

    "founders": [

        "Derry Shribman",

        "Ofer Vilenski"

    ],

    "email": "[email protected]",

    "phone": "(888) 538-9204",

    "description": "Proxies that hide your location and IP address, allowing access to public web content anonymously without detection or blocking.",

    "products": [

        {

            "name": "Residential Proxies",

            "description": "A network of over 72 million real residential IPs from 195 countries, allowing access to any website content while avoiding IP bans and CAPTCHAs."

        },

        {

            "name": "Datacenter Proxies",

            "description": "A network of 770,000+ datacenter IPs offering global coverage and the ability to target specific countries and cities for reliable data collection."

        },

        {

            "name": "Mobile Proxies",

            "description": "A network of 7 million+ real 3G/4G mobile IPs from around the world, enabling users to see the web as real mobile users and bypass IP location blocks and CAPTCHAs."

        },

        {

            "name": "ISP Proxies",

            "description": "700,000+ static residential IPs assigned by ISPs, providing long sessions and exclusive use for as long as needed."

        },

        {

            "name": "Rotating Proxies",

            "description": "Proxies that constantly replace your IP address to avoid detection and blocking, with 99.99% uptime and easy management through a Proxy Manager."

        },

        {

            "name": "Anonymous Proxies",

            "description": "Proxies that hide your location and IP address, allowing access to public web content anonymously without detection or blocking."

        }

    ]

}

That is the data available on the Crunchbase company page for Bright Data.

Et voilà! You just learned how to do web scraping on Crunchbase using Python.

Unlocking Crunchbase Data with Ease

Crunchbase provides a wealth of valuable data but also takes extensive measures to protect it from scrapers and automated bots. While interacting with the site using a headless browser or performing certain actions, you may encounter 403 Forbidden pages or CAPTCHAs:

As a first step, you can refer to our guide on how to bypass CAPTCHAs in Python. However, Crunchbase employs additional advanced anti-scraping solutions that could still lead to blocks.

Without the right tools, scraping Crunchbase can quickly become a slow and frustrating experience. The best solution is Bright Data’s dedicated Crunchbase Scraper API. Retrieve data from Crunchbase without getting blocked!

Conclusion

In this step-by-step tutorial, you learned what a Crunchbase scraper is and the types of data it can retrieve. You also saw how to build a Python script to scrape Crunchbase for company overview data, which only required around 150 lines of code.

The problem is that Crunchbase adopts strict measures against bots and automated scripts. CAPTCHAs, browser fingerprinting, and IP bans are just a few of the defenses used to prevent scraping. Forget about all those challenges with our Crunchbase Scraper API.

If web scraping is not for you but you are still interested in Cruncbase data, explore our Crunchbase datasets!

Talk to one of our experts to find out which of Bright Data’s solutions best suits your needs.

No credit card required