Querying Bright Data Datasets via Databricks API, SQL Connector, and CLI

Discover how to easily query Bright Data’s datasets within Databricks using REST API, CLI, or SQL Connector. Step-by-step setup and code included!
10 min read
Querying Bright Data Datasets via Databricks API

In this article, you will discover:

  • Which products Bright Data provides on Databricks.
  • How to set up a Databricks account and retrieve all the required credentials for programmatic data retrieval and exploration.
  • How to query a Bright Data dataset using Databricks’s:
    • REST API
    • CLI
    • SQL Connector

Let’s dive in!

Bright Data’s Data Products on Databricks

Databricks is an open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. On the website, you can find data products from multiple providers, which is why it is considered one of the best data marketplaces.

Bright Data has recently joined Databricks as a data products provider, already offering over 40 products:

These solutions include B2B datasets, company datasets, financial datasets, real estate datasets, and many others. On top of that, you also have access to more general web data retrieval and web scraping solutions through Bright Data’s infrastructure, such as Scraping Browser and the Web Scraper API.

In this tutorial, you will learn how to programmatically query data from one of these Bright Data datasets using the Databricks API, CLI, and dedicated SQL Connector library. Let’s jump into it!

Getting Started with Databricks

To query Bright Data datasets from Databricks via API or CLI, you first need to set up a few things. Follow the steps below to configure your Databricks account and retrieve all the required credentials for Bright Data dataset access and integration.

At the end of this section, you will have:

  • A configured Databricks account
  • A Databricks access token
  • A Databricks warehouse ID
  • A Databricks host string
  • Access to one or more Bright Data datasets in your Databricks account

Prerequisites

First, make sure you have a Databricks account (a free account is enough). If you do not have one, create an account. Otherwise, simply log in.

Configure Your Databricks Access Token

To authorize access to Databricks resources, you need an access token. Follow the instructions below to set one up.

In your Databricks dashboard, click your profile image and select the “Settings” option:

Selecting the “Settings” option

On the “Settings” page, select the “Developer” option, then click the “Manage” button in the “Access tokens” section:

Clicking the “Manage” button in the “Access tokens” section

On the “Access tokens” page, click “Generate New Token” and follow the instructions in the modal:

Pressing the “Generate new token” button

You will receive a Databricks API access token. Store it in a safe place, as you will need it soon.

Retrieve Your Databricks Warehouse ID

Another piece of information you need to programmatically call the API or query the datasets via CLI is your Databricks warehouse ID. To retrieve it, select the “SQL Warehouses” option in the menu:

Selecting the “SQL Warehouses” option

Click the available warehouse (in this example, “Serverless Starter Warehouse”) and reach the “Overview” tab:

The “Overview” tab of your selected Databricks warehouse

In the “Name” section, you will see your Databricks warehouse ID (in parentheses, after ID:). Copy it and store it safely, as you will need it shortly.

Find Your Databricks Host

To connect to any Databricks compute resource, you need to specify your Databricks hostname. This corresponds to the base URL associated with your Databricks account and has a format like:

  https://u003crandom-stringu003e.cloud.databricks.com

You can find this information directly by copying it from your Databricks dashboard URL:

Get Access to Bright Data Datasets

Now you need to add one or more Bright Data datasets to your Databricks account so that you can query them via API, CLI, or SQL Connector.

Go to the “Marketplace” page, click the settings button on the left, and select “Bright Data” as the only provider you are interested in:

Selecting the “Bright Data” provider

This will filter the available data products to only those provided by Bright Data and accessible via Databricks.

For this example, assume you are interested in the “Zillow Properties Information Dataset”:

The Bright Data “Zillow Properties Information Dataset” product on Databricks

Click on the dataset card, and on the “Zillow Properties Information Dataset” page, press “Get Instances Access” to add it to your Databricks account:

Pressing the “Get instant access” button

The dataset will be added to your account, and you will now be able to query it via Databricks SQL. If you are wondering where that data comes from, the answer is Bright Data’s Zillow Datasets.

Verify that by reaching the “SQL Editor” page, and query the dataset using a SQL query like this:

  SELECT * FROM bright_data_zillow_properties_information_dataset.datasets.zillow_propertiesnWHERE state LIKE 'NY' AND homestatus LIKE 'FOR_SALE' nLIMIT 10;

The result should be something like:

The result produced by the SQL query

Great! You have successfully added the chosen Bright Data dataset and made it queryable via Databricks. You can follow the same steps to add other Bright Data datasets.

In the next sections, you will learn how to query this dataset:

  • Via the Databricks REST API
  • With the Databricks SQL Connector for Python
  • Through the Databricks CLI

How to Query a Bright Data Dataset via the Databricks REST API

Databricks exposes some of its features via a REST API, including the ability to query datasets available in your account. Follow the steps below to see how to programmatically query the “Zillow Properties Information Dataset” provided by Bright Data.

Note: The code below is written in Python, but it can be easily adapted to other programming languages or called directly in Bash via cURL.

Step #1: Install the Required Libraries

To run SQL queries on remote Databricks warehouses, the REST API endpoint to use is /api/2.0/sql/statements. You can call it via a POST request using any HTTP client. In this example, we will use the Python Requests library.

Install it with:

  pip install requests

Next, import it in your script with:

  import requests

Learn more about it in our dedicated guide on Python Requests.

Step #2: Prepare Your Databricks Credentials and Secrets

To call the Databricks REST API endpoint /api/2.0/sql/statements using an HTTP client, you need to specify:

  • Your Databricks access token: For authentication.
  • Your Databricks host: To build the complete API URL.
  • Your Databricks warehouse ID: To query the correct table in the correct warehouse.

Add the secrets you retrieved earlier to your script like this:

  databricks_access_token = u0022u003cYOUR_DATABRICKS_ACCESS_TOKENu003eu0022ndatabricks_warehouse_id = u0022u003cYOUR_DATABRICKS_WAREHOUSE_IDu003eu0022ndatabricks_host = u0022u003cYOUR_DATABRICKS_HOSTu003eu0022

Tip: In production, avoid hard-coding these secrets in your script. Instead, consider storing those credentials in environment variables and loading them using python-dotenv for better security.

Step #3: Call the SQL Statement Execution API

Make a POST HTTP call to the /api/2.0/sql/statements endpoint with the appropriate headers and body using Requests:

  # The parametrized SQL query to run on the given datasetnsql_query = u0022u0022u0022nSELECT *nFROM bright_data_zillow_properties_information_dataset.datasets.zillow_propertiesnWHERE state LIKE :state AND homestatus LIKE :homestatusnLIMIT :row_limit;nu0022u0022u0022n# The parameter to populate the SQL querynparameters = [n    {u0022nameu0022: u0022stateu0022, u0022valueu0022: u0022NYu0022, u0022typeu0022: u0022STRINGu0022},n    {u0022nameu0022: u0022homestatusu0022, u0022valueu0022: u0022FOR_SALEu0022, u0022typeu0022: u0022STRINGu0022},n    {u0022nameu0022: u0022row_limitu0022, u0022valueu0022: u002210u0022, u0022typeu0022: u0022INTu0022}n]nn# Make the POST request and query the datasetnheaders = {n    u0022Authorizationu0022: fu0022Bearer {databricks_access_token}u0022, # For authenticating in Databricksn    u0022Content-Typeu0022: u0022application/jsonu0022n}nnpayload = {n    u0022statementu0022: sql_query,n    u0022warehouse_idu0022: databricks_warehouse_id,n    u0022parametersu0022: parametersn}nnresponse = requests.post(n    fu0022{databricks_host}/api/2.0/sql/statementsu0022,n    headers=headers,n    data=json.dumps(payload)n)

As you can see, the snippet above relies on a prepared SQL statement. As stressed in the documentation, Databricks strongly recommends using parameterized queries as a best practice for your SQL statements.

In other words, running the script above is equivalent to executing the following query on the bright_data_zillow_properties_information_dataset.datasets.zillow_properties table, just as we did earlier:

  SELECT * FROM bright_data_zillow_properties_information_dataset.datasets.zillow_propertiesnWHERE state LIKE 'NY' AND homestatus LIKE 'FOR_SALE' nLIMIT 10;

Fantastic! It only remains to manage the output data

Step #4: Export the Query Results

Handle the response and export the retrieved data with this Python logic:

  if response.status_code == 200:n    # Access the output JSON datan    result = response.json()nn    # Export the retrieved data to a JSON filen    output_file = u0022zillow_properties.jsonu0022n    with open(output_file, u0022wu0022, encoding=u0022utf-8u0022) as f:n        json.dump(result, f, indent=4)n    print(fu0022Query successful! Results saved to '{output_file}'u0022)nelse:n    print(fu0022Error {response.status_code}: {response.text}u0022)

If the request is successful, the snippet will create a zillow_properties.json file containing the query results.

Step #5: Put It All Together

Your final script should contain:

  import requestsnimport jsonnn# Your Databricks credentials (replace them with the right values)ndatabricks_access_token = u0022u003cYOUR_DATABRICKS_ACCESS_TOKENu003eu0022ndatabricks_warehouse_id = u0022u003cYOUR_DATABRICKS_WAREHOUSE_IDu003eu0022ndatabricks_host = u0022u003cYOUR_DATABRICKS_HOSTu003eu0022nn# The parametrized SQL query to run on the given datasetnsql_query = u0022u0022u0022nSELECT *nFROM bright_data_zillow_properties_information_dataset.datasets.zillow_propertiesnWHERE state LIKE :state AND homestatus LIKE :homestatusnLIMIT :row_limit;nu0022u0022u0022n# The parameter to populate the SQL querynparameters = [n    {u0022nameu0022: u0022stateu0022, u0022valueu0022: u0022NYu0022, u0022typeu0022: u0022STRINGu0022},n    {u0022nameu0022: u0022homestatusu0022, u0022valueu0022: u0022FOR_SALEu0022, u0022typeu0022: u0022STRINGu0022},n    {u0022nameu0022: u0022row_limitu0022, u0022valueu0022: u002210u0022, u0022typeu0022: u0022INTu0022}n]nn# Make the POST request and query the datasetnheaders = {n    u0022Authorizationu0022: fu0022Bearer {databricks_access_token}u0022, # For authenticating in Databricksn    u0022Content-Typeu0022: u0022application/jsonu0022n}nnpayload = {n    u0022statementu0022: sql_query,n    u0022warehouse_idu0022: databricks_warehouse_id,n    u0022parametersu0022: parametersn}nnresponse = requests.post(n    fu0022{databricks_host}/api/2.0/sql/statementsu0022,n    headers=headers,n    data=json.dumps(payload)n)nn# Handle the responsenif response.status_code == 200:n    # Access the output JSON datan    result = response.json()nn    # Export the retrieved data to a JSON filen    output_file = u0022zillow_properties.jsonu0022n    with open(output_file, u0022wu0022, encoding=u0022utf-8u0022) as f:n        json.dump(result, f, indent=4)n    print(fu0022Query successful! Results saved to '{output_file}'u0022)nelse:n    print(fu0022Error {response.status_code}: {response.text}u0022)

Execute it, and it should produce a zillow_properties.json file in your project directory.

The output first contains the column structure to help you understand the available columns. Then, in the data_array field, you can see the resulting query data as a JSON string:

A portion of the data_array field

Mission complete! You just collected Zillow property data provided by Bright Data via the Databricks REST API.

How to Access Bright Data Datasets Using the Databricks CLI

Databricks also lets you query data in a warehouse through the Databricks CLI, which relies on the REST API under the hood. Learn how to use it!

Step #1: Install the Databricks CLI

The Databricks CLI is an open-source command-line tool that lets you interact with the Databricks platform directly from your terminal.

To install it, follow the installation guide for your operating system. If everything is set up correctly, running the databricks -v command should display something like this:

The output of the “databricks -v” command

Perfect!

Step #2: Define a Configuration Profile for Authentication

Use the Databricks CLI to create a configuration profile named DEFAULT that authenticates you with your Databricks personal access token. To do so, run the command below:

  databricks configure u002du002dprofile DEFAULT

You will then be prompted to provide:

  1. Your Databricks host
  2. Your Databricks access token

Paste both values and press Enter to complete the configuration:

Setting up your Databricks configuration profile

You will now be able to authenticate CLI api commands by specifying the --profile DEFAULT option.

Step #3: Query Your Dataset

Use the following CLI command to run a parameterized query via the api post command:

  databricks api post u0022/api/2.0/sql/statementsu0022 
u002du002dprofile DEFAULT 
u002du002djson '{n  u0022warehouse_idu0022: u0022u003cYOUR_DATABRICKS_WAREHOUSE_IDu003eu0022,n  u0022statementu0022: u0022SELECT * FROM bright_data_zillow_properties_information_dataset.datasets.zillow_properties WHERE state LIKE :state AND homestatus LIKE :homestatus LIMIT :row_limitu0022,n  u0022parametersu0022: [n    { u0022nameu0022: u0022stateu0022, u0022valueu0022: u0022NYu0022, u0022typeu0022: u0022STRINGu0022 },n    { u0022nameu0022: u0022homestatusu0022, u0022valueu0022: u0022FOR_SALEu0022, u0022typeu0022: u0022STRINGu0022 },n    { u0022nameu0022: u0022row_limitu0022, u0022valueu0022: u002210u0022, u0022typeu0022: u0022INTu0022 }n  ]n}' 
u003e zillow_properties.json

Replace the <YOUR_DATABRICKS_WAREHOUSE_ID> placeholder with the actual ID of your Databricks SQL warehouse.

Behind the scenes, this does the same thing we did before in Python. More specifically, it makes a POST request to the Databricks REST SQL API. The result will be a zillow_properties.json file containing the same data as seen before:

The output  zillow_properties.json file

How to Query a Dataset From Bright Data Through the Databricks SQL Connector

The Databricks SQL Connector is a Python library that lets you connect to Databricks clusters and SQL warehouses. In particular, it provides a simplified API for connecting to Databricks infrastructure and exploring your data.

In this guide section, you will learn how to use it to query the “Zillow Properties Information Dataset” from Bright Data.

Step #1: Install the Databricks SQL Connector for Python

The Databricks SQL Connector is available via the databricks-sql-connector Python library. Install it with:

  pip install databricks-sql-connector

Then, import it in your script with:

  from databricks import sql

Step #2: Get Started with the Databricks SQL Connector

The Databricks SQL Connector requires different credentials compared to the REST API and CLI. In detail, it needs:

  • server_hostname: Your Databricks host name (without the https:// part).
  • http_path: A special URL to connect to your warehouse.
  • access_token: Your Databricks access token.

You can find the necessary authentication values, along with a sample starter snippet, in the “Connection Details” tab of your SQL warehouse:

Clicking the Python button on the “Connection Details” of your warehouse

Press the “Python” button, and you will get:

The databricks-sql-connector sample snippet

These are all the instructions you need to get started with the databricks-sql-connector.

Step #3: Put It All Together

Adapt the code from the sample snippet in the “Databricks SQL Connector for Python” section to your warehouse in order to run the parameterized query of interest. You should end up with the a script like the following:

  from databricks import sqlnn# Connect to your SQL warehouse in Databricks (replace the credentials with your values)nconnection = sql.connect(n    server_hostname = u0022u003cYOUR_DATABRICKS_HOSTu003eu0022,n    http_path = u0022u003cYOUR_DATABRICKS_WAREHOUST_HTTP_PATHu003eu0022,n    access_token = u0022u003cYOUR_DATABRICKS_ACCESS_TOKENu003eu0022n)nn# Execute the SQL parametrized query and get the results in a cursorncursor = connection.cursor()nnsql_query = u0022u0022u0022nSELECT *nFROM bright_data_zillow_properties_information_dataset.datasets.zillow_propertiesnWHERE state LIKE :state AND homestatus LIKE :homestatusnLIMIT :row_limitnu0022u0022u0022nnparams = {n    u0022stateu0022: u0022NYu0022,n    u0022homestatusu0022: u0022FOR_SALEu0022,n    u0022row_limitu0022: 10n}nn# Run the queryncursor.execute(sql_query, params)nnresult = cursor.fetchall()n# Printing all results a row at a timenfor row in result[:2]:n  print(row)nn# Close the cursor and the SQL warehouse connectionncursor.close()nconnection.close()

Run the script, and it will generate an output like this:

The output data for a single row

Notice that each row object is a Row instance, representing a single record from the query results. You can then process that data directly in your Python script.

Keep in mind that you can convert a Row instance into a Python dictionary with the asDict() method:

  row_data = row.asDict()

Et voilà! You now know how to interact with and query your Bright Data datasets in Databricks in multiple ways.

Conclusion

In this article, you learned how to query Bright Data’s datasets from Databricks using its REST API, CLI, or dedicated SQL Connector library. As demonstrated, Databricks provides multiple ways to interact with the products offered by their data providers, which now includes Bright Data.

With over 40 products available, you can explore the extensive richness of Bright Data’s datasets directly within Databricks and access their data in a variety of ways.

Create a Bright Data account for free and start experimenting with our data solutions today!

Antonello Zanini

Technical Writer

5.5 years experience

Antonello Zanini is a technical writer, editor, and software engineer with 5M+ views. Expert in technical content strategy, web development, and project management.

Expertise
Web Development Web Scraping AI Integration