Databricks

Only in Mage Pro.Try our fully managed solution to access this advanced feature.

Besides supporting running Spark pipelines in AWS EMR cluster and standalone Spark cluster, Mage also supports running Spark pipelines in Databricks cluster.

Set up

Here is an overview of the steps required to use Mage with Databricks Cluster:

Set up Databricks cluster
Use Mage databricks docker image
Configure environment variables
Sample pipeline with PySpark code
Verify everything worked

If you get stuck, run into problems, or just want someone to walk you through these steps, please join our Slack

1. Set up Databricks cluster

Set up a Databricks workspace and cluster following the docs:

2. Use Mage databricks docker image

Contact Mage team to update your Mage Pro cluster to use Mage databricks docker image.

3. Configure environment variables

Set the following environment variables in your Mage Pro cluster to enable connectivity with your Databricks workspace:

DATABRICKS_HOST
The base URL of your Databricks workspace.
Example: https://<your-databricks-instance>.cloud.databricks.com
DATABRICKS_TOKEN
A personal access token (PAT) used for authenticating with Databricks.
You can generate this token in the Databricks UI by navigating to:
Settings > Developer > Access Tokens.
DATABRICKS_CLUSTER_ID
The unique identifier for the Databricks cluster where queries will be executed.
You can find this in your Databricks workspace under: Compute > Clusters.
Refer to Databricks documentation for detailed steps on retrieving the cluster ID.

4. Sample pipeline with PySpark code

Create a new pipeline by clicking New pipeline in the /pipelines page.

Open the pipeline’s metadata.yaml file and set the config.

cache_block_output_in_memory: true
run_pipeline_in_one_process: true

Click + Data loader, then Base template (generic) to add a new data loader block.
Paste the following sample code in the new data loader block:
Click “Run code” button to run the block.

from databricks.connect import DatabricksSession

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader
if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test


@data_loader
def load_data(*args, **kwargs):
    spark = DatabricksSession.builder.remote().getOrCreate()

    data = [("John", 28), ("Anna", 23), ("Mike", 35)]
    columns = ["Name", "Age"]

    df = spark.createDataFrame(data, columns)
    return df


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'

Click + Data exporter, then Base template (generic) to add a new data exporter block.
Paste the following sample code in the new data exporter block:
Click “Run code” button to run the block.

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter


@data_exporter
def export_data(df, *args, **kwargs):
    df.write.format("delta").mode("append").saveAsTable("user_table")

    return df

5. Verify everything worked

Check the table in your Unity Catalog to verify whether the data is written to it correctly.

Adventure starts here

Deployment architecture

Getting started

CI/CD

Pro only features

Infrastructure

Migration guides

Global hooks

Mage Pro API

Set up

1. Set up Databricks cluster

2. Use Mage databricks docker image

3. Configure environment variables

4. Sample pipeline with PySpark code

5. Verify everything worked

Adventure starts here

Deployment architecture

Getting started

CI/CD

Pro only features

Infrastructure

Migration guides

Global hooks

Mage Pro API

​Set up

​1. Set up Databricks cluster

​2. Use Mage databricks docker image

​3. Configure environment variables

​4. Sample pipeline with PySpark code

​5. Verify everything worked

Set up

1. Set up Databricks cluster

2. Use Mage databricks docker image

3. Configure environment variables

4. Sample pipeline with PySpark code

5. Verify everything worked