Skip to main content
Mage Pro provides built-in support for running PySpark code inside your pipelines. You can run distributed Spark jobs directly within a Mage Pro batch pipeline, with automatic resource management, cloud integration, and observability. Use PySpark in Mage to transform large datasets, connect to cloud object storage like GCS or S3, and integrate with modern data lake formats like Apache Iceberg.

How to Use PySpark in Mage Pro

Follow these steps to run PySpark code in Mage Pro:
  1. Create a batch pipeline in the Mage UI.
  2. Add a block of type: Data Loader, Transformer, Data Exporter, or Custom.
  3. In your block, write PySpark code using the provided SparkSession.
  4. Install or mount any required Spark JARs, such as those for Iceberg or cloud storage access.

Example Pipeline

Create a standard batch pipeline and configure the following settings in the pipeline’s metadata.yaml file to ensure PySpark works properly:
cache_block_output_in_memory: true
run_pipeline_in_one_process: true
These settings ensure data is passed in-memory between PySpark blocks and the pipeline runs in a single JVM process for Spark compatibility.

Data Loader Block (PySpark)

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader
if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test
from pyspark.sql import Row, SparkSession

@data_loader
def load_data(*args, **kwargs):
    spark = SparkSession.builder.getOrCreate()
    data = [
        Row(id=1, name="Alice", age=29, salary=50000),
        Row(id=2, name="Bob", age=35, salary=60000),
        Row(id=3, name="Charlie", age=40, salary=70000),
        Row(id=4, name="David", age=45, salary=None)  # Null value example
    ]

    df = spark.createDataFrame(data)
    df.show()
    return df

Data Exporter Block

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data(df, *args, **kwargs):
    df.write.csv("spark_output", header=True, mode="overwrite")
This example pipeline loads structured data into a Spark DataFrame and writes it to disk in CSV format.

Connecting to Spark on Kubernetes

Mage Pro can connect to an external Spark cluster running on Kubernetes. This allows you to leverage existing Spark infrastructure or run Spark jobs on a dedicated Kubernetes cluster. Configure Spark connection settings in your project’s metadata.yaml file:
spark_config:
  # Application name
  app_name: 'my spark app'
  # Master URL to connect to Spark on Kubernetes
  # For Kubernetes native mode:
  spark_master: 'k8s://https://kubernetes.default.svc:443'
  # Or for standalone Spark cluster in Kubernetes:
  # spark_master: 'spark://spark-master-service:7077'
  
  # Executor environment variables
  executor_env: {}
  
  # Jar files to be uploaded to the cluster and added to the classpath
  spark_jars: []
  
  # Path where Spark is installed on worker nodes (if applicable)
  spark_home: null
  
  # Additional Spark configuration options
  others:
    spark.executor.memory: '4g'
    spark.executor.cores: '2'
    spark.kubernetes.namespace: 'default'
    spark.kubernetes.container.image: 'your-spark-image:tag'

Option 2: Configure via Environment Variable

Alternatively, you can set the SPARK_MASTER_HOST environment variable in your Mage Pro workspace configuration:
  • For Kubernetes native mode: SPARK_MASTER_HOST=k8s://https://kubernetes.default.svc:443
  • For standalone Spark cluster: SPARK_MASTER_HOST=spark://spark-master-service:7077

Setting Up Spark on Kubernetes

If you need to deploy Spark on Kubernetes first, you can use Helm:
# Add the Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami

# Install Spark cluster
helm install my-spark bitnami/spark

# Get the Spark master service URL
kubectl get svc my-spark-spark-master-svc
Then use the service URL in your spark_master configuration. For example:
  • Service name: my-spark-spark-master-svc
  • Port: 7077
  • Configuration: spark_master: 'spark://my-spark-spark-master-svc:7077'

Using Spark in Your Code

Once configured, you can use Spark in your pipeline blocks:
from pyspark.sql import SparkSession

# SparkSession will automatically use the configuration from metadata.yaml
spark = SparkSession.builder.getOrCreate()

# Your Spark operations here
df = spark.read.csv("your-data.csv")
The Spark session will automatically connect to your configured Kubernetes Spark cluster.

Benefits of Running PySpark in Mage Pro

Mage Pro handles all the infrastructure so you can focus on your PySpark code:
  • ⚙️ Distributed execution with automatic pod scheduling and resource allocation
  • ☁️ Seamless cloud integration with GCS, S3, and service account/IAM-based authentication
  • 🧩 Support for Spark JARs and connectors like Apache Iceberg, GCS connectors, Delta Lake, and more
  • 📈 Built-in observability, with access to logs, resource usage, and block-level monitoring in the Mage UI

Notes

  • You can customize the SparkSession in any block using .builder.config(...) to tune performance or integrate external tools.
  • Cloud storage credentials (e.g., a GCP service account key or AWS credentials) must be mounted and accessible inside the Mage Pro cluster.
  • For advanced use cases (e.g., Apache Iceberg), see the Iceberg + PySpark guide.