Only in Mage Pro.Try our fully managed solution to access this advanced feature.
How to Use PySpark in Mage Pro
Follow these steps to run PySpark code in Mage Pro:- Create a batch pipeline in the Mage UI.
- Add a block of type:
Data Loader,Transformer,Data Exporter, orCustom. - In your block, write PySpark code using the provided
SparkSession. - Install or mount any required Spark JARs, such as those for Iceberg or cloud storage access.
Example Pipeline
Create a standard batch pipeline and configure the following settings in the pipeline’smetadata.yaml file to ensure PySpark works properly:
Data Loader Block (PySpark)
Data Exporter Block
Connecting to Spark on Kubernetes
Mage Pro can connect to an external Spark cluster running on Kubernetes. This allows you to leverage existing Spark infrastructure or run Spark jobs on a dedicated Kubernetes cluster.Option 1: Configure via metadata.yaml (Recommended)
Configure Spark connection settings in your project’smetadata.yaml file:
Option 2: Configure via Environment Variable
Alternatively, you can set theSPARK_MASTER_HOST environment variable in your Mage Pro workspace configuration:
- For Kubernetes native mode:
SPARK_MASTER_HOST=k8s://https://kubernetes.default.svc:443 - For standalone Spark cluster:
SPARK_MASTER_HOST=spark://spark-master-service:7077
Setting Up Spark on Kubernetes
If you need to deploy Spark on Kubernetes first, you can use Helm:spark_master configuration. For example:
- Service name:
my-spark-spark-master-svc - Port:
7077 - Configuration:
spark_master: 'spark://my-spark-spark-master-svc:7077'
Using Spark in Your Code
Once configured, you can use Spark in your pipeline blocks:Benefits of Running PySpark in Mage Pro
Mage Pro handles all the infrastructure so you can focus on your PySpark code:- ⚙️ Distributed execution with automatic pod scheduling and resource allocation
- ☁️ Seamless cloud integration with GCS, S3, and service account/IAM-based authentication
- 🧩 Support for Spark JARs and connectors like Apache Iceberg, GCS connectors, Delta Lake, and more
- 📈 Built-in observability, with access to logs, resource usage, and block-level monitoring in the Mage UI
Notes
- You can customize the
SparkSessionin any block using.builder.config(...)to tune performance or integrate external tools. - Cloud storage credentials (e.g., a GCP service account key or AWS credentials) must be mounted and accessible inside the Mage Pro cluster.
- For advanced use cases (e.g., Apache Iceberg), see the Iceberg + PySpark guide.