This is a guide for using Spark (PySpark) with Mage in different cloud providers or Kubernetes cluster.
metadata.yaml
file:
SPARK_MASTER_HOST
to the url of the master node of the Spark cluster in Mage container. Then you’ll be able to connect Mage to your Spark
cluster and execute PySpark code in Mage.
Here is an overview of the steps required to use Mage with Spark in Kubernetes cluster.
Dockerfile
. Build the docker image with the command docker build -t mage_spark .
SPARK_MASTER_HOST
environment variableSPARK_MASTER_HOST
environment variable with the Spark Master URL from step 1 to your the container spec
in your Kubernetes yaml file.0.8.83
and above, you don’t need to specify the environment variable anymore.metadata.yaml
file as stated in Custom Spark Session section.python
kernel) and then write PySpark code in any blocks.
In Scratchpad
block, you’ll need to manually create the Spark session with the code:
kwargs
via kwargs['spark']
.
Dockerfile
. Build the docker image with the command docker build -t mage_spark .
SPARK_MASTER_HOST
environment variabledemo_project
is the name of your project, you can change it to anything you wantSPARK_MASTER_HOST
. If you
use local Spark, you can set the value of SPARK_MASTER_HOST
to local
or not set the environment variable.0.8.83
and above, you don’t need to specify the environment variable anymore.metadata.yaml
file as stated in Custom Spark Session section.python
kernel) and then write PySpark code in any blocks.
In Scratchpad
block, you’ll need to manually create the Spark session with the code:
kwargs
via kwargs['spark']
.
docker build -t mage_spark .
demo_project
is the name of your project, you can change it to anything you wantspark_config
section in metadata.yaml
under the project folder, and make necessary adjustmentspython
kernel), and add a block,
then write PySpark code using the Spark session via kwargs['spark']
, e.g.,
Scratchpad
block, you’ll need to manually create the Spark session with the code:
docker build -t mage_spark .
demo_project
is the name of your project, you can change it to anything you wantpython
kernel), and update the spark_config
section in metadata.yaml
under the pipeline folder:kwargs['spark']
, e.g.,
run_pipeline_in_one_process: true
in your pipeline’s metadata.yamlspark_config
of your project’s metadata.yaml or pipeline’s metadata.yaml, set use_custom_session
to true
. Example config:
kwargs['context']['spark']
. Example code:
kwargs['spark']
.docker build -t mage_hadoop .
demo_project
is the name of your project, you can change it to anything you wantmetadata.yaml
file in the main pipeline folder to include the following Spark settings:Standalone (batch)
pipeline and add a Data loader
, then run it with the following code:demo_project
is the name of your project, you can change it to anything you
want):
metadata.yaml
file located at the root of your project’s
directory: demo_project/metadata.yaml
(presuming your project is named
demo_project
).
Change the values for the keys mentioned in the following steps.
remote_variables_dir
.
Change the value for key remote_variables_dir
to equal the S3 bucket you
created in an earlier step.
For example, if your S3 bucket is named my-awesome-bucket
, then the value for
the key remote_variables_dir
should be s3://my-awesome-bucket
.
master_security_group
slave_security_group
metadata.yaml
file could look like this:
pyspark
kernel. Mage will automatically creates the EMR cluster when you switch to pyspark
kernel. The cluster is usually created and intialized within 8~10 minutes. Then you can select
your EMR cluster in the dropdown of your pyspark
kernel selector.
Type | Protocol | Port range | Source |
---|---|---|---|
Custom TCP | TCP | 8998 | My IP |
pip
, you must run the
following commands in your terminal to use the pyspark
kernel:
File
in the top left corner of the page
and then clicking New pipeline
.python
to pyspark
. Click the button
with the green dot and the word python
next to it. This is located at the
top of the page on the right side of your header.+ Data loader
, then Generic (no template)
to add a new data loader
block.+ Data exporter
, then Generic (no template)
to add a new data
exporter block.s3://bucket-name
to the bucket you created from a previous step):+ Data loader
, then Generic (no template)
to add a new data loader
block.s3://bucket-name
to the bucket you created from a previous step):Run
>
Restart kernel
.
If the block is hanging when running the block, it could be due to network connection issue.
Make sure the EMR cluster is accessible from the Mage server. You can verify whether Mage server’s
security group or IP is whitelisted in EMR cluster’s security group following section Allow EMR connection permissions
If that doesn’t work, restart the app by stopping the docker container and
starting it again.