Using an external IDE

Create a new pipeline

Each pipeline is represented by a YAML file in a folder named pipelines/ under the Mage project directory. For example, if your project is named demo_project and your pipeline is named etl_demo then you’ll have a folder structure that looks like this:

demo_project/
|   pipelines/
|   |   etl_demo/
|   |   |   __init__.py
|   |   |   metadata.yaml

Create a new folder in the demo_project/pipelines/ directory. Name this new folder after the name of your pipeline. Add 2 files in this new folder:

__init__.py
metadata.yaml

In the metadata.yaml file, add the following content:

blocks: []
name: etl_demo
type: python
uuid: etl_demo

Change etl_demo to whatever name you’re using for your new pipeline.

Sample pipeline metadata content

This sample pipeline metadata.yaml will produce the following block dependencies:

blocks:
- downstream_blocks:
    - select_columns
  executor_type: local_python
  language: python
  name: load_data_from_file
  type: data_loader
  upstream_blocks: []
  uuid: load_data_from_file
- downstream_blocks:
    - export_to_file
  executor_type: local_python
  language: python
  name: select_columns
  type: transformer
  upstream_blocks:
    - load_data_from_file
  uuid: select_columns
- downstream_blocks: []
  executor_type: local_python
  language: python
  name: export_to_file
  type: data_exporter
  upstream_blocks:
    - select_columns
  uuid: export_to_file
name: etl_demo
type: python
uuid: etl_demo

`metadata.yaml` sections

Pipeline attributes

blocks

array of objects

An array of blocks that are in the pipeline.

name

string

Unique name of the pipeline.

type

string enum

The type of pipeline. Currently available options are:

databricks
integration
pyspark
python (most common)
streaming

uuid

string

Unique identifier of the pipeline. This UUID must be unique across all pipelines.

description

string

Optional description of what the pipeline does.

executor_type

string enum

Pipeline level executor type. Supported values:

ecs
gcp_cloud_run
azure_container_instance
k8s
local_python (most common)
pyspark

executor_count

integer

Number of concurrent executors to run the pipeline. Used in streaming pipeline.

executor_config

object

Optional configuration specific to the selected executor type. Refer to the following documentation for executor-specific options:

retry_config

object

Retry configuration at the pipeline level. See documentation for details.

notification_config

object

Configuration for pipeline notification messages (e.g., on failure or success). See documentation for details.

concurrency_config

object

Concurrency settings for block execution within the pipeline. See documentation for details.

block_run_limit: Maximum number of blocks that can run in parallel.
pipeline_run_limit
pipeline_run_limit_all_triggers
on_pipeline_run_limit_reached

cache_block_output_in_memory

boolean

Whether to cache block output in memory during execution.

run_pipeline_in_one_process

boolean

If true, runs all blocks in a single process or k8s pod.

Block attributes

downstream_blocks

array of strings

An array of block UUIDs that depend on this current block. These downstream blocks will have access to this current block’s data output.

executor_type

string enum

The method for running this block of code. Currently available options are:

ecs
gcp_cloud_run
azure_container_instance
k8s
local_python (most common)
pyspark

executor_config

object

Optional configuration specific to the selected executor type. Refer to the following documentation for executor-specific options:

language

string enum

Programming language used by the block. Supported values:

python (most common)
r
sql
yaml

name

string

Unique name of the block.

type

string enum

The type of block. Currently available options are:

chart
custom (most common)
data_exporter
data_loader
dbt
scratchpad
sensor
transformer

The type of block will determine which folder it needs to be in. For example, if the block type is data_loader, then the file must be in the [project_name]/data_loaders/ folder. It can be nested in any number of subfolders.

upstream_blocks

array of strings

An array of block UUIDs that this current block depends on. These upstream blocks will pass its data output to this current block.

uuid

string

Unique identifier of the block. This UUID must be unique within the current pipeline. The UUID corresponds to the name of the file for this block.For example, if the UUID is load_data and the language is python, then the file name will be load_data.py.

retry_config

object

Retry configuration at the block level. See documentation for details.

Infrastructure

3rd party

Using an external IDE

Create a new pipeline

Sample pipeline metadata content

`metadata.yaml` sections

Pipeline attributes

Block attributes

Infrastructure

3rd party

​Create a new pipeline

​Sample pipeline metadata content

​metadata.yaml sections

​Pipeline attributes

​Block attributes

Create a new pipeline

Sample pipeline metadata content

`metadata.yaml` sections

Pipeline attributes

Block attributes