Create a new pipeline

Each pipeline is represented by a YAML file in a folder named pipelines/ under the Mage project directory. For example, if your project is named demo_project and your pipeline is named etl_demo then you’ll have a folder structure that looks like this:
demo_project/
|   pipelines/
|   |   etl_demo/
|   |   |   __init__.py
|   |   |   metadata.yaml
Create a new folder in the demo_project/pipelines/ directory. Name this new folder after the name of your pipeline. Add 2 files in this new folder:
  1. __init__.py
  2. metadata.yaml
In the metadata.yaml file, add the following content:
blocks: []
name: etl_demo
type: python
uuid: etl_demo
Change etl_demo to whatever name you’re using for your new pipeline.

Sample pipeline metadata content

This sample pipeline metadata.yaml will produce the following block dependencies:
Sample pipeline
blocks:
- downstream_blocks:
    - select_columns
  executor_type: local_python
  language: python
  name: load_data_from_file
  type: data_loader
  upstream_blocks: []
  uuid: load_data_from_file
- downstream_blocks:
    - export_to_file
  executor_type: local_python
  language: python
  name: select_columns
  type: transformer
  upstream_blocks:
    - load_data_from_file
  uuid: select_columns
- downstream_blocks: []
  executor_type: local_python
  language: python
  name: export_to_file
  type: data_exporter
  upstream_blocks:
    - select_columns
  uuid: export_to_file
name: etl_demo
type: python
uuid: etl_demo

metadata.yaml sections

Pipeline attributes

blocks
array of objects
An array of blocks that are in the pipeline.
name
string
Unique name of the pipeline.
type
string enum
The type of pipeline. Currently available options are:
  • databricks
  • integration
  • pyspark
  • python (most common)
  • streaming
uuid
string
Unique identifier of the pipeline. This UUID must be unique across all pipelines.
description
string
Optional description of what the pipeline does.
executor_type
string enum
Pipeline level executor type. Supported values:
  • ecs
  • gcp_cloud_run
  • azure_container_instance
  • k8s
  • local_python (most common)
  • pyspark
executor_count
integer
Number of concurrent executors to run the pipeline. Used in streaming pipeline.
executor_config
object
Optional configuration specific to the selected executor type. Refer to the following documentation for executor-specific options:
retry_config
object
Retry configuration at the pipeline level. See documentation for details.
notification_config
object
Configuration for pipeline notification messages (e.g., on failure or success). See documentation for details.
concurrency_config
object
Concurrency settings for block execution within the pipeline. See documentation for details.
  • block_run_limit: Maximum number of blocks that can run in parallel.
  • pipeline_run_limit
  • pipeline_run_limit_all_triggers
  • on_pipeline_run_limit_reached
cache_block_output_in_memory
boolean
Whether to cache block output in memory during execution.
run_pipeline_in_one_process
boolean
If true, runs all blocks in a single process or k8s pod.

Block attributes

downstream_blocks
array of strings
An array of block UUIDs that depend on this current block. These downstream blocks will have access to this current block’s data output.
executor_type
string enum
The method for running this block of code. Currently available options are:
  • ecs
  • gcp_cloud_run
  • azure_container_instance
  • k8s
  • local_python (most common)
  • pyspark
executor_config
object
Optional configuration specific to the selected executor type. Refer to the following documentation for executor-specific options:
language
string enum
Programming language used by the block. Supported values:
  • python (most common)
  • r
  • sql
  • yaml
name
string
Unique name of the block.
type
string enum
The type of block. Currently available options are:
  • chart
  • custom (most common)
  • data_exporter
  • data_loader
  • dbt
  • scratchpad
  • sensor
  • transformer
The type of block will determine which folder it needs to be in. For example, if the block type is data_loader, then the file must be in the [project_name]/data_loaders/ folder. It can be nested in any number of subfolders.
upstream_blocks
array of strings
An array of block UUIDs that this current block depends on. These upstream blocks will pass its data output to this current block.
uuid
string
Unique identifier of the block. This UUID must be unique within the current pipeline. The UUID corresponds to the name of the file for this block.For example, if the UUID is load_data and the language is python, then the file name will be load_data.py.
retry_config
object
Retry configuration at the block level. See documentation for details.