A project is like a repository on GitHub; this is where you write all your code.
Here is a sample project and a sample folder structure:
📁 charts/ 📁 data_exporters/ 📁 data_loaders/ 📁 pipelines/ ⌄ 📁 demo/ 📝 __init__.py 📝 metadata.yaml 📁 scratchpads/ 📁 transformers/ 📁 utils/ 📝 __init__.py 📝 io_config.yaml 📝 metadata.yaml 📝 requirements.txt
Code in a project can be shared across the entire project.
You can create a new project by running the following command:
docker run -it -p 6789:6789 -v $(pwd):/home/src \ mageai/mageai mage init [project_name]
mage init [project_name]
A pipeline contains references to all the blocks of code you want to run, charts for visualizing data, and organizes the dependency between each block of code.
Each pipeline is represented by a YAML file. Here is an example.
This is what it could look like in the notebook UI:
A block is a file with code that can be executed independently or within a pipeline.
Blocks can depend on each other. A block won’t start running in a pipeline until all its upstream dependencies are met.
There are 5 types of blocks.
For more information, please see the documentation on blocks
Here is an example of a data loader block and a snippet of its code:
@data_loader def load_data_from_api() -> DataFrame: url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv' response = requests.get(url) return pd.read_csv(io.StringIO(response.text), sep=',')
Each block file is stored in a folder that matches its respective type (e.g.
transformers are stored in
A sensor is a block that continuously evaluates a condition until it’s met or until a period of time has elapsed.
If there is a block with a sensor as an upstream dependency, that block won’t start running until the sensor has evaluated its condition successfully.
Sensors can check for anything. Examples of common sensors check for:
Does a table exist (e.g.
Does a partition of a table exist (e.g.
ds = 2022-12-31)?
Does a file in a remote location exist (e.g.
Has another pipeline finished running successfully?
Has a block from another pipeline finished running successfully?
Has a pipeline run or block run failed?
Here is an example of a sensor that will keep checking to see if pipeline
transform_users has finished running successfully for the current execution
from mage_ai.orchestration.run_status_checker import check_status @sensor def check_condition(**kwargs) -> bool: return check_status( 'pipeline_uuid', kwargs['execution_date'], )
This example is using a helper function called
check_status that handles the
logic for retrieving the status of a pipeline run for
transform_users on the
current execution date.
Every block produces data after it’s been executed. These are called data products in Mage.
Data validation occurs whenever a block is executed.
Additionally, each data product produced by a block can be automatically partitioned, versioned, and backfilled.
Some examples of data products produced by blocks:
📋 Dataset/Table in a database, data warehouse, etc.
📝 Text file
🎧 Audio file
A trigger is a set of instructions that determine when or how a pipeline should run. A pipeline can have 1 or more triggers.
There are 3 types of triggers:
A schedule-type trigger will instruct the pipeline to run after a start date and on a set interval.
Currently, the frequency pipelines can be scheduled for include:
Run exactly once
Every N minutes (coming soon)
An event-type trigger will instruct the pipeline to run whenever a specific event occurs.
For example, you can have a pipeline start running when a database query is finished executing or when a new object is created in Amazon S3 or Google Storage.
You can also trigger a pipeline using your own custom event by making a
request to the
http://localhost/api/events endpoint with a custom event
Check out this tutorial on how to create an event trigger.
An API-type trigger will instruct the pipeline to run after a specific API call is made.
You can make a POST request to an endpoint provided in the UI when creating or editing a trigger. You can optionally include runtime variables in your request payload.
A run record stores information about when it was started, its status, when it was completed, any runtime variables used in the execution of the pipeline or block, etc.
Every time a pipeline or a block is executed (outside of the notebook while building the pipeline and block), a run record is created in a database.
There are 2 types of runs:
This contains information about the entire pipeline execution.
Every time a pipeline is executed, each block in the pipeline will be executed and potentially create a block run record.
A log is a file that contains system output information.
It’s created whenever a pipeline or block is ran.
Logs can contain information about the internal state of a run, text that is
outputted by loggers or
Here is an example of a log in the Data pipeline management UI:
Logs are stored on disk wherever Mage is running. However, you can configure where you want log files written to (e.g. Amazon S3, Google Storage, etc).
A backfill creates 1 or more pipeline runs for a pipeline.