Native support for Polars

Mage supports using Polars DataFrame in addition to Pandas DataFrame to process the data.

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader
import polars as pl


@data_loader
def load_data(*args, **kwargs):
    df = pl.DataFrame(
        {
            'A': [1, 2, 3, 4, 5] * 10,
            'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'] * 10,
            'ham': ['a', 'b', 'c', 'd', 'e'] * 10,
            'dt': ['2023-03-01', '2023-03-02', '2023-03-03', '2023-03-04', '2023-03-05'] * 10,
            'arr': [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2]] * 10,
            'obj': [dict(a=[1, 2]), dict(a=[3, 4]), dict(a=[5, 6]), dict(a=[7, 8]), dict(a=[9, 10])] * 10,
        }
    )
    df = df.with_columns(
        pl.col('dt').str.to_datetime('%Y-%m-%d')
    )
    return df

If you return a Pandas or Polars DataFrame, it’ll be serialized to disk using PyArrow and stored in Parquet format. When a downstream block loads an upstream block’s output data as an input argument, that data will be deserialized back to the data type before it was serialized.

Leveraging Polars and PyArrow in this process improves the memory usage while serializing data to disk and deserializing data from disk by leveraging Polar’s lazy data frames, PyArrow partitions, and Parquet file format features such as accessing metadata and schema information without loading the raw data into memory.

Quickstart

Developer experience

Version control

Blocks

Pipelines

dbt

Native support for Polars