ETL pipeline tutorial
Build a data pipeline that loads restaurant data, transforms it, then exports it to a DuckDB database. 🦆
Overview
In this tutorial, we’ll create a data pipeline with the following steps:
Load data from an online endpoint
We’ll use a Python block to load data from an online endpoint— a CSV file containing restaurant user transactions. We’ll also run a test to make sure the data is clean.
Clean column names and add a new column
We’ll use a Python block to transform the data by cleaning the columns and creating a new column, number of meals
, that counts the number of meals for each user.
Write the transformed data to a local DuckDB database
Finally, we’ll write the transformed data to a local DuckDB table.
If you haven’t created a Mage project before, follow the setup guide before starting this tutorial.
Quickstart
Want to dive in? Simply run the following command to clone a pre-built repo:
Then navigate to https://localhost:6789
in your browser to see the pipeline in action!
Tutorial
1️⃣ Create a pipeline
1️⃣ Create a pipeline
Open the pipelines list page (/pipelines
). This is the default page when navigating to Mage in your web browser. If you have Mage running, just click here.
In the top left corner, select the button labeled + New, then select the option labeled Standard (batch) to create a new pipeline.
In the left vertical navigation, click the last link labeled Pipeline settings. Change the pipeline’s name to ETL demo
. Click the button labeled Save pipeline settings.
2️⃣ Load data from an API
2️⃣ Load data from an API
In the left vertical navigation, click the 1st link labeled Edit pipeline. Click the button labeled + Data loader then hover over Python, and click the option labeled API.
A dialog menu will appear. Change the block name to load data
. Click the button labeled Save and add block. Paste the following code in the data loader block:
Run the block by clicking the play icon button in the top right corner of the data loader block or one of the following keyboard shortcuts:
⌘ + Enter
Control + Enter
Shift + Enter
After you run the block (⌘ + Enter), you’ll see a sample of the data that was loaded.
3️⃣ Transform data
3️⃣ Transform data
Let’s transform the data in 2 ways:
- Add a column that counts the number of meals for each user.
- Clean the column names to properly store in a PostgreSQL database.
Follow these steps:
Click the button labeled + Transformer, then hover over Python, and click the option labeled Generic (no template). Paste the following code in the transformer block:
Run the block by clicking the play icon button in the top right corner of the data loader block or press 1 of the following keyboard shortcuts:
⌘ + Enter
Control + Enter
Shift + Enter
After you run the block (⌘ + Enter), you’ll see a sample of the data that was transformed.
4️⃣ Export data to DuckDB
4️⃣ Export data to DuckDB
Let’s export the data to a local DuckDB database! Create a new Data Exporter
SQL block and change the connection type to be DuckDB
.
Next, change the Table name
to be magic_duck.restaurant_user_transactions
and make sure the content of the block is:
This will take our data from the previous block and export it to a local DuckDB table— magic_duck.restaurant_user_transactions
! 🎉
After you run the block (⌘ + Enter), you’ll see a sample of the data that was exported.
You should also see some logs indicating the export was completed, e.g.
🎉 Congratulations!
You’ve successfully built an end-to-end ETL pipeline that loaded data, transformed it, and exported it to a database.
Now you’re ready to raid the dungeons and find magical treasures with your new powers!
If you have more questions or ideas, get real-time help in our live support Slack channel.