Machine learning pipeline tutorial
Build a machine learning pipeline to train a model on the Titanic dataset.
In this tutorial, we’ll create a pipeline that does the following:
- Load data from an online endpoint
- Select columns and fill in missing values
- Train a model to predict which passengers will survive
If you prefer to skip the tutorial and view the finished code, follow this guide.
If you haven’t setup a project before, check out the setup guide before starting.
1. Setup
1a. Add Python packages to project
In the left sidebar (aka file browser), click on the requirements.txt
file
under the demo_project/
folder.
Then add the following dependencies to that file:
Then, save the file by pressing ⌘ + S
.
2a. Install dependencies
The simplest way is to run pip install from the tool.
Add a scratchpad block by pressing the + Scratchpad
button. Then run the
following command:
Alternatively, here are other ways of installing dependencies (depending on if you are using Docker or not):
Docker
Get the name of the container that is running the tool:
Sample output:
The container name in the above sample output is
mage-ai_server_run_6f8d367ac405
.
Then run this command to install Python packages in the
demo_project/requirements.txt
file:
pip
If you aren’t using Docker, just run the following command in your terminal:
2. Create new pipeline
In the top left corner, click File > New pipeline
. Then, click the name of the
pipeline next to the green dot to rename it to titanic survivors
.
3. Play around with scratchpad
There are 4 buttons, click on the + Scratchpad
button to add a block.
Paste the following sample code in the block:
Then click the Play button
on the right side of the block to run the code.
Alternatively, you can use the following keyboard shortcuts to execute code in
the block:
- ⌘ + Enter
- Control + Enter
- Shift + Enter (run code and add a new block)
Now that we’re done with the scratchpad, we can leave it there or delete it. To delete a block, click the trash can icon on the right side or use the keyboard shortcut by typing the letter D and then D again.
4. Load data
- Click the
+ Data loader
button, selectPython
, then click the template calledAPI
. - Rename the block to
load dataset
. - In the function named
load_data_from_api
, set theurl
variable to:https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
. - Run the block by clicking the play icon button or using the keyboard
shortcuts
⌘ + Enter
,Control + Enter
, orShift + Enter
.
After you run the block (⌘ + Enter), you can immediately see a sample of the data in the block’s output.
Here is what the code should look like:
5. Transform data
We’re going to select numerical columns from the original dataset, then fill in missing values for those columns (aka impute).
- Click the
+ Transformer
button, selectPython
, then clickGeneric (no template)
. - Rename the block to
extract and impute numbers
. - Paste the following code in the block:
After you run the block (⌘ + Enter), you can immediately see a sample of the data in the block’s output.
6. Train model
In this part, we’re going to accomplish the following:
- Split the dataset into a training set and a test set.
- Train logistic regression model.
- Calculate the model’s accuracy score.
- Save the training set, test set, and model artifact to disk.
Here are the steps to take:
- Add a new data exporter block by clicking
+ Data exporter
button, selectPython
, then clickGeneric (no template)
. - Rename the block to
train model
. - Paste the following code in the block:
Run the block (⌘ + Enter).
7. Run pipeline
We can now run the entire pipeline end-to-end. In your terminal, execute the following command:
You can also run the pipeline from the UI. Click on the Execute pipeline from right bottom panel.
Your output should look something like this:
Congratulations!
You’ve successfully built an ML pipeline that consists of modular code blocks and is reproducible in any environment.
If you have more questions or ideas, please live chat with us in Slack.
Was this page helpful?