Transformer Actions are a library of modular, reusable data transformations, reducing the boilerplate to perform common transformations on your data.
build_transformer_action()
creates a transformer action payload, which is a dictionary that describes settings for the action to complete. The arguments passed in are
df
- the data frame to transform, used to infer column typesaction_type
- specifies the action to perform. In this case, we are performing an IMPUTE
action.arguments
- specifies the columns to perform the action on. Currently, the action is configured to impute all columns of the data frame. Since we only want to transform “Cabin”, “Age”, and “Embarked”, we will update this entry to only contain these column names.axis
- specifies which axis to perform the action on. Imputing values is done on a per column basis, so this is a column transformation.options
- these are extra settings to specify for this action. For an impute transformation, the main option to specify is 'strategy'
, dictating the imputation strategy to use. See Fill in Missing Values for details on the valid imputation strategies. For now we will use MODE
, which specifies to fill empty values with the most frequent value per column.build_transformer_action()
, the payload is used to construct the BaseAction
object, representing a reusable instance of the transformation. Then the transformation is executed on the input data frame using execute()
, returning the transformed data frame when done.
Suppose then you’re interested in computing the average fare and age by each location of embarkation - Southampton (S), Cherbourg (C), and Queenstown (Q). Transformer Actions can help us here as well - create a new transformer using the “Aggregate/Aggregate by average value” template.
action_code
is used by aggregation-type transformer actions to filter rows of the dataset before performing the aggregation. For example, if the “Embarked” column had null values, we could filter them out using the filter 'Embarked != null'
. In this case we won’t perform any filtering. See Filter Syntax for more details on valid filter syntax.['Fare', 'Age']
options
, the only parameter is groupby_columns
which is the list of the columns to group by when performing the aggregation. In this case, the column we want to use to form groups is “Embarked”outputs
is used to specify the name and type of the new columns created by the aggregation. As there are two columns in arguments
that we want to perform the aggregation over, there should be two corresponding entries in outputs
.BaseAction
object using the payload and execute the action to get the transformed data frame32
. Here is a list of the keys used in transformer action payloads:
Key | Description |
---|---|
axis | Axis to perform the transformation on. Either ‘row’ or ‘column’. |
action_type | Type of action to perform. See Transformer Action Reference for the list of action types. |
action_code | Text field for specifying any code such as filters. |
action_arguments | The columns to perform the transformation on. |
action_options | Other settings to provide to the transformer action. See Transformer Action Reference for specific details on which actions require which options. |
action_variables | Variable metadata for each input argument (column). Specifies column name and type. |
outputs | Variable metadata for each output column. Specifies column name and type. See Column Types for information on possible column types to specify. |
build_transformer_action
factory to generate the payload. The above snippet generates the same payload as above:
build_transformer_action
.
The transformer action payload, once built, can be used to construct a BaseAction
object. This object processes the payload to identify the transformations to apply a data frame. Through this action object, the transformer action can be applied to a data frame using the execute
function:
build_transformer_action(df: DataFrame, action_type: ActionType | str, arguments: List[str], action_code: str, options: Dict, axis: Axis | str, outputs: List[Dict]) -> Dict:
Source: mage_ai.data_cleaner.transformer_actions.utils.build_transformer_action
Builds transformer action payload from arguments. The output of this function can be passed
as input to transformer_actions.base.BaseAction
in order to perform the requested transformation.
Note: Action variables are inferred from arguments
and df
.
df (DataFrame)
: The data frame to build a transformer action payload for.action_type (ActionType | str)
: Transformer action to perform.arguments (List[str], optional)
: Columns to perform this action on. Defaults to []
.action_code (str, optional)
: Special code or query to execute with action. Defaults to ''
.options (Dict, optional)
: Options specifying behavior of action. Defaults to {}
.axis (Union[Axis, str], optional)
: Axis of the data frame to apply the action to. Defaults to Axis.COLUMN
.outputs (List[Dict], optional)
: Specifies metadata of newly created columns. Defaults to []
.Dict
) Transformer action payload to use with transformer_actions.base.BaseAction
Paid Dues | Due Amount |
---|---|
Yes | $5.00 |
No | $15.20 |
Yes | $3.50 |
Yes | $1.28 |
No | $25.03 |
Paid Dues | Due Amount |
---|---|
Yes | $5.00 |
Yes | $3.50 |
Yes | $1.28 |
Paid Dues | Due Amount |
---|---|
No | $15.20 |
No | $25.03 |
Paid Dues | Total Due Amount |
---|---|
Yes | $9.78 |
No | $40.23 |
Paid Dues | Due Amount | Total Due Amount by Payment Status |
---|---|---|
Yes | $5.00 | $9.78 |
No | $15.20 | $40.23 |
Yes | $3.50 | $9.78 |
Yes | $1.28 | $9.78 |
No | $25.03 | $40.23 |
build_transformer_action
, but have different types corresponding to the different aggregation functions that could be applied to the groups.
Example
groupby_columns
(optional) - list of columns to form groups with when aggregating. A group consists of each unique combination of values chosen from each of the columns specified in this option.
timestamp_feature_a
(optional) - time feature A (a column in the data frame)timestamp_feature_b
(optional) - time feature B (a column in the data frame)window
(optional) - max time window between feature A and feature B in secondsuuid (string)
: Name of the new columncolumn_type (string | ColumnType)
: Type of the new column ( See mage_ai.data_cleaner.column_types.constants.ColumnType
for options)build_transformer_action
.
Name | Action Type Enum | Description |
---|---|---|
Average | ActionType.AVERAGE | Averages input column value over each group |
Count Distinct | ActionType.COUNT_DISTINCT | Counts the number of distinct rows in each group per input column |
Count | ActionType.COUNT | Counts the number of rows in each group per input column |
First | ActionType.FIRST | For each input column, selects first item in each group (as appearing in the order of the original data frame) |
Last | ActionType.LAST | For each input column, selects last item in each group (as appearing in the order of the original data frame) |
Maximum | ActionType.MAX | Gets the maximum value per group for each input column |
Median | ActionType.MEDIAN | Gets the median value per group for each input column |
Minimum | ActionType.MIN | Gets the maximum value per group for each input column |
Sum | ActionType.SUM | Sums over all values in each group per input column |
Original Name | Cleaned Name | Notes |
---|---|---|
good_name | good_name | columns in snake case are left unchanged |
bad case | bad_case | |
%@#342%34@@#342 | number_34234342 | Symbols are removed first, then number is cleaned |
PascalCaseColumn | pascal_case_column | |
camelCaseText | camel_case_text | |
___snake_case___ | snake_case | Surrounding underscores are stripped |
"invalid"
if column is of categorical or string typepandas.NaT
if column is a datetimenp.nan
if column is a number typereformat = 'caps_standardization'
): Forces text column to follow a single capitalization strategy (either lowercase or uppercase)reformat = 'currency_to_num'
): Converts a currency value (by default stored as a string) to a decimal number. If unable to convert, doesn’t perform conversion
Original String | Parsed Decimal | Notes |
---|---|---|
” $ 10000” | 10000 | |
”- ¥ 22.324523” | -22.324523 | Negation in front of symbol supported |
” 10000 元 “ | 10000 | Currencies with postfixed symbols are supported |
”0.42 €“ | 0.42 | |
”- 3.42032 CAD” | -3.42032 | |
”Rs - 100000.23” | -100000.23 | Negation after symbol is supported |
reformat = 'date_time_conversion'
): Converts a string value representing a datetime to a pandas.Timestamp
object if correctly formatted, else converts to None
.reformat = 'trim'
): Removes leading and trailing whitespace from text columns.reformat
(optional): The reformat action to apply to the specified columns. Can be:
None
- no reformat is applied'caps_standardization'
- standardize capitalization'currency_to_num'
- convert currency string to number'date_format_conversion'
- convert datetime string to pandas.Timestamp
'trim'
- remove leading and trailing whitespace from text columnscapitalization
(optional): Specifies the capitalization strategy to use when standardizing capitalization. This argument is ignored unless reformat = "caps_standardization"
. Options are ['lowercase', 'uppercase']
.Type | Fill Value |
---|---|
Numeric | np.nan |
Datetime | pandas.NaT |
Other Types | None |
uuid
: Name of the new upshifted columntype
: Data type of the new upshifted columnperiods
(optional): Specifies how many rows to shift down. Defaults to 1.groupby_columns
(optional): Specifies columns to group by. If specified, shifts each group by periods rows instead of the entire data frame. The resulting column is then returned. If unspecified no grouping is performed. Multiple columns can be specified to group by.uuid
: Name of the new downshifted columntype
: Data type of the new downshifted columnType | Formula |
---|---|
Normalization | (X-Xmin)/(Xmax-Xmin) |
Standardization | (X-Xmean)/Xstd |
uuid
: Name of the difference columntype
: Data type of the difference column (most often will be a number)Strategy | Description |
---|---|
Average | Fill missing values with the average per column. |
Column | Fill missing values with values from another column. |
Constant | Fill missing values with a placeholder constant. Can fill with a specific value; otherwise a default placeholder is chosen by column type. |
Median | Fill missing values with the median per column. |
Mode | Fill missing values with the most frequent value per column. |
Random | Fill missing values with a randomly sampled non-null value from same column. |
Sequential | Fill missing values with the previous row. Should only be used with timeseries columns. |
strategy
(required): Specifies strategy to use. See table above or enum ImputationStrategy
value
(optional): Value to fill in missing value with. This argument is ignored unless strategy = 'constant'
, in which case the constant value used to impute is value
.timeseries_cols = []
(optional): Columns which specify the time series in the data frame. This argument is ignored unless strategy = 'sequential'
, in which case the data frame is sorted along these columns to sequentially impute data in the order of the time series.Method | Description |
---|---|
'itree' | Uses the Isolation Tree algorithm. Number of trees is set to 100. |
'lof' | Uses the Local Outlier Factor algorithm. Number of neighbors for computing local density scales with amount of data and contamination rate. |
'auto' | Automatically determines from above methods based on number of data points and dimensionality of the data. |
method
(optional): Specifies outlier detection method to use:
'itree'
'lof'
'auto'
'auto'
.
axis = Axis.ROW
.
keep
(optional): specifies policy on keeping duplicates:
'first'
- keeps the first occurrence of the row'last'
- keeps the last occurrence of the rowFalse
- removes occurrences of duplicates'last'
.
rows
: List of the indices corresponding to the rows to remove. If empty, no rows are removed.ascending
(optional): Specifies how to sort rows. If True
, then rows are sorted in ascending order; if False
sorted in descending order. Defaults to True
.
ascendings
(optional): A list of booleans specifying if each column specified in arguments should be sorted in ascending order (must be the same length). If specified, this list will take precedence over the ascending
option.==, !=, <=, <, >, >=, contains, not contains
contains
(and not contains
) checks if the value is a substring (or not a substring) of values in the column specified.and
and or
operators and grouped with parentheses.
age != null and price_of_item >= 50.23
+=-*&^%$! ?~|<>(){}[],.
, then the quoted column name can be used to filter code. Any quote character can be used.
Example: ("+= -*&^%$! ?~ |<>" == False and ' kas22d fe ($)' > 50.23) or ("dis>>> ??cou nted" contains "partial")
mage_ai.data_cleaner.column_types.constants.ColumnType
.
Type | Enum | String Value | Description |
---|---|---|---|
Category | ColumnType.CATEGORY | ”category” | Categorical variable that takes one of some finite set of values. This set has at most 255 entries. |
High Cardinality Category | ColumnType.CATEGORY_HIGH_CARDINALITY | ”category_high_cardinality” | Categorical variable that takes one of some finite set of values. This set has more than 255 entries. |
Datetime | ColumnType.DATETIME | ”datetime” | Datetime objects or strings representing a datetime. |
ColumnType.EMAIL | ”email” | Strings that contains emails. | |
List | ColumnType.LIST | ”list” | Lists of primitives (integers, strings, booleans, floats) of variable length. |
Number | ColumnType.NUMBER | ”number” | Integer numbers. Can be a Python integer, NumPy integer or a stringified integer. |
Floating Point Number | ColumnType.NUMBER_WITH_DECIMALS | ”number_with_decimals” | Floating point numbers. Can be a Python float, NumPy float, or a stringified float. |
Phone Number | ColumnType.PHONE_NUMBER | ”phone_number” | Phone numbers (either in string form or integer form). |
Text | ColumnType.TEXT | ”text” | Text values, such as sentences or paragraphs. |
Binary | ColumnType.TRUE_OR_FALSE | ”true_or_false” | A variable that takes on two different values (including booleans). |
Zip Code | ColumnType.ZIP_CODE | ”zip_code” | Zip codes (either in string form or integer form). |