Only in Mage Pro.Try our fully managed solution to access this advanced feature.
Add credentials
To access Apache Iceberg tables stored in Amazon S3, you’ll need to configure your AWS credentials.- Create a new pipeline or open an existing pipeline.
- Expand the left side of your screen to view the file browser.
-
Scroll down and click on a file named
io_config.yaml. -
Enter the following keys and values under the key named
default(or the profile you are using):These credentials must have read/write access to the S3 bucket that contains your Iceberg tables.
Using Python block
You can use Mage to load data from Iceberg tables stored in S3 or export data to Iceberg tables using a configurable Python block.Steps
- Create or open a pipeline in your Mage Pro cluster.
- Add a block of type Data Loader or Data Exporter.
- From the block template list, choose:
Data lakes → Apache Iceberg - In the generated code block, update the following configuration parameters:
base_uri: Base S3 URI for the Iceberg warehouse (e.g.,s3://your-bucket-name/warehouse/) - required for SQL catalognamespace: Namespace for the Iceberg catalog (default:'default')catalog_type: Type of catalog to use -'sql'(default) or'glue'table_name: Name of the Iceberg tablebucket_name: Name of your S3 bucket (for exports)mode: Write mode for exports -'append'(default) or'overwrite'metadata_file: Optional - used to directly access S3 metadata file when table is not in catalog
- If you’re using a non-default profile, update the
config_profilefield accordingly. - Run the block to load or export data from your Iceberg table stored on S3.
Configuration Options
Catalog Types
Mage supports multiple catalog types for Iceberg, including:-
SQL Catalog (default): Uses a Postgres-backed catalog to store table metadata
- Requires
base_urito specify the warehouse location - Tables can be registered in the catalog or accessed directly via metadata files
- Requires
-
AWS Glue Catalog: Uses AWS Glue as the catalog
- No
base_urirequired - Tables must be registered in AWS Glue
- No
catalog_type to the appropriate value for your catalog implementation.
Loading Data
When loading data, you can:- Load from catalog: If the table is registered in the catalog, just provide
table_name - Load from metadata file: If the table is not in the catalog, provide
metadata_fileto access the S3 metadata file directly- The metadata file path will be:
{base_uri}{table_name}/metadata/{metadata_file}
- The metadata file path will be:
row_filter, selected_fields, case_sensitive, snapshot_id) can be passed into the load() method to customize the data retrieval. These parameters are passed through to pyiceberg’s scan method.
Scan Parameters
The following parameters can be passed to theload() method to customize data retrieval. These parameters are passed through to pyiceberg’s scan() method. For the complete method signature, see the pyiceberg source code.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
row_filter | str or BooleanExpression | AlwaysTrue() | A string or BooleanExpression that describes the desired rows | 'id > 100' or 'status == "active"' |
selected_fields | tuple[str] | ("*",) | A tuple of strings representing the column names to return in the output dataframe | ('id', 'name', 'created_at') |
case_sensitive | bool | True | If True, column matching is case sensitive | True |
snapshot_id | int or None | None | Optional Snapshot ID to time travel to. If None, scans the table as of the current snapshot ID | 12345 |
limit | int or None | None | An integer representing the number of rows to return in the scan result. If None, fetches all matching rows | 1000 |
Exporting Data
When exporting data, you can:- Append mode (default): Adds new data to the existing table
- Overwrite mode: Replaces all data in the table with the new data
Example: Loading Data
Example: Exporting Data
Additional Methods
The Iceberg integration also provides methods for managing tables and namespaces:list_namespaces(): List all namespaces in the cataloglist_tables(namespace): List all tables in a namespaceget_table_schema(table_name, namespace): Get the schema of a tabledrop_table(table_name, namespace): Drop (delete) a tabledrop_namespace(namespace): Drop (delete) a namespace
Notes
- The Iceberg integration supports direct access to tables through metadata stored in S3 when tables are not registered in the catalog.
- For SQL catalog,
base_urishould point to your warehouse location and include a trailing slash (e.g.,s3://bucket/warehouse/). - For AWS Glue catalog, tables must be registered in AWS Glue.
- When using
metadata_file, the path is constructed as:{base_uri}{table_name}/metadata/{metadata_file}. - Additional scan parameters (e.g.,
row_filter,selected_fields,case_sensitive,snapshot_id) can be passed to theload()method.
Using Iceberg with PySpark
Mage Pro supports reading from and writing to Apache Iceberg tables using PySpark, enabling scalable and efficient data lake operations.Example code
Using Iceberg with Google Cloud Storage (GCS)Notes
iceberg.db_name.iceberg_tableuses the Hadoop catalog type and stores metadata in the specified GCS path.- You can modify the catalog configs to use Hive, Glue, or Nessie depending on your architecture.
- For AWS S3, update the warehouse path and authentication configurations accordingly.
- The Google service account key file path must be accessible inside the Mage Pro cluster.
- You can run this code inside a block in Mage Pro batch pipeline.