Add credentials

To access Apache Iceberg tables stored in Amazon S3, you’ll need to configure your credentials.

  1. Create a new pipeline or open an existing pipeline.

  2. Expand the left side of your screen to view the file browser.

  3. Scroll down and click on a file named io_config.yaml.

  4. Enter the following keys and values under the key named default (or the profile you are using):

    version: 0.1.1
    default:
      AWS_ACCESS_KEY_ID: ...
      AWS_SECRET_ACCESS_KEY: ...
      AWS_REGION: us-west-2  # or the region where your S3 bucket is located
    

    These credentials must have read/write access to the S3 bucket that contains your Iceberg tables.

Using Python block

You can use Mage to load data from Iceberg tables stored in S3 or export data to Iceberg tables using a configurable Python block.

Steps

  1. Create or open a pipeline in your Mage Pro cluster.
  2. Add a block of type Data Loader or Data Exporter.
  3. From the block template list, choose:
    Data lakes → Apache Iceberg
  4. In the generated code block, update the following required configuration parameters:
    • base_uri: The URI of your data lake storage (e.g., s3://your-bucket-name/warehouse)
    • bucket_name: Name of your S3 bucket
    • table_name: Fully qualified name of your Iceberg table (e.g., db_name.table_name)
  5. If you’re using a non-default profile, update the config_profile field accordingly.
  6. Run the block to load or export data from your Iceberg table stored on S3.

Notes

  • The Iceberg integration supports direct access to tables through metadata stored in S3.
  • Make sure the table_path points to the root directory of the Iceberg table (i.e., where the metadata/ folder resides).
  • Mage uses a Postgres-backed Iceberg catalog by default. If you’re using a different catalog (e.g., Hive or REST), you can customize the configuration to point to your specific catalog endpoint and type.

Using Iceberg with PySpark

Mage Pro supports reading from and writing to Apache Iceberg tables using PySpark, enabling scalable and efficient data lake operations.

Example code

Using Iceberg with Google Cloud Storage (GCS)

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergWithGCS") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.1") \
    .config("spark.jars", "/opt/spark/jars/gcs-connector-hadoop3-2.2.5-shaded.jar") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "hadoop") \
    .config("spark.sql.catalog.iceberg.warehouse", "gs://mage_icerberg_test/test") \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/home/src/default_repo/your_google_service_account_key.json") \
    .getOrCreate()

Create and Write an Iceberg Table

# Sample DataFrame
df = spark.createDataFrame([
    (1, "apple"),
    (2, "banana"),
    (3, "banana"),
], ["id", "fruit"])

# Create the database if it doesn't exist
spark.sql("CREATE DATABASE IF NOT EXISTS iceberg.db_name")

# Write data to an Iceberg table
df.writeTo("iceberg.db_name.iceberg_table") \
  .using("iceberg") \
  .createOrReplace()

Notes

  • iceberg.db_name.iceberg_table uses the Hadoop catalog type and stores metadata in the specified GCS path.
  • You can modify the catalog configs to use Hive, Glue, or Nessie depending on your architecture.
  • For AWS S3, update the warehouse path and authentication configurations accordingly.
  • The Google service account key file path must be accessible inside the Mage Pro cluster.
  • You can run this code inside a block in Mage Pro batch pipeline.