AWS Snowflake Lakehouse: 2 Practical Apache Iceberg...

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

Q: Can I orchestrate Iceberg‑based Spark jobs with Airflow on AWS?

Yes. Airflow’s SparkSubmitOperator (or a BashOperator calling spark-submit) can push the job to an EMR or EKS cluster; the DAG can then trigger a Snowflake task to validate the load, creating a fully managed end‑to‑end ETL workflow.

Over 70 % of modern data pipelines still rely on brittle file‑format conversions, costing enterprises an average of $1.2 M per year in hidden ETL debt. By marrying Snowflake’s native lakehouse capabilities with Apache Iceberg, you can slash that debt and run truly atomic, version‑controlled ETL jobs—no data‑loss, no re‑processing. Imagine a data engineer who can push a Spark job, have Airflow orchestrate it, and let dbt instantly validate the new Iceberg snapshot—all inside a single Snowflake‑powered lakehouse.

Why Apache Iceberg + Snowflake = a Game‑Changing Lakehouse

Iceberg gives you atomic table evolution that feels like magic. No more half‑finished loads breaking downstream dashboards. Schema‑evolution without rewrites lets you add, drop, or rename columns in seconds, instead of staring at a 50‑hour back‑fill job. And the cost model? Snowflake’s pay‑as‑you‑go storage keeps you from over‑provisioning, while Iceberg’s metadata lives in Snowflake tables, so you pay for what you use, not for the number of files.

ACID guarantees across batch and streaming workloads, so your data pipeline stays reliable.
Zero‑dropped‑row surprises thanks to Iceberg’s immutable snapshots.
Partition pruning stays fast because the metadata is always up‑to‑date.

Pattern #1 – Batch‑Oriented ETL with Spark → Iceberg → Snowflake

And here’s the flow that feels almost too clean. Spark writes to an Iceberg table stored in an S3 bucket. Snowflake exposes that table via an external table that reads the Iceberg manifests. Airflow orchestrates the whole dance: a DAG that submits the Spark job, waits for commit, then runs a Snowflake SQL task to refresh the downstream view.

# Spark (PySpark) job
df.write.format("iceberg")\
  .mode("append")\
  .save("s3://my-bucket/iceberg_db.my_table")

# Airflow DAG snippet
with DAG("spark_to_iceberg", start_date=days_ago(1)) as dag:
    submit = SparkSubmitOperator(
        task_id="submit_spark",
        conn_id="spark_default",
        application="s3://code-bucket/spark_job.py",
    )
    refresh = SnowflakeOperator(
        task_id="refresh_view",
        sql="CALL snowflake.refresh_external_table('my_ext_table');",
        snowflake_conn_id="snowflake_default",
    )
    submit >> refresh

The key is that the Spark job commits atomically. Iceberg writes a new snapshot and updates the manifest list in a single step. Snowflake’s external table sees the new snapshot only after the commit, so downstream queries are never reading half‑written data.

Pattern #2 – Real‑Time Incremental Loads via dbt + Snowflake Streams on Iceberg

Sound familiar? You’ve probably set up Kafka → Spark Structured Streaming → Snowflake. Now, swap out the raw Snowflake write for an Iceberg append‑only table. Snowflake Streams capture the changes through an external table pointing at the Iceberg manifests, and dbt pulls those changes into curated models.

Kafka -> Spark Structured Streaming writes to Iceberg (append‑only).
S3 bucket holds the raw parquet files; Iceberg keeps the commit log.
Snowflake external table reads the manifest list; a Snowflake Stream records insertions.
dbt runs a model that selects from the stream view and materializes a business‑ready mart.

Because dbt uses Snowflake’s native SQL under the hood, you get all the benefits of version control, testing, and documentation without touching Iceberg directly. The pipeline feels like a single, tidy data pipeline that still keeps the best of both worlds.

Practical Walkthrough – End‑to‑End ETL Pipeline (Code‑Heavy)

Recently, I built a pilot for a retail client that needed a single source of truth for click‑stream events. The stack was: Snowflake, Spark 3.4+, Airflow 2.x, and dbt‑snowflake. Here’s the step‑by‑step you can copy‑paste.

Prerequisites

# Snowflake account with Snowpark and external stages enabled
# An S3 bucket with an IAM role that Snowflake can assume
# Spark 3.4+ with iceberg-spark-runtime jar added
# Airflow 2.x with Spark and Snowflake hooks
# dbt-snowflake installed and configured

Step 1: Create Iceberg catalog in Spark

spark.sql("""
CREATE CATALOG iceberg WITH
  type = hive
  catalog-database = iceberg_db
  warehouse = 's3://my-bucket/iceberg_warehouse/';
""")

Step 2: Build an Airflow DAG

with DAG("full_etl", start_date=days_ago(1), schedule_interval="@daily") as dag:
    spark_job = SparkSubmitOperator(
        task_id="load_to_iceberg",
        application="s3://code-bucket/etl_job.py",
        conn_id="spark_default",
    )
    snowflake_task = SnowflakeOperator(
        task_id="validate_and_refresh",
        sql="""
            INSERT INTO curated_view
            SELECT * FROM external_iceberg_table;
        """,
        snowflake_conn_id="snowflake_default",
    )
    spark_job >> snowflake_task

Step 3: Define a Snowflake external table

CREATE EXTERNAL TABLE iceberg_db.my_table (
  id STRING,
  event STRING,
  ts TIMESTAMP_NTZ
)
  USING ICEBERG
  LOCATION 's3://my-bucket/iceberg_warehouse/iceberg_db/my_table';

Step 4: dbt model

# models/curated_view.sql
WITH raw AS (
  SELECT * FROM iceberg_db.my_table
)
SELECT
  id,
  event,
  DATE_TRUNC('hour', ts) AS event_hour
FROM raw;

Step 5: Verify results

SELECT COUNT(*) FROM curated_view; -- should match the number of rows in S3
SELECT MIN(ts), MAX(ts) FROM curated_view; -- sanity check timestamps
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE QUERY_TEXT LIKE '%load_to_iceberg%'; -- no overlapping runs

If the counts line up and the query history shows clean, non‑overlapping runs, you’re looking at an ETL that’s both reliable and fast.

Actionable Takeaways & Best‑Practice Checklist

Choose the right pattern. Batch is great for nightly jobs; real‑time is for low‑latency dashboards.
Metadata hygiene. Iceberg manifests can grow large; run VACUUM or MERGE hints to keep them lean.
Monitoring & alerts. Use Snowflake’s QUERY_HISTORY view and Airflow SLA misses to catch failures early.
Security tip. Configure Snowflake external stages with IAM roles that only allow read access to the bucket, following least‑privilege.
Next steps. Pilot with one high‑value table; once the pattern proves itself, replicate across your lakehouse catalog.

Frequently Asked Questions

How does Apache Iceberg improve ETL reliability in Snowflake?

Iceberg stores table metadata (snapshots, manifests) separately from data files, enabling atomic commits and time‑travel queries. When Snowflake reads an Iceberg external table, it sees a consistent snapshot, eliminating half‑finished loads that typically break ETL pipelines.

Can I orchestrate Iceberg‑based Spark jobs with Airflow on AWS?

Yes. Airflow’s SparkSubmitOperator (or a BashOperator calling spark-submit) can push the job to an EMR or EKS cluster; the DAG can then trigger a Snowflake task to validate the load, creating a fully managed end‑to‑end ETL workflow.

What is the role of dbt when using Iceberg tables in a Snowflake lakehouse?

dbt treats the Iceberg external table as any other Snowflake source. You can write dbt models that select from the external table, apply tests, and materialize downstream mart tables, giving you version‑controlled, test‑driven ETL on top of Iceberg.

Do I need a separate compute cluster for Spark when using Snowflake?

Not necessarily. You can run Spark on Amazon EMR, AWS Glue Elastic Views, or even on an EKS‑based Spark operator. Snowflake handles the query processing for downstream analytics, so compute is only needed for the transformation stage.

Is there a performance penalty reading Iceberg manifests from S3 into Snowflake?

The overhead is minimal because Snowflake caches external table metadata and only reads the manifest files needed for the query. With proper partitioning and manifest pruning, latency is comparable to native Snowflake tables.

Code & Crumbs

Search This Blog