Skip to main content

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg...

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg...

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

Over 70 % of modern data pipelines still rely on brittle file‑format conversions, costing enterprises an average of $1.2 M per year in hidden ETL debt. By marrying Snowflake’s native lakehouse capabilities with Apache Iceberg, you can slash that debt and run truly atomic, version‑controlled ETL jobs—no data‑loss, no re‑processing. Imagine a data engineer who can push a Spark job, have Airflow orchestrate it, and let dbt instantly validate the new Iceberg snapshot—all inside a single Snowflake‑powered lakehouse.

Why Apache Iceberg + Snowflake = a Game‑Changing Lakehouse

Iceberg gives you atomic table evolution that feels like magic. No more half‑finished loads breaking downstream dashboards. Schema‑evolution without rewrites lets you add, drop, or rename columns in seconds, instead of staring at a 50‑hour back‑fill job. And the cost model? Snowflake’s pay‑as‑you‑go storage keeps you from over‑provisioning, while Iceberg’s metadata lives in Snowflake tables, so you pay for what you use, not for the number of files.
  • ACID guarantees across batch and streaming workloads, so your data pipeline stays reliable.
  • Zero‑dropped‑row surprises thanks to Iceberg’s immutable snapshots.
  • Partition pruning stays fast because the metadata is always up‑to‑date.

Pattern #1 – Batch‑Oriented ETL with Spark → Iceberg → Snowflake

And here’s the flow that feels almost too clean. Spark writes to an Iceberg table stored in an S3 bucket. Snowflake exposes that table via an external table that reads the Iceberg manifests. Airflow orchestrates the whole dance: a DAG that submits the Spark job, waits for commit, then runs a Snowflake SQL task to refresh the downstream view.
# Spark (PySpark) job
df.write.format("iceberg")\
  .mode("append")\
  .save("s3://my-bucket/iceberg_db.my_table")

# Airflow DAG snippet
with DAG("spark_to_iceberg", start_date=days_ago(1)) as dag:
    submit = SparkSubmitOperator(
        task_id="submit_spark",
        conn_id="spark_default",
        application="s3://code-bucket/spark_job.py",
    )
    refresh = SnowflakeOperator(
        task_id="refresh_view",
        sql="CALL snowflake.refresh_external_table('my_ext_table');",
        snowflake_conn_id="snowflake_default",
    )
    submit >> refresh
The key is that the Spark job commits atomically. Iceberg writes a new snapshot and updates the manifest list in a single step. Snowflake’s external table sees the new snapshot only after the commit, so downstream queries are never reading half‑written data.

Pattern #2 – Real‑Time Incremental Loads via dbt + Snowflake Streams on Iceberg

Sound familiar? You’ve probably set up Kafka → Spark Structured Streaming → Snowflake. Now, swap out the raw Snowflake write for an Iceberg append‑only table. Snowflake Streams capture the changes through an external table pointing at the Iceberg manifests, and dbt pulls those changes into curated models.
  • Kafka -> Spark Structured Streaming writes to Iceberg (append‑only).
  • S3 bucket holds the raw parquet files; Iceberg keeps the commit log.
  • Snowflake external table reads the manifest list; a Snowflake Stream records insertions.
  • dbt runs a model that selects from the stream view and materializes a business‑ready mart.
Because dbt uses Snowflake’s native SQL under the hood, you get all the benefits of version control, testing, and documentation without touching Iceberg directly. The pipeline feels like a single, tidy data pipeline that still keeps the best of both worlds.

Practical Walkthrough – End‑to‑End ETL Pipeline (Code‑Heavy)

Recently, I built a pilot for a retail client that needed a single source of truth for click‑stream events. The stack was: Snowflake, Spark 3.4+, Airflow 2.x, and dbt‑snowflake. Here’s the step‑by‑step you can copy‑paste.

Prerequisites

# Snowflake account with Snowpark and external stages enabled
# An S3 bucket with an IAM role that Snowflake can assume
# Spark 3.4+ with iceberg-spark-runtime jar added
# Airflow 2.x with Spark and Snowflake hooks
# dbt-snowflake installed and configured

Step 1: Create Iceberg catalog in Spark

spark.sql("""
CREATE CATALOG iceberg WITH
  type = hive
  catalog-database = iceberg_db
  warehouse = 's3://my-bucket/iceberg_warehouse/';
""")

Step 2: Build an Airflow DAG

with DAG("full_etl", start_date=days_ago(1), schedule_interval="@daily") as dag:
    spark_job = SparkSubmitOperator(
        task_id="load_to_iceberg",
        application="s3://code-bucket/etl_job.py",
        conn_id="spark_default",
    )
    snowflake_task = SnowflakeOperator(
        task_id="validate_and_refresh",
        sql="""
            INSERT INTO curated_view
            SELECT * FROM external_iceberg_table;
        """,
        snowflake_conn_id="snowflake_default",
    )
    spark_job >> snowflake_task

Step 3: Define a Snowflake external table

CREATE EXTERNAL TABLE iceberg_db.my_table (
  id STRING,
  event STRING,
  ts TIMESTAMP_NTZ
)
  USING ICEBERG
  LOCATION 's3://my-bucket/iceberg_warehouse/iceberg_db/my_table';

Step 4: dbt model

# models/curated_view.sql
WITH raw AS (
  SELECT * FROM iceberg_db.my_table
)
SELECT
  id,
  event,
  DATE_TRUNC('hour', ts) AS event_hour
FROM raw;

Step 5: Verify results

SELECT COUNT(*) FROM curated_view; -- should match the number of rows in S3
SELECT MIN(ts), MAX(ts) FROM curated_view; -- sanity check timestamps
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE QUERY_TEXT LIKE '%load_to_iceberg%'; -- no overlapping runs
If the counts line up and the query history shows clean, non‑overlapping runs, you’re looking at an ETL that’s both reliable and fast.

Actionable Takeaways & Best‑Practice Checklist

  • Choose the right pattern. Batch is great for nightly jobs; real‑time is for low‑latency dashboards.
  • Metadata hygiene. Iceberg manifests can grow large; run VACUUM or MERGE hints to keep them lean.
  • Monitoring & alerts. Use Snowflake’s QUERY_HISTORY view and Airflow SLA misses to catch failures early.
  • Security tip. Configure Snowflake external stages with IAM roles that only allow read access to the bucket, following least‑privilege.
  • Next steps. Pilot with one high‑value table; once the pattern proves itself, replicate across your lakehouse catalog.

Frequently Asked Questions

How does Apache Iceberg improve ETL reliability in Snowflake?

Iceberg stores table metadata (snapshots, manifests) separately from data files, enabling atomic commits and time‑travel queries. When Snowflake reads an Iceberg external table, it sees a consistent snapshot, eliminating half‑finished loads that typically break ETL pipelines.

Can I orchestrate Iceberg‑based Spark jobs with Airflow on AWS?

Yes. Airflow’s SparkSubmitOperator (or a BashOperator calling spark-submit) can push the job to an EMR or EKS cluster; the DAG can then trigger a Snowflake task to validate the load, creating a fully managed end‑to‑end ETL workflow.

What is the role of dbt when using Iceberg tables in a Snowflake lakehouse?

dbt treats the Iceberg external table as any other Snowflake source. You can write dbt models that select from the external table, apply tests, and materialize downstream mart tables, giving you version‑controlled, test‑driven ETL on top of Iceberg.

Do I need a separate compute cluster for Spark when using Snowflake?

Not necessarily. You can run Spark on Amazon EMR, AWS Glue Elastic Views, or even on an EKS‑based Spark operator. Snowflake handles the query processing for downstream analytics, so compute is only needed for the transformation stage.

Is there a performance penalty reading Iceberg manifests from S3 into Snowflake?

The overhead is minimal because Snowflake caches external table metadata and only reads the manifest files needed for the query. With proper partitioning and manifest pruning, latency is comparable to native Snowflake tables.


Related reading: Original discussion

Related Articles

What do you think?

Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!

Comments

Popular posts from this blog

2026 Update: Getting Started with SQL & Databases: A Comp...

Low-Code Isn't Stealing Dev Jobs — It's Changing Them (And That's a Good Thing) Have you noticed how many non-tech folks are building Mission-critical apps lately? Honestly, it's kinda wild — marketing tres creating lead-gen tools, ops managers deploying inventory systems. Sound familiar? But here's the deal: it's not magic, it's low-code development platforms reshaping who gets to play the app-building game. What's With This Low-Code Thing Anyway? So let's break it down. Low-code platforms are visual playgrounds where you drag pre-built components instead of hand-coding everything. Think LEGO blocks for software – connect APIs, design interfaces, and automate workflows with minimal typing. Citizen developers (non-IT pros solving their own problems) are loving it because they don't need a PhD in Java. Recently, platforms like OutSystems and Mendix have exploded because honestly? Everyone needs custom tools faster than traditional codin...

Practical Guide: Getting Started with Data Science: A Com...

Laravel 11 Unpacked: What's New and Why It Matters Still running Laravel 10? Honestly, you might be missing out on some serious upgrades. Let's break down what Laravel 11 brings to the table – and whether it's worth the hype for your PHP framework projects. Because when it comes down to it, staying current can save you headaches later. What's Cooking in Laravel 11? Laravel 11 streamlines things right out of the gate. Gone are the cluttered config files – now you get a leaner, more focused starting point. That means less boilerplate and more actual coding. And here's the kicker: they've baked health routing directly into the framework. So instead of third-party packages for uptime monitoring, you've got built-in /up endpoints. But the real showstopper? Per-second API rate limiting. Remember those clunky custom solutions for throttling requests? Now you can just do: RateLimiter::for('api', function (Request $ 💬 What do you think?...

Expert Tips: Getting Started with Data Tools & ETL: A Com...

{"text":""} 💬 What do you think? Have you tried any of these approaches? I'd love to hear about your experience in the comments!