AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns
Over 70 % of modern data pipelines still rely on brittle file‑format conversions, costing enterprises an average of $1.2 M per year in hidden ETL debt. By marrying Snowflake’s native lakehouse capabilities with Apache Iceberg, you can slash that debt and run truly atomic, version‑controlled ETL jobs—no data‑loss, no re‑processing. Imagine a data engineer who can push a Spark job, have Airflow orchestrate it, and let dbt instantly validate the new Iceberg snapshot—all inside a single Snowflake‑powered lakehouse.Why Apache Iceberg + Snowflake = a Game‑Changing Lakehouse
Iceberg gives you atomic table evolution that feels like magic. No more half‑finished loads breaking downstream dashboards. Schema‑evolution without rewrites lets you add, drop, or rename columns in seconds, instead of staring at a 50‑hour back‑fill job. And the cost model? Snowflake’s pay‑as‑you‑go storage keeps you from over‑provisioning, while Iceberg’s metadata lives in Snowflake tables, so you pay for what you use, not for the number of files.- ACID guarantees across batch and streaming workloads, so your data pipeline stays reliable.
- Zero‑dropped‑row surprises thanks to Iceberg’s immutable snapshots.
- Partition pruning stays fast because the metadata is always up‑to‑date.
Pattern #1 – Batch‑Oriented ETL with Spark → Iceberg → Snowflake
And here’s the flow that feels almost too clean. Spark writes to an Iceberg table stored in an S3 bucket. Snowflake exposes that table via an external table that reads the Iceberg manifests. Airflow orchestrates the whole dance: a DAG that submits the Spark job, waits for commit, then runs a Snowflake SQL task to refresh the downstream view.# Spark (PySpark) job
df.write.format("iceberg")\
.mode("append")\
.save("s3://my-bucket/iceberg_db.my_table")
# Airflow DAG snippet
with DAG("spark_to_iceberg", start_date=days_ago(1)) as dag:
submit = SparkSubmitOperator(
task_id="submit_spark",
conn_id="spark_default",
application="s3://code-bucket/spark_job.py",
)
refresh = SnowflakeOperator(
task_id="refresh_view",
sql="CALL snowflake.refresh_external_table('my_ext_table');",
snowflake_conn_id="snowflake_default",
)
submit >> refresh
The key is that the Spark job commits atomically. Iceberg writes a new snapshot and updates the manifest list in a single step. Snowflake’s external table sees the new snapshot only after the commit, so downstream queries are never reading half‑written data.
Pattern #2 – Real‑Time Incremental Loads via dbt + Snowflake Streams on Iceberg
Sound familiar? You’ve probably set up Kafka → Spark Structured Streaming → Snowflake. Now, swap out the raw Snowflake write for an Iceberg append‑only table. Snowflake Streams capture the changes through an external table pointing at the Iceberg manifests, and dbt pulls those changes into curated models.- Kafka -> Spark Structured Streaming writes to Iceberg (append‑only).
- S3 bucket holds the raw parquet files; Iceberg keeps the commit log.
- Snowflake external table reads the manifest list; a Snowflake Stream records insertions.
- dbt runs a model that selects from the stream view and materializes a business‑ready mart.
Practical Walkthrough – End‑to‑End ETL Pipeline (Code‑Heavy)
Recently, I built a pilot for a retail client that needed a single source of truth for click‑stream events. The stack was: Snowflake, Spark 3.4+, Airflow 2.x, and dbt‑snowflake. Here’s the step‑by‑step you can copy‑paste.Prerequisites
# Snowflake account with Snowpark and external stages enabled
# An S3 bucket with an IAM role that Snowflake can assume
# Spark 3.4+ with iceberg-spark-runtime jar added
# Airflow 2.x with Spark and Snowflake hooks
# dbt-snowflake installed and configured
Step 1: Create Iceberg catalog in Spark
spark.sql("""
CREATE CATALOG iceberg WITH
type = hive
catalog-database = iceberg_db
warehouse = 's3://my-bucket/iceberg_warehouse/';
""")
Step 2: Build an Airflow DAG
with DAG("full_etl", start_date=days_ago(1), schedule_interval="@daily") as dag:
spark_job = SparkSubmitOperator(
task_id="load_to_iceberg",
application="s3://code-bucket/etl_job.py",
conn_id="spark_default",
)
snowflake_task = SnowflakeOperator(
task_id="validate_and_refresh",
sql="""
INSERT INTO curated_view
SELECT * FROM external_iceberg_table;
""",
snowflake_conn_id="snowflake_default",
)
spark_job >> snowflake_task
Step 3: Define a Snowflake external table
CREATE EXTERNAL TABLE iceberg_db.my_table (
id STRING,
event STRING,
ts TIMESTAMP_NTZ
)
USING ICEBERG
LOCATION 's3://my-bucket/iceberg_warehouse/iceberg_db/my_table';
Step 4: dbt model
# models/curated_view.sql
WITH raw AS (
SELECT * FROM iceberg_db.my_table
)
SELECT
id,
event,
DATE_TRUNC('hour', ts) AS event_hour
FROM raw;
Step 5: Verify results
SELECT COUNT(*) FROM curated_view; -- should match the number of rows in S3
SELECT MIN(ts), MAX(ts) FROM curated_view; -- sanity check timestamps
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE QUERY_TEXT LIKE '%load_to_iceberg%'; -- no overlapping runs
If the counts line up and the query history shows clean, non‑overlapping runs, you’re looking at an ETL that’s both reliable and fast.
Actionable Takeaways & Best‑Practice Checklist
- Choose the right pattern. Batch is great for nightly jobs; real‑time is for low‑latency dashboards.
- Metadata hygiene. Iceberg manifests can grow large; run
VACUUMorMERGEhints to keep them lean. - Monitoring & alerts. Use Snowflake’s
QUERY_HISTORYview and Airflow SLA misses to catch failures early. - Security tip. Configure Snowflake external stages with IAM roles that only allow read access to the bucket, following least‑privilege.
- Next steps. Pilot with one high‑value table; once the pattern proves itself, replicate across your lakehouse catalog.
Frequently Asked Questions
How does Apache Iceberg improve ETL reliability in Snowflake?
Iceberg stores table metadata (snapshots, manifests) separately from data files, enabling atomic commits and time‑travel queries. When Snowflake reads an Iceberg external table, it sees a consistent snapshot, eliminating half‑finished loads that typically break ETL pipelines.
Can I orchestrate Iceberg‑based Spark jobs with Airflow on AWS?
Yes. Airflow’s SparkSubmitOperator (or a BashOperator calling spark-submit) can push the job to an EMR or EKS cluster; the DAG can then trigger a Snowflake task to validate the load, creating a fully managed end‑to‑end ETL workflow.
What is the role of dbt when using Iceberg tables in a Snowflake lakehouse?
dbt treats the Iceberg external table as any other Snowflake source. You can write dbt models that select from the external table, apply tests, and materialize downstream mart tables, giving you version‑controlled, test‑driven ETL on top of Iceberg.
Do I need a separate compute cluster for Spark when using Snowflake?
Not necessarily. You can run Spark on Amazon EMR, AWS Glue Elastic Views, or even on an EKS‑based Spark operator. Snowflake handles the query processing for downstream analytics, so compute is only needed for the transformation stage.
Is there a performance penalty reading Iceberg manifests from S3 into Snowflake?
The overhead is minimal because Snowflake caches external table metadata and only reads the manifest files needed for the query. With proper partitioning and manifest pruning, latency is comparable to native Snowflake tables.
Related reading: Original discussion
Related Articles
- Real-time streaming pipeline with Apache Flink 2.0,...
- Lightweight ETL on AWS Lambda Using DuckDB and Snowflake...
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment