Real‑time streaming pipeline with Apache Flink 2.0, Kafka & Iceberg
Over 70 % of modern enterprises say that latency under one second is now a competitive necessity for data‑driven decisions. With the right combination of Apache Flink 2.0, Kafka and Iceberg, you can build an etl‑style data pipeline that processes events in true real‑time while keeping your lakehouse immutable and query‑ready. Imagine a fraud‑detection team that gets alerts the instant a suspicious transaction lands on the stream—no batch windows, no stale data.
Architecture Overview – From Ingestion to Lakehouse
Kafka sits at the heart of the flow, acting as the durable event backbone. Topics are split into a handful of partitions so that parallelism can scale with traffic, while a schema registry keeps every producer and consumer in sync. Flink 2.0 consumes those partitions, applies stateful operators and guarantees exactly‑once semantics by checkpointing to a reliable store. Finally, Iceberg writes the transformed data into immutable tables, hiding partition details and enabling time‑travel queries for downstream analytics.
What I love about this trio is that each component does one thing really well. Kafka delivers high throughput with minimal latency; Flink delivers low‑latency stream processing with strong consistency; Iceberg delivers a clean, query‑friendly lakehouse that evolves without breaking downstream workloads.
So, let's be real: you don't need to reinvent the wheel. Just stitch these pieces together, tune each knob, and you get an etl pipeline that feels like a single, coherent system.
Building the Real‑time ETL with Flink 2.0 (Code Walkthrough)
Below is a compact Java example that illustrates the core flow. You’ll see how to read from a JSON Kafka topic, do a tumbling‑window aggregation, enrich via a broadcast lookup, and sink the result into an Iceberg table with daily partitioning.
public class StreamingETL {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
env.enableCheckpointing(5000); // 5 seconds
TableEnvironment tEnv = StreamTableEnvironment.create(env);
// Source: Kafka topic
tEnv.executeSql(
"CREATE TABLE source_kafka (" +
" user_id STRING," +
" amount DOUBLE," +
" ts TIMESTAMP(3)," +
" WATERMARK FOR ts AS ts - INTERVAL '5' SECOND" +
") WITH (" +
" 'connector' = 'kafka'," +
" 'topic' = 'transactions'," +
" 'properties.bootstrap.servers' = 'kafka:9092'," +
" 'format' = 'json'" +
")"
);
// Lookup table for enrichment
tEnv.executeSql(
"CREATE TEMPORARY TABLE user_lookup (" +
" user_id STRING PRIMARY KEY NOT ENFORCED," +
" country STRING" +
") WITH (" +
" 'connector' = 'jdbc'," +
" 'url' = 'jdbc:postgresql://db:5432/userdb'," +
" 'table-name' = 'users'," +
" 'username' = 'user'," +
" 'password' = 'pass'" +
")"
);
// Transformation: 1‑minute tumbling window aggregation + enrichment
Table result = tEnv.sqlQuery(
"SELECT " +
" DATE_FORMAT(ts, 'yyyy-MM-dd') AS event_date," +
" country," +
" COUNT(*) AS tx_count," +
" SUM(amount) AS total_amount " +
"FROM source_kafka " +
"LEFT JOIN user_lookup FOR SYSTEM_TIME AS OF ts ON source_kafka.user_id = user_lookup.user_id " +
"GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE), country"
);
// Sink: Iceberg
tEnv.executeSql(
"CREATE TABLE iceberg_sink (" +
" event_date DATE," +
" country STRING," +
" tx_count BIGINT," +
" total_amount DOUBLE" +
") PARTITIONED BY (event_date) WITH (" +
" 'connector' = 'iceberg'," +
" 'catalog-name' = 'iceberg_catalog'," +
" 'catalog-type' = 'hadoop'," +
" 'warehouse' = 'hdfs://namenode:8020/warehouse'," +
" 'write-format' = 'parquet'" +
")"
);
result.executeInsert("iceberg_sink");
}
}
The snippet is intentionally lightweight; production deployments add a checkpoint directory, a schema registry reference, and robust error handling. I've found that keeping the job short and declarative makes debugging a breeze.
Orchestrating the Pipeline – Airflow + dbt Integration
Airflow is great for managing the lifecycle of each component. The DAG below spins up a Flink job, ensures Kafka topics exist, and triggers dbt tests once the Iceberg snapshot is ready. Here’s a simplified representation:
- Trigger Flink job via REST API or Kubernetes operator.
- Wait for job completion and check Flink metrics for any dropped records.
- Run dbt models against the new Iceberg tables: compile, test, and publish docs.
Sound familiar? You’ve probably orchestrated a batch Spark job before, but the difference with Flink is that you don't need to re‑run the ingestion step every time; just restart the job on new data. The thing is, Airflow’s SLA support gives you hard guarantees on data freshness. The code in the DAG might look like this:
with DAG('flink_iceberg_etl', schedule_interval='@hourly', catchup=False) as dag:
trigger_flink = ExternalTaskSensor(
task_id='trigger_flink',
external_dag_id='flink_job',
external_task_id='run',
allowed_states=['success']
)
run_dbt = BashOperator(
task_id='run_dbt',
bash_command='dbt run --models * --target prod'
)
trigger_flink > run_dbt
By coupling Airflow with dbt, you get a single source of truth for pipeline state and a declarative layer for business logic. I think this combo is better than writing custom scripts that jump around between services because it centralizes monitoring and auditability.
Why Real‑time Streaming ETL Matters – Business & Technical Impact
First, let's talk money. Faster time‑to‑insight translates to revenue uplift, risk mitigation, and personalized experiences that customers notice. If your fraud team can act within seconds, you stop a lot of bad money early. The thing is, batch Spark pipelines can add a 10‑minute or longer lag, which is a luxury that most products don't afford.
Second, consider cost efficiency. In a streaming setup, you only allocate resources when data arrives. Spark clusters idle most of the time, while Flink can scale down to zero in a containerized environment. As of 2026, cloud providers are offering Flink‑as‑a‑service with auto‑scaling, making the economics even more favorable.
Finally, future‑proofing is a key selling point. Iceberg's unified table format means you can swap Flink for Spark Structured Streaming or even Flink 1.x with minimal changes. Your analytical models, built in dbt, keep querying the same tables, so downstream teams never notice a difference.
Actionable Takeaways & Best‑Practice Checklist
What I love about this architecture is that it's a recipe you can follow, tweak, and scale. Here are the hard facts you should check before you deploy:
- Checkpointing on a durable store (S3, HDFS) with a 5‑second interval.
- Schema registry integration to avoid incompatible evolution.
- Idempotent writes to Iceberg—use `write-format = 'delta'` or `write-distribution = 'hash'` as needed.
- Prometheus and Grafana dashboards for Flink metrics, Airflow task durations, and Iceberg metadata checks.
- dbt tests that validate cardinality, null counts, and partition column integrity.
Next steps: scale out the Flink cluster with operator‑managed PVCs, add CDC sources like Debezium to capture database changes, and experiment with in‑stream ML inference to push predictions directly into Iceberg. The key is to iterate fast, monitor rigorously, and keep your lakehouse as a single source of truth.
Frequently Asked Questions
What is the best way to implement an ETL pipeline that runs in real‑time with Flink and Kafka?
Use Kafka as the source of truth, let Flink 2.0 perform the transformations via the Table API or DataStream API, and sink the results into Iceberg tables for downstream analytics. This pattern guarantees exactly‑once processing and keeps the lakehouse queryable instantly.
How does Apache Iceberg complement a streaming data pipeline compared to using plain Parquet files?
Iceberg adds a transaction log, hidden partitioning, and schema evolution support, which means streaming writes from Flink can be committed atomically and queried without data‑skew or stale reads. It also enables time‑travel queries, something raw Parquet directories cannot provide.
Can I schedule Flink jobs with Airflow, and why would I do that?
Yes—Airflow’s DAGs can trigger Flink job submissions (via the Flink REST API or Kubernetes operator) and handle dependencies such as topic creation or dbt model runs. Orchestration adds reliability, retry logic, and a single source of truth for pipeline state.
Is it possible to replace Spark with Flink for a real‑time data pipeline without rewriting downstream dbt models?
Absolutely. Because Iceberg presents a unified table format, downstream dbt models continue to query the same tables regardless of whether the upstream write path is Spark or Flink. You only need to adjust the ingestion layer; the analytical layer remains unchanged.
What are the latency expectations for a Flink‑Kafka‑Iceberg pipeline in production?
With proper tuning (low checkpoint interval, adequate parallelism, and fast storage for Iceberg), end‑to‑end latency can be sub‑second to a few seconds. Real‑world deployments report 500 ms to 2 s from event ingestion to queryable data.
Related reading: Original discussion
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment