Skip to main content

Real-time streaming pipeline with Apache Flink 2.0,...

Real-time streaming pipeline with Apache Flink 2.0,...

Real‑time streaming pipeline with Apache Flink 2.0, Kafka & Iceberg

Over 70 % of modern enterprises say that latency under one second is now a competitive necessity for data‑driven decisions. With the right combination of Apache Flink 2.0, Kafka and Iceberg, you can build an etl‑style data pipeline that processes events in true real‑time while keeping your lakehouse immutable and query‑ready. Imagine a fraud‑detection team that gets alerts the instant a suspicious transaction lands on the stream—no batch windows, no stale data.

Architecture Overview – From Ingestion to Lakehouse

Kafka sits at the heart of the flow, acting as the durable event backbone. Topics are split into a handful of partitions so that parallelism can scale with traffic, while a schema registry keeps every producer and consumer in sync. Flink 2.0 consumes those partitions, applies stateful operators and guarantees exactly‑once semantics by checkpointing to a reliable store. Finally, Iceberg writes the transformed data into immutable tables, hiding partition details and enabling time‑travel queries for downstream analytics.

What I love about this trio is that each component does one thing really well. Kafka delivers high throughput with minimal latency; Flink delivers low‑latency stream processing with strong consistency; Iceberg delivers a clean, query‑friendly lakehouse that evolves without breaking downstream workloads.

So, let's be real: you don't need to reinvent the wheel. Just stitch these pieces together, tune each knob, and you get an etl pipeline that feels like a single, coherent system.

Building the Real‑time ETL with Flink 2.0 (Code Walkthrough)

Below is a compact Java example that illustrates the core flow. You’ll see how to read from a JSON Kafka topic, do a tumbling‑window aggregation, enrich via a broadcast lookup, and sink the result into an Iceberg table with daily partitioning.

public class StreamingETL {
  public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(4);
    env.enableCheckpointing(5000); // 5 seconds

    TableEnvironment tEnv = StreamTableEnvironment.create(env);

    // Source: Kafka topic
    tEnv.executeSql(
      "CREATE TABLE source_kafka (" +
      "  user_id STRING," +
      "  amount DOUBLE," +
      "  ts TIMESTAMP(3)," +
      "  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND" +
      ") WITH (" +
      "  'connector' = 'kafka'," +
      "  'topic' = 'transactions'," +
      "  'properties.bootstrap.servers' = 'kafka:9092'," +
      "  'format' = 'json'" +
      ")"
    );

    // Lookup table for enrichment
    tEnv.executeSql(
      "CREATE TEMPORARY TABLE user_lookup (" +
      "  user_id STRING PRIMARY KEY NOT ENFORCED," +
      "  country STRING" +
      ") WITH (" +
      "  'connector' = 'jdbc'," +
      "  'url' = 'jdbc:postgresql://db:5432/userdb'," +
      "  'table-name' = 'users'," +
      "  'username' = 'user'," +
      "  'password' = 'pass'" +
      ")"
    );

    // Transformation: 1‑minute tumbling window aggregation + enrichment
    Table result = tEnv.sqlQuery(
      "SELECT " +
      "  DATE_FORMAT(ts, 'yyyy-MM-dd') AS event_date," +
      "  country," +
      "  COUNT(*) AS tx_count," +
      "  SUM(amount) AS total_amount " +
      "FROM source_kafka " +
      "LEFT JOIN user_lookup FOR SYSTEM_TIME AS OF ts ON source_kafka.user_id = user_lookup.user_id " +
      "GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE), country"
    );

    // Sink: Iceberg
    tEnv.executeSql(
      "CREATE TABLE iceberg_sink (" +
      "  event_date DATE," +
      "  country STRING," +
      "  tx_count BIGINT," +
      "  total_amount DOUBLE" +
      ") PARTITIONED BY (event_date) WITH (" +
      "  'connector' = 'iceberg'," +
      "  'catalog-name' = 'iceberg_catalog'," +
      "  'catalog-type' = 'hadoop'," +
      "  'warehouse' = 'hdfs://namenode:8020/warehouse'," +
      "  'write-format' = 'parquet'" +
      ")"
    );

    result.executeInsert("iceberg_sink");
  }
}

The snippet is intentionally lightweight; production deployments add a checkpoint directory, a schema registry reference, and robust error handling. I've found that keeping the job short and declarative makes debugging a breeze.

Orchestrating the Pipeline – Airflow + dbt Integration

Airflow is great for managing the lifecycle of each component. The DAG below spins up a Flink job, ensures Kafka topics exist, and triggers dbt tests once the Iceberg snapshot is ready. Here’s a simplified representation:

  • Trigger Flink job via REST API or Kubernetes operator.
  • Wait for job completion and check Flink metrics for any dropped records.
  • Run dbt models against the new Iceberg tables: compile, test, and publish docs.

Sound familiar? You’ve probably orchestrated a batch Spark job before, but the difference with Flink is that you don't need to re‑run the ingestion step every time; just restart the job on new data. The thing is, Airflow’s SLA support gives you hard guarantees on data freshness. The code in the DAG might look like this:

with DAG('flink_iceberg_etl', schedule_interval='@hourly', catchup=False) as dag:
    trigger_flink = ExternalTaskSensor(
        task_id='trigger_flink',
        external_dag_id='flink_job',
        external_task_id='run',
        allowed_states=['success']
    )

    run_dbt = BashOperator(
        task_id='run_dbt',
        bash_command='dbt run --models * --target prod'
    )

    trigger_flink > run_dbt

By coupling Airflow with dbt, you get a single source of truth for pipeline state and a declarative layer for business logic. I think this combo is better than writing custom scripts that jump around between services because it centralizes monitoring and auditability.

Why Real‑time Streaming ETL Matters – Business & Technical Impact

First, let's talk money. Faster time‑to‑insight translates to revenue uplift, risk mitigation, and personalized experiences that customers notice. If your fraud team can act within seconds, you stop a lot of bad money early. The thing is, batch Spark pipelines can add a 10‑minute or longer lag, which is a luxury that most products don't afford.

Second, consider cost efficiency. In a streaming setup, you only allocate resources when data arrives. Spark clusters idle most of the time, while Flink can scale down to zero in a containerized environment. As of 2026, cloud providers are offering Flink‑as‑a‑service with auto‑scaling, making the economics even more favorable.

Finally, future‑proofing is a key selling point. Iceberg's unified table format means you can swap Flink for Spark Structured Streaming or even Flink 1.x with minimal changes. Your analytical models, built in dbt, keep querying the same tables, so downstream teams never notice a difference.

Actionable Takeaways & Best‑Practice Checklist

What I love about this architecture is that it's a recipe you can follow, tweak, and scale. Here are the hard facts you should check before you deploy:

  • Checkpointing on a durable store (S3, HDFS) with a 5‑second interval.
  • Schema registry integration to avoid incompatible evolution.
  • Idempotent writes to Iceberg—use `write-format = 'delta'` or `write-distribution = 'hash'` as needed.
  • Prometheus and Grafana dashboards for Flink metrics, Airflow task durations, and Iceberg metadata checks.
  • dbt tests that validate cardinality, null counts, and partition column integrity.

Next steps: scale out the Flink cluster with operator‑managed PVCs, add CDC sources like Debezium to capture database changes, and experiment with in‑stream ML inference to push predictions directly into Iceberg. The key is to iterate fast, monitor rigorously, and keep your lakehouse as a single source of truth.

Frequently Asked Questions

What is the best way to implement an ETL pipeline that runs in real‑time with Flink and Kafka?

Use Kafka as the source of truth, let Flink 2.0 perform the transformations via the Table API or DataStream API, and sink the results into Iceberg tables for downstream analytics. This pattern guarantees exactly‑once processing and keeps the lakehouse queryable instantly.

How does Apache Iceberg complement a streaming data pipeline compared to using plain Parquet files?

Iceberg adds a transaction log, hidden partitioning, and schema evolution support, which means streaming writes from Flink can be committed atomically and queried without data‑skew or stale reads. It also enables time‑travel queries, something raw Parquet directories cannot provide.

Can I schedule Flink jobs with Airflow, and why would I do that?

Yes—Airflow’s DAGs can trigger Flink job submissions (via the Flink REST API or Kubernetes operator) and handle dependencies such as topic creation or dbt model runs. Orchestration adds reliability, retry logic, and a single source of truth for pipeline state.

Is it possible to replace Spark with Flink for a real‑time data pipeline without rewriting downstream dbt models?

Absolutely. Because Iceberg presents a unified table format, downstream dbt models continue to query the same tables regardless of whether the upstream write path is Spark or Flink. You only need to adjust the ingestion layer; the analytical layer remains unchanged.

What are the latency expectations for a Flink‑Kafka‑Iceberg pipeline in production?

With proper tuning (low checkpoint interval, adequate parallelism, and fast storage for Iceberg), end‑to‑end latency can be sub‑second to a few seconds. Real‑world deployments report 500 ms to 2 s from event ingestion to queryable data.


Related reading: Original discussion

What do you think?

Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!

Comments

Popular posts from this blog

2026 Update: Getting Started with SQL & Databases: A Comp...

Low-Code Isn't Stealing Dev Jobs — It's Changing Them (And That's a Good Thing) Have you noticed how many non-tech folks are building Mission-critical apps lately? Honestly, it's kinda wild — marketing tres creating lead-gen tools, ops managers deploying inventory systems. Sound familiar? But here's the deal: it's not magic, it's low-code development platforms reshaping who gets to play the app-building game. What's With This Low-Code Thing Anyway? So let's break it down. Low-code platforms are visual playgrounds where you drag pre-built components instead of hand-coding everything. Think LEGO blocks for software – connect APIs, design interfaces, and automate workflows with minimal typing. Citizen developers (non-IT pros solving their own problems) are loving it because they don't need a PhD in Java. Recently, platforms like OutSystems and Mendix have exploded because honestly? Everyone needs custom tools faster than traditional codin...

Practical Guide: Getting Started with Data Science: A Com...

Laravel 11 Unpacked: What's New and Why It Matters Still running Laravel 10? Honestly, you might be missing out on some serious upgrades. Let's break down what Laravel 11 brings to the table – and whether it's worth the hype for your PHP framework projects. Because when it comes down to it, staying current can save you headaches later. What's Cooking in Laravel 11? Laravel 11 streamlines things right out of the gate. Gone are the cluttered config files – now you get a leaner, more focused starting point. That means less boilerplate and more actual coding. And here's the kicker: they've baked health routing directly into the framework. So instead of third-party packages for uptime monitoring, you've got built-in /up endpoints. But the real showstopper? Per-second API rate limiting. Remember those clunky custom solutions for throttling requests? Now you can just do: RateLimiter::for('api', function (Request $ 💬 What do you think?...

Expert Tips: Getting Started with Data Tools & ETL: A Com...

{"text":""} 💬 What do you think? Have you tried any of these approaches? I'd love to hear about your experience in the comments!