Real‑Time Data Streaming vs Batch Data ETL: Why Timing Matters
In 2024, 73 % of Fortune 500 companies say a delay of just 5 minutes in data delivery caused a missed revenue opportunity. Yet most data teams still default to nightly ETL jobs, treating latency as an after‑thought. In this article we’ll unpack why the when of data movement is as critical as the how, and how the right mix of streaming and batch can turn timing into a competitive advantage.
Foundations: Batch ETL vs Real‑Time Streaming
We’re glued to the idea that “ETL” means “Extract, Transform, Load,” but the world has split that into two distinct modes.
- Batch ETL pulls data once, processes it in bulk, and writes a snapshot. Classic tools: Airflow for orchestration, Spark for large‑scale transforms, dbt for modeling.
- Real‑time streaming ingests events as they happen, processes them on the fly, and pushes results downstream. Think Kafka, Flink, or Spark Structured Streaming.
The latency spectrum ranges from hours, to minutes, down to milliseconds. ETL can mean either, depending on the schedule.
When Real‑Time Wins: Business Scenarios That Demand Low Latency
Sound familiar? Those moments when a second can mean profit or loss.
- Fraud detection & security alerts – you need sub‑second reaction to block a transaction.
- Personalized user experiences – recommendation engines that update instantly keep users engaged.
- Operational monitoring & IoT – continuous sensor feeds trigger alerts before a machine fails.
Timing directly impacts revenue, risk, and user satisfaction. In my experience, the teams that built a streaming layer first saw a 30 % boost in conversion rates.
When Batch Still Makes Sense (and Why Hybrid is Often Best)
Batch isn’t dead. It’s just not always the primary engine.
- Large‑scale historical analytics – petabytes of data are cheaper to process in bulk with Spark or Hive.
- Regulatory & compliance reporting – deterministic windows simplify audit trails.
- Data quality & enrichment – batch runs allow expensive joins, cleansing, and dbt‑style modeling.
The hybrid pattern: ingest with a streaming layer (Kafka/Flink) into a Delta Lake, then run nightly dbt models for aggregates.
Practical Walkthrough: Building a Hybrid Pipeline with Airflow, dbt, and Spark Structured Streaming
Let’s roll up our sleeves. Here’s a minimal, but functional, hybrid pipeline.
# PySpark streaming job (app.py)
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
spark = SparkSession.builder.appName("stream_to_delta").getOrCreate()
schema = StructType([
StructField("event_id", StringType(), True),
StructField("user_id", StringType(), True),
StructField("amount", DoubleType(), True),
StructField("event_ts", StringType(), True)
])
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "kafka:9092")\
.option("subscribe", "transactions")\
.load()\
.selectExpr("CAST(value AS STRING) as json_str")\
.select(from_json(col("json_str"), schema).alias("data"))\
.select("data.*")\
.withColumn("ingest_ts", current_timestamp())
query = df.writeStream.format("delta")\
.outputMode("append")\
.option("path", "/delta/raw_transactions")\
.option("checkpointLocation", "/delta/checkpoints/raw_transactions")\
.trigger(availableNow=True)\
.start()
query.awaitTermination()
# dbt model: models/daily_sales_summary.sql
{{ config(materialized='table') }}
WITH source AS (
SELECT
user_id,
amount,
DATE(event_ts) AS sale_date
FROM {{ ref('raw_transactions') }}
WHERE event_ts >= CURRENT_DATE() - INTERVAL '1' DAY
)
SELECT
sale_date,
COUNT(*) AS transactions,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_transaction
FROM source
GROUP BY sale_date;
# Airflow DAG: dags/daily_dbt_run.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data',
'depends_on_past': False,
'start_date': datetime(2026, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG('daily_dbt_run',
default_args=default_args,
schedule_interval='0 2 * * *',
catchup=False) as dag:
run_dbt = BashOperator(
task_id='run_dbt',
bash_command='cd /opt/dbt && dbt run --models daily_sales_summary',
env={'DBT_PROFILE':'prod'}
)
In this setup:
- Kafka streams events to a Spark Structured Streaming job that writes to a Delta Lake.
- Airflow triggers a nightly dbt run to aggregate the raw data.
- Both streaming and batch layers are exposed to BI tools.
Actionable Takeaways: Choosing the Right Timing Strategy for Your Data Pipeline
- Assess latency requirements – map business KPIs to maximum acceptable delay. If a KPI swings by 5 % in a minute, you’re in streaming territory.
- Evaluate cost vs value – streaming infrastructure can be pricey; batch compute may be cheaper for large volumes.
- Start with a hybrid MVP – pick one high‑impact use case, implement streaming ingest, then layer batch transforms.
- Governance checklist – monitoring, schema evolution, and testing for both streaming and batch components.
Honestly, most teams end up with a mix. The key is to keep the pipeline flexible enough to shift workloads as business needs evolve.
Frequently Asked Questions
What is the difference between ETL and ELT in modern data pipelines?
ETL extracts data, transforms it in a staging area, then loads into the target, while ELT loads raw data first and pushes transformation logic (often with dbt) to the warehouse. ELT is common with cloud warehouses because compute is cheap and scalable, but the timing considerations (batch vs streaming) remain the same.
How does Apache Airflow fit into a real‑time streaming architecture?
Airflow is primarily an orchestrator, so it schedules batch jobs, model builds (dbt), and even triggers streaming jobs (e.g., start/stop a Flink job). It can also monitor streaming health via sensors and alert on SLA breaches.
Can Spark Structured Streaming replace traditional batch Spark jobs?
Structured Streaming can run in micro‑batch mode, giving near‑real‑time results, but pure batch jobs are still more efficient for large, static datasets that don’t require low latency. Choose based on data volume, latency SLA, and cost.
Is dbt useful for streaming data transformations?
dbt excels at declarative, version‑controlled transformations on data that is already materialized (e.g., in a Delta Lake). For true event‑by‑event logic you’d use a streaming engine, then let dbt run nightly to create aggregates and dimensional models.
What are the key metrics to monitor when comparing batch vs streaming pipelines?
Latency (time from source to consumer), throughput (records/sec), error rate, resource utilization (CPU/Memory), and cost per million records. Monitoring these helps you decide when to shift workloads between batch and streaming.
Related reading: Original discussion
Related Articles
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment