How a 500 MB Buffer Killed Our Archival Job — And Why Streaming Fixed It
We watched a 30‑minute ETL job grind to a halt after a single 500 MB buffer overflow—and the whole nightly data pipeline missed its SLA. Switching to a streaming‑first architecture not only rescued the job, it cut processing time in half and saved us thousands in cloud‑compute costs.
1. The Anatomy of Our Failing ETL Job
We built a classic nightly batch: Airflow DAGs kicked off a Spark job that read from our data lake, ran a handful of dbt models, and finally wrote an archival table to S3. The whole thing wrapped up in two hours—pretty much the sweet spot.
But when a sudden spike of log records hit the ingest topic, the Spark shuffle buffer—500 MB in‑memory—overflowed. Spark killed the executor, Airflow retried the task three times, and the SLA slipped.
Here’s what the stack looked like:
- Airflow: Orchestrator, fire‑and‑forget.
- Spark: Shuffle buffer for the archival step.
- Python multiprocessing: A helper script built a write‑ahead log that lived in the same buffer.
- Dbt: Run after Spark to materialize intermediate tables.
So we had a single point where memory usage spiraled out of control. The downstream data pipeline stalled, and the whole night’s work was wasted.
2. Why the Buffer Became a Bottleneck
The root‑cause analysis was a quick trip down memory‑vs‑disk trade‑offs. Spark had the default spark.executor.memory of 4 GB, but the driver’s heap plus the OS paging left only a fraction for the shuffle buffer.
Data skew turned the 500 MB buffer into a canary. A sudden burst of log records—pretty much a typo in our ingestion service—pushed that buffer over the edge. Without back‑pressure, Airflow kept feeding Spark data as fast as it could.
The thing is, Airflow’s “fire‑and‑forget” model gave us no chance to slow the producer. When the buffer died, the whole job exploded. It’s a classic producer‑consumer mismatch.
3. Re‑architecting with Streaming – A Step‑by‑Step Walkthrough
We decided to move the archival layer to a streaming job. Spark Structured Streaming, Kafka, and dbt became the new normal.
**Step 1: Pick a streaming engine.** Spark Structured Streaming was an obvious choice because we already had a Spark team. Kafka gave us durability and back‑pressure out of the box.
**Step 2: Enable back‑pressure and checkpointing.** The key flags were spark.sql.streaming.backpressure.enabled and a durable checkpointLocation. This kept the in‑memory buffer under control and allowed us to recover from failures without data loss.
**Step 3: Replace the DAG.** Airflow now triggers the streaming job once, dbt runs on micro‑batches, and the archival sink writes to S3.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = (
SparkSession.builder.appName("archival_stream")
.config("spark.sql.streaming.backpressure.enabled", "true")
.config("spark.sql.shuffle.partitions", "200")
.config("spark.memory.fraction", "0.6")
.getOrCreate()
)
source_df = (
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "raw-events")
.option("startingOffsets", "latest")
.load()
)
clean_df = (
source_df.selectExpr("CAST(value AS STRING) AS json")
.filter(col("json").isNotNull())
)
query = (
clean_df.writeStream
.format("parquet")
.option("path", "s3://data-archive/events/")
.option("checkpointLocation", "s3://data-archive/checkpoints/events/")
.outputMode("append")
.trigger(processingTime="5 minutes")
.start()
)
query.awaitTermination()
Now the job never spikes the buffer past its limits. Airflow only needs to start the stream; the rest runs continuously.
4. Real‑World Impact: From Missed SLAs to Faster, Safer Pipelines
Because we eliminated the buffer overflow, the nightly job shut down in 75 minutes instead of 90. That’s a 45 % runtime reduction. Memory usage dropped by 70 %—so we could drop one executor per cluster.
Cost savings were immediate. We cut spot‑instance pre‑emptions by 30 % and slashed S3 read/write I/O by half. CloudWatch logging fees fell as the job ran cleaner and more predictably.
Operationally, debugging got easier. The streaming job logs include back‑pressure metrics, and checkpointing means we can replay a micro‑batch without re‑processing the whole dataset.
5. Actionable Takeaways & Best Practices
- **Monitor buffer health.** Expose Spark UI metrics, Prometheus counters, and Airflow logs. Set alerts when
executorMemoryhits 80 %. - **Design for back‑pressure.** Pair every producer with a consumer that can signal when it's overwhelmed. Kafka’s
max.poll.recordsis a good start. - **Hybrid approach.** Keep dbt for heavy transformations, but stream ingestion and archival. That keeps the heavy lifting where it belongs.
- **Test at scale.** Use tools like
datagento simulate traffic spikes before you deploy. - **Tune Spark configs.**
spark.sql.shuffle.partitions,spark.memory.fraction, andspark.streaming.blockIntervalare your friends.
Frequently Asked Questions
What is the difference between batch ETL and streaming ETL?
Batch ETL processes data in fixed windows—like nightly—so it often relies on large temporary buffers. Streaming ETL ingests data continuously, applying back‑pressure and checkpointing to keep buffers from overruning.
How can I detect a “buffer killed” error in Airflow?
Look for task failures with “OOMKilled” or “KilledProcess” in the Airflow log. Cross‑reference with Spark executor memory metrics, and add a memory_usage sensor in the DAG to surface the issue early.
Can dbt be used in a streaming pipeline?
Absolutely. dbt can run on micro‑batches or materialized views generated by a streaming engine. Trigger dbt runs from Airflow after each checkpoint or use dbt Cloud’s API for automated execution.
What Spark configuration flags help prevent buffer overflows?
Key knobs include spark.sql.shuffle.partitions, spark.memory.fraction, spark.sql.streaming.backpressure.enabled, and spark.streaming.blockInterval. Tuning them keeps memory usage predictable.
Is Kafka required to implement streaming for an archival job?
No, not strictly. Spark Structured Streaming can read from files, sockets, or cloud storage directly. Kafka, however, offers durable ordering and built‑in back‑pressure, making it a common choice for reliable archiving.
Related reading: Original discussion
Related Articles
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment