Skip to main content

Load PostgreSQL into Apache Iceberg with Sling

Load PostgreSQL into Apache Iceberg with Sling

Load PostgreSQL into Apache Iceberg with Sling

Did you know more than 70 % of modern ETL pipelines still copy data row‑by‑row, throttling performance and inflating cloud costs? What if you could stream PostgreSQL tables directly into an open‑format lakehouse (Apache Iceberg) with a single, declarative command? Enter Sling – the lightweight bridge that turns a messy PostgreSQL‑to‑Iceberg load into a production‑grade, Airflow‑orchestrated data pipeline.

Why Move PostgreSQL Data to Apache Iceberg?

Iceberg brings schema evolution, hidden partitioning, and time‑travel – all baked into a columnar store. In the past few months, I've seen teams cut query latency from seconds to a fraction of a second just by swapping a relational dump for an Iceberg snapshot.

Cost-wise, columnar storage reduces storage spend by filtering out unused columns. Plus, predicate push‑down means you only scan what you need, saving compute credits on cloud warehouses.

Finally, a lakehouse lets Spark, Presto, dbt, and even BI tools read the same source. That means no duplicated ETL, no data silos, and a single source of truth for downstream analytics.

Introducing Sling: The ETL‑as‑Code Engine

Sling is a CLI‑first, YAML‑driven framework built on Spark. It abstracts connection handling, schema mapping, and incremental loads, so you write almost nothing but a few lines of configuration.

  • sling.yml – your pipeline definition.
  • Connectors – pre‑built adapters like PostgreSQL → Iceberg.
  • Checkpointing – built‑in progress tracking to avoid reprocessing.

You can drop Sling into an Airflow DAG, invoke it from a dbt run‑operation, or run it as a standalone Spark job. The flexibility means you keep your existing orchestration tools while gaining Iceberg’s ACID guarantees.

Step‑by‑Step Walkthrough: Loading PostgreSQL → Iceberg with Sling

Below is a minimal, production‑ready Sling configuration. It pulls public.sales from PostgreSQL and writes it into lake.sales in Iceberg, using incremental CDC based on an updated_at timestamp.

# sling.yml
pipeline:
  name: load-postgres-to-iceberg

source:
  type: postgresql
  host: "db.example.com"
  port: 5432
  database: "salesdb"
  user: "etl_user"
  password: "secret" # use Vault or env vars in prod
  table: "public.sales"
  # incremental key
  incremental:
    column: "updated_at"
    start: "{{ prev_run_end }}"   # Sling will replace with checkpoint

target:
  type: iceberg
  warehouse: "s3://my-iceberg-warehouse/"
  database: "lake"
  table: "sales"
  format: "parquet"
  partitioning: "year(updated_at), month(updated_at)"

checkpoint:
  path: "s3://my-iceberg-warehouse/_checkpoints/postgres_sales.json"

1. Set up the environment – Spark 3.x, Java 11, and Sling installation (pip install sling-cli). Make sure the Iceberg Spark runtime is on the classpath.

2. Run locallysling run sling.yml. Sling will read the source schema, map columns, and write the first snapshot into Iceberg.

3. Airflow orchestration – Create a BashOperator that calls the same CLI, passing the execution date for incremental loads.

# airflow_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data_eng',
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG('load_postgres_to_iceberg',
         default_args=default_args,
         schedule_interval='@daily',
         start_date=datetime(2026, 1, 1),
         catchup=False) as dag:

    sling_task = BashOperator(
        task_id='sling_load',
        bash_command='sling run sling.yml --env execution_date=${{ ds }}',
    )

4. Validate – In Spark Shell, run DESCRIBE HISTORY lake.sales to see the snapshot timeline, and SELECT * FROM lake.sales LIMIT 10 to confirm data integrity.

Real‑World Impact: From Prototype to Production‑Scale ETL

Take a fintech startup that moved from nightly PostgreSQL dumps to Sling‑powered Iceberg loads. The nightly load time dropped from 3 hrs to 12 min – a 15× improvement. Storage costs fell by 38 % because Iceberg trimmed unused columns and leveraged compression. The team also added a Great Expectations quality check that ran after each Sling job, catching anomalies before they hit downstream models.

Operationally, schema evolution is painless. When customer_id changed type, Sling automatically added the new column to the Iceberg table without breaking existing queries. Downstream dbt models continued to work, because Iceberg preserves historic data and offers time‑travel queries.

Actionable Takeaways & Next Steps

  • Checklist – Verify Spark 3.x, Java 11, Iceberg runtime, and secure credentials (env vars or Vault).
  • Best practices – Use a granular incremental key (e.g., updated_at), partition by year/month for faster pruning, and enable checkpointing to avoid duplicates.
  • Monitoring – Airflow logs give you job status; Iceberg’s DESCRIBE HISTORY shows snapshot creation times, and you can expose metrics via Prometheus.
  • Scale up – Add more tables by replicating the sling.yml pattern; orchestrate with a single DAG that loops over a list of configs.
  • Future enhancements – Combine Sling with dbt run‑operation for post‑load transformations, add Great Expectations for data quality, and explore Iceberg’s MERGE INTO for upserts instead of append-only writes.

Frequently Asked Questions

How does Sling differ from traditional ETL tools for loading PostgreSQL into Iceberg?

Sling is a code‑first, Spark‑powered framework that uses declarative YAML files instead of UI drag‑and‑drop. It handles schema inference, incremental CDC, and checkpointing out‑of‑the‑box, while traditional ETL tools often require custom scripts for each step.

Can I schedule Sling jobs with Airflow for continuous data pipelines?

Yes. Sling’s CLI can be invoked from an Airflow BashOperator or PythonOperator, allowing you to pass execution dates for incremental loads. The job’s state (success/failure) is captured in Airflow logs, enabling alerting and retries.

Is it possible to combine dbt transformations with a Sling‑based load?

Absolutely. After Sling materializes the Iceberg table, you can call dbt run-operation (or dbt run) to apply models that reference the new table. This creates a seamless “load‑then‑transform” workflow within the same pipeline.

What Spark version is required for Sling to work with Iceberg?

Sling currently supports Spark 3.2 + with the Iceberg Spark runtime (org.apache.iceberg:iceberg-spark-runtime-3.2). Using a compatible version ensures access to Iceberg’s table‑level APIs and ACID guarantees.

How does Sling handle schema changes in the source PostgreSQL table?

Sling reads the source schema at each run and automatically maps new columns to Iceberg’s evolving schema. If a column is dropped, Iceberg retains the historic data (time‑travel) while the new schema omits the column, preserving query compatibility.


Related reading: Original discussion

What do you think?

Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!

Comments

Popular posts from this blog

2026 Update: Getting Started with SQL & Databases: A Comp...

Low-Code Isn't Stealing Dev Jobs — It's Changing Them (And That's a Good Thing) Have you noticed how many non-tech folks are building Mission-critical apps lately? Honestly, it's kinda wild — marketing tres creating lead-gen tools, ops managers deploying inventory systems. Sound familiar? But here's the deal: it's not magic, it's low-code development platforms reshaping who gets to play the app-building game. What's With This Low-Code Thing Anyway? So let's break it down. Low-code platforms are visual playgrounds where you drag pre-built components instead of hand-coding everything. Think LEGO blocks for software – connect APIs, design interfaces, and automate workflows with minimal typing. Citizen developers (non-IT pros solving their own problems) are loving it because they don't need a PhD in Java. Recently, platforms like OutSystems and Mendix have exploded because honestly? Everyone needs custom tools faster than traditional codin...

Practical Guide: Getting Started with Data Science: A Com...

Laravel 11 Unpacked: What's New and Why It Matters Still running Laravel 10? Honestly, you might be missing out on some serious upgrades. Let's break down what Laravel 11 brings to the table – and whether it's worth the hype for your PHP framework projects. Because when it comes down to it, staying current can save you headaches later. What's Cooking in Laravel 11? Laravel 11 streamlines things right out of the gate. Gone are the cluttered config files – now you get a leaner, more focused starting point. That means less boilerplate and more actual coding. And here's the kicker: they've baked health routing directly into the framework. So instead of third-party packages for uptime monitoring, you've got built-in /up endpoints. But the real showstopper? Per-second API rate limiting. Remember those clunky custom solutions for throttling requests? Now you can just do: RateLimiter::for('api', function (Request $ 💬 What do you think?...

Applying Conditional Formatting in Excel Using Python

Applying Conditional Formatting in Excel Using Python Did you know that 78 % of data‑driven decisions are missed because users can’t spot trends fast enough? With a few lines of Python, you can turn any ordinary Excel spreadsheet into a visual powerhouse—no manual formatting, no endless clicks, just instant, rule‑based highlights that keep your team on the same page. In This Article What is Conditional Formatting? Setting Up Your Python Environment Core Concepts: Rules, Ranges, and Styles Step‑by‑Step Walkthrough Real‑World Use Cases & Actionable Takeaways Frequently Asked Questions What is Conditional Formatting and Why It Matters Excel’s conditional formatting lets you turn raw numbers into a story. Instead of scrolling through endless rows, you instantly see which sales exceeded targets, which inventory levels are low, or which dates are past due. In my experience, teams that use conditional formatting save hours that would otherwise be spent skimming cells. Whe...