Load PostgreSQL into Apache Iceberg with Sling
Did you know more than 70 % of modern ETL pipelines still copy data row‑by‑row, throttling performance and inflating cloud costs? What if you could stream PostgreSQL tables directly into an open‑format lakehouse (Apache Iceberg) with a single, declarative command? Enter Sling – the lightweight bridge that turns a messy PostgreSQL‑to‑Iceberg load into a production‑grade, Airflow‑orchestrated data pipeline.
Why Move PostgreSQL Data to Apache Iceberg?
Iceberg brings schema evolution, hidden partitioning, and time‑travel – all baked into a columnar store. In the past few months, I've seen teams cut query latency from seconds to a fraction of a second just by swapping a relational dump for an Iceberg snapshot.
Cost-wise, columnar storage reduces storage spend by filtering out unused columns. Plus, predicate push‑down means you only scan what you need, saving compute credits on cloud warehouses.
Finally, a lakehouse lets Spark, Presto, dbt, and even BI tools read the same source. That means no duplicated ETL, no data silos, and a single source of truth for downstream analytics.
Introducing Sling: The ETL‑as‑Code Engine
Sling is a CLI‑first, YAML‑driven framework built on Spark. It abstracts connection handling, schema mapping, and incremental loads, so you write almost nothing but a few lines of configuration.
- sling.yml – your pipeline definition.
- Connectors – pre‑built adapters like PostgreSQL → Iceberg.
- Checkpointing – built‑in progress tracking to avoid reprocessing.
You can drop Sling into an Airflow DAG, invoke it from a dbt run‑operation, or run it as a standalone Spark job. The flexibility means you keep your existing orchestration tools while gaining Iceberg’s ACID guarantees.
Step‑by‑Step Walkthrough: Loading PostgreSQL → Iceberg with Sling
Below is a minimal, production‑ready Sling configuration. It pulls public.sales from PostgreSQL and writes it into lake.sales in Iceberg, using incremental CDC based on an updated_at timestamp.
# sling.yml
pipeline:
name: load-postgres-to-iceberg
source:
type: postgresql
host: "db.example.com"
port: 5432
database: "salesdb"
user: "etl_user"
password: "secret" # use Vault or env vars in prod
table: "public.sales"
# incremental key
incremental:
column: "updated_at"
start: "{{ prev_run_end }}" # Sling will replace with checkpoint
target:
type: iceberg
warehouse: "s3://my-iceberg-warehouse/"
database: "lake"
table: "sales"
format: "parquet"
partitioning: "year(updated_at), month(updated_at)"
checkpoint:
path: "s3://my-iceberg-warehouse/_checkpoints/postgres_sales.json"
1. Set up the environment – Spark 3.x, Java 11, and Sling installation (pip install sling-cli). Make sure the Iceberg Spark runtime is on the classpath.
2. Run locally – sling run sling.yml. Sling will read the source schema, map columns, and write the first snapshot into Iceberg.
3. Airflow orchestration – Create a BashOperator that calls the same CLI, passing the execution date for incremental loads.
# airflow_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data_eng',
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG('load_postgres_to_iceberg',
default_args=default_args,
schedule_interval='@daily',
start_date=datetime(2026, 1, 1),
catchup=False) as dag:
sling_task = BashOperator(
task_id='sling_load',
bash_command='sling run sling.yml --env execution_date=${{ ds }}',
)
4. Validate – In Spark Shell, run DESCRIBE HISTORY lake.sales to see the snapshot timeline, and SELECT * FROM lake.sales LIMIT 10 to confirm data integrity.
Real‑World Impact: From Prototype to Production‑Scale ETL
Take a fintech startup that moved from nightly PostgreSQL dumps to Sling‑powered Iceberg loads. The nightly load time dropped from 3 hrs to 12 min – a 15× improvement. Storage costs fell by 38 % because Iceberg trimmed unused columns and leveraged compression. The team also added a Great Expectations quality check that ran after each Sling job, catching anomalies before they hit downstream models.
Operationally, schema evolution is painless. When customer_id changed type, Sling automatically added the new column to the Iceberg table without breaking existing queries. Downstream dbt models continued to work, because Iceberg preserves historic data and offers time‑travel queries.
Actionable Takeaways & Next Steps
- Checklist – Verify Spark 3.x, Java 11, Iceberg runtime, and secure credentials (env vars or Vault).
- Best practices – Use a granular incremental key (e.g.,
updated_at), partition by year/month for faster pruning, and enable checkpointing to avoid duplicates. - Monitoring – Airflow logs give you job status; Iceberg’s
DESCRIBE HISTORYshows snapshot creation times, and you can expose metrics via Prometheus. - Scale up – Add more tables by replicating the
sling.ymlpattern; orchestrate with a single DAG that loops over a list of configs. - Future enhancements – Combine Sling with dbt run‑operation for post‑load transformations, add Great Expectations for data quality, and explore Iceberg’s
MERGE INTOfor upserts instead of append-only writes.
Frequently Asked Questions
How does Sling differ from traditional ETL tools for loading PostgreSQL into Iceberg?
Sling is a code‑first, Spark‑powered framework that uses declarative YAML files instead of UI drag‑and‑drop. It handles schema inference, incremental CDC, and checkpointing out‑of‑the‑box, while traditional ETL tools often require custom scripts for each step.
Can I schedule Sling jobs with Airflow for continuous data pipelines?
Yes. Sling’s CLI can be invoked from an Airflow BashOperator or PythonOperator, allowing you to pass execution dates for incremental loads. The job’s state (success/failure) is captured in Airflow logs, enabling alerting and retries.
Is it possible to combine dbt transformations with a Sling‑based load?
Absolutely. After Sling materializes the Iceberg table, you can call dbt run-operation (or dbt run) to apply models that reference the new table. This creates a seamless “load‑then‑transform” workflow within the same pipeline.
What Spark version is required for Sling to work with Iceberg?
Sling currently supports Spark 3.2 + with the Iceberg Spark runtime (org.apache.iceberg:iceberg-spark-runtime-3.2). Using a compatible version ensures access to Iceberg’s table‑level APIs and ACID guarantees.
How does Sling handle schema changes in the source PostgreSQL table?
Sling reads the source schema at each run and automatically maps new columns to Iceberg’s evolving schema. If a column is dropped, Iceberg retains the historic data (time‑travel) while the new schema omits the column, preserving query compatibility.
Related reading: Original discussion
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment