Skip to main content

Lightweight ETL on AWS Lambda Using DuckDB and Snowflake...

Lightweight ETL on AWS Lambda Using DuckDB and Snowflake...

Lightweight ETL on AWS Lambda Using DuckDB and Snowflake Connector

Over 70 % of data‑engineer teams say “cold‑start latency” is the biggest blocker to serverless ETL. With just a few lines of Python you can replace heavyweight Spark jobs with a lightweight Lambda that runs DuckDB in‑memory and pushes results straight into Snowflake—no EMR, no Glue, no managed Airflow required.

Why a Serverless, DuckDB‑Powered ETL Matters

Imagine a world where you pay for exactly the milliseconds you use instead of a rented cluster that sits idle for hours.
Cost & scalability: Pay‑per‑invocation pricing eliminates idle clusters; a 5‑minute Lambda can be 10‑20× cheaper than a Spark job on EMR. Performance edge: DuckDB’s vectorized execution on a single vCPU often out‑performs a multi‑node Spark job for medium‑size CSV/Parquet loads. Operational simplicity: No DAG orchestrator (Airflow) or CI/CD for dbt models—Lambda’s built‑in retries and CloudWatch logging give you a self‑contained data pipeline. Sound familiar? That’s the sweet spot for teams who want to keep etl costs predictable while still delivering fresh data. And the thing is, you don’t have to sacrifice the power of SQL. DuckDB speaks the same dialect Snowflake uses, so you can prototype locally, then push to production with minimal changes.

Architecture Overview: Lambda ↔ DuckDB ↔ Snowflake

  • Ingest raw files from S3 → DuckDB in‑memory analytics → Snowflake “stage” via the Snowflake Python connector.
  • Key AWS services: S3 (source), Lambda (compute), Secrets Manager (credentials), CloudWatch (monitoring).
  • Security & compliance: IAM least‑privilege role, encrypted environment variables, Snowflake private key authentication.
Look, the data flow is basically a single straight line, so you avoid the pain of data sharding, cross‑region replication, or costly data movement. But if you’re worried about the Lambda memory ceiling, just split the job into smaller parts or use S3 multipart uploads. The Lambda execution role only needs read access to S3 and Secrets Manager, and a small trust relationship to the Snowflake connector.

Step‑by‑Step Walkthrough (Code‑First)

Below is the minimal Lambda handler that pulls a CSV from S3, runs a DuckDB query, and writes the results to Snowflake. You’ll need a custom layer that bundles duckdb and snowflake-connector-python, or you can use a container image if you prefer.
import os
import json
import boto3
import duckdb
import snowflake.connector
from botocore.exceptions import ClientError

s3 = boto3.client('s3')
secrets_client = boto3.client('secretsmanager')

def get_secret(secret_name):
    response = secrets_client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

def lambda_handler(event, context):
    # 1️⃣ Parse incoming S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key    = event['Records'][0]['s3']['object']['key']

    # 2️⃣ Download CSV into /tmp
    local_path = f"/tmp/{os.path.basename(key)}"
    try:
        s3.download_file(bucket, key, local_path)
    except ClientError as e:
        print(f"Error downloading {key}: {e}")
        raise

    # 3️⃣ DuckDB transformation
    con = duckdb.connect(database=':memory:')
    con.execute(f"COPY read_csv_auto('{local_path}') TO 'data' (FORMAT CSV)")
    df = con.execute("""
        SELECT 
            user_id,
            COUNT(*) AS page_views,
            MAX(event_ts) AS last_seen
        FROM data
        GROUP BY user_id
    """).fetchdf()

    # 4️⃣ Load to Snowflake
    secret = get_secret(os.getenv('SNOWFLAKE_SECRET'))
    ctx = snowflake.connector.connect(
        user=secret['user'],
        account=secret['account'],
        private_key=secret['private_key']
    )
    cs = ctx.cursor()
    try:
        cs.execute("CREATE OR REPLACE TEMP TABLE tmp_etl (user_id INT, page_views INT, last_seen TIMESTAMP)")
        for row in df.itertuples(index=False):
            cs.execute(
                "INSERT INTO tmp_etl VALUES (%s, %s, %s)",
                (row.user_id, row.page_views, row.last_seen)
            )
        cs.execute("COPY INTO target_table FROM tmp_etl")
    finally:
        cs.close()
        ctx.close()

    print(f"ETL completed: {key} → Snowflake")
    return {"status": "success"}
*What I love about this pattern is that the entire pipeline lives in a single function, so debugging becomes a matter of looking at CloudWatch logs instead of juggling multiple services.* ✅ Testing & deployment: - Use sam local invoke to run the handler locally against a mock S3 event. - Zip the handler and the layer together, upload to Lambda, and set the memory to 512 MB (or 1024 MB if you’re pushing 10 GBs). - In GitHub Actions, simply add a step that runs sam package and sam deploy.

Integrating with Existing Orchestration (Airflow / dbt) – When to Keep Them

  • Hybrid pipelines: Use Lambda for “micro‑ETL” tasks while Airflow schedules larger batch jobs or dbt model runs.
  • Triggering Lambda from Airflow: PythonOperator or AwsLambdaInvokeFunctionOperator for seamless hand‑off.
  • Version control & testing: Keep transformation logic in dbt models; use DuckDB as a local dev engine that mirrors Snowflake’s SQL dialect.
So if your team already has an Airflow DAG that pulls data from an API, you can fire off a Lambda that does the heavy lifting and writes to Snowflake. The DAG stays lightweight, and you retain a single source of truth for orchestration.

Actionable Takeaways & Best‑Practice Checklist

  1. When to choose this pattern: Data volumes <10 GB per run, latency‑sensitive workloads, or cost‑constrained teams.
  2. Performance tuning tips: Increase Lambda memory (CPU scales), enable DuckDB’s PRAGMA threads; batch inserts with Snowflake’s PUT/COPY INTO.
  3. Monitoring & alerting: CloudWatch metrics for duration & errors, Snowflake query history for downstream verification, Slack/SNS alerts for failures.
  4. Security: Store Snowflake private key in Secrets Manager, never hard‑code credentials, and give Lambda a minimal IAM policy.
  5. Testing: Write unit tests for the DuckDB SQL against a small in‑memory dataset; use boto3 stubs for S3 interactions.
But remember: Lambda’s timeout is 15 minutes and /tmp is capped at 10 GB. If you’re hitting those limits, split the job into smaller functions or orchestrate with Step Functions.

Frequently Asked Questions

How does DuckDB compare to Spark for ETL on AWS?

DuckDB runs entirely in‑process, eliminating cluster startup time and JVM overhead. For datasets that fit in the memory of a single Lambda (up to ~10 GB), DuckDB often finishes 2‑5× faster than a Spark job on a modest EMR cluster.

Can I schedule a Lambda‑based ETL with Airflow?

Yes—Airflow can invoke the Lambda via the AwsLambdaInvokeFunctionOperator or through an HTTP endpoint (API Gateway). This lets you keep a single source of truth for orchestration while off‑loading compute to a serverless function.

Is the Snowflake Python connector supported in Lambda layers?

The connector is pure‑Python with a compiled snowflake-connector-python wheel; bundling it in a Lambda layer (or using a container image) works without extra native dependencies.

What are the limits of using Lambda for ETL (size, time, concurrency)?

Lambda caps at 15 minutes per invocation and 10 GB of /tmp storage. For larger files, split the workload into multiple invocations or stage data in S3 and process chunks in parallel.

How do I secure Snowflake credentials in a Lambda function?

Store the Snowflake private key or password in AWS Secrets Manager, grant the Lambda’s execution role secretsmanager:GetSecretValue, and retrieve it at runtime. Avoid hard‑coding credentials in the deployment package.


Related reading: Original discussion

What do you think?

Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!

Comments

Popular posts from this blog

2026 Update: Getting Started with SQL & Databases: A Comp...

Low-Code Isn't Stealing Dev Jobs — It's Changing Them (And That's a Good Thing) Have you noticed how many non-tech folks are building Mission-critical apps lately? Honestly, it's kinda wild — marketing tres creating lead-gen tools, ops managers deploying inventory systems. Sound familiar? But here's the deal: it's not magic, it's low-code development platforms reshaping who gets to play the app-building game. What's With This Low-Code Thing Anyway? So let's break it down. Low-code platforms are visual playgrounds where you drag pre-built components instead of hand-coding everything. Think LEGO blocks for software – connect APIs, design interfaces, and automate workflows with minimal typing. Citizen developers (non-IT pros solving their own problems) are loving it because they don't need a PhD in Java. Recently, platforms like OutSystems and Mendix have exploded because honestly? Everyone needs custom tools faster than traditional codin...

Practical Guide: Getting Started with Data Science: A Com...

Laravel 11 Unpacked: What's New and Why It Matters Still running Laravel 10? Honestly, you might be missing out on some serious upgrades. Let's break down what Laravel 11 brings to the table – and whether it's worth the hype for your PHP framework projects. Because when it comes down to it, staying current can save you headaches later. What's Cooking in Laravel 11? Laravel 11 streamlines things right out of the gate. Gone are the cluttered config files – now you get a leaner, more focused starting point. That means less boilerplate and more actual coding. And here's the kicker: they've baked health routing directly into the framework. So instead of third-party packages for uptime monitoring, you've got built-in /up endpoints. But the real showstopper? Per-second API rate limiting. Remember those clunky custom solutions for throttling requests? Now you can just do: RateLimiter::for('api', function (Request $ 💬 What do you think?...

Expert Tips: Getting Started with Data Tools & ETL: A Com...

{"text":""} 💬 What do you think? Have you tried any of these approaches? I'd love to hear about your experience in the comments!