Lightweight ETL on AWS Lambda Using DuckDB and Snowflake...

Lightweight ETL on AWS Lambda Using DuckDB and Snowflake Connector

Q: How does DuckDB compare to Spark for ETL on AWS?

DuckDB runs entirely in‑process, eliminating cluster startup time and JVM overhead. For datasets that fit in the memory of a single Lambda (up to ~10 GB), DuckDB often finishes 2‑5× faster than a Spark job on a modest EMR cluster.

Q: Can I schedule a Lambda‑based ETL with Airflow?

Yes—Airflow can invoke the Lambda via the AwsLambdaInvokeFunctionOperator or through an HTTP endpoint (API Gateway). This lets you keep a single source of truth for orchestration while off‑loading compute to a serverless function.

Q: Is the Snowflake Python connector supported in Lambda layers?

The connector is pure‑Python with a compiled snowflake-connector-python wheel; bundling it in a Lambda layer (or using a container image) works without extra native dependencies.

Q: What are the limits of using Lambda for ETL (size, time, concurrency)?

Lambda caps at 15 minutes per invocation and 10 GB of /tmp storage. For larger files, split the workload into multiple invocations or stage data in S3 and process chunks in parallel.

Q: How do I secure Snowflake credentials in a Lambda function?

Store the Snowflake private key or password in AWS Secrets Manager, grant the Lambda’s execution role secretsmanager:GetSecretValue, and retrieve it at runtime. Avoid hard‑coding credentials in the deployment package.

Over 70 % of data‑engineer teams say “cold‑start latency” is the biggest blocker to serverless ETL. With just a few lines of Python you can replace heavyweight Spark jobs with a lightweight Lambda that runs DuckDB in‑memory and pushes results straight into Snowflake—no EMR, no Glue, no managed Airflow required.

Why a Serverless, DuckDB‑Powered ETL Matters

Imagine a world where you pay for exactly the milliseconds you use instead of a rented cluster that sits idle for hours.

Cost & scalability: Pay‑per‑invocation pricing eliminates idle clusters; a 5‑minute Lambda can be 10‑20× cheaper than a Spark job on EMR. Performance edge: DuckDB’s vectorized execution on a single vCPU often out‑performs a multi‑node Spark job for medium‑size CSV/Parquet loads. Operational simplicity: No DAG orchestrator (Airflow) or CI/CD for dbt models—Lambda’s built‑in retries and CloudWatch logging give you a self‑contained data pipeline. Sound familiar? That’s the sweet spot for teams who want to keep etl costs predictable while still delivering fresh data. And the thing is, you don’t have to sacrifice the power of SQL. DuckDB speaks the same dialect Snowflake uses, so you can prototype locally, then push to production with minimal changes.

Architecture Overview: Lambda ↔ DuckDB ↔ Snowflake

Ingest raw files from S3 → DuckDB in‑memory analytics → Snowflake “stage” via the Snowflake Python connector.
Key AWS services: S3 (source), Lambda (compute), Secrets Manager (credentials), CloudWatch (monitoring).
Security & compliance: IAM least‑privilege role, encrypted environment variables, Snowflake private key authentication.

Look, the data flow is basically a single straight line, so you avoid the pain of data sharding, cross‑region replication, or costly data movement. But if you’re worried about the Lambda memory ceiling, just split the job into smaller parts or use S3 multipart uploads. The Lambda execution role only needs read access to S3 and Secrets Manager, and a small trust relationship to the Snowflake connector.

Step‑by‑Step Walkthrough (Code‑First)

Below is the minimal Lambda handler that pulls a CSV from S3, runs a DuckDB query, and writes the results to Snowflake. You’ll need a custom layer that bundles duckdb and snowflake-connector-python, or you can use a container image if you prefer.

import os
import json
import boto3
import duckdb
import snowflake.connector
from botocore.exceptions import ClientError

s3 = boto3.client('s3')
secrets_client = boto3.client('secretsmanager')

def get_secret(secret_name):
    response = secrets_client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

def lambda_handler(event, context):
    # 1️⃣ Parse incoming S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key    = event['Records'][0]['s3']['object']['key']

    # 2️⃣ Download CSV into /tmp
    local_path = f"/tmp/{os.path.basename(key)}"
    try:
        s3.download_file(bucket, key, local_path)
    except ClientError as e:
        print(f"Error downloading {key}: {e}")
        raise

    # 3️⃣ DuckDB transformation
    con = duckdb.connect(database=':memory:')
    con.execute(f"COPY read_csv_auto('{local_path}') TO 'data' (FORMAT CSV)")
    df = con.execute("""
        SELECT 
            user_id,
            COUNT(*) AS page_views,
            MAX(event_ts) AS last_seen
        FROM data
        GROUP BY user_id
    """).fetchdf()

    # 4️⃣ Load to Snowflake
    secret = get_secret(os.getenv('SNOWFLAKE_SECRET'))
    ctx = snowflake.connector.connect(
        user=secret['user'],
        account=secret['account'],
        private_key=secret['private_key']
    )
    cs = ctx.cursor()
    try:
        cs.execute("CREATE OR REPLACE TEMP TABLE tmp_etl (user_id INT, page_views INT, last_seen TIMESTAMP)")
        for row in df.itertuples(index=False):
            cs.execute(
                "INSERT INTO tmp_etl VALUES (%s, %s, %s)",
                (row.user_id, row.page_views, row.last_seen)
            )
        cs.execute("COPY INTO target_table FROM tmp_etl")
    finally:
        cs.close()
        ctx.close()

    print(f"ETL completed: {key} → Snowflake")
    return {"status": "success"}

*What I love about this pattern is that the entire pipeline lives in a single function, so debugging becomes a matter of looking at CloudWatch logs instead of juggling multiple services.* ✅ Testing & deployment: - Use sam local invoke to run the handler locally against a mock S3 event. - Zip the handler and the layer together, upload to Lambda, and set the memory to 512 MB (or 1024 MB if you’re pushing 10 GBs). - In GitHub Actions, simply add a step that runs sam package and sam deploy.

Integrating with Existing Orchestration (Airflow / dbt) – When to Keep Them

Hybrid pipelines: Use Lambda for “micro‑ETL” tasks while Airflow schedules larger batch jobs or dbt model runs.
Triggering Lambda from Airflow: PythonOperator or AwsLambdaInvokeFunctionOperator for seamless hand‑off.
Version control & testing: Keep transformation logic in dbt models; use DuckDB as a local dev engine that mirrors Snowflake’s SQL dialect.

So if your team already has an Airflow DAG that pulls data from an API, you can fire off a Lambda that does the heavy lifting and writes to Snowflake. The DAG stays lightweight, and you retain a single source of truth for orchestration.

Actionable Takeaways & Best‑Practice Checklist

When to choose this pattern: Data volumes <10 GB per run, latency‑sensitive workloads, or cost‑constrained teams.
Performance tuning tips: Increase Lambda memory (CPU scales), enable DuckDB’s PRAGMA threads; batch inserts with Snowflake’s PUT/COPY INTO.
Monitoring & alerting: CloudWatch metrics for duration & errors, Snowflake query history for downstream verification, Slack/SNS alerts for failures.
Security: Store Snowflake private key in Secrets Manager, never hard‑code credentials, and give Lambda a minimal IAM policy.
Testing: Write unit tests for the DuckDB SQL against a small in‑memory dataset; use boto3 stubs for S3 interactions.

But remember: Lambda’s timeout is 15 minutes and /tmp is capped at 10 GB. If you’re hitting those limits, split the job into smaller functions or orchestrate with Step Functions.

Frequently Asked Questions

How does DuckDB compare to Spark for ETL on AWS?

DuckDB runs entirely in‑process, eliminating cluster startup time and JVM overhead. For datasets that fit in the memory of a single Lambda (up to ~10 GB), DuckDB often finishes 2‑5× faster than a Spark job on a modest EMR cluster.

Can I schedule a Lambda‑based ETL with Airflow?

Yes—Airflow can invoke the Lambda via the AwsLambdaInvokeFunctionOperator or through an HTTP endpoint (API Gateway). This lets you keep a single source of truth for orchestration while off‑loading compute to a serverless function.

Is the Snowflake Python connector supported in Lambda layers?

The connector is pure‑Python with a compiled snowflake-connector-python wheel; bundling it in a Lambda layer (or using a container image) works without extra native dependencies.

What are the limits of using Lambda for ETL (size, time, concurrency)?

Lambda caps at 15 minutes per invocation and 10 GB of /tmp storage. For larger files, split the workload into multiple invocations or stage data in S3 and process chunks in parallel.

How do I secure Snowflake credentials in a Lambda function?

Store the Snowflake private key or password in AWS Secrets Manager, grant the Lambda’s execution role secretsmanager:GetSecretValue, and retrieve it at runtime. Avoid hard‑coding credentials in the deployment package.

Code & Crumbs

Search This Blog