Lightweight ETL on AWS Lambda Using DuckDB and Snowflake Connector
Over 70 % of data‑engineer teams say “cold‑start latency” is the biggest blocker to serverless ETL. With just a few lines of Python you can replace heavyweight Spark jobs with a lightweight Lambda that runs DuckDB in‑memory and pushes results straight into Snowflake—no EMR, no Glue, no managed Airflow required.Why a Serverless, DuckDB‑Powered ETL Matters
Imagine a world where you pay for exactly the milliseconds you use instead of a rented cluster that sits idle for hours.Cost & scalability: Pay‑per‑invocation pricing eliminates idle clusters; a 5‑minute Lambda can be 10‑20× cheaper than a Spark job on EMR. Performance edge: DuckDB’s vectorized execution on a single vCPU often out‑performs a multi‑node Spark job for medium‑size CSV/Parquet loads. Operational simplicity: No DAG orchestrator (Airflow) or CI/CD for dbt models—Lambda’s built‑in retries and CloudWatch logging give you a self‑contained data pipeline. Sound familiar? That’s the sweet spot for teams who want to keep etl costs predictable while still delivering fresh data. And the thing is, you don’t have to sacrifice the power of SQL. DuckDB speaks the same dialect Snowflake uses, so you can prototype locally, then push to production with minimal changes.
Architecture Overview: Lambda ↔ DuckDB ↔ Snowflake
- Ingest raw files from S3 → DuckDB in‑memory analytics → Snowflake “stage” via the Snowflake Python connector.
- Key AWS services: S3 (source), Lambda (compute), Secrets Manager (credentials), CloudWatch (monitoring).
- Security & compliance: IAM least‑privilege role, encrypted environment variables, Snowflake private key authentication.
Step‑by‑Step Walkthrough (Code‑First)
Below is the minimal Lambda handler that pulls a CSV from S3, runs a DuckDB query, and writes the results to Snowflake. You’ll need a custom layer that bundlesduckdb and snowflake-connector-python, or you can use a container image if you prefer.
import os
import json
import boto3
import duckdb
import snowflake.connector
from botocore.exceptions import ClientError
s3 = boto3.client('s3')
secrets_client = boto3.client('secretsmanager')
def get_secret(secret_name):
response = secrets_client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
def lambda_handler(event, context):
# 1️⃣ Parse incoming S3 event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# 2️⃣ Download CSV into /tmp
local_path = f"/tmp/{os.path.basename(key)}"
try:
s3.download_file(bucket, key, local_path)
except ClientError as e:
print(f"Error downloading {key}: {e}")
raise
# 3️⃣ DuckDB transformation
con = duckdb.connect(database=':memory:')
con.execute(f"COPY read_csv_auto('{local_path}') TO 'data' (FORMAT CSV)")
df = con.execute("""
SELECT
user_id,
COUNT(*) AS page_views,
MAX(event_ts) AS last_seen
FROM data
GROUP BY user_id
""").fetchdf()
# 4️⃣ Load to Snowflake
secret = get_secret(os.getenv('SNOWFLAKE_SECRET'))
ctx = snowflake.connector.connect(
user=secret['user'],
account=secret['account'],
private_key=secret['private_key']
)
cs = ctx.cursor()
try:
cs.execute("CREATE OR REPLACE TEMP TABLE tmp_etl (user_id INT, page_views INT, last_seen TIMESTAMP)")
for row in df.itertuples(index=False):
cs.execute(
"INSERT INTO tmp_etl VALUES (%s, %s, %s)",
(row.user_id, row.page_views, row.last_seen)
)
cs.execute("COPY INTO target_table FROM tmp_etl")
finally:
cs.close()
ctx.close()
print(f"ETL completed: {key} → Snowflake")
return {"status": "success"}
*What I love about this pattern is that the entire pipeline lives in a single function, so debugging becomes a matter of looking at CloudWatch logs instead of juggling multiple services.*
✅ Testing & deployment:
- Use sam local invoke to run the handler locally against a mock S3 event.
- Zip the handler and the layer together, upload to Lambda, and set the memory to 512 MB (or 1024 MB if you’re pushing 10 GBs).
- In GitHub Actions, simply add a step that runs sam package and sam deploy.
Integrating with Existing Orchestration (Airflow / dbt) – When to Keep Them
- Hybrid pipelines: Use Lambda for “micro‑ETL” tasks while Airflow schedules larger batch jobs or dbt model runs.
- Triggering Lambda from Airflow:
PythonOperatororAwsLambdaInvokeFunctionOperatorfor seamless hand‑off. - Version control & testing: Keep transformation logic in dbt models; use DuckDB as a local dev engine that mirrors Snowflake’s SQL dialect.
Actionable Takeaways & Best‑Practice Checklist
- When to choose this pattern: Data volumes <10 GB per run, latency‑sensitive workloads, or cost‑constrained teams.
- Performance tuning tips: Increase Lambda memory (CPU scales), enable DuckDB’s
PRAGMA threads; batch inserts with Snowflake’sPUT/COPY INTO. - Monitoring & alerting: CloudWatch metrics for duration & errors, Snowflake query history for downstream verification, Slack/SNS alerts for failures.
- Security: Store Snowflake private key in Secrets Manager, never hard‑code credentials, and give Lambda a minimal IAM policy.
- Testing: Write unit tests for the DuckDB SQL against a small in‑memory dataset; use
boto3stubs for S3 interactions.
Frequently Asked Questions
How does DuckDB compare to Spark for ETL on AWS?
DuckDB runs entirely in‑process, eliminating cluster startup time and JVM overhead. For datasets that fit in the memory of a single Lambda (up to ~10 GB), DuckDB often finishes 2‑5× faster than a Spark job on a modest EMR cluster.
Can I schedule a Lambda‑based ETL with Airflow?
Yes—Airflow can invoke the Lambda via the AwsLambdaInvokeFunctionOperator or through an HTTP endpoint (API Gateway). This lets you keep a single source of truth for orchestration while off‑loading compute to a serverless function.
Is the Snowflake Python connector supported in Lambda layers?
The connector is pure‑Python with a compiled snowflake-connector-python wheel; bundling it in a Lambda layer (or using a container image) works without extra native dependencies.
What are the limits of using Lambda for ETL (size, time, concurrency)?
Lambda caps at 15 minutes per invocation and 10 GB of /tmp storage. For larger files, split the workload into multiple invocations or stage data in S3 and process chunks in parallel.
How do I secure Snowflake credentials in a Lambda function?
Store the Snowflake private key or password in AWS Secrets Manager, grant the Lambda’s execution role secretsmanager:GetSecretValue, and retrieve it at runtime. Avoid hard‑coding credentials in the deployment package.
Related reading: Original discussion
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment