Building My First End-to-End ETL Pipeline with Airflow,...

Building My First End‑to‑End ETL Pipeline with Airflow, BigQuery, and Docker

Over 70 % of data‑driven companies say their biggest bottleneck is moving data from source to analytics – and 90 % of those bottlenecks are solved with a well‑orchestrated ETL pipeline. In this guide you’ll spin up a production‑grade, reproducible ETL pipeline **from zero to queryable data in BigQuery** in under an hour—without writing a single Spark job. If you’ve ever wrestled with ad‑hoc scripts that break on the next schema change, this step‑by‑step walkthrough shows how Docker, Airflow, and BigQuery turn chaos into a repeatable, version‑controlled workflow.

1️⃣ Why an End‑to‑End ETL Pipeline Matters Today

Business results move fast, and a slow data pipeline can halt decisions. Faster time‑to‑insight means analysts can stay ahead of the curve, and it cuts manual effort dramatically. Technically, ETL lets you decouple extraction, transformation, and loading, so each piece lives in its own environment and can be version‑controlled. Airflow adds built‑in monitoring, so you can see exactly why a task failed. A retail analytics team, for instance, slashed nightly load times from 3 hrs to 15 min after shifting to Airflow + BigQuery. In my experience, that single change freed up analysts to focus on modeling rather than ETL bugs.

2️⃣ Setting Up the Foundations: Docker + Airflow + BigQuery

Dockerizing Airflow is a breeze with the official apache/airflow image. All you need is a docker-compose.yml that mounts a dags/ folder, sets the necessary environment variables, and exposes the webserver.
Here’s a quick snapshot:

services:
  airflow:
    image: apache/airflow:2.8.0
    environment:
      - "AIRFLOW__CORE__LOAD_EXAMPLES=False"
      - "AIRFLOW__CORE__EXECUTOR=LocalExecutor"
      - "GOOGLE_APPLICATION_CREDENTIALS=/secrets/credentials.json"
    volumes:
      - ./dags:/opt/airflow/dags
      - ./secrets:/secrets
    ports:
      - "8080:8080"

Configure BigQuery credentials by creating a service‑account JSON, placing it in secrets/credentials.json, and granting the account the BigQuery Data Editor role. The Airflow BigQueryInsertJobOperator automatically picks up the credentials from the environment variable. Once docker-compose up -d finishes, you can navigate to localhost:8080 and confirm the webserver and scheduler are healthy.

3️⃣ Building the ETL Logic (Code Walkthrough)

The heart of the pipeline is a 30‑line DAG that does three things: pulls a CSV from S3, cleans it in Pandas, and streams the result into a partitioned BigQuery table. Here’s the full code:

from airflow import DAG
from airflow.providers.amazon.aws.operators.s3 import S3DownloadFileOperator
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime, timedelta
import pandas as pd

default_args = {
    'owner': 'data_eng',
    'depends_on_past': False,
    'email_on_failure': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    dag_id='etl_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    download = S3DownloadFileOperator(
        task_id='download_from_s3',
        bucket='public-data',
        key='sales/2026-{{ ds }}.csv',
        local_file='/tmp/sales.csv',
    )

    def clean_dataframe(**kwargs):
        df = pd.read_csv('/tmp/sales.csv')
        df['sale_date'] = pd.to_datetime(df['sale_date'])
        df['amount'] = df['amount'].astype(float).fillna(0)
        kwargs['ti'].xcom_push(key='cleaned_df', value=df.to_json())

    transform = PythonOperator(
        task_id='clean_dataframe',
        python_callable=clean_dataframe,
    )

    load = BigQueryInsertJobOperator(
        task_id='bq_load',
        configuration={
            "load": {
                "destinationTable": {
                    "projectId": "my-gcp-project",
                    "datasetId": "sales",
                    "tableId": "daily_sales",
                },
                "sourceFormat": "NEWLINE_DELIMITED_JSON",
                "writeDisposition": "WRITE_TRUNCATE",
                "timePartitioning": {"type": "DAY"},
            }
        },
        params={
            "json": "{{ ti.xcom_pull(task_ids='clean_dataframe', key='cleaned_df') }}"
        },
    )

    download >> transform >> load

The DAG is tiny, but it covers the entire ETL spectrum. The download_from_s3 task pulls the newest file, clean_dataframe does the heavy lifting in Pandas, and bq_load writes the cleaned data into BigQuery. Because the table is partitioned by day, future runs will just append new partitions.

4️⃣ Enhancing the Pipeline with dbt & Spark (Optional Extensions)

If you find yourself writing more complex transformations, dbt is a natural fit. You can keep raw loads in BigQuery and let dbt run SQL models on top, adding tests and documentation. Spark steps into the picture when you hit the 10 GB‑plus threshold or need distributed joins that Pandas can’t handle. A common pattern is to trigger a DockerOperator after the load finishes:

spark_job = DockerOperator(
    task_id='spark_submit',
    image='gcr.io/spark-operator:latest',
    api_version='auto',
    auto_remove=True,
    command='spark-submit /app/process.py',
    docker_url='unix://var/run/docker.sock',
    network_mode='bridge',
    volumes=[f'{os.getcwd()}/spark:/app'],
)

The upshot? Your pipeline stays modular, and you can swap out heavy lifting for a cluster when needed.

5️⃣ Actionable Takeaways & Next Steps

Docker + Airflow + BigQuery = a solid first ETL stack.
Keep credentials out of code and use GOOGLE_APPLICATION_CREDENTIALS for security.
Use Airflow’s UI to spot failures fast; set retry_delay to avoid hammering services.
CI/CD with GitHub Actions and secret management (Vault, GCP Secret Manager) ready your pipeline for production.
Future enhancements: incremental loads, data quality checks, alerts, and a Light‑weight front‑end (Looker, Metabase).

Frequently Asked Questions

What is an ETL pipeline and how does Airflow fit in?

ETL (Extract‑Transform‑Load) moves data from source systems, reshapes it, and stores it in a target warehouse. Airflow orchestrates each step as tasks, handling retries, dependencies, and scheduling so the pipeline runs reliably.

How do I connect Airflow to BigQuery securely?

Store a service‑account JSON file outside the container, set GOOGLE_APPLICATION_CREDENTIALS as an environment variable, and grant the account BigQuery Data Editor (or more restrictive) role. Airflow’s BigQueryInsertJobOperator automatically picks up the credentials.

Can I run this ETL pipeline in production without Docker?

Yes, you can install Airflow on a VM or Kubernetes, but Docker guarantees reproducible environments and isolates dependencies—critical for scaling and for avoiding “works on my machine” bugs.

When should I use dbt instead of Python for transformations?

Use dbt when transformations are SQL‑centric, you want version‑controlled models, automated tests, and documentation. It complements Python‑based extracts and loads, keeping business logic in the warehouse.

Is Spark necessary for this “first” ETL pipeline?

Not for small‑to‑medium datasets; Pandas inside Airflow is sufficient. Spark becomes valuable when processing > 10 GB files, needing distributed computing, or when you already have a Spark cluster you want to leverage.

Applying Conditional Formatting in Excel Using Python

Applying Conditional Formatting in Excel Using Python Did you know that 78 % of data‑driven decisions are missed because users can’t spot trends fast enough? With a few lines of Python, you can turn any ordinary Excel spreadsheet into a visual powerhouse—no manual formatting, no endless clicks, just instant, rule‑based highlights that keep your team on the same page. In This Article What is Conditional Formatting? Setting Up Your Python Environment Core Concepts: Rules, Ranges, and Styles Step‑by‑Step Walkthrough Real‑World Use Cases & Actionable Takeaways Frequently Asked Questions What is Conditional Formatting and Why It Matters Excel’s conditional formatting lets you turn raw numbers into a story. Instead of scrolling through endless rows, you instantly see which sales exceeded targets, which inventory levels are low, or which dates are past due. In my experience, teams that use conditional formatting save hours that would otherwise be spent skimming cells. Whe...

Code & Crumbs

Search This Blog