Skip to main content

From Data Quality Checks to Analytics-Ready Parquet with...

From Data Quality Checks to Analytics-Ready Parquet with...

From Data Quality Checks to Analytics‑Ready Parquet with Python

90 % of data‑driven projects stall because raw data never passes quality gates – and the bottleneck is usually the format conversion step. In this article you’ll see how a handful of Python libraries can turn messy, unverified CSVs into Spark‑ready Parquet files in under 5 minutes, without writing a single custom ETL job. Imagine you’ve just landed a new dataset in your Airflow DAG; instead of wrestling with schema drift, you run a reproducible quality‑check‑and‑convert script and hand the result off to dbt or a downstream Spark job—effortless, auditable, and production‑grade.

Why Data Quality & Format Matter in Modern ETL Pipelines

Bad data inflates debugging time by up to 30 % (source: Gartner). Parquet’s columnar compression reduces storage by 70‑80 % and speeds up Spark scans dramatically. Embedding validation early satisfies audit trails and makes lineage tools trustworthy. And honestly, if that’s not a business‑case, I’m not sure why anyone would chase it.

Core Building Blocks – Python Libraries You’ll Need

  • pandas & pyarrow: Fast in‑memory manipulation + native Parquet writer.
  • great‑expectations (or pandera): Declarative data‑quality expectations that integrate with CI/CD.
  • prefect / airflow operators: Orchestrate checks & conversion as reusable tasks.

Step‑by‑Step Walkthrough: From Raw CSV → Validated Parquet (Code Example)

Below is a self‑contained script that you can drop straight into an Airflow PythonOperator. It loads, profiles, validates, casts, and writes a Parquet file with Snappy compression. The code is pretty much a copy‑paste for your next DAG.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import great_expectations as ge
from pathlib import Path

# 1️⃣ Load raw CSV
raw_path = Path("data/raw/customers.csv")
df = pd.read_csv(raw_path)

# 2️⃣ Define GE expectations (could be loaded from a .yml suite)
expectations = {
    "customer_id": {"type": "int64", "unique": True, "not_null": True},
    "email": {"type": "string", "not_null": True, "regex": r".+@.+\..+"},
    "signup_date": {"type": "datetime64[ns]"},
    "revenue": {"type": "float64", "min_value": 0},
}
ge_df = ge.from_pandas(df)
for col, rules in expectations.items():
    if rules.get("not_null"):
        ge_df.expect_column_values_to_not_be_null(col)
    if rules.get("unique"):
        ge_df.expect_column_values_to_be_unique(col)
    if "type" in rules:
        ge_df.expect_column_values_to_be_of_type(col, rules["type"])
    if "min_value" in rules:
        ge_df.expect_column_values_to_be_greater_than(col, rules["min_value"])
    if "regex" in rules:
        ge_df.expect_column_values_to_match_regex(col, rules["regex"])

validation_result = ge_df.validate()
if not validation_result.success:
    raise ValueError("Data quality checks failed – see logs for details")

# 3️⃣ Cast to final schema (optional but recommended)
schema = {
    "customer_id": "int64",
    "email": "string",
    "signup_date": "datetime64[ns]",
    "revenue": "float64",
}
df = df.astype(schema)

# 4️⃣ Write to Parquet (Snappy compression)
table = pa.Table.from_pandas(df, preserve_index=False)
out_path = Path("data/processed/customers.parquet")
pq.write_table(table, out_path, compression="snappy")
print(f"✅ Parquet written to {out_path}")

In my experience, keeping the validation logic separate from the writing logic makes the DAG cleaner and easier to test.

Integrating the Parquet Output into Your Data Stack (dbt, Spark, Lakehouse)

After the script runs, you’re left with a clean Parquet file. dbt models can point to the new Parquet location and run incremental builds that now benefit from columnar pruning. Spark reads it instantly: spark.read.parquet("s3://bucket/data/processed/customers.parquet"). If you’re using Unity Catalog or Hive Metastore, just register the asset so downstream users see the same schema and lineage.

Actionable Takeaways & Best‑Practice Checklist

  • Define expectations before ingestion.
  • Automate validation in CI.
  • Store schema‑enforced Parquet.
  • Version‑control quality suites.
  • Monitor drift with Airflow alerts.

Next steps? Deploy the DAG to staging, add a dbt test that asserts row counts, and schedule a daily run. The thing is, the most common pitfall is over‑validating—performance takes a hit, but you’ll see the trade‑off quickly. Also, never forget to compress; compression="snappy" is the sweet spot for most workloads.

Frequently Asked Questions

What is the best way to add data‑quality checks to an ETL pipeline in Python?

Use a declarative library like Great Expectations or pandera to define expectations (e.g., column types, ranges, uniqueness). These can be called from Airflow or Prefect tasks, and the validation report can fail the DAG if thresholds aren’t met.

How does Parquet improve Spark performance compared to CSV?

Parquet stores data column‑wise with built‑in compression and statistics, allowing Spark to skip irrelevant columns and prune partitions. Typical scan time drops from minutes to seconds for the same dataset size.

Can I schedule the quality‑check‑and‑convert script with Airflow?

Absolutely—wrap the Python script in a PythonOperator or a custom DockerOperator. Airflow’s XCom can pass validation metrics to downstream tasks or trigger alerts.

Is it possible to version‑control data‑quality expectations like code?

Yes. Store the expectation suites as .yml files in a Git repo, and include them in your CI pipeline (e.g., GitHub Actions) to ensure they don’t regress before deployment.

Do I need Spark to write Parquet, or can pure Python do it?

Pure Python can write Parquet via pyarrow or fastparquet, which is perfect for the “batch‑convert” stage. Spark is only needed when you want distributed processing or to read the files at scale later.


Related reading: Original discussion

Related Articles

What do you think?

Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!

Comments

Popular posts from this blog

2026 Update: Getting Started with SQL & Databases: A Comp...

Low-Code Isn't Stealing Dev Jobs — It's Changing Them (And That's a Good Thing) Have you noticed how many non-tech folks are building Mission-critical apps lately? Honestly, it's kinda wild — marketing tres creating lead-gen tools, ops managers deploying inventory systems. Sound familiar? But here's the deal: it's not magic, it's low-code development platforms reshaping who gets to play the app-building game. What's With This Low-Code Thing Anyway? So let's break it down. Low-code platforms are visual playgrounds where you drag pre-built components instead of hand-coding everything. Think LEGO blocks for software – connect APIs, design interfaces, and automate workflows with minimal typing. Citizen developers (non-IT pros solving their own problems) are loving it because they don't need a PhD in Java. Recently, platforms like OutSystems and Mendix have exploded because honestly? Everyone needs custom tools faster than traditional codin...

Practical Guide: Getting Started with Data Science: A Com...

Laravel 11 Unpacked: What's New and Why It Matters Still running Laravel 10? Honestly, you might be missing out on some serious upgrades. Let's break down what Laravel 11 brings to the table – and whether it's worth the hype for your PHP framework projects. Because when it comes down to it, staying current can save you headaches later. What's Cooking in Laravel 11? Laravel 11 streamlines things right out of the gate. Gone are the cluttered config files – now you get a leaner, more focused starting point. That means less boilerplate and more actual coding. And here's the kicker: they've baked health routing directly into the framework. So instead of third-party packages for uptime monitoring, you've got built-in /up endpoints. But the real showstopper? Per-second API rate limiting. Remember those clunky custom solutions for throttling requests? Now you can just do: RateLimiter::for('api', function (Request $ 💬 What do you think?...

Applying Conditional Formatting in Excel Using Python

Applying Conditional Formatting in Excel Using Python Did you know that 78 % of data‑driven decisions are missed because users can’t spot trends fast enough? With a few lines of Python, you can turn any ordinary Excel spreadsheet into a visual powerhouse—no manual formatting, no endless clicks, just instant, rule‑based highlights that keep your team on the same page. In This Article What is Conditional Formatting? Setting Up Your Python Environment Core Concepts: Rules, Ranges, and Styles Step‑by‑Step Walkthrough Real‑World Use Cases & Actionable Takeaways Frequently Asked Questions What is Conditional Formatting and Why It Matters Excel’s conditional formatting lets you turn raw numbers into a story. Instead of scrolling through endless rows, you instantly see which sales exceeded targets, which inventory levels are low, or which dates are past due. In my experience, teams that use conditional formatting save hours that would otherwise be spent skimming cells. Whe...