From Data Quality Checks to Analytics‑Ready Parquet with Python
90 % of data‑driven projects stall because raw data never passes quality gates – and the bottleneck is usually the format conversion step. In this article you’ll see how a handful of Python libraries can turn messy, unverified CSVs into Spark‑ready Parquet files in under 5 minutes, without writing a single custom ETL job. Imagine you’ve just landed a new dataset in your Airflow DAG; instead of wrestling with schema drift, you run a reproducible quality‑check‑and‑convert script and hand the result off to dbt or a downstream Spark job—effortless, auditable, and production‑grade.
Why Data Quality & Format Matter in Modern ETL Pipelines
Bad data inflates debugging time by up to 30 % (source: Gartner). Parquet’s columnar compression reduces storage by 70‑80 % and speeds up Spark scans dramatically. Embedding validation early satisfies audit trails and makes lineage tools trustworthy. And honestly, if that’s not a business‑case, I’m not sure why anyone would chase it.
Core Building Blocks – Python Libraries You’ll Need
- pandas & pyarrow: Fast in‑memory manipulation + native Parquet writer.
- great‑expectations (or pandera): Declarative data‑quality expectations that integrate with CI/CD.
- prefect / airflow operators: Orchestrate checks & conversion as reusable tasks.
Step‑by‑Step Walkthrough: From Raw CSV → Validated Parquet (Code Example)
Below is a self‑contained script that you can drop straight into an Airflow PythonOperator. It loads, profiles, validates, casts, and writes a Parquet file with Snappy compression. The code is pretty much a copy‑paste for your next DAG.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import great_expectations as ge
from pathlib import Path
# 1️⃣ Load raw CSV
raw_path = Path("data/raw/customers.csv")
df = pd.read_csv(raw_path)
# 2️⃣ Define GE expectations (could be loaded from a .yml suite)
expectations = {
"customer_id": {"type": "int64", "unique": True, "not_null": True},
"email": {"type": "string", "not_null": True, "regex": r".+@.+\..+"},
"signup_date": {"type": "datetime64[ns]"},
"revenue": {"type": "float64", "min_value": 0},
}
ge_df = ge.from_pandas(df)
for col, rules in expectations.items():
if rules.get("not_null"):
ge_df.expect_column_values_to_not_be_null(col)
if rules.get("unique"):
ge_df.expect_column_values_to_be_unique(col)
if "type" in rules:
ge_df.expect_column_values_to_be_of_type(col, rules["type"])
if "min_value" in rules:
ge_df.expect_column_values_to_be_greater_than(col, rules["min_value"])
if "regex" in rules:
ge_df.expect_column_values_to_match_regex(col, rules["regex"])
validation_result = ge_df.validate()
if not validation_result.success:
raise ValueError("Data quality checks failed – see logs for details")
# 3️⃣ Cast to final schema (optional but recommended)
schema = {
"customer_id": "int64",
"email": "string",
"signup_date": "datetime64[ns]",
"revenue": "float64",
}
df = df.astype(schema)
# 4️⃣ Write to Parquet (Snappy compression)
table = pa.Table.from_pandas(df, preserve_index=False)
out_path = Path("data/processed/customers.parquet")
pq.write_table(table, out_path, compression="snappy")
print(f"✅ Parquet written to {out_path}")
In my experience, keeping the validation logic separate from the writing logic makes the DAG cleaner and easier to test.
Integrating the Parquet Output into Your Data Stack (dbt, Spark, Lakehouse)
After the script runs, you’re left with a clean Parquet file. dbt models can point to the new Parquet location and run incremental builds that now benefit from columnar pruning. Spark reads it instantly: spark.read.parquet("s3://bucket/data/processed/customers.parquet"). If you’re using Unity Catalog or Hive Metastore, just register the asset so downstream users see the same schema and lineage.
Actionable Takeaways & Best‑Practice Checklist
- Define expectations before ingestion.
- Automate validation in CI.
- Store schema‑enforced Parquet.
- Version‑control quality suites.
- Monitor drift with Airflow alerts.
Next steps? Deploy the DAG to staging, add a dbt test that asserts row counts, and schedule a daily run. The thing is, the most common pitfall is over‑validating—performance takes a hit, but you’ll see the trade‑off quickly. Also, never forget to compress; compression="snappy" is the sweet spot for most workloads.
Frequently Asked Questions
What is the best way to add data‑quality checks to an ETL pipeline in Python?
Use a declarative library like Great Expectations or pandera to define expectations (e.g., column types, ranges, uniqueness). These can be called from Airflow or Prefect tasks, and the validation report can fail the DAG if thresholds aren’t met.
How does Parquet improve Spark performance compared to CSV?
Parquet stores data column‑wise with built‑in compression and statistics, allowing Spark to skip irrelevant columns and prune partitions. Typical scan time drops from minutes to seconds for the same dataset size.
Can I schedule the quality‑check‑and‑convert script with Airflow?
Absolutely—wrap the Python script in a PythonOperator or a custom DockerOperator. Airflow’s XCom can pass validation metrics to downstream tasks or trigger alerts.
Is it possible to version‑control data‑quality expectations like code?
Yes. Store the expectation suites as .yml files in a Git repo, and include them in your CI pipeline (e.g., GitHub Actions) to ensure they don’t regress before deployment.
Do I need Spark to write Parquet, or can pure Python do it?
Pure Python can write Parquet via pyarrow or fastparquet, which is perfect for the “batch‑convert” stage. Spark is only needed when you want distributed processing or to read the files at scale later.
Related reading: Original discussion
Related Articles
- How Excel is Used in Real-word Data Analysis
- How I Learned Excel in My First Week Of Data Science -...
- Show HN: HelixDB – A graph database built on object storage
What do you think?
Have experience with this topic? Drop your thoughts in the comments - I read every single one and love hearing different perspectives!
Comments
Post a Comment