You can have the most elegant pipeline architecture in the world; if business stakeholders cannot trust the numbers, none of it matters. Quality and governance are what separate "data infrastructure" from a "data platform".
Six Dimensions of Data Quality
| Dimension | Question it answers | Example test |
|---|---|---|
| Freshness | Is the data recent enough? | max(updated_at) within last 1 hour |
| Volume | Is the row count reasonable? | row count between 0.8x and 1.2x of 7-day average |
| Schema | Are columns/types as expected? | columns and types match contract |
| Accuracy | Are values correct? | amount > 0; status in known set |
| Completeness | Are required fields populated? | customer_id IS NOT NULL |
| Uniqueness | Are keys unique? | no duplicate order_id |
Testing Frameworks
dbt tests
# models/marts/orders.yml
version: 2
models:
- name: fact_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- relationships:
to: ref('dim_customer')
field: customer_id
- name: amount
tests:
- dbt_utils.accepted_range:
min_value: 0
Great Expectations
More expressive than dbt tests, language-agnostic, runs against pandas / Spark / SQL. Heavier setup but powerful for ML feature pipelines.
Soda / Anomalo / Monte Carlo
Commercial data observability tools — define expectations, get monitoring + anomaly detection, integrate with the orchestrator and Slack/PagerDuty.
Data Contracts
A data contract is an explicit agreement between a data producer and its consumers:
name: orders_completed
owner: orders-team
version: 2
schema:
- name: order_id
type: string
nullable: false
- name: customer_id
type: string
nullable: false
- name: amount
type: decimal(10,2)
nullable: false
- name: completed_at
type: timestamp
nullable: false
sla:
freshness: "5m"
uniqueness: "order_id"
breaking_changes_require:
- approval_from: data-platform
- notice_period: "30d"
Contracts shift quality "left" — producers, often application engineers, take responsibility for the data they emit, instead of consumers cleaning up downstream.
Tools: Confluent Stream Catalog, dbt Semantic Layer, GitHub-based YAML + CI checks, dedicated platforms (Gable, etc.).
Lineage
Lineage tracks how data flows from source to consumer. Two flavours:
- Coarse-grained — table-level: "fact_orders depends on stg_orders, stg_payments".
- Fine-grained — column-level: "fact_orders.amount comes from stg_payments.amount via a sum".
Why it matters:
- Debugging: "why is revenue wrong today?" → trace back upstream.
- Impact analysis: "if we drop column X in source, what breaks?".
- Compliance: prove where personal data lives and who touched it.
Sources of lineage: dbt's manifest, OpenLineage, BigQuery Data Lineage, Unity Catalog, DataHub, Atlan.
Catalog
A catalog is the inventory of your data platform. Modern catalogs go beyond Hive metadata to include:
- Table and column descriptions, owners, tags.
- Usage metrics — most-queried tables, last-queried date.
- Quality scores and freshness.
- Lineage edges.
- Glossary and business definitions ("what is an active customer?").
- Access policies.
Players: Unity Catalog, Glue + Lake Formation, BigQuery Data Catalog, DataHub (open-source), Atlan, Collibra, Alation.
Governance: Classification, Access, Masking
Classification
Tag columns by sensitivity: PII, PHI, financial, internal, public. Most catalogs / warehouses can auto-detect (regex + ML).
Access Control
- RBAC — roles → privileges (analyst, engineer, finance).
- ABAC — attribute-based, finer-grained ("only this region" / "only own customers").
- Row-level security and column-level masking enforce least privilege without splitting tables.
-- Snowflake column-level masking
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('SUPPORT', 'PLATFORM') THEN val
ELSE REGEXP_REPLACE(val, '.+@', '****@')
END;
ALTER TABLE dim_customer
MODIFY COLUMN email
SET MASKING POLICY mask_email;
Retention
Keep data only as long as needed: legal minimum, business value, GDPR / CCPA "right to be forgotten". Automate deletion / anonymisation; don't rely on humans.
Audit
Every access logged: who queried what, when. All cloud warehouses provide query logs; centralise into a security data lake (or SIEM).
The Ownership Model
Modern best practice: the team that produces the data owns its quality. The platform team provides the tools — catalog, observability, contracts — but does not babysit every table.
This is the core idea of data mesh: domain-oriented data ownership, with a central platform for discoverability and governance. You don't have to adopt the buzzword; the principle is sound.
An Operating Rhythm
- Every production dataset has an owner, a contract, and tests.
- Failed tests block downstream consumers (or alert + page).
- Monthly review: stale tables, low-quality tables, undocumented tables.
- Quarterly review: cost per dataset, top-10 most expensive queries.
- Yearly review: classification, retention, access policies.
Cert Mapping
| Cert | Governance scope |
|---|---|
| DP-203 / DP-700 | Microsoft Purview, Fabric data lineage, sensitivity labels |
| AWS DEA-C01 | Lake Formation, Glue Data Quality, Glue Data Catalog, IAM lake permissions |
| GCP PDE | Dataplex, Data Catalog, BigQuery policy tags, DLP |
The Mindset
Quality and governance are not bureaucracy added on top of pipelines. They are the difference between a data platform people trust and one they tolerate. Build them in early; retrofitting is much harder than starting with them.