Skip to content
5 min read·Lesson 9 of 10

Data Quality and Governance

Tests, contracts, lineage, catalogs, and access control — the practices that turn raw data infrastructure into a trustworthy data platform.

You can have the most elegant pipeline architecture in the world; if business stakeholders cannot trust the numbers, none of it matters. Quality and governance are what separate "data infrastructure" from a "data platform".

Six Dimensions of Data Quality

DimensionQuestion it answersExample test
FreshnessIs the data recent enough?max(updated_at) within last 1 hour
VolumeIs the row count reasonable?row count between 0.8x and 1.2x of 7-day average
SchemaAre columns/types as expected?columns and types match contract
AccuracyAre values correct?amount > 0; status in known set
CompletenessAre required fields populated?customer_id IS NOT NULL
UniquenessAre keys unique?no duplicate order_id

Testing Frameworks

dbt tests

# models/marts/orders.yml
version: 2
models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0

Great Expectations

More expressive than dbt tests, language-agnostic, runs against pandas / Spark / SQL. Heavier setup but powerful for ML feature pipelines.

Soda / Anomalo / Monte Carlo

Commercial data observability tools — define expectations, get monitoring + anomaly detection, integrate with the orchestrator and Slack/PagerDuty.

Data Contracts

A data contract is an explicit agreement between a data producer and its consumers:

name: orders_completed
owner: orders-team
version: 2
schema:
  - name: order_id
    type: string
    nullable: false
  - name: customer_id
    type: string
    nullable: false
  - name: amount
    type: decimal(10,2)
    nullable: false
  - name: completed_at
    type: timestamp
    nullable: false
sla:
  freshness: "5m"
  uniqueness: "order_id"
breaking_changes_require:
  - approval_from: data-platform
  - notice_period: "30d"

Contracts shift quality "left" — producers, often application engineers, take responsibility for the data they emit, instead of consumers cleaning up downstream.

Tools: Confluent Stream Catalog, dbt Semantic Layer, GitHub-based YAML + CI checks, dedicated platforms (Gable, etc.).

Lineage

Lineage tracks how data flows from source to consumer. Two flavours:

  • Coarse-grained — table-level: "fact_orders depends on stg_orders, stg_payments".
  • Fine-grained — column-level: "fact_orders.amount comes from stg_payments.amount via a sum".

Why it matters:

  • Debugging: "why is revenue wrong today?" → trace back upstream.
  • Impact analysis: "if we drop column X in source, what breaks?".
  • Compliance: prove where personal data lives and who touched it.

Sources of lineage: dbt's manifest, OpenLineage, BigQuery Data Lineage, Unity Catalog, DataHub, Atlan.

Catalog

A catalog is the inventory of your data platform. Modern catalogs go beyond Hive metadata to include:

  • Table and column descriptions, owners, tags.
  • Usage metrics — most-queried tables, last-queried date.
  • Quality scores and freshness.
  • Lineage edges.
  • Glossary and business definitions ("what is an active customer?").
  • Access policies.

Players: Unity Catalog, Glue + Lake Formation, BigQuery Data Catalog, DataHub (open-source), Atlan, Collibra, Alation.

Governance: Classification, Access, Masking

Classification

Tag columns by sensitivity: PII, PHI, financial, internal, public. Most catalogs / warehouses can auto-detect (regex + ML).

Access Control

  • RBAC — roles → privileges (analyst, engineer, finance).
  • ABAC — attribute-based, finer-grained ("only this region" / "only own customers").
  • Row-level security and column-level masking enforce least privilege without splitting tables.
-- Snowflake column-level masking
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('SUPPORT', 'PLATFORM') THEN val
    ELSE REGEXP_REPLACE(val, '.+@', '****@')
  END;

ALTER TABLE dim_customer
  MODIFY COLUMN email
  SET MASKING POLICY mask_email;

Retention

Keep data only as long as needed: legal minimum, business value, GDPR / CCPA "right to be forgotten". Automate deletion / anonymisation; don't rely on humans.

Audit

Every access logged: who queried what, when. All cloud warehouses provide query logs; centralise into a security data lake (or SIEM).

The Ownership Model

Modern best practice: the team that produces the data owns its quality. The platform team provides the tools — catalog, observability, contracts — but does not babysit every table.

This is the core idea of data mesh: domain-oriented data ownership, with a central platform for discoverability and governance. You don't have to adopt the buzzword; the principle is sound.

An Operating Rhythm

  1. Every production dataset has an owner, a contract, and tests.
  2. Failed tests block downstream consumers (or alert + page).
  3. Monthly review: stale tables, low-quality tables, undocumented tables.
  4. Quarterly review: cost per dataset, top-10 most expensive queries.
  5. Yearly review: classification, retention, access policies.

Cert Mapping

CertGovernance scope
DP-203 / DP-700Microsoft Purview, Fabric data lineage, sensitivity labels
AWS DEA-C01Lake Formation, Glue Data Quality, Glue Data Catalog, IAM lake permissions
GCP PDEDataplex, Data Catalog, BigQuery policy tags, DLP

The Mindset

Quality and governance are not bureaucracy added on top of pipelines. They are the difference between a data platform people trust and one they tolerate. Build them in early; retrofitting is much harder than starting with them.

Key Takeaways

  • Data quality is the freshness, accuracy, completeness, and consistency of datasets — testable like code.
  • Data contracts make producer/consumer expectations explicit and enforceable at the boundary.
  • Lineage answers "where did this number come from?" and is essential for debugging and compliance.
  • A catalog turns a swamp of tables into a discoverable, documented platform.
  • Governance combines classification, access control, masking, retention, and audit — not just a security checkbox.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →