Data Quality and Governance — Data Engineering Fundamentals | CertQnA

You can have the most elegant pipeline architecture in the world; if business stakeholders cannot trust the numbers, none of it matters. Quality and governance are what separate "data infrastructure" from a "data platform".

Six Dimensions of Data Quality

Dimension	Question it answers	Example test
Freshness	Is the data recent enough?	max(updated_at) within last 1 hour
Volume	Is the row count reasonable?	row count between 0.8x and 1.2x of 7-day average
Schema	Are columns/types as expected?	columns and types match contract
Accuracy	Are values correct?	amount > 0; status in known set
Completeness	Are required fields populated?	customer_id IS NOT NULL
Uniqueness	Are keys unique?	no duplicate order_id

Testing Frameworks

dbt tests

# models/marts/orders.yml
version: 2
models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0

Great Expectations

More expressive than dbt tests, language-agnostic, runs against pandas / Spark / SQL. Heavier setup but powerful for ML feature pipelines.

Soda / Anomalo / Monte Carlo

Commercial data observability tools — define expectations, get monitoring + anomaly detection, integrate with the orchestrator and Slack/PagerDuty.

Data Contracts

A data contract is an explicit agreement between a data producer and its consumers:

name: orders_completed
owner: orders-team
version: 2
schema:
  - name: order_id
    type: string
    nullable: false
  - name: customer_id
    type: string
    nullable: false
  - name: amount
    type: decimal(10,2)
    nullable: false
  - name: completed_at
    type: timestamp
    nullable: false
sla:
  freshness: "5m"
  uniqueness: "order_id"
breaking_changes_require:
  - approval_from: data-platform
  - notice_period: "30d"

Contracts shift quality "left" — producers, often application engineers, take responsibility for the data they emit, instead of consumers cleaning up downstream.

Tools: Confluent Stream Catalog, dbt Semantic Layer, GitHub-based YAML + CI checks, dedicated platforms (Gable, etc.).

Lineage

Lineage tracks how data flows from source to consumer. Two flavours:

Coarse-grained — table-level: "fact_orders depends on stg_orders, stg_payments".
Fine-grained — column-level: "fact_orders.amount comes from stg_payments.amount via a sum".

Why it matters:

Debugging: "why is revenue wrong today?" → trace back upstream.
Impact analysis: "if we drop column X in source, what breaks?".
Compliance: prove where personal data lives and who touched it.

Sources of lineage: dbt's manifest, OpenLineage, BigQuery Data Lineage, Unity Catalog, DataHub, Atlan.

Catalog

A catalog is the inventory of your data platform. Modern catalogs go beyond Hive metadata to include:

Table and column descriptions, owners, tags.
Usage metrics — most-queried tables, last-queried date.
Quality scores and freshness.
Lineage edges.
Glossary and business definitions ("what is an active customer?").
Access policies.

Players: Unity Catalog, Glue + Lake Formation, BigQuery Data Catalog, DataHub (open-source), Atlan, Collibra, Alation.

Governance: Classification, Access, Masking

Classification

Tag columns by sensitivity: PII, PHI, financial, internal, public. Most catalogs / warehouses can auto-detect (regex + ML).

Access Control

RBAC — roles → privileges (analyst, engineer, finance).
ABAC — attribute-based, finer-grained ("only this region" / "only own customers").
Row-level security and column-level masking enforce least privilege without splitting tables.

-- Snowflake column-level masking
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('SUPPORT', 'PLATFORM') THEN val
    ELSE REGEXP_REPLACE(val, '.+@', '****@')
  END;

ALTER TABLE dim_customer
  MODIFY COLUMN email
  SET MASKING POLICY mask_email;

Retention

Keep data only as long as needed: legal minimum, business value, GDPR / CCPA "right to be forgotten". Automate deletion / anonymisation; don't rely on humans.

Audit

Every access logged: who queried what, when. All cloud warehouses provide query logs; centralise into a security data lake (or SIEM).

The Ownership Model

Modern best practice: the team that produces the data owns its quality. The platform team provides the tools — catalog, observability, contracts — but does not babysit every table.

This is the core idea of data mesh: domain-oriented data ownership, with a central platform for discoverability and governance. You don't have to adopt the buzzword; the principle is sound.

An Operating Rhythm

Every production dataset has an owner, a contract, and tests.
Failed tests block downstream consumers (or alert + page).
Monthly review: stale tables, low-quality tables, undocumented tables.
Quarterly review: cost per dataset, top-10 most expensive queries.
Yearly review: classification, retention, access policies.

Cert Mapping

Cert	Governance scope
DP-203 / DP-700	Microsoft Purview, Fabric data lineage, sensitivity labels
AWS DEA-C01	Lake Formation, Glue Data Quality, Glue Data Catalog, IAM lake permissions
GCP PDE	Dataplex, Data Catalog, BigQuery policy tags, DLP

The Mindset

Quality and governance are not bureaucracy added on top of pipelines. They are the difference between a data platform people trust and one they tolerate. Build them in early; retrofitting is much harder than starting with them.