Data Warehouses — Data Engineering Fundamentals | CertQnA

A data warehouse is a database optimised for analytical queries — scanning millions or billions of rows, aggregating, joining, grouping. It is not optimised for thousands of small inserts and updates per second; that is what an OLTP database is for.

OLTP vs OLAP

	OLTP	OLAP / Warehouse
Workload	Many small reads + writes	Few large reads
Storage	Row-oriented	Column-oriented
Indexes	Many B-tree	Zone maps, micro-partitions, clustering
Concurrency	Thousands of users	Tens to hundreds of analysts + dashboards
Examples	PostgreSQL, MySQL, SQL Server	Snowflake, BigQuery, Redshift, Synapse, Databricks SQL

Why Columnar?

Analytical queries typically touch many rows but few columns. Storing data column-wise means:

Only the columns referenced in the query are read from disk.
Compression is dramatically better — columns of one type compress far more than mixed rows.
Vectorised execution: process many values of the same type at once with SIMD.

Row store:
[1, "Alice", 30, "USA"][2, "Bob", 25, "UK"][3, "Eve", 40, "DE"]

Column store:
ids:      [1, 2, 3]
names:    ["Alice", "Bob", "Eve"]
ages:     [30, 25, 40]
country:  ["USA", "UK", "DE"]

For "average age by country", the column store reads only ages and country. The row store reads everything.

Massively Parallel Processing (MPP)

A warehouse splits a query across many workers. A 10-billion-row scan distributes across 100 nodes; each scans 100 million rows in parallel. Results are merged.

This is why warehouses can return in seconds what a single PostgreSQL would take hours on.

The Big Four (Five) Cloud Warehouses

Snowflake

Architecture: separates storage (S3 / GCS / Blob) from compute ("virtual warehouses" — independent clusters you can size up/down/clone).
Pricing: per-second compute, per-TB storage.
Strengths: multi-cloud, easy to manage, time travel, zero-copy clones.
Use when: you want a great DX, no infrastructure ops, and predictable analytics workloads.

Google BigQuery

Architecture: fully serverless — no clusters to manage. Storage is GCS-backed; compute is "slots".
Pricing: on-demand per-TB scanned, or flat-rate slot reservations.
Strengths: truly zero ops, instant scale, integrated with GCP analytics + ML.
Use when: you are on GCP, want serverless, or need huge scan elasticity.

Amazon Redshift

Architecture: originally cluster-based (ra3 nodes); Redshift Serverless is now available.
Pricing: per-cluster-hour or per-RPU for serverless.
Strengths: deep AWS integration, Spectrum for querying S3 data.
Use when: heavily on AWS, especially with S3-based lakes.

Azure Synapse / Microsoft Fabric

Architecture: Synapse Dedicated SQL pools (MPP) and Serverless SQL; Fabric centralises this on OneLake (Delta-backed lakehouse).
Pricing: dedicated DWUs or per-TB scanned (serverless).
Strengths: tight integration with Power BI and the broader Microsoft ecosystem.
Use when: Azure-first organisation, especially with Power BI.

Databricks SQL Warehouse

Architecture: Photon engine on top of Delta Lake / Unity Catalog (lakehouse — see next lesson).
Strengths: unifies data engineering, ML, and BI in one platform.

Modelling: Star Schemas Still Win

Despite all the architectural change, Kimball-style dimensional modelling remains the dominant approach for analytical layers.

                 ┌────────────┐
                 │ dim_date   │
                 └────────────┘
                       │
   ┌────────────┐  ┌──────────┐  ┌─────────────┐
   │ dim_customer├─┤fact_orders├─┤ dim_product │
   └────────────┘  └──────────┘  └─────────────┘
                       │
                 ┌────────────┐
                 │ dim_store  │
                 └────────────┘

Fact tables — measurable events (orders, payments, page views). Long and narrow.
Dimension tables — descriptive attributes (customers, products, dates). Short and wide.
Grain — define what one row of the fact represents. Be explicit. ("One row per order line item, per day.")

Modern variants: "one big table" (OBT) for analytics dashboards, where all dimensions are denormalised into the fact for query performance. The warehouse handles the cost; the BI layer becomes simpler.

Performance Levers

Technique	What it does
Partitioning (BigQuery, Synapse)	Physically split table by date / key; queries scan only relevant partitions
Clustering / sort keys	Co-locate related rows for predicate pruning
Materialised views / incremental models	Pre-compute heavy aggregations
Result caching	Repeated queries pay nothing — most warehouses cache results
Right-sized warehouse	Don't run a 64-credit warehouse for a 1-second query

Cost Discipline

Watch unbounded scans — SELECT * on a giant table without filters.
Materialise expensive joins once per day, not once per dashboard load.
Use clustering / partition pruning aggressively.
Auto-suspend idle warehouses (Snowflake) or use serverless models with billing alerts.
Tag queries by user / project so you can attribute cost.

Cert Mapping

Cert	Warehouse focus
DP-203 / DP-700	Synapse / Fabric, dedicated and serverless SQL pools
AWS DEA-C01	Redshift, RA3 nodes, Spectrum, AQUA, Redshift Serverless
GCP PDE	BigQuery: slots, partitioning, clustering, materialised views, BI Engine

The next lesson covers the other major storage paradigm — data lakes and the lakehouse — and when to use it instead of (or alongside) a warehouse.