A data warehouse is a database optimised for analytical queries — scanning millions or billions of rows, aggregating, joining, grouping. It is not optimised for thousands of small inserts and updates per second; that is what an OLTP database is for.
OLTP vs OLAP
| OLTP | OLAP / Warehouse | |
|---|---|---|
| Workload | Many small reads + writes | Few large reads |
| Storage | Row-oriented | Column-oriented |
| Indexes | Many B-tree | Zone maps, micro-partitions, clustering |
| Concurrency | Thousands of users | Tens to hundreds of analysts + dashboards |
| Examples | PostgreSQL, MySQL, SQL Server | Snowflake, BigQuery, Redshift, Synapse, Databricks SQL |
Why Columnar?
Analytical queries typically touch many rows but few columns. Storing data column-wise means:
- Only the columns referenced in the query are read from disk.
- Compression is dramatically better — columns of one type compress far more than mixed rows.
- Vectorised execution: process many values of the same type at once with SIMD.
Row store:
[1, "Alice", 30, "USA"][2, "Bob", 25, "UK"][3, "Eve", 40, "DE"]
Column store:
ids: [1, 2, 3]
names: ["Alice", "Bob", "Eve"]
ages: [30, 25, 40]
country: ["USA", "UK", "DE"]
For "average age by country", the column store reads only ages and country. The row store reads everything.
Massively Parallel Processing (MPP)
A warehouse splits a query across many workers. A 10-billion-row scan distributes across 100 nodes; each scans 100 million rows in parallel. Results are merged.
This is why warehouses can return in seconds what a single PostgreSQL would take hours on.
The Big Four (Five) Cloud Warehouses
Snowflake
- Architecture: separates storage (S3 / GCS / Blob) from compute ("virtual warehouses" — independent clusters you can size up/down/clone).
- Pricing: per-second compute, per-TB storage.
- Strengths: multi-cloud, easy to manage, time travel, zero-copy clones.
- Use when: you want a great DX, no infrastructure ops, and predictable analytics workloads.
Google BigQuery
- Architecture: fully serverless — no clusters to manage. Storage is GCS-backed; compute is "slots".
- Pricing: on-demand per-TB scanned, or flat-rate slot reservations.
- Strengths: truly zero ops, instant scale, integrated with GCP analytics + ML.
- Use when: you are on GCP, want serverless, or need huge scan elasticity.
Amazon Redshift
- Architecture: originally cluster-based (ra3 nodes); Redshift Serverless is now available.
- Pricing: per-cluster-hour or per-RPU for serverless.
- Strengths: deep AWS integration, Spectrum for querying S3 data.
- Use when: heavily on AWS, especially with S3-based lakes.
Azure Synapse / Microsoft Fabric
- Architecture: Synapse Dedicated SQL pools (MPP) and Serverless SQL; Fabric centralises this on OneLake (Delta-backed lakehouse).
- Pricing: dedicated DWUs or per-TB scanned (serverless).
- Strengths: tight integration with Power BI and the broader Microsoft ecosystem.
- Use when: Azure-first organisation, especially with Power BI.
Databricks SQL Warehouse
- Architecture: Photon engine on top of Delta Lake / Unity Catalog (lakehouse — see next lesson).
- Strengths: unifies data engineering, ML, and BI in one platform.
Modelling: Star Schemas Still Win
Despite all the architectural change, Kimball-style dimensional modelling remains the dominant approach for analytical layers.
┌────────────┐
│ dim_date │
└────────────┘
│
┌────────────┐ ┌──────────┐ ┌─────────────┐
│ dim_customer├─┤fact_orders├─┤ dim_product │
└────────────┘ └──────────┘ └─────────────┘
│
┌────────────┐
│ dim_store │
└────────────┘
- Fact tables — measurable events (orders, payments, page views). Long and narrow.
- Dimension tables — descriptive attributes (customers, products, dates). Short and wide.
- Grain — define what one row of the fact represents. Be explicit. ("One row per order line item, per day.")
Modern variants: "one big table" (OBT) for analytics dashboards, where all dimensions are denormalised into the fact for query performance. The warehouse handles the cost; the BI layer becomes simpler.
Performance Levers
| Technique | What it does |
|---|---|
| Partitioning (BigQuery, Synapse) | Physically split table by date / key; queries scan only relevant partitions |
| Clustering / sort keys | Co-locate related rows for predicate pruning |
| Materialised views / incremental models | Pre-compute heavy aggregations |
| Result caching | Repeated queries pay nothing — most warehouses cache results |
| Right-sized warehouse | Don't run a 64-credit warehouse for a 1-second query |
Cost Discipline
- Watch unbounded scans —
SELECT *on a giant table without filters. - Materialise expensive joins once per day, not once per dashboard load.
- Use clustering / partition pruning aggressively.
- Auto-suspend idle warehouses (Snowflake) or use serverless models with billing alerts.
- Tag queries by user / project so you can attribute cost.
Cert Mapping
| Cert | Warehouse focus |
|---|---|
| DP-203 / DP-700 | Synapse / Fabric, dedicated and serverless SQL pools |
| AWS DEA-C01 | Redshift, RA3 nodes, Spectrum, AQUA, Redshift Serverless |
| GCP PDE | BigQuery: slots, partitioning, clustering, materialised views, BI Engine |
The next lesson covers the other major storage paradigm — data lakes and the lakehouse — and when to use it instead of (or alongside) a warehouse.