Skip to content
5 min read·Lesson 5 of 10

Data Warehouses

How cloud data warehouses work, why they are columnar, and how Snowflake, BigQuery, Redshift, and Synapse compare for analytics workloads.

A data warehouse is a database optimised for analytical queries — scanning millions or billions of rows, aggregating, joining, grouping. It is not optimised for thousands of small inserts and updates per second; that is what an OLTP database is for.

OLTP vs OLAP

OLTPOLAP / Warehouse
WorkloadMany small reads + writesFew large reads
StorageRow-orientedColumn-oriented
IndexesMany B-treeZone maps, micro-partitions, clustering
ConcurrencyThousands of usersTens to hundreds of analysts + dashboards
ExamplesPostgreSQL, MySQL, SQL ServerSnowflake, BigQuery, Redshift, Synapse, Databricks SQL

Why Columnar?

Analytical queries typically touch many rows but few columns. Storing data column-wise means:

  • Only the columns referenced in the query are read from disk.
  • Compression is dramatically better — columns of one type compress far more than mixed rows.
  • Vectorised execution: process many values of the same type at once with SIMD.
Row store:
[1, "Alice", 30, "USA"][2, "Bob", 25, "UK"][3, "Eve", 40, "DE"]

Column store:
ids:      [1, 2, 3]
names:    ["Alice", "Bob", "Eve"]
ages:     [30, 25, 40]
country:  ["USA", "UK", "DE"]

For "average age by country", the column store reads only ages and country. The row store reads everything.

Massively Parallel Processing (MPP)

A warehouse splits a query across many workers. A 10-billion-row scan distributes across 100 nodes; each scans 100 million rows in parallel. Results are merged.

This is why warehouses can return in seconds what a single PostgreSQL would take hours on.

The Big Four (Five) Cloud Warehouses

Snowflake

  • Architecture: separates storage (S3 / GCS / Blob) from compute ("virtual warehouses" — independent clusters you can size up/down/clone).
  • Pricing: per-second compute, per-TB storage.
  • Strengths: multi-cloud, easy to manage, time travel, zero-copy clones.
  • Use when: you want a great DX, no infrastructure ops, and predictable analytics workloads.

Google BigQuery

  • Architecture: fully serverless — no clusters to manage. Storage is GCS-backed; compute is "slots".
  • Pricing: on-demand per-TB scanned, or flat-rate slot reservations.
  • Strengths: truly zero ops, instant scale, integrated with GCP analytics + ML.
  • Use when: you are on GCP, want serverless, or need huge scan elasticity.

Amazon Redshift

  • Architecture: originally cluster-based (ra3 nodes); Redshift Serverless is now available.
  • Pricing: per-cluster-hour or per-RPU for serverless.
  • Strengths: deep AWS integration, Spectrum for querying S3 data.
  • Use when: heavily on AWS, especially with S3-based lakes.

Azure Synapse / Microsoft Fabric

  • Architecture: Synapse Dedicated SQL pools (MPP) and Serverless SQL; Fabric centralises this on OneLake (Delta-backed lakehouse).
  • Pricing: dedicated DWUs or per-TB scanned (serverless).
  • Strengths: tight integration with Power BI and the broader Microsoft ecosystem.
  • Use when: Azure-first organisation, especially with Power BI.

Databricks SQL Warehouse

  • Architecture: Photon engine on top of Delta Lake / Unity Catalog (lakehouse — see next lesson).
  • Strengths: unifies data engineering, ML, and BI in one platform.

Modelling: Star Schemas Still Win

Despite all the architectural change, Kimball-style dimensional modelling remains the dominant approach for analytical layers.

                 ┌────────────┐
                 │ dim_date   │
                 └────────────┘
                       │
   ┌────────────┐  ┌──────────┐  ┌─────────────┐
   │ dim_customer├─┤fact_orders├─┤ dim_product │
   └────────────┘  └──────────┘  └─────────────┘
                       │
                 ┌────────────┐
                 │ dim_store  │
                 └────────────┘
  • Fact tables — measurable events (orders, payments, page views). Long and narrow.
  • Dimension tables — descriptive attributes (customers, products, dates). Short and wide.
  • Grain — define what one row of the fact represents. Be explicit. ("One row per order line item, per day.")

Modern variants: "one big table" (OBT) for analytics dashboards, where all dimensions are denormalised into the fact for query performance. The warehouse handles the cost; the BI layer becomes simpler.

Performance Levers

TechniqueWhat it does
Partitioning (BigQuery, Synapse)Physically split table by date / key; queries scan only relevant partitions
Clustering / sort keysCo-locate related rows for predicate pruning
Materialised views / incremental modelsPre-compute heavy aggregations
Result cachingRepeated queries pay nothing — most warehouses cache results
Right-sized warehouseDon't run a 64-credit warehouse for a 1-second query

Cost Discipline

  • Watch unbounded scans — SELECT * on a giant table without filters.
  • Materialise expensive joins once per day, not once per dashboard load.
  • Use clustering / partition pruning aggressively.
  • Auto-suspend idle warehouses (Snowflake) or use serverless models with billing alerts.
  • Tag queries by user / project so you can attribute cost.

Cert Mapping

CertWarehouse focus
DP-203 / DP-700Synapse / Fabric, dedicated and serverless SQL pools
AWS DEA-C01Redshift, RA3 nodes, Spectrum, AQUA, Redshift Serverless
GCP PDEBigQuery: slots, partitioning, clustering, materialised views, BI Engine

The next lesson covers the other major storage paradigm — data lakes and the lakehouse — and when to use it instead of (or alongside) a warehouse.

Key Takeaways

  • A warehouse is an OLAP store optimised for large analytical queries, not transactional writes.
  • Columnar storage and massively parallel processing are the two architectural cornerstones.
  • Snowflake separates storage and compute; BigQuery is fully serverless; Redshift is cluster-based with serverless option.
  • Dimensional modelling (star schemas) still applies, alongside modern wide-table patterns.
  • Cost discipline (clustering, partitioning, materialisation) is half the job of a warehouse engineer.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →