Skip to content
5 min read·Lesson 6 of 10

Data Lakes and Lakehouses

Object storage as a data platform: lakes, the swamp problem, and how Delta, Iceberg, and Hudi solved it with the lakehouse architecture.

If a warehouse is a curated library, a data lake is a giant warehouse-store of crates: cheap, flexible, holds anything — but you need to know what you have and where it is, or it becomes a swamp.

What a Data Lake Is

  • Storage: object store (S3, ADLS Gen2, GCS) — cheap, infinitely scalable, durable.
  • Format: open file formats — Parquet, ORC, Avro, JSON, CSV.
  • Compute: separate engines that read those files — Spark, Presto / Trino, Athena, Dremio, BigQuery external tables.
  • Schema: applied on read, not on write.

Originally hyped as the cure-all replacement for warehouses ("Hadoop will eat the database"). Reality was messier.

The Swamp Problem

Without discipline, lakes accumulate problems:

  • Files everywhere, no canonical view of "the orders table".
  • Schema drifts silently — column added, column type changes — readers break.
  • No transactions: a partial write leaves consumers reading half-data.
  • No reliable updates / deletes — GDPR requests are painful.
  • "Small files" problem: millions of tiny files crush query engines.
  • No time travel: cannot answer "what did this table look like yesterday?".

This is why pure lakes lost ground to warehouses for analytics in the 2015–2020 era.

The Lakehouse Fix: Open Table Formats

Three open formats brought ACID semantics to object storage:

FormatOriginStrengths
Delta LakeDatabricks (open-source 2019)Mature, deep Spark integration, Unity Catalog
Apache IcebergNetflix → ApacheEngine-agnostic, strong on schema evolution + hidden partitioning; backed by Snowflake, AWS, GCP
Apache HudiUber → ApacheStrong on streaming upserts, CDC patterns

What they all add on top of Parquet:

  • ACID transactions — atomic writes, snapshot isolation. No half-files.
  • Schema enforcement and evolution — columns add safely, type changes are tracked.
  • Time travel — query a table "as of" timestamp or version.
  • Upserts and deletes — MERGE INTO works on a lake.
  • Optimisations — file compaction, Z-ordering, data skipping.
-- Spark SQL on a Delta table
MERGE INTO silver.customers AS t
USING staging.customer_changes AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel
SELECT * FROM silver.customers VERSION AS OF 142;
SELECT * FROM silver.customers TIMESTAMP AS OF '2025-09-01 00:00:00';

Lakehouse Architecture

Coined by Databricks but now broadly used. The pattern:

[ Sources ] ─→ [ Bronze (raw) ] ─→ [ Silver (cleaned) ] ─→ [ Gold (curated marts) ]
                  Delta/Iceberg       Delta/Iceberg            Delta/Iceberg
                          │
                          ▼
            Engines query directly:
            Spark, Trino, Presto, Athena,
            Snowflake (Iceberg), BigQuery (Iceberg)
  • Bronze — raw landing, append-only, exact source replica.
  • Silver — cleaned, conformed, deduplicated; production-ready datasets.
  • Gold — business-level aggregates, marts, ML features.

The medallion (bronze/silver/gold) terminology is Databricks-flavoured but the layering is universal.

When to Pick a Lakehouse vs Warehouse

Choose lakehouse whenChoose warehouse when
Heavy ML / data science workloadsPure SQL analytics + BI
Semi-structured / nested dataStructured tabular data
Petabyte+ scaleTens of TB to low PB
Need to share data across many enginesOne team, one engine
Multi-language (Python, Scala, SQL)SQL-only is fine
Streaming + batch unifiedMostly batch

Reality: many teams run both. Land raw data in a lake; promote curated marts to a warehouse; or vice versa with external tables (Snowflake reading Iceberg, BigQuery reading Iceberg, Redshift Spectrum reading Parquet).

Object Storage Layout Best Practices

  • Partition by date (yyyy/mm/dd) and one or two business keys at most.
  • Target file sizes ~128–512 MB (Parquet); avoid the small files problem.
  • Compact regularly (OPTIMIZE / VACUUM in Delta; rewrite_data_files in Iceberg).
  • Use the table format's partition transforms (Iceberg hidden partitioning, Delta generated columns).
  • Track table-level metadata in a catalog (Glue Data Catalog, Unity Catalog, Hive Metastore, REST catalog).

Catalogs Matter

A table format on object storage is useless without a catalog telling engines where the metadata lives. Common catalogs:

  • AWS Glue Data Catalog — Hive-compatible, used by Athena, Redshift Spectrum, EMR.
  • Unity Catalog — Databricks-native, multi-cloud, supports Delta and Iceberg.
  • Iceberg REST catalog — open spec; AWS, Snowflake, Tabular, Polaris implementations.
  • Hive Metastore — older, but still everywhere.

Choosing a catalog is increasingly the strategic decision — once data is locked into one catalog, switching engines later is hard.

Cert Mapping

CertLakehouse content
DP-203 / DP-700ADLS Gen2, Synapse Spark / Fabric Lakehouse on OneLake (Delta-backed)
AWS DEA-C01S3, Lake Formation, Glue Catalog, Athena, Iceberg in Athena, EMR
GCP PDEGCS, Dataproc, BigQuery + BigLake (Iceberg / Delta external tables)
Databricks DEDelta Lake deep — MERGE, OPTIMIZE, Z-order, Auto Loader, Live Tables

The Practical View

Lakes alone are 2010s thinking. Lakehouses with Delta / Iceberg are the modern default for non-warehouse workloads. For most teams, the architecture is:

  • Object storage + open table format = the durable foundation.
  • Warehouse on top for SQL-first analytics and BI.
  • Spark / Trino on top for ML and large-scale processing.

The next lesson goes deeper into Spark and distributed compute — the engine that powers most lakehouse workloads.

Key Takeaways

  • A data lake is cheap object storage holding raw data in open formats — flexible but easy to mismanage.
  • Without governance, lakes become "data swamps" — unqueryable and untrusted.
  • Open table formats (Delta, Iceberg, Hudi) bring ACID, schema evolution, and time travel to lakes.
  • A lakehouse combines warehouse semantics with lake economics on object storage.
  • Choose lake/lakehouse for ML, semi-structured data, and very large scale; warehouse for pure SQL analytics.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →