If a warehouse is a curated library, a data lake is a giant warehouse-store of crates: cheap, flexible, holds anything — but you need to know what you have and where it is, or it becomes a swamp.
What a Data Lake Is
- Storage: object store (S3, ADLS Gen2, GCS) — cheap, infinitely scalable, durable.
- Format: open file formats — Parquet, ORC, Avro, JSON, CSV.
- Compute: separate engines that read those files — Spark, Presto / Trino, Athena, Dremio, BigQuery external tables.
- Schema: applied on read, not on write.
Originally hyped as the cure-all replacement for warehouses ("Hadoop will eat the database"). Reality was messier.
The Swamp Problem
Without discipline, lakes accumulate problems:
- Files everywhere, no canonical view of "the orders table".
- Schema drifts silently — column added, column type changes — readers break.
- No transactions: a partial write leaves consumers reading half-data.
- No reliable updates / deletes — GDPR requests are painful.
- "Small files" problem: millions of tiny files crush query engines.
- No time travel: cannot answer "what did this table look like yesterday?".
This is why pure lakes lost ground to warehouses for analytics in the 2015–2020 era.
The Lakehouse Fix: Open Table Formats
Three open formats brought ACID semantics to object storage:
| Format | Origin | Strengths |
|---|---|---|
| Delta Lake | Databricks (open-source 2019) | Mature, deep Spark integration, Unity Catalog |
| Apache Iceberg | Netflix → Apache | Engine-agnostic, strong on schema evolution + hidden partitioning; backed by Snowflake, AWS, GCP |
| Apache Hudi | Uber → Apache | Strong on streaming upserts, CDC patterns |
What they all add on top of Parquet:
- ACID transactions — atomic writes, snapshot isolation. No half-files.
- Schema enforcement and evolution — columns add safely, type changes are tracked.
- Time travel — query a table "as of" timestamp or version.
- Upserts and deletes — MERGE INTO works on a lake.
- Optimisations — file compaction, Z-ordering, data skipping.
-- Spark SQL on a Delta table
MERGE INTO silver.customers AS t
USING staging.customer_changes AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Time travel
SELECT * FROM silver.customers VERSION AS OF 142;
SELECT * FROM silver.customers TIMESTAMP AS OF '2025-09-01 00:00:00';
Lakehouse Architecture
Coined by Databricks but now broadly used. The pattern:
[ Sources ] ─→ [ Bronze (raw) ] ─→ [ Silver (cleaned) ] ─→ [ Gold (curated marts) ]
Delta/Iceberg Delta/Iceberg Delta/Iceberg
│
▼
Engines query directly:
Spark, Trino, Presto, Athena,
Snowflake (Iceberg), BigQuery (Iceberg)
- Bronze — raw landing, append-only, exact source replica.
- Silver — cleaned, conformed, deduplicated; production-ready datasets.
- Gold — business-level aggregates, marts, ML features.
The medallion (bronze/silver/gold) terminology is Databricks-flavoured but the layering is universal.
When to Pick a Lakehouse vs Warehouse
| Choose lakehouse when | Choose warehouse when |
|---|---|
| Heavy ML / data science workloads | Pure SQL analytics + BI |
| Semi-structured / nested data | Structured tabular data |
| Petabyte+ scale | Tens of TB to low PB |
| Need to share data across many engines | One team, one engine |
| Multi-language (Python, Scala, SQL) | SQL-only is fine |
| Streaming + batch unified | Mostly batch |
Reality: many teams run both. Land raw data in a lake; promote curated marts to a warehouse; or vice versa with external tables (Snowflake reading Iceberg, BigQuery reading Iceberg, Redshift Spectrum reading Parquet).
Object Storage Layout Best Practices
- Partition by date (yyyy/mm/dd) and one or two business keys at most.
- Target file sizes ~128–512 MB (Parquet); avoid the small files problem.
- Compact regularly (OPTIMIZE / VACUUM in Delta; rewrite_data_files in Iceberg).
- Use the table format's partition transforms (Iceberg hidden partitioning, Delta generated columns).
- Track table-level metadata in a catalog (Glue Data Catalog, Unity Catalog, Hive Metastore, REST catalog).
Catalogs Matter
A table format on object storage is useless without a catalog telling engines where the metadata lives. Common catalogs:
- AWS Glue Data Catalog — Hive-compatible, used by Athena, Redshift Spectrum, EMR.
- Unity Catalog — Databricks-native, multi-cloud, supports Delta and Iceberg.
- Iceberg REST catalog — open spec; AWS, Snowflake, Tabular, Polaris implementations.
- Hive Metastore — older, but still everywhere.
Choosing a catalog is increasingly the strategic decision — once data is locked into one catalog, switching engines later is hard.
Cert Mapping
| Cert | Lakehouse content |
|---|---|
| DP-203 / DP-700 | ADLS Gen2, Synapse Spark / Fabric Lakehouse on OneLake (Delta-backed) |
| AWS DEA-C01 | S3, Lake Formation, Glue Catalog, Athena, Iceberg in Athena, EMR |
| GCP PDE | GCS, Dataproc, BigQuery + BigLake (Iceberg / Delta external tables) |
| Databricks DE | Delta Lake deep — MERGE, OPTIMIZE, Z-order, Auto Loader, Live Tables |
The Practical View
Lakes alone are 2010s thinking. Lakehouses with Delta / Iceberg are the modern default for non-warehouse workloads. For most teams, the architecture is:
- Object storage + open table format = the durable foundation.
- Warehouse on top for SQL-first analytics and BI.
- Spark / Trino on top for ML and large-scale processing.
The next lesson goes deeper into Spark and distributed compute — the engine that powers most lakehouse workloads.