Data Lakes and Lakehouses — Data Engineering Fundamentals | CertQnA

If a warehouse is a curated library, a data lake is a giant warehouse-store of crates: cheap, flexible, holds anything — but you need to know what you have and where it is, or it becomes a swamp.

What a Data Lake Is

Storage: object store (S3, ADLS Gen2, GCS) — cheap, infinitely scalable, durable.
Format: open file formats — Parquet, ORC, Avro, JSON, CSV.
Compute: separate engines that read those files — Spark, Presto / Trino, Athena, Dremio, BigQuery external tables.
Schema: applied on read, not on write.

Originally hyped as the cure-all replacement for warehouses ("Hadoop will eat the database"). Reality was messier.

The Swamp Problem

Without discipline, lakes accumulate problems:

Files everywhere, no canonical view of "the orders table".
Schema drifts silently — column added, column type changes — readers break.
No transactions: a partial write leaves consumers reading half-data.
No reliable updates / deletes — GDPR requests are painful.
"Small files" problem: millions of tiny files crush query engines.
No time travel: cannot answer "what did this table look like yesterday?".

This is why pure lakes lost ground to warehouses for analytics in the 2015–2020 era.

The Lakehouse Fix: Open Table Formats

Three open formats brought ACID semantics to object storage:

Format	Origin	Strengths
Delta Lake	Databricks (open-source 2019)	Mature, deep Spark integration, Unity Catalog
Apache Iceberg	Netflix → Apache	Engine-agnostic, strong on schema evolution + hidden partitioning; backed by Snowflake, AWS, GCP
Apache Hudi	Uber → Apache	Strong on streaming upserts, CDC patterns

What they all add on top of Parquet:

ACID transactions — atomic writes, snapshot isolation. No half-files.
Schema enforcement and evolution — columns add safely, type changes are tracked.
Time travel — query a table "as of" timestamp or version.
Upserts and deletes — MERGE INTO works on a lake.
Optimisations — file compaction, Z-ordering, data skipping.

-- Spark SQL on a Delta table
MERGE INTO silver.customers AS t
USING staging.customer_changes AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel
SELECT * FROM silver.customers VERSION AS OF 142;
SELECT * FROM silver.customers TIMESTAMP AS OF '2025-09-01 00:00:00';

Lakehouse Architecture

Coined by Databricks but now broadly used. The pattern:

[ Sources ] ─→ [ Bronze (raw) ] ─→ [ Silver (cleaned) ] ─→ [ Gold (curated marts) ]
                  Delta/Iceberg       Delta/Iceberg            Delta/Iceberg
                          │
                          ▼
            Engines query directly:
            Spark, Trino, Presto, Athena,
            Snowflake (Iceberg), BigQuery (Iceberg)

Bronze — raw landing, append-only, exact source replica.
Silver — cleaned, conformed, deduplicated; production-ready datasets.
Gold — business-level aggregates, marts, ML features.

The medallion (bronze/silver/gold) terminology is Databricks-flavoured but the layering is universal.

When to Pick a Lakehouse vs Warehouse

Choose lakehouse when	Choose warehouse when
Heavy ML / data science workloads	Pure SQL analytics + BI
Semi-structured / nested data	Structured tabular data
Petabyte+ scale	Tens of TB to low PB
Need to share data across many engines	One team, one engine
Multi-language (Python, Scala, SQL)	SQL-only is fine
Streaming + batch unified	Mostly batch

Reality: many teams run both. Land raw data in a lake; promote curated marts to a warehouse; or vice versa with external tables (Snowflake reading Iceberg, BigQuery reading Iceberg, Redshift Spectrum reading Parquet).

Object Storage Layout Best Practices

Partition by date (yyyy/mm/dd) and one or two business keys at most.
Target file sizes ~128–512 MB (Parquet); avoid the small files problem.
Compact regularly (OPTIMIZE / VACUUM in Delta; rewrite_data_files in Iceberg).
Use the table format's partition transforms (Iceberg hidden partitioning, Delta generated columns).
Track table-level metadata in a catalog (Glue Data Catalog, Unity Catalog, Hive Metastore, REST catalog).

Catalogs Matter

A table format on object storage is useless without a catalog telling engines where the metadata lives. Common catalogs:

AWS Glue Data Catalog — Hive-compatible, used by Athena, Redshift Spectrum, EMR.
Unity Catalog — Databricks-native, multi-cloud, supports Delta and Iceberg.
Iceberg REST catalog — open spec; AWS, Snowflake, Tabular, Polaris implementations.
Hive Metastore — older, but still everywhere.

Choosing a catalog is increasingly the strategic decision — once data is locked into one catalog, switching engines later is hard.

Cert Mapping

Cert	Lakehouse content
DP-203 / DP-700	ADLS Gen2, Synapse Spark / Fabric Lakehouse on OneLake (Delta-backed)
AWS DEA-C01	S3, Lake Formation, Glue Catalog, Athena, Iceberg in Athena, EMR
GCP PDE	GCS, Dataproc, BigQuery + BigLake (Iceberg / Delta external tables)
Databricks DE	Delta Lake deep — MERGE, OPTIMIZE, Z-order, Auto Loader, Live Tables

The Practical View

Lakes alone are 2010s thinking. Lakehouses with Delta / Iceberg are the modern default for non-warehouse workloads. For most teams, the architecture is:

Object storage + open table format = the durable foundation.
Warehouse on top for SQL-first analytics and BI.
Spark / Trino on top for ML and large-scale processing.

The next lesson goes deeper into Spark and distributed compute — the engine that powers most lakehouse workloads.