Cloud Data Platforms Compared

This is the platform-selection lesson. Each major vendor offers a complete stack — ingestion, storage, transformation, serving, governance — but their strengths differ. There is no universal "best".

AWS Data Stack

Layer	Service
Storage	S3 (lake), Redshift (warehouse), Aurora / DynamoDB (operational)
Ingestion	DMS, Kinesis Data Streams / Firehose, MSK, Glue crawlers
Transformation	Glue (Spark), EMR, Athena (Trino), Redshift
Streaming	Kinesis, MSK, Managed Service for Apache Flink
Orchestration	MWAA (Airflow), Step Functions, EventBridge
Catalog / governance	Glue Data Catalog, Lake Formation, IAM, DataZone
BI / ML	QuickSight, SageMaker

Strengths: deepest service catalog, broadest ecosystem, S3 is the de-facto lake substrate.

Watch out: integration between services (IAM, networking, catalogs) is your job. Many small services to learn.

Cert: AWS DEA-C01 (Data Engineer Associate) is the focused track.

Azure Data Stack

Layer	Service
Storage	ADLS Gen2 (lake), Synapse / Fabric Warehouse, Cosmos DB / SQL DB (operational)
Ingestion	ADF / Fabric Data Factory, Event Hubs, Stream Analytics
Transformation	Synapse Spark, Synapse SQL, Fabric notebooks, Databricks (often)
Streaming	Event Hubs (Kafka-compatible), Stream Analytics, Fabric eventstreams
Orchestration	ADF / Fabric pipelines, Logic Apps
Catalog / governance	Microsoft Purview, Fabric OneLake
BI / ML	Power BI (deeply integrated), Azure ML

Strengths: Power BI integration is unmatched; Microsoft Fabric is a serious bet on a unified experience over OneLake (Delta-based).

Watch out: Synapse and Fabric overlap; product strategy is in transition. Check current Microsoft guidance.

Cert: DP-700 Fabric Data Engineer (DP-203 retiring).

GCP Data Stack

Layer	Service
Storage	GCS (lake), BigQuery (warehouse), Spanner / Firestore / Cloud SQL (operational)
Ingestion	Datastream (CDC), Pub/Sub, Dataflow, Storage Transfer
Transformation	BigQuery (SQL), Dataform / dbt, Dataflow (Beam), Dataproc (Spark)
Streaming	Pub/Sub, Dataflow streaming, BigQuery streaming inserts
Orchestration	Cloud Composer (Airflow), Workflows, Dataform schedules
Catalog / governance	Dataplex, Data Catalog, BigLake, policy tags, DLP
BI / ML	Looker, Vertex AI

Strengths: BigQuery is the gold standard for serverless analytics; Dataflow is exceptional for unified streaming + batch (Apache Beam).

Watch out: smaller market share than AWS/Azure means a smaller hiring pool and fewer third-party connectors.

Cert: Google Professional Data Engineer (PDE).

Databricks

Layer	Capability
Storage	Delta Lake on S3 / ADLS / GCS; Unity Catalog
Compute	Spark (Photon), SQL Warehouses, Delta Live Tables, Workflows
Streaming	Structured Streaming + Auto Loader
ML	MLflow, Mosaic AI, model serving, vector search
Governance	Unity Catalog (tables, files, ML models, dashboards)
BI	Databricks SQL + dashboards; integrates with Power BI / Tableau

Strengths: the lakehouse leader; unifies BI, data engineering, ML on one substrate. Best multi-cloud option for teams who don't want vendor lock-in.

Watch out: still a separate platform you operate; cost can climb if clusters are mismanaged.

Cert: Databricks Certified Data Engineer Associate / Professional.

Snowflake

Layer	Capability
Storage	Snowflake-managed micro-partitions (S3/ADLS/GCS under the hood); Iceberg tables
Compute	Independent virtual warehouses; Snowpark (Python / Java / Scala)
Streaming	Snowpipe / Snowpipe Streaming; Dynamic Tables
ML	Snowpark ML, Cortex AI
Governance	Horizon Catalog, dynamic data masking, row access policies
BI	Native dashboards (Streamlit), strong third-party support

Strengths: simplest warehouse experience, multi-cloud, strong governance, mature ecosystem.

Watch out: primarily warehouse-shaped; ML and unstructured workloads catching up but Databricks still leads there.

Cert: SnowPro Core; SnowPro Advanced Data Engineer.

Decision Framework

"We're already on cloud X"

Default to that hyperscaler's stack. The integration savings dominate. Add Databricks or Snowflake only if a specific workload demands it.

"We're multi-cloud or undecided"

Snowflake (warehouse-led) or Databricks (lakehouse-led) is the natural choice. Both run on all three clouds with consistent UX.

"BI-heavy, SQL-only team"

BigQuery, Snowflake, or Synapse / Fabric. Avoid Spark unless you have to.

"ML-heavy / unstructured / huge scale"

Databricks. Or BigQuery + Vertex if on GCP.

"Real-time first"

Kafka + Flink (managed) on top of any of the above. Confluent + Snowflake / BigQuery is a popular combo.

"Power BI is non-negotiable"

Microsoft Fabric / Synapse will be the path of least resistance.

Cost: The Hidden Variable

Per-TB / per-credit pricing differs less than people think; workload patterns matter more.
Always-on streaming is the most expensive mode; batch with auto-suspend is the cheapest.
Track cost per pipeline / per team from day one; tag everything.
Beware "managed serverless" sticker shock — convenience has a price.
Reserved / committed-use discounts (Snowflake, BigQuery, Databricks) reward predictable workloads.

The Convergence Trend

Open formats are eroding lock-in:

Iceberg is now first-class in Snowflake, BigQuery, AWS, Databricks.
Delta is open-sourced and supported by many engines.
Unity Catalog and the Iceberg REST catalog spec aim for cross-engine governance.

The 2030-era picture: data lives in object storage in an open table format, governed by an open catalog, queried by whichever engine fits the workload. Today we are about half-way there.

Cert Mapping Recap

Goal	Cert
Microsoft / Azure / Fabric	DP-700 (Fabric DE), DP-203 (legacy)
AWS	DEA-C01
GCP	Professional Data Engineer
Multi-cloud lakehouse	Databricks Certified DE Associate / Professional
Snowflake	SnowPro Core, SnowPro Advanced DE

Pick one cloud / platform deeply, learn the others conceptually. The fundamentals from this course — pipelines, warehouses, lakes, Spark, streaming, governance — transfer across all of them.

Where to Go Next

Build something end-to-end on a free tier: ingest → land → transform → serve.
Pair this course with our DevOps and Cloud Security courses — modern data platforms live and die by both.
For depth, pick the cert most relevant to your environment and study it against real workloads, not just Q&A drills.