This is the platform-selection lesson. Each major vendor offers a complete stack — ingestion, storage, transformation, serving, governance — but their strengths differ. There is no universal "best".
AWS Data Stack
| Layer | Service |
|---|---|
| Storage | S3 (lake), Redshift (warehouse), Aurora / DynamoDB (operational) |
| Ingestion | DMS, Kinesis Data Streams / Firehose, MSK, Glue crawlers |
| Transformation | Glue (Spark), EMR, Athena (Trino), Redshift |
| Streaming | Kinesis, MSK, Managed Service for Apache Flink |
| Orchestration | MWAA (Airflow), Step Functions, EventBridge |
| Catalog / governance | Glue Data Catalog, Lake Formation, IAM, DataZone |
| BI / ML | QuickSight, SageMaker |
Strengths: deepest service catalog, broadest ecosystem, S3 is the de-facto lake substrate.
Watch out: integration between services (IAM, networking, catalogs) is your job. Many small services to learn.
Cert: AWS DEA-C01 (Data Engineer Associate) is the focused track.
Azure Data Stack
| Layer | Service |
|---|---|
| Storage | ADLS Gen2 (lake), Synapse / Fabric Warehouse, Cosmos DB / SQL DB (operational) |
| Ingestion | ADF / Fabric Data Factory, Event Hubs, Stream Analytics |
| Transformation | Synapse Spark, Synapse SQL, Fabric notebooks, Databricks (often) |
| Streaming | Event Hubs (Kafka-compatible), Stream Analytics, Fabric eventstreams |
| Orchestration | ADF / Fabric pipelines, Logic Apps |
| Catalog / governance | Microsoft Purview, Fabric OneLake |
| BI / ML | Power BI (deeply integrated), Azure ML |
Strengths: Power BI integration is unmatched; Microsoft Fabric is a serious bet on a unified experience over OneLake (Delta-based).
Watch out: Synapse and Fabric overlap; product strategy is in transition. Check current Microsoft guidance.
Cert: DP-700 Fabric Data Engineer (DP-203 retiring).
GCP Data Stack
| Layer | Service |
|---|---|
| Storage | GCS (lake), BigQuery (warehouse), Spanner / Firestore / Cloud SQL (operational) |
| Ingestion | Datastream (CDC), Pub/Sub, Dataflow, Storage Transfer |
| Transformation | BigQuery (SQL), Dataform / dbt, Dataflow (Beam), Dataproc (Spark) |
| Streaming | Pub/Sub, Dataflow streaming, BigQuery streaming inserts |
| Orchestration | Cloud Composer (Airflow), Workflows, Dataform schedules |
| Catalog / governance | Dataplex, Data Catalog, BigLake, policy tags, DLP |
| BI / ML | Looker, Vertex AI |
Strengths: BigQuery is the gold standard for serverless analytics; Dataflow is exceptional for unified streaming + batch (Apache Beam).
Watch out: smaller market share than AWS/Azure means a smaller hiring pool and fewer third-party connectors.
Cert: Google Professional Data Engineer (PDE).
Databricks
| Layer | Capability |
|---|---|
| Storage | Delta Lake on S3 / ADLS / GCS; Unity Catalog |
| Compute | Spark (Photon), SQL Warehouses, Delta Live Tables, Workflows |
| Streaming | Structured Streaming + Auto Loader |
| ML | MLflow, Mosaic AI, model serving, vector search |
| Governance | Unity Catalog (tables, files, ML models, dashboards) |
| BI | Databricks SQL + dashboards; integrates with Power BI / Tableau |
Strengths: the lakehouse leader; unifies BI, data engineering, ML on one substrate. Best multi-cloud option for teams who don't want vendor lock-in.
Watch out: still a separate platform you operate; cost can climb if clusters are mismanaged.
Cert: Databricks Certified Data Engineer Associate / Professional.
Snowflake
| Layer | Capability |
|---|---|
| Storage | Snowflake-managed micro-partitions (S3/ADLS/GCS under the hood); Iceberg tables |
| Compute | Independent virtual warehouses; Snowpark (Python / Java / Scala) |
| Streaming | Snowpipe / Snowpipe Streaming; Dynamic Tables |
| ML | Snowpark ML, Cortex AI |
| Governance | Horizon Catalog, dynamic data masking, row access policies |
| BI | Native dashboards (Streamlit), strong third-party support |
Strengths: simplest warehouse experience, multi-cloud, strong governance, mature ecosystem.
Watch out: primarily warehouse-shaped; ML and unstructured workloads catching up but Databricks still leads there.
Cert: SnowPro Core; SnowPro Advanced Data Engineer.
Decision Framework
"We're already on cloud X"
Default to that hyperscaler's stack. The integration savings dominate. Add Databricks or Snowflake only if a specific workload demands it.
"We're multi-cloud or undecided"
Snowflake (warehouse-led) or Databricks (lakehouse-led) is the natural choice. Both run on all three clouds with consistent UX.
"BI-heavy, SQL-only team"
BigQuery, Snowflake, or Synapse / Fabric. Avoid Spark unless you have to.
"ML-heavy / unstructured / huge scale"
Databricks. Or BigQuery + Vertex if on GCP.
"Real-time first"
Kafka + Flink (managed) on top of any of the above. Confluent + Snowflake / BigQuery is a popular combo.
"Power BI is non-negotiable"
Microsoft Fabric / Synapse will be the path of least resistance.
Cost: The Hidden Variable
- Per-TB / per-credit pricing differs less than people think; workload patterns matter more.
- Always-on streaming is the most expensive mode; batch with auto-suspend is the cheapest.
- Track cost per pipeline / per team from day one; tag everything.
- Beware "managed serverless" sticker shock — convenience has a price.
- Reserved / committed-use discounts (Snowflake, BigQuery, Databricks) reward predictable workloads.
The Convergence Trend
Open formats are eroding lock-in:
- Iceberg is now first-class in Snowflake, BigQuery, AWS, Databricks.
- Delta is open-sourced and supported by many engines.
- Unity Catalog and the Iceberg REST catalog spec aim for cross-engine governance.
The 2030-era picture: data lives in object storage in an open table format, governed by an open catalog, queried by whichever engine fits the workload. Today we are about half-way there.
Cert Mapping Recap
| Goal | Cert |
|---|---|
| Microsoft / Azure / Fabric | DP-700 (Fabric DE), DP-203 (legacy) |
| AWS | DEA-C01 |
| GCP | Professional Data Engineer |
| Multi-cloud lakehouse | Databricks Certified DE Associate / Professional |
| Snowflake | SnowPro Core, SnowPro Advanced DE |
Pick one cloud / platform deeply, learn the others conceptually. The fundamentals from this course — pipelines, warehouses, lakes, Spark, streaming, governance — transfer across all of them.
Where to Go Next
- Build something end-to-end on a free tier: ingest → land → transform → serve.
- Pair this course with our DevOps and Cloud Security courses — modern data platforms live and die by both.
- For depth, pick the cert most relevant to your environment and study it against real workloads, not just Q&A drills.