Data engineering is the discipline of designing and operating systems that move data from where it is produced to where it is useful. Where software engineering ships features, data engineering ships data products — clean, reliable datasets that downstream analysts, scientists, and applications depend on.
What Data Engineers Actually Build
- Ingestion — pull data from databases, SaaS APIs, event streams, files; land it in a central place.
- Transformation — clean, normalise, join, deduplicate, enrich, aggregate.
- Storage — choose where data lives: warehouse, lake, lakehouse, OLTP, NoSQL, time-series.
- Modelling — design schemas (star, snowflake, data vault) so the data is queryable and trustworthy.
- Serving — make data available via SQL, APIs, dashboards, ML feature stores, streaming topics.
- Operating — monitor freshness, quality, cost, lineage, and respond when something breaks.
How the Role Compares
| Software engineer | Data engineer | Data analyst / scientist | |
|---|---|---|---|
| Primary artefact | Application code, services | Pipelines, models, datasets | Insights, reports, ML models |
| Primary language | App language (TS, Java, Go) | SQL + Python (+ Scala/Java) | SQL + Python / R |
| "Production" means | Service is up and serving requests | Data lands on time, correct, queryable | Dashboard or model is accurate and used |
| Failure looks like | 500 error, latency spike | Stale data, wrong number, broken schema | Bad decision based on the data |
A good data engineer borrows software practices (version control, CI, tests, code review) but applies them to data — testing not just code, but datasets.
The Modern Data Stack
The dominant pattern today, sometimes called the "MDS":
[ Sources ]
- OLTP DBs, SaaS, events, files
│
▼ (ingestion: Fivetran, Airbyte, custom, CDC)
[ Cloud warehouse / lakehouse ]
- Snowflake, BigQuery, Redshift, Databricks, Synapse
│
▼ (transformation: dbt, Spark, native SQL)
[ Modelled / curated layer ]
- dim_customer, fact_orders, daily_active_users
│
├──→ [ BI: Looker, Power BI, Tableau, Mode ]
├──→ [ Reverse ETL: Hightouch, Census ]
├──→ [ ML feature stores ]
└──→ [ APIs / apps ]
Two structural shifts make this stack possible:
- Cheap, scalable storage and compute in cloud warehouses — load raw data first, transform later.
- SQL-first transformation tools (dbt) — analytics engineers and data engineers share a language.
Data Engineering vs Big Data
"Big data" was a useful early-2010s framing — Hadoop, MapReduce, terabytes that did not fit on one machine. Today, cloud warehouses transparently scale to petabytes, and most teams' "big" data is comfortably handled by Snowflake or BigQuery without thinking about clusters. The job has shifted from "make this fit" to "make this trustworthy and timely."
Spark, Kafka, and the Hadoop ecosystem still matter — for true high-volume streaming, ML feature engineering, and lakehouse workloads. They are tools, not the centre of gravity.
Data Engineering vs MLOps and Data Science
The boundaries blur:
- Data engineers build the pipelines that produce features and training data.
- ML engineers / MLOps run training, serving, monitoring of models.
- Data scientists explore, build, and evaluate models on those datasets.
In small companies one person wears all three hats. In large ones the split is sharper. The throughline: data engineering is the foundation; everything downstream depends on it.
The Three Constants of the Job
1. Quality
A pipeline that runs on time but produces wrong numbers is worse than no pipeline. Tests on schemas, freshness, row counts, and business invariants are non-negotiable.
2. Reliability
Downstream consumers (executives, automated decisions, ML models) bake your output into their workflows. SLAs matter: "the daily revenue mart lands by 6am" is a contract.
3. Cost
Cloud warehouses are pay-per-query. A naive transformation that scans 50 TB per run can drop a five-figure bill before anyone notices. Cost discipline is part of the craft.
Skills to Build
- SQL — fluency, including window functions, CTEs, performance tuning.
- Python — for ingestion glue, orchestration, occasional Spark / Pandas.
- Modelling — Kimball-style dimensional, Inmon, data vault, modern "one big table" patterns.
- Cloud platforms — at least one of AWS, Azure, GCP, Databricks deeply; the others conceptually.
- Orchestration — Airflow / Prefect / Dagster.
- Transformation — dbt or equivalent.
- Streaming — Kafka concepts at minimum.
- Data governance — lineage, classification, access control, privacy.
Where the Certs Fit
| Cert | Focus |
|---|---|
| Microsoft DP-203 (retiring) / DP-700 Fabric Data Engineer | Azure: ADF, Synapse, Databricks, Fabric, Event Hubs |
| AWS DEA-C01 | AWS: Glue, EMR, Redshift, Kinesis, MSK, S3, Lake Formation |
| Google PDE | GCP: Dataflow, BigQuery, Pub/Sub, Dataproc, Looker, Vertex |
| Databricks Certified Data Engineer | Spark, Delta Lake, lakehouse architecture |
| Snowflake SnowPro Data Engineer | Snowflake-specific patterns |
The certs share concepts — pipelines, warehouses, lakes, streaming, governance. Master the concepts here and the per-cloud variants are mostly nomenclature.
The Mental Frame for the Rest of the Course
Data engineering is plumbing for decisions. Every pipeline you build either makes a downstream decision faster, more reliable, or cheaper — or it should not exist. As we cover ETL, warehouses, lakes, Spark, and streaming, keep that lens: who consumes this, and how does this make their work better?