Skip to content
5 min read·Lesson 1 of 10

What Is Data Engineering?

Data engineering as a discipline — what data engineers actually do, how the role compares to software engineering and analytics, and the tools that define the modern data stack.

Data engineering is the discipline of designing and operating systems that move data from where it is produced to where it is useful. Where software engineering ships features, data engineering ships data products — clean, reliable datasets that downstream analysts, scientists, and applications depend on.

What Data Engineers Actually Build

  • Ingestion — pull data from databases, SaaS APIs, event streams, files; land it in a central place.
  • Transformation — clean, normalise, join, deduplicate, enrich, aggregate.
  • Storage — choose where data lives: warehouse, lake, lakehouse, OLTP, NoSQL, time-series.
  • Modelling — design schemas (star, snowflake, data vault) so the data is queryable and trustworthy.
  • Serving — make data available via SQL, APIs, dashboards, ML feature stores, streaming topics.
  • Operating — monitor freshness, quality, cost, lineage, and respond when something breaks.

How the Role Compares

Software engineerData engineerData analyst / scientist
Primary artefactApplication code, servicesPipelines, models, datasetsInsights, reports, ML models
Primary languageApp language (TS, Java, Go)SQL + Python (+ Scala/Java)SQL + Python / R
"Production" meansService is up and serving requestsData lands on time, correct, queryableDashboard or model is accurate and used
Failure looks like500 error, latency spikeStale data, wrong number, broken schemaBad decision based on the data

A good data engineer borrows software practices (version control, CI, tests, code review) but applies them to data — testing not just code, but datasets.

The Modern Data Stack

The dominant pattern today, sometimes called the "MDS":

[ Sources ]
  - OLTP DBs, SaaS, events, files
        │
        ▼  (ingestion: Fivetran, Airbyte, custom, CDC)
[ Cloud warehouse / lakehouse ]
  - Snowflake, BigQuery, Redshift, Databricks, Synapse
        │
        ▼  (transformation: dbt, Spark, native SQL)
[ Modelled / curated layer ]
  - dim_customer, fact_orders, daily_active_users
        │
        ├──→ [ BI: Looker, Power BI, Tableau, Mode ]
        ├──→ [ Reverse ETL: Hightouch, Census ]
        ├──→ [ ML feature stores ]
        └──→ [ APIs / apps ]

Two structural shifts make this stack possible:

  1. Cheap, scalable storage and compute in cloud warehouses — load raw data first, transform later.
  2. SQL-first transformation tools (dbt) — analytics engineers and data engineers share a language.

Data Engineering vs Big Data

"Big data" was a useful early-2010s framing — Hadoop, MapReduce, terabytes that did not fit on one machine. Today, cloud warehouses transparently scale to petabytes, and most teams' "big" data is comfortably handled by Snowflake or BigQuery without thinking about clusters. The job has shifted from "make this fit" to "make this trustworthy and timely."

Spark, Kafka, and the Hadoop ecosystem still matter — for true high-volume streaming, ML feature engineering, and lakehouse workloads. They are tools, not the centre of gravity.

Data Engineering vs MLOps and Data Science

The boundaries blur:

  • Data engineers build the pipelines that produce features and training data.
  • ML engineers / MLOps run training, serving, monitoring of models.
  • Data scientists explore, build, and evaluate models on those datasets.

In small companies one person wears all three hats. In large ones the split is sharper. The throughline: data engineering is the foundation; everything downstream depends on it.

The Three Constants of the Job

1. Quality

A pipeline that runs on time but produces wrong numbers is worse than no pipeline. Tests on schemas, freshness, row counts, and business invariants are non-negotiable.

2. Reliability

Downstream consumers (executives, automated decisions, ML models) bake your output into their workflows. SLAs matter: "the daily revenue mart lands by 6am" is a contract.

3. Cost

Cloud warehouses are pay-per-query. A naive transformation that scans 50 TB per run can drop a five-figure bill before anyone notices. Cost discipline is part of the craft.

Skills to Build

  • SQL — fluency, including window functions, CTEs, performance tuning.
  • Python — for ingestion glue, orchestration, occasional Spark / Pandas.
  • Modelling — Kimball-style dimensional, Inmon, data vault, modern "one big table" patterns.
  • Cloud platforms — at least one of AWS, Azure, GCP, Databricks deeply; the others conceptually.
  • Orchestration — Airflow / Prefect / Dagster.
  • Transformation — dbt or equivalent.
  • Streaming — Kafka concepts at minimum.
  • Data governance — lineage, classification, access control, privacy.

Where the Certs Fit

CertFocus
Microsoft DP-203 (retiring) / DP-700 Fabric Data EngineerAzure: ADF, Synapse, Databricks, Fabric, Event Hubs
AWS DEA-C01AWS: Glue, EMR, Redshift, Kinesis, MSK, S3, Lake Formation
Google PDEGCP: Dataflow, BigQuery, Pub/Sub, Dataproc, Looker, Vertex
Databricks Certified Data EngineerSpark, Delta Lake, lakehouse architecture
Snowflake SnowPro Data EngineerSnowflake-specific patterns

The certs share concepts — pipelines, warehouses, lakes, streaming, governance. Master the concepts here and the per-cloud variants are mostly nomenclature.

The Mental Frame for the Rest of the Course

Data engineering is plumbing for decisions. Every pipeline you build either makes a downstream decision faster, more reliable, or cheaper — or it should not exist. As we cover ETL, warehouses, lakes, Spark, and streaming, keep that lens: who consumes this, and how does this make their work better?

Key Takeaways

  • Data engineers build the pipelines and platforms that move and shape data for analytics, machine learning, and operations.
  • The role sits between software engineering, infrastructure, and analytics — borrowing from each.
  • The modern data stack centres on a cloud warehouse or lakehouse, with ELT replacing classic ETL.
  • Data engineering work spans ingestion, transformation, storage, modelling, and serving.
  • Strong data engineers care as much about data quality and reliability as code quality.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →