What Is Data Engineering? — Data Engineering Fundamentals | CertQnA

Data engineering is the discipline of designing and operating systems that move data from where it is produced to where it is useful. Where software engineering ships features, data engineering ships data products — clean, reliable datasets that downstream analysts, scientists, and applications depend on.

What Data Engineers Actually Build

Ingestion — pull data from databases, SaaS APIs, event streams, files; land it in a central place.
Transformation — clean, normalise, join, deduplicate, enrich, aggregate.
Storage — choose where data lives: warehouse, lake, lakehouse, OLTP, NoSQL, time-series.
Modelling — design schemas (star, snowflake, data vault) so the data is queryable and trustworthy.
Serving — make data available via SQL, APIs, dashboards, ML feature stores, streaming topics.
Operating — monitor freshness, quality, cost, lineage, and respond when something breaks.

How the Role Compares

	Software engineer	Data engineer	Data analyst / scientist
Primary artefact	Application code, services	Pipelines, models, datasets	Insights, reports, ML models
Primary language	App language (TS, Java, Go)	SQL + Python (+ Scala/Java)	SQL + Python / R
"Production" means	Service is up and serving requests	Data lands on time, correct, queryable	Dashboard or model is accurate and used
Failure looks like	500 error, latency spike	Stale data, wrong number, broken schema	Bad decision based on the data

A good data engineer borrows software practices (version control, CI, tests, code review) but applies them to data — testing not just code, but datasets.

The Modern Data Stack

The dominant pattern today, sometimes called the "MDS":

[ Sources ]
  - OLTP DBs, SaaS, events, files
        │
        ▼  (ingestion: Fivetran, Airbyte, custom, CDC)
[ Cloud warehouse / lakehouse ]
  - Snowflake, BigQuery, Redshift, Databricks, Synapse
        │
        ▼  (transformation: dbt, Spark, native SQL)
[ Modelled / curated layer ]
  - dim_customer, fact_orders, daily_active_users
        │
        ├──→ [ BI: Looker, Power BI, Tableau, Mode ]
        ├──→ [ Reverse ETL: Hightouch, Census ]
        ├──→ [ ML feature stores ]
        └──→ [ APIs / apps ]

Two structural shifts make this stack possible:

Cheap, scalable storage and compute in cloud warehouses — load raw data first, transform later.
SQL-first transformation tools (dbt) — analytics engineers and data engineers share a language.

Data Engineering vs Big Data

"Big data" was a useful early-2010s framing — Hadoop, MapReduce, terabytes that did not fit on one machine. Today, cloud warehouses transparently scale to petabytes, and most teams' "big" data is comfortably handled by Snowflake or BigQuery without thinking about clusters. The job has shifted from "make this fit" to "make this trustworthy and timely."

Spark, Kafka, and the Hadoop ecosystem still matter — for true high-volume streaming, ML feature engineering, and lakehouse workloads. They are tools, not the centre of gravity.

Data Engineering vs MLOps and Data Science

The boundaries blur:

Data engineers build the pipelines that produce features and training data.
ML engineers / MLOps run training, serving, monitoring of models.
Data scientists explore, build, and evaluate models on those datasets.

In small companies one person wears all three hats. In large ones the split is sharper. The throughline: data engineering is the foundation; everything downstream depends on it.

The Three Constants of the Job

1. Quality

A pipeline that runs on time but produces wrong numbers is worse than no pipeline. Tests on schemas, freshness, row counts, and business invariants are non-negotiable.

2. Reliability

Downstream consumers (executives, automated decisions, ML models) bake your output into their workflows. SLAs matter: "the daily revenue mart lands by 6am" is a contract.

3. Cost

Cloud warehouses are pay-per-query. A naive transformation that scans 50 TB per run can drop a five-figure bill before anyone notices. Cost discipline is part of the craft.

Skills to Build

SQL — fluency, including window functions, CTEs, performance tuning.
Python — for ingestion glue, orchestration, occasional Spark / Pandas.
Modelling — Kimball-style dimensional, Inmon, data vault, modern "one big table" patterns.
Cloud platforms — at least one of AWS, Azure, GCP, Databricks deeply; the others conceptually.
Orchestration — Airflow / Prefect / Dagster.
Transformation — dbt or equivalent.
Streaming — Kafka concepts at minimum.
Data governance — lineage, classification, access control, privacy.

Where the Certs Fit

Cert	Focus
Microsoft DP-203 (retiring) / DP-700 Fabric Data Engineer	Azure: ADF, Synapse, Databricks, Fabric, Event Hubs
AWS DEA-C01	AWS: Glue, EMR, Redshift, Kinesis, MSK, S3, Lake Formation
Google PDE	GCP: Dataflow, BigQuery, Pub/Sub, Dataproc, Looker, Vertex
Databricks Certified Data Engineer	Spark, Delta Lake, lakehouse architecture
Snowflake SnowPro Data Engineer	Snowflake-specific patterns

The certs share concepts — pipelines, warehouses, lakes, streaming, governance. Master the concepts here and the per-cloud variants are mostly nomenclature.

The Mental Frame for the Rest of the Course

Data engineering is plumbing for decisions. Every pipeline you build either makes a downstream decision faster, more reliable, or cheaper — or it should not exist. As we cover ETL, warehouses, lakes, Spark, and streaming, keep that lens: who consumes this, and how does this make their work better?