ETL vs ELT — Data Engineering Fundamentals | CertQnA

Three letters define how data moves: Extract, Transform, Load. The order matters, and it has flipped over the past decade.

ETL: The Classical Approach

[ Source ] → [ ETL tool ] ──transform in flight──→ [ Warehouse ]
                Informatica
                Talend
                SSIS

The pipeline pulls data, transforms it on a dedicated server (or the ETL tool itself), and only the cleaned, modelled output lands in the warehouse.

Why ETL was the default

Warehouses (Teradata, on-prem Oracle, early data warehouses) were expensive — every byte of storage and every CPU second cost real money.
You wanted only modelled data in the warehouse, not raw operational data.
Compliance: PII could be masked or dropped before it ever touched the analytics layer.

Downsides

Transformation logic lived in proprietary GUI tools, hard to version-control.
If you forgot a column or wanted to model differently, you had to re-extract from source.
Schema changes upstream broke the entire pipeline.
ETL tools were a separate skill set; not many people could maintain them.

ELT: The Modern Default

[ Source ] → [ Loader ] ──load raw──→ [ Warehouse ]
              Fivetran                    │
              Airbyte                     ▼ (transform in SQL)
              CDC                   [ Modelled marts ]
                                    dbt, native SQL

The pipeline lands raw data in the warehouse first. Transformations run inside the warehouse using its own SQL engine.

Why ELT won

Cheap warehouse compute — Snowflake, BigQuery, Redshift, Databricks have pushed compute and storage prices down 10–100×.
Elastic compute — large transforms can spin up dedicated warehouses for minutes, then shut down.
SQL fluency — most data people know SQL; few knew Informatica.
Replayability — raw data is preserved, so you can rebuild any model without re-extracting.
dbt brought software practices (modular SQL, version control, tests, docs) to the T step.

The standard ELT layering

Layer	Purpose	Naming
Raw / source	Exact replica of source, append-only	raw.salesforce__opportunity
Staging	Renamed, typed, lightly cleaned 1:1 with source	stg_salesforce__opportunities
Intermediate	Joins, filters, business logic building blocks	int_orders_with_payments
Marts	Final dim_ / fact_ tables consumers query	fact_orders, dim_customer

When ETL Still Wins

Strong data residency / compliance — health data or PII that should never land raw in the warehouse. Mask or pseudonymise in flight.
Schema-on-read becomes a problem — extreme nested or proprietary formats may benefit from upfront parsing.
Source rate-limiting — when re-extracting is expensive (paid API quotas), pre-shape and cache.
Small warehouse / fixed cluster — when warehouse compute is constrained, push transformation upstream.

Hybrid: do a minimal ETL step (mask + format) before loading, then full ELT inside the warehouse for everything else. This is common for healthcare and financial data.

Code: The Same Job Two Ways

ETL (Python in transit)

def transform(rows):
    out = []
    for r in rows:
        if r["status"] == "active":
            out.append({
                "customer_id": r["id"],
                "email": r["email"].lower().strip(),
                "signup_year": parse_date(r["created_at"]).year,
            })
    return out

rows = extract_from_crm()
clean = transform(rows)
load_to_warehouse("dim_customer", clean)

Logic is in Python, only running on whatever box runs the pipeline. No history of raw CRM data — you cannot rebuild dim_customer differently without re-extracting.

ELT (raw + dbt)

-- raw.crm_customers landed by Fivetran, append-only
-- dbt model: stg_crm__customers.sql
select
  id            as customer_id,
  lower(trim(email)) as email,
  date_part('year', created_at::date) as signup_year,
  status
from {{ source('crm', 'customers') }}
where status = 'active'

Logic is SQL, version-controlled, tested, documented. Raw is preserved — you can rebuild the model with new logic any time.

Reverse ETL: The Other Direction

Once your warehouse is the source of truth, you often want to push data out to operational tools — Salesforce, HubSpot, Intercom, ad platforms. Tools like Hightouch and Census do this. It is "reverse ETL": warehouse → SaaS.

This makes the warehouse not just an analytics store but the operational data hub.

Cert Mapping

Cert	ETL/ELT story
DP-203 / DP-700	ADF copy + Mapping data flows; Synapse / Fabric pipelines; dataflows Gen2
AWS DEA-C01	Glue ETL jobs, Glue DataBrew, Step Functions; Redshift COPY for ELT
GCP PDE	Dataflow (ETL streaming/batch), Dataform / dbt for ELT in BigQuery

Practical Choice Today

Default to ELT with a managed loader (Fivetran / Airbyte) and dbt on a cloud warehouse.
Add light transform-in-flight only when compliance or cost demands it.
Use orchestration (next lesson) to coordinate loaders, dbt runs, and downstream consumers.