Three letters define how data moves: Extract, Transform, Load. The order matters, and it has flipped over the past decade.
ETL: The Classical Approach
[ Source ] → [ ETL tool ] ──transform in flight──→ [ Warehouse ]
Informatica
Talend
SSIS
The pipeline pulls data, transforms it on a dedicated server (or the ETL tool itself), and only the cleaned, modelled output lands in the warehouse.
Why ETL was the default
- Warehouses (Teradata, on-prem Oracle, early data warehouses) were expensive — every byte of storage and every CPU second cost real money.
- You wanted only modelled data in the warehouse, not raw operational data.
- Compliance: PII could be masked or dropped before it ever touched the analytics layer.
Downsides
- Transformation logic lived in proprietary GUI tools, hard to version-control.
- If you forgot a column or wanted to model differently, you had to re-extract from source.
- Schema changes upstream broke the entire pipeline.
- ETL tools were a separate skill set; not many people could maintain them.
ELT: The Modern Default
[ Source ] → [ Loader ] ──load raw──→ [ Warehouse ]
Fivetran │
Airbyte ▼ (transform in SQL)
CDC [ Modelled marts ]
dbt, native SQL
The pipeline lands raw data in the warehouse first. Transformations run inside the warehouse using its own SQL engine.
Why ELT won
- Cheap warehouse compute — Snowflake, BigQuery, Redshift, Databricks have pushed compute and storage prices down 10–100×.
- Elastic compute — large transforms can spin up dedicated warehouses for minutes, then shut down.
- SQL fluency — most data people know SQL; few knew Informatica.
- Replayability — raw data is preserved, so you can rebuild any model without re-extracting.
- dbt brought software practices (modular SQL, version control, tests, docs) to the T step.
The standard ELT layering
| Layer | Purpose | Naming |
|---|---|---|
| Raw / source | Exact replica of source, append-only | raw.salesforce__opportunity |
| Staging | Renamed, typed, lightly cleaned 1:1 with source | stg_salesforce__opportunities |
| Intermediate | Joins, filters, business logic building blocks | int_orders_with_payments |
| Marts | Final dim_ / fact_ tables consumers query | fact_orders, dim_customer |
When ETL Still Wins
- Strong data residency / compliance — health data or PII that should never land raw in the warehouse. Mask or pseudonymise in flight.
- Schema-on-read becomes a problem — extreme nested or proprietary formats may benefit from upfront parsing.
- Source rate-limiting — when re-extracting is expensive (paid API quotas), pre-shape and cache.
- Small warehouse / fixed cluster — when warehouse compute is constrained, push transformation upstream.
Hybrid: do a minimal ETL step (mask + format) before loading, then full ELT inside the warehouse for everything else. This is common for healthcare and financial data.
Code: The Same Job Two Ways
ETL (Python in transit)
def transform(rows):
out = []
for r in rows:
if r["status"] == "active":
out.append({
"customer_id": r["id"],
"email": r["email"].lower().strip(),
"signup_year": parse_date(r["created_at"]).year,
})
return out
rows = extract_from_crm()
clean = transform(rows)
load_to_warehouse("dim_customer", clean)
Logic is in Python, only running on whatever box runs the pipeline. No history of raw CRM data — you cannot rebuild dim_customer differently without re-extracting.
ELT (raw + dbt)
-- raw.crm_customers landed by Fivetran, append-only
-- dbt model: stg_crm__customers.sql
select
id as customer_id,
lower(trim(email)) as email,
date_part('year', created_at::date) as signup_year,
status
from {{ source('crm', 'customers') }}
where status = 'active'
Logic is SQL, version-controlled, tested, documented. Raw is preserved — you can rebuild the model with new logic any time.
Reverse ETL: The Other Direction
Once your warehouse is the source of truth, you often want to push data out to operational tools — Salesforce, HubSpot, Intercom, ad platforms. Tools like Hightouch and Census do this. It is "reverse ETL": warehouse → SaaS.
This makes the warehouse not just an analytics store but the operational data hub.
Cert Mapping
| Cert | ETL/ELT story |
|---|---|
| DP-203 / DP-700 | ADF copy + Mapping data flows; Synapse / Fabric pipelines; dataflows Gen2 |
| AWS DEA-C01 | Glue ETL jobs, Glue DataBrew, Step Functions; Redshift COPY for ELT |
| GCP PDE | Dataflow (ETL streaming/batch), Dataform / dbt for ELT in BigQuery |
Practical Choice Today
- Default to ELT with a managed loader (Fivetran / Airbyte) and dbt on a cloud warehouse.
- Add light transform-in-flight only when compliance or cost demands it.
- Use orchestration (next lesson) to coordinate loaders, dbt runs, and downstream consumers.