Skip to content
5 min read·Lesson 3 of 10

ETL vs ELT

Extract, Transform, Load — and why modern stacks flipped the order. When to choose ETL, when to choose ELT, and how the warehouse changed the economics.

Three letters define how data moves: Extract, Transform, Load. The order matters, and it has flipped over the past decade.

ETL: The Classical Approach

[ Source ] → [ ETL tool ] ──transform in flight──→ [ Warehouse ]
                Informatica
                Talend
                SSIS

The pipeline pulls data, transforms it on a dedicated server (or the ETL tool itself), and only the cleaned, modelled output lands in the warehouse.

Why ETL was the default

  • Warehouses (Teradata, on-prem Oracle, early data warehouses) were expensive — every byte of storage and every CPU second cost real money.
  • You wanted only modelled data in the warehouse, not raw operational data.
  • Compliance: PII could be masked or dropped before it ever touched the analytics layer.

Downsides

  • Transformation logic lived in proprietary GUI tools, hard to version-control.
  • If you forgot a column or wanted to model differently, you had to re-extract from source.
  • Schema changes upstream broke the entire pipeline.
  • ETL tools were a separate skill set; not many people could maintain them.

ELT: The Modern Default

[ Source ] → [ Loader ] ──load raw──→ [ Warehouse ]
              Fivetran                    │
              Airbyte                     ▼ (transform in SQL)
              CDC                   [ Modelled marts ]
                                    dbt, native SQL

The pipeline lands raw data in the warehouse first. Transformations run inside the warehouse using its own SQL engine.

Why ELT won

  • Cheap warehouse compute — Snowflake, BigQuery, Redshift, Databricks have pushed compute and storage prices down 10–100×.
  • Elastic compute — large transforms can spin up dedicated warehouses for minutes, then shut down.
  • SQL fluency — most data people know SQL; few knew Informatica.
  • Replayability — raw data is preserved, so you can rebuild any model without re-extracting.
  • dbt brought software practices (modular SQL, version control, tests, docs) to the T step.

The standard ELT layering

LayerPurposeNaming
Raw / sourceExact replica of source, append-onlyraw.salesforce__opportunity
StagingRenamed, typed, lightly cleaned 1:1 with sourcestg_salesforce__opportunities
IntermediateJoins, filters, business logic building blocksint_orders_with_payments
MartsFinal dim_ / fact_ tables consumers queryfact_orders, dim_customer

When ETL Still Wins

  • Strong data residency / compliance — health data or PII that should never land raw in the warehouse. Mask or pseudonymise in flight.
  • Schema-on-read becomes a problem — extreme nested or proprietary formats may benefit from upfront parsing.
  • Source rate-limiting — when re-extracting is expensive (paid API quotas), pre-shape and cache.
  • Small warehouse / fixed cluster — when warehouse compute is constrained, push transformation upstream.

Hybrid: do a minimal ETL step (mask + format) before loading, then full ELT inside the warehouse for everything else. This is common for healthcare and financial data.

Code: The Same Job Two Ways

ETL (Python in transit)

def transform(rows):
    out = []
    for r in rows:
        if r["status"] == "active":
            out.append({
                "customer_id": r["id"],
                "email": r["email"].lower().strip(),
                "signup_year": parse_date(r["created_at"]).year,
            })
    return out

rows = extract_from_crm()
clean = transform(rows)
load_to_warehouse("dim_customer", clean)

Logic is in Python, only running on whatever box runs the pipeline. No history of raw CRM data — you cannot rebuild dim_customer differently without re-extracting.

ELT (raw + dbt)

-- raw.crm_customers landed by Fivetran, append-only
-- dbt model: stg_crm__customers.sql
select
  id            as customer_id,
  lower(trim(email)) as email,
  date_part('year', created_at::date) as signup_year,
  status
from {{ source('crm', 'customers') }}
where status = 'active'

Logic is SQL, version-controlled, tested, documented. Raw is preserved — you can rebuild the model with new logic any time.

Reverse ETL: The Other Direction

Once your warehouse is the source of truth, you often want to push data out to operational tools — Salesforce, HubSpot, Intercom, ad platforms. Tools like Hightouch and Census do this. It is "reverse ETL": warehouse → SaaS.

This makes the warehouse not just an analytics store but the operational data hub.

Cert Mapping

CertETL/ELT story
DP-203 / DP-700ADF copy + Mapping data flows; Synapse / Fabric pipelines; dataflows Gen2
AWS DEA-C01Glue ETL jobs, Glue DataBrew, Step Functions; Redshift COPY for ELT
GCP PDEDataflow (ETL streaming/batch), Dataform / dbt for ELT in BigQuery

Practical Choice Today

  1. Default to ELT with a managed loader (Fivetran / Airbyte) and dbt on a cloud warehouse.
  2. Add light transform-in-flight only when compliance or cost demands it.
  3. Use orchestration (next lesson) to coordinate loaders, dbt runs, and downstream consumers.

Key Takeaways

  • ETL transforms before loading; ELT loads raw, then transforms inside the warehouse.
  • Cheap warehouse compute and elastic storage made ELT the modern default.
  • ETL still wins for sensitive data that must be cleansed or masked before landing.
  • Raw + staging + marts is the standard ELT layering pattern.
  • dbt made ELT mainstream by treating SQL transformations like software.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →