Orchestration: Airflow, dbt, and Friends — Data Engineering Fundamentals | CertQnA

Once you have more than two pipelines, you need an orchestrator. Without one, you are reduced to cron + bash + Slack — fragile and unobservable.

What an Orchestrator Provides

Scheduling — cron-like, event-driven, or sensor-based.
Dependencies — task B runs only after task A succeeds.
Retries and backoff — transient failures should self-heal.
Observability — UI, logs, run history, lineage.
Parametrisation — backfill a date range, retry a single partition.
Alerting — page someone when something breaks.

Airflow: The Incumbent

Apache Airflow has been the de-facto standard since ~2016. You define DAGs in Python:

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta

with DAG(
    dag_id="daily_orders_mart",
    schedule="0 6 * * *",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:

    extract = BashOperator(
        task_id="fivetran_sync",
        bash_command="fivetran-cli sync --connector salesforce",
    )

    dbt_run = BashOperator(
        task_id="dbt_run_orders",
        bash_command="cd /dbt && dbt run --select orders",
    )

    dbt_test = BashOperator(
        task_id="dbt_test_orders",
        bash_command="cd /dbt && dbt test --select orders",
    )

    extract >> dbt_run >> dbt_test

Strengths

Massive ecosystem — providers for nearly every system.
Mature managed offerings: MWAA (AWS), Cloud Composer (GCP), Astronomer.
Huge community, lots of patterns documented.

Weaknesses

Task-centric, not asset-centric — you orchestrate steps, not data.
Local dev is fiddly; testing DAG logic is harder than it should be.
Dynamic DAG generation is awkward.

Prefect

Prefect 2 / 3 reimagined orchestration around Python-native flows and tasks, with a strong focus on developer experience and dynamic workflows.

from prefect import flow, task

@task(retries=3)
def extract():
    ...

@task
def transform(rows):
    ...

@flow
def daily_orders():
    rows = extract()
    transform(rows)

daily_orders()

Pros: Pythonic, easy local dev, dynamic workflows. Cons: smaller ecosystem than Airflow.

Dagster

Dagster is the most opinionated of the three. Its core abstraction is the asset — a piece of data that exists or does not, has a freshness, has dependencies. Pipelines are derived from the asset graph.

from dagster import asset

@asset
def raw_orders():
    return fetch_orders()

@asset
def cleaned_orders(raw_orders):
    return clean(raw_orders)

@asset(deps=["cleaned_orders"])
def order_marts():
    run_dbt(["--select", "orders"])

This asset-aware model maps naturally to dbt models, files in S3, BigQuery tables. Lineage, freshness, and "what is stale?" are first-class. Many modern teams pick Dagster for greenfield projects.

Where dbt Fits

A common point of confusion: dbt is not an orchestrator. dbt is the SQL transformation engine — it turns your SELECT statements into a DAG of materialised tables and views, runs tests, and generates docs.

You still need an orchestrator to:

Trigger the loaders that land raw data.
Run dbt run at the right time, with the right vars.
Run dbt test after.
Trigger downstream tasks (Reverse ETL, ML training, dashboard refresh).

dbt Cloud provides a lightweight scheduler that is enough for many teams. Larger or more heterogeneous stacks pair dbt-core with Airflow / Prefect / Dagster.

DAG Design Principles

Idempotent

Re-running yesterday's DAG should produce the same output, not duplicates. Partition by date, use MERGE/INSERT OVERWRITE instead of blind INSERT, key off business identifiers.

Partitioned

Operate on a single day or hour at a time. Failures recover by re-running just that partition.

Small, atomic tasks

One thing per task. A 4-hour monolithic task that fails 99% in is a bad day. Split it.

Owned data contracts at boundaries

Each DAG produces clear outputs. Downstream DAGs consume those outputs, not raw upstream tables.

Tests inline

Use dbt tests, Great Expectations, or custom assertions. A failing test should fail the run, not silently land bad data.

Observability and Lineage

The orchestrator alone is not enough. Modern stacks add:

Data observability — Monte Carlo, Bigeye, Anomalo for freshness, volume, schema, distribution monitoring.
Catalog and lineage — DataHub, Atlan, Unity Catalog, BigQuery Data Lineage.
Cost monitoring — track per-DAG / per-model warehouse cost.

Cert Mapping

Cert	Orchestration
DP-203 / DP-700	ADF / Synapse pipelines / Fabric Data Factory
AWS DEA-C01	Step Functions, MWAA, EventBridge schedules
GCP PDE	Cloud Composer (managed Airflow), Workflows, Dataform tags

Practical Picks

Small team, modern stack, greenfield — dbt Cloud or Dagster.
Existing Airflow expertise / many connectors — Airflow on managed platform.
Heavy Python data science work — Prefect.
Tightly tied to a cloud — its native pipeline service is fine for simple cases.

Whatever you pick, the principles — idempotent DAGs, partitioned runs, inline tests, observable lineage — are universal.