Skip to content
5 min read·Lesson 7 of 10

Spark and Distributed Compute

How Apache Spark turned distributed processing into a productive API. RDDs, DataFrames, the Catalyst optimiser, and where Spark fits in modern stacks.

Apache Spark, born at UC Berkeley in 2009 and Apache top-level in 2014, became the dominant engine for distributed data processing. It is fast, multi-language (Python, Scala, Java, R, SQL), and runs everywhere — Databricks, EMR, Dataproc, Synapse, Fabric, Kubernetes, on-prem.

The Core Idea

Distribute a dataset across many machines, then express transformations on it as if it were a single collection. Spark handles the parallelism, fault tolerance, and shuffling.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

spark = SparkSession.builder.getOrCreate()

orders = spark.read.parquet("s3://lake/orders/")

revenue_by_country = (
    orders
      .filter(col("status") == "paid")
      .groupBy("country")
      .agg(sum("amount").alias("revenue"))
      .orderBy(col("revenue").desc())
)

revenue_by_country.write.mode("overwrite").parquet("s3://lake/marts/revenue_by_country/")

That code might run on 1 GB locally or 100 TB across 500 nodes — same API.

Architecture in 60 Seconds

[ Driver ]  ── orchestrates ──→  [ Executors on worker nodes ]
   │                                       │
   │                                       ▼
   └── splits work into tasks ───→  process partitions of data
                                    return results / shuffle
  • Driver — your code's process; builds the plan, schedules stages.
  • Executors — JVMs on workers; run tasks, hold cached data.
  • Cluster manager — YARN, Kubernetes, Mesos, Databricks proprietary.
  • Tasks — unit of work for one partition.
  • Stages — groups of tasks separated by shuffles.

RDDs vs DataFrames vs Datasets

APIEraUse today
RDDSpark 1.xRare — only for low-level control
DataFrameSpark 1.3+The default API, Python and Scala
DatasetSpark 1.6+ (typed)Scala / Java only, mostly for type-safe code
Spark SQL1.x onwardSame engine, plain SQL queries

Use DataFrames or Spark SQL. RDDs are an escape hatch.

Lazy Evaluation and Catalyst

Spark does not run operations as you write them. It builds a logical plan, then the Catalyst optimiser rewrites it: pushing filters down, pruning columns, reordering joins, choosing physical operators. Only when you trigger an action (write, count, collect, show) does execution start.

This is why orders.filter(...).select("id") only ever reads the columns you asked for — Catalyst pushed the projection into the Parquet read.

Where Performance Goes Wrong

Shuffles

Joins, groupBys, and window functions cause data to move across the network between executors. Shuffles are the most expensive operation in Spark.

  • Use broadcast joins for small dimension tables (broadcast(small_df)).
  • Pre-partition large tables on join keys when running the same join repeatedly.
  • Avoid orderBy at scale unless required.

Skew

If one key (e.g. customer_id = NULL, or a single megacustomer) holds 30% of the data, one task does 30% of the work. Symptoms: 99 tasks finish, one runs for 4 hours.

  • Salt skewed keys.
  • Spark 3+ AQE (Adaptive Query Execution) handles many cases automatically.

Small Files

Reading a million 100KB files is slower than 1000 100MB files. Compact and target file sizes appropriately.

UDFs

Python UDFs serialise rows out of the JVM. Use built-in functions (pyspark.sql.functions) where possible. If you must use a UDF, prefer Pandas UDFs (vectorised).

Structured Streaming

The same DataFrame API, applied incrementally to a stream. Spark treats a stream as an "unbounded table" that grows over time.

stream = (
    spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "broker:9092")
      .option("subscribe", "orders")
      .load()
)

agg = (
    stream
      .selectExpr("CAST(value AS STRING) as json")
      .selectExpr("from_json(json, 'order_id INT, amount DOUBLE') as o")
      .select("o.*")
      .groupBy(window("event_time", "1 minute"))
      .sum("amount")
)

query = (
    agg.writeStream
      .format("delta")
      .option("checkpointLocation", "/chk/orders_per_min")
      .outputMode("append")
      .trigger(processingTime="30 seconds")
      .start("/lake/silver/orders_per_min")
)

Trade-offs: Spark Streaming is micro-batch (sub-second to seconds), great for most use cases, but for true sub-100ms latency consider Flink.

Spark vs the Warehouse

If your work is "join two tables, group by, aggregate, write" and the data fits in a modern warehouse, the warehouse is faster, simpler, and cheaper. You write SQL; you don't manage clusters.

Spark wins when:

  • Data is unstructured or nested in ways SQL cannot easily express.
  • You need ML (MLlib, Pandas, sklearn-on-Spark, distributed PyTorch).
  • Data scale exceeds practical warehouse cost.
  • You need a single engine for batch, streaming, and ML.
  • You want full control of execution / custom logic.

Where Spark Runs

PlatformForm
DatabricksManaged Spark + Delta Lake; the most polished experience
AWS EMR / Glue / Athena SparkEMR for full clusters, Glue for serverless ETL, Athena Spark for notebooks
GCP Dataproc / ServerlessManaged Spark on GCP; Dataproc Serverless for elastic jobs
Azure Synapse / FabricSynapse Spark pools; Fabric notebooks on OneLake
Self-managed K8sSpark Operator on Kubernetes

Cert Mapping

CertSpark scope
Databricks Certified Data EngineerSpark + Delta Lake deep
DP-203 / DP-700Synapse Spark / Fabric Spark notebooks
AWS DEA-C01Glue Spark, EMR Spark, Athena Spark
GCP PDEDataproc, Dataproc Serverless

Mental Model

Think of Spark as "SQL with Python superpowers, distributed". For routine ELT, prefer the warehouse. For everything else — ML, semi-structured, hyper-scale, streaming-batch unification — Spark is still the workhorse.

Key Takeaways

  • Spark distributes work across a cluster while presenting a familiar DataFrame API.
  • Lazy evaluation and the Catalyst optimiser turn high-level code into efficient execution plans.
  • Most performance issues come from skew, shuffles, and small files — not the API itself.
  • Structured Streaming brings the same DataFrame API to streaming workloads.
  • Today, Spark dominates lakehouse and ML pipelines; warehouse-only teams may not need it.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →