Apache Spark, born at UC Berkeley in 2009 and Apache top-level in 2014, became the dominant engine for distributed data processing. It is fast, multi-language (Python, Scala, Java, R, SQL), and runs everywhere — Databricks, EMR, Dataproc, Synapse, Fabric, Kubernetes, on-prem.
The Core Idea
Distribute a dataset across many machines, then express transformations on it as if it were a single collection. Spark handles the parallelism, fault tolerance, and shuffling.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum
spark = SparkSession.builder.getOrCreate()
orders = spark.read.parquet("s3://lake/orders/")
revenue_by_country = (
orders
.filter(col("status") == "paid")
.groupBy("country")
.agg(sum("amount").alias("revenue"))
.orderBy(col("revenue").desc())
)
revenue_by_country.write.mode("overwrite").parquet("s3://lake/marts/revenue_by_country/")
That code might run on 1 GB locally or 100 TB across 500 nodes — same API.
Architecture in 60 Seconds
[ Driver ] ── orchestrates ──→ [ Executors on worker nodes ]
│ │
│ ▼
└── splits work into tasks ───→ process partitions of data
return results / shuffle
- Driver — your code's process; builds the plan, schedules stages.
- Executors — JVMs on workers; run tasks, hold cached data.
- Cluster manager — YARN, Kubernetes, Mesos, Databricks proprietary.
- Tasks — unit of work for one partition.
- Stages — groups of tasks separated by shuffles.
RDDs vs DataFrames vs Datasets
| API | Era | Use today |
|---|---|---|
| RDD | Spark 1.x | Rare — only for low-level control |
| DataFrame | Spark 1.3+ | The default API, Python and Scala |
| Dataset | Spark 1.6+ (typed) | Scala / Java only, mostly for type-safe code |
| Spark SQL | 1.x onward | Same engine, plain SQL queries |
Use DataFrames or Spark SQL. RDDs are an escape hatch.
Lazy Evaluation and Catalyst
Spark does not run operations as you write them. It builds a logical plan, then the Catalyst optimiser rewrites it: pushing filters down, pruning columns, reordering joins, choosing physical operators. Only when you trigger an action (write, count, collect, show) does execution start.
This is why orders.filter(...).select("id") only ever reads the columns you asked for — Catalyst pushed the projection into the Parquet read.
Where Performance Goes Wrong
Shuffles
Joins, groupBys, and window functions cause data to move across the network between executors. Shuffles are the most expensive operation in Spark.
- Use broadcast joins for small dimension tables (
broadcast(small_df)). - Pre-partition large tables on join keys when running the same join repeatedly.
- Avoid
orderByat scale unless required.
Skew
If one key (e.g. customer_id = NULL, or a single megacustomer) holds 30% of the data, one task does 30% of the work. Symptoms: 99 tasks finish, one runs for 4 hours.
- Salt skewed keys.
- Spark 3+ AQE (Adaptive Query Execution) handles many cases automatically.
Small Files
Reading a million 100KB files is slower than 1000 100MB files. Compact and target file sizes appropriately.
UDFs
Python UDFs serialise rows out of the JVM. Use built-in functions (pyspark.sql.functions) where possible. If you must use a UDF, prefer Pandas UDFs (vectorised).
Structured Streaming
The same DataFrame API, applied incrementally to a stream. Spark treats a stream as an "unbounded table" that grows over time.
stream = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "orders")
.load()
)
agg = (
stream
.selectExpr("CAST(value AS STRING) as json")
.selectExpr("from_json(json, 'order_id INT, amount DOUBLE') as o")
.select("o.*")
.groupBy(window("event_time", "1 minute"))
.sum("amount")
)
query = (
agg.writeStream
.format("delta")
.option("checkpointLocation", "/chk/orders_per_min")
.outputMode("append")
.trigger(processingTime="30 seconds")
.start("/lake/silver/orders_per_min")
)
Trade-offs: Spark Streaming is micro-batch (sub-second to seconds), great for most use cases, but for true sub-100ms latency consider Flink.
Spark vs the Warehouse
If your work is "join two tables, group by, aggregate, write" and the data fits in a modern warehouse, the warehouse is faster, simpler, and cheaper. You write SQL; you don't manage clusters.
Spark wins when:
- Data is unstructured or nested in ways SQL cannot easily express.
- You need ML (MLlib, Pandas, sklearn-on-Spark, distributed PyTorch).
- Data scale exceeds practical warehouse cost.
- You need a single engine for batch, streaming, and ML.
- You want full control of execution / custom logic.
Where Spark Runs
| Platform | Form |
|---|---|
| Databricks | Managed Spark + Delta Lake; the most polished experience |
| AWS EMR / Glue / Athena Spark | EMR for full clusters, Glue for serverless ETL, Athena Spark for notebooks |
| GCP Dataproc / Serverless | Managed Spark on GCP; Dataproc Serverless for elastic jobs |
| Azure Synapse / Fabric | Synapse Spark pools; Fabric notebooks on OneLake |
| Self-managed K8s | Spark Operator on Kubernetes |
Cert Mapping
| Cert | Spark scope |
|---|---|
| Databricks Certified Data Engineer | Spark + Delta Lake deep |
| DP-203 / DP-700 | Synapse Spark / Fabric Spark notebooks |
| AWS DEA-C01 | Glue Spark, EMR Spark, Athena Spark |
| GCP PDE | Dataproc, Dataproc Serverless |
Mental Model
Think of Spark as "SQL with Python superpowers, distributed". For routine ELT, prefer the warehouse. For everything else — ML, semi-structured, hyper-scale, streaming-batch unification — Spark is still the workhorse.