Spark and Distributed Compute — Data Engineering Fundamentals | CertQnA

Apache Spark, born at UC Berkeley in 2009 and Apache top-level in 2014, became the dominant engine for distributed data processing. It is fast, multi-language (Python, Scala, Java, R, SQL), and runs everywhere — Databricks, EMR, Dataproc, Synapse, Fabric, Kubernetes, on-prem.

The Core Idea

Distribute a dataset across many machines, then express transformations on it as if it were a single collection. Spark handles the parallelism, fault tolerance, and shuffling.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

spark = SparkSession.builder.getOrCreate()

orders = spark.read.parquet("s3://lake/orders/")

revenue_by_country = (
    orders
      .filter(col("status") == "paid")
      .groupBy("country")
      .agg(sum("amount").alias("revenue"))
      .orderBy(col("revenue").desc())
)

revenue_by_country.write.mode("overwrite").parquet("s3://lake/marts/revenue_by_country/")

That code might run on 1 GB locally or 100 TB across 500 nodes — same API.

Architecture in 60 Seconds

[ Driver ]  ── orchestrates ──→  [ Executors on worker nodes ]
   │                                       │
   │                                       ▼
   └── splits work into tasks ───→  process partitions of data
                                    return results / shuffle

Driver — your code's process; builds the plan, schedules stages.
Executors — JVMs on workers; run tasks, hold cached data.
Cluster manager — YARN, Kubernetes, Mesos, Databricks proprietary.
Tasks — unit of work for one partition.
Stages — groups of tasks separated by shuffles.

RDDs vs DataFrames vs Datasets

API	Era	Use today
RDD	Spark 1.x	Rare — only for low-level control
DataFrame	Spark 1.3+	The default API, Python and Scala
Dataset	Spark 1.6+ (typed)	Scala / Java only, mostly for type-safe code
Spark SQL	1.x onward	Same engine, plain SQL queries

Use DataFrames or Spark SQL. RDDs are an escape hatch.

Lazy Evaluation and Catalyst

Spark does not run operations as you write them. It builds a logical plan, then the Catalyst optimiser rewrites it: pushing filters down, pruning columns, reordering joins, choosing physical operators. Only when you trigger an action (write, count, collect, show) does execution start.

This is why orders.filter(...).select("id") only ever reads the columns you asked for — Catalyst pushed the projection into the Parquet read.

Where Performance Goes Wrong

Shuffles

Joins, groupBys, and window functions cause data to move across the network between executors. Shuffles are the most expensive operation in Spark.

Use broadcast joins for small dimension tables (broadcast(small_df)).
Pre-partition large tables on join keys when running the same join repeatedly.
Avoid orderBy at scale unless required.

Skew

If one key (e.g. customer_id = NULL, or a single megacustomer) holds 30% of the data, one task does 30% of the work. Symptoms: 99 tasks finish, one runs for 4 hours.

Salt skewed keys.
Spark 3+ AQE (Adaptive Query Execution) handles many cases automatically.

Small Files

Reading a million 100KB files is slower than 1000 100MB files. Compact and target file sizes appropriately.

UDFs

Python UDFs serialise rows out of the JVM. Use built-in functions (pyspark.sql.functions) where possible. If you must use a UDF, prefer Pandas UDFs (vectorised).

Structured Streaming

The same DataFrame API, applied incrementally to a stream. Spark treats a stream as an "unbounded table" that grows over time.

stream = (
    spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "broker:9092")
      .option("subscribe", "orders")
      .load()
)

agg = (
    stream
      .selectExpr("CAST(value AS STRING) as json")
      .selectExpr("from_json(json, 'order_id INT, amount DOUBLE') as o")
      .select("o.*")
      .groupBy(window("event_time", "1 minute"))
      .sum("amount")
)

query = (
    agg.writeStream
      .format("delta")
      .option("checkpointLocation", "/chk/orders_per_min")
      .outputMode("append")
      .trigger(processingTime="30 seconds")
      .start("/lake/silver/orders_per_min")
)

Trade-offs: Spark Streaming is micro-batch (sub-second to seconds), great for most use cases, but for true sub-100ms latency consider Flink.

Spark vs the Warehouse

If your work is "join two tables, group by, aggregate, write" and the data fits in a modern warehouse, the warehouse is faster, simpler, and cheaper. You write SQL; you don't manage clusters.

Spark wins when:

Data is unstructured or nested in ways SQL cannot easily express.
You need ML (MLlib, Pandas, sklearn-on-Spark, distributed PyTorch).
Data scale exceeds practical warehouse cost.
You need a single engine for batch, streaming, and ML.
You want full control of execution / custom logic.

Where Spark Runs

Platform	Form
Databricks	Managed Spark + Delta Lake; the most polished experience
AWS EMR / Glue / Athena Spark	EMR for full clusters, Glue for serverless ETL, Athena Spark for notebooks
GCP Dataproc / Serverless	Managed Spark on GCP; Dataproc Serverless for elastic jobs
Azure Synapse / Fabric	Synapse Spark pools; Fabric notebooks on OneLake
Self-managed K8s	Spark Operator on Kubernetes

Cert Mapping

Cert	Spark scope
Databricks Certified Data Engineer	Spark + Delta Lake deep
DP-203 / DP-700	Synapse Spark / Fabric Spark notebooks
AWS DEA-C01	Glue Spark, EMR Spark, Athena Spark
GCP PDE	Dataproc, Dataproc Serverless

Mental Model

Think of Spark as "SQL with Python superpowers, distributed". For routine ELT, prefer the warehouse. For everything else — ML, semi-structured, hyper-scale, streaming-batch unification — Spark is still the workhorse.