A PySpark job joins a large transactions table (200 GB) with a small currency rates lookup table (10 MB). The job experiences slow performance due to shuffle overhead. Which optimization directly addresses this problem?

Question

Accepted Answer

B. Use the `broadcast()` hint on the small currency rates DataFrame. The `broadcast()` hint instructs Spark to send a full copy of the small DataFrame to every executor, replacing the expensive shuffle-based join with a broadcast hash join. The 10 MB lookup table is well within the default broadcast threshold (`spark.sql.autoBroadcastJoinThreshold`), making this the most effective optimization to eliminate the shuffle.

A PySpark job joins a large transactions table (200 GB) with a small currency rates lookup table (10 MB). The job experiences slow performance due to shuffle overhead. Which optimization directly addresses this problem?

Related Questions

Discussion