A data engineer needs to join a large `transactions` DataFrame with a small `lookup` DataFrame that contains 500 rows. The join is causing a shuffle that is slowing down the pipeline. What optimization should the engineer apply?

Question

Accepted Answer

B. Use a broadcast join hint to broadcast the small `lookup` DataFrame to all executors. When one side of a join is small enough to fit in memory on each executor, broadcasting it eliminates the expensive shuffle of the large DataFrame. In PySpark, this is done with the `broadcast` hint (e.g., `df_large.join(broadcast(df_small), ...)`) or in SQL with the `BROADCAST` hint. Databricks Spark documentation covers broadcast joins as a key optimization for skewed or small-table joins. Repartitioning the large DataFrame does not eliminate the shuffle. Cross joins produce a Cartesian product and are generally incorrect. Increasing shuffle partitions affects the number of output partitions, not whether a shuffle occurs.

A data engineer needs to join a large `transactions` DataFrame with a small `lookup` DataFrame that contains 500 rows. The join is causing a shuffle that is slowing down the pipeline. What optimization should the engineer apply?

Related Questions

Discussion

A data engineer needs to join a large transactions DataFrame with a small lookup DataFrame that contains 500 rows. The join is causing a shuffle that is slowing down the pipeline. What optimization should the engineer apply?

Related Questions

Discussion

A data engineer needs to join a large `transactions` DataFrame with a small `lookup` DataFrame that contains 500 rows. The join is causing a shuffle that is slowing down the pipeline. What optimization should the engineer apply?