A data engineer writes the following PySpark code to find the top 3 products by total revenue per category. Which statement best describes the result?

```python
from pyspark.sql import functions as F
from pyspark.sql.window import Window

windowSpec = Window.partitionBy("category").orderBy(F.desc("total_revenue"))
result = (
df.groupBy("category", "product_id")
.agg(F.sum("revenue").alias("total_revenue"))
.withColumn("rank", F.rank().over(windowSpec))
.filter(F.col("rank")

Question

A data engineer writes the following PySpark code to find the top 3 products by total revenue per category. Which statement best describes the result?

```python
from pyspark.sql import functions as F
from pyspark.sql.window import Window

windowSpec = Window.partitionBy("category").orderBy(F.desc("total_revenue"))
result = (
    df.groupBy("category", "product_id")
    .agg(F.sum("revenue").alias("total_revenue"))
    .withColumn("rank", F.rank().over(windowSpec))
    .filter(F.col("rank") <= 3)
)
```

Accepted Answer

B. The code returns the top 3 products per category by total revenue, but ties in revenue may cause more than 3 rows to be returned per category. The code is logically correct: it groups by `category` and `product_id` to calculate total revenue, then applies `RANK()` within each category. However, `RANK()` assigns the same rank to tied values and skips subsequent ranks — for example, two products with equal top revenue both receive rank 1 and there is no rank 2. This means categories with ties at the top 3 positions may return more than 3 rows. Use `ROW_NUMBER()` to guarantee exactly 3 rows per category.

Related Questions

Discussion