A data science team deploys a model for batch inference using `mlflow.pyfunc.spark_udf()` on a DataFrame with 1 million rows. The job is very slow despite having a large cluster. Which change is most likely to improve performance?

Question

Accepted Answer

C. Repartition the input DataFrame to increase parallelism across executors. When using `mlflow.pyfunc.spark_udf()` for batch inference, performance depends on the degree of parallelism. If the DataFrame has too few partitions, few executors are used. Repartitioning the DataFrame to match cluster resources ensures the UDF is applied in parallel across all executors.

A data science team deploys a model for batch inference using `mlflow.pyfunc.spark_udf()` on a DataFrame with 1 million rows. The job is very slow despite having a large cluster. Which change is most likely to improve performance?

Related Questions

Discussion

A data science team deploys a model for batch inference using mlflow.pyfunc.spark_udf() on a DataFrame with 1 million rows. The job is very slow despite having a large cluster. Which change is most likely to improve performance?

Related Questions

Discussion

A data science team deploys a model for batch inference using `mlflow.pyfunc.spark_udf()` on a DataFrame with 1 million rows. The job is very slow despite having a large cluster. Which change is most likely to improve performance?