An ML team deploys a real-time model serving endpoint and enables the scale-to-zero feature. A product manager is concerned about latency for the first request after a quiet period. What behavior should the team communicate?

Question

Accepted Answer

B. The first request after scaling to zero triggers a cold start; scaling up from zero typically takes 10-20 seconds but can sometimes take minutes, with no SLA on scale-from-zero latency.. Databricks documentation states that when scale-to-zero is enabled and the endpoint has been idle for 30 minutes, it scales to zero. The first request after this triggers a cold start where the endpoint must scale back up, typically taking 10-20 seconds but potentially longer. There is no SLA on scale-from-zero latency, making it unsuitable for production workloads requiring guaranteed response times.

An ML team deploys a real-time model serving endpoint and enables the scale-to-zero feature. A product manager is concerned about latency for the first request after a quiet period. What behavior should the team communicate?

Related Questions

Discussion