An ML team wants to deploy a PyTorch transformer model for real-time inference. The model weighs 8 GB and requires GPU acceleration. During peak hours, the team expects up to 4 concurrent requests. Which compute type configuration is most appropriate?

Question

Accepted Answer

C. `GPU_MEDIUM` (1xA10G, 24 GB per concurrency) with provisioned concurrency set to 4.. The model weighs 8 GB. `GPU_MEDIUM` provides 24 GB GPU memory per concurrency on a single A10G GPU, which is sufficient for an 8 GB model with headroom for computation. `GPU_SMALL` (T4, 16 GB) is borderline for an 8 GB model given framework overhead. `MULTIGPU_MEDIUM` is designed for very large models requiring multiple GPUs and would be overkill. Setting provisioned concurrency to 4 handles the expected 4 concurrent requests.

An ML team wants to deploy a PyTorch transformer model for real-time inference. The model weighs 8 GB and requires GPU acceleration. During peak hours, the team expects up to 4 concurrent requests. Which compute type configuration is most appropriate?

Related Questions

Discussion