An ML team runs Agent Evaluation on their RAG chatbot's evaluation set. 30% of responses fail quality checks, and the root cause analysis shows failures concentrated in the groundedness judge. The team confirms the LLM is following the system prompt grounding instructions correctly. What is the most likely root cause and recommended fix?

Question

Accepted Answer

B. The retrieval step is returning documents that do not contain information relevant to the query, so the LLM cannot be grounded and may hallucinate or decline to answer.. In Agent Evaluation, the groundedness judge fails when the response contains claims not supported by the retrieved context. If the LLM is correctly following grounding instructions but groundedness still fails, the root cause is retrieval quality — the retrieved documents do not contain the answer, so the LLM cannot produce a grounded response. The fix is to improve retrieval: tune chunk size, improve the embedding model, add re-ranking, or enable hybrid search.

Related Questions

Discussion