claim
Metrics such as ROUGE and F1 can be inaccurate because they rely on shallow linguistic similarities (word overlap) between ground truth and LLM responses, even when the actual meaning differs.
Authors
Sources
- Evaluating RAG applications with Amazon Bedrock knowledge base ... aws.amazon.com via serper
Referenced by nodes (3)
- ROUGE concept
- ground truth concept
- F1 concept