claim
Prompt engineering and dataset-specific post-processing techniques often lack scalability and generalizability across different models and datasets when attempting to improve ROUGE scores.
Authors
Sources
- Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org via serper
Referenced by nodes (2)
- ROUGE concept
- prompt engineering concept