reference
The study evaluated several alternative metrics for text evaluation, including BERTScore (Zhang et al., 2020), BLEU (Papineni et al., 2002), SummaC (Laban et al., 2022), and UniEval-fact (Zhong et al., 2022), benchmarking them against LLM-as-Judge labels.
Authors
Sources
- Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org via serper
Referenced by nodes (2)
- SummaC concept
- LLM-as-a-judge concept