claim
Benchmarking results from the PHANTOM study indicate that out-of-the-box Large Language Models face severe challenges in detecting real-world hallucinations within long-context data.
Authors
Sources
- A Benchmark for Hallucination Detection in Financial Long-Context QA neurips.cc via serper
Referenced by nodes (1)
- Large Language Models concept