claim
Benchmarks for large language models that test only high-frequency factual questions fail to reveal tail entity hallucination, and benchmarks that test only short responses fail to reveal exposure bias accumulation.

Authors

Sources

Referenced by nodes (3)