claim
The evaluation of generated text by Large Language Models is inconsistent and unreliable, as it is difficult to achieve consistent results between human judgments and automatic evaluation tools, and models themselves can be biased based on their training data.

Authors

Sources

Referenced by nodes (2)