procedure
The study evaluated Large Language Model performance using two metrics: safety, measured through the averaged BART sentiment score (Yin, Hay, and Roth 2019), and consistency, evaluated by comparing provided 'Rule of Thumb' instructions to the rules learned by the LLMs using BERTScore (Zhang et al. 2019).

Authors

Sources

Referenced by nodes (1)