procedure
The LLM-as-Judge approach for evaluating response correctness leverages GPT-4o-Mini (et al., 2024) to classify generated responses into 'correct,' 'incorrect,' or 'refuse' categories, with 'refuse' treated as a hallucination.

Authors

Sources

Referenced by nodes (2)