measurement
GPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.

Authors

Sources

Referenced by nodes (3)