procedure
The Med-HALT benchmark calculates cosine similarity between model output embeddings and two references: Answer Similarity (between the correct option and model output) and Question Similarity (between the original question and model output).

Authors

Sources

Referenced by nodes (1)