claim
Human evaluation is considered the gold standard for RAG systems, but it does not scale, and automation requires ground truth, which synthetic test sets often fail to provide in real enterprise domains.

Authors

Sources

Referenced by nodes (4)