FACTS Grounding
The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form responses grounded in provided context documents up to 32k tokens.
FACTS Grounding Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 87.8 | 2025-05-06 | Multimodal | ||
2 | Gemini 2.5 Flash | 85.8 | 2025-05-20 | Multimodal | ||
3 | Gemini 2.0 Flash | 85.6 | 2025-02-25 | Multimodal | ||
4 | Gemini 2.5 Flash-Lite | 83.8 | 2025-06-17 | Multimodal | ||
5 | Claude 3.5 Sonnet | Anthropic | 83.3 | 2024-06-20 | Multimodal | |
6 | Gemini 2.0 Flash-Lite | 83.2 | 2025-02-25 | Multimodal | ||
7 | Gemini 2.0 Pro | 82.8 | 2025-02-05 | Multimodal | ||
8 | Claude Sonnet 4 | Anthropic | 79.1 | 2025-05-22 | Multimodal | |
9 | Claude 3.7 Sonnet | Anthropic | 78.8 | 2025-02-24 | Multimodal | |
10 | o1 | OpenAI | 78.8 | 2024-09-12 | Multimodal |
About FACTS Grounding
Description
The FACTS Grounding benchmark measures language models' ability to generate text that is factually accurate with respect to given context in the user prompt. Each prompt includes a user request and a full document with a maximum length of 32k tokens, requiring long-form responses that are fully grounded in the provided context document while fulfilling the user request. The evaluation process uses automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set, and the final factuality score is an aggregate of multiple judge models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to mitigate evaluation bias. The benchmark contains 860 public examples ('Open' split) and 859 private examples ('Blind' split) of natural, complex LLM prompts written by humans. The prompts span various enterprise domains including finance, technology, retail, medical, and legal, with tasks including fact finding, summarization, effect analysis, and concept comparison. Context documents have a mean length of 2.5k tokens with a maximum of 32k tokens. Key features: • Long-form grounded response generation up to 32k token context • Multi-judge evaluation system to reduce bias • Eligibility filter to prevent metric gaming • Diverse enterprise domains and task types • Public and private splits for leaderboard integrity
Methodology
FACTS Grounding evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Technical Paper