FACTS Grounding

factualityVerified

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form responses grounded in provided context documents up to 32k tokens.

Published: 2025
Score Range: 0-100
Top Score: 87.8

FACTS Grounding Leaderboard

RankModelProviderScoreParametersReleasedType
1Gemini 2.5 ProGoogle
87.8
2025-05-06Multimodal
2Gemini 2.5 FlashGoogle
85.8
2025-05-20Multimodal
3Gemini 2.0 FlashGoogle
85.6
2025-02-25Multimodal
4Gemini 2.5 Flash-LiteGoogle
83.8
2025-06-17Multimodal
5Claude 3.5 SonnetAnthropic
83.3
2024-06-20Multimodal
6Gemini 2.0 Flash-LiteGoogle
83.2
2025-02-25Multimodal
7Gemini 2.0 ProGoogle
82.8
2025-02-05Multimodal
8Claude Sonnet 4Anthropic
79.1
2025-05-22Multimodal
9Claude 3.7 SonnetAnthropic
78.8
2025-02-24Multimodal
10o1OpenAI
78.8
2024-09-12Multimodal

About FACTS Grounding

Description

The FACTS Grounding benchmark measures language models' ability to generate text that is factually accurate with respect to given context in the user prompt. Each prompt includes a user request and a full document with a maximum length of 32k tokens, requiring long-form responses that are fully grounded in the provided context document while fulfilling the user request. The evaluation process uses automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set, and the final factuality score is an aggregate of multiple judge models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to mitigate evaluation bias. The benchmark contains 860 public examples ('Open' split) and 859 private examples ('Blind' split) of natural, complex LLM prompts written by humans. The prompts span various enterprise domains including finance, technology, retail, medical, and legal, with tasks including fact finding, summarization, effect analysis, and concept comparison. Context documents have a mean length of 2.5k tokens with a maximum of 32k tokens. Key features: • Long-form grounded response generation up to 32k token context • Multi-judge evaluation system to reduce bias • Eligibility filter to prevent metric gaming • Diverse enterprise domains and task types • Public and private splits for leaderboard integrity

Methodology

FACTS Grounding evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Technical Paper