FACTS Grounding

factualityVerified

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form responses grounded in provided context documents up to 32k tokens.

Published: 2025

Score Range: 0-100

Top Score: 87.8

Technical Paper

FACTS Grounding Leaderboard

Rank	Model	Provider	Score	Released	Type
1	Gemini 2.5 Pro	Google	87.8	2025-05-06	Multimodal
2	Gemini 2.5 Flash	Google	85.8	2025-05-20	Multimodal
3	Gemini 2.0 Flash	Google	85.6	2025-02-25	Multimodal
4	Gemini 2.5 Flash-Lite	Google	83.8	2025-06-17	Multimodal
5	Claude 3.5 Sonnet	Anthropic	83.3	2024-06-20	Multimodal
6	Gemini 2.0 Flash-Lite	Google	83.2	2025-02-25	Multimodal
7	Gemini 2.0 Pro	Google	82.8	2025-02-05	Multimodal
8	Claude Sonnet 4	Anthropic	79.1	2025-05-22	Multimodal
9	Claude 3.7 Sonnet	Anthropic	78.8	2025-02-24	Multimodal
10	o1	OpenAI	78.8	2024-09-12	Multimodal

About FACTS Grounding

Description

The FACTS Grounding benchmark measures language models' ability to generate text that is factually accurate with respect to given context in the user prompt. Each prompt includes a user request and a full document with a maximum length of 32k tokens, requiring long-form responses that are fully grounded in the provided context document while fulfilling the user request. The evaluation process uses automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set, and the final factuality score is an aggregate of multiple judge models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to mitigate evaluation bias. The benchmark contains 860 public examples ('Open' split) and 859 private examples ('Blind' split) of natural, complex LLM prompts written by humans. The prompts span various enterprise domains including finance, technology, retail, medical, and legal, with tasks including fact finding, summarization, effect analysis, and concept comparison. Context documents have a mean length of 2.5k tokens with a maximum of 32k tokens. Key features: • Long-form grounded response generation up to 32k token context • Multi-judge evaluation system to reduce bias • Eligibility filter to prevent metric gaming • Diverse enterprise domains and task types • Public and private splits for leaderboard integrity

Methodology

FACTS Grounding evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Technical Paper