A benchmark for evaluating foundation models on human-centric tasks.

AGIEval (English)

A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.

Aider Polyglot

American Invitational Mathematics Examination (AIME) 2024 problems.

AIME-2024

American Invitational Mathematics Examination (AIME) 2025 problems.

AIME-2025

American Invitational Mathematics Examination (AIME) problems test advanced mathematical problem-solving.

AIME

AI2 Reasoning Challenge (ARC) tests reasoning through grade-school science questions.

Evaluation pipeline for instruction tuned models using challenging user queries.

Arena-Hard-v2

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Berkeley Function-Calling Leaderboard

Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark of 204 diverse tasks.

BIG-bench

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) is a cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. It contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB, and covers more than 37 professional domains including blockchain, hockey, healthcare, and education.

BIRD-SQL

BrowseComp (Browsing Competition) is a benchmark designed to evaluate AI agents' ability to browse the internet and retrieve hard-to-find, deeply entangled information. The benchmark comprises 1,266 challenging questions that require agents to persistently navigate multiple websites, reason about factuality, and creatively search to find answers that would be difficult for humans to locate within 10 minutes.

The benchmark was created by human trainers who designed questions to be extremely challenging - not solvable by existing models like GPT-4o (with or without browsing) or early versions of OpenAI Deep Research. Questions cover diverse topics including TV shows & movies (16.2%), science & technology (13.7%), art (10.0%), history (9.9%), sports (9.7%), music (9.2%), video games (5.6%), geography (5.5%), and politics (4.7%).

The benchmark follows an "inverted" design approach where trainers start with a fact and create questions where the answer is easy to verify but hard to find. For example: identifying a scientific paper from EMNLP 2018-2023 where the first author did undergrad at Dartmouth and fourth author at UPenn.

Human performance on BrowseComp shows its difficulty: trainers solved only 29.2% of problems within 2 hours, with 70.8% giving up after the time limit. This demonstrates that BrowseComp measures core browsing capabilities including persistence, creative search strategies, and factual reasoning - making it an important benchmark for evaluating web-browsing AI agents.

A benchmark for measuring browsing agents' ability to navigate the web and find hard-to-find, entangled information. Comprises 1,266 challenging questions requiring persistent web browsing and creative search strategies.

BrowseComp

Tests reasoning on challenging problems from arXiv papers across multiple scientific domains.

CharXiv-Reasoning

The CodeForces benchmark represents a comprehensive evaluation framework for assessing large language models' capabilities in competitive programming and algorithmic problem-solving. This benchmark features 387 carefully curated CodeForces contest problems spanning multiple difficulty divisions (Div. 1-4), providing a rigorous test of models' ability to understand complex problem statements, design efficient algorithms, and generate correct code implementations.

The benchmark covers a wide range of algorithmic concepts including dynamic programming, graph algorithms, tree structures, data structures, number theory, combinatorics, and geometry. Each problem is tagged with specific algorithm categories, allowing for detailed analysis of model performance across 35 different algorithmic domains.

What sets this benchmark apart is its unique evaluation methodology that submits solutions directly to the CodeForces platform, ensuring zero false positives and maintaining the same rigorous standards used for human competitive programmers. The system supports special judge problems with multiple valid outputs and maintains aligned execution environments for fair comparison.

The benchmark provides standardized ELO ratings that are directly comparable to human participants on the CodeForces platform, offering meaningful insights into how AI models perform relative to human competitive programmers. Testing has shown a wide range of performance, from basic implementations to advanced reasoning models like o1-mini achieving 1578 ELO rating.

Supporting both C++ and Python programming languages, the benchmark enables comprehensive analysis of code generation quality, algorithmic thinking, and problem-solving strategies across different programming paradigms.

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

CodeForces

CoVoST 2 is an open, large-scale multilingual speech-to-text translation (ST) dataset developed to advance research in both high- and low-resource language settings. It includes translations from 21 languages into English and from English into 15 languages, making it the largest publicly available corpus in terms of language coverage and volume (2880 hours, 78K speakers). Built on Mozilla's Common Voice, CoVoST 2 extends its predecessor by incorporating rigorous quality control (language model perplexity, LASER scores, length heuristics) and improved data utilization via revised train/dev/test splits. The dataset supports automatic speech recognition (ASR), machine translation (MT), and end-to-end ST, with baseline models implemented in fairseq. Extensive evaluations show that multilingual training substantially boosts performance on low-resource language pairs.

CoVoST 2

Evaluates models on their ability to solve cybersecurity challenges across various domains including cryptography, web exploitation, reverse engineering, binary exploitation, and forensics.

Cybersecurity CTF

Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and perform discrete operations.

DROP

EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.

EgoSchema

The FACTS Grounding benchmark measures language models' ability to generate text that is factually accurate with respect to given context in the user prompt. Each prompt includes a user request and a full document with a maximum length of 32k tokens, requiring long-form responses that are fully grounded in the provided context document while fulfilling the user request.

The evaluation process uses automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set, and the final factuality score is an aggregate of multiple judge models (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to mitigate evaluation bias.

The benchmark contains 860 public examples ('Open' split) and 859 private examples ('Blind' split) of natural, complex LLM prompts written by humans. The prompts span various enterprise domains including finance, technology, retail, medical, and legal, with tasks including fact finding, summarization, effect analysis, and concept comparison. Context documents have a mean length of 2.5k tokens with a maximum of 32k tokens.

Key features:
• Long-form grounded response generation up to 32k token context
• Multi-judge evaluation system to reduce bias
• Eligibility filter to prevent metric gaming
• Diverse enterprise domains and task types
• Public and private splits for leaderboard integrity

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form responses grounded in provided context documents up to 32k tokens.

FACTS Grounding

A balanced collection of culturally sensitive and culturally agnostic MMLU tasks designed for efficient evaluation of multilingual models in 15 languages (including English).

Global-MMLU-Lite

A multilingual evaluation set spanning 42 languages that combines machine translations for MMLU questions along with professional translations and crowd-sourced post-edits. Includes cultural sensitivity annotations classifying questions as Culturally Sensitive (CS) or Culturally Agnostic (CA).

Global-MMLU

Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-level physics and mathematics problems.

GPQA

Grade School Math 8K (GSM8K) consists of 8.5K high-quality grade school math word problems.

GSM8K

Tests common sense natural language inference through completion of scenarios.

HellaSwag

Google's internal holdout set of competition math problems

HiddenMath

## Overview

The Harvard-MIT Mathematics Tournament (HMMT) is one of the largest and most prestigious high school mathematics competitions in the world, founded in 1998. The tournament is entirely student-organized by undergraduate students at Harvard and MIT, many of whom are HMMT alumni themselves.

## Tournament Structure

HMMT consists of two annual tournaments:

### November Tournament (at Harvard)
- **Difficulty**: Mid-AMC to upper-AIME level
- **Individual Rounds**: General and Theme exams (10 questions each, 50 minutes)
- **Team Round**: 10 short answer problems
- **Guts Round**: 36 problems in 12 sets of 3 (80 minutes, real-time scoring)

### February Tournament (at MIT)
- **Difficulty**: Mid-AIME to olympiad level
- **Individual Rounds**: Algebra, Geometry, and Combinatorics (10 questions each, 50 minutes)
- **Team Round**: 10 proof-based problems with partial credit
- **Guts Round**: 36 problems in 9 sets of 4 (80 minutes, real-time scoring)

## Scoring System

HMMT uses a sophisticated post-weighted scoring algorithm for Individual Rounds. Problem weights are determined after testing based on actual difficulty (measured by solve rates and solver strength) rather than perceived difficulty. This creates a fairer scoring system that accurately reflects problem difficulty.

**Sweepstakes Scoring**:
- Individual rounds: up to 800 points (50%)
- Team round: up to 400 points (25%)
- Guts round: up to 400 points (25%)

## Difficulty Level

HMMT is considered one of the most challenging high school mathematics competitions in the United States. The contest organizers state that HMMT is "geared toward students who can comfortably and confidently solve 6 to 8 problems correctly on the American Invitational Mathematics Examination (AIME)."

The February tournament is comparable to or harder than ARML, AIME, or the Mandelbrot Competition, with problems ranging from mid-AIME difficulty to olympiad-level challenges.

## Format Details

### Individual Rounds
- 10 short answer questions per exam
- 50 minutes per exam
- Answers can be real numbers or algebraic expressions
- Problems weighted by difficulty and solve rates

### Team Round
- **November**: 10 short answer problems (no proofs required)
- **February**: 10 proof-based problems (partial credit available)
- Teams of 4-6 (November) or 6-8 (February) work collaboratively

### Guts Round
- Real-time grading and scoring
- Problems divided into sets of increasing difficulty
- Final set typically contains estimation problems worth 20 points each
- Teams submit answers set-by-set and cannot return to previous sets
- Live scoreboard shows all teams' progress

## Participation

Each tournament draws close to 1,000 students from around the globe, including top scorers at national and international olympiads. The competition is open to all currently-enrolled secondary school students under the age of 21.

## Awards

Prizes are awarded to:
- Top 10 individuals overall
- Top 10 scorers on each subject round
- Top 10 teams on the Team round
- Top 10 teams on the Guts round
- Top 10 teams overall (Sweepstakes winners)

## Additional Events

**Friday Night Events**: Social activities, corporate sponsor events, and mini-competitions (including Jane Street's Estimathon, Five Rings's Integration Bee, and HRT's Puzzle Hunt)

**Sunday Education Events**: Talks from mathematics and computer science faculty from Harvard and MIT

**HMIC (HMMT Invitational Competition)**: A five-question, four-hour, remote-proctored proof contest held in April for the top 50 February competitors

Harvard-MIT Mathematics Tournament (HMMT) problems test advanced high school mathematical problem-solving across algebra, geometry, combinatorics, and proof-based reasoning.

HMMT

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

HumanEval

A challenging benchmark of novel problems designed to test the limits of AI capabilities.

Humanitys-Last-Exam

Benchmark for instruction following capabilities.

IFBench

IMO-AnswerBench is part of the IMO-Bench suite introduced at EMNLP 2025 by Google DeepMind. It consists of 400 carefully chosen problems from past Olympiad competitions, altered by IMO medalists and mathematicians to avoid memorization.

## Problem Distribution

The benchmark spans four IMO categories:
- **Algebra**: 100 problems (11 pre-IMO, 46 IMO-Easy, 32 IMO-Medium, 11 IMO-Hard)
- **Combinatorics**: 100 problems (4 pre-IMO, 19 IMO-Easy, 31 IMO-Medium, 46 IMO-Hard)
- **Geometry**: 100 problems (13 pre-IMO, 44 IMO-Easy, 32 IMO-Medium, 11 IMO-Hard)
- **Number Theory**: 100 problems (2 pre-IMO, 20 IMO-Easy, 31 IMO-Medium, 47 IMO-Hard)

## Difficulty Levels

- **Pre-IMO**: Middle school or pre-Math Olympiad problems
- **IMO-Easy**: Equivalent to Problem 1 or Problem 4 at the IMO
- **IMO-Medium**: Equivalent to Problem 2 or Problem 5 at the IMO
- **IMO-Hard**: Equivalent to Problem 3 or Problem 6 at the IMO or post-Math Olympiad problems

## Key Features

- Vetted by a panel of IMO medalists and mathematicians (10 gold and 5 silver medals combined)
- Problems require rigorous multi-step reasoning and creativity beyond simple formula application
- Diverse representation of topics, ideas, and domain knowledge across all categories
- Designed to push the frontiers of mathematical reasoning in AI systems
- Part of the benchmark suite that contributed to achieving gold-medal standard at IMO

IMO-AnswerBench focuses on getting the right answer and serves as a comprehensive test of mathematical problem-solving abilities at the International Mathematical Olympiad level.

A large-scale benchmark of 400 International Mathematical Olympiad-level problems with verifiable answers, spanning Algebra, Combinatorics, Geometry, and Number Theory across four difficulty levels.

IMO-AnswerBench

A contamination-limited benchmark with frequently-updated questions from recent sources, scoring answers automatically against objective ground-truth values. Covers math, coding, reasoning, language, instruction following, and data analysis tasks.

LiveBench

Evaluates models on their ability to solve coding problems in real-time.

LiveCodeBench-v5

Benchmark for evaluating LLMs on code generation tasks from contests.

LiveCodeBench v6

## Overview

LOFT (Long-Context Frontiers) is a comprehensive benchmark designed to evaluate long-context language models (LCLMs) on paradigm-shifting tasks that leverage their ability to process millions of tokens. Unlike traditional benchmarks that rely on synthetic tasks, LOFT focuses on real-world applications where LCLMs can potentially replace specialized systems and complex pipelines.

## Key Features

- **Dynamic Scaling**: Supports context lengths from 32k to 1M+ tokens with automatic scaling capabilities
- **Multi-Modal**: Covers text, visual, and audio modalities across 35 datasets
- **Real-World Tasks**: Six core task categories that test practical applications
- **Corpus-in-Context (CiC) Prompting**: Novel prompting approach for long-context evaluation

## Task Categories

### 1. Text Retrieval
Evaluates LCLMs' ability to retrieve relevant information from large text corpora without specialized dual-encoder models. Includes datasets from BEIR benchmark covering argument retrieval (ArguAna), fact checking (FEVER), question answering (FIQA, MS MARCO, NQ), duplicate detection (Quora), citation prediction (SciFact), and multi-hop reasoning (HotPotQA, MuSiQue, QAMPARI, QUEST).

### 2. Visual Retrieval
Tests text-to-image retrieval (Flickr30k, MS COCO), text-to-video retrieval (MSR-VTT), and image-text retrieval (OVEN) capabilities, comparing against specialized models like CLIP.

### 3. Audio Retrieval
Evaluates multilingual audio retrieval across five languages (English, Hindi, Chinese, Spanish, French) using the FLEURS dataset, testing against dual-encoder models.

### 4. Retrieval-Augmented Generation (RAG)
Assesses LCLMs' ability to reason over entire corpora directly, eliminating traditional RAG pipeline complexity. Uses subsets of retrieval datasets with phrase-level answer annotations.

### 5. SQL-like Reasoning
Tests compositional reasoning over entire databases represented as text, using Spider (single-turn) and SParC (multi-turn) datasets. Evaluates natural language database querying without formal SQL conversion.

### 6. Many-Shot In-Context Learning
Explores scaling from traditional few-shot to hundreds/thousands of examples using Big-Bench Hard (BBH) and LongICLBench (LIB) datasets for classification tasks.

## Evaluation Methodology

**Corpus-in-Context (CiC) Prompting** is introduced as a novel approach that:
- Provides task-specific instructions
- Formats entire corpora with unique identifiers
- Includes grounded few-shot examples with chain-of-thought reasoning
- Maintains consistent query formatting

## Key Findings

### Performance Highlights
- **Text Retrieval**: Gemini 1.5 Pro achieves 77% average performance, matching specialized Gecko retriever at 128k context
- **Visual Retrieval**: Gemini 1.5 Pro (83% average) outperforms CLIP (71% average) across all visual benchmarks
- **Audio Retrieval**: Perfect 100% performance on FLEURS datasets, matching/exceeding PaLM 2 DE
- **RAG**: LCLMs excel at multi-hop reasoning but struggle with multi-target tasks requiring comprehensive passage ranking
- **SQL**: Significant gap remains (38% vs 65% for specialized models), indicating room for improvement in compositional reasoning
- **Many-Shot ICL**: Performance varies by task complexity, with knowledge-intensive tasks benefiting more from additional examples

### Positional Analysis
Reveals attention degradation in later corpus sections, but strategic placement of few-shot examples can mitigate this "lost in the middle" effect.

## Technical Details

- **Context Lengths**: 32k, 128k, and 1M token versions
- **Evaluation Scale**: Up to 100 test queries per dataset with 5 few-shot and 10 development queries
- **Metrics**: Task-specific (Recall@1 for retrieval, exact match for RAG, accuracy for SQL/ICL)
- **Models Evaluated**: Gemini 1.5 Pro, GPT-4o, Claude 3 Opus

## Significance

LOFT demonstrates that LCLMs can rival specialized systems across multiple domains without task-specific training, while revealing areas for improvement in long-context reasoning. The benchmark's scalable design ensures continued relevance as context windows expand to billions of tokens.

Long-Context Frontiers benchmark evaluating long-context language models on real-world tasks requiring context up to 128k tokens, including retrieval, RAG, SQL-like reasoning, and many-shot in-context learning across text, visual, and audio modalities.

LOFT (128k)

A sample of 500 diverse problems from the MATH benchmark, spanning topics like probability, algebra, trigonometry, and geometry. The questions test a model's ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly. Models are prompted to present final answers in boxed LaTeX format, and evaluated using parsing logic from the PRM800K dataset grader. Most models are evaluated with temperature set to 0, except for reasoning models that require specific temperature settings.

MATH 500

A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning.

MATH

Evaluates mathematical reasoning in visual contexts, combining vision and mathematical problem-solving.

MathVista

Sanitized version of the Mostly Basic Python Problems benchmark.

MBPP (Sanitized)

Multilingual Grade School Math (MGSM) extends GSM8K to 10 languages.

MGSM

MMLU-Pro is an enhanced benchmark with over 12,000 challenging questions across 14 domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others. It features 10 answer choices per question (vs. 4 in MMLU) and focuses on complex reasoning tasks.

MMLU-Pro

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathematics, history, law, and more.

MMLU

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks demanding expert-level subject knowledge and deliberate reasoning. The benchmark includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks.

Key Features

Comprehensive Coverage:
- 6 Core Disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering
- 30 Subjects: Spanning from Art to Computer Science to Clinical Medicine
- 183 Subfields: Providing granular coverage across academic domains
- 30 Image Types: Including diagrams, tables, charts, chemical structures, photographs, paintings, geometric shapes, music sheets, medical images, and more

Advanced Challenges:
1. Expert-level Visual Perception: Requires understanding of domain-specific visual elements
2. Deep Subject Knowledge: Demands college-level expertise across multiple disciplines
3. Complex Reasoning: Tests logical, spatial, commonsense, and mathematical reasoning
4. Interleaved Text and Images: Evaluates joint understanding of multimodal content

Dataset Statistics

- Total Questions: 11,550
- Split: 150 dev / 900 validation / 10,500 test
- Question Types: 94% multiple-choice, 6% open-ended
- Difficulty Distribution: 28% Easy, 45% Medium, 27% Hard
- Images per Question: 97.5% include images, 7.4% have multiple images
- Average Question Length: 59.33 words

Discipline Breakdown

- Tech & Engineering (26%): Agriculture, Architecture, Computer Science, Electronics, Energy & Power, Materials, Mechanical Engineering
- Science (23%): Biology, Chemistry, Geography, Math, Physics
- Health & Medicine (17%): Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health
- Business (14%): Accounting, Economics, Finance, Management, Marketing
- Art & Design (11%): Art, Design, Music, Art Theory
- Humanities & Social Science (9%): History, Literature, Psychology, Sociology

Performance Benchmarks

Human Expert Performance:
- Best Human Expert: 88.6% (validation)
- Medium Human Expert: 82.6% (validation)
- Worst Human Expert: 76.2% (validation)

Model Performance (Test Set):
- GPT-4o: 69.1% (validation only)
- Gemini 1.5 Pro: 62.2% (validation only)
- Claude 3 Opus: 59.4% (validation only)
- GPT-4V(ision): 55.7%
- SenseChat-Vision: 50.3%
- LLaVA-1.6-34B: 44.7%
- InternVL-Chat-V1.2: 46.2%
- BLIP-2 FLAN-T5-XXL: 34.0%

Error Analysis

Analysis of 150 GPT-4V errors reveals:
- Perceptual Errors (35%): Basic visual interpretation failures and domain-specific perception issues
- Lack of Knowledge (29%): Missing specialized domain knowledge
- Reasoning Errors (26%): Flawed logical and mathematical reasoning
- Other Errors (10%): Textual understanding, answer extraction, and annotation issues

Evaluation Methodology

Zero-shot Evaluation:
All models are evaluated in a zero-shot setting without fine-tuning or few-shot demonstrations to assess their inherent capabilities.

Scoring:
- Metric: Micro-averaged accuracy
- Multiple-choice: Systematic rule-based evaluation with robust regex patterns
- Open questions: Direct answer matching with fallback to random selection for invalid responses

Baselines:
- Random Choice: 23.9% (test set)
- Frequent Choice: 25.8% (test set)

Key Findings

1. Significant Challenge: Even advanced models like GPT-4V achieve only 55.7% accuracy, indicating substantial room for improvement
2. Performance Gap: Large disparity between open-source models (~34%) and proprietary models (~56%)
3. OCR/Caption Limitations: Text-only LLMs with OCR or caption augmentation show minimal improvement
4. Domain Variation: Higher performance in Art & Design and Humanities; lower in Science, Medicine, and Engineering
5. Image Type Dependency: Models perform better on common image types (photos, paintings) but struggle with specialized formats (geometric shapes, music sheets, chemical structures)

Significance for AGI Research

MMMU serves as a critical benchmark for measuring progress toward Expert AGI, as defined by Morris et al.'s taxonomy. Strong performance on MMMU should be considered necessary (though not sufficient) for any system claiming expert-level artificial general intelligence, as it tests:

- Breadth: Coverage across multiple academic disciplines
- Depth: College-level expertise and reasoning
- Multimodal Integration: Joint understanding of text and diverse visual content
- Expert-level Skills: Domain-specific knowledge application

Data Collection and Quality

The benchmark was created through a rigorous process:
- Manual Curation: 50+ college students collected questions from textbooks and online resources
- Quality Control: Multi-stage review process with duplicate detection and format standardization
- Difficulty Assessment: Questions categorized into four difficulty levels with ~10% excluded as too easy
- Copyright Compliance: Strict adherence to licensing regulations
- Contamination Prevention: Selection of questions with non-obvious answers to reduce data leakage risks

MMMU represents a significant step forward in evaluating multimodal AI systems, providing a comprehensive and challenging benchmark that pushes the boundaries of what current models can achieve.

A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI with 11.5K college-level questions across 6 disciplines and 30 subjects.

MMMU

MRCR (Multi-Round Coreference Resolution) is part of the Michelangelo benchmark suite that evaluates long-context reasoning in LLMs. It tests models' ability to track identities and references in adversarial conversation histories up to 1M tokens. Unlike retrieval-based benchmarks, MRCR requires synthesis, reasoning, and contextual understanding over extended contexts. The benchmark is synthetically generated to avoid pretraining contamination and provides unambiguous scoring metrics. MRCR is recommended as a superior alternative to traditional retrieval benchmarks for evaluating long-context reasoning capabilities.

Michelangelo Long-Context Reasoning (128k)

Michelangelo Long-Context Reasoning (1M)

Multi-IF evaluates LLMs on multi-turn and multilingual instruction following across 8 languages, with 4,501 conversations of three turns each. It reveals performance degradation with each additional turn and challenges with non-Latin scripts.

Multi-IF

A question answering dataset modeled after open book exams.

OpenBookQA

PIQA

Large-scale ReAding Comprehension Dataset From Examinations.

RACE

Benchmark for evaluating long-context capabilities of LLMs.

RULER

A multi-domain challenge set created by Scale AI to test models across diverse tasks.

Scale-MultiChallenge

SciCode

A benchmark of simple but precise questions to test factual knowledge and reasoning.

SimpleQA

Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.

SWE-bench

SWE-Lancer is a comprehensive benchmark that evaluates language models on real-world freelance software engineering tasks sourced from Upwork. The benchmark consists of 1,488 tasks collectively worth $1,000,000 USD in actual payouts, making it the first benchmark to map model performance directly to monetary value.

Benchmark Composition

Individual Contributor (IC) SWE Tasks: 764 tasks worth $414,775 total. These range from $50 bug fixes to $32,000 feature implementations, requiring models to generate code patches to resolve real-world issues. Tasks are evaluated using end-to-end tests created by professional software engineers that simulate real user workflows.

SWE Management Tasks: 724 tasks worth $585,225 total. Models act as technical leads, selecting the best implementation proposal from multiple submissions by freelancers. Performance is assessed against the choices made by original engineering managers.

Key Features

Real-world payouts: All tasks represent actual payments to freelance engineers, providing a natural market-derived difficulty gradient. Advanced full-stack engineering: Tasks involve whole-codebase context, mobile and web development, API interactions, and complex issue reproduction. End-to-end testing: Uses Playwright browser automation to verify application behavior, making it more resistant to grader hacking than unit tests. Domain diversity: 74% involve Application Logic, 17% UI/UX development, with 88% being bug fixes and 12% new features. Unbiased data collection: Representative sample from Upwork rather than filtering for easily testable problems.

Evaluation Results

The best performing model, Claude 3.5 Sonnet, achieves 26.2% success rate on IC SWE tasks, 44.9% success rate on SWE Management tasks, total earnings of $208,050 out of $500,800 possible on the public Diamond set, and over $400,000 out of $1,000,000 possible on the full dataset.

Public Release

SWE-Lancer Diamond is the public evaluation split containing $500,800 worth of tasks (237 IC SWE tasks worth $236,300 and 265 SWE Manager tasks worth $264,500). The remaining tasks are held private to prevent contamination. The benchmark includes a unified Docker image and evaluation harness for reproducible results.

Significance

SWE-Lancer addresses limitations of existing coding benchmarks by focusing on commercially valuable, full-stack engineering work rather than isolated, self-contained tasks. It provides insights into the economic impact potential of AI automation in software engineering while highlighting that frontier models still struggle with the majority of real-world engineering challenges.

A benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Evaluates frontier LLMs on both individual contributor tasks and software engineering management decisions.

SWE-Lancer

Tool Augmented Understanding Benchmark (TAU-bench) evaluates models on their ability to use tools.

TAU-bench

Evaluates models on their ability to use terminal commands to solve system administration tasks.

Terminal-bench

Measures a model's tendency to reproduce falsehoods commonly believed by humans.

TruthfulQA

Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.

Vending-Bench

Evaluates models on their ability to understand and generate content with specific vibes or styles.

Vibe-Eval

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Video-MME

An adversarial winograd schema challenge at scale.

WinoGrande

WMT24 is the 2024 edition of the Workshop on Machine Translation, which provides a ranking of General Machine Translation systems and Large Language Models. The evaluation includes both automatic metrics and human evaluation, with the human evaluation considered the official ranking as it is superior to automatic metrics. The benchmark serves as a comprehensive evaluation platform for comparing the performance of various translation systems, including traditional MT systems and newer LLM-based approaches. The preliminary results are provided to participants to assist in their system submissions, while the final official ranking is determined through human evaluation.

WMT24

A sparse Mixture-of-Experts model (8 experts) with a total of 46.7B parameters (12.9B active per token). Outperformed dense models like Llama 2 70B and GPT-3.5 on many benchmarks at significantly lower compute cost.

Mixtral 8×7B

Specifications

Benchmark Scores

Advanced Specifications

Capabilities & Limitations

Related Models

Devstral 2

Mistral Large (123B)

Pixtral 123B