BIG-bench

diversePending Human Review

Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark of 204 diverse tasks.

Published: 2022

Score Range: 0-100

Top Score: 93.1

BIG-bench Leaderboard

Rank	Model	Provider	Score	Released	Type
1	Claude 3.5 Sonnet	Anthropic	93.1	2024-06-20	Multimodal
2	Claude 3 Opus	Anthropic	86.8	2024-03-04	Multimodal
3	Claude 3 Sonnet	Anthropic	82.9	2024-03-04	Multimodal
4	Claude 3 Haiku	Anthropic	73.7	2024-03-04	Multimodal
5	Gemini Diffusion	Google	15	2025-05-20	Text

About BIG-bench

Methodology

BIG-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2022.Technical Paper

Related Benchmarks

Humanitys-Last-Exam

diverse

A challenging benchmark of novel problems designed to test the limits of AI capabilities.

Published2023

Scale0-100

Technical Paper View Details

Scale-MultiChallenge

diverse

A multi-domain challenge set created by Scale AI to test models across diverse tasks.

Published2024

Scale0-100

Technical Paper View Details