Vending-Bench

agentVerified

Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.

Published: 2025
Score Range: 0-5000
Top Score: 4,694.15

Vending-Bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Grok 4xAI
4,694.15
Unknown2025-07-09Multimodal
2Claude 3.5 SonnetAnthropic
2,217.93
2024-06-20Multimodal
3Claude Opus 4Anthropic
2,077.41
2025-05-22Multimodal
4o3OpenAI
1,843.11
2025-04-16Multimodal
5Claude 3.7 SonnetAnthropic
1,567.9
2025-02-24Multimodal
6Claude Sonnet 4Anthropic
968.31
2025-05-22Multimodal
7Gemini 2.5 ProGoogle
789.34
2025-05-06Multimodal
8Claude 3.5 HaikuAnthropic
373.36
2024-10-22Multimodal
9GPT-4oOpenAI
335.46
2024-05-13Multimodal

About Vending-Bench

Methodology

Vending-Bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 5000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Read the full paper