Vending-Bench

agentVerified

Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.

Published: 2025
Score Range: 0-5000
Top Score: 4,694.15

Vending-Bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Grok 4xAI
4,694.15
Unknown2025-07-09Multimodal
2Claude 3.5 SonnetAnthropic
2,217.93
2024-06-20Multimodal
3Claude Opus 4Anthropic
2,077.41
2025-05-22Multimodal
4o3OpenAI
1,843.11
2025-04-16Multimodal
5Claude 3.7 SonnetAnthropic
1,567.9
2025-02-24Multimodal
6Claude Sonnet 4Anthropic
968.31
2025-05-22Multimodal
7Gemini 2.5 ProGoogle
789.34
2025-05-06Multimodal
8Claude 3.5 HaikuAnthropic
373.36
2024-10-22Multimodal
9GPT-4oOpenAI
335.46
2024-05-13Multimodal

About Vending-Bench

Methodology

Vending-Bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 5000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Technical Paper