Vending-Bench
agentVerified
Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.
Published: 2025
Score Range: 0-5000
Top Score: 4,694.15
Vending-Bench Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Grok 4 | xAI | 4,694.15 | Unknown | 2025-07-09 | Multimodal |
2 | Claude 3.5 Sonnet | Anthropic | 2,217.93 | 2024-06-20 | Multimodal | |
3 | Claude Opus 4 | Anthropic | 2,077.41 | 2025-05-22 | Multimodal | |
4 | o3 | OpenAI | 1,843.11 | 2025-04-16 | Multimodal | |
5 | Claude 3.7 Sonnet | Anthropic | 1,567.9 | 2025-02-24 | Multimodal | |
6 | Claude Sonnet 4 | Anthropic | 968.31 | 2025-05-22 | Multimodal | |
7 | Gemini 2.5 Pro | 789.34 | 2025-05-06 | Multimodal | ||
8 | Claude 3.5 Haiku | Anthropic | 373.36 | 2024-10-22 | Multimodal | |
9 | GPT-4o | OpenAI | 335.46 | 2024-05-13 | Multimodal |
About Vending-Bench
Methodology
Vending-Bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 5000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Technical Paper