Vending-Bench
agentVerified
Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.
Published: 2025
Score Range: 0-5000
Top Score: 4,694.15
Vending-Bench Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Grok 4 | xAI | 4,694.15 | Unknown | 2025-07-09 | Multimodal |
2 | Claude 3.5 Sonnet | Anthropic | 2,217.93 | 2024-06-20 | Multimodal | |
3 | Claude Opus 4 | Anthropic | 2,077.41 | 2025-05-22 | Multimodal | |
4 | o3 | OpenAI | 1,843.11 | 2025-04-16 | Multimodal | |
5 | Claude 3.7 Sonnet | Anthropic | 1,567.9 | 2025-02-24 | Multimodal | |
6 | Claude Sonnet 4 | Anthropic | 968.31 | 2025-05-22 | Multimodal | |
7 | Gemini 2.5 Pro | 789.34 | 2025-05-06 | Multimodal | ||
8 | Claude 3.5 Haiku | Anthropic | 373.36 | 2024-10-22 | Multimodal | |
9 | GPT-4o | OpenAI | 335.46 | 2024-05-13 | Multimodal |
About Vending-Bench
Methodology
Vending-Bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 5000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Read the full paper