Testing long-term coherence in agents by simulating a vending machine business. Agents manage ordering, inventory, and pricing over long context horizons to successfully make money.

Vending-Bench

xAI's latest generation model with enhanced mathematical reasoning capabilities, showing significant improvements in competition-level mathematics benchmarks. Features 2x faster end-to-end latency, supports 5 different voices, and achieves 10x daily user seconds compared to previous models.

Grok 4

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of a mid-tier model. It operates at twice the speed of Claude 3 Opus.

Claude 3.5 Sonnet

Claude Opus 4 is Anthropic's most powerful model and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours. Features include extended thinking with tool use, parallel tool execution, improved memory capabilities, and significantly reduced shortcut behaviors.

Claude Opus 4

OpenAI's most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. Sets new state-of-the-art on benchmarks including Codeforces, SWE-bench, and MMMU. Features full tool access and can agentically use and combine every tool within ChatGPT.

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

Claude Sonnet 4 significantly improves on Sonnet 3.7's capabilities, excelling in coding with a state-of-the-art 72.7% on SWE-bench. The model balances performance and efficiency, with enhanced steerability for greater control over implementations. Features include extended thinking with tool use, parallel tool execution, improved instruction following, and significantly reduced shortcut behaviors.

Claude Sonnet 4

Gemini 2.5 Pro is capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy. Features Deep Think, an enhanced reasoning mode, and native audio outputs that capture subtle nuances of speech.

Gemini 2.5 Pro

Anthropics fastest model at time of launch, delivering intelligence at blazing speeds. Claude 3.5 Haiku is optimized for quick responses while maintaining high quality.

Claude 3.5 Haiku

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Grok 4	xAI	4,694.15	Unknown	2025-07-09	Multimodal
2	Claude 3.5 Sonnet	Anthropic	2,217.93		2024-06-20	Multimodal
3	Claude Opus 4	Anthropic	2,077.41		2025-05-22	Multimodal
4	o3	OpenAI	1,843.11		2025-04-16	Multimodal
5	Claude 3.7 Sonnet	Anthropic	1,567.9		2025-02-24	Multimodal
6	Claude Sonnet 4	Anthropic	968.31		2025-05-22	Multimodal
7	Gemini 2.5 Pro	Google	789.34		2025-05-06	Multimodal
8	Claude 3.5 Haiku	Anthropic	373.36		2024-10-22	Multimodal
9	GPT-4o	OpenAI	335.46		2024-05-13	Multimodal

Vending-Bench

Vending-Bench Leaderboard

About Vending-Bench

Methodology

Publication