Tool Augmented Understanding Benchmark (TAU-bench) evaluates models on their ability to use tools.

TAU-bench

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

OpenAI's most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. Sets new state-of-the-art on benchmarks including Codeforces, SWE-bench, and MMMU. Features full tool access and can agentically use and combine every tool within ChatGPT.

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

GPT-OSS-120B

GPT-OSS-20B is a medium-sized open-weight language model that delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory. This makes it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Despite its smaller size, it demonstrates strong performance on reasoning tasks, tool use, and competition mathematics.

GPT-OSS-20B

Anthropics fastest model at time of launch, delivering intelligence at blazing speeds. Claude 3.5 Haiku is optimized for quick responses while maintaining high quality.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Claude 3.7 Sonnet	Anthropic	81.2		2025-02-24	Multimodal
2	o3	OpenAI	73.9		2025-04-16	Multimodal
3	o4-mini	OpenAI	71.8		2025-04-16	Multimodal
4	GPT-OSS-120B	OpenAI	67.8	117B total (5.1B active per token)	2025-08-05	Text
5	GPT-OSS-20B	OpenAI	54.8	21B total (3.6B active per token)	2025-08-05	Text
6	Claude 3.5 Haiku	Anthropic	51		2024-10-22	Multimodal

TAU-bench

TAU-bench Leaderboard

About TAU-bench

Methodology

Publication