TAU-bench

tool-usePending Human Review

Tool Augmented Understanding Benchmark (TAU-bench) evaluates models on their ability to use tools.

Published: 2024
Score Range: 0-100
Top Score: 90.7

TAU-bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Gemini 3 ProGoogle
90.7
Proprietary2025-11-18Multimodal
2Claude 3.7 SonnetAnthropic
81.2
2025-02-24Multimodal
3Kimi K2Moonshot AI
74.3
1T total, 32B activated2025-07-11Text
4o3OpenAI
73.9
2025-04-16Multimodal
5o4-miniOpenAI
71.8
2025-04-16Multimodal
6GPT-OSS-120BOpenAI
67.8
117B total (5.1B active per token)2025-08-05Text
7GPT-OSS-20BOpenAI
54.8
21B total (3.6B active per token)2025-08-05Text
8Claude 3.5 HaikuAnthropic
51
2024-10-22Multimodal
9Nemotron 3 NanoNVIDIA
49
31.6B (Total), ~3.2B (Active)2025-12-15Text

About TAU-bench

Methodology

TAU-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper