TAU-bench
tool-usePending Verification
Tool Augmented Understanding Benchmark (TAU-bench) evaluates models on their ability to use tools.
Published: 2024
Score Range: 0-100
Top Score: 81.2
TAU-bench Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Claude 3.7 Sonnet | Anthropic | 81.2 | 2025-02-24 | Multimodal | |
2 | o3 | OpenAI | 73.9 | 2025-04-16 | Multimodal | |
3 | o4-mini | OpenAI | 71.8 | 2025-04-16 | Multimodal | |
4 | GPT-OSS-120B | OpenAI | 67.8 | 117B total (5.1B active per token) | 2025-08-05 | Text |
5 | GPT-OSS-20B | OpenAI | 54.8 | 21B total (3.6B active per token) | 2025-08-05 | Text |
6 | Claude 3.5 Haiku | Anthropic | 51 | 2024-10-22 | Multimodal |
About TAU-bench
Methodology
TAU-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2024.Technical Paper