TAU-bench

tool-usePending Verification

Tool Augmented Understanding Benchmark (TAU-bench) evaluates models on their ability to use tools.

Published: 2024
Score Range: 0-100
Top Score: 81.2

TAU-bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Claude 3.7 SonnetAnthropic
81.2
2025-02-24Multimodal
2o3OpenAI
73.9
2025-04-16Multimodal
3o4-miniOpenAI
71.8
2025-04-16Multimodal
4GPT-OSS-120BOpenAI
67.8
117B total (5.1B active per token)2025-08-05Text
5GPT-OSS-20BOpenAI
54.8
21B total (3.6B active per token)2025-08-05Text
6Claude 3.5 HaikuAnthropic
51
2024-10-22Multimodal

About TAU-bench

Methodology

TAU-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper