Terminal-bench
systemPending Human Review
Evaluates models on their ability to use terminal commands to solve system administration tasks.
Published: 2024
Score Range: 0-100
Top Score: 46.6
Terminal-bench Leaderboard
| Rank | Model | Provider | Score | Parameters | Released | Type |
|---|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 46.6 | Proprietary | 2025-11-18 | Multimodal | |
| 2 | Claude Opus 4 | Anthropic | 43.2 | 2025-05-22 | Multimodal | |
| 3 | Kimi K2 | Moonshot AI | 33.15 | 1T total, 32B activated | 2025-07-11 | Text |
| 4 | Nemotron 3 Nano | NVIDIA | 8.5 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 | Text |
About Terminal-bench
Methodology
Terminal-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2024.Technical Paper