Terminal-bench

systemPending Human Review

Evaluates models on their ability to use terminal commands to solve system administration tasks.

Published: 2024
Score Range: 0-100
Top Score: 46.6

Terminal-bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Gemini 3 ProGoogle
46.6
Proprietary2025-11-18Multimodal
2Claude Opus 4Anthropic
43.2
2025-05-22Multimodal
3Kimi K2Moonshot AI
33.15
1T total, 32B activated2025-07-11Text
4Nemotron 3 NanoNVIDIA
8.5
31.6B (Total), ~3.2B (Active)2025-12-15Text

About Terminal-bench

Methodology

Terminal-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper