Terminal-bench

systemPending Verification

Evaluates models on their ability to use terminal commands to solve system administration tasks.

Published: 2024
Score Range: 0-100
Top Score: 43.2

Terminal-bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Claude Opus 4Anthropic
43.2
2025-05-22Multimodal

About Terminal-bench

Methodology

Terminal-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Read the full paper