Berkeley Function-Calling Leaderboard

codingVerified

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published: 2024
Score Range: 0-100
Top Score: 70.8

Berkeley Function-Calling Leaderboard Leaderboard

RankModelProviderScoreParametersReleasedType
1Qwen-3Alibaba
70.8
235B (22B active)2025-04-29Text

About Berkeley Function-Calling Leaderboard

Methodology

Berkeley Function-Calling Leaderboard evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper

Related Benchmarks