Berkeley Function-Calling Leaderboard

codingVerified

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published: 2024
Scale: 0-100
Top Score: 70.8

Berkeley Function-Calling Leaderboard Leaderboard

RankModelProviderScoreParametersReleasedType
1Qwen-3Alibaba
70.8
235B (22B active)2025-04-29Text

About Berkeley Function-Calling Leaderboard

Description

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Methodology

Berkeley Function-Calling Leaderboard evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Read the full paper

Related Benchmarks