Berkeley Function-Calling Leaderboard

codingVerified

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published: 2024

Score Range: 0-100

Top Score: 70.8

Technical Paper

Berkeley Function-Calling Leaderboard Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	Qwen-3	Alibaba	70.8	235B (22B active)	2025-04-29	Text
2	Nemotron 3 Nano	NVIDIA	53.8	31.6B (Total), ~3.2B (Active)	2025-12-15	Text

About Berkeley Function-Calling Leaderboard

Methodology

Berkeley Function-Calling Leaderboard evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper

Related Benchmarks

Aider Polyglot

coding

A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.

Published2024

Scale0-100

aider.chat View Details

CodeForces

coding

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

Published2025

Scale0-3500

Technical Paper View Details

HumanEval

coding

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Published2021

Scale0-100

Technical Paper View Details