Arena-Hard-v2

languagePending Human Review

Evaluation pipeline for instruction tuned models using challenging user queries.

Published: 2024

Score Range: 0-100

Top Score: N/A

Arena-Hard-v2 Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
No models found with scores for this benchmark.

About Arena-Hard-v2

Methodology

Arena-Hard-v2 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.GitHub

Related Benchmarks

IFBench

language

Benchmark for instruction following capabilities.

Published2025

Scale0-100

Paper View Details

RACE

language

Large-scale ReAding Comprehension Dataset From Examinations.

Published2017

Scale0-100

Paper View Details