Arena-Hard-v2

languagePending Human Review

Evaluation pipeline for instruction tuned models using challenging user queries.

Published: 2024
Score Range: 0-100
Top Score: N/A

Arena-Hard-v2 Leaderboard

RankModelProviderScoreParametersReleasedType
No models found with scores for this benchmark.

About Arena-Hard-v2

Methodology

Arena-Hard-v2 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.GitHub

Related Benchmarks