CharXiv-Reasoning

scientificPending Verification

Tests reasoning on challenging problems from arXiv papers across multiple scientific domains.

Published: 2023
Score Range: 0-100
Top Score: 78.6

CharXiv-Reasoning Leaderboard

RankModelProviderScoreParametersReleasedType
1o3OpenAI
78.6
2025-04-16Multimodal
2o4-miniOpenAI
72
2025-04-16Multimodal
3GPT-4.1OpenAI
56.7
2025-04-14Multimodal

About CharXiv-Reasoning

Methodology

CharXiv-Reasoning evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper