CharXiv-Reasoning

scientificPending Verification

Tests reasoning on challenging problems from arXiv papers across multiple scientific domains.

Published: 2023
Score Range: 0-100
Top Score: 78.6

CharXiv-Reasoning Leaderboard

RankModelProviderScoreParametersReleasedType
1o3OpenAI
78.6
2025-04-16Multimodal
2o4-miniOpenAI
72
2025-04-16Multimodal
3GPT-4.1OpenAI
56.7
2025-04-14Multimodal

About CharXiv-Reasoning

Methodology

CharXiv-Reasoning evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Read the full paper