Tests reasoning on challenging problems from arXiv papers across multiple scientific domains.

CharXiv-Reasoning

OpenAI's most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. Sets new state-of-the-art on benchmarks including Codeforces, SWE-bench, and MMMU. Features full tool access and can agentically use and combine every tool within ChatGPT.

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

Flagship GPT model for complex tasks. It is well suited for problem solving across domains. Features major improvements in coding, instruction following, and long context comprehension.

Rank	Model	Provider	Score	Released	Type
1	o3	OpenAI	78.6	2025-04-16	Multimodal
2	o4-mini	OpenAI	72	2025-04-16	Multimodal
3	GPT-4.1	OpenAI	56.7	2025-04-14	Multimodal

CharXiv-Reasoning

CharXiv-Reasoning Leaderboard

About CharXiv-Reasoning

Methodology

Publication