Our state-of-the-art, experimental text diffusion model that uses diffusion techniques to explore a new kind of language model that gives users greater control, creativity, and speed in text generation.
Evaluates code generation capabilities by asking models to complete Python functions based on docstr...
Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-lev...
American Invitational Mathematics Examination (AIME) 2025 problems....
Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark of 204 diverse tasks....
Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathema...
Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks...