Grok 3

Name: Grok 3
Author: xAI

xAIProprietaryVerified

xAI's most advanced model yet, blending superior reasoning with extensive pretraining knowledge. Trained on the Colossus supercluster with 10x the compute of previous state-of-the-art models. Features test-time compute and reasoning capabilities through reinforcement learning, allowing it to think for seconds to minutes while correcting errors and exploring alternatives. Achieved an Elo score of 1402 in the Chatbot Arena.

2025-02-19

Unknown (multi-trillion estimated)

Decoder-only Transformer

Proprietary

Compare with other models

Specifications

Parameters: Unknown (multi-trillion estimated)
Architecture: Decoder-only Transformer
License: Proprietary
Context Window: 1,000,000 tokens
Type: multimodal
Modalities: textimagevideo

Benchmark Scores

AIME-202452.2

American Invitational Mathematics Examination (AIME) 2024 problems....

AIME-202593.3

American Invitational Mathematics Examination (AIME) 2025 problems....

EgoSchema74.5

EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long vid...

FACTS Grounding75.7

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form res...

GPQA75.4

Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-lev...

LiveCodeBench-v557

Evaluates models on their ability to solve coding problems in real-time....

LOFT (128k)83.3

Long-Context Frontiers benchmark evaluating long-context language models on real-world tasks requiri...

MMLU-Pro79.9

MMLU-Pro is an enhanced benchmark with over 12,000 challenging questions across 14 domains including...

MMMU73.2

A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI with 11.5...

SimpleQA43.6

A benchmark of simple but precise questions to test factual knowledge and reasoning....

Advanced Specifications

Model Family: Grok
API Access: Available
Chat Interface: Available
Variants: Grok 3 (Think)Grok 3 miniGrok 3 mini (Think)

Capabilities & Limitations

Capabilities: reasoningmathematicscodingtest-time-computechain-of-thoughttool-useagentsmultimodal-understanding
Notable Use Cases: advanced reasoningmathematical problem solvingcode generationresearch assistanceagent workflows
Function Calling Support: Yes
Tool Use Support: Yes

Resources

x.ai/blog/grok-3

Related Models

Claude Opus 4.6

Anthropic

Claude Opus 4.6 is Anthropic's most intelligent model, upgrading its predecessor's coding, reasoning, and agentic capabilities. It plans more carefully, sustains agentic tasks for longer, operates more reliably in larger codebases, and has superior code review and debugging skills. In a first for Opus-class models, it features a 1M token context window (beta) and 128K max output tokens. Opus 4.6 achieves state-of-the-art results on Terminal-Bench 2.0 (65.4%), leads on Humanity's Last Exam, BrowseComp, and GDPval-AA, and scores 76% on the 8-needle 1M variant of MRCR v2 (vs. 18.5% for Sonnet 4.5), representing a qualitative leap in long-context performance. It introduces adaptive thinking and four effort levels (low, medium, high, max) for fine-grained control over intelligence, speed, and cost.

FunctionGemma

Google

FunctionGemma is a specialized version of Gemma 3 270M fine-tuned for function calling and designed to run on edge devices. It bridges natural language and software execution, translating user commands into executable API actions. The model excels at unified action and chat capabilities, switching seamlessly between generating structured function calls and conversational responses. Built specifically for customization through fine-tuning, it demonstrated 85% accuracy on Mobile Actions after training (up from 58% baseline). Small enough to run on mobile phones and edge devices like NVIDIA Jetson Nano, it uses Gemma's 256k vocabulary to efficiently tokenize JSON and multilingual inputs.

Claude 4.5 Opus

Anthropic

Claude 4.5 Opus is the most intelligent and capable model in the Claude 4.5 family, designed for the most demanding reasoning, research, and coding tasks. It introduces a unique 'effort' parameter (low, medium, high) that allows users to control the depth of the model's reasoning process and token usage. Opus 4.5 excels in handling ambiguity, complex problem-solving, and deep analysis, often succeeding where other models fail on multi-system bugs or intricate research questions. It is priced to be more accessible than previous Opus generations while delivering state-of-the-art performance.