GPT-4o

Name: GPT-4o
Author: OpenAI

OpenAIProprietaryVerified

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

2024-05-13

Autoregressive omni model

Proprietary

Compare with other models

Specifications

Architecture: Autoregressive omni model
License: Proprietary
Context Window: 128,000 tokens
Max Output: 16,384 tokens
Training Data Cutoff: Oct 2023
Type: multimodal
Modalities: textvisionaudiovideo

Benchmark Scores

BrowseComp0.6

A benchmark for measuring browsing agents' ability to navigate the web and find hard-to-find, entang...

DROP83.4

Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and per...

FACTS Grounding75.9

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form res...

GPQA53.6

Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-lev...

HumanEval90.2

Evaluates code generation capabilities by asking models to complete Python functions based on docstr...

MATH76.6

A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning....

MGSM90.5

Multilingual Grade School Math (MGSM) extends GSM8K to 10 languages....

MMLU88.7

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathema...

Vending-Bench335.46

Testing long-term coherence in agents by simulating a vending machine business. Agents manage orderi...

Advanced Specifications

Model Family: omni
API Access: Available
Chat Interface: Available
Multilingual Support: Yes

Capabilities & Limitations

Capabilities: voice generationreal-time audio processingmultimodal reasoningcode generationmath reasoningscientific researchmultilingual supportimage understandingvideo understandingaudio translationspeech recognitiontone detectionemotion expressionsinginglaughter generationmultiple speaker detectionbackground noise handlingreal-time translationvisual narrative understanding
Known Limitations: Audio outputs limited to preset voices at launchMay struggle with low quality audio input, background noise, and echoesText extraction mistakes with scientific terms and complex figuresNon-native accents when speaking non-English languagesOver-refusal behavior in non-English conversations for voice moderationPotential for anthropomorphization and emotional relianceAudio modalities present novel risks requiring additional safety measures
Notable Use Cases: Real-time voice conversationsScientific research assistanceMultimodal content creationLanguage learning and translationEducational tutoringAccessibility applicationsCustomer serviceInterview preparationMeeting assistanceCreative collaborationVisual storytellingMusic and audio generation
Function Calling Support: Yes
Tool Use Support: Yes

Resources

Related Models

GPT-5.2

OpenAI

GPT-5.2 is OpenAI's flagship frontier model series designed for professional knowledge work, advanced coding, and agentic workflows. Released in December 2025 as a response to competitive pressures, it features a massive 400,000-token context window and a 128,000-token maximum output capacity. The model utilizes a Mixture-of-Experts (MoE) architecture to balance inference efficiency with deep reasoning capabilities. It is available in three variants—Instant, Thinking, and Pro—each optimized for different points on the latency-intelligence curve. GPT-5.2 demonstrates state-of-the-art performance in tool calling reliability (98.7%), coding (SWE-Bench Verified 80.0%), and long-context retrieval.

Typemultimodal

ParametersProprietary

2025-12-11

Proprietary

Details Compare

GPT-5.1

OpenAI

GPT-5.1 is a frontier-grade multimodal language model family released by OpenAI in November 2025. It introduces a unified system architecture featuring a 'Smart Router' that dynamically allocates compute resources between two primary modes: 'Instant' (optimized for low latency and conversational warmth) and 'Thinking' (optimized for deep, adaptive reasoning). The model utilizes a Sparse Mixture-of-Experts (MoE) architecture with a central language backbone and attachable modules, allowing it to process text, audio, image, and video inputs natively. Key capabilities include adaptive test-time compute, where the model adjusts its reasoning depth based on query complexity, and enhanced personalization options with distinct personality presets. It demonstrates significant improvements in instruction following, coding (via the Codex variants), and mathematical reasoning compared to its predecessor, GPT-5.

Typemultimodal

ParametersUndisclosed (Estimated ~1.7-2T)

2025-11-12

Proprietary

Details Compare

GPT-OSS-120B

OpenAI

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

Typetext

Parameters117B total (5.1B active per token)

2025-08-05

Open Weights

Details Compare