OpenAI logo

GPT-4o

OpenAIProprietaryVerified

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

2024-05-13
Autoregressive omni model
Proprietary

Specifications

Architecture
Autoregressive omni model
License
Proprietary
Context Window
128,000 tokens
Max Output
16,384 tokens
Training Data Cutoff
Oct 2023
Type
multimodal
Modalities
textvisionaudiovideo

Benchmark Scores

MMLU88.7

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathema...

GPQA53.6

Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-lev...

MATH76.6

A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning....

Evaluates code generation capabilities by asking models to complete Python functions based on docstr...

MGSM90.5

Multilingual Grade School Math (MGSM) extends GSM8K to 10 languages....

DROP83.4

Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and per...

A benchmark for measuring browsing agents' ability to navigate the web and find hard-to-find, entang...

The FACTS Grounding Leaderboard evaluates LLMs' ability to generate factually accurate long-form res...

Testing long-term coherence in agents by simulating a vending machine business. Agents manage orderi...

Advanced Specifications

Model Family
omni
API Access
Available
Chat Interface
Available
Multilingual Support
Yes

Capabilities & Limitations

Capabilities
voice generationreal-time audio processingmultimodal reasoningcode generationmath reasoningscientific researchmultilingual supportimage understandingvideo understandingaudio translationspeech recognitiontone detectionemotion expressionsinginglaughter generationmultiple speaker detectionbackground noise handlingreal-time translationvisual narrative understanding
Known Limitations
Audio outputs limited to preset voices at launchMay struggle with low quality audio input, background noise, and echoesText extraction mistakes with scientific terms and complex figuresNon-native accents when speaking non-English languagesOver-refusal behavior in non-English conversations for voice moderationPotential for anthropomorphization and emotional relianceAudio modalities present novel risks requiring additional safety measures
Notable Use Cases
Real-time voice conversationsScientific research assistanceMultimodal content creationLanguage learning and translationEducational tutoringAccessibility applicationsCustomer serviceInterview preparationMeeting assistanceCreative collaborationVisual storytellingMusic and audio generation
Function Calling Support
Yes
Tool Use Support
Yes

Related Models