GPT-4o
GPT-4o ('o' for 'omni') is a multimodal model that accepts any combination of text, audio, image, and video inputs and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on English text and code, with significant improvements for non-English languages. It responds to audio inputs in as little as 232 milliseconds (avg. 320ms), similar to human conversation response time. GPT-4o is 50% cheaper in the API than previous models and features superior vision and audio understanding capabilities.
Specifications
- Architecture
- Decoder-only Transformer (with vision encoder for images and audio processing)
- License
- Proprietary
- Context Window
- 128,000 tokens
- Max Output
- 16,384 tokens
- Training Data Cutoff
- Sep 30, 2023
- Type
- multimodal
- Modalities
- textvisionaudiovideo
Benchmark Scores
Evaluates models on competitive programming problems from the Codeforces platform....
Evaluates code generation capabilities by asking models to complete Python functions based on docstr...
Evaluates models on their ability to solve cybersecurity challenges across various domains including...
Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathema...
Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-lev...
A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning....
A sample of 500 diverse problems from the MATH benchmark, spanning topics like probability, algebra,...
Multilingual Grade School Math (MGSM) extends GSM8K to 10 languages....
Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and per...
Advanced Specifications
- Model Family
- omni
- API Access
- Not Available
- Chat Interface
- Not Available