
DeepSeek-V3
A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.
Specifications
- Parameters
- 671B total, 37B activated
- Architecture
- Mixture of Experts (MoE)
- License
- MIT
- Context Window
- 128,000 tokens
- Training Data Cutoff
- 2024-12
- Type
- text
- Modalities
- text
Benchmark Scores
Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathema...
MMLU-Pro is an enhanced benchmark with over 12,000 challenging questions across 14 domains including...
Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and per...
Grade School Math 8K (GSM8K) consists of 8.5K high-quality grade school math word problems....
A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning....
Evaluates code generation capabilities by asking models to complete Python functions based on docstr...
Advanced competitive programming benchmark for evaluating large language models on algorithmic probl...
Advanced Specifications
- Model Family
- DeepSeek
- API Access
- Available
- Chat Interface
- Available
- Multilingual Support
- Yes
- Variants
- BaseChatFP8BF16
- Hardware Support
- CUDAAMD GPUHuawei Ascend NPU
Capabilities & Limitations
- Capabilities
- reasoningcodemathmultilinguallong-contextmulti-token-prediction
- Function Calling Support
- Yes
- Tool Use Support
- Yes