Qwen2.5-VL 72B
A state-of-the-art vision-language model capable of understanding videos over 1 hour long, performing precise object grounding, and structured output generation. It can also act as a visual agent for computer/mobile GUI operations.
2025-01-26
72B
Vision Transformer (ViT) + Decoder-only Transformer
Qwen License
Specifications
- Parameters
- 72B
- Architecture
- Vision Transformer (ViT) + Decoder-only Transformer
- License
- Qwen License
- Context Window
- 128,000 tokens
- Max Output
- 8,192 tokens
- Training Data Cutoff
- Jan 2025
- Type
- multimodal
- Modalities
- textimagevideo
Benchmark Scores
Advanced Specifications
- Model Family
- Qwen
- API Access
- Available
- Chat Interface
- Available
- Multilingual Support
- Yes
- Variants
- Qwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-32B
- Hardware Support
- CUDA
Capabilities & Limitations
- Capabilities
- visual understandingvideo analysisOCRobject groundingvisual agentstructured output
- Notable Use Cases
- Video summarizationDocument parsingGUI automation
- Function Calling Support
- Yes
- Tool Use Support
- Yes