Alibaba logo

Qwen2.5-VL 72B

AlibabaOpen WeightsPending Human Review

A state-of-the-art vision-language model capable of understanding videos over 1 hour long, performing precise object grounding, and structured output generation. It can also act as a visual agent for computer/mobile GUI operations.

2025-01-26
72B
Vision Transformer (ViT) + Decoder-only Transformer
Qwen License

Specifications

Parameters
72B
Architecture
Vision Transformer (ViT) + Decoder-only Transformer
License
Qwen License
Context Window
128,000 tokens
Max Output
8,192 tokens
Training Data Cutoff
Jan 2025
Type
multimodal
Modalities
textimagevideo

Benchmark Scores

Advanced Specifications

Model Family
Qwen
API Access
Available
Chat Interface
Available
Multilingual Support
Yes
Variants
Qwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-32B
Hardware Support
CUDA

Capabilities & Limitations

Capabilities
visual understandingvideo analysisOCRobject groundingvisual agentstructured output
Notable Use Cases
Video summarizationDocument parsingGUI automation
Function Calling Support
Yes
Tool Use Support
Yes

Related Models