Multi-IF evaluates LLMs on multi-turn and multilingual instruction following across 8 languages, with 4,501 conversations of three turns each. It reveals performance degradation with each additional turn and challenges with non-Latin scripts.

Multi-IF

Third-generation Qwen model featuring hybrid reasoning capabilities that can switch between thinking and non-thinking modes. Trained on 36 trillion tokens (double that of Qwen2.5), with support for 119 languages and dialects. Available in 6 dense models (0.6B to 32B parameters) and 2 MoE models (30B/3B active and 235B/22B active).

Qwen-3

Flagship GPT model for complex tasks. It is well suited for problem solving across domains. Features major improvements in coding, instruction following, and long context comprehension.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Qwen-3	Alibaba	71.9	235B (22B active)	2025-04-29	Text
2	GPT-4.1	OpenAI	70.8		2025-04-14	Multimodal

Multi-IF

Multi-IF Leaderboard

About Multi-IF

Methodology

Publication