Multi-IF

instructionVerified

Multi-IF evaluates LLMs on multi-turn and multilingual instruction following across 8 languages, with 4,501 conversations of three turns each. It reveals performance degradation with each additional turn and challenges with non-Latin scripts.

Published: 2024
Score Range: 0-100
Top Score: 71.9

Multi-IF Leaderboard

RankModelProviderScoreParametersReleasedType
1Qwen-3Alibaba
71.9
235B (22B active)2025-04-29Text
2GPT-4.1OpenAI
70.8
2025-04-14Multimodal

About Multi-IF

Methodology

Multi-IF evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Read the full paper