Multi-IF

instructionVerified

Multi-IF evaluates LLMs on multi-turn and multilingual instruction following across 8 languages, with 4,501 conversations of three turns each. It reveals performance degradation with each additional turn and challenges with non-Latin scripts.

Published: 2024
Score Range: 0-100
Top Score: 71.9

Multi-IF Leaderboard

RankModelProviderScoreParametersReleasedType
1Qwen-3Alibaba
71.9
235B (22B active)2025-04-29Text
2GPT-4.1OpenAI
70.8
2025-04-14Multimodal

About Multi-IF

Methodology

Multi-IF evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper