CoVoST 2
CoVoST 2 is an open, large-scale multilingual speech-to-text translation (ST) dataset developed to advance research in both high- and low-resource language settings. It includes translations from 21 languages into English and from English into 15 languages, making it the largest publicly available corpus in terms of language coverage and volume (2880 hours, 78K speakers). Built on Mozilla's Common Voice, CoVoST 2 extends its predecessor by incorporating rigorous quality control (language model perplexity, LASER scores, length heuristics) and improved data utilization via revised train/dev/test splits. The dataset supports automatic speech recognition (ASR), machine translation (MT), and end-to-end ST, with baseline models implemented in fairseq. Extensive evaluations show that multilingual training substantially boosts performance on low-resource language pairs.
CoVoST 2 Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 40.6 | 2025-02-05 | Multimodal | ||
2 | Gemini 2.0 Flash | 39 | 2025-02-25 | Multimodal | ||
3 | Gemini 2.0 Flash-Lite | 38.4 | 2025-02-25 | Multimodal |
About CoVoST 2
Description
CoVoST 2 is an open, large-scale multilingual speech-to-text translation (ST) dataset developed to advance research in both high- and low-resource language settings. It includes translations from 21 languages into English and from English into 15 languages, making it the largest publicly available corpus in terms of language coverage and volume (2880 hours, 78K speakers). Built on Mozilla's Common Voice, CoVoST 2 extends its predecessor by incorporating rigorous quality control (language model perplexity, LASER scores, length heuristics) and improved data utilization via revised train/dev/test splits. The dataset supports automatic speech recognition (ASR), machine translation (MT), and end-to-end ST, with baseline models implemented in fairseq. Extensive evaluations show that multilingual training substantially boosts performance on low-resource language pairs.
Methodology
CoVoST 2 evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2020.Read the full paper