CoVoST 2

translationVerified

CoVoST 2 is an open, large-scale multilingual speech-to-text translation (ST) dataset developed to advance research in both high- and low-resource language settings. It includes translations from 21 languages into English and from English into 15 languages, making it the largest publicly available corpus in terms of language coverage and volume (2880 hours, 78K speakers). Built on Mozilla's Common Voice, CoVoST 2 extends its predecessor by incorporating rigorous quality control (language model perplexity, LASER scores, length heuristics) and improved data utilization via revised train/dev/test splits. The dataset supports automatic speech recognition (ASR), machine translation (MT), and end-to-end ST, with baseline models implemented in fairseq. Extensive evaluations show that multilingual training substantially boosts performance on low-resource language pairs.

Published: 2020

Score Range: 0-100

Top Score: 40.6

Technical Paper

CoVoST 2 Leaderboard

Rank	Model	Provider	Score	Released	Type
1	Gemini 2.0 Pro	Google	40.6	2025-02-05	Multimodal
2	Gemini 2.0 Flash	Google	39	2025-02-25	Multimodal
3	Gemini 2.0 Flash-Lite	Google	38.4	2025-02-25	Multimodal

About CoVoST 2

Methodology

CoVoST 2 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2020.Technical Paper

Related Benchmarks

WMT24

translation

WMT24 is the 2024 edition of the Workshop on Machine Translation, which provides a ranking of General Machine Translation systems and Large Language Models. The evaluation includes both automatic metrics and human evaluation, with the human evaluation considered the official ranking as it is superior to automatic metrics. The benchmark serves as a comprehensive evaluation platform for comparing the performance of various translation systems, including traditional MT systems and newer LLM-based approaches. The preliminary results are provided to participants to assist in their system submissions, while the final official ranking is determined through human evaluation.

Published2024

Scale0-100

Technical Paper View Details