오픈 ASR 리더보드 분석: 다국어 및 장문 전사 트렌드와 인사이트
요약
최근 업데이트된 Open ASR Leaderboard는 기존의 짧은 영어 중심 평가를 넘어, 다국어 성능과 모델 처리량(throughput)을 핵심 지표로 포함하고 있습니다. 현재까지 최고의 정확도는 Conformer 인코더와 LLM 디코더 조합에서 나타나며, NVIDIA Canary-Qwen-2.5B 등이 낮은 WER을 기록했습니다. 반면, 실시간 처리가 중요한 장문 오디오에서는 CTC/TDT 디코더를 사용하는 모델들이 압도적인 처리 속도를 보여줍니다. Whisper Large v3는 강력한 다국어 기준점이지만, 특정 언어에 특화
핵심 포인트
- 최고 정확도는 Conformer 인코더와 LLM 디코더 조합(예: NVIDIA Canary-Qwen-2.5B)에서 나타나며, 이는 ASR 성능 향상에 LLM 추론 능력이 크게 기여함을 보여줍니다.
- 실시간/배치 처리에 최적화된 CTC 및 TDT 디코더는 Whisper Large v3 대비 월등히 높은 처리량(RTFx 2793.75 vs 68.56)을 제공합니다.
- Whisper Large v3와 같은 범용 모델은 다국어 지원에 강하지만, 특정 언어에 특화된 파인튜닝 모델이 영어 단독 작업에서는 더 높은 성능을 보일 수 있습니다.
- 장문 오디오 전사(팟캐스트, 회의) 분야는 여전히 폐쇄형 시스템이 우위를 점하고 있으나, 오픈 소스 커뮤니티가 혁신할 잠재력이 매우 높습니다.
Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
Most benchmarks focus on short-form English transcription (<30s), and overlook other important tasks, such as (1) multilingual performance and (2) model throughput, which can a be deciding factor for long-form audio like meetings and podcasts.
Over the past two years, the Open ASR Leaderboard has become a standard for comparing open and closed-source models on both accuracy and efficiency. Recently, multilingual and long-form transcription tracks have been added to the leaderboard 🎉
TL;DR - Open ASR Leaderboard
- 📝 New preprint on ASR trends from the leaderboard: https://hf.co/papers/2510.06961
- 🧠 Best accuracy: Conformer encoder + LLM decoders (open-source ftw 🥳)
- ⚡ Fastest: CTC / TDT decoders
- 🌍 Multilingual: Comes at the cost of single-language performance
- ⌛ Long-form: Closed-source systems still lead (for now 😉)
- 🧑💻 Fine-tuning guides (Parakeet, Voxtral, Whisper): to continue pushing performance
As of 21 Nov 2025, the Open ASR Leaderboard compares 60+ open and closed-source models from 18 organizations, across 11 datasets.
In a recent preprint, we dive into the technical setup and highlight some key trends in modern ASR. Here are the big takeaways 👇
Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy. For example, NVIDIA’s Canary-Qwen-2.5B, IBM’s Granite-Speech-3.3-8B, and Microsoft’s Phi-4-Multimodal-Instruct achieve the lowest word error rates (WER), showing that integrating LLM reasoning can significantly boost ASR accuracy.
💡 Pro-tip: NVIDIA introduced Fast Conformer, a 2x faster variant of the Conformer, that is used in their Canary and Parakeet suite of models.
While highly accurate, these LLM decoders tend to be slower than simpler approaches. On the Open ASR Leaderboard, efficiency is measured using inverse real-time factor (RTFx), where higher is better.
For even faster inference, CTC and TDT decoders deliver 10–100× faster throughput, albeit with slightly higher error rates. This makes them ideal for real-time, offline, or batch transcription tasks (such as meetings, lectures, or podcasts).
OpenAI’s Whisper Large v3 remains a strong multilingual baseline, supporting 99 languages. However, fine-tuned or distilled variants like Distil-Whisper and CrisperWhisper often outperform the original on English-only tasks, showing how targeted fine-tuning can improve specialization (how to fine-tune? Check out guides for Whisper, Parakeet, and Voxtral).
That said, focusing on English tends to reduce multilingual coverage 👉 a classic case of the tradeoff between specialization and generalization. Similarly, while self-supervised systems like Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support 1K+ languages, they trail behind language-specific encoders in accuracy.
⭐ While just five languages are currently benchmarked, we’re planning to expand to more languages and are excited for new dataset and models contributions to multilingual ASR through GitHub pull requests.
🎯 Alongside multilingual benchmarks, several community-driven leaderboards focus on individual languages. For example, the Open Universal Arabic ASR Leaderboard compares models across Modern Standard Arabic and regional dialects, highlighting how speech variation and diglossia challenge current systems. Similarly. the Russian ASR Leaderboard provides a growing hub for evaluating encoder-decoder and CTC models on Russian-specific phonology and morphology. These localized efforts mirror the broader multilingual leaderboard’s mission to encourage dataset sharing, fine-tuned checkpoints, and transparent model comparisons, especially in languages with fewer established ASR resources.
For long-form audio (e.g., podcasts, lectures, meetings), closed-source systems still edge out open ones. It could be due to domain tuning, custom chunking, or production-grade optimization.
Among open models, OpenAI’s Whisper Large v3 performs the best. But for throughput, CTC-based Conformers shine 👉 for example, NVIDIA’s Parakeet CTC 1.1B achieves an RTFx of 2793.75, compared to 68.56 for Whisper Large v3, with only a moderate WER degradation (6.68 and 6.43 respectively).
The tradeoff? Parakeet is English-only, again reminding us of that multilingual and specialization tradeoff 🫠.
⭐ While closed systems still lead, there’s huge potential for open-source innovation here. Long-form ASR remains one of the most exciting frontiers for the community to tackle next!
Given how fast ASR is evolving, we’re excited to see what new architectures push performance and efficiency, and how the Open ASR Leaderboard continues to serve as a transparent, community-driven benchmark for the field, and as a reference for other leaderboards (Russian, Arabic, and Speech DeepFake Detection).
We’ll keep expanding the Open ASR LeaderBoard with more models, more languages, and more datasets so stay tuned 👀
👉 Want to contribute? Head on over to the GitHub repo to open a pull request 🚀
AI 자동 생성 콘텐츠
본 콘텐츠는 Hugging Face Blog의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기