arXiv중요논문2026. 04. 25. 00:42

LLM 기반의 의미론적 STT 평가 방법론 연구

요약

기존 자동 음성 인식(ASR) 평가는 단어 오류율(WER)에 의존하여 의미를 포착하는 데 한계가 있었습니다. 본 논문은 디코더 기반 대규모 언어 모델(LLM)을 활용하여 ASR의 의미론적 평가 방법을 제시합니다. 세 가지 접근 방식(최적 가설 선택, 생성 임베딩을 이용한 의미 거리 계산, 오류 정성 분류)을 통해 HATS 데이터셋에서 LLM이 최적의 가설 선택에 92~94%의 높은 일치율을 보였으며, 이는 WER(63%)이나 기존 의미론적 지표를 크게 능가합니다. 이 연구는 ASR 평가의 패러다임을 단어 단위에서 '의미' 중심으로,

핵심 포인트

LLM 기반 평가는 전통적인 WER보다 인간 인식과 높은 상관관계를 보이며, 최적 가설 선택에서 92~94%의 일치율을 달성했습니다.
디코더 기반 LLM의 임베딩은 인코더 모델 수준의 성능을 보여주어 ASR 평가에 활용 가능성이 높습니다.
ASR 평가는 단순히 단어 단위 오류를 측정하는 것을 넘어, 생성형 LLM을 통해 의미론적 관점에서 접근해야 합니다.

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94% agreement with human annotators for hypothesis selection, compared to 63% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

AI 자동 생성 콘텐츠

원문 바로가기

LLM 기반의 의미론적 STT 평가 방법론 연구

요약

핵심 포인트

댓글