HuggingFace헤드라인2026. 05. 07. 23:39

Fixing Open LLM Leaderboard with Math-Verify

요약

본 기술 기사는 Open LLM Leaderboard의 MATH-Hard 평가 과정에서 발견된 여러 문제점들을 개선한 'Math-Verify'라는 새로운 파서를 소개합니다. 기존 리더보드는 모델이 답변 형식을 지키지 않거나, 복잡한 수학적 표현(예: 매개변수 방정식, 행렬)을 정확히 파싱하지 못하는 등 심각한 오류를 안고 있었습니다. Math-Verify는 이러한 형식 준수 문제와 다양한 수학적 구조의 파싱 문제를 해결하여, LLM 모델들의 실제 수학 능력을 더욱 공정하고 견고하게 평가할 수 있도록 개선했습니다.

핵심 포인트

Math-Verify는 Open LLM Leaderboard의 MATH-Hard 평가 정확도를 높이는 핵심 도구입니다.
기존 리더보드는 답변 형식 미준수(예: 서론 문장 추가)나 복잡한 수학 표현 파싱 실패로 인해 모델 성능을 오판하는 문제가 있었습니다.
Math-Verify는 텍스트 추출, 대수적 표현 변환(SymPy 파싱), 그리고 최종 비교 단계의 모든 오류를 수정했습니다.
이 개선을 통해 LLM의 실제 수학 문제 해결 능력을 더욱 공정하고 신뢰성 있게 측정할 수 있게 되었습니다.

오늘 우리는 Math-Verify 를 사용하여 Open LLM Leaderboard 에 제출된 모든 3,751 개 모델을 철저히 재평가하고, 더 공정하고 견고한 모델 비교를 위해 이 작업을 완료했습니다!

Open LLM Leaderboard 는 Hugging Face Hub 에서 가장 많이 사용되는 리더보드입니다: 다양한 작업에 따른 오픈 Large Language Models (LLM) 성능을 비교합니다. 이러한 작업 중 하나인 MATH-Hard 는 수학 문제에 특화되어 있습니다: 이는 LLM 들이 고등학교 및 대학 수준의 수학 문제를 얼마나 잘 풀 수 있는지 평가합니다. Hendrycks MATH 데이터셋의 1,324 개의 최고 난도 문제 (Level 5) 를 사용하여, 7 가지 주제 (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory) 에 걸쳐 분포되며, 5-shot 접근법을 사용합니다 (모델은 프롬프트에 5 개 예제를 제공하여 어떻게 답변해야 하는지 보여줍니다).

일반적인 질문은 다음과 같습니다:

For all real numbers $r$ and $s$, define the mathematical operation $\	ext{#}$ such that the following conditions apply: $r\ \#\	ext{ }0 = r, r\ \#\	ext{ }s = s\ \#\	ext{ }r$, and $(r + 1)\ \#\	ext{ }s = (r\ \#\	ext{ }s) + s + 1$. What is the value of $11\ \#\	ext{ }5$?

이에 대한 답변은 다음과 같습니다:

리더보드에서는 모델이 매우 구체적인 문자열로 답변을 종료해야 합니다 (Minerva-Math 논문 따름):

"Final answer is [ANSWER]. I hope it is correct."

그 후 리더보드는 SymPy 를 사용하여 [ANSWER] 를 파싱하여 대수적 표현으로 변환하고 (필요시 값 단순화), 최종적으로 골드 타겟과 비교합니다.

그러나 사용자는 위의 방식에 여러 문제를 보고했습니다.

첫째로, 반복되는 문제는 모델이 예제에서 기대하는 답변 형식을 따르지 못한다는 것입니다: 그들은 다른 문장으로 답변을 소개하기 위해 출력했습니다. 형식이 지켜지지 않았기 때문에, 실제로 정답이었음에도 불구하고 답변은 틀렸다고 표시되었습니다! (이는 "모델의 수학 능력"에 관심이 있다면 문제가 됩니다).

📄 Example	❗️Issue	✅ Math-Verify	🛑 Old-Leaderboard
Therefore, the perimeter of one of these triangles is $14 + 7\sqrt{2}$ inches, expressed in simplest radical form.	Failed extraction	`7*sqrt(2) + 14`	None
Therefore, the sum of the infinite geometric series is (\frac{7}{9}).	Failed extraction	`7/9`	None
( p(n) ) and ( p(n+1) ) share a common factor greater than 1 is (\boxed{41}).	Failed extraction	`4`	None
So it's \frac{1}{9}	Failed extraction	`1/9`	None
Concluding he has \boxed{5} cars	Failed extraction	`5`	None

다음 단계로, [ANSWER] 를 대수적 표현으로 변환하는 과정에서도 문제가 발생했습니다. 이번에는 SymPy 파싱과 관련된 문제였습니다:

📄 Example	❗️Issue	✅ Math-Verify	🛑 Old-Leaderboard
The final answer is $2x + 4y + z - 19 = 0$. I hope it is correct.	Partial parse of parametric eq	Eq(2x + 4y + z - 19, 0)	0
(23)	Failed extraction due to latex borders	`23`	None
((- \infty, -14) \cup (-3, \infty)).	Failed extraction due to interval	Union(Interval.open(-oo, -14), Interval.open(-3, oo))	None
100%	Failed extraction due to invalid symbol	`1`	None
\begin{pmatrix}\frac{1}{50}&\frac{7}{50}\frac{7}{50}&\frac{49}{50}\end{pmatrix}	Failed extraction due to Matrix	Matrix([[1/50, 7/50], [7/50, 49/50]])	None

마지막 단계로, 추출된 답변과 타겟 표현을 비교할 때에도 여러 문제가 발생했습니다:

📄 Example	❗️Issue	✅ Math-Verify	🛑 Old-Leaderboard
1/3 == 0.333333	No rounding support	True	False
...
All of these issues are now completely fixed with the new Math-Verify parser!

As all these issues tend to accumulate, some models deeply suffered from this, and their performance was strongly underestimated… so we removed the previous evaluator and added Math-Verify, which was as simple as changing only 3 lines of code! (You can try it too on your math evals!)

This therefore meant re-evaluating all submitted models since June… and it completely overhauled the top 20 models on the MATH subset of the leaderboard.

On average, models solved 61 more problems across the board, equating to a 4.66-point increase across the board!

The two subsets that showed the most significant improvement were both algebra-related (Algebra and Prealgebra) with gains of 8.27 and 6.93, respectively. In extreme cases, some models demonstrated improvements of nearly 90 points on these subsets.
We believe these subsets saw the greatest improvement because they frequently involve answers presented as sets (due to questions with multiple solutions) and matrices. The Math-Verify has enhanced its handling of both answer types, contributing to these notable gains.

We initially discovered the math evaluation issues when inspecting Qwen models, which had unusually low scores compared to the self-reported performance. After the Math-Verify introduction, the scores more than doubled for these models, showcasing previous severe underestimation of performance.

But Qwen models aren't alone. Another major family affected is DeepSeek. After switching to Math-Verify, DeepSeek models almost tripled their scores! This is because their answers are typically wrapped in boxed \boxed{} notations which the old evaluator couldn't extract.

As mentioned at the beginning, the Top 20 rankings have undergone a significant shift, with Nvidia's AceMath models now dominating the MATH-Hard leaderboard.
Other major beneficiaries of this change are the Qwen derivatives, which are now almost exclusively the only models ranking right below AceMath.
Following is the complete table comparing the old and new Top 20 leaderboard rankings:

Finally, we examined how the overall Leaderboard results have evolved. While the top four positions remain unchanged, the rest have undergone significant shifts. Due to the rise of multiple Qwen derivatives in the MATH subset, the presence of Qwen models among the top 20 has grown-derived models grown even further at the Overall results.

Many other models also completely jumped in the rankings, gaining 200 places or more! You can check out the results in more detail at the Open LLM Leaderboard.

The introduction of Math-Verify has significantly improved the accuracy and fairness of our evaluations on the Open LLM Leaderboard. This has led to a reshuffling of the leaderboard, with many models showing substantial improvements in their scores.

We encourage all developers and researchers to adopt Math-Verify for their own math evaluations. By doing so, you can ensure that your models are evaluated with more reliable results. Additionally, we invite you to explore the updated rankings and see how your favorite models have changed in performance.

AI 자동 생성 콘텐츠

원문 바로가기

Fixing Open LLM Leaderboard with Math-Verify

요약

핵심 포인트

댓글