AI 과학 연구 능력 측정: 새로운 벤치마크 'FrontierScience' 공개
요약
OpenAI가 AI의 전문적인 과학적 추론 능력을 평가하기 위한 새로운 벤치마크, FrontierScience를 발표했습니다. 이 벤치마크는 물리학, 화학, 생물학 등 세 분야에서 전문가 수준의 지식과 문제 해결 능력을 측정합니다. 특히 기존의 객관식 위주 벤치마크의 한계를 극복하고, 올림피아드 스타일 추론(Olympiad)과 실제 연구 과제 수행 능력(Research) 두 가지 트랙으로 구성되어 있습니다. 초기 평가에서 GPT-5.2가 높은 점수를 기록하며 AI 과학 가속화에 대한 기대감을 높였습니다.
핵심 포인트
- FrontierScience는 물리학, 화학, 생물학 등 3개 분야의 전문가들이 제작한 새로운 과학 추론 벤치마크입니다.
- 두 가지 트랙으로 구성되어 있습니다: 올림피아드(Olympiad) 트랙은 국제 대회 수준의 이론적 추론을 측정하며, 연구(Research) 트랙은 실제 PhD급 연구 과제 수행 능력을 평가합니다.
- GPT-4 대비 GPT-5.2는 GPQA 벤치마크에서 점수가 크게 향상된 바 있으며, FrontierScience에서도 우수한 성능을 보여주었습니다 (Olympiad: 77%, Research: 25%).
- 이 벤치마크의 목표는 AI가 단순 지식 검색을 넘어 복잡하고 개방형(open-ended)인 과학 연구 워크플로우를 가속화하는 능력을 측정하는 것입니다.
Evaluating AI’s ability to perform scientific research tasks
We introduce FrontierScience, a new benchmark that evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.
Reasoning is at the core of scientific work. Beyond recalling facts, scientists generate hypotheses, test and refine them, and synthesize ideas across fields. As our models become more capable, the central question is how they can reason deeply to contribute to scientific research.
Over the last year, our models have reached major milestones, including achieving gold-medal performance at the International Math Olympiad and the International Olympiad in Informatics. In parallel, we’re starting to see our most capable models, such as GPT‑5, meaningfully accelerate real scientific workflows. Researchers are using these systems for tasks such as literature search across disciplines and languages and working through complex mathematical proofs. In many cases, the model shortens work that might have taken days or weeks to hours. This progress is documented in our paper Early science acceleration experiments with GPT‑5, released in November 2025, which presents early evidence that GPT‑5 can measurably accelerate scientific workflows.
As accelerating scientific progress is one of the most promising opportunities for AI to benefit humanity, we’re improving our models on difficult math and science tasks and working on the tools that will help scientists get the most from them.
When GPQA (opens in a new window), a “Google-Proof” science benchmark of questions written by PhD experts, was released in November 2023, GPT‑4 scored 39%, below the expert baseline of 70%. Two years later, GPT‑5.2 scored 92%. As models’ reasoning and knowledge capabilities continue to scale, more difficult benchmarks will be important to measure and forecast models’ ability to accelerate scientific research. Prior scientific benchmarks largely focus on multiple-choice questions, are saturated, or are not centrally focused on science.
To bridge this gap, we’re introducing FrontierScience: a new benchmark built to measure expert-level scientific capabilities. FrontierScience is written and verified by experts across physics, chemistry, and biology, and consists of hundreds of questions designed to be difficult, original, and meaningful. FrontierScience includes two tracks of questions: Olympiad, which measures Olympiad-style scientific reasoning capabilities, and Research, which measures real-world scientific research abilities. Providing more insight into models’ scientific capabilities helps us track progress and advance AI-accelerated science.
In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. We’ve seen substantial progress on solving expert-level questions while leaving headroom for more progress, especially on open-ended research-style tasks. For scientists, this suggests that current models can already support parts of research that involve structured reasoning, while highlighting that significant work remains to improve their ability to carry out open-ended thinking. These results align with how scientists are already using today’s models: to accelerate research workflows while relying on human judgment for problem framing and validation, and increasingly to explore ideas and connections that would otherwise take much longer to uncover—including, in some cases, contributing new insights that experts then evaluate and test.
In the end, the most important benchmark for the scientific capabilities of AI is the novel discoveries it helps generate; those are what ultimately matter to science and society. FrontierScience sits upstream of that. It gives us a north star for expert-level scientific reasoning, letting us test models on a standardized set of questions, see where they succeed or fail, and identify where we need to improve them. FrontierScience is narrow and has limitations in key respects (for example, focusing on constrained, expert-written problems) and does not capture everything scientists do in their everyday work. But the field needs more difficult, original, and meaningful science benchmarks, and FrontierScience provides a step forward in this direction.
Benchmark Structure and Creation
The full FrontierScience evaluation spans over 700 textual questions (with 160 in the gold set) covering subfields across physics, chemistry, and biology. The benchmark is composed of an Olympiad and a Research split. FrontierScience-Olympiad contains 100 questions designed by international olympiad medalists to assess scientific reasoning in a constrained, short answer format. The Olympiad set was designed to contain theoretical questions at least as difficult as problems at international olympiad competitions. FrontierScience-Research consists of 60 original research subtasks designed by PhD scientists (doctoral candidates, professors, or postdoctoral researchers) that are graded using a 10-point rubric. The Research set was created to contain self-contained, multi-step subtasks at the level of difficulty that a PhD scientist might encounter during their research.
The Olympiad questions were created in collaboration with 42 former international medalists or national team coaches in the relevant domains, totalling 109 olympiad medals. The research questions were created in collaboration with 45 qualified scientists and domain experts. All scientists were either doctoral candidates, post-doctoral researchers, or professors. Their areas of expertise spanned an array of specialized and important scientific disciplines, from quantum electrodynamics to synthetic organic chemistry to evolutionary biology.
The task creation process for both sets included some selection against OpenAI internal models (e.g., discarding tasks that models successfully got right, so we expect the evaluation to be somewhat biased against these models relative to others). We open-source the Olympiad gold set of 100 questions and Research gold set of 60 questions, holding out the other questions to track contamination.
Grading Methodology
The Olympiad set is gradable with a short answer: either with a number, expression, or fuzzy string match, which helps with verifying correctness. However, this verification often trades off with the expressivity and open-endedness of the problem. For the Research set, we introduce a rubric-based architecture for grading more open-ended tasks. Each question includes a scoring rubric with multiple independent and objectively assessable items, totaling 10 points. The grading rubric assesses not only the accuracy of the final answer, but also the correctness of intermediate reasoning steps, allowing for nuanced model performance and failure analysis. A solution is considered “correct” if it’s awarded at least 7/10 rubric points.
Responses are evaluated by a model-based grader (GPT‑5) against either the short answer or the rubric criteria. While we’d ideally use an expert human to grade each response, this approach is not scalable, so we designed the rubric to be checkable using a model grader. We developed a verification pipeline to help ensure rubrics and questions were well-calibrated to difficulty and correctness.
Evaluation Results
We evaluated several frontier models: GPT‑5.2, Claude Opus 4.5, and Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, and OpenAI o3 on FrontierScience-Olympiad and FrontierScience-Research. All reasoning models were evaluated at “high” reasoning effort with the exception of GPT‑5.2 at “xhigh”. In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. Gemini 3 Pro is comparable to GPT‑5.2 on the Olympiad set (scoring 76%).
We’ve seen substantial progress on solving expert-level questions, especially on open-ended research-style tasks. There is still more room to grow: from analyzing the transcripts for failures, frontier models sometimes made reasoning, logic, and calculation errors, didn’t understand niche scientific concepts, and made factual inaccuracies.
While FrontierScience represents a step forward in difficulty of scientific benchmarks, there are still many limitations. FrontierScience is composed of questions with a constrained problem statement, which focuses on evaluating the final answer (Olympiad) or evaluating the reasoning to complete a research task (Research). In addition, using rubrics with multiple components on longer tasks is less objective than checking the final answer.
FrontierScience offers a higher resolution snapshot of models’ reasoning on difficult, expert-written questions, but not a full picture of how science gets done in practice. In particular, it does not assess a significant part of scientific research: how models generate genuinely novel hypotheses, or interact with multiple modalities, including video data and real experimental systems in the physical world.
Looking ahead, we expect progress in scientific reasoning to come from both better general-purpose reasoning systems and focused effort on improvising scientific capabilities. FrontierScience is one tool among many, and as models improve, we plan to iterate on this benchmark, expand it to new domains, and pair it with more real-world evaluations that look at what these systems actually enable scientists to do. Benchmarks like FrontierScience help us understand the weaknesses of today’s AI systems to focus our work on making models be reliable partners in scientific discovery.
AI 자동 생성 콘텐츠
본 콘텐츠는 OpenAI Blog의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기