OpenAI헤드라인2026. 05. 06. 22:20

MRC (Multipath Reliable Connection) 를 통한 대규모 AI 학습 네트워크의 해독

요약

본 기사는 대규모 AI 모델 학습에 필수적인 고신뢰성 슈퍼컴퓨터 네트워크 기술인 MRC(Multipath Reliable Connection)를 소개합니다. OpenAI는 이 새로운 프로토콜을 Open Compute Project(OCP)를 통해 공개하며, 이를 통해 GPU 네트워킹 성능과 복원력을 획기적으로 개선할 수 있다고 설명합니다. MRC는 다중 평면 고속 네트워크 설계, 적응형 패킷 분사(adaptive packet spraying), 정적 소스 라우팅 등을 활용하여 네트워크 혼잡 및 장애로 인한 학습 지연을 최소화하고 AI 시스템의 확장성을 높이는 데 기여합니다.

핵심 포인트

MRC(Multipath Reliable Connection)는 대규모 AI 모델 학습 클러스터에서 GPU 네트워킹 성능과 복원력을 향상시키기 위해 개발된 새로운 프로토콜입니다.
이 프로토콜은 다중 평면 고속 네트워크를 구축하여 네트워크 장애에 대한 이중화(redundancy)를 제공하며, 적응형 패킷 분사 기능을 통해 핵심 혼잡을 사실상 제거합니다.
MRC는 정적 소스 라우팅(static source routing)을 사용하여 경로 실패를 우회하고 전체 클래스의 라우팅 오류를 제거함으로써 학습 중단 위험을 최소화합니다.
OpenAI가 이 사양을 OCP에 공개한 것은 AI 시스템의 핵심 인프라 계층에서 표준을 공유하여 전반적인 확장성과 신뢰성을 높이려는 전략적 움직임입니다.

Frontier model training depends on reliable supercomputer networks that can quickly move data between GPUs. To make this faster and more efficient, OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection): a novel protocol that improves GPU networking performance and resilience in large training clusters. We released MRC today (opens in a new window) through the Open Compute Project (OCP) to enable the broader industry to use it.

With more than 900M people using ChatGPT every week, our systems are becoming core infrastructure for AI, helping people and businesses around the world build with increasingly capable models. Prior to the inception of Stargate, we co-developed, brought up, and maintained our first three generations of supercomputers with great care and close collaboration with our partners over the span of a few years. This invaluable experience informed our strong belief that, to efficiently use compute at the scale of Stargate and succeed in our mission, we need to rethink and drastically reduce complexity in every layer of the stack – including network design.

Publishing the MRC specification is part of OpenAI’s overall compute strategy: shared standards in key infrastructure layers can help scale AI systems more efficiently, reliably, and across a broader partner ecosystem. In this post, we’ll cover the design of MRC, including: i) how it enables us to build multi-plane high-speed networks to create redundancy to ride out network failures, while using fewer components and less power ii) how MRC’s adaptive packet spraying virtually eliminates core congestion and iii) how our deployments use static source routing to bypass failures and eliminate whole classes of routing failure. In concert, these benefits allow us to deliver better models to everyone faster.

When training large AI models, a single step can involve many millions of data transfers. One transfer arriving late can ripple through the entire job, potentially causing GPUs to sit idle. Network congestion, link, and device failures are the most common sources of delay and jitter in transfers.

These problems get more frequent, and harder to solve, as the size of the cluster increases. This makes networking technology a key part of the design of Stargate.

To enable the current scale of Stargate supercomputers, we faced two key networking challenges. First, whenever possible, we should minimize the possibility of network congestion. There are unavoidable bottlenecks, such as two GPUs sending to the same destination at the same time. But outside of these cases, we should avoid congestion through design.

Second, we need to minimize the effect of network failures on the training job itself. At large enough scale, even the best network will have a constant background level of link and switch failures. Previously, a single failure would often cause a training job to crash, forcing a restart from a saved checkpoint, or stall progress for many seconds while the network recomputed routes. Such interruptions are costly in both GPU cycles and time. With synchronous pretraining – where many GPUs across many computers cooperate in lockstep to train one AI model – this is especially true. The larger the job we run, the greater the impact of any single link flap or failure. These workloads act as a form of "failure amplifier," so preventing this has become critical.

우리의 목표는 단순히 빠른 네트워크를 구축하는 것뿐만 아니라, 실패 상황에서도 매우 예측 가능한 성능을 제공하여 훈련 작업을 계속할 수 있도록 하는 것이었습니다.

이러한 신뢰성을 달성하기 위해 우리는 지난 2 년 동안 AMD(opens in a new window), Broadcom(opens in a new window), Intel, Microsoft(opens in a new window), 그리고 NVIDIA(opens in a new window) 와 협력하여 네트워크를 구축하고 운영하는 새로운 방식을 개발했습니다. 이 노력의 결과는 우리가 Multipath Reliable Connection, or MRC(opens in a new window)라고 부르는 기술입니다. 이는 최신 800Gb/s 네트워크 인터페이스에 내장된 새로운 네트워크 프로토콜로, 단일 전송을 수백 개의 경로에 분산시키고, 실패를 마이크로초 단위로 우회하며, 더 단순한 네트워크 컨트롤 플랜을 실행할 수 있게 합니다.

MRC 는 RDMA over Converged Ethernet (RoCE) 을 확장합니다. RoCE 는 GPU 와 CPU 간 하드웨어 가속화된 원격 직접 메모리 접근을 가능하게 하는 InfiniBand Trade Association (IBTA) 표준입니다. MRC 는 Ultra Ethernet Consortium (UEC) 에서 개발된 기술을 기반으로 하며, 대규모 AI 네트워킹 패브릭을 지원하기 위해 SRv6 기반 소스 라우팅을 추가합니다.

MRC 는 이미 우리가 경계 모델 훈련에 사용하는 OpenAI 의 모든 가장 큰 NVIDIA GB200 슈퍼컴퓨터에 배포되어 있습니다. 이는 텍사스州的 Abilene 지점과 Oracle Cloud Infrastructure (OCI) 와 Microsoft 의 Fairwater 슈퍼컴퓨터를 포함합니다. MRC 는 NVIDIA 와 Broadcom 의 하드웨어를 활용하여 여러 개의 OpenAI 모델을 훈련하는 데 사용되었습니다. 오늘날, MRC 사양은 커뮤니티가 사용할 수 있도록 Open Compute Project (OCP) 기여물로 제공됩니다. 우리는 경험을 상세히 설명한 논문을 공동 저술했습니다: * "Resilient AI Supercomputer Networking using MRC and SRv6"*(opens in a new window).

고도로 견고한 네트워크를 구축하려면, 링크나 스위치가 실패하더라도 모든 플로우가 좋은 성능을 얻을 수 있도록 충분한 자연스러운 중복성을 가진 네트워크 토폴로지를 시작해야 합니다.

각 네트워크 인터페이스를 하나의 800Gb/s 링크로 취급하는 대신, 우리는 이를 여러 개의 작은 링크로 분할합니다. 예를 들어, 하나의 인터페이스는 8 개 다른 스위치에 연결될 수 있습니다. 그런 다음 100Gb/s 의 단일 네트워크가 아닌 각각 100Gb/s 로 작동하는 8 개의 별도의 병렬 네트워크 또는 플랜을 구축할 수 있습니다.

이 변화는 클러스터의 모양에 큰 영향을 미칩니다. 800Gb/s 에서 64 개 포트를 연결할 수 있는 스위치는 이제 100Gb/s 에서 512 개 포트를 연결할 수 있습니다. 이는 두 개의 스위치 계층으로만 약 131,000 개의 GPU 를 완전히 연결하는 네트워크를 구축할 수 있게 합니다. 일반적인 800Gb/s 네트워크는 세 개 또는 네 개의 계층을 필요로 합니다.

그 결과, 이 네트워크는 비용이 낮고 전력 소비가 적으며, 일반적인 네트워크 설계보다 더 많은 경로 다양성을 제공합니다. 또한 Tier 0 스위치에 더 많은 트래픽이 로컬에 머무르게 할 수 있어 성능을 개선할 수 있습니다.

그러나 모든 이러한 경로 다양성을 완전히 활용하는 것은 어렵습니다. AI 훈련에 일반적으로 사용되는 전통적인 네트워크 프로토콜은 각 전송이 단일 경로를 따라야 패킷이 순서대로 도착하도록 요구합니다. 대규모 다중 플랜 네트워크에서는 두 가지 문제가 발생합니다: 다른 플로우가 동일한 링크에서 충돌하여 혼잡을 만들고, 각 플로우는 사용할 수 있는 플랜 중 하나만 사용할 수 있습니다. 우리가 다른 것을 바꾸지 않는다면, 다중 플랜 네트워크는 심각한 혼잡과 전반적인 성능 저하를 초래할 것입니다.

MRC 는 이 모델을 근본적으로 변화시킵니다. 하나의 경로에 전송을 할당하는 대신, MRC 는 단일 전송의 패킷을 네트워크를 통해 모든 다른 평면 (plane) 을 가로지르며 수백 개의 경로를 통해 분산 (sprays) 합니다. 패킷은 순서를 깨뜨려 도착할 수 있지만, 모든 MRC 패킷에는 최종 메모리 주소가 포함되어 있어 목적지는 패킷이 도착하는 대로 메모리에 전달할 수 있습니다.

각 MRC 연결은 사용하는 많은 경로에 대해 소량의 상태를 유지합니다. 만약 경로의 혼잡도를 감지하면, 해당 경로를 다른 것으로 교체하여 네트워크의 부하를 평준화합니다. 패킷을 잃어버린 경우, 안전한 옵션을 선택하며 해당 경로의 일부가 실패했을 것이라고 가정하고 즉시 사용하지 않고, 손실된 패킷을 재전송합니다. MRC 가 경로를 종료 (retires) 한 후, 실제로 실패했는지 확인하기 위해 탐사 패킷을 전송하고, 실패가 발생했다면 복구되었는지 여부를 확인합니다.

하지만 패킷 손실의 유일한 원인은 실패가 아닙니다; 다른 일반적인 손실 원인은 목적지에서의 혼잡입니다. MRC 는 패킷 절단 (trimming) 을 통해 이를 처리합니다. 스위치가 혼잡으로 인해 패킷을 버리게 될 경우, 페이로드를 잘라내고 헤더만 목적지에 전송하여 명시적인 재전송 요청을 유발합니다. 패킷 절단은 우리가 손실이 경로의 실패를 의미한다고 잘못 가정하는 경우의 위양성 (false positives) 을 줄입니다.

다중 평면 토폴로지, 분산, 부하 균형, 및 절단이라는 이 조합은 MRC 연결이 마이크로초 시간 규모에서 네트워크 실패를 감지하고 우회할 수 있음을 의미하며, 동기화 훈련 작업에 미치는 영향을 최소화합니다. 반면, 관용적인 네트워크 패브릭 (fabric) 은 실패를 우회하기 위해 몇 초 또는 수십 초가 걸릴 수 있습니다.

MRC 는 우리가 네트워크를 단순화하는 데 한 단계 더 나아가도록 허용합니다.

전통적으로, 스위치는 BGP (Border Gateway Protocol) 와 같은 동적 라우팅 프로토콜을 실행하여 이용 가능한 경로를 계산하고 실패를 우회합니다. 그러나 스위지는 복잡한 소프트웨어를 실행하는 복잡한 장치입니다. 그들이 미세한 방식으로 실패할 경우, 이러한 문제는 진단하기 어렵고 연결 실패를 일으키며 수정될 때까지 지속됩니다.

MRC 와는 달리, 동적 라우팅은 덜 필요해졌습니다. 만약 패킷이 경路上에서 손실되면, MRC 는 해당 경로를 사용하지 않습니다. 우리는 더 급진적인 접근 방식을 취하여 동적 라우팅을 비활성화하고 IPv6 Segment Routing (또는 SRv6) 을 사용했습니다. SRv6 은 송신자가 네트워크를 통해 각 패킷이 취해야 할 경로를 직접 지정할 수 있게 합니다. 이는 각 패킷의 목적지 주소에 스위치 식별자 시퀀스를 내장함으로써 이를 수행합니다.

이를 분해하면: 포워딩 (forwarding) 을 할 때, 스위치는 자신의 식별자가 있는지 확인합니다. 만약 있다면, 다음 스위치의 식별자가 드러나도록 목적지 주소를 이동시켜 식별자를 제거합니다. 스위치는 이 식별자를 정적 라우팅 테이블에서 조회하여 패킷을 다음으로 전송할 위치를 결정합니다. 동적 라우팅과 달리, 이 정적 라우팅 테이블은 스위치가 처음 구성될 때 설정되며 이후 변경되지 않습니다.

MRC 는 SRv6 을 사용하여 모든 네트워크 평면을 가로지르는 패킷 분산 및 각 평면의 많은 경로를 동시에 사용하도록 합니다. 만약 경로가 실패하면, MRC 는 단순히 해당 경로를 사용하지 않습니다. 스위지는 라우트를 재계산하거나 설정된 정적 라우트를 맹목적으로 따르는 것 외에는 아무것도 수행할 필요가 없습니다.

Our training networks have millions of links. While these networks are of high quality, at sufficient scale some link flaps are inevitable. During training, we have observed cases of multiple link flaps each minute between tier-0 and tier-1 switches, but MRC ensured that they had no measurable impact on our synchronous pretraining jobs. In fact, their impact was small enough that we did not even need to prioritize the immediate repair of those links.

It's not just links that can fail. During training of a recent frontier model for ChatGPT and Codex, we had to reboot four tier-1 switches. Previously, rebooting a switch would have required the operations team to be very careful not to disrupt training. With MRC, we didn't even need to coordinate with the teams running training jobs in the cluster. The same is true for many link repairs. We used to coordinate with operations teams to disable a link when maintenance work needed to happen. Now we can repair links while they are still in service. If a link is working well enough, MRC will use it. If not, MRC avoids it until it is fixed.

Before MRC, if a link between a GPU's network interface and a tier-0 switch failed, the training job would fail. With MRC, the job survives with reasonable performance. If an 8-port network interface loses one port, the maximum rate is reduced by one eighth. MRC detects this, recalculates paths to avoid the failed plane, and immediately tells peers not to use that plane for inbound traffic. Most failed links recover within a minute, at which point MRC brings the plane back into use.

The slowdown, caused by losing a GPU interface link, has differed across training jobs, but in practice, tends to be significantly less than the amount of physical capacity lost.

MRC ultimately delivers us three critical advantages when scaling our supercomputers.

First, it lets us build multi-plane high-speed networks for supercomputers with over 100,000 GPUs using only two tiers of Ethernet switches. This gives us enough redundancy to ride out network failures, while using less power than equivalent three- or four-tier single-plane networks.

Second, MRC's adaptive packet spraying load-balances well enough that we see essentially no congestion in the core of the network. This greatly reduces variation in throughput between flows during synchronous training, where eliminating outliers is central to performance. It also means that when multiple jobs share the cluster, they do not impact one another's performance.

Last, MRC uses SRv6 source routing to bypass failures quickly and send packets only over working paths. This lets us run a simple static network control plane and eliminate whole classes of dynamic routing failure behavior.

MRC has markedly advanced our ability to train new frontier models and ensure our networks keep pace with our researchers' ambitious AI roadmap. It delivers a significant improvement over previous approaches and helps accelerate our goal of bringing the benefits of AGI to everyone, reliably. We're proud of the cross-industry collaboration that made it possible.

AI 자동 생성 콘텐츠

원문 바로가기

MRC (Multipath Reliable Connection) 를 통한 대규모 AI 학습 네트워크의 해독

요약

핵심 포인트

댓글