OpenAI Open Sources MRC Networking for Stargate-Scale AI

OpenAI just opened up one of its most important pieces of behind-the-scenes infrastructure. The company announced MRC (Multipath Reliable Connection), a new networking protocol it co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA to keep tens of thousands of GPUs in lockstep during frontier model training. The full specification is being released through the Open Compute Project so the rest of the industry can use it.

This is a relatively rare look inside the plumbing that makes Stargate-scale training possible, and OpenAI says MRC is already running on its largest NVIDIA GB200 clusters, including the Oracle-built Stargate site in Abilene, Texas and Microsoft’s Fairwater supercomputers. The company says MRC has already been used to train multiple OpenAI models powering ChatGPT and Codex.

Why GPU Networks Needed a Redesign

When tens of thousands of GPUs train a single model in lockstep, a single late packet can stall the entire job. As clusters grow into the hundreds of thousands of GPUs, link flaps and switch reboots become routine, and traditional networks were taking seconds or even tens of seconds to recover. At Stargate scale, those interruptions translate into idle GPU time and wasted training cycles.

OpenAI says it set out to fix two problems at once: reduce the chance of network congestion in the first place, and make sure that when failures do happen they barely register on the training run.

How MRC Works

MRC extends RDMA over Converged Ethernet (RoCE) and borrows ideas from the Ultra Ethernet Consortium, then layers on a few aggressive design choices. Instead of treating each 800Gb/s network interface as one fat pipe, OpenAI splits it into eight 100Gb/s "planes," giving each GPU eight separate parallel networks. This lets a single cluster fully connect roughly 131,000 GPUs using only two tiers of Ethernet switches, compared with three or four tiers in a conventional design. The result is fewer components, lower power draw, and lower cost.

On top of that topology, MRC sprays the packets of any single transfer across hundreds of paths at once. Packets carry their final memory address, so they can be reassembled out of order. If one path gets slow or drops a packet, MRC simply stops using it within microseconds and probes later to see if it has recovered. Switches running over-subscribed links trim the payload off congested packets so the sender can quickly retransmit, which keeps the system from confusing congestion with hardware failure.

OpenAI also replaced the usual dynamic routing protocols like BGP with IPv6 Segment Routing (SRv6). Each packet carries the full list of switches it should hop through, so the network only needs static routing tables. That eliminates an entire class of routing failures and makes the control plane much simpler.

Why It Matters

The numbers OpenAI shared paint a picture of how brittle large-scale training used to be. During one recent frontier model run, the team had to reboot four tier-1 switches and saw multiple link flaps per minute, and MRC absorbed all of it without measurable impact on the training job. Previously, a single GPU-to-switch link failure could crash a whole training run.

By publishing the spec through OCP and co-authoring a paper titled "Resilient AI Supercomputer Networking using MRC and SRv6," OpenAI is also making a strategic move. With more than 900 million people now using ChatGPT each week, the company is positioning shared networking standards as the next layer of AI infrastructure that the whole ecosystem can build on, much like RoCE and InfiniBand before it.

Why GPU Networks Needed a Redesign

How MRC Works

Why It Matters

Leave a ReplyCancel reply