May 31, 2026

aiincider.ai

AI News. No Noise. Just Signal.

Mercury 2: Diffusion LLM Hits 1,000 Tokens Per Second

2 min read
Inception Labs released Mercury 2, a diffusion-based reasoning LLM that breaks 1,000 tokens per second, challenging transformer dominance with parallel token generation.

Inception Labs has released Mercury 2, a reasoning language model that swaps the standard transformer for a diffusion architecture and reaches speeds exceeding 1,000 tokens per second. The model arrives as a quiet but architecturally loud moment in the May 2026 release cycle, sitting alongside frontier launches from OpenAI, Google, and xAI but taking a fundamentally different route to fast, accurate reasoning.

Transformers generate text one token at a time, with each new token depending on the ones before it. Diffusion models work in parallel. They start from noisy candidate tokens and iteratively refine them across the full sequence, which is how image models like Stable Diffusion work but applied to language. Mercury 2 is the first commercial reasoning model to use this approach at production scale, and the throughput numbers reflect the architectural change. Most leading reasoning models top out between 50 and 200 tokens per second on a single accelerator. Mercury 2 pushes past 1,000, while still running the multi-step chain-of-thought behavior that has become the default for frontier systems.

The speed has practical consequences. Reasoning models burn tokens internally before producing a final answer, and the wait can stretch from seconds to a minute on complex queries. A diffusion model that runs ten times faster collapses that wait into something close to real time, which matters for agentic systems where one step blocks the next. Inception is positioning Mercury 2 for high-volume agent loops, code generation, and inference-heavy enterprise workloads where latency is the dominant cost.

The release fits a broader pattern in May 2026. Reasoning has become the default architecture across labs, with every major release shipping models that think before they answer. The competitive frontier is no longer just capability but how quickly and cheaply that reasoning can run. GPT-5.5 Instant, Gemini 3.5 Flash, and SubQ 1M-Preview have all leaned into latency and price. Mercury 2 attacks the same problem from the architecture side rather than the optimization side.

The open question is quality. Diffusion language models have historically lagged transformers on hard benchmarks, and Inception has not yet published full evaluation results against the latest GPT-5.5 or Claude releases. If Mercury 2 holds up on reasoning quality while keeping its speed advantage, it would mark the first credible architectural alternative to the transformer paradigm that has dominated since 2017.

Continue Reading…

Leave a Reply