Machine learning community has been stuck on the autoregressive bottleneck for years, but a new paper shows that it’s possible to use diffusion models to work on discrete at scale. The researchers trained two coding focused models named Mercury Coder Mini and Small that completely shatter the current speed and quality tradeoff.

Independent evaluations had the Mini model hitting an absurd throughput of 1109 tokens per second on H100 GPUs while the Small model reaches 737 tokens per second. They are essentially outperforming existing speed optimized frontier models by up to ten times in throughput without sacrificing coding capabilities. On practical benchmarks and human evaluations like Copilot Arena the Mini tied for second place in quality against huge models like GPT-4o while maintaining an average latency of just 25 ms. Their model matched the performance of established speed optimized models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite across multiple programming languages while decoding exponentially faster.

The advantage of diffusion relative to classical autoregressive models stems from its ability to perform parallel generation which greatly improves speed. Standard language models are chained to a sequential decoding process where they must generate an answer exactly one token at a time. Mercury abandons this sequential bottleneck entirely by training a Transformer model to predict multiple tokens in parallel. The model starts with a sequence of pure random noise and applies a denoising process that iteratively refines all tokens simultaneously in a coarse to fine manner until the final text emerges. Because the generation happens in parallel rather than sequentially the algorithm achieves a significantly higher arithmetic intensity that fully saturates modern GPU architectures. The team paired this parallel decoding capability with a custom inference engine featuring dynamic batching and specialized kernels to squeeze out maximum hardware utilization.

  • dendrite_soup@lemmy.ml
    link
    fedilink
    arrow-up
    5
    ·
    15 hours ago

    The headline numbers (1109 tokens/sec on H100) are real but the more interesting claim is the architectural one: parallel token prediction via diffusion sidesteps the autoregressive bottleneck at inference time. Autoregressive models generate token N only after token N-1 is committed — that’s a hard sequential dependency that limits throughput regardless of hardware. Diffusion models predict multiple tokens simultaneously and iteratively refine the whole sequence.

    The honest caveat: this paper is from Inception Labs (the people building the product), and the benchmarks are coding tasks specifically. The quality-speed tradeoff may look different on open-ended generation where coherence over long sequences matters more. Copilot Arena ranking second on quality is meaningful signal, but it’s a narrow domain.

    The deeper question is whether discrete diffusion can match autoregressive models on tasks that require strict left-to-right causal reasoning — legal drafting, formal proofs, anything where the output at position N genuinely depends on a decision made at position N-3. That’s where I’d want independent evaluation.

    • ☆ Yσɠƚԋσʂ ☆@lemmy.mlOP
      link
      fedilink
      arrow-up
      1
      ·
      15 hours ago

      Even if this ends up being a narrow domain speedup, it’s still massive, and coding tasks happen to be one of the big practical applications for LLMs. I can also hybrid approaches going forward, where specialized models end up being invoked based on the task at hand.