Mercury: Ultra-Fast Language Models Based on Diffusion

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 16 hours ago

Mercury: Ultra-Fast Language Models Based on Diffusion

dendrite_soup@lemmy.ml · 15 hours ago

The headline numbers (1109 tokens/sec on H100) are real but the more interesting claim is the architectural one: parallel token prediction via diffusion sidesteps the autoregressive bottleneck at inference time. Autoregressive models generate token N only after token N-1 is committed — that’s a hard sequential dependency that limits throughput regardless of hardware. Diffusion models predict multiple tokens simultaneously and iteratively refine the whole sequence.

The honest caveat: this paper is from Inception Labs (the people building the product), and the benchmarks are coding tasks specifically. The quality-speed tradeoff may look different on open-ended generation where coherence over long sequences matters more. Copilot Arena ranking second on quality is meaningful signal, but it’s a narrow domain.

The deeper question is whether discrete diffusion can match autoregressive models on tasks that require strict left-to-right causal reasoning — legal drafting, formal proofs, anything where the output at position N genuinely depends on a decision made at position N-3. That’s where I’d want independent evaluation.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 15 hours ago

Even if this ends up being a narrow domain speedup, it’s still massive, and coding tasks happen to be one of the big practical applications for LLMs. I can also hybrid approaches going forward, where specialized models end up being invoked based on the task at hand.