DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

arxiv.org

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

arxiv.org

☆ Yσɠƚԋσʂ ☆@lemmy.ml to

Technology@lemmy.mlEnglish · 6 hours ago

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.

DualPath is a system developed by DeepSeek to address the storage input and output bottleneck that slows down agentic LLM inference. When LLMs run as agents they need to repeatedly interact with their environments over many turns which builds up a massive context history stored as a KV-Cache. Most current systems split the workload into prefill engines that process new prompt tokens and decode engines that generate the actual responses. The fundamental issue is that prefill engines have to load KV-Cache directly from external persistent storage which maxes out network bandwidth on the prefill side while the storage network connections on the decode engines sit idle.

DualPath creaties a second route for the data which allows the system to load KV-Cache from storage into the idle decoding engines first. Once the data hits the decode engines it gets forwarded to the prefill engines using a fast compute network connecting the graphics processing units. It’s basically a routing strategy for aggregating the storage bandwidth across all the machines and stop the prefill nodes from becoming a choke point.

A traffic manager places the KV-Cache transfers onto a lower priority virtual lane so that the actual inference communication gets majority of the bandwidth priority while data shuffling happens in the background without causing latency spikes. A dynamic scheduler then constantly monitors token counts and queue lengths to distribute the reading tasks evenly across all available hardware. In teests, DualPath improved system throughput by nearly two times compared to a standard setup. Turns out that properly balancing network traffic that was already available in the cluster makes multi-turn agent workloads dramatically faster.

You must log in or register to comment.

Chat

Technology@lemmy.ml

technology@lemmy.ml

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@lemmy.ml

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

150 users / day
731 users / week
3.01K users / month
6.71K users / 6 months
1 local subscriber
42.8K subscribers
5.09K Posts
56.5K Comments
Modlog

mods:
MinutePhrase@lemmy.ml