This paper basically shows that treating the prompt as an external variable is a surprisingly effective way to handle massive contexts. The authors argue that instead of shoving ten million tokens directly into the model and hoping for the best, we should put the text into a Python REPL environment where the model can interact with it programmatically. This setup allows the LLM to write code that slices the text into manageable chunks and recursively calls new instances of itself to process those pieces individually. It is essentially the same logic as out-of-core algorithms which process datasets far larger than the available memory by fetching only what is needed at any given moment.

One of the most interesting parts of the study is how it exposes the reality of context rot in frontier models like GPT-5. The results show that while base models handle simple needle-in-a-haystack tasks just fine, they fall apart completely on information dense tasks that require aggregating data across the entire input. For example, on the OOLONG-Pairs benchmark which has quadratic complexity, the base GPT-5 model scores less than 0.1 percent accuracy once the context gets long enough. Meanwhile, the recursive language model manages to hold steady even up to a million tokens and achieves a 58% score on that same difficult task.

Turns out that for retrieval tasks like CodeQA, simply having the REPL to grep through files was enough to beat the base model because the model could filter data before reading it. Having the recursive capability turned out to be essential for reasoning tasks like OOLONG where the model needs to process every line. The version of the system that could not make subcalls performed significantly worse because it could not offload the thinking process to fresh contexts and prevent its own window from getting polluted.

Since the model writes code to filter the text using tools like regex before it actually reads anything, it processes fewer tokens on average than a summary agent that is forced to read everything to compress it. The only downside is that the variance can be pretty wild since the model sometimes gets stuck in a loop or decides to verify its own answer multiple times in a row which blows up the compute cost for that specific run.

We are clearly seeing a shift where inference time compute and smart context management are becoming more important than just having a massive raw context window. The fact that this method beats retrieval-based agents on deep research tasks suggests that giving the model a loop to think and code is the future for tasks that need a large persistent context.