Refik Anadol

Creating Agentic LLM apps: The Secret of Token & Conversation Management.

George Salapa
6 min readJan 16, 2025

--

A Large Language Model (LLM) is a pattern predictor. A text calculator.

Every time you send in a question, it: 1. tokenizes the input text, 2. processes through transformer layers, 3. generates probabilities for next tokens, 4. outputs the most likely sequence.

Here is how sending a query to llama model looks like:

Most will be more familiar with sending that query to an API in a structure looking like the below which is OpenAI format that has become something of an industry standard.

Importantly, of course, it is helpful to understand that every time user sends in a question LLM infers from clear slate. It is agnostic probabilistic machine running to predict text.

To the extent that we want to make the conversation meaningful (let alone agentic or acting autonomously) in any way, we have to accumulate the messages and send the previous history with each new question to give LLM context to continue from.

Doing this message accumulation makes the conversation meaningful and that is great, but ey!! .. token limit! If we keep growing a list like this we hit LLM’s limit soon. More importantly, even if we don’t the longer the list, the worse LLM’s inference becomes.

So what we need is to introduce some kind of token management block into our code. For example:

Now every time a new user query enters our code and we retrieve all that important history to give it context we run it through a block of code that ensures that oldest messages are cut off (for example, we retain only up to 3,000 tokens of messages) before sending to LLM.

You can also ask LLMs to infer structured output that can move things. … It is called tool calling and it is what makes AI magical. But this just means more messages to add to our history.

And so the conversation keeps growing, and maybe in our token cutter we can be a little more clever. For example on each run, we do cut, but we make sure to retain a number of most recent tool results, or if we hit a hard limit we only truncate the last message so we retain as much context as possible.

But this is all still problematic. You see because we retain the messages as a simple list, they keep growing in a linear fashion. Moreover the size expands very quickly as some of the tools invoked by LLM are document or information retrieval which can rapidly fill the token window.

Context is lost quickly!

Lets say that we ask a complex question requiring LLM to invoke several tools, filling and exceeding our token window very fast, and then we ask a follow up question, but so much of the previous tool results will be cut off that we leave LLM with very limited, or even worse misleading context.

Can we do better?

Once we recognize that LLMs (henceforth models) are nothing but next token predictors (however superhuman their predictions have become), we are in a great place to understand that it is upon us to somehow cleverly manage the context to allow models across its inferences to do something spectacular .. to truly start exerting agentic behavior.

At its simplest, storing conversation history is about retaining messages. We saw that we can do this linearly — each new message is appended to a list, creating a chronological record. However, real conversations aren’t just sequences; they have structure and relationships.

Consider this simple exchange:

Instead of this flat structure, what if we stored each message with information about its relationships? Each message can have a parent (the message it responds to) and children (the responses it generates). Like this:

Now when we add messages, a natural tree structure emerges:

This simple addition of parent-child relationships transforms our conversation storage from a flat list into a navigable structure. This fundamentally changes context management.

By structuring conversation as a tree rather than a linear list, each message becomes a node in a growing knowledge structure, complete with parent-child relationships, related branches, and tool results. Now, when a user sends in a question, it doesn’t simply append to a list — it finds its place in a web of related contexts.

Say, a complex query enters our system. The initial user message might invoke multiple tool calls, each generating results that become child nodes. These results trigger assistant responses, which in turn might lead to more tool invocations. Instead of a linear sequence that quickly hits token limits, we build a rich tree structure where each branch preserves its specific context. When a subsequent user message arrives, our system analyzes its content against existing nodes, finding semantic similarities that determine where in the tree the new message belongs. Below, a simplified example:

Then, when providing context to the model, we don’t just retrieve the last n messages — we traverse the tree both vertically through parent-child relationships and horizontally across semantically related branches, gathering relevant context until we reach our token limit. Below, a simplified example:

Over lifetime of the conversation, we can also prune and clean the tree, by merging related branches. A simplified example below:

This tree-based approach fundamentally changes how models can navigate conversation history. We enable the model to naturally reference and build upon previous contexts. While the model has no other way than to go forward (it is, after all, a next-token predictor), by retrieving relevant message nodes from our tree both vertically and horizontally, the model gains the ability to step back, re-evaluate, and simply infer better.

When the model needs to solve a complex problem, it can loop through related contexts, building upon previous findings and tool results, achieving something akin to recursive reasoning. (See also my article on agentic architecture using Swarm) The tree structure becomes an implicit knowledge base, allowing the model to exhibit more sophisticated agentic behavior simply through better context management.

--

--

George Salapa
George Salapa

Written by George Salapa

Thoughts on technology, coding, money & culture. Wrote for Forbes and Venturebeat before.

No responses yet