Stateful LLMs and memory techniques

4 minute read

Oh you too writing about AI…

Yep! After tinkering with LLMs on my spare time for fun, I would like to share some learnings about them and their memory, or better said…the lack of it.

It sometimes goes unnoticed the stateless nature of LLMs. From the very beginning, we’ve been interacting with them through ‘wrappers’, like ChatGPT, that made us think LLMs can keep track of a conversation - or even remember details across conversations!

However, this is far from true. LLMs are very knowledgeable, but they’re terrible at remembering things unless we help them big time.

Let’s see the below example to illustrate this problem.

llm = replicate.run(
    "meta/meta-llama-3-70b-instruct",
    input={
        "prompt": "Hey, my name is Rafa",
    }
)
print("".join(llm))

Hi Rafa! It’s nice to meet you! Is there something I can help you with or would you like to chat about something in particular?

And then if we go and ask for my name…

llm = replicate.run(
    "meta/meta-llama-3-70b-instruct",
    input={
        "prompt": "How did I say my name was?",
    }
)
print("".join(llm))

I apologize, but this is the beginning of our conversation, and you haven’t mentioned your name yet.

LLMs Memory techniques

One key aspect to understand for LLMs is the Context Window. In simple terms, the context window is how much data the model can input (or output…) to produce responses. This is measured in tokens, 32k tokens meaning it can take 32k tokens max as input. Simplifying a lot, 1 token ~= 4 chars in English. 1 token ~= ¾ words.

With this in mind, let’s now see some of the techniques that can be used to give memory to our poor LLMs and how they would affect context windows.

Buffer Memory

The first quick win. Probably the easiest, least scalable solution. When providing the human message (aka a prompt), the whole previous conversation is appended, so the LLM can use it as ‘memory’.

  • Pros: Complete context retention ensures no loss of detail, which can enhance the model’s response accuracy.
  • Cons: Rapidly consumes available context window space.
Conversation history:
User: <message>
AI: <message>
User: <message> //this being the last human message
conversation_history = []

def ask_llm(prompt):
    return llm.run(prompt)

while True:
    user_message = input("You: ")
    conversation_history.append(f"User: {user_message}")
    llm_prompt = "Conversation history:\n" + "\n".join(conversation_history) + "\nAI:"
    ai_response = ask_llm(llm_prompt)
    conversation_history.append(f"AI: {ai_response}")
    print("AI:", ai_response)

Windowed Buffer

A less greedy interaction of the Buffer strategy. Keeps track only of the X last messages/interactions.

  • Pros: Balances detail and efficiency by keeping recent interactions.
  • Cons: Information outside the window is lost.
Conversation History - Last (up to) 3 interactions:
User: <message>
AI: <message>
User: <message>
AI: <message>
User: <message>
AI: <message>
User: <message> //this being the last human message
conversation_history = deque(maxlen=3)

def ask_llm(prompt):
    return llm.run(prompt)

while True:
    user_message = input("You: ")
    conversation_history.append(f"User: {user_message}")
    llm_prompt = "Conversation history:\n" + "\n".join(conversation_history) + "\nAI:"
    ai_response = ask_llm(llm_prompt)
    conversation_history.append(f"AI: {ai_response}")
    print("AI:", ai_response)

Summary Memory

Instead of remembering everything, it keeps track of a summary of the conversation that is then passed to the LLM. To do this, the LLM itself (or a different LLM, one for example cheaper) can be used to continuously summarise the conversation.

  • Pros: More efficient use of the context window.
  • Cons: Potential loss of nuanced details. Summarization cost.
Summary of previous interactions:
<LLM generated summary>

<Last User Message>
ongoing_summary = ""

def ask_llm(prompt):
    return llm.run(prompt)

while True:
    user_message = input("You: ")
    prompt_for_llm = f"{ongoing_summary} {user_message}"
    ai_response = ask_llm(prompt_for_llm)
    print("AI:", ai_response)
    summary_prompt = f"Summarize this conversation: {ongoing_summary} User: {user_message} AI: {ai_response}"
    ongoing_summary = ask_llm(summary_prompt)
    print("Updated Summary:", ongoing_summary)

Windowed Buffer + Summary

Combines the last 2 to get the best of both worlds: the ‘freshness’ of the last N message and the summary of the previous messages.

  • Pros: Combines the benefits of windowed buffer and summary to optimize context usage without losing important details.
  • Cons: Summarization cost. Less straight-forward implementation.
Conversation History - Last (up to) 3 interactions:
User: <message>
AI: <message>
User: <message>
AI: <message>
User: <message>
AI: <message>
User: <message> //this being the last human message

Summary of previous interactions:
<LLM generated summary>

RAG

While the above techniques are sufficient for most applications, sometimes might not be enough. Retrieval-Augmented Generation (RAG) is used in these scenarios (by itself or in addition). This consist in storing a lot of data (i.e the whole conversation) in a DB. When the user asks something, a semantic search over the data is performed, extracting the most relevant bits to be provided to the LLM.

Expanding Context Windows and Conclusion

Is RAG or other sophisticated memory techniques worth it?

Maybe not! As LLMs models evolve rapidly, they’re getting larger context windows. At the time of writing this, Google’s has announced a context window of 2M token for Gemini. That’s a lot of data!

My gut tells me these context windows will make advanced memory techniques kind of niche and on the verge of over-engineering, at least for most LLM applications, with notable exceptions where massive amounts of data are needed. Only time will tell!

Hope you enjoyed the article. For any questions, my Linkedin and Twitter accounts are the best place. Thanks for your time!

You might find interesting…

Tags: ,

Updated: