Is there no way to share that "memory" across chats? or are we at the mercy of h...

ShamelessC · on April 1, 2024

There’s caching but only so much can be cached when small changes in the input can lead to an entirely different space of outputs. Furthermore, even with caching LLM inference can take anywhere from 1-15s using GPT4-Turbo via the API. As was mentioned, the more characters you prefix in the context - the longer this takes. Similarly you have a variable length output from model (up to a fixed context length) and so the time it takes to calculate the “answer” can also take awhile. In particular with CoT you are basically forcing the model to use more characters than it otherwise would (in its answer) by asking it to explain itself in a verbose step by step manner.