Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Chain of thought takes time to generate all the characters. If you do a chain-of-thought for every action and every misstep (and you need to for quality + reliability), it adds up.


Is there no way to share that "memory" across chats?

or are we at the mercy of hosted models?


There’s caching but only so much can be cached when small changes in the input can lead to an entirely different space of outputs. Furthermore, even with caching LLM inference can take anywhere from 1-15s using GPT4-Turbo via the API. As was mentioned, the more characters you prefix in the context - the longer this takes. Similarly you have a variable length output from model (up to a fixed context length) and so the time it takes to calculate the “answer” can also take awhile. In particular with CoT you are basically forcing the model to use more characters than it otherwise would (in its answer) by asking it to explain itself in a verbose step by step manner.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: