Chain of thought takes time to generate all the characters. If you do a chain-of-thought for every action and every misstep (and you need to for quality + reliability), it adds up.
There’s caching but only so much can be cached when small changes in the input can lead to an entirely different space of outputs. Furthermore, even with caching LLM inference can take anywhere from 1-15s using GPT4-Turbo via the API. As was mentioned, the more characters you prefix in the context - the longer this takes. Similarly you have a variable length output from model (up to a fixed context length) and so the time it takes to calculate the “answer” can also take awhile. In particular with CoT you are basically forcing the model to use more characters than it otherwise would (in its answer) by asking it to explain itself in a verbose step by step manner.