It's always a tradeoff, and most of the time chunking and keeping the context sh...

It's always a tradeoff, and most of the time chunking and keeping the context short performs better.

I feed long context tasks to each new model and snapshot just to test the performance improvements, and every time it's immediately obvious that no current model can handle its own max context. I do not believe any benchmarks, because contrary to the results of many of them, no matter what the (coding) task is, the results start getting worse after just a couple dozen thousand tokens, and after a hundred the accuracy becomes unacceptable. Lost-in-the-middle is still a big issue as well, at least for reasoning if not for direct recall - despite benchmarks showing it's not. LLMs are still pretty unreliable at one-shotting big things, and everything around it is still alchemy.