One thing that stands out playing with the sorting is that Google's Gemini claims to have a context window more than 10x that of most of its competition. Has anyone experimented with this to see if its useful context window is actually anything close to that?
In my own experiments with the chat models they seem to lose the plot after about 10 replies unless constantly "refreshed", which is a tiny fraction of the supposed 128000 token input length that 4o has. Does Gemini actually do something dramatically differently, or is their 3 million token context window pure marketing nonsense?
When the released it they specifically focused on the accurate recall across the context window. There are a bunch of demos of things like giving it a whole movie as input (frame every N seconds plus script or something) and asking for highly specific facts).
Anecdotally, I use NotebookLM a bit, and while that’s probably RAG plus large contexts (to be clear, this is a guess not based on inside knowledge), it seems very accurate.
I tend to use a sentence along these lines:
"Give me a straightforward summary of what we discussed so far, someone who didn't read the above should understand the details. Don't be too verbose."
Then i just continue from there or simply use this as a seed in another fresh chat.
In my own experiments with the chat models they seem to lose the plot after about 10 replies unless constantly "refreshed", which is a tiny fraction of the supposed 128000 token input length that 4o has. Does Gemini actually do something dramatically differently, or is their 3 million token context window pure marketing nonsense?