That's true, but those results are rarely the correct ones, at least for v1 llam...

dontreact · on Sept 9, 2023

Why would the accuracy decrease with lower temperature? Setting temperature to 0 just means at each step the model will emit the token with the highest likelihood.

moffkalast · on Sept 10, 2023

Yes that's what I'm saying, to reiterate: The likeliest token does not lead to the highest performing result. Otherwise temperature wouldn't even be an option. I would imagine things like language word frequency affect the token rating a lot while having nothing to do with the task at hand except providing a correctly formatted answer, but it's probably not the whole story.

OpenAI (and others that know what they're doing) always do their benchmarks in a multi-sampled way, by running 5 or 20 times at optimal temp. Using a wrapper that runs these samples and then another pass that judges self-consistency for a final answer can give you a correct answer 100% of the time for a question that would be wrong 100% of the time with temp at zero.

lostmsu · on Sept 11, 2023

I had a conversation with a friend regarding this exact question and my understanding is that model trains to optimize the distribution of all texts, therefore when you restrict it to deterministic sampling that is not representative of inputs you select the slice of the distribution that model learned that conveys much less information than the full distribution, and hence has poorer results.