For one thing, it's not a real score; they judged the results themselves and Putnam judges are notoriously tough. There was not a single 8 on the problem they claim partial credit for (or any partial credit above a 2) amongst the top 500 humans. https://kskedlaya.org/putnam-archive/putnam2024stats.html.
For another thing, the 2024 Putnam problems are in their RL data.
Also, it's very unclear how these competitions consisting of problems designed to have clear-cut answers and be solved by (well-prepared) humans in an hour will translate to anything else.
> Curating Cold Start RL Data: We constructed our initial training data through the following
process:
> 1. We crawled problems from Art of Problem Solving (AoPS) contests
, prioritizing math
olympiads, team selection tests, and post-2010 problems explicitly requiring proofs, total-
ing 17,503 problems.
> Why do you think that the 2024 Putnam programs that they used to test were in the training data?
Putnam solutions can be found multiple places online: https://kskedlaya.org/putnam-archive/, https://artofproblemsolving.com/community/c3249_putnam. These could have appeared in the training of the base LLM DeepSeek-V3.2-Exp or as problems in the training set - they do not give further detail on what problems they selected from AOPS and as the second link gives they are there.
For another thing, the 2024 Putnam problems are in their RL data.
Also, it's very unclear how these competitions consisting of problems designed to have clear-cut answers and be solved by (well-prepared) humans in an hour will translate to anything else.