There's a difficult balance between letting the model simply memorize inputs, an...

frozenseven · 2025-07-10T14:09:00 1752156540

You're not answering the question. Grok 4 also performs better on the semi-private evaluation sets for ARC-AGI-1 and ARC-AGI-2. It's across-the-board better.

emp17344 · 2025-07-10T14:45:23 1752158723

If these things are truly exhibiting general reasoning, why do the same models do significantly worse on ARC-AGI-2, which is practically identical to ARC-AGI-1?

frozenseven · 2025-07-10T15:25:40 1752161140

It's not identical. ARC-AGI-2 is more difficult - both for AI and humans. In ARC-AGI-1 you kept track of one (or maybe two) kinds of transformations or patterns. In ARC-AGI-2 you are dealing with at least three, and the transformation interact with one another in more complex ways.

Reasoning isn't an on-off switch. It's a hill that needs climbing. The models are getting better at complex and novel tasks.

emp17344 · 2025-07-10T15:33:30 1752161610

This simply isn’t the case. Humans actually perform better on ARC-AGI-2, according to their website: https://arcprize.org/leaderboard

frozenseven · 2025-07-10T16:12:54 1752163974

The 100.0% you see there just verifies that all the puzzles got solved by at least 2 people on the panel. That was calibrated to be so for ARC-AGI-2. The human panel averages for ARC-AGI-1 and ARC-AGI-2 are 64.2% and 60% respectively. Not a huge difference, sure, but it is there.

I've played around with both, yes, I'd also personally say that v2 is harder. Overall a better benchmark. ARC-AGI-3 will be a set of interactive games. I think they're moving in the right direction if they want to measure general reasoning.