*Except of course it's not true lol. Horses are smart critters, but they absolut...

habinero · 2026-01-03T05:11:53 1767417113

Or -- and hear me out -- that result doesn't mean what you think it does.

That's the exact reason I mention the Clever Hans story. You think it's obvious because you can't come up with any other explanation, therefore there can't be another explanation and the horse must be able to do math. And if I can't come up with an explanation, well that just proves it, right? Those are the only two options, obviously.

Except no, all it means is you're the limiting factor. This isn't science 101 but maybe science 201?

My current hypothesis is the IMO thing gets trotted out mostly by people who aren't strong at math. They find the math inexplicable, therefore it's impressive, therefore machine thinky.

When you actually look hard at what's claimed in these papers -- and I've done this for a number of these self-published things -- the evidence frequently does not support the conclusions. Have you actually read the paper, or are you just waving it around?

At any rate, I'm not shocked that an LLM can cobble together what looks like a reasonable proof for some things sometimes, especially for the IMO which is not novel math and has a range of question difficulties. Proofs are pretty code-like and math itself is just a language for concisely expressing ideas.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

CamperBob2 · 2026-01-03T17:03:29 1767459809

Have you actually read the paper, or are you just waving it around?

I've spent a lot of time feeding similar problems to various models to understand what they can and cannot do well at various stages of development. Reading papers is great, but by the time a paper comes out in this field, it's often obsolete. Witness how much mileage the ludds still get out of the METR study, which was conducted with a now-ancient Claude 3.x model that wasn't at the top of the field when it was new.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

And the goalposts have now been moved to a dark corner of the parking garage down the street from the stadium. "This brand-new technology doesn't deliver infallible, godlike results out of the box, so it must just be fooling people." Or in equestrian parlance, "This talking horse told me to short NVDA. What a scam."

threethirtytwo · 2026-01-03T13:11:06 1767445866

On the IMO paper: pointing out that it’s not a gold medal or that some proofs are flawed is irrelevant to the claim being discussed, and you know it. The claim is not “LLMs are perfect mathematicians.” The claim is that they can produce nontrivial formal reasoning that passes external verification at a rate far above chance and far above parroting. Even a single verified solution falsifies the “just regurgitation” hypothesis, because no retrieval-only or surface-pattern system can reliably construct valid proofs under novel compositions.

Your fallback move here is rhetorical, not scientific: “maybe it doesn’t mean what you think it means.” Fine. Then name the mechanism. What specific process produces internally consistent multi-step proofs, respects formal constraints, generalizes across problem types, and fails in ways analogous to human reasoning errors, without representing the underlying structure? “People are impressed because they’re bad at math” is not a mechanism, it’s a tell.

Also, the “math is just a language” line cuts the wrong way. Yes, math is symbolic and code-like. That’s precisely why it’s such a strong test. Code-like domains have exact semantics. They are adversarial to bullshit. That’s why hallucinations show up so clearly there. The fact that LLMs sometimes succeed and sometimes fail is evidence of partial competence, not illusion. A parrot does not occasionally write correct code or proofs under distribution shift. It never does.

You keep asserting that others are being fooled, but you haven’t produced what science actually requires: an alternative explanation that accounts for the full observed behavior and survives tighter controls. Clever Hans had one. Stage magic has one. LLMs, so far, do not.

Skepticism is healthy. But repeating “you’re the limiting factor” while refusing to specify a falsifiable counter-hypothesis is not adversarial engineering. It’s just armchair disbelief dressed up as rigor. And engineers, as you surely know, eventually have to ship something more concrete than that.

habinero · 2026-01-03T05:29:41 1767418181

(Continuing from my other post)

The first thing I checked was "how did they verify the proofs were correct" and the answer was they got other AI people to check it, and those people said there were serious problems with the paper's methodology and it would not be a gold medal.

https://x.com/j_dekoninck/status/1947587647616004583

This is why we do not take things at face value.

CamperBob2 · 2026-01-03T16:36:20 1767458180

That tweet is aimed at Google. I don't know much about Google's effort at IMO, but OpenAI was the primary newsmaker in that event, and they reportedly did not use hints or external tools. If you have info to the contrary, please share it so I can update that particular belief.

Gemini 2.5 has since been superceded by 3.0, which is less likely to need hints. 2.5 was not as strong as the contemporary GPT model, but 3.0 with Pro Thinking mode enabled is up there with the best.

Finally, saying, "Well, they were given some hints" is like me saying, "LOL, big deal, I could drag a Tour peleton up Col du Galibier if I were on the same drugs Lance was using."

No, in fact I could do no such thing, drugs or no drugs. Similarly, a model that can't legitimately reason will not be able to solve these types of problems, even if given hints.