More

daveguy · 2026-03-26T19:53:08 1774554788

Chollet literally never says that. Quite the opposite. He says that AIs are currently abysmally bad at the skills this benchmark tests. An AGI should be able to do this, but doing this doesn't mean it's AGI. He has been very clear about that. I suggest you go back and (re)read the intro ARC-AGI paper.

No system can crack these out of the box (like humans can) because we don't have AGI.

daveguy · 2026-03-26T19:34:12 1774553652

This is the correct strategy for this particular game (center the mirrors between the yellow squares, move the black squares). I didn't realize it until about round 6 or 7.

daveguy · 2026-03-26T18:44:23 1774550663

Can AI models generalize+ at any long context problem solving and agency regardless of modality? I think the answer is no, and this is why they are not yet AGI.

+ generalize being the key word.

daveguy · 2026-03-26T18:37:01 1774550221

Rate of learning and general applicability of what is learned is essentially the point of ARC-AGI.

That's why all the AIs score abysmally until humans step in to guide them (fine tuning, harnesses, etc).

daveguy · 2026-03-26T18:34:05 1774550045

>> As long as there is a gap between AI and human learning, we do not have AGI.

>> "It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

> Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

You misinterpret what is meant by "a gap between AI and human learning". The point isn't that they aren't similar enough or that they aren't as intelligent. The statement is specifically about "learning". Humans learn continuously and can devise new strategies for problem solving. Current AI, especially LLMs are just snapshots of a single strategy. LLMs do not learn at all -- they specifically have "knowledge cutoffs" even with all the tools available to them in a harness we still have to wait for new frontier models or new fine tuning for them to solve significantly new problems. A human does this continually -- learn regardless of intelligence.

daveguy · 2026-03-26T15:14:43 1774538083

This is a gross misrepresentation of the scoring process.

daveguy · 2026-03-26T15:08:46 1774537726

No, there is no source for this. Opus is scoring around 1% just like all the other frontier models. It would be fairly trivial to add a renderer intermediary. And if it improves to 97+%... Then you would get a huge cut of $2 million dollars. The assertion that Opus gets 97% if you just give it a gui is completely bogus.

daveguy · 2026-03-26T15:05:13 1774537513

Source? I haven't seen anything like that for ARC-AGI performance.

Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.

famouswaffles · 2026-03-26T16:25:23 1774542323

That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).

This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.

[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

[2] https://blog.alexisfox.dev/arcagi3

vbarrielle · 2026-03-26T16:47:31 1774543651

> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration

This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.

famouswaffles · 2026-03-26T16:51:58 1774543918

The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.

>We then tested the harnesses on the full public set (which researchers did not have access to at the time)

daveguy · 2026-03-26T21:36:28 1774560988

It may have been tested on the full set, but the score you quote is for a single game environment. Not the full public set. That fact is verbatim in what you responded to and vbarrielle quoted. It scored 97% in one game, and 0% in another game. The full prelude to what vbarrielle quoted, the last sentence of which you left out, was:

> We then tested the harnesses on the full public set (which researchers did not have access to at the time). We found extreme bimodal performance across the two sets, controlling for the same frontier model...

The harness only transfers to like-environments and the intelligence for those specific games is baked into the harness by the humans who coded it for this specific challenge.

The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments. Having a human give it more powerful tools in a harness defeats the purpose. You should go back and read the original ARC-AGI paper to see what this is about+. Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

+ https://arxiv.org/abs/1911.01547

fc417fc802 · 2026-03-26T22:32:15 1774564335

> intelligence for those specific games is baked into the harness

This is your claim but the other commenter claims the harness consists only of generic tools. What's the reality?

I also encountered confusion about this exact issue in another subthread. I had thought that generic tooling was allowed but others believed the benchmark to be limited to ingesting the raw text directly from the API without access to any agent environment however generic it might be.

daveguy · 2026-03-26T14:57:16 1774537036

The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.

daveguy · 2026-03-26T14:13:43 1774534423

The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.