More

zurfer · 2026-03-26T08:33:27 1774514007

it's just not binary. today's world is dominated by capitalistic competition and a lot of people earn a living by competing with their labor. If AI + robots can do the labor better, cheaper, faster, most (90%+) of today's jobs are gone without obvious replacement.

zurfer · 2026-03-24T08:24:50 1774340690

"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."

I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.

Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P

I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?

mike_hearn · 2026-03-24T11:12:01 1774350721

Yes there's a whole ecosystem of companies that create and sell RL gyms to AI labs and of course they develop their own internally too. You don't hear much about this ecosystem because RL at scale is all private. Nearly no academic research on it.

A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.

I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.

slopinthebag · 2026-03-24T09:24:13 1774344253

Is it possible the AI labs are seeding their models with these solved problems? Like, if I was Sam Altman with a bazillion dollars of investment I would pay some mathematicians to solve some of these problems so that the models could "solve" them later on. Not that I think it's what's happening here of course...

But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.

mrtesthah · 2026-03-25T04:53:34 1774414414

Maybe so, but GPT 5.4 is absolutely pulling ahead. You can see the differences visually on https://minebench.ai/.

zurfer · 2026-03-18T12:06:03 1773835563

Buy public openai investors, e.g. Microsoft. It's diluted but easy.

fjni · 2026-03-19T04:43:38 1773895418

Less diluted, but still: https://fundrise.com/vcx

zurfer · 2026-03-11T16:45:57 1773247557

this takes long enough for me to give codex a new try

yomismoaqui · 2026-03-11T17:23:27 1773249807

As a cheap user that only uses the 20$ month subscriptions I started with Claude Code as main & Codex as backup when the 5 hour quota was exhausted.

Then I saw that Codex worked better for me and cancelled my Claude Code subscription. And now for my moderate use (4-5 hours a day with no parallel agents) I have enough with Codex $20 and AMP free if I want to save some weekly quota.

But honestly I usually have enough usage to last the full week without using AMP.

winrid · 2026-03-11T16:49:01 1773247741

seriously, it's been going on for two hours, how complicated is their auth system?

ta988 · 2026-03-11T17:16:34 1773249394

They can't fix it if claude code isn't up, nobody understands the code anymore. /s(a little)

zurfer · 2026-03-11T15:52:50 1773244370

https://status.claude.com/incidents/jm3b4jjy2jrt https://news.ycombinator.com/item?id=47336889

zurfer · 2026-03-11T12:24:28 1773231868

I liked how it read. Not as a perfectly thought out post but more an ongoing conversation.

These are confusing times for engineers as the automators can now automate themselves away at even greater speed. Reminding ourselves to play positive sum games seems relevant.

The cake is too small to divide with humans and AI. We all feel that. Time to make more cakes :)

zurfer · 2026-03-10T12:21:04 1773145264

tldr: the author argues it is closer to costing 500 USD per month IF a user hits their weekly rate limits every week.

Which is probably a lot more correct than other claims. However it's also true that anybody who has to use the API might pay that much, creating a real cost per token moat for Anthropics Claude code vs other models as long as they are so far ahead in terms of productivity.

zurfer · 2026-03-05T17:12:36 1772730756

whenever i worry that AI will eventually do all the work I remind myself that the world is full of almost infinite problems and we'll continue to have a choice to be problem solvers over just consumers.

andriy_koval · 2026-03-05T17:23:40 1772731420

> we'll continue to have a choice to be problem solvers over just consumers.

that's if we still stay relevant and competitive compared to AI in problem solving.

jopsen · 2026-03-05T18:21:42 1772734902

10 years ago self-driving EVs were going to make it so nobody owns a car.

There was a lot of hype. We are possibly still on track to get to that world. But it might easily take another 10-20 years :)

AI will change things, but don't underestimate the timeline.

Also even if we get a super intelligence in a box, it probably won't fold my laundry. Super intelligence might not unlock as much as we dream.

zurfer · 2026-02-25T14:25:37 1772029537

LLMs before extensive RL were harmless. Now with RL I do fear that labs just let them play games and the only objective in a game is to win short term.

Please guys and girls at those labs be wise. Don't give them counterstrike etc. even if it improves the score.

zurfer · 2026-02-19T11:33:55 1771500835

Yes, but honestly what's the best source when reporting about a person? Their personal website no?

I think it's a hard problem and I feel there are a lot of trade-offs here.

It's not as simple as saying chatgpt is stupid or the author shouldn't be surprised.

kulahan · 2026-02-19T11:38:41 1771501121

The problem isn’t that it pulled the data from his personal site, it’s that it simply accepted his information which was completely false. It’s not a hard problem to solve at this time. “Oh, there’s exactly zero corroborating sources on this. I’ll ignore it.”

moebrowne · 2026-02-19T12:01:41 1771502501

Verifying that something is 'true' requires more than corroborating sources. Making a second blog post on another domain is trivial, then a third and a forth.

kulahan · 2026-02-19T22:02:55 1771538575

I’m not going to write out the entire logic train for an LLM to determine whether or not one of the billion documents scanned that day is new. Of course you’ll need more than one simple “does anyone on the internet anywhere also say this” check. It’s obvious to everyone that I did not mean this one thing would somehow be a bulletproof, complete method of determining if something is true or not. It’d just an incredibly strong signal of inauthenticity. Come on man.

fatherwavelet · 2026-02-19T11:54:29 1771502069

To me it is like steering a car into the ditch and then posting how the car went into a ditch.

You don't have to drive that much to figure out that what is impressive is keeping the car on the road and then traveling further or faster than what you could do by walking. For that though you actually have to have a destination in mind and not just spin the wheels. Post pointless metrics on how fast the wheels spin for your blog no one reads in the vague hope of some hyper Warhol 15 milliseconds of "fame".

The models for me are just making the output of the average person an insufferable bore.