AI Fails at 96% of (General Work) Jobs (New Study)

ben_w · 2026-02-16T10:45:54 1771238754

Actual paper: https://www.remotelabor.ai/paper.pdf

Sounds about right.

With those test parameters for how long it would take a human to complete the same work, it fits a similar pattern to METR; i.e. at "humans would take 11.5 hours" (Figure 4, median) you're pushing your luck for any success with all but the most recent models*, and METR is testing software where AI has the possibility of fully automating a lot of its own tests.

Even more recent models than they tested, like Opus 4.5, are only 50% successful for tasks that take humans 5h20m: https://metr.org/time-horizons/

Assuming the bubble doesn't pop/WW3 doesn't start first (IDK, 25% and 5% respectively?), and if trends continue (???), I expect a similar paper this time next year to show something like 50% success at automation of similar tasks.

* which they didn't test, I don't blame them for that because this field moves too fast

deterministic · 2026-02-16T23:38:57 1771285137

Duplicate of this one: https://news.ycombinator.com/item?id=47011722

belter · 2026-02-16T23:50:31 1771285831

Also

https://news.ycombinator.com/item?id=46928172

https://news.ycombinator.com/item?id=47004754

adyashakti · 2026-02-16T09:51:45 1771235505

translation: "96% of people trying to replace workers with AI don't know how to prompt it effectively or supervise its output."

BoredPositron · 2026-02-16T10:42:54 1771238574

The 4% is using it to write posts about ai on linkedin.

devnonymous · 2026-02-16T09:56:25 1771235785

So what you're saying is the interface fails the common case?

gdulli · 2026-02-16T16:35:54 1771259754

Or they've determined that micromanaging it is circuitous and increases their dependence on tech giants, so it's a bad deal given that they also need to know the work well enough to verify it anyway.

vrighter · 2026-02-16T14:03:44 1771250624

96% are "holding it wrong".

There's a saying that if everywhere you go it smells like shit, you might just have some shit smeared on your own nose.

96% is not "holding it wrong".