I don't speak for bopbopbop7, but I will say this: my experience of using Claude...

I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].