> As an example, they cited how Devin, when asked to deploy multiple applications to the infrastructure deployment platform Railway, failed to understand this wasn't supported and spent more than a day trying approaches that didn't work and hallucinating non-existent features
An engineer not reading the docs and wasting a day chasing their tail because of that. Yes… how unrealistic…
>"Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions," the researchers explain in their report. "Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible."
Apparently we've all been working with Devin for years.
> "Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible."
Quickest way to get AI engineers kicked out of the company will be to patch them so they push back against unrealistic goals from management.
Seriously though, where is the AI C-suite? The AI BoD? At least with an AI BoD you don't have to worry about them doing backstabbing financial shenanigans for their own self-interest at the expense of the company.
You would need much less "agreeable" AI to reliably steer a company. With current models an AI C-suite would quickly get "captured" by almost anyone interacting with it.
If an employee behaved like an LLM, a company should immediately get them into a debriefing with corporate counsel, HR, management, and trusted top technical personnel.
For example, to try to find out whose IP they plagiarized, and how badly we're scrod.
Or, for example, to find out how they generated so much code they don't understand at all, and how badly we're scrod.
Or, for example, to find out why they wrote a criminally negligent security vulnerability or data corruption, and how badly we're scrod.
Or, for example, to see what engineering assurance they "hallucinated", and how badly we're scrod.
I wonder how much you get billed if the agent spends a whole day running around in circles. The $500/month subscription only comes with 250 vaguely defined "compute units", so past a certain point you'd have to pay extra for the time it wastes.
Move over "bankrupted by runaway cloud spending", it's time for "bankrupted by AI agents trying and failing to complete a task indefinitely".
Depends on the company. We all hear stories of people writing themselves a promotion / bonus by deploying a bunch of bugs they can then save the day by fixing.
Do people actually do that? Finding bugs in virtually any piece of software isn’t difficult if you have access to the source. Merging in a bug only to fix it later honestly seems like more work. Most bugs are pretty easy to fix…
A much more common story would be people knowingly cutting corners because of management pressure/demotivation/etc, then fixing the resulting bugs. It's easy for somebody doing that to look like a hard-working hero compared to the programmer who just avoided the problems in the first place.
No, if A's PRs always bounce because the testers find bugs then A is going to look like an idiot. Then again you need to work at a place that actually employs testers.
If B always submits PRs and they always go straight to merged in prod, then B knows what he's doing
I've seen a lot of fairly explicit discussions around "this timeline will require cutting these corners and cost this much time to fix later or else it will cause these problems", and also some relatively internal discussions around "how strongly can we rely on promises that the project won't get dropped before all the cleanup is done, and how does that impact what options we can present".
> Finding bugs in virtually any piece of software isn’t difficult if you have access to the source.
What????
Yeah, trivial bugs maybe.
Even because most "hairy" bugs (and those are the one that count by the end of the day) manifest themselves not in obvious ways, but only under some hard to predict set of pre-conditions and input data. And let's not even get started on threaded/asynchronous code.
In my experience, "not supported" has a wide range of meanings from "next to impossible" to "we just don't want you to", so that wouldn't deter a human either (and I have interpreted it to mean "challenge accepted" several times), but the latter would be unlikely to hallucinate non-existent features.
An engineer not reading the docs and wasting a day chasing their tail because of that. Yes… how unrealistic…