It doesn't seem obvious that it's a problem for LLM coders to write their own tests (if we assume that their coding/testing abilities are up to snuff), given human coders do so routinely.
This thread is talking about vibe coding, not LLM-assisted human coding.
The defining feature of vibe coding is that the human prompter doesn't know or care what the actual code looks like. They don't even try to understand it.
You might instruct the LLM to add test cases, and even tell it what behavior to test. And it will very likely add something that passes, but you have to take the LLM's word that it properly tests what you want it to.
The issue I have with using LLM's is the test code review. Often the LLM will make a 30 or 40 line change to the application code. I can easily review and comprehend this. Then I have to look at the 400 lines of generated test code. While it may be easy to understand there's a lot of it. Go through this cycle several times a day and I'm not convinced I'm doing a good review of the test code do to mental fatigue, who knows what I may be missing in the tests six hours into the work day?
> This thread is talking about vibe coding, not LLM-assisted human coding.
I was writing about vibe-coding. It seems these guys are vibe-coding (https://factory.strongdm.ai/) and their LLM coders write the tests.
I've seen this in action, though to dubious results: the coding (sub)agent writes tests, runs them (they fail), writes the implementation, runs tests (repeat this step and last until tests pass), then says it's done. Next, the reviewer agent looks at everything and says "this is bad and stupid and won't work, fix all of these things", and the coding agent tries again with the reviewer's feedback in mind.
Models are getting good enough that this seems to "compound correctness", per the post I linked. It is reasonable to think this is going somewhere. The hard parts seem to be specification and creativity.
Maybe it’s just the people I’m around but assuming you write good tests is a big assumption. It’s very easy to just test what you know works. It’s the human version of context collapse, becoming myopic around just what you’re doing in the moment, so I’d expect LLMs to suffer from it as well.
> the human version of context collapse, becoming myopic around just what you’re doing in the moment
The setups I've seen use subagents to handle coding and review, separately from each other and from the "parent" agent which is tasked with implementing the thing. The parent agent just hands a task off to a coding agent whose only purpose is to do the task, the review agent reviews and goes back and forth with the coding agent until the review agent is satisfied. Coding agents don't seem likely to suffer from this particular failure mode.