It can be hard enough for humans to just look at some (already consistently passing) tests and think, "is X actually the expected behavior or should it have been Y instead?"
I think you should have a look at the abstract, especially this quote:
> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers
This tool sounds awesome in that it generated real tests that engineers liked! "zero human checking of AI outputs" is very different though, and "this test passes" is very different from "this is a good test"
Good points regarding test quality. One takeaway for me from this paper is that you can increase code coverage with LLMs without any human checking of LLM outputs, because it’s easy to make a fully automated checker. Pure coverage may not be super-interesting but it’s still fairly interesting and nontrivial. LLM-based applications that run fully autonomously without bubbling hallucinations up to users seem elusive but this is an example.
It can be hard enough for humans to just look at some (already consistently passing) tests and think, "is X actually the expected behavior or should it have been Y instead?"
I think you should have a look at the abstract, especially this quote:
> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers
This tool sounds awesome in that it generated real tests that engineers liked! "zero human checking of AI outputs" is very different though, and "this test passes" is very different from "this is a good test"