> The code I generate is usually better than what I'd do by hand.
I'm always baffled by this. If you can't do it that well by hand, how can you discriminate its quality so confidently?
I get there is a artist/art consumer analogy to be made (i.e. you can see a piece is good without knowing how to paint), but I'm not convinced it is transferrable to code.
Also, not really my experience when dealing with IaC or (complex) data related code.
You're forgetting that code quality also requires time. Developers make tradeoffs all the time on how much time to invest into improving the quality of what they write, for both new and existing code. When someone claims that LLMs can produce higher-quality code it can include quality levels that may be unjustifiably slow to hand-craft depending on constraints and needs.
Related - agentic LLMs may be slow to produce output but they are parallelizable by an individual unlike hand-written work.
I get that. I'm exclusively talking about code quality verification after it being coded by a human or an LLM, in fact I don't really care by whom. Mainly because I do care about introducing tech debt and/or hidden balloning costs.
Ah, alright, that makes a lot more sense, like another poster said I read "'d" as "could".
Point still remains for junior and semi-senior devs though, or any dev trying to leap over a knowledge barrier with LLMs. Emphasis on good pipelines and human (eventually maybe also LLM based) peer-reviews will be very important in the years to come.
You underestimate how lazy people are. I always take shortcuts and skip taking edge cases into account. LLMs have no problem writing tedious guards and creating abstractions without hacks, which means the code becomes more robust than if I would do it by hand.
What an odd question. For the exact same reason people who write prose professionally usually have someone else edit their work: because editing your own work is harder, and everybody slips up sometimes.
I'm not getting this analogy. Editors can't normally discriminate if the content itself is good (after all, the writer is the SME), but rather, only perfect its form (syntax, grammar, etc).
Well-written bullshit in perfect prose is still bullshit.
ehhhhhhh yeah but this is like hiring Reddit to do your prose editing, considering generated code is slightly worse than what you'd find on r/programming
You can believe that or not believe that without changing the implication of the previous question, which was that someone who routinely slips while writing code would be incapable of determining whether the LLM got it right. Obviously not.
I am pattern matching your last statement with what I've seen with my teammates who are more AI-oriented: I suspect this is a matter of making the metrics the goal. I would rather maintain something that is simple, works, and have targeted comments than something messy that meets the metrics you list.
I don't get all the prompt vibe coding going around. I don't use prompts to generate code.
I use "tab-tab" auto complete to speed through refactorings and adding new fields / plumbing.
It's easily a 3x productivity gain. On a good day it might be 10x.
It gets me through boring tedium. It gets strings and method names right for languages that aren't statically typed. For languages that are statically typed, it's still better than the best IDE AST understanding.
It won't replace the design and engineering work I do to scope out active-active systems of record, but it'll help me when time comes to build.
I use tab auto complete, and i think it's a 5% productivity gain. On a good day, maybe 10%. I haven't put much effort into optimizing the setup or learning advanced usage patterns or anything. I'm using stock copilot, provided by my employer. If I had to pay for it, I wouldn't be using it, as it doesn't justify the cost.
The 5% is an increase in straight-ahead code speed. I spend a small fraction of my time typing code. Smaller than I'd like.
And it very well might be an economically rational subscription. For me personally, I'm subscription averse based on the overhead of remembering that I have a subscription and managing it.
I can't attest to C++, but we've got a large Rust monorepo, and it's magical.
It expands match blocks against highly complex enums from different crates, then tab completes test cases after I write the first one. Sometimes even before that.
We may be at different levels of "large" (and "gnarly") - this code-base has existed in some form since 1985, through various automated translations Pascal -> C -> C++.
Just by virtue of Rust being relatively short-lived I would guess that your code base is modular enough to live inside reasonable context limits, and written following mostly standard practice.
One of the main files I work on is ~40k lines of code, and one of the main proprietary API headers I consume is ~40k lines of code.
My attempts at getting the models available to Copilot to author functions for me have often failed spectacularly - as in I can't even get it to generate edits at prescribed places in the source code, follow examples from prescribed places. And the hallucination issue is EXTREME when trying to use the big C API I alluded to.
That said Claude Code (which I don't have access to at work) has been pretty impressive (although not what I would call "magical") on personal C++ projects. I don't have Opus, though.
Prompts are worth mastering. AI autocomplete is better than older autocomplete systems but of course it only works based on what you started to type.
Prompts are especially good for building a new template of structure for a new code module or basic boilerplate for some of the more verbose environments. eg. Android Java programming can be a mess, huge amounts of code for something simple like an efficient scrolling view. AI takes care of this - it's obvious code, no thought, but it's still over 100 lines scattered in XML (the view definitions), resources, and in multiple Java files.
Do you really want to be copying boilerplate like this across to many different files? Prompts that are well integrated to the IDE (they give a diff to add the code) are great (also old style Android before Jetpack sucked) https://stackoverflow.com/questions/40584424/simple-android-...
Do you have a link to some of the code that you have produced using this approach? I am yet to see a public or private repo with non-trivial generated code that is not fundamentally flawed.
I took an existing MIT licensed prefix tree crate and had Claude+Gemini rewrite it to support immutable quickly comparable views. The execution took about one day's work, following two or three weeks thinking about the problem part time. I scoured the prefix tree libraries available in rust, as well as the various existing immutable collections libraries and found that nothing like this existed. I wanted O(1) comparable views into a prefix tree. This implementation has decently comprehensive tests and benchmarks.
No code for the next two but definitely results...
In both these examples, I leaned on Claude to set up the boilerplate, the GUI, etc, which gave me more mental budget for playing with the challenging aspects of the problem. For example, the tabu graph layout is inspired by several papers, but I was able to iterate really quickly with claude on new ideas from my own creative imagination with the problem. A few of them actually turned out really well.
Not the OP, not my code. But here is Mitchel Hashimoto showing his workflow and code in Zig, created with AI agent assistance: https://youtu.be/XyQ4ZTS5dGw
I think this still is some kind of 'fight' between assisted and more towards 'vibe'. Vibe for me means not reading the generated code, just trying it and the other extreme is writing all without AI. I don't think people here are talking about assisted : they are taking about vibe or almost vibe coding. And its fairly terrible if the llm does not have tons of info. It can loop, hang, remove tons of features, break random things etc all while being cheerful and saying 'this is production code now, ready to deploy'. And people believe it. When you use it to assist, it is great imho.
https://github.com/wglb/gemini-chat Almost entirely generated by gemini based on my english language description. Several rounds with me adding requirements.
That's disingenuous or naive. Almost nobody decides to expressly highlight the section of code (or whole files generated by ai) they just get on with the job when there's real deadlines and it's not about coding for the sake of the art form...
If the generated implementation is not good, you're trading short-term "getting on with the job" and "real deadlines" for mid-to-long-term slowdown and missed deadlines.
In other words, it matters whether the AI is creating technical debt.
Do you want to clarify your original comment, then? I just read it again, and it really sounds like you're saying that asking to review AI-generated code is "disingenuous or naive".
I am talking about correctness, not style, coding isn't just about being able to show activity (code produced), but rather producing a system that is correctly performing the intended task
Yes, and frankly you should be spending time writing large integration tests correctly not microscopic tests that forgot how tools interact.
It's not about lines of code or quality it's about solving a problem. If the problem creates another problem then it's bad code. If it solves the problem without causing that then great. Move onto the next problem.
Same as pretending that vibe coding isn't producing tons of slop. "Just improve your prompt bro" doesn't work for most real codebases. The recent TEA app leak is a good example of vibe coding gone wrong, I wish I had as much copium as vibe coders to be blind to these things, as most of them clearly are like "it happened to them but surely won't happen to ME."
> The recent TEA app leak is a good example of vibe coding gone wrong
Weren't there 2 or 3 dating apps that were launched before the "vibecoding" craze that went extremely popular and got extremely hacked weeks/months in? I also distinctly remember a social network having firebase global tokens on the clientside, also a few years ago.
Not an excuse, no. I agree it should be better. And it will get better. Just pointing out that some mistakes were systematically happening before vibecoding became a thing.
We went from "this thing is a stochastic parrot that gives you poems and famous people styled text, but not much else" to "here's a fullstack app, it may have some security issues but otherwise it mainly works" in 2.5 years. People expect perfection, and move the goalposts. Give it a second. Learn what it can do today, adapt, prepare for what it can do tomorrow.
No one is moving the goalposts. There are a ton of people and companies trying to replace large swathes of workers with AI. So it's very reasonable to point out ways in which the AI's output does not measure up to that of those workers.
I thought the idea was that AI would make us collectively better off, not flood the zone with technical debt as if thousands of newly minted CS/bootcamp graduates were unleashed without any supervision.
LLMs are still stochastic parrots, though highly impressive and occasionally useful ones. LLMs are not going to solve problems like "what is the correct security model for this application given this use case".
AI might get there at some point, but it won't be solely based on LLMs.
> "what is the correct security model for this application given this use case".
Frankly I've seen LLMs answer better than people trained in security theatre so be very careful where you draw the line.
If you're trying to say they struggle with what they've not seen before. Yes, provided that what is new isn't within the phase space they've been trained over. Remember there's no photographs of cats riding dinosaurs but SD models can generate them.
I've heard this multiple times (Tea being an example of problems with vibe coding) but my understanding was that the Tea app issues well predated vibe coding.
I have experimented with vibe coding. With Claude Code I could produce a useful and usable small React/TS application, but it was hard to maintain and extend beyond a fairly low level of complexity. I totally agree that vibe coding (at the moment) is producing a lot of slop code, I just don't think Tea is an example of it from what I understand.
# loop over the images
for filename in images_filenames:
# download the image
image = download_image(filename)
# resize the image
resize_image(image)
# upload the image
upload_image(image)
They're often repetitive if you're reading the code, but they're useful context that feeds back into the LLM. Often once the code is clear enough I'll delete them before pushing to production.
do you have proof of this being useful for llm? wouldn't you rather it re-read the actual code it generated instead of assuming that the potentially wishful thinking or stale comment is going to lead it astray?
it reads both, so with the comments it more or less parrots the desired outcome I explained... and it sometimes catches the mismatch between code and comment itself before I even mention it
I read and understand 100% of the code it outputs, so I'm not so worried about falling too far astray...
being too prescriptive about it (like prompting "don't write comments") makes the output worse in my experience
I've noticed this too. They are often restatements of the line in verbal form, or intended for me, the LLM-reader about the prompt, vice a code maintainer.
Very often comments generated by humans are also useless. The reason for this are mandated comment policies, e.g., 'every public method should have a comment'. An utterly disgusting practice. One should only have a comment if one has something interesting to say. In a not-overly-complex code base there should maybe be a comment perhaps every 100 lines or so. In many cases it makes more sense to comment the unit tests than the code.
I think the rules for comments on public method is to use something like doxygen to extract the reference. And most IDE can display them upon hovering. And comments can remind the caller of pre- and post-conditions.
I am pretty far to one end of the spectrum on need for comments. Very rarely is a comment useful to help you/another developer decipher the intent and function of a piece of code.
Ah, so it's good enough to write code on its own without time-consuming, excessive hand-holding. But it's not good enough to write comments on its own.
I can't speak to comments rules specifically but I am a heavy user of "agentic" coding and use rules files and while they help they are simply not that reliable. For something like comments that's probably not that big of a deal because some extra bad comments isn't the end of the world.
But I have rules that are quite important for successfully completing a task by my standards and it's very frustrating when the LLM randomly ignores them. In a previous comment I explained my experiences in more detail but depending on the circumstances instruction compliance is 9/10 times at best, with some instructions/tasks as poor as 6/10 in the most "demanding" scenarios particularly as the context window fills up during a longer agentic run.
Me: Here's the relevant part of the code, add this simple feature.
Opus: here's the modified code blah blah bs bs
Me: Will this work?
Opus: There's a fundamental flaw in blah bleh bs bs here's the fix, but I only generate part of the code, go hunt for the lines to make the changes yourself.
Me: did you change anything from the original logic?
Opus: I added this part, do you want me to leave it as it was?
Sorry to be that guy, but you're using it wrong. The best flows right now are architect -> act -> test. First you have a session in "architect" / "plan" mode (depending on your ide/tool) where you discuss, ask questions, etc. Then, when everything is clear in "chat" mode, you ask the model to make a plan. You verify the plan, and then you tell it to start implementing it. You still get to approve tools, calls, tests, etc. You can also provide feedback on the way if you missed something (i.e. use uv instead of pip, etc).
Coding in a chat interface, and expecting the same results as with dedicated tools is ... 1-1.5 years old at this point. It might work, but your results will be subpar.
Nah it's good thanks for your input. I saw people use plan.md and todo.md and ide/commandline for this before. manus.ai demonstrates this via its chat interface as well.
These conversations on AI code good, vs AI code bad constantly keep cropping up.
I feel we need to build a cultural norm to share examples places of succeeded, and failures, so that we can get to some sort of comparison and categorization.
The sharing also has to be made non-contentious, so that we get a multitude of examples. Otherwise we’d get nerd-sniped into arguing the specifics of a single case.
Let’s talk about rules and docs, shall we? What makes a good rule for AI to keep it on task? What are your setups for docs and attaching them to the context (do you need to? Or just the location?)
Let’s boil this down to an easy set of reproducible steps any engineer can take to wrangle some sense from their AI trip.
The company I work at (https://getunblocked.com) is built to give tools like Claude Code and Cursor context based on all your docs, issues, code, and chat threads from Slack and soon Teams. Happy to give you a demo sometime if you're interested!
In my experience, unit tests and logging code generated by LLMs tend to be overly verbose, miss meaningful assertions, and often produce boilerplate that looks correct but doesn’t test or log anything useful. It’s easy to get misled by the surface structure.
I've been finding actual human-written bugs and correcting them with Claude, so I find the "often broken" claims a load of nonsense... I've been fixing dozens of minor bugs in our codebase that no one's been arsed to fix for years due to bigger priorities (which tbh is generating more features and tech debt).
It may change in the future, but AI is without a doubt improving our codebase right now. Maybe not 10X but it can easily 2X as long as you actually understand your codebase enough to explain it in writing.
Yeah there's so many now it's hard to settle on one. YouTube is littered with them. Agent OS, amp.code, BMAD. I'm probably trying BMAD in earnest next ...
Each of the "tools" does things slightly differently but the techniques to use them effectively are largely the same now (rules, planning, context management, good prompting).
You know like when the loom came out there were probably quite a few models but using it was similar. Like cars are now.
I do think a lot of the discourse in this space can be summed up as: people are arguing about two non-overlapping segments of a distribution having no idea the other segment even exists; instead they just assume the other side is [hype/pessimistic].
What a scary time it is for devs. We spent all this time learning this obscure skill and now when I play with claude or even chatgpt it makes really good code. I just asked it to write me a video game and it did it. Perfect godot code. I was stunned it didn't hallucinate and when I asked for clarification on a snippet of code, it perfectly answered.
I think its only a matter of time until our roles are commoditized and vibe-coding becomes the norm in most industries.
Vibe coding being a dismissive term on developing a new skillset. For example we'll be doing more planning and testing and such instead of writing code. The same way, say, sysadmins just spin up k8s instead of racking servers or car mechanics read diagnosis codes from readers and, often, just replace an electric part instead of hand-tuning carbs or gapping spark plugs and such. That is to say, a level of skill is being abstracted away.
I think we just have to see this, most likely, as how things will get done going forward.
Could you at least mention what the video game was, or why it was such a good implementation? Also, what was "perfect" about the code? "Perfect" is not a word I would ever use to describe code.
This reads like empty hype to me, and there's more than one claim like this in these threads, where AI magically creates an app, but any description of the app itself is always conspicuously missing.
Yes Im exaggerating and its not writing a AAA game from a prompt but I asked it to make a game like Zelda and it figured it out and walked me through all the aspects of it. That's a lot more than I expected. I'm not a games programmer, so I'm probably a lot more impresse than I should be, but I went from not knowing anything about godot to having a framework up to build a 2d rpg-esque game fairly quickly and me learning as it gave me the code. Note, I used the new chatgpt study mode, so that's may be different than just regular prompts. I fully expected just broken code and random AI musings, but instead I got a very solid implementation of a game, albeit a simple one. Or at least as simple as I asked for, I imagine I can keep building out more with its help.
I also have never used godot before, and I was surprised at how well it navigated and taught me the interface as well.
At least the horror stories about "all the code is broken and hallucinations" isn't really true for me and my uses so far. If LLM's will succeed anywhere it will be in the overly logical and predictable worlds of programming languages, but that's just a guess on my part, but thus far whenever I reach out for code from LLM's, its been a fairly positive experience.
Thanks for elaborating, this puts things into perspective, although the complexity of the end product is still unclear to me.
I do still disagree with your assessment, I think the syntactic tokens in programming languages have a kind of impedance mismatch with the tokens that LLMs, and that the formal semantics of programming languages are a bad fit with the fuzzy statistical LLMs. I firmly believe that increased LLM usage will drive software safety and quality down, simply because a) no semblance of semantic reasoning or formal verification has been applied to the code and b) a software developer will have an incomplete understanding of code not written by themself.
But our opinions can co-exist, good luck in your game development journey!
I'm playing with it still and now am adding more scenes and more logic. I think the complexity here is whatever my goals are. I'm not sure what the practical limits here are, or at least they exceed my own ability in games development right now. This is just a toy game, but as I reach into claude and gpt, I can keep going, which is nice. I already have coding experience so I'm not exactly a 'vibe coder' but I think professionally, I dont think people with zero coding experience are getting dev roles, but instead the role will change like my example of the modern mechanic or modern sysadmin above.
As far as QA goes, we then circle back to the tool itself being the cure for the problems the tool brings in, which is typical in technology. The same way agile/'break things' programming's solution to QA was to fire the 'hands on' QA department and then programmatically do QA. Mostly for cost savings, but partly because manual QA couldn't keep up.
I think like all artifacts in capitalism, this is 'good enough,' and as such the market will accept it. The same way my laggy buggy Windows computer would be laughable to some in the past. I know if you gave me this Win11 computer when I was big into low-footprint GUI linux desktop, I would have been very unimpressed, but now I'm used to it. Funny enough, I'm migrating back to kubuntu because Windows has become unfun and bloaty and every windows update feels a bit like gambling. But that's me. I'm not the typical market.
I think your concerns are real and correct factually and ideologically, but in terms of a capitalist market will not really matter in the end, and AI code is probably here to stay because it serves the capital owning class (lower labor costs/faster product = more profit for them). How the working class fares or if the consumer product isn't as good as it was will not matter either unless there's a huge pushback, which thus far hasn't happened (coders arent unionizing, consumers seem to accept bloaty buggy software as the norm). If anything the right-wing drift of STEM workers and the 'break things' ideology of development has primed the market for lower-quality AI products and AI-based workforces.
First thing I do is tell llm to stop writing useless docstrings and comments and instead follow clean code principles where each variable is a noun and function call a verb.
With enough rules and good prompting this is not true. The code I generate is usually better than what I'd do by hand.
The reason the code is better all the extra polish and gold plating is essentially free.
Everything I generate comes out commented great error handling, logging, SOLID, and united tested using established patterns in the code base.