Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is also my experience. Everything I’ve ever tried to vibe code has ended up with off-by-one errors, logic errors, repeated instances of incorrect assumptions etc. Sometimes they appear to work at first, but, still, they have errors like this in them that are often immediately obvious on code review and would definitely show up in anything more than very light real world use.

They _can_ usually be manually tidied and fixed, with varying amounts of effort (small project = easy fixes, on a par with regular code review, large project = “this would’ve been easier to write myself...”)

I guess Gas Town’s multiple layers of supervisory entities are meant to replace this manual tidying and fixing, but, well, really?

I don’t understand how people are supposedly having so much success with things like this. Am I just holding it wrong?

If they are having real success, why are there no open source projects that are AI developed and maintained that are _not_ just systems for managing AI? (Or are there and I just haven’t seen them?...)





In my comment history can be found a comment much like yours.

Then Opus 4.5 was released. I had already had my CC cluade.md, and Windsurf global rules + workspace rules set up. Also, my main money making project is React/Vite/Refine.dev/antd/Supabase... known patterns.

My point is that given all that, I can now deploy amazing features that "just work," and have excellent ux in a single prompt. I still review all commits, but they are now 95% correct on front end, and ~75% correct on Postgres migrations.

Is it magic? Yes. What's worse is that I believe Dario. In a year or so, many people will just create their own Loom or Monday.com equivalent apps with a one page request. Will it be production ready? No. Will it have all the features that everyone wants? No. But it will do that they want, which is 5% of most SaaS feature sets. That will kill at least 10% of basic SaaS.

If Sonnet 3.5 (~Nov 2024) to Opus 4.5 (Nov 2025) progress is a thing, then we are slightly fucked.

"May you live in interesting times" - turns out to be a curse. I had no idea. I really thought it was a blessing all this time.


Yeah, it sounds like "you're holding it wrong"

Like, why are you manually tidying and fixing things? The first pass is never perfect. Maybe the functionality is there but the code is spaghetti or untestable. Have another agent review and feed that review back into the original agent that built out the code. Keep iterating like that.

My usual workflow:

Agent 1 - Build feature Agent 2 - Review these parts of the code, see if you find any code smells, bad architecture, scalability problems that will pop up, untestable code, or anything else falling outside of modern coding best practices Agent 1 - Here's the code review for your changes, please fix Agent 2 - Do another review Agent 1 - Here's the code review for your changes, please fix

Repeat until testable, maybe throw in a full codebase review instead of just the feature.

Agent 1 - Code looks good, start writing unit tests, go step by step, let's walk through everything, etc. etc. etc.

Then update your .md directive files to tell the agents how to test.

Voila, you have an llm agent loop that will write decent code and get features out the door.


I'm not trying to be rude here at all but are you manually verifying any of that? When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value. And that's when they aren't bypassing commit checks or just commenting out tests or saying "I fixed it all" while multiple tests are broken.

Maybe I need a stricter harness but I feel like I did try that and still didn't get good results.


I feel like it was doing what you're saying about 4-6 months ago. Especially the commenting out tests. Not always but I'd have to do more things step by step and keep the llm on track. Now though, the last 3-4 months, it's writing decent unit tests without much hand holding or refactors.

Hmm, my last experience was within the last 2 months but I'm trying not to write it off as "this sucked and will always suck", that's the #1 reason I keep testing and playing with these things, the capabilities are increasing quickly and what did/didn't work last week (especially "last model") might work this week.

I'll keep testing it but that just hasn't been my experience, I sincerely hope that changes because an agent that runs unit test [0] and can write them would be very powerful.

[0] This is a pain point for me. The number of times I've watching Claude run "git commit --no-verify"... I've told it in CLAUDE.md to never bypass commit checks, I've told it in the prompt, I've added it 10 more times in different places in CLAUDE.md but still, the agent will always reach for that if it can't fix something in 1-3 iterations. And yes, I've told it "If you can't get the checks to pass then ask me before bypassing the checks".

It doesn't matter how many guardrails I put up and how good they are if the agent will lazily bypass them at the drop of a hat. I'm not sure how other people are dealing with this (maybe with agents managing agents and checking their work? A la Gas Town?).


I haven't seen your issue, but git is actually one of the things I don't have the llm do.

When I work on issues I create a new branch off of master, let the llm go to town on it, then I manually commit and push to remote for an MR/PR. If there are any errors on the commit hooks I just feed the errors back into the agent.


Interesting, ok, I might try that on my next attempt. I was trying to have it commit so that I could use pre-commit hooks to enforce things I want (test, lint, prettier, etc) but maybe instead I should handle that myself and make it more explicit in my prompts/CLAUDE.md to test/lint/etc. In reality I should just create a `/prep` command or similar that asks it to do all of that so that once it thinks it's done, I can quickly type that and have it get everything passing/fixed and then give a final report on what it did.

You’ll likely have the same issue relying on CLAUDE.md instructions to test/lint/etc, mine get ignored constantly to the point of uselessness.

I’m trying to redesign my setup to use hooks now instead because poor adherence to rules files across all the agentic CLIs is exhausting to workaround.

(and no, Opus 4.5 didn’t magically solve this problem to preemptively respond to that reply)


What do your rules files look like?

I wonder if some people are putting in too much into their markdown files of what NOT to do.

I hate people saying the llms are just better auto-correct, but in some ways they're right. I think putting in too much "don't do this" is leading the llm down the path to do "this" because you mentioned it at all. The LLM is probabilistically generating it's response based on what you've said and what's in the markdown files, the fact you put some of that stuff in there at all probably increases the probability those things will show up.


In my projects there's generally a "developer" way to do things and an "llm agent" way to do things.

For the llm a lot of linting and build/test tools go into simple scripts that the llm can run and get shorthand info out of. Some tools, if you have the llm run them, they're going to ingest a lot from the output (like a big stacktrace or something). I want to keep context clean so I have the llm create the tool to use for build/test/linting and I tell it to create it so the outputs will keep its context clean, then I have it document it in the .md file.

When working with the LLM I have to start out pretty explicit about using the tooling. As we work through things it will start to automatically run the tooling. Sometimes it will want to do something else, I just nudge it back to use the tooling (or I'll ask it why or if there are benefits to the other way and if there are we'll rebuild the tooling to use the other way).

Finally, if the LLM is really having trouble, I kill the session and start a new one. It used to feel bad to do that. I'd feel like I'm losing a lot of info that's in context. But now, I feel like it's not so bad... but I'm not sure if that's because the llms are better or if my workflow has adapted.

Now, let me backup a little bit. I mentioned that I don't have the llm use git. That's the control I maintain. And with that my workflow is: llm builds feature->llm runs linters/tests->I e2e test whatever I'm building by deploying to a dev/staging/local env->once verified I commit. Now I will continue that context window/session until I feel like the llm starts fucking up. Then I kill the session and start a new one. I rarely compact, but it does happen and I generally don't fret about it too much.

I try to keep my units of work small and I feel like it does the best when I do. But then I often find myself surprised at how much it can do from a single prompt, so idk. I do understand some of the skepticism because a lot of this stuff sounds "hand-wavy". I'm hoping we all start to hone in on some general more concrete patterns but with it being so non-deterministic I'm not sure if we will. It feels like everyone is using it differently and people are having successes and failures across different things. People where I work LOVE MCPs but I can't stand them. When I use them it always feels like I have to remind the llm that it has an MCP, then it feels like the MCP takes too much context window and sometimes the llm still trips over how to use it.


Ok, that's a good tip about separate tools/scripts for the LLM, I did something similar less than a year ago so that I kept lint/test output to a minimum but it was still invoked via git hooks. I'll try again with scripts next time I'm doing this. My hope was to let the agent commit to a branch (with code that passed lint/test/prettier/etc), push it, auto-deploys to preview branches, and then that's where I'd do my e2e/QA and once I was happy I could merge it and it get deployed to the main site.

I discussed approaches in my earlier reply. But what you are saying now makes me think you are having problems with too much context. Pare down your CLAUDE.md massively and never let you context usage get over 60-65%. And tell CLAUDE not to commit anything without explicit instructions from you (unless you are working in a branch/worktree and are willing to throw it all away).

put a `git` script in `PATH` that simply errors out i.e.:

    if "--no-verify" in sys.args:
        println("--no-verify is not allowed, file=sys.stderr)
        sys.exit(1)
and otherwise forwards to the underlying `git`

Literally yesterday I was using Claude for writing a SymPy symbolic verification of a mathematical assertion it was making with respect to some rigorous algebra/calculus I was having it do for me. This is the best possible hygiene I could adopt for checking its output, and it still failed to report on results correctly.

After manual line-by-line inspection and hand-tweaks, it still saved me time. But it's going to be a long, long time before I no longer manually tweak things or trust that there are no silent mistakes.


Those kinds of errors were super common 4-6 months ago, but LLM quality moves fast. Nowadays I don't see these very often at all. Two things that make a huge difference: work on writing a spec first. github.speckit, GSD, BMAD, whatever tool you like can help with this. Do several passes on the spec to refine it and focus on the key ideas.

Now that you have a spec, task it out, but tell the LLM to write the tests first (like Test-Driven Development, but without all the formalisms). This forces the LLM to focus on the desired behavior instead of the algorithms. Be sure to focus on tests that focus on real behavior: client apis doing the right error handling when you get bad input, handling tricky cases, etc. Tell the system not to write 'struct' tests - checking that getters/setters work isn't interesting or useful.

Then you implement 1-3 tasks at a time, getting the tests to pass. The rules prevent disabling tests, commenting out tests, and, most importantly, changing the behavior of the tests. Doesn't use a lot of context, little to no hallucinating, and easily measurable progress.


>> When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value.

This has not happened to me since Sonnet 4.5. Opus 4.5 is especially robust when it comes to writing tests. I use it daily in multiple projects and verify the test code.


I thought I did use Opus 4.5 when I tested this last time but I might have still been on the $20 plan and I cannot remember if you get any Opus 4.5 on that in Claude Code (I thought you did with really low limits?), so maybe I wasn't using Opus 4.5, I will need to try again.

I haven’t used multi-agent set up yet but it’s intriguing.

Are you using Claude Code? How do you run the agents and make them speak?


Let me clarify actually, I run separate terminals and the agents are separated. I think claude code cli is the best. But at home I pay per token. I have a google account and I pay for chatgpt. So I often use codex and gemini cli in tandem. I'll copy + paste stuff between them sometimes or I'll have one review the changes or just the code in general and then feed the other with the outputs. I'll break out claude code for specific tasks or when I feel like gemini/chatgpt aren't quite doing the job right (which has gotten rarer the past few months).

I messed around with separate "agents" in the same context window for a while. I even went as far as playing with strands agents. Having multiple agents was a crapshoot.

Sometimes they'd work great, but sometimes they start working on the same files at the same time, argue with each other, etc. I'd always get multiple agents working, at least how I assumed they should work, by telling the llm explicitly what agents to create and what work to pass off to what agents. And it did a pretty poor job of that. I tried having orchestration agents, but at a certain point the orchestration agent would just takeover and do everything. So I'm not big on having multiple agents (in theory it sounds great, especially since they are supposed to each have their own context window). When I attempted doing this kind of stuff with strands agents it honestly felt like I was trying to recreate claude, so I just stick with plain cli llm tools for now.


I worry about people who use this approach where they never look at the code. Vibe-coding IS possible but you have to spent a lot of time in plan mode and be very clear about architecture and the abstractions you want it to use.

I've written two seperate moderately-sized codebases using agentic techniques (oftentimes being very lazy and just blanket approving changes), and I don't encounter logic or off-by-one errors very often if at all. It seems quite good at the basic task of writing working code, but it sucks at architecture and you need occasional code review rounds to keep the codebase tidy and readable. My code reviews with the AI are like 50% DRY and separating concerns


In a recent Yegge interview, he mentions that he often throws away the entire codebase and starts from scratch rather than try to get LLMs to refactor their code for architecture.

This has been my best way to learn, put one agent on a big task, let it learn things about the problem and any gotchas, and then have it take notes, do it again until I'm happy with the result, if in the middle I think there's two choices that have merit I ask for a subagent to go explore that solution in another worktree and to make all its own decisions, then I compare. I also personally learn a lot about the problem space during the process so my prompts and choices on us sequent iterations use the right language I need to use.

Honestly, in my experience so far, if an LLM starts going down a bad path, it’s better just to roll back to a point where things were OK and throw away whatever it was doing, rather than trying to course correct.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: