Hacker Newsnew | past | comments | ask | show | jobs | submit | joshstrange's commentslogin

I'm not trying to be rude here at all but are you manually verifying any of that? When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value. And that's when they aren't bypassing commit checks or just commenting out tests or saying "I fixed it all" while multiple tests are broken.

Maybe I need a stricter harness but I feel like I did try that and still didn't get good results.


I feel like it was doing what you're saying about 4-6 months ago. Especially the commenting out tests. Not always but I'd have to do more things step by step and keep the llm on track. Now though, the last 3-4 months, it's writing decent unit tests without much hand holding or refactors.

Hmm, my last experience was within the last 2 months but I'm trying not to write it off as "this sucked and will always suck", that's the #1 reason I keep testing and playing with these things, the capabilities are increasing quickly and what did/didn't work last week (especially "last model") might work this week.

I'll keep testing it but that just hasn't been my experience, I sincerely hope that changes because an agent that runs unit test [0] and can write them would be very powerful.

[0] This is a pain point for me. The number of times I've watching Claude run "git commit --no-verify"... I've told it in CLAUDE.md to never bypass commit checks, I've told it in the prompt, I've added it 10 more times in different places in CLAUDE.md but still, the agent will always reach for that if it can't fix something in 1-3 iterations. And yes, I've told it "If you can't get the checks to pass then ask me before bypassing the checks".

It doesn't matter how many guardrails I put up and how good they are if the agent will lazily bypass them at the drop of a hat. I'm not sure how other people are dealing with this (maybe with agents managing agents and checking their work? A la Gas Town?).


I haven't seen your issue, but git is actually one of the things I don't have the llm do.

When I work on issues I create a new branch off of master, let the llm go to town on it, then I manually commit and push to remote for an MR/PR. If there are any errors on the commit hooks I just feed the errors back into the agent.


Interesting, ok, I might try that on my next attempt. I was trying to have it commit so that I could use pre-commit hooks to enforce things I want (test, lint, prettier, etc) but maybe instead I should handle that myself and make it more explicit in my prompts/CLAUDE.md to test/lint/etc. In reality I should just create a `/prep` command or similar that asks it to do all of that so that once it thinks it's done, I can quickly type that and have it get everything passing/fixed and then give a final report on what it did.

You’ll likely have the same issue relying on CLAUDE.md instructions to test/lint/etc, mine get ignored constantly to the point of uselessness.

I’m trying to redesign my setup to use hooks now instead because poor adherence to rules files across all the agentic CLIs is exhausting to workaround.

(and no, Opus 4.5 didn’t magically solve this problem to preemptively respond to that reply)


What do your rules files look like?

I wonder if some people are putting in too much into their markdown files of what NOT to do.

I hate people saying the llms are just better auto-correct, but in some ways they're right. I think putting in too much "don't do this" is leading the llm down the path to do "this" because you mentioned it at all. The LLM is probabilistically generating it's response based on what you've said and what's in the markdown files, the fact you put some of that stuff in there at all probably increases the probability those things will show up.


In my projects there's generally a "developer" way to do things and an "llm agent" way to do things.

For the llm a lot of linting and build/test tools go into simple scripts that the llm can run and get shorthand info out of. Some tools, if you have the llm run them, they're going to ingest a lot from the output (like a big stacktrace or something). I want to keep context clean so I have the llm create the tool to use for build/test/linting and I tell it to create it so the outputs will keep its context clean, then I have it document it in the .md file.

When working with the LLM I have to start out pretty explicit about using the tooling. As we work through things it will start to automatically run the tooling. Sometimes it will want to do something else, I just nudge it back to use the tooling (or I'll ask it why or if there are benefits to the other way and if there are we'll rebuild the tooling to use the other way).

Finally, if the LLM is really having trouble, I kill the session and start a new one. It used to feel bad to do that. I'd feel like I'm losing a lot of info that's in context. But now, I feel like it's not so bad... but I'm not sure if that's because the llms are better or if my workflow has adapted.

Now, let me backup a little bit. I mentioned that I don't have the llm use git. That's the control I maintain. And with that my workflow is: llm builds feature->llm runs linters/tests->I e2e test whatever I'm building by deploying to a dev/staging/local env->once verified I commit. Now I will continue that context window/session until I feel like the llm starts fucking up. Then I kill the session and start a new one. I rarely compact, but it does happen and I generally don't fret about it too much.

I try to keep my units of work small and I feel like it does the best when I do. But then I often find myself surprised at how much it can do from a single prompt, so idk. I do understand some of the skepticism because a lot of this stuff sounds "hand-wavy". I'm hoping we all start to hone in on some general more concrete patterns but with it being so non-deterministic I'm not sure if we will. It feels like everyone is using it differently and people are having successes and failures across different things. People where I work LOVE MCPs but I can't stand them. When I use them it always feels like I have to remind the llm that it has an MCP, then it feels like the MCP takes too much context window and sometimes the llm still trips over how to use it.


Ok, that's a good tip about separate tools/scripts for the LLM, I did something similar less than a year ago so that I kept lint/test output to a minimum but it was still invoked via git hooks. I'll try again with scripts next time I'm doing this. My hope was to let the agent commit to a branch (with code that passed lint/test/prettier/etc), push it, auto-deploys to preview branches, and then that's where I'd do my e2e/QA and once I was happy I could merge it and it get deployed to the main site.

I discussed approaches in my earlier reply. But what you are saying now makes me think you are having problems with too much context. Pare down your CLAUDE.md massively and never let you context usage get over 60-65%. And tell CLAUDE not to commit anything without explicit instructions from you (unless you are working in a branch/worktree and are willing to throw it all away).

put a `git` script in `PATH` that simply errors out i.e.:

    if "--no-verify" in sys.args:
        println("--no-verify is not allowed, file=sys.stderr)
        sys.exit(1)
and otherwise forwards to the underlying `git`

Literally yesterday I was using Claude for writing a SymPy symbolic verification of a mathematical assertion it was making with respect to some rigorous algebra/calculus I was having it do for me. This is the best possible hygiene I could adopt for checking its output, and it still failed to report on results correctly.

After manual line-by-line inspection and hand-tweaks, it still saved me time. But it's going to be a long, long time before I no longer manually tweak things or trust that there are no silent mistakes.


Those kinds of errors were super common 4-6 months ago, but LLM quality moves fast. Nowadays I don't see these very often at all. Two things that make a huge difference: work on writing a spec first. github.speckit, GSD, BMAD, whatever tool you like can help with this. Do several passes on the spec to refine it and focus on the key ideas.

Now that you have a spec, task it out, but tell the LLM to write the tests first (like Test-Driven Development, but without all the formalisms). This forces the LLM to focus on the desired behavior instead of the algorithms. Be sure to focus on tests that focus on real behavior: client apis doing the right error handling when you get bad input, handling tricky cases, etc. Tell the system not to write 'struct' tests - checking that getters/setters work isn't interesting or useful.

Then you implement 1-3 tasks at a time, getting the tests to pass. The rules prevent disabling tests, commenting out tests, and, most importantly, changing the behavior of the tests. Doesn't use a lot of context, little to no hallucinating, and easily measurable progress.


>> When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value.

This has not happened to me since Sonnet 4.5. Opus 4.5 is especially robust when it comes to writing tests. I use it daily in multiple projects and verify the test code.


I thought I did use Opus 4.5 when I tested this last time but I might have still been on the $20 plan and I cannot remember if you get any Opus 4.5 on that in Claude Code (I thought you did with really low limits?), so maybe I wasn't using Opus 4.5, I will need to try again.

Where is the "super upvote button" when you need it?

YES! I have been playing with vibe coding tools since they came out. "Playing" because only on rare occasions have I created something that is good enough to commit/keep/use. I keep playing with them because, well I have a subscription, but also so I don't fall into the fuddy-duddy camp of "all AI is bad" and can legitimately speak on the value, or lack thereof, of these tools.

Claude Code is super cool, no doubt, and with _highly targeted_ and _well planned_ tasks it can produce valuable output. Period. But, every attempt at full-vibe-coding I've done has gotten hung up at some point and I have to step in an manually fix this. My experience is often:

1. First Prompt: Oh wow, this is amazing, this is the future

2. Second Prompt: Ok, let me just add/tweak a few things

10. 10th prompt: Ugh, everytime I fix one thing, something else breaks

I'm not sure at all what I'm doing "wrong". Flogging the agents along doesn't not work well for me or maybe I am just having trouble letting go of the control and I'm not flogging enough?

But the bottom line is I am generally shocked that something like Gas Town was able to be vibe-coded. Maybe it's a case of the LLM overstating what it's accomplished (typical) and if you look under the hood it's doing 1% of what it says it is but I really don't know. Clearly it's doing something, but then I sit over here trying to build a simple agent with some MCPs hooked up to it using a LLM agent framework and it's falling over after a few iterations.


So I’m probably in a similar spot - I mostly prompt-and-check, unless it’s a throwaway script or something, and even then I give it a quick glance.

One thing that stands out in your steps and that I’ve noticed myself- yeah, by prompt 10, it starts to suck. If it ever hits “compaction” then that’s beyond the point of return.

I still find myself slipping into this trap sometimes because I’m just in the flow of getting good results (until it nosedives), but the better strategy is to do a small unit of work per session. It keeps the context small and that keeps the model smarter.

“Ralph” is one way to do this. (decent intro here: https://www.aihero.dev/getting-started-with-ralph)

Another way is “Write out what we did to PROGRESS.md” - then start new session - then “Read @PROGRESS.md and do X”

Just playing around with ways to split up the work into smaller tasks basically, and crucially, not doing all of those small tasks in one long chat.


I will check out Ralph (thank you for that link!).

> Another way is “Write out what we did to PROGRESS.md” - then start new session - then “Read @PROGRESS.md and do X”

I agree on small context and if I hit "compacting" I've normally gone too far. I'm a huge fan of `/clear`-ing regularly or `/compact <Here is what you should remember for the next task we will work on>` and I've also tried "TODO.md"-style tracking.

I'm conflicted on TODO.md-style tracking because in practice I've had an agent work through everyone on the list, confidently telling me steps are done, only to find that's not the case when I check its work. Either a TODO.md that I created or one I had the agent create both suffer from this. Also, getting it update the TODO.md has been frustrating, even when I add it to CLAUDE.md "Make sure to mark tasks as complete in the TODO.md as you finish them" or adding the same message to the end of all my prompts, it won't always update it.

I've been interested in trying out beads to see if works better than a markdown TODO file but I haven't played with that yet.

But overall I agree with you, smaller chunks are key to success.


I hate TODO.mds too. If I ever have to use one, I'll keep track of it manually, and split the work myself into chunks of the size I believe CC/codex can handle. TODO.md is a recipe for failure because you'll quickly have more code than you can review and nothing to trust that it was executed well.

> 10. 10th prompt: Ugh, everytime I fix one thing, something else breaks

Maybe that is the time to start making changes by hand. I think this dream of humans never ever writing any more code might be too far and unnecessary.


I’ve definitely hit that same pattern in the early iterations, but for me it hasn’t really been a blocker. I’ve found the iteration loop itself isn’t that bad as long as you treat it like normal software work. I still test, review, and check what it actually did each time, but that’s expected anyway. What’s surprised me is how quickly things can scale once the overall architecture is thought through. I’ve built out working pieces in a couple of weeks using Claude Code, and a lot of that time was just deciding on the architecture up front and then letting it help fill in the details. It’s not hands-off, but used deliberately, it’s been quite effective https://robos.rnsu.net

I agree that it can be very useful when used like that but I'm referring to fully vibe-coding, the "I've never looked at the code"-people. CC is a great tool when you use plan carefully, review its work, etc but people are building things they say they've never read the code for and that just hasn't been my experience, it always falls over on it's own if I'm not in the code reviewing/tweaking.

> Keep in mind that Steve has LLMs write his posts on that blog.

Ok, I can accept that, it's a choice.

> Things said there may not reflect his actual thoughts on the subject(s) at hand.

Nope, you don't get to have it both ways. LLMs are just tools, there is always a human behind them and that human is responsible for what they let the LLM do/say/post/etc.

We have seen the hell that comes from playing the "They said that but they don't mean it" or "It's just a joke" (re: Trump), I'm not a fan of whitewashing with LLMs.

This is not an anti or pro Gas Town comment, just a comment on giving people a pass because they used an LLM.


Do you read that as giving him a pass? I read it as more of a condemnation. If you have an LLM write "your" blog posts then of course their content doesn't represent your thoughts. Discussing the contents of the post then is pointless, and we can disregard it entirely. Separately we can talk about what the person's actual views might be, using the fact that he has a machine generate his blog posts as a clue. I'm not sure I buy that the post was meaningfully LLM-generated though.

The same approach actually applies to Trump and other liars. You can't take anything they say as truth or serious intent on its own; they're not engaging in good faith. You can remove yourself one step and attempt to analyze why they say what they do, and from there get at what to take seriously and what to disregard.

In Steve's case, my interpretation is that he's extremely bullish on AI and sees his setup or something similar as the inevitable future, but he sprinkles in silly warnings to lampshade criticism. That's how the two messages of "this isn't serious" and "this is the future or software development" co-exist. The first is largely just a cover and an admission this his particular project is a mess. Note that this interpretation assumes that the contents of the blog post in question were largely written by him, even if LLM assistance was used.


Hmm, maybe I read the original comment wrong then? I did read it as "You can't blame him, that might not even be what he thinks" and my stance is "He posted it on his blog, directly or indirectly, what else am I supposed to think?".

I agree with you on Steve's case, and I have no ill will towards him. Mostly it was just me trying to "stomp" on giving him a pass, but, as you point out, that may not have been what the original commenter meant.


I'm building a house currently and I really wish there were more options to have the things I want without needing all the extra space in places I don't care about. The problem is, even if I was able to build such a house (I'm using a large builder, this is not a fully custom house) the resale prospects would be poor.

I missed that thread originally, the post and the comments where a good read, thank you for sharing.

I got a kick out of this comment [0]. "BenjiWiebe" made a comment about the SSH packets you stumbled across in that thread. Obviously making the connection between what you were seeing in your game and this random off-hand comment would be insane (if you had seen the comment at all), but I got a smile out of it.

[0] https://news.ycombinator.com/item?id=46366291


wow, I missed that comment, that's an incredible connection. Thank you!

First time I've been reading on HN and come across my name randomly.

I wanted to look into their pricing for Devin+ and I have to say, ACU are entirely too opaque/confusing/complicated. The entire description of them is shrouded in mystery. And this part confuses me even more:

> Aside from the few ACUs required to keep the Devin VM running, Devin will not consume ACUs when:

> Waiting for your response

> Waiting for a test suite to run

> Setting up and cloning repositories

Ok, that kind of makes sense, but what does "the few ACUs required to keep the Devin VM running" mean? These cost $2.50/ea so "a few" means $5+ and on what time scale? Daily? Monthly?

The lowest plan comes with $20 ACUs but they don't list anywhere how far that gets you or even rough examples. I guess if you want to kick the tires $20 isn't a crazy amount to test it out yourself and maybe I'm just not the target market (I kind of feel like I am though?) but I wish their pricing made sense.


Disclaimer: No disrespect meant towards FreeBSD or the maintainers.

I currently work on FreeBSD servers pretty much exclusively for my job and I have a really hard time grokking why I would want to use them over some flavor of Linux. I also work (and have worked in my career) with Linux servers (Ubuntu and Debian primarily, and things like alpine in docker) and there isn't anything I do that I think "I wish I was on FreeBSD", the opposite is not true, I semi-regularly pine for X tool or Y program that doesn't run on FreeBSD (or is harder to run).

It's very possible that I am just not using/experiencing the full power of FreeBSD (as in: I'm too dumb to know how great it is) but if I had pro/con columns for FreeBSD I can think of a number of cons and very few pros that Linux doesn't share. Again, there is a very good chance that I'm "holding it wrong", but I've heard "oh, but not on FreeBSD" or "Hmm, they don't support FreeBSD" about too many things that might have solved issues we've run into at my job.

Maybe I'm boring or maybe I'm just lazy but I feel like Linux is the past of least resistance, it has the most info online available, the most guides, blog posts, LLM training, etc.

I'd be interested to hear what people on HN like best about FreeBSD so I can see if it applies to my usage or not and to see if I can't learn new tips/tricks.


BSD can be a better choice for a variety of reasons. Firstly business reasons BSD has more permissive licences than Linux's GNU licence which compels you to share any modifications you make to the software. BSD uses the MIT licenses which state that you are allowed to modify the source code and not release it, which is why most embedded devices like routers/firewalls use BSD over Linux. That and BSD is faster at networking.

It also has better storage (ZFS), although this is now implemented in Linux, it is not as stable as BSD which developed it specifically for their OS.


I run most of my personal network infra (routers, DNS servers, etc.) on FreeBSD because I have been running it on FreeBSD since the late 90's, and have never had any reason to change it.

In all that time, I've never felt like I suffered from lack of information on how to get things done: the documentation is generally good, and I've always been able to fill in any missing details by reading shell scripts and, very very rarely, source code.


While the "better" security than Linux argument is weak, the FreeBSD/OpenBSD OS network packet handling is extremely good (common OS for routers etc.) =3

I ended up switching from FreeBSD to Linux twice (TrueNAS CORE -> TrueNAS SCALE, opnSense -> OpenWrt) due to poor network performance on FreeBSD. Could just never get 10 Gbps throughput on FreeBSD, whereas Linux on the same hardware was fine. Across Intel and Mellanox NICs, so not a specific driver issue.

Usually Linux can enable a vendors direct packet handling driver as a closed source firmware that bypasses the kernel almost completely once the user connection is setup. That was the most economical way to handle several SFP-DD Transceivers at saturation in a normal host. ymmv

There are probably better solutions around these days. =3


For me it’s the amazing ports and pkg system.

I use Arch for superior hardware support on the laptop and FreeBSD on the server for superior software management.


The typical touted benefit is the native first-party ZFS support.

Mine: It's not Linux. Linux feels like a heavy weight. Compiling a kernel is tideous. If a service fails, systemd breaks which a PITA in to fix. "Waiting for X/Y to quit", NetworkManager is archaic.

I've found that on RedHat based distro's you have to at least enable different repo's (epel, rpmfusion, el) just to get the packages required. Debian you're already out of date but that's for security, so fair enough. It's under corporate control, Ubuntu (Canonical) is corporate, anything RedHat (IBM) are corporate. You try to look online for a reason why SSSD is failing and the actual answers are hidden behind a paywall on redhat.com

We have aggressive HP machines designed for Windows with 4000RTX's which get used for rendering. They get thrashed and for the studio to obtain further TPN status I am moving from Windows to Linux. The struggles on a good day to operate with them is insanity. I'm now drinking 2x double shot lattes a day from just a single, double shot. Next it will be whisky, some days I have snuck in a shot of Mezcal before work in hope the Mezcal gods save the day.

FreeBSD handles them like a champ. TPN doesn't recognize FreeBSD so it has to be Rocky Linux.

I needed a PXE server, this shop only had a old 2009 mac mini left over in the cupboard. It does the job, 100Mbit is fine for provisioning, and if I want more I'll just use a USB Ethernet dongle. Linux, failed. FreeBSD, booted off memory stick and has been working flawlessly. I now have a working PXE server coded in TCL and running on FreeBSD. It's glorious and because so I've now been told going forward all my future creations must be Python. Urgh but fair enough, TCL is niche.

ZFS <3, why the hell TrueNAS went Linux is beyond my grasp.

I run FBSD 16 (bleeding edge) on my main rig, 4x screens. 2x27' 4K, 2x27' all work flawlessly with Xorg.

Jails are fantastic, my web browsers never touch the OS and at any point I can torch them and roll back to a clean snapshots. Thanks ZFS.

Four of my colocated servers are running FreeBSD. Two of them have over 1000 days uptime.

    mookie@cookie:~ $ uname -a && uptime
    FreeBSD cookie.server 12.2-BETA1 FreeBSD 12.2-BETA1 r365618 GENERIC  amd64
    10:39PM  up 1699 days,  1:31, 1 user, load averages: 0.64, 1.30, 1.31
My laptop which works flawlessly including suspend (MSI Modern 2015) works as my media TV station with Bluetooth audio streaming to my sound bar with a 3rd party HDMI transmitter. This runs FreeBSD.

I didn't see you give any reason to why you don't like FreeBSD. because what you can do on Linux, you can do on FreeBSD.

./configure, make, make install. Nothing else is required unless you want docker, then eww. go away.

My life of a FreeBSD admin has been a large weight off my shoulders. And I was there when Linux was on the 2.x branch kernel & you had to write your own X config without internet at the age of 13. If it wasn't for Minix pissing off Linus, Linux wouldn't of existed. The only distribution if forced would be Slackware.


RCS is such trash. It's amazing that people fell for Google's BS in pushing Apple to implement it. I imagine that in the near future I will just disable it on my phone if I start getting spam. I push all my Android friends to use other messaging platforms, even with RCS it's a crap-shoot and pictures still come through looking like it's the year 2000.

RCS was a bad idea literally from day 1 and I do not understand why so many people thought it was worth pursuing. I mean other than Google since they effectively own the "standard", finally after untold number of failed messaging projects they have something they strong-armed other idiots into using.


Interesting, I'm on 3x27" 2K monitor (same setup as you, portrait, landscape, portrait) and while it works very well for me, I'd like to replace it with just 1 screen (or 3x 4-5K monitors but that is less interesting to me). I already have custom window management software that I use so it wouldn't be hard to switch to sub-dividing 1 monitor to get a similar experience (I think).

Maybe I should look into the 40" 5K monitors, thanks!


Losing the bezel is great, and the Dell 4025qw that I have has also an IPS Black panel which is a vast improvement over what I had before - Dell U27-something (4K IPS), 3219Q (4K IPS). And it's 120hz. I really enjoy it.

By having fewer pixels, lower quality screens? Crazy what you can do when you cut corners.

This screen reminds of when I did tech support in high school and I helped a guy who bragged about his computer monitor, it was a TV running at 720p (if not lower) and a massive screen. The windows start bar was hilariously large (as were all UI elements), I had to just smile and nod until I got out of there.

Sure, your screen may be bigger but it's blurry and everything is scaled way too large.


> By having fewer pixels

I thought samdixon was referencing the Apple Pro Display XDR? If so, Apple has fewer pixels.

Apple Pro XDR: 6016 x 3384

Kuycon G32P: 6144 x 3456


> everything is scaled way too large

The HiDPI/Retina bullshit is just bullshit. I've been running a 4K 43" 4:3 display at 100% scaling since 2018. It is neither blurry nor scaled too large. It can, however, comfortably fit 10 A4 pages simultaneously. Or 4 terminals + a browser + a PDF reader.


My arithmetic nodule is having a konniption fit. Does not compute. If this is 16:9 and you mistook your aspect ratio I can breathe again. √2:1 says 1.41:1 isn't 1.33:1

10 A4 pages do not fill a 4:3 or 3:4 aspect ratio box. They don't fill a 16:9 box either but it's more plausible, the wastage is different.


My comment (or at least that quote) was specifically about someone using a 30"+ TV at 720p as their computer monitor.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: