The one time I trusted Claude to write a full class - with excessive rounds of prompts and revisions - it introduced a very subtle bug that I would have never made but took a couple months to show up.
Fixing the bug required a root and stem overhaul of the class, and ended up taking more time in aggregate.
And that’s the problem: just like with self-driving cars, if it isn’t right 100% of the time you are worse off because you think it’s ok to take your hand off the wheel when it very much is not.
We’ll get full autonomy in cars before we get an LLM that can write production code reliably, and we’re still very far from that.
> get an LLM that can write production code reliably
it depends on what that user wanted the code to do, and how important it is.
For example, an average, non-technical user could use this to generate a script to sort out their email, or a script for automation in MS Office VBA.
Just because it's not perfect, doesn't mean it doesn't have a good use, and won't improve. Tom Scott's video[0] makes a very good argument: we don't know where we are on the technology curve.
I mean systems critical infrastructure code where error tolerance is zero.
For one off scripts or sketching a concept quickly it’s good enough, and for language reference it’s generally useful.
However, one thing I’ve noticed with Claude in particular is it tends to overweight the top answers in stack overflow.
The problem there is top answers are rarely the best answer - rather they tend to be overly verbose and long, whereas the best answer is usually the second one that just tells you what function to call.
On multiple occasions I’ve had Claude answer a simple prompt with horribly verbose and complicated code.
Then I’ll say “what about this single call?” (Eg the type of SO answer than gets the second most votes), and it’s says “You’re right! That’s a much better answer”.
Likewise I suggest to anybody to take the domain you are most knowledgeable in and pepper your LLM of choice with lots of questions and see how much it knows.
You’ll get a feel for how much you can trust it in other domains - which is “not much”.
which is quite niche, and you'd be correct that no one would trust GPT generated code blindly for that!
But this is a spectrum, and while i think today's GPT models don't quite make it there, i argue that we're closer to success with this than auto-driving cars; mainly due to the larger tolerance for bad code, rather than actual tech improvements.
My problem is even if I do that I'm not convinced it's making me any faster. It feels like when it gets it right and I compare the time to writing it myself I would estimate it's maybe 20% faster. But when it gets it wrong after a few prompts and I have to write it myself anyway then it's more like 20% slower. Those two seem to average out, but then in the p90 it gets it subtly wrong enough in a way where I accept the code and then spend twice the amount of time reviewing and making adjustments to get to the solution compared to doing it myself in the first place. So I'm not convinced it's making me any faster, if anything I feel like it's either the same or a bit slower. Other than a junior engineer there is also no ROI on this time investment since it's just as likely to get it wrong again the next time.
The only thing where I noticed a pronounced speed up is when I use other languages I'm not super familiar with. AI can more easily help me translate concepts from languages I do know better, and then a good old Google search is often enough to fill in the rest of the blanks for me to be reasonably productive in a way that I wouldn't be without AI
I think it depends on the problem domain. I have to implement a lot of throwaway ideas quickly, and LLMs are really useful there.
For instance, say I wanted to plot a complicated Matplotlib diagram. It takes me 10+ minutes plus many context switches to get the syntax right (I don't use Matplotlib enough to have all the args at the tip of my fingers). Also I don't know everything Matplotlib is able to do -- I haven't read the entire docs. Fortunately LLMs have and they get me to the right ballpark in 10-20 seconds. I usually want to try maybe 10-15 plots before settling on something. LLMs definitely do get me there much faster.
I think if you have a clear idea of what you want to do, and how to do it, then maybe the time savings are not compelling. But if you're in space where you're ideating and groping at an idea, LLMs can significantly cut down the iteration time and even open up new channels of inquiry that you didn't know existed.
They're primarily generative assistants. Using them to implement ideas in production is probably a secondary use.
I have a friend who recently built a SaaS from the ground up using only Replit and Claude (it is integrated).
Non-technical, never built a React App, never built with Supabase, never built with Firebase (for auth). Never coded a single flow of Stripe.
100x might be an understatement. He built it from nothing with minimal knowledge of React, Tailwind, Supabase, Postgres, Stripe, and Firebase using Claude.
(He knew what all of the blocks were, but no technical coding knowledge at all)
Legit has paying customers in under a month and just running it directly via Replit (not even hosted externally).
Once you have paying customers, you can hire an actual developer.
Claude is not 100x for any typical software work, but the biggest gains come from precisely 'non-typical work', which is previously impossible.
Imagine a domain expert, who knows a niche super well with all the weird edge cases and untapped demand. Hiring a developer for it doesn't work because.
1. The communication costs are too high, the developer won't know the business niche in depth enough to make a good product.
2. The niche is not profitable enough to risk hiring a developer.
Now LLMs allows the solo non-technical founder to make a MVP app, and put it to market to test, for very little cost and risk. Sure the app is not really extendable, may have to be heavily rewritten to expand and maintain, but hiring a developer then, will be a much lower risk task.
It doesn't even reduce developer employment this way, as now there's a ton more niche use cases being opened up, and becoming profitable enough to support developers.
As a very senior dev and having worked in good number of startups, including YC backed startups, I can tell you that the hardest part is rarely technical for most SaaS. It's actually validating the idea, market, and business model.
If he can get 10 or 20 paying customers, then it's easy to find money to fix or scale the code.
If you can't see the problem, it doesn't matter for an early startup; the only thing that matters is what users complain about or request and getting more users paying you. Everything can be fixed later once the idea itself has been validated.
Data breaches, financial errors, and the like for an early stage company are a death sentence. That’s the type of error I’m suggesting, not some flakey CSS.
Death knell is not solving a problem in the first place. Almost everything is negotiable if you solve a valuable problem that people are willing to pay you to solve.
Probably loads the whole thing into Claude’s massive context window and asks it to make alterations.
I do similar things with code bases on a smaller scale with ChatGPT all the time. Half the time I need to make small tweaks but it’s increased my productivity tremendously.
I just wrapped up a project where I had to do a bunch of work with audio, which I’ve never done before, and I wouldn’t say 100X but I did at night over a handful of weeks what would’ve taken me months to teach myself and some of it I probably never would have figured out. Like I could prototype on an easy library and go “rewrite it with librosa instead” or “nah I don’t really like this let’s just do such and such with ffmpeg that would work and probably be faster, right?” or “find a way to do this funky thing with torchaudio and a bunch of file I/O in memory” and then be like “how much money would I save on GCP egress if I did such and such?” It’s not always right on the first try but omg does it save a shitload of time and energy.
I agree that it makes sense for this style of learning but only if you already know the fundamentals and are knowledgeable enough to know the right questions to ask.
>"Claude's extensive context window has also transformed their approach to handling large codebases. When the 200K context window was released, Hedley notes they "ripped out the entire RAG and just put it in the context window instead and it went from 60 percent accuracy to 98."
RAG = Retrieval Augmented Generation
Related:
What is retrieval-augmented generation, and what does it do for generative AI?:
A good deal of software development is clarifying the required specification. This alone takes a lot of work and a lot of coding! If you don't know the nitty-gritty of what you really need, you can't get it at 100x or even at 10x speed.
When it comes to financial accounting work, there is just no room for buggy or sloppy code. The customer will go away forever at the first instance of being billed incorrectly, also seeking a refund from the credit card.
Update: I think this is an ad by Anthropic