The cope + disappointment will be knowing that a large population of HN users wi...

bopbopbop7 · 2026-01-01T20:25:58 1767299158

Please do provide some data for this "obvious value of coding agents". Because right now the only thing obvious is the increase in vulnerabilities, people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

aspenmartin · 2026-01-01T20:39:15 1767299955

Sure: at my MAANG company, where I watch the data closely on adoption of CC and other internal coding agent tools, most (significant) LOC are written by agents, and most employees have adopted coding agents as WAU, and the adoption rate is positively correlated with seniority.

Like a lot of things LLM related (Simon Willison's pelican test, researchers + product leaders implementing AI features) I also heavily "vibe" check the capabilities myself on real work tasks. The fact of the matter is I am able to dramatically speed up my work. It may be actually writing production code + helping me review it, or it may be tasks like: write me a script to diagnose this bug I have, or build me a streamlit dashboard to analyze + visualize this ad hoc data instead of me taking 1 hour to make visualizations + munge data in a notebook.

> people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

what would satisfy you here? I feel you are strawmanning a bit by picking the most hyperbolic statements and then blanketing that on everyone else.

My workflow is now:

- Write code exclusively with Claude

- Review the code myself + use Claude as a sort of review assistant to help me understand decisions about parts of the code I'm confused about

- Provide feedback to Claude to change / steer it away or towards approaches

- Give up when Claude is hopelessly lost

It takes a bit to get the hang of the right balance but in my personal experience (which I doubt you will take seriously but nevertheless): it is quite the game changer and that's coming from someone who would have laughed at the idea of a $200 coding agent subscription 1 year ago

Denzel · 2026-01-02T02:30:35 1767321035

We probably work at the same company, given you used MAANG instead of FAANG.

As one of the WAU (really DAU) you’re talking about, I want to call out a couple things: 1) the LOC metrics are flawed, and anyone using the agents knows this - eg, ask CC to rewrite the 1 commit you wrote into 5 different commits, now you have 5 100% AI-written commits; 2) total speed up across the entire dev lifecycle is far below 10x, most likely below 2x, but I don’t see any evidence of anyone measuring the counterfactuals to prove speed up anyways, so there’s no clear data; 3) look at token spend for power users, you might be surprised by how many SWE-years they’re spending.

Overall it’s unclear whether LLM-assisted coding is ROI-positive.

ben_w · 2026-01-02T10:53:32 1767351212

To add to your point:

If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.

I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d

aspenmartin · 2026-01-02T13:21:36 1767360096

Oh yes all of this I agree with. I had tried to clarify this above but your examples are clearer: my point is: all measures and studies I have personally seen of AI impact on productivity have been deeply flawed for one reason or another.

Total speed up is WAY less than 10x by any measure. 2x seems too high too.

By data alone it’s a bit unclear of impact I agree. But I will say there seems to be a clear picture that to me, starting from a prior formed from personal experience, indicates some real productivity impact today, with a trajectory that suggests these claims of a lot of SWE work being offloaded to agents over the next few years seems not that far fetched.

- adoption and retention numbers internally and externally. You can argue this is driven by perverse incentives and/or the perception performance mismatch but I’m highly skeptical of this even though the effects of both are probably really, it would be truly extraordinary to me if there weren’t at least a ~10-20% bump in productivity today and a lot of headroom to go as integration gets better and user skill gets better and model capabilities grow

- benchmark performance, again benchmarks are really problematic but there are a lot of them and all of them together paint a pretty clear picture of capabilities truly growing and growing quickly

- there are clearly biases we can think of that would cause us to overestimate AI impact, but there are also biases that may cause us to underestimate impact: e.g. I’m now able to do work that I would have never attempted before. Multitasking is easier. Experiments are quicker and easier. That may not be captured well by e.g. task completion time or other metrics.

I even agree: quality of agentic code can be a real risk, but:

- I think this ignores the fact that humans have also always written shitty code and always will; there is lots of garbage in production believe me, and that predates agentic code

- as models improve, they can correct earlier mistakes

- it’s also a muscle to grow: how to review and use humans in the loop to improve quality and set a high bar

Denzel · 2026-01-09T02:49:49 1767926989

Great response, we’re like 98% aligned at a high-level. :) These next few years will be interesting.

bopbopbop7 · 2026-01-01T20:54:53 1767300893

Anecdotes don’t prove anything, ones without any metrics, and especially at MAANG where AI use is strongly incentivized.

Evidence is peer reviewed research, or at least something with metrics. Like the METR study that shows that experienced engineers often got slower on real tasks with AI tools, even though they thought they were faster.

aspenmartin · 2026-01-01T22:16:11 1767305771

That's why I gave you data! METR study was 16 people using Sonnet 3.5/3.7. Data I'm talking about is 10s of thousands of people and is much more up to date.

Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR

https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.

https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)

To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about

bopbopbop7 · 2026-01-01T22:38:44 1767307124

To be fair, I’ll take a non-biased 16 person study over “internal measures” from a MAANG company that burned 100s of billions on AI with no ROI that is now forcing its employees to use AI.

anorwell · 2026-01-02T02:04:56 1767319496

What do you think about the METR 50% task length results? About benchmark progress generally?

ben_w · 2026-01-02T11:02:33 1767351753

I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].

aspenmartin · 2026-01-02T00:08:21 1767312501

I could have guessed you would say that :) but METR is not an unbiased study either. Maybe you mean that METR is less likely to intentionally inflate their numbers?

If you insist or believe in a conspiracy I don’t think there’s really anything I or others will be able to say or show you that would assuage you, all I can say is I’ve seen the raw data. It’s a mess and again we’re stuck with proxies (which are bad since you start conflating the change in the proxy-latent relationship with the treatment effect). And it’s also hard and arguably irresponsible to run RCTs.

All I will say is: there are flaws everywhere. METR results are far from conclusive. Totally understandable if there is a mismatch between perception and performance. But also consider: even if task takes the same or even slightly more time, one big advantage for me is that it substantially reduces cognitive load so I can work in parallel sessions on two completely different issues.

bopbopbop7 · 2026-01-02T00:28:24 1767313704

I bet it does reduce your cognitive load, considering you, in your own words "Give up when Claude is hopelessly lost". No better way to reduce cognitive load.

aspenmartin · 2026-01-02T01:03:22 1767315802

I give up using Claude when it gets hopelessly lost, and then my cognitive load increases.

Ianjit · 2026-01-01T23:59:26 1767311966

Meta internal study showed a 6-12% productivity uplift.

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

insin · 2026-01-01T22:40:57 1767307257

> - Give up when Claude is hopelessly lost

You love to see "Maybe completely waste my time" as part of the normal flow for a productivity tool

aspenmartin · 2026-01-02T00:09:51 1767312591

That negates everything else? If you have a tool that can boost you for 80% of your work and for the other 20% you just have to do what you’re already doing, is that bad?

shimman · 2026-01-02T01:45:44 1767318344

There's a reason why sunk cost IS a fallacy and not a sound strategy.

Ianjit · 2026-01-01T23:56:35 1767311795

The productivity uplift is massive, Meta got a 6-12% productivity uplift from AI coding!

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

ben_w · 2026-01-02T10:43:33 1767350613

> You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks)

This is a surprising claim. There's only 3 orders of magnitude between US data centre electricity consumption and worldwide primary energy (as in, not just electricity) production. Worldwide electricity supply is about 3/20ths of world primary energy, so without very rapid increases in electricity supply there's really only a little more than 2 orders of magnitude growth possible in compute.

Renewables are growing fast, but "fast" means "will approach 100% of current electricity demand by about 2032". Which trend is faster, growth of renewable electricity or growth of compute? Trick question, compute is always constrained by electricity supply, and renewable electricity is growing faster than anything else can right now.

aspenmartin · 2026-01-02T13:31:25 1767360685

This is not my own claim, it’s based on the following analysis from Epoch: https://epoch.ai/blog/can-ai-scaling-continue-through-2030

But I forgot how old that article is: it’s 4 orders of magnitude past GPT-4 in terms of total compute which is I think only 3.5 orders of magnitude from where we are today (based on 4.4x scaling/yr)

nsxwolf · 2026-01-01T20:28:04 1767299284

The nature of my job has always been fighting red tape, process, and stake holders to deploy very small units of code to production. AI really did not help with much of that for me in 2025.

I'd imagine I'm not the only one who has a similar situation. Until all those people and processes can be swept away in favor of letting LLMS YOLO everything into production, I don't see how that changes.

aspenmartin · 2026-01-01T20:41:12 1767300072

No I think that's extremely correct. I work at a MAANG where we have the resources to hook up custom internal LLMs and agents to actually deal with that but that is unique to an org of our scale.