> What's the metric? Language model capability at generating text output. The mo...

dragonwriter · 2026-01-01T08:35:00 1767256500

> Language model capability at generating text output.

That's not a metric, that's a vague non-operationalized concept, that could be operationalized into an infinite number of different metrics. And an improvement that was linear in one of those possible metrics would be exponential in another one (well, actually, one that is was linear in one would also be linear in an infinite number of others, as well as being exponential in an infinite number of others.

That’s why you have to define an actual metric, not simply describe a vague concept of a kind of capacity of interest, before you can meaningfully discuss whether improvement is exponential. Because the answer is necessarily entirely dependent on the specific construction of the metric.

threethirtytwo · 2026-01-01T17:25:52 1767288352

I don’t think the path was ever exponential but your claim here is almost as if the slow down hit an asymptote like wall.

Most of the improvements are intangible. Can we truly say how much more reliable the models are? We barely have quantitative measurements on this so it’s all vibes and feels. We don’t even have a baseline metric for what AGI is and we invalidated the Turing test also based on vibes and feels.

So my argument is that part of the slow down is in itself an hallucination because the improvement is not actually measurable or definable outside of vibes.

aspenmartin · 2026-01-02T14:08:55 1767362935

I kind of agree in principle but there are a multitude of clever benchmarks that try to measure lots of different aspects like robustness, knowledge, understanding, hallucinations, tool use effectiveness, coding performance, multimodal reasoning and generation, etc etc etc. all of these have lots of limitations but they all paint a pretty compelling picture that compliments the “vibes” which are also important.

aoeusnth1 · 2026-01-01T17:35:24 1767288924

> Language model capability at generating text output.

How would you put this on a graph?

viraptor · 2026-01-01T09:06:16 1767258376

> Language model capability at generating text output.

That's not a quantifiable sentence. Unless you put it in numbers, anyone can argue exponential/not.

> next gen models are significantly harder to build.

That's not how we judge capability progress though.

> Remind me what was so great about gpt 5? How about gpt4 from from gpt 3?

> Do you even remember the releases?

At gpt 3 level we could generate some reasonable code blocks / tiny features. (An example shown around at the time was "explain what this function does" for a "fib(n)") At gpt 4, we could build features and tiny apps. At gpt 5, you can often one-shot build whole apps from a vague description. The difference between them is massive for coding capabilities. Sorry, but if you can't remember that massive change... why are you making claims about the progress in capabilities?

> Multimodal add ons that no one asked for

Not only does multimodal input training improve the model overall, it's useful for (for example) feeding back screenshots during development.

aspenmartin · 2026-01-02T14:06:51 1767362811

Exactly, gpt5 was unimpressive not because of its leap from GPT4 but because of expectations based on the string of releases since GPT4 (especially the reasoning models). The leap from 4->5 was actually massive.

aspenmartin · 2026-01-02T14:23:48 1767363828

Next gen models are always hard to build, they are by definition pushing the frontier. Every generation of CPU was hard to build but we still had Moores law.

> Simultaneously we see a distinct narrowing between players (openai, deepseek, mistral, google, anthropic) in their offerings. Thats usually a signal that the rate of progress is slowing.

I agree with you on the fact in the first part but not the second part…why would convergence of performance indicate anything about the absolute performance improvements of frontier models?

> Remind me what was so great about gpt 5? How about gpt4 from from gpt 3? Do you even remember the releases? Yeah. I dont. I had to look it up.

3 -> 4 -> 5 were extraordinary leaps…not sure how one would be able to say anything else

> Just another model with more or less the same capabilities.

5 is absolutely not a model with more or less the same capabilities as gpt 4, what could you mean by this?

> “Mixed reception”

A mixed reception is an indication of model performance against a backdrop of market expectations, not against gpt 4…

> That is not what exponential progress looks like, by any measure.

Sure it is…exponential is a constant % improvement per year. We’re absolutely in that regime by a lot of measures

> The progress this year has been in the tooling around the models, smaller faster

Effective tool use is not somehow some trivial add on it is a core capability for which we are on an exponential progress curve.

> models with similar capabilities. Multimodal add ons that no one asked for, because its easier to add image and audio processing than improve text handling.

This is definitely a personal feeling of yours, multimodal models are not something no one asked for…they are absolutely essential. Text data is essential and data curation is non trivial and continually improving, we are also hitting the ceiling of internet text data. But yet we use an incredible amount of synthetic data for RL and this continues to grow……you guessed it, exponentially. and multimodal data is incredibly information rich. Adding multi modality lifts all boats and provides core capabilities necessary for open world reasoning and even better text data (e.g. understanding charts and image context for text).

noodletheworld · 2026-01-03T00:31:56 1767400316

> exponential is a constant % improvement per year

I suppose of you pick a low enough exponent then the exp graph is flat for a long time and you're right, zero progress is “exponential” if you cherry pick your growth rate to be low enough.

Generally though, people understand “exponential growth” as “getting better/bigger faster and faster in an obvious way”

> 3 -> 4 -> 5 were extraordinary leaps…not sure how one would be able to say anything else

They objectively were not.

The metrics and reception to them was very clear and overwhelming.

Youre spitting some meaningless revisionist BS here.

Youre wrong.

Thats all there is to it.

aspenmartin · 2026-01-03T00:59:34 1767401974

Doesn’t sound like you really seem to be interested in any sort of rational dialogue, metrics were “objectively” not better? What are you talking about of course they were have you even looked at benchmark progression for every benchmark we have?

You don’t understand what an exponential is or apparently what the benchmark numbers even are or possibly even how we actually measure model performance and the very real challenges and nuances involved but yet I’m “spitting some revisionist BS”. You have cited zero sources and are calling measured numbers “revisionist”.

You are also citing reception to models as some sort of indication of their performance, which is yet another confusing part of your reasoning.

I do agree that “metrics were were very clear” it just seems you don’t happen to understand what they are or what they mean.