My use case has been trying to remove the damn "apologies for this" and extraneo...

michaelt · 2025-08-08T01:32:11 1754616731

The jargon to google here is "length bias"

It turns out if you generate two LLM responses and ask a judge to choose which is better, many judges have a bias in favour of long answers full of waffle.

mh- · 2025-08-08T02:02:42 1754618562

Thanks for that pointer.

The abstract of this paper seems interesting: https://arxiv.org/html/2407.01085v3

> use of [LLMs] as judges [..] reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass [..]

(If you're interested, give it a click. I tried to pare this down to avoid quoting a wall of text.)

eru · 2025-08-08T00:43:32 1754613812

> I swear they have tuned their models to waste tokens.

Which seems a bit weird, because the customers of the chat interface (ie non-API customers) don't pay per token.

setsewerd · 2025-08-08T03:21:36 1754623296

I've heard the theory a few times lately that AI businesses will increasingly move towards usage models over subscription models, so while it is probably accidental, it could also be a longer term strategy to normalize excessive token usage.

eru · 2025-08-08T04:11:56 1754626316

I don't know whether the major AI companies will move to usage models. But let's assume that they do.

However: I would expect chat interfaces to be charged per query, not per token. End users don't understand tokens, and don't want to have to understand tokens.

If you charge per query, you don't gain anything from extra wordy responses.