My use case has been trying to remove the damn "apologies for this" and extraneous language that just waste tokens for no reason. GPT has always always always been so quick to waffle.
And removing the chat interface as much as possible. Many benchmarks are better with text completion models, but they keep insisting on this horrible interface for their models.
Fine tuning is there to ensure you get the output format you want without the extra garbage. I swear they have tuned their models to waste tokens.
It turns out if you generate two LLM responses and ask a judge to choose which is better, many judges have a bias in favour of long answers full of waffle.
> use of [LLMs] as judges [..] reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass [..]
(If you're interested, give it a click. I tried to pare this down to avoid quoting a wall of text.)
I've heard the theory a few times lately that AI businesses will increasingly move towards usage models over subscription models, so while it is probably accidental, it could also be a longer term strategy to normalize excessive token usage.
I don't know whether the major AI companies will move to usage models. But let's assume that they do.
However: I would expect chat interfaces to be charged per query, not per token. End users don't understand tokens, and don't want to have to understand tokens.
If you charge per query, you don't gain anything from extra wordy responses.
And removing the chat interface as much as possible. Many benchmarks are better with text completion models, but they keep insisting on this horrible interface for their models.
Fine tuning is there to ensure you get the output format you want without the extra garbage. I swear they have tuned their models to waste tokens.