Not even that, see LMArena. They vaguely gesture in the general direction of the...

		qsort 6 months ago \| parent \| context \| favorite \| on: AI agent benchmarks are broken Not even that, see LMArena. They vaguely gesture in the general direction of the model being good, but between contamination and issues with scoring they're little more than a vibe check.