I originally only wanted to point out that the cited benchmark doesn't actually ...

I originally only wanted to point out that the cited benchmark doesn't actually run the same thing for both cases (`do_fib_throws(...) + do_fib_throws(...)` has no defined call order), but then looked at the assembly and noticed that they are very differently structured. It turned out that GCC only recognized `do_fib_throws` to be eligible for tail calls and did some more inlining, and putting `noinline` and `optimize("no-optimize-sibling-calls")` attributes reduced the gap to a more believable level (~50%). As tail calls are highly sensitive to the exact call sequence, this benchmark is not suitable for the claim without detailed analyses.

Yes, result types may result in a worse branch prediction among others. But that is rarely the primary performance issue caused by them, as you would expect the "unexpected" branch to be rarely taken anyway. The actual performance issue simply comes from the fact that it uses a more complex code in the typical path, so it may confuse the less sophisticated optimizer and prevent potential optimizations possible just like above.