What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.
CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.
These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.
You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.
So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.
CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.