To add on to what the sibling said, ignoring that CISC chips have a separate frontend to break complex instructions down into an internal RISC-like instruction set and thus the difference is blurred, more RISC instruction sets do tend to win on performance and power for the main reason that the instruction set has a fixed width. This means that you can fetch a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel whereas x86’d variableness makes it harder to keep the super scalar pipeline full (it’s decoder is significantly more complex to try to still extract parallelism which further slows it down). This is a bit more complex on ARM (and maybe RISCV?) where you have two widths but even then in practice it’s easier to extract performance out of it because x86 can be anywhere from 1-4 bytes (or 1-8? Can’t remember) which makes it hard to find boundary instructions in parallel.
There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).
Any reading resources? I’d love to learn better the techniques they’re using to get better parsllelism. The most obvious solution I can imagine is that they’d just try to brute force starting to execute every possible boundary and rely on it either decoding an invalid instruction or late latching the result until it got confirmed that it was a valid instruction boundary. Is that generally the technique or are they doing more than even that? The challenge with this technique of course is that you risk wasting energy & execution units on phantom stuff vs an architecture that didn’t have as much phantomness potential in the first place.
There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).