I sometimes feel like I don't produce enough code, but then I see something like this where someone probably spent hours or days refactoring this tiny amount of code to reduce the cycles by a bit less than 25% and I start to feel like maybe I'm producing too much code. This is a really cool article, even though I don't quite understand the nuances of moving bits between the processor and memory.
First rule of optimisation - don't.
Second rule of optimisation (experts only) - don't, yet.
However in this case this is the hot path half the internet uses (granted, M1 specific in this case), and a 6% speedup is significant. Some companies rewrite their SSL libraries completely to get single digit percentage increases in performance. When multiplied out over 50,000 machines, it's a significant business win.
I understand the "first rule of optimisation" (and I am sure you do, too) but I hate it because people cite it too often so they can simply ignore performance aspects of their code (not you right now, just a general rant).
You have to know when to apply it, like your example. Don't optimize something that takes 1 minute into a 50-second task if it runs every hour when you have something that runs every 1 minute and takes 10 seconds that you can optimize down to 9.5 seconds first, and know how to tell which to optimize. Maybe don't optimize either if you have other stuff to do first.
I think another thing to note is that the rule is mostly about low level optimization, which can be done later. For things like protocol design, software architecture, that can also affect performance and hard to modify later, you probably want to take performance into account...
I mine ETH with a lot of GPUs (let's just say a lot more than 50k). A recent release of the mining application that we use on each rig has a ~6% power (watts) improvement on GPUs released even 6 years ago (as well as an increase in hashrate). Installing the software and tuning the GPUs at my scale has been a huge undertaking, but totally worth it.