And as far as I can see, it's a total waste of silicon. Anything running in it will anyway be so underpowered that it doesn't matter. It'd be better to dedicate the transistors to the GPU.
The latest Ryzen mobile CPU line didn't improve performance compared to its predecessor (the integrated GPU is actually worse), and I think the NPU is to blame.
If you ask NVIDIA, inference should always run on the GPU. If you ask anybody else designing chips for consumer devices, they say there's a benefit to having a low-power NPU that's separate from the GPU.
Okay, yeah, and those manufacturers’ opinions are both obvious reflections of market position independent of the merits, what do people who actually run inference say?
(Also, the NPUs usually aren't any more separate from the GPU than tensor cores are separate from an Nvidia GPU, they are integrated with the CPU and iGPU.)
If you're running an LLM there's a benefit in shifting prompt pre-processing to the NPU. More generally, anything that's memory-throughput limited should stay on the GPU, while the NPU can aid compute-limited tasks to at least some extent.
The general problem with NPUs for memory-limited tasks is either that the throughput available to them is too low to begin with, or that they're usually constrained to formats that will require wasteful padding/dequantizing when read (at least for newer models) whereas a GPU just does that in local registers.
The latest Ryzen mobile CPU line didn't improve performance compared to its predecessor (the integrated GPU is actually worse), and I think the NPU is to blame.