And as far as I can see, it's a total waste of silicon. Anything running in it w...

wtallis · 2026-01-06T15:38:05 1767713885

If you ask NVIDIA, inference should always run on the GPU. If you ask anybody else designing chips for consumer devices, they say there's a benefit to having a low-power NPU that's separate from the GPU.

dragonwriter · 2026-01-06T16:46:56 1767718016

Okay, yeah, and those manufacturers’ opinions are both obvious reflections of market position independent of the merits, what do people who actually run inference say?

(Also, the NPUs usually aren't any more separate from the GPU than tensor cores are separate from an Nvidia GPU, they are integrated with the CPU and iGPU.)

zozbot234 · 2026-01-06T17:19:17 1767719957

If you're running an LLM there's a benefit in shifting prompt pre-processing to the NPU. More generally, anything that's memory-throughput limited should stay on the GPU, while the NPU can aid compute-limited tasks to at least some extent.

The general problem with NPUs for memory-limited tasks is either that the throughput available to them is too low to begin with, or that they're usually constrained to formats that will require wasteful padding/dequantizing when read (at least for newer models) whereas a GPU just does that in local registers.

Spellman · 2026-01-06T17:08:40 1767719320

Depends on how big the NPU is and how much power/memory the inference model needs.