> Because memory bandwidth is the #1 bottleneck for inference, even more than ca... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		acoard on Jan 30, 2025 \| parent \| context \| favorite \| on: Mistral Small 3 > Because memory bandwidth is the #1 bottleneck for inference, even more than capacity. But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.

aurareturn on Jan 30, 2025 | [–]

You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.

elcomet on Jan 30, 2025 | [–]

You can offload tensors to the cpu memory. It will make your model run much slower but it will work

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact