Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.



You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.


You can offload tensors to the cpu memory. It will make your model run much slower but it will work




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: