Would it be realistic to buy and self-host the hardware to run, for example, the...

mrajcok · 2025-04-05T21:48:50 1743889730

Yes - I'm able to run Llama 3.1 405B on 3x A6000 + 3x 4090.

Will have Llama 4 Maverick running in 4bit quantization (typically results in only minor quality degradation) once llama.cpp support is merged.

Total hardware cost well under $50,000.

The 2T Behemoth model is tougher, but enough Blackwell 6000 Pro cards (16) should be able to run it for under $200k.

briandw · 2025-04-05T22:20:24 1743891624

Llama scout is a 17B x 16 MOE. So that 17B active parameters. That makes it faster to run. But the memory requirements are still large. They claim it fits on an H100. So under 80GB. A mac studio at 96GB could run this. By run i mean inference, Ollama is easy to use for this. 4x3090 nvidia cards would also work but its not the easiest pc build. The tinybox https://tinygrad.org/#tinybox is 15k and you can do Lora fine tuning. Could also do a regular pc with 128gb of ram, but its would be quite slow.

latchkey · 2025-04-05T21:55:03 1743890103

A box of AMD MI300x (1.5TB of memory) is much less than $500k and AMD made sure to have day zero support with vLLM.

That said, I'm obviously biased but you're probably better off renting it.

hhh · 2025-04-05T21:23:49 1743888229

You can do it with regular gpus for less