Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
> They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.
> The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.
I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
> As the comments on reddit said, those numbers don’t make sense.
Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
We're super proud to support this work. If you're thinking of running deepseek in production, give us a shout!