> I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them. On...

embedding-shape · 2025-12-23T09:05:38 1766480738

No, but the models you will be able to run, will run fast and many of them are Good Enough(tm) for quite a lot of tasks already. I mostly use GPT-OSS-120B and glm-4.5-air currently, both easily fit and run incredibly fast, and the runners haven't even yet been fully optimized for Blackwell so time will tell how fast it can go.

bigyabai · 2025-12-22T23:49:07 1766447347

You definitely could, the RTX Pro 6000 has 96 (!!!) gigs of memory. You could load 2 experts at once at an MXFP4 quant, or one expert at FP8.

coder543 · 2025-12-22T23:55:40 1766447740

No… that’s not how this works. 96GB sounds impressive on paper, but this model is far, far larger than that.

If you are running a REAP model (eliminating experts), then you are not running GLM-4.7 at that point — you’re running some other model which has poorly defined characteristics. If you are running GLM-4.7, you have to have all of the experts accessible. You don’t get to pick and choose.

If you have enough system RAM, you can offload some layers (not experts) to the GPU and keep the rest in system RAM, but the performance is asymptotically close to CPU-only. If you offload more than a handful of layers, then the GPU is mostly sitting around waiting for work. At which point, are you really running it “on” the RTX Pro 6000?

If you want to use RTX Pro 6000s to run GLM-4.7, then you really need 3 or 4 of them, which is a lot more than $10k.

And I don’t consider running a 1-bit superquant to be a valid thing here either. Much better off running a smaller model at that point. Quantization is often better than a smaller model, but only up to a point which that is beyond.

bigyabai · 2025-12-23T00:34:24 1766450064

You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one.

> And I don’t consider running a 1-bit superquant to be a valid thing here either.

I don't either. MXFP4 is scalar.

coder543 · 2025-12-23T00:43:34 1766450614

Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.

You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.

If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.

embedding-shape · 2025-12-23T09:06:58 1766480818

> If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.

Absolutely, same if they get a $10K Mac/Apple computer, immense disappointment ahead.

Best is of course to start looking at models that fit within 96GB, but that'd make too much sense.

virgildotcodes · 2025-12-23T11:07:47 1766488067

$10k is > 4 years of a $200/mo sub to models which are currently far better, continue to get upgraded frequently, and have improved tremendously in the last year alone.

This almost feels like a retro computing kind of hobby than anything aimed at genuine productivity.

embedding-shape · 2025-12-23T11:33:48 1766489628

I don't think the calculation is that simple. With your own hardware, there literally is no limits of runtime, or what models you use, or what tooling you use, or availability, all of those things are up to you.

Maybe I'm old school, but I prefer those benefits over some cost/benefit analysis across 4 years which by the time we're 20% through it, everything has changed.

But I also use this hardware for training my own models, not just inference and not just LLMs, I'd agree with you if we were talking about just LLM inference.

naasking · 2025-12-23T14:08:17 1766498897

They are better in some ways, but they're also neutered.