Correct, most of r/LocalLlama moved onto next gen MoE models mostly. Deepseek in...

kouteiheika · 2026-01-02T08:01:09 1767340869

Llama 4 isn't that bad, but it was overhyped, and people in generally "hold it wrong".

I recently needed an LLM to batch process me some queries. I ran an ablation on 20+ models from Open Router to find the best one. Guess which ones got 100% accuracy? GPT-5-mini, Grok-4.1-fast and... Llama4 Scout. For comparison, DeepSeek v3.2 got 90%, and the community darling GLM-4.5-Air got 50%. Even the newest GLM-4.7 only got 70%.

Of course, this is just an anecdotal single datapoint which doesn't mean anything, but it shows that Llama 4 is probably underrated.

eurekin · 2026-01-02T10:36:56 1767350216

Oh, this is very interesting. Will have to test it out on coding too. Very good point about testing. Had I only followed benchmarks, I'd miss few gems completely (long context models and 4b vision models that are unbelievably capable for their size). I'd encourage anyone to test the models on actual problems you're working on.

coder543 · 2026-01-02T16:21:39 1767370899

The Llama 4 models were instruct models at a time when everyone was hyped about and expecting reasoning models. As instruct models, I agree they seemed fine, and I think Meta mostly dropped the ball by taking the negative community feedback as a signal that they should just give up. They’ve had plenty of time to train and release a Llama-4.5 by now, which could include reasoning variants and even stronger instruct models, and I think the community would have come around. Instead, it sounds like they’re focusing on closed source models that seem destined for obscurity, where Llama was at least widely known.

On the flip side, it also shows how damaging echo chambers can be, where relatively few people even gave the models a chance, just repeating the negativity they heard from other people and downvoting anyone who voiced a different experience.

I think this was exacerbated by the fact that Llama models had previously come in small, dense sizes like 8B that people could run on modest hardware, where even Llama 4 Scout was a large model that a lot of people in the community weren’t prepared to run. Large models seem more socially accepted now than they were when Llama 4 launched.

zozbot234 · 2026-01-02T17:17:18 1767374238

Large MoE models are more socially accepted because medium/large sized MoE models can still be quite small wrt. expert size (which is what sets the amount of required VRAM). But a large dense model is still challenging to get to run.

coder543 · 2026-01-02T17:20:04 1767374404

I meant large MoE models are more socially accepted now. They were not when Llama 4 launched, and I believe that worked against the Llama 4 models.

The Llama 4 models are MoE models, in case you are unaware, since it feels like your comment feels was implying they were dense models.

fragmede · 2026-01-02T01:03:54 1767315834

What are some of the models people are using? (Rather than naming the ones they aren't.)

eurekin · 2026-01-02T01:49:40 1767318580

GLM 4.7 is new and promising. MinMax 2.1 is good for agents. Of course the qwen3 family, vl versions are spectacular. NVIDIA Nemotron Nano 3 excels at long context and the unsloth variant has been extended to 1m tokens.

I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.

Gpt-oss-120 is good too, altough I'm yet to try it out for coding specifically

magicalhippo · 2026-01-02T04:08:08 1767326888

Since I'm just a pleb with a 5090, I run GPT-OSS 20B a lot, since it fits comfortably in VRAM with max context size. I find it quite decent for a lot of things, especially after I set reasoning effort to high and disabled top-k and top-p and set min-p to something like 0.05.

For the Qwen3-VL, I recently read that someone got significantly better results by using F16 or even F32 versions of the vision model part, while using a Q4 or similar for the text model part. In llama.cpp you can specify these separately[1]. Since the vision model part is usually quite small in comparison, this isn't as rough as it sounds. Haven't had a chance to test that yet though.

[1]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv... (using --mmproj AFAIK)

nightski · 2026-01-02T03:27:07 1767324427

Does GLM 4.7 run well on the spark? I thought I read it didn’t but it wasn’t clear.