no, they definitely fit. They're just awkwardly exactly the right size that while you're trying to plug things in punched over under the desk and crawling around and feeling around the backside; it just yeah.
It's all optional if you have enough mechanical empathy. No speedo, oil light, odo, gas gauge. You just get a feel for how fast you're going. You haven't really lived until you've ridden a salvage titled motorcycle with zero instrument cluster across 17 without headlights after the sun's gone down. Sometimes I'm surprised I made it this long.
> What I'm struggling with is, when you ask AI to do something, its answer is always undeterministically different, more or less.
For some computer science definition of deterministic, sure, but who gives a shit about that? If I ask it build a login page, and it puts GitHub login first one day, and Google login first the next day, do I care? I'm not building login pages every other day. What point do you want to define as "sufficiently deterministic", for which use case?
"Summarize this essay into 3 sentences" for a human is going to vary from day to day, and yeah, it's weird for computers to no longer be 100% deterministic, but I didn't decide this future for us.
GLM 4.7 is new and promising. MinMax 2.1 is good for agents. Of course the qwen3 family, vl versions are spectacular. NVIDIA Nemotron Nano 3 excels at long context and the unsloth variant has been extended to 1m tokens.
I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.
Gpt-oss-120 is good too, altough I'm yet to try it out for coding specifically
Since I'm just a pleb with a 5090, I run GPT-OSS 20B a lot, since it fits comfortably in VRAM with max context size. I find it quite decent for a lot of things, especially after I set reasoning effort to high and disabled top-k and top-p and set min-p to something like 0.05.
For the Qwen3-VL, I recently read that someone got significantly better results by using F16 or even F32 versions of the vision model part, while using a Q4 or similar for the text model part. In llama.cpp you can specify these separately[1]. Since the vision model part is usually quite small in comparison, this isn't as rough as it sounds. Haven't had a chance to test that yet though.
I know money is the root of all evil and all that, but a total aversion to it isn't a very healthy way of interfacing with it at all.
reply