The brand new Qwen3-Coder-Next runs at 300Tok/s PP and 40Tok/s on M1 64GB with 4-bit MLX quant. Together with Qwen Code (fork of Gemini) it is actually pretty capable.
Before that I used Qwen3-30B which is good enough for some quick javascript or Python, like 'add a new endpoint /api/foobar which does foobaz'. Also very decent for a quick summary of code.
It is 530Tok/s PP and 50Tok/s TG. If you have it spit out lots of the code that is just copy of the input, then it does 200Tok/s, i.e. 'add a new endpoint /api/foobar which does foobaz and return the whole file'
Gawker was a well known website with 23 million visit per month, and a Wikipedia page. This guy has 44k subscribers and no Wikipedia page. It's a stretch to go from "Thiel had a vendetta against Gawker" to "Thiel had a vendetta against this guy".
There are now quite a few cases in Europe where the EU or local govs been de-banking individuals. No court, no judge needed. Much more efficient way to shut down critics. We ain't need no people who delegitimize those in power.
Can we have a link? In France, at most you can get your account restricted (can't go into deficit and a sum is blocked) until the issue is resolved (99% because of unpaid taxes, sometimes the money is blocked by a judge until a judgement is passed). It's weird if the EU don't have a standard.
Yes, exactly. French citizens have an inalienable right to a checking account, enforced by French government. I don't remember the exact law, but I know someone who was 'interdit bancaire' (took too much revolving credits in the 90s) and the local bank _had_ to let him open a new account (a very limited one).
Look at what happened to Jacques Baud because he criticized the EU on the Ukranian war and is now considered a "pro-russia" and propagating "disinformation". [1]
I read what he wrote ("L’art de la guerre russe, comment l’Occident a conduit l’Ukraine à l’échec"), you can download it temporarily here : [2].
His book is absolutely not pro-russia nor pro-EU.
The only debanking cases I've been aware of were the US putting pressure on judges from the International Court and a special appointee of the UN for Palestine.
He's Chinese and if you had looked into his comment history you'd know this is not someone who uses LLMs for karma farming and looking at his blog he has a long history of posting about database topics going back before there was GPT.
Should I ever participate in a Chinese speaking forum, I'd certainly use an LLM for translation as well.
Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.
On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
But who knows, maybe Qwen gives them a hand? (hint,hint)
KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.
Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?
It is the buffer implementation.
[u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.
I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.
> People writing and maintaining software need to optimize for simplicity, readibility, maintainability. Whether they use an LLM to achieve that is seconday. The humans in the loop must understand what's going on.
Linux is nowadays mostly sponsored by big corporations. They have different goals and different ways to do things. Probably the first 10 years Linux was driven by enthusiasts and therefore it was a lean system. Something like systemd is typical corporate output. Due it its complexity it would have died long before finding adoption. But with enterprise money this is possible. Try to develop for the combo Linux Bluetooth/Audio/dbus: the complexity drives you crazy because all this stuff was made for (and financed by) corporate needs of the automotive industry. Simplicity is never a goal in these big companies.
But then Linux wouldn't be where it is without the business side paying for the developers. There is no such thing as a free lunch...
> AIs have endless grit (or at least as endless as your budget).
That is the only thing he doesn't address: the money it costs to run the AI. If you let the agents loose, they easily burn north of 100M tokens per hour. Now at $25/1M tokens that gets quickly expensive. At some point, when we are all drug^W AI dependent, the VCs will start to cash in on their investments.
Before that I used Qwen3-30B which is good enough for some quick javascript or Python, like 'add a new endpoint /api/foobar which does foobaz'. Also very decent for a quick summary of code.
It is 530Tok/s PP and 50Tok/s TG. If you have it spit out lots of the code that is just copy of the input, then it does 200Tok/s, i.e. 'add a new endpoint /api/foobar which does foobaz and return the whole file'
reply