This is really impressive. At these speeds, it’s possible to run agents with multi-tool turns within seconds. Consider it a feature rich, “non-deterministic API” for your platform or business.
Don’t have enough ram for this model, however the smaller 20B model runs nice and fast on my MacBook and is reasonably good for my use-cases. Pity that function calling is still broken with llama.cpp
I'm glad to see this was a bug of some sort and (hopefully) not a full RAM limitation. I've used quite a few of these models on my MacBook Air with 16GB of RAM. I also have a plan to build an AI chat bot and host it from my bedroom on a $149 mini-pc. I'll probably go much smaller than the 20B models for that. The Qwen3 4B model looks quite good.
The key benefit is significant lower power usage. Benchmarked llama3.2-1B on my machines; M1 Max (47t/s, ~1.8 watts), M4 Pro (62t/s, ~2.8 watts). The GPU is twice as fast (even faster on the Max), but draws much more power (~20 watts) vs the ANE.
Also the ANE models are limited to 512 tokens of context, so unlikely yet to use these in production.
Love this! The C64 introduced me to the world of computers as a kid. I still have that almost 40 year old machine in my collection, but I’m weary of failure every time I turn it on. This is somewhat better than the MiSTer as I can use physical peripherals with it. Great work!
The most common failure points in these old boxes are the capacitors and the power supply. Swap out all the caps and replace the original power supply for a modern remake and the 64 could last you another 40 years. :)
Let’s remind ourselves that MCP was announced to the world in November 2024, only 4 short months ago. The RFC is actively being worked on and evolving.
I had the same frustration and wanted to see "under the hood", so I coded up this little agent tool to play with MCP (sse and stdio), https://github.com/sunpazed/agent-mcp
I really is just json-rpc 2.0 under the hood, either piped to stdio or POSTed over http.