Hacker Newsnew | past | comments | ask | show | jobs | submit | zozbot234's commentslogin

> It'd be nice to have say, ABI for slices for example.

The de-facto ABI for slices involves passing/storing pointer and length separately and rebuilding the slice locally. It's hard to do better than that other than by somehow standardizing a "slice" binary representation across C and C-like languages. And then you'll still have to deal with existing legacy code that doesn't agree with that strict representation.


The underlying advantage of local inference is that you're repurposing your existing hardware for free. You don't need your token spend to pay a share of the capex cost for datacenters that are large enough to draw gigawatts in power, you can just pay for your own energy use. Even though the raw energy cost per operation will probably be higher for local inference, the overall savings in hardware costs can still be quite real.

For long context, yes this is at least plausible. And the latest models are reaching context lengths of 1M tokens or perhaps more.

KV quantization has long been available in llama.cpp

Yes but the optimisation described has not right?

With proper mmap support you don't really need the entire model in memory. It can be streamed from a fast SSD, and this is more useful for MoE models where not all expert-layers are uniformly used. Of course the more data you stream from SSD, the slower this is; caching stuff in RAM is still relevant to good performance.

You can do this on a Mac as well tho, right? So that 128 GB unified memory becomes cache for very fast 1+ TB Apple SSD.

I think the advantage of Flash-MoE compared to plain mmap is mostly the coalesced representation where a single expert-layer is represented by a single extent of sequential data. That could be introduced to existing binary formats like GGUF or HF - there is already a provision for differently structured representations, and that would easily fit.

> If there is no barrier, there is no moat, just a transitory advantage

A moat can very much be transitory and caused by natural (or at least, not specifically intended) factors. Perhaps we could then call it something different, like a river as opposed to a moat, but the strategic effect is the same either way so it makes sense to use the established term for it.


> If the file is mmap'd, and the string view points into that, presumably decent performance depends on the page cache having those strings in RAM.

Not so much, because you only need some fraction of that memory when the program is actually running; the OS is free to evict it as soon as it needs the RAM for something else. Non-file-backed memory can only be evicted by swapping it out and that's way more expensive,


All major OSes (well Windows and macOS) do in-memory compression before swap, which is cheaper than evicting a file-backed page. But still slow, so you don't want to rely on it.

If you use an ownership/lifetime system under the hood you only pay that synchronization overhead when ownership truly changes, i.e. when a reference is added or removed that might actually impact the object's lifecycle. That's a rare case with most uses of reference counting; most of the time you're creating a "sub"-reference and its lifetime is strictly bounded by some existing owning reference.

There are 2 unavoidable atomic updates for RC, the allocation and the free event. That alone will significantly increase the amount of traffic per thread back to main memory.

A lifetime system could possibly eliminate those, but it'd be hard to add to the JVM at this point. The JVM sort of has it in terms of escape analysis, but that's notoriously easy to defeat with pretty typical java code.


Why would an allocation require an atomic write for a reference count?

Swift routinely optimizes out reference count traffic.


> Why would an allocation require an atomic write for a reference count?

It won't always require it, but it usually will because you have to ensure the memory containing the reference count is correctly set before handing off a pointer to the item. This has to be done almost first thing in the construction of the item.

It's not impossible that a smart compiler could see and remove that initialization and destruction if it can determine that the item never escapes the current scope. But if it does escape it by, for example, being added to a list or returned from a function, then those two atomic writes are required.


But commodity hardware that's right-sized for your own private needs is many orders of magnitude cheaper than datacenter hardware that's intended to serve millions of users simultaneously while consuming gigawatts in power. You're mostly paying for that hardware when you buy LLM tokens, not just for power efficiency. And your own hardware stays available for non-AI related needs, while paying for these tokens would require you to address these needs separately in some way.

>And your own hardware stays available for non-AI related needs, while paying for these tokens would require you to address these needs separately in some way.

^ Fair. Yep, I agree the calculus changes if you don't have _any_ local hardware and you're needing to factor in the cost of acquiring such hardware.

When I did this napkin math, I was mostly interested in the energy aspect, using cost as a proxy. I was calculating the $/token (taking into consideration the cost of a KWh from my utility, the measured power draw of my M1 work machine, and the measured tokens per second processed by a ~20BP open-weight model). I then compared this to the published $/token rate of a frontier provider, and it was something like two orders of magnitude in favor of the frontier model. I get it, they're subsidizing, but I've got to imagine there's some truth in the numbers.

I wonder, does (or will) the $/token ratio fall asymptotically toward the cost of electricity? In my mind I'm drawing a parallel to how the value of mined cryptocurrency approximately tracks the cost of electricity... but I might be misremembering that detail.


I doubt it because you aren't going to get the utilisation that a commercial setup would. No point wasting tons of money on hardware that is sat idle most of the time.

If you're running agentic workloads in the background (either some coding agent or personal claw-agent type) that's enough utilization that the hardware won't be sitting idle.

> Two Chinese firms are ramping up production of consumer RAM/SSDs because they see a market opening

Yes but these Chinese firms are a tiny share of the overall RAM/SSD market, and they'll have the same problems with expanding production as everyone else. So it doesn't actually help all that much.


The biggest problem in expanding for everyone else is they don't trust the market to exist for long enough to be worth paying for a new factory so they are not investing in it. The Chinese might be small, but they think the market will exist and are investing. Will they be right or wrong - I don't know.

Chinese firms won’t have the exact same problems as anyone else. Some problems will be the same but not all.

* Chinese firms finance through different banks and investors than current ram producers

* A company with a mission statement of consumer ram won’t have their supply outbid by data centers

* Chinese manufacturing has more expertise in scaling then any other manufacturing culture


The fact that there’s been a massive expansion in the nonconsumer market means the consumer market makes up a smaller proportion of the overall market, but it doesn’t mean the consumer market is any smaller than it used to be.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: