There's also a "llamafile" 4MB binary that can run any model (GGUF file) that yo...

dekhn · on Nov 29, 2023

Right. So if that exists, why would I want to embed my weights in the binary rather than distributing them as a side file?

I assume the answers are "because Justine can" and "sometimes it's easier to distribute a single file than two".

simonw · on Nov 29, 2023

Personally I really like the single file approach.

If the weights are 4GB, and the binary code needed to actually execute them is 4.5MB, then the size of the executable part is a rounding error - I don't see any reason NOT to bundle that with the model.

dekhn · on Nov 29, 2023

I guess in every world I've worked in, deployment involved deploying a small executable which would run millions of times on thousands of servers, each instance loading a different model (or models) over its lifetime, and the weights are stored in a large, fast filesystem with much higher aggregate bandwidth than a typical local storage device. The executable itself doesn't even contain the final model- just a description of the model which is compiled only after the executable starts (so the compilation has all the runtime info on the machine it will run on).

But, I think llama plus obese binaries must be targeting a very, very different community- one that doesn't build its own binaries, runs in any number of different locations, and focuses on getting the model to run with the least friction.

csdvrx · on Nov 29, 2023

> a large, fast filesystem with much higher aggregate bandwidth than a typical local storage device

that assumption gets wrong very fast with nvme storage, even before you add herding effects

dekhn · on Nov 30, 2023

Until you compare a single machine with nvme to a cluster of storage servers with nvme, and each machine has 800Gbit connectivity and you use smart replication for herding. but yes, nvme definitely has amazing transfer rates.

csdvrx · on Nov 30, 2023

> Until you compare a single machine with nvme to a cluster of storage servers with nvme

No, only as long as you compare against a very low number of machines with local nvme.

The sum of the bandwith available on typical storage device (even cheap and low end) will be at most times greater than what you have of your expansive top of the line cluster

If you have a single local storage, you don't have scale, so you won't have money for an expansive top of the line cluster either. But if you are wasting money on it, yes you will have more bandwidth, but that's a degenerate case.

If you have a few local storage machines, the assumption gets very wrong and very fast: 1 low end tier nvme=1 G/s at worst, one top of the line WD 990: 8G/s at best, so we're talking about a ratio of ~ 8 in the most favorable scenario.

fullspectrumdev · on Nov 30, 2023

> But, I think llama plus obese binaries must be targeting a very, very different community- one that doesn't build its own binaries, runs in any number of different locations, and focuses on getting the model to run with the least friction.

Yes, the average user.

quickthrower2 · on Nov 29, 2023

This is convenient for people who don't want to go knee deep in LLM-ology to try an LLM out on their computer. That said a single download that in turn downloads the weights for you is just as good in my book.

insanitybit · on Nov 29, 2023

`ollama pull <modelname>` has worked for me, and then I can try out new models and updated the binary trivially.