> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enab...

tomwojcik · on Nov 29, 2023

That's not the limitation introduced in Llamafile. It's actually a feature of all gguf models. If not specified, GPU is not used at all. Optionally, you can offload some work to the GPU. This allows to run 7b models (zephyr, mistral, openhermes) on regular PCs, it just takes a bit more time to generate the response. What other API would you suggest?

amelius · on Nov 29, 2023

This is a bit like saying if you don't specify "--dram", the data will be stored on punchcards.

From the user's point of view: they just want to run the thing, and as quickly as possible. If multiple programs want to use the GPU, then the OS and/or the driver should figure it out.

andersa · on Nov 29, 2023

They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

The reason for this flag to exist in the first place is that many of the models are larger than the available VRAM on most consumer GPUs, so you have to "balance" it between running some layers on the GPU and some on the CPU.

What would make sense is a default auto option that uses as much VRAM as possible, assuming the model is the only thing running on the GPU, except for the amount of VRAM already in use at the time it is started.

insanitybit · on Nov 29, 2023

> They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

What I don't understand is why it can't just check your VRAM and allocate by default. The allocation is not that dynamic AFAIK - when I run models it all happens basically upfront when the model loads. ollama even prints out how much VRAM it's allocating for model + context for each layer. But I still have to tune the layers manually, and any time I change my context size I have to retune.

jmorgan · on Nov 30, 2023

This is a great point. Context size has a large impact on memory requirements and Ollama should take this into account (something to work on :)

insanitybit · on Nov 30, 2023

Thanks for the work you've done already :D

numpad0 · on Nov 30, 2023

Some GPUs has quirks that VRAM access slows down near the end or that GPU just crashes and disables display output if actually used. I think it's sort of sensible that they don't use GPU at all by default.

wongarsu · on Nov 30, 2023

Wouldn't the sensible default be to use 80% of available VRAM, or total VRAM minus 2GB, or something along those lines. Something that's a tad conservative but works for 99% of cases, with tuning options for those who want to fly closer to the sun.

insanitybit · on Nov 30, 2023

2GB is a huge amount - you'd be dropping a dozen layers. Saving a few MB should be sufficient, and a layer is generally going to be orders of megabytes, so unless your model fits perfectly into VRAM (using 100%) you're already going to be leaving at least a few MB / 10s of MBs/ 100s of MBs free.

Your window manager will already have reserved its vRAM upfront so it isn't a big deal to use ~all of the rest.

insanitybit · on Nov 30, 2023

I think in the vast majority of cases the GPU being the default makes sense, and for the incredibly niche cases where it isn't there is already a tunable.

brucethemoose2 · on Nov 29, 2023

Llama.cpp allocates stuff to the GPU statically. It'd not really analogous to a game.

It should have a heuristic that looks at available VRAM by default, but it does not. Probably because this is vendor specific and harder than you would think, and they would rather not use external libraries.

michaelt · on Nov 29, 2023

> What other API would you suggest?

Assuming increasing vram leads to an appreciable improvement in model speed, it should default to using all but 10% of the vram of the largest GPU, or all but 1GB, whichever is less.

If I've got 8GB of vram, the software should figure out the right number of layers to offload and a sensible context size, to not exceed 7GB of vram.

(Although I realise the authors are just doing what llama.cpp does, so they didn't design it the way it is)

brucethemoose2 · on Nov 29, 2023

> What other API would you suggest?

MLC LLM?

I think the binary it compiles down to (Probably the Vulkan and Metal ones for yall) is seperate from the weights, so you could ship a bunch in one file.