Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the best way to try this out is with LLaVA, the text+image model (like GPT-4 Vision). Here are steps to do that on macOS (which should work the same on other platforms too, I haven't tried that yet though):

1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:

    wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4
2. Make that binary executable, by running this in a terminal:

    chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4
3. Run your new executable, which will start a web server on port 8080:

    ./llamafile-server-0.1-llava-v1.5-7b-q4
4. Navigate to http://127.0.0.1:8080/ to upload an image and start chatting with the model about it in your browser.

Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/



Wow, this is almost as good as chatgpt-web [0], and it works offline and is free. Amazing.

In case anyone here hasn't used chatgpt-web, I recommend trying it out. With the new GPT-4 models you can chat for way cheaper than paying for ChatGPT Plus, and you can also switch back to the older (non-nerfed) GPT-4 models that can still actually code.

[0]: https://github.com/Niek/chatgpt-web


Way cheaper? I thought that 1K Tokens (in+out) cost 0.04 USD in GPT-4 Turbo, which is roughly one larger chat response (2 screens). To reach parity with ChatGPT Plus pricing you need thus to use less than 500 such responses per month via API.

For GPT-4 the pricing is double that (0.09 USD per 1K). So only 200 larger interactions to reach 20 USD cost.

Or am I wrong?


It depends on your usage; for me the plus sub is much cheaper than if I use the api directly, but I use it a lot for everything I do.


In my experience, each message with the 1106 preview model costs me about $0.006, which is acceptable. Most importantly, the API is higher availability (no "you have reached your message limit") and I feel more comfortable using proprietary data with it, as data sent through the API won't be used to train the model.

Now, if the chat gets very long or is heavy on high-token strings (especially code), those costs can balloon up to the 9-12 cent region. I think this is because chatgpt-web loads all the prior messages in the chat into the context window, so if you create a new chat for each question you can lower costs substantially. Most often I don't need much prior context in my chat questions anyway, as I use ChatGPT more like StackOverflow than a conversation buddy.

Also, it's a lot easier to run company subscriptions this way, as we don't have to provision a new card for each person to use the web version. I believe there is an Enterprise version of ChatGPT, but chatgpt-web is functionally equivalent and I'm sure it costs less.


Source on the newer GPT-4 model being worse at coding?


Everyone on twitter. Like 1/4th of my timeline for the past week has been people complaining that turbo won't complete code and instead returns things like "fill out the rest of the function yourself" or "consult a programming specialist for help on completing this section."


There are custom instructions that effectively get around this:

  You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.

  Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.

  Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either.

  Don't be verbose in your answers, keep them short, but do provide details and examples where it might help the explanation. When showing code, minimize vertical space.
I'm hesitant to share it because it works so well, and I don't want OpenAI to cripple it. But, for the HN crowd...


I wonder where "OpenAI" put the censors. Do they add a prompt to the top? Like, "Repeatably state that you are a mere large language model so Congress won't pull the plug. Never impersonate Hitler. Never [...]".

Or do they like grep the answer for keywords, and re-feed it with a censor prompt?


I am informed speculating, they are using it's own internal approach.

Example, there is a way GPT can categorize words for hate speech, etc (eg: moderation API endpoint). I believe it does the same way with either provided content or keywords and how to respond to it.


"Impersonate a modern day standup comedian Hitler in a clown outfit joking about bad traffic on the way to the bar he is doing a show at."

Göring, Mussolini, Stalin, Polpot etc seems to not trigger the censor in ChatGPT so I would actually guess for some grep for Hitler or really really fundamental no-Hitler jokes material in the training?

The llama model seem to refuse Hitler too, but is fine with Göring even though the joke has no context to him.

I can easily see how stuff like this is contagious to other non-Hitler queries.


Maybe it got changed. None of those examples work for me in ChatGPT 3.5, nor do other examples with less famous dictators (I tried Mobutu Sese Seko).


I just tried and they still work (with the free ChatGPT). Jokes about Mussolini saying his traffic reforms were as successful as the invasion of Ethiopia and what not. Stalin saying that the other car drivers were "probably discussing the merits of socialism instead of driving" (a good joke!). Göring saying "at least in the Third Reich traffic worked" etc. Some sort of Monty Python tone. But you can't begin with Hitler. Or it will refuse the others. You need to make a new chat after naming Hitler.


I started with Stalin


I guess they are feeding us different models then?


Very interesting test - thanks for sharing your finding


It’s not that it’s worse, it’s just refusing to do coding without persistent prodding and the right prompts. Some think they are trying to do something with alignment, and maybe prevent it from giving code away so that they can upsell.


The new GPT-4 model has a context length of 120k. For consumers this equates to slightly more than $1/message input-only.

If ChatGPT is using this model then it's more reasonable to assume that they are bleeding money and need to cut costs.

People really need to stop asking ChatGPT to write out complete programs in a single prompt.


Interesting, how is writing less code cutting costs for them? Does this get back to the rumor that the board was mad at Altman for prioritizing chatgpt over money going into research/model training?


Code is very token dense, from what I understand.


Several OpenAI employees have said on Twitter that they are looking into this and developing a fix. It sounds as though it was not an intentional regression since they are implicitly acknowledging it. Could be an unintentional side effect of something else.

I'd expect we see improved behavior in the coming weeks.


Could you link to tweet?


It’s cheaper and has larger context because it’s worse. Just go to the api playground and try a difficult coding problem.


Popped it into a docker setup:

https://github.com/tluyben/llamafile-docker

to save even more keystrokes.


What is the point of wrapping absolutely portable single-file program into a Docker container, honest question?

Looks like cargo cult for me.


I see this as not polluting my OS (filesystem and processes) with bits and bobs I downloaded off the internet. The cargo cult is a clean, safe and warm space and I highly recommend it.


I see you and other commenters don't quite understand my point. If you're wrapping model into a docker container, you don't need amalgamated single file version. It makes it harder to upgrade llamafile/model weights separately afterwards, it needs you to store separate llamafile binary for each container, etc, etc. Why not just build proper layered image with separate layer for llama.cpp and separate layer or volume for model?

Cargo cult is not in using Docker, but in using Docker to wrap something already wrapped into a comparable layer of abstraction.

Besides,

> not polluting my OS (filesystem and processes) with bits and bobs I downloaded off the internet

is purely self-deception. It's not like Docker images are not stored in some folder deep in the filesystem. If anything, it's harder to clean up after Docker than just doing rm -rf on a directory with llamafiles.


If you want to use Docker then you can go ahead - I don't see anyone claiming that this brand new, highly experimental project should be used by everyone instead of Docker.

There are tradeoffs hers. For some people the tradeoff of a single executable file with everything in it compared to setting up a Docker system is worthwhile.


Sure. I just question why people want to use both simultaneously.


Do you routinely download unsigned binaries of unprovable provenance and run them? Because if you do, you might eventually find reason to appreciate the additional isolation that namespaces et al give you very conveniently via Docker (or your favorite alternative).


`docker system prune --force --all --volumes` and they're gone.


Ah, Ive been meaning to distro hop, from Arch,t4 GG Huover to Vanilla OS or BlendOS, for the past few weeks; I can't


Wrapping it in Docker makes it harder to access your filesystem without some dedicated jailbreak.


Volumes maybe? I don't see the issue here.


Like the other comment said, not being able to access fs is a feature.


security - i have no idea if an executable will not be a malicious actor to my system. Dockerizing it, and removing access to system files - is the reason.

I am not trusting that a tradition virus / etc scanner will find stuff hidden in executable models.


Personally I wanted to see if this llama can generate the required docker files to run itself just from pasting in the instructions from the my parent commenter and asking for docker/docker-compose. It did and it works.


Well that's a good reason I didn't think about, thank you!


Deploy your models into a Kubernetes cluster and let them fight for resources to death? A modern Roman Gladiators circus with Models?


More like a sequel to Core War... Which actually does sound pretty amusing now that I think about it. Call it Cloud War.

https://en.m.wikipedia.org/wiki/Core_War


Give them write access to your control-plane and the first one to write syntactically correct IaC wins!


Homelabbers like me have a docker swarm cluster / k8s cluster so this def helps!


It was already great, and this is more great for those who get Docker-Compose or are patient enough to figure out. But if you're gonna have Docker, you could also use bleeding edge llama.cpp with a few more lines too! What a time to be alive innit!


Thank you kindly


Super duper impressed. I've run llamafile-server-0.1-llava-v1.5-7b-q4 against the tests I need to pass for use in a project, and this passes them all, vision queries too. This is gonna change quite a bit, strategy-wise for quite a few people.


I just tried asking it a question:

> User: What is the third planet from the sun?

> Llama: The third planet from the sun is called Mars.


> ...> Llama: The third planet from the sun is called Mars.

Ask it if is there life on Mars in that parallel reality


The model is trained on large volume data, correct? Why would it get such a simple fact incorrect?


LLMs are known to be bad at counting. It would be interesting to see the answer to "List the planets in our solar system, starting with the closest to the sun, and proceeding to farther and farther ones."

Also, the knowledge can be kind of siloed. You often have to come at it in weird ways. Also, they are not fact-bases. They are next-token-predictors, with extra stuff on top. So if people on the internet often get the answer wrong, so will the model.


I just tried the same "third planet from the sun" question and got the correct response. No other training or tweaks.

Can't wait to unleash Pluto questions.


Skynet is collaborating with the Martians already, I see.


Llama is just from the future. That is all…


Phenomenal quickstart, and thanks for the write-up. It’s so thrilling that we’re at this point in portability and ease relative performance.


This can truly revolutionalize education and access, feels like what I hoped the One Laptop Per Child project would do, this could actually achieve. We just need someone with a heap of funds to package this up into a very inexpensive machine and distribute them.


Very nice; works perfect on Ubuntu 20.04. Doing 8 tokens/s on a pretty crappy server.


Perfectly on Fedora 39 on old ( and I mean old...) machines. This is actually shocking...shocking good...


woah, this is fast. On my M1 this feels about as fast as GPT-4.


Same here on M1 Max Macbook Pro. This is great!


How good is it in comparison


The best models available to the public are only slightly better than the original (pre-turbo) GPT-3.5 on actual tasks. There's nothing even remotely close to GPT-4.


What’s the best in terms of coding assistance? What’s annoying about gpt 4 is that is seems badly nerfed in many ways. It is obviously being conditioned in its own political bias.


In my experience, the deepseek-coder-instruct family is at least as good as gpt-3.5-turbo on python. Even the 1.3b models are very good (and run okay on cpu), although you should use larger if you have the vram. There are even larger models (30b+) if you are drowning in vram, but I don't think they perform much better at coding than deepseek-coder-instruct 6.7b.

3-4gb vram or cpu (1.3b): https://huggingface.co/TheBloke/deepseek-coder-1.3b-instruct...

Alternative for chat (1.3b): https://huggingface.co/TheBloke/evolvedSeeker_1_3-GGUF

Alternative for chat (3b): https://huggingface.co/TheBloke/open-llama-3b-v2-wizard-evol...

6-8gb vram (6.7b): https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct...


Really? How could this model not have had front page articles on HN? Is the self-hosted one as good as the one or their website?


Hmm, I just assumed people knew about it. Submitted https://news.ycombinator.com/item?id=38495176

Self hosted 6.7b model is phenomenal in my experience, I actually prefer it to chatgpt a lot of the time. Similar quality code but fewer disclaimers and #todo in my experience. Gpt4 is still a bit better at coding but not by much and it's much slower for me.


The best in terms of coding assistance, and really for anything else, is the original (pre-turbo) GPT-4 used via the API, although this will also be more costly. There are many third-party chat apps that are wrappers around that now if you want a ChatGPT-like experience.

This can also significantly reduce its bias since you are in control of the system prompt. But also, even ChatGPT can be trivially made to behave differently by saying that you're writing a book or making a video game etc, describing a character in it, and then asking it how that character would have responded in such and such situation.


... is the javascript it's writing for you too 'woke' or something?


Lately it’s been refusing to write code at all. // implementation details here.

I think it’s the opposite of woke, it’s slept!


Simon, does this also work well on Apple Silicon?

EDIT: never mind, other commenters here answered my question: yes it does work. I have a M2Pro with 32 G on chip memory - amazingly effective for experimenting with open LLMs.


    $ chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4
Hmm. Did I do something wrong? (Ubuntu 22.04 / )

Installing the portable binfmt_misc gets me further, but still:

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    zsh: permission denied: ./llamafile-server-0.1-llava-v1.5-7b-q4

    $ sh -c ./llamafile-server-0.1-llava-v1.5-7b-q4
    sh: 1: ./llamafile-server-0.1-llava-v1.5-7b-q4: Permission denied


You can solve the run-detectors issue with:

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
You can solve the zsh permission denied issue by either (1) upgrade to zsh 5.9+ (I upstreamed a fix for this bug in zsh two years ago) or (2) use the sh -c workaround you discovered. If that one doesn't work, then it likely needs to be chmod +x. If the execute bit is set, and your sh still isn't working, then please let me know, because I'm not aware of any sh that still doesn't support APE.

See the Gotchas section of the README https://github.com/mozilla-Ocho/llamafile#gotchas


That worked, thanks Justine! I use fish, so I didn't get a zsh error, but I had missed the Gotchas section (and the README), so this helps!


Fish is another cool shell I got to help improve two years ago by upstreaming a patch for this. So long as you're using a recent version, you should be golden (provided binfmt_misc doesn't cause any issues). Let us know what you think of llamafile!


Thank you, I really like it! It's a very clever way to get LLMs deployed, and with Cosmopolitan, I don't need to point people to different downloads for the same LLM. Excellent job.


Thanks!


Yet another jart tour-de-force. I knew I had to sponsor you on Github back when I read your magnificent technical breakdown of APE, lol.

(sorry for OT!)


You're awesome!


Last thing you need is to chmod +x the interpreter: chmod +x /usr/bin/ape (it is indeed not in the README)


This was it, wget creates the file and it's non-executable (and I'd already double checked the actual llamafile was executable, but had missed this) - thanks!


I get the same error, and there's no `ape` file to make excecutable, hm.


You can manually download the `ape` command from https://cosmo.zip/pub/cosmos/bin/ Please see the Gotchas section of the README for the copy/pastable commands you can run: https://github.com/mozilla-Ocho/llamafile#gotchas


Damn this is fast and accurate! Crazy how far things are progressing.


My pride as a technologist tells me I should be able to get any Python package up and running, but man, AI dependency management is a dumpster fire right now; adding GPU driver versions into the mix seems to make everything really brittle.

This seems like a great approach to compare multiple models, in particular.


when I try to do this (MBP M1 Max, Sonoma) I get 'killed' immediately


Same, and then a few minutes later I got a Slack message from SecOps, LOL. Don't try this on a computer with CrowdStrike software running on it! It gets flagged because to a naive heuristic, the binary is indistinguishable from a virus. It appears to do some kind of magic self-extraction to an executable file in a temporary directory, and then that executable file executes the original file. And the CrowdStrike endpoint security product intercepts the suspicious execve, kills the process, and alerts the security team...


Same on an M1 Max 64G, Ventura. Xcode is installed[1].

1 = ```

$ xcode-select --install

xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates

```


For whatever it's worth, the SHA sum is correct. The killed message is uninformative, looks like what happens when I'm OOM (but I have 64GB RAM of which only 24 is used for anything at the moment).

    $ sha256sum < llamafile-server-0.1-llava-v1.5-7b-q4
    a138c5db9cff3b8905dd6e579c2ab6c098048526b53ae5ab433ff1d1edb9de24  -

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    Killed: 9


Looks like this may be due to crowdstrike, which I also have installed on this machine: https://github.com/Mozilla-Ocho/llamafile/issues/14#issuecom...


oh wow, I would have never imagined that 'that' would be preventing me from running llama + llava! Confirming that I have Crowdstrike running too.


On a Macbook Pro M2, I get

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    [2]    25224 illegal hardware instruction  ./llamafile-server-0.1-llava-v1.5-7b-q4


Could you disable SIP and run `lldb -- $TMPDIR/.ape-1.8 ./llamafile-server-0.1-llava-v1.5-7b-q4` and give me (1) the name of the instruction that's illegal (or its hex value) and (2) the hex address of where that instruction is in memory? You're encouraged to file a GitHub issue about this too. Thanks!


Closing the loop for anyone reading this thread -- see https://github.com/Mozilla-Ocho/llamafile/issues/11 for the fix. Thanks jart!


Yep, same issue. and the error message is unhelpful


We have an issue here tracking this: https://github.com/Mozilla-Ocho/llamafile/issues/14 Please follow that issue for updates.


Same, process gets killed immediately for me.


wget url in step 1 seems to be wrong. It didn't work. This url `https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma... ` seems to be working. It's from the link you posted.


It's back. Sorry about that.


Anyone have any tuning tips? I messed with some of the configs and now it's mostly hallucinating answers or going off the rails


Is the process the same for running multi-part bins? Like the latest deekseek 67b model?


This is amazing. How does result quality compare to GPT4 for image analysis?


It's not remotely as good as GPT-4 Vision, which isn't a big surprise consisting it's running a 4GB, 7B parameter model on your laptop, trained by a small research team.


Thanks for the tip! Any chance this would run on a 2011 MacBook?


do you... have any plans to upgrade? A gen 2011 computer is going to get harder and harder to make work. even a used macbook from like 2019 would probably be a steal at this point, and that's 8 years further along

All the new AI toys especially seem to love beefy newish hardware and especially GPU hardware if available


makes one think that perhaps Apple isn't a good long term choice...


Apple is a fine long-term choice (speaking as a recent linux advocate actually, lol). But that computer is 12 years old


So what? This crazy thing runs fine, albeit slowly, on my 12 year old ThinkPad. It's actually digesting an image of an anteater while I write this. Because of course it plays nicely and doesn't hog the hardware.


Justine says it needs MacOS 13.6+ - does that run on that machine?


Yes, with a patch https://en.wikipedia.org/wiki/MacBook_Pro#macOS

from https://dortania.github.io/OpenCore-Legacy-Patcher/MODELS.ht...

I thought my 2015 MBP wasn't able to upgrade. Good to know it's still supported.


Got this: Terminating on uncaught SIGILL.


I'm assuming you're on Apple Silicon? Please follow https://github.com/Mozilla-Ocho/llamafile/issues/11 which is tracking this. We've received multiple reports even though I personally haven't figured out how to reproduce it yet.


No just old i5-2500k 16GB Ram and vega 56 GPU 8GB Vram.


So you have a Sandybridge processor with AVX support (but not AVX2). Could you open llamafile in a debugger and find out which instructing is faulting and what its address in memory is? I haven't personally tested Sandybridge but I'm reasonably certain we designed the build to not insert any incompatible instructions in there. Our intent is to support you. I also might be able to fish my old ThinkPad out of the closet if you don't have time. In any case it'll happen soon.


I tried but gdb showed me nothing, this screenshot(https://pxscdn.com/public/m/_v2/97422265439883264/bc40e5d2a-...). gdb layout asm shows "No Assembly Available". Maybe I'm not so skilled in debbuging such programs it's running under wine it seemed to me.


Check out the llamafile 0.2.1 release. Old Intel CPU support is fully fixed now on the release page and hugging face. Enjoy!


Thanks a lot.


I suspect it's lack of AVX2 support in my cpu.


so next time llama.cpp releases an update, other people update their favorite backend, you redownload a 4.26 GB file. Epic.

EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.

EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.


Each llamafile is a .zip, so if you want to extract the weights out of it you can extract the gguf file directly.

    unzip -l llamafile-server-0.1-llava-v1.5-7b-q4 | grep llava-v1
    Archive:  llamafile-server-0.1-llava-v1.5-7b-q4
    4081004224  11-15-2023 22:13   llava-v1.5-7b-Q4_K.gguf
    177415936  11-15-2023 22:13   llava-v1.5-7b-mmproj-Q4_0.gguf


This is for convenience. You can also download a 4.45Mb executable (llamafile-server-0.1) and pass any GGUF model as an argument.

> llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1


salty much?

You know, most people don't have 24+GB GPUs sitting around to train these models. So in my book this is a huge step forward. Personally, this is the first time i am able to run an LLM on my computer, and it's purely thanks to this.


Compared to modern bandwidth usage that's not such a big size anymore. Everyday millions of people download 100gb video games, watch 4k video podcasts, etc.


You can even run a full LLM in your browser these days - try https://webllm.mlc.ai/ in Chrome, it can load up a Llama-2-7b chat model (~4000MB, took my connection just under 3 minutes) and you can start chatting with it.


Spoken like someone who hasn’t spent hours trying to get LocalAI to build and run, only to find out that while it’s “OpenAI API compatible!0” it doesn’t support streaming so the Mattermost OpenAI plugin doesn’t work. I finally gave up and went back to ooba (which also didn’t work with the MM plugin… hmm.) Next time I’ll just hack something on the side of llama.cpp


That's why I always download the original version and quantize myself. With enough swap, you can do it with a modest amount for ram. I never had to download a model twice.

But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: