As opposed to simply being locked into openai api's as the only option?

AndrewKemendo · on June 17, 2023

A false dilemma, also referred to as false dichotomy or false binary, is an informal fallacy based on a premise that erroneously limits what options are available.[1]

[1]https://en.wikipedia.org/wiki/False_dilemma

JumpCrisscross · on June 17, 2023

These models cost millions to train. The only reason open-source LLMs have a heartbeat is they’re standing on Meta’s weights. The only third path is a public option.

LoganDark · on June 17, 2023

> The only reason open-source LLMs have a heartbeat is they’re standing on Meta’s weights.

Not necessarily.

RWKV, for example, is a different architecture that wasn't based on Facebook's weights whatsoever. I don't know where BlinkDL (the author) got the training data, but they seem to have done everything mostly independently otherwise.

https://github.com/BlinkDL/RWKV-LM

disclaimer: I've been doing a lot of work lately on an implementation of CPU inference for this model, so I'm obviously somewhat biased since this is the model I have the most experience in.

JumpCrisscross · on June 17, 2023

My personal bet is specialised models have a niche. Do you think one of these could compete with GPT if e.g. trained on a law firm’s correspondence and contracts?

LoganDark · on June 17, 2023

Probably not, honestly—because it's an RNN, old information gradually deteriorates as new information is fed into the model, which is undesirable compared to e.g. transformers that can reference any part of the context without degradation, but have a hard limit on context size (RWKV can ingest a theoretically infinite number of tokens, but after around 16k it will start to degrade into madness until restarted, so practically it does sort of have a limit).

(The reason why it degrades is because a single internal state is updated in-place per token, and the currently models have only been trained with up to 8192 tokens of context, so once you start getting double past that or so, the state starts to diverge from "sanity", with no known way to correct this. And then priming a new instance of the model with 8192 tokens or so of the new context takes a really long time because you can't compute the next token of an RNN until you also have the previous one!)

With some fine-tuning (which, even that is ... still out of reach for most people unfortunately, but I digress) it can be turned into a pretty good chat model, generate story completions, generate boilerplate code etc. and the base model is reasonably okay at most of these things already.

I think it's definitely a competitor in some areas, though I don't remember if there have already been benchmarks putting it up against the other models. I do know that it's better than the majority of other open-source models, including transformer-based ones, but this is probably more the fault of training data than architecture.

AndrewKemendo · on June 17, 2023

It is interesting how “catastrophic forgetting” is subtly different technically between these large corpus LLMs and say a CNN, but the basic “the sequences you are looking for are not here” is the same.

LoganDark · on June 17, 2023

oh, you said trained. If trained, then the long context length issue may not be as severe. It might still go mad if you let it eat too much of a hundred-page lawsuit, but if you work with portions of it (like how transformers work), RWKV can be vastly more economical than the larger models (requiring a much less powerful GPU, or even running on no GPU at all, thanks to rwkv.cpp).

rwkv.cpp in particular depends on a project that would not have existed in its current form without LLaMA, even though the project itself isn't LLaMA-specific. However there are enough other implementations of CPU inference (at least two?) that I think RWKV could still exist even if LLaMA had never.

DirkH · on June 17, 2023

Didn't the whole "we have no moat" paper show how this is actually not the case and that the future is far brighter for open-source LLMs?