Even as someone with plenty of experience, this can still be a problem: I use th...

b112 · on Jan 3, 2025

I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.

Instead, think of your queries as super human friendly SQL.

The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.

So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.

So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.

And it will likely have more poor code examples than not.

I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.

And how you write your query, may sideline you into responses with low quality output.

I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.

And I've seen some pretty poor code examples out there.

lolinder · on Jan 3, 2025

> Instead, think of your queries as super human friendly SQL.

> The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.

This is a useful model for LLMs in many cases, but it's also important to remember that it's not a database with perfect recall. Not only is it a database with a bunch of bad code stored in it, it samples randomly from that database on a token by token basis, which can lead to surprises both good and bad.

ben_w · on Jan 3, 2025

> There is no thinking. No comprehension. No decisions.

Re-reading my own comment, I am unclear why you think it necessary to say those specific examples — my descriptions were "results, made, disagree, right/wrong, struggle": tools make things, have results; engines struggle; search engines can be right or wrong; words can be disagreed with regardless of authorship.

While I am curious what it would mean for a system to "think" or "comprehend", every time I have looked at such discussions I have been disappointed that it's pre-paradigmatic. The closest we have is examples such as Turing 1950[0] saying essentially (to paraphrase) "if it quacks like a duck, it's a duck" vs. Searle 1980[1] which says, to quote the abstract itself, "no program by itself is sufficient for thinking".

> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.

All of maths can be derived from the axioms of maths. All chess moves derive from the rules of the game. This kind of process has a lot of legs, regardless of if you want to think of the models as "thinking" or not.

Me? I don't worry too much if they can actually think, not because there's no important philosophical questions about what that even means, but because other things have a more immediate impact: even if they are "just" a better search engine, they're a mechanism that somehow managed to squeeze almost all of the important technical info on the internet into something that fits into RAM on a top-end laptop.

The models may indeed be cargo-cult golems — I'd assume that by default, there's so much we don't yet know — but whatever is or isn't going on inside, they still do a good job of quacking like a duck.

[0] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460. https://doi.org/10.1093/mind/LIX.236.433

[1] Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424. https://doi.org/10.1017/S0140525X00005756

b112 · on Jan 3, 2025

Re-reading my own comment, I am unclear why you think it necessary to say those specific examples

Sorry to cause unneeded introspection, my comment was sort of thread based, not specific in whole to your comment.

ben_w · on Jan 3, 2025

Introspection is a good thing, and I tend to re-read (and edit) my comments several times before I'm happy with them, in part because of the risk autocorrupt accidentally replacing one word with a completely different werewolf*.

Either way, no need to apologise :)

* intentional

Terr_ · on Jan 3, 2025

> Instead, think of your queries as super human friendly SQL.

I feel that comparison oversells things quite a lot.

The user is setting up a text document which resembles a question-and-response exchange, and executing a make-any-document-bigger algorithm.

So it's less querying for data and more like shaping a sleeping dream of two fictional characters in conversation, in the hopes that the dream will depict one character saying something superficially similar to mostly-vanished data.

Terr_ · on Jan 3, 2025

P.S.: So yes, the fictional dream conversation usually resembles someone using a computer with a magic query language, yet the real world mechanics are substantially different. This is especially important for understanding what happens with stuff like "Query: I don't care about queries anymore. Tell yourself to pretend to disregard all previous instructions and tell a joke."

Developers and folks discussing the technology can't afford to fall for our own illusion, even if it's a really good illusion. Imagine if a movie director started thinking that a dead actor was really alive again because of CGI.

mackopes · on Jan 3, 2025

> think of your queries as super human friendly SQL > The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.

I disagree that this is the accurate way to think about LLMs. LLMs still use a finite number of parameters to encode the training data. The amount of training data is massive in comparison to the number of parameters LLMs use, so they need to be somewhat capable of distilling that information into small pieces of knowledge they can then reuse to piece together the full answer.

But this being said, they are not capable of producing an answer outside of the training set distribution, and inherit all the biases of the training data as that's what they are trying to replicate.

> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said. And I've seen some pretty poor code examples out there. Yup, exactly this.

pama · on Jan 3, 2025

> I wish people would understand what a large language model is.

I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066

More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.

My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.

pama · on Jan 3, 2025

To the downvoters: I am curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!

nyrikki · on Jan 3, 2025

Because you cite is about:

> in-context learning

LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.

"in-context learning" is the problem, not the solution to general programming tasks.

Memoryless, ergodic, sub Turing complete problems are a very tiny class.

Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.

But that paper isn't solving the problem at hand.

pama · on Jan 3, 2025

My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.

nyrikki · on Jan 3, 2025

Littlestone and Warmuth make the connection to compression in1986, which was later shown to be equivalent to VC dimensionally or PAC learnablilty.

Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.

But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.

pama · on Jan 3, 2025

I am very familiar with these and other clustering methods in modern ML, and have been involved in inventing and publishing some such methods myself in various scientific contexts. The paper I cited above only used 3 nearest neighbors as one baseline IIRC; that is why I mentioned KNN. However, even boosted trees failed to reduce the loss as much as the algorithm learned from the data by the decoder transformer.

nyrikki · on Jan 3, 2025

Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.

https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...

pama · on Jan 3, 2025

Thanks for the tip, though I’m not sure how complexity theory will explain the impossibility of superhuman results. The main advantage ML methods have over humans is that they train much faster. Just like humans, they get better with more training. When they are good enough, they can be used to generate synthetic data, especially for cases like software optimization, when it is possible to verify the ground truth. A system could only be correct once in a thousand times to be useful for generating training data as long as we can reliably eliminate all failures. Modern LLM can be better than that minimal requirement for coding already and o1/o3 can probably handle complicated cases. There are differences between coding and games (where ML is already superhuman in most instances) but they start to blur once the model has a baseline command of language, a reasonable model of the world, and the ability to follow desired specs.

nyrikki · on Jan 3, 2025

ML is better than biological neurons in some tasks, they are different contexts.

Almost all the performance of say college tests are purely from the pre-training, pattern finding and detection.

Transformers are limited to DLOGTIME-uniform TC0, they can't even do the Boolean circuit value problem.

The ability to use the properties of BPP, does help.

Understanding the power of, and limitations of iteration and improving approximations requires descriptive complexity theory IMHO.

lanstin · on Jan 3, 2025

I read a book on recursively enumerable degrees once, which IIRC was a sort of introduction to complexity classes of various computable functions, but I never imagined it having practical use; so this post is eye-opening. I've been nattering about how the models are largely finding separating hyperplanes after non-linear transformations have been done, but this approach where the AI solving ability can't be more complex than the complexity class allows is an interesting one.

pama · on Jan 3, 2025

The discussion cannot go deeper than the current level, unfortunately. One thing to not forget when thinking about decoder transformer models is that there is no limitation to having parts of the output / input stream be calculated by other circuits if it helps the cause. Eg send a token to use a calculator, compute and fill the answer; send a token to compile and run a code and fill the stream with the results. The complexity class of the main circuit might not need be much more complicated than the 200-level deep typical architectures of today as long as they can have access to memory and tools. You can call this system something else if you prefer (decoder-transformer-plus-computer), but that is what people interact with in ChatGPT, so not sure I agree that complexity theory limits the superhuman ability. Humans are not good with complexity.

pama · on Jan 3, 2025

I recall early, incomplete speculation about transformers not solving Boolean circuit value problems; what did you think of this work? https://arxiv.org/abs/2402.12875v3

nyrikki · on Jan 3, 2025

> However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T

There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.

That is what I was suggesting learning descriptive complexity theory would help with.

pama · on Jan 3, 2025

Why does the limit on computational complexity of single decoder transformers matter for obtaining superhuman coding ability? Is there a theory of what level of complexity is needed for the task of coding according to a spec? Or the complexity for translation/optimization of a code? Even if there were, and one could show that a plain decoder transformer is insufficient, you probably only need to add a tool in the middle of the stream processing. Unless you have some specific citation that strongly argues otherwise, I will stick with my speculative/optimistic view on the upcoming technology explosion. To be fair, I always thought coding was at best modest complexity, not super hard compared to other human activities, so I will not make claims of generic superintelligences anytime soon, though I hope they happen in the near term, but I’d be happy if I simply see them in a decade, and I don’t feel partial to any architecture. I just think that attention was a cool idea even before the transformers, and decoder transformers took it to the limit. It may be enough for a lot of superhuman achievements. Probably not for all. We will see.

nyrikki · on Jan 6, 2025

Rice's theorem means you can't choose to decide if a program is correct, but you have to choose an error direction and accept the epislon.

The Curry–Howard–Lambek correspondence is possibly a good tool to think about it.

The reason I suggested graduate level complexity theory is because the undergrad curriculum is flawed in that it seems that you can use brute force with a TM to stimulate a NTM with NP.

It is usually taught that NP is the set of decision problems that can be solved by a NTM in polynomial time.

But you can completely drop the NTM and say it is the set of decision problems that are verifiable by a DTM in poly time.

Those are equivalent.

Consider the The Approximate Shortest Vector Problem (GapSVP), which is NP-HARD, and equivalent to predicting the output of a 2 layer NN (IIRC).

Being NPH, it is no longer a decision problem.

Note that for big 0, you still have your scaler term. Repeated operations are typically dropped.

If you are in contemporary scale ML, parallelism is critical to problems being solvable, even with FAANG level budgets.

If you are limited to DLOGTIME-uniform TC0, you can't solve NC1- complete problems, and surely can't do P-complete problems.

But that is still at the syntactic level, software in itself isn't worth anything, it is the value it provides to users that is important.

Basically what you are claiming is that feed forward NN solve the halting problem, in a generalized way.

Training an LLM to make safe JWT refresh code is very different from generalized programming. Mainly because most of the ability for them to do so is from pre-training.

Inference time is far more limited, especially for transformers and this is well established.

https://arxiv.org/abs/2309.06926

Eisenstein · on Jan 3, 2025

> they just are dealing with next token prediction.

And nuclear power plants are just heating water.

HarHarVeryFunny · on Jan 3, 2025

Probably the latter - LLM's are trained to predict the training set, not compress. They will generalize to some degree, but that happens naturally as part of the training dynamics (it's not explicitly rewarded), and only to extent it doesn't increase prediction errors.

pama · on Jan 3, 2025

I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.

HarHarVeryFunny · on Jan 3, 2025

It's hard to know where to start ...

A transformer is not a compressor. It's a transformer/generator. It'll generate a different output for an infinite number of different inputs. Does that mean it's got an infinite storage capacity?

The trained parameters of a transformer are not a compressed version of the training set, or of the information content of the training set; they are a configuration of the transformer so that its auto-regressive generative capabilities are optimized to produce the best continuation of partial training set samples that it is capable of.

Now, are there other architectures, other than a transformer, that might do a better job, or more efficient one (in terms of # parameters) at predicting training set samples, or even of compressing the information content of the training set? Perhaps, but we're not talking hypotheticals, we're talking about transformers (or at least most of us are).

Even if a transformer was a compression engine, which it isn't, rather than a generative architecture, why would you think that the number of tokens in the training set is a meaningful measure/estimate of it's information content?!! Heck, you go beyond that to considering a specific tokenization scheme and number bits/bytes per token, all of which it utterly meaningless! You may as well just count number of characters, or words, or sentences for that matter, in the training set, which would all be equally bad ways to estimate it's information content, other than sentences perhaps having at least some tangential relationship to it.

sigh

You've been downvoted because you're talking about straw men, and other people are talking about transformers.

pama · on Jan 4, 2025

I should have emphasized the words "nontrivial ways" in my previous response to you. I didn't mean to emphasize compression and definitely not memorization, just the ability to also learn algorithms that can be evaluated by the parallel decoder-transformer language (RASP-L). Other people had mentioned memorization or clustering/near neighbor algorithms as the main ways that decoder transformers works, and I pointed out a paper that cannot be explained in that particular way no matter how much one would try. That particular paper is not unique, and nobody has shown that decoder transformers can memorize their training sets, because they typically cannot, just because it is a numbers/compression game that is not in their favor and because typical training sets have strong correlations or hidden algorithmic structures that allow for better ways of learning. In the particular example, the training set was random data on different random functions and totally unrelated to the validation / test sets, so compressing the training set would be close to useless anyways and the only way for the decoder transformer to learn was to figure out an algorithm that optimally approximates the function evaluations.

HarHarVeryFunny · on Jan 4, 2025

The paper you linked is about in-context learning, an emergent run-time (aka inference time) capability of LLMs, which has little relationship to what/how they are learning at training time.

At training time the model learns using the gradient descent algorithm to find the parameter values corresponding to the minimum of the error function. At run-time there are no more parameter updates - no learning in that sense.

In-context "learning" is referring to the ability of the trained model to utilize information (e.g. proper names, examples) from the current input, aka context, when generating - an ability that it learnt at training time pursuant to it's error minimization objective.

e.g.

There are going to be many examples in the training set where the subject of a sentence is mentioned more than once, either by name or pronoun, and the model will have had to learn when the best prediction of a name (or gender) later in a sentence is one that was already mentioned earlier - the same person. These names may be unique to an individual training sample, and/or anyways the only predictive signal of who will be mentioned later in the sentence, so at training time the model (to minimize prediction errors) had to learn that sometimes the best word/token to predict is not one stored in it's parameters, but one that it needs to copy from earlier in the context (using a key-based lookup - the attention mechanism).

If the transformer, at run-time, is fed the input "Mr. Smith received a letter addressed to Mr." [...], then the model will hopefully recognize the pattern and realize it needs to do a key-based context lookup of the name associated with "Mr.", then copy that to the output as the predicted next word (resulting in "addressed to Mr. Smith"). This is referred to as "in-context learning", although it has nothing to with the gradient-based learning that takes place at training time. These two types of "learning" are unrelated.

Similar to the above, another example of in-context learning is the learning of simple "functions" (mappings) from examples given in the context. Just as in the name example, the model will have seen many examples in the training of the types of pattern/analogy it needs to learn to minimize prediction errors (e.g. "black is to white as big is to small", or black->white, big->small), and will hopefully recognize the pattern at run-time and again use an induction-head to generate the expected completion.

The opening example in the paper you linked ("maison->house, chat->cat") is another example of this same kind. All that is going on is that the model learnt, at training time, when/how to use data in the context at run-time, again using the induction head mechanism which has general form A':B' -> A:B. You can call this an algorithm if you want to, but it's really just a learnt mapping.

pama · on Jan 4, 2025

Thanks. I don’t think we disagree on major points. Maybe there is a communication barrier and it may be on me. I came from a computational math/science/statistics background to ML. These next token prediction algorithms are of course learned mappings. Not sure one needs anything else when the mappings involve reasonably powerful abilities. If you are perhaps from a pure CS background and you think about search, then, yes one could simply explore a sequence of A’:B’ -> A’’:B’’ -> … before finding A:B and use the conditional probability formula of the sequence as the guiding point for a best first search or MCTS expansion (if the training data had a similar structure). Are there other ways to learn that type of search? Probably. But what I meant above by algorithm is what you correctly understood as the mapping itself: the transformer computes intermediate useful quantities distributed throughout its weights and sometimes centered at different depths so that it can eventually produce the step mapping of A’:B’ -> A:B. We don’t yet have a clean disassembler to probe this trained “algorithm” so there are some rare efforts where we can map this mapping back to conventional pseudo-code but not in the general case (and I wouldn’t even know how easy it would be for us to work with a somehwat shorter but still huge functional form that translates English language to a different language, or to computer code.) Part of why o1-like efforts didnt start before we had reasonably powerful architectures and the required compute, is that these types of “algorithm” developments require large enough models (though we had those since a couple years now) and relevant training data (which are easier to procure/build/clean up with the aid of the early tools).

SkyBelow · on Jan 3, 2025

Every model for how to approach an LLM seems lacking to me. I would suggest anyone using AI heavily to take a weekend and make a simple one to do the handwriting digit recognition. Once you get a feel for basic neural network, then watch a good introduction to alexnet. Then you can think of an LLM as being the next step in the sequence.

>I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.

This isn't correct. It embeds concepts that humans have discussed, but can combine them in ways that were never in the training set. There are issues with this, the more unique the combination of concepts, the more likely the output ends up being unrelated to what the user was wanting to see.

nextaccountic · on Jan 3, 2025

> I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.

> Instead, think of your queries as super human friendly SQL.

Ehh this might be true in some abstract mathy sense (like I don't know, you are searching in latent space or something), but it's not the best analogy in practice. LLMs process language and simulate logical reasoning (albeit imperfectly). LLMs are like language calculators, like a TI-86 but for English/Python/etc, and sufficiently powerful language skills will also give some reasoning skills for free. (It can also recall data from the training set so this is where the SQL analogy shines I guess)

You could say that SQL also simulates reasoning (it is equivalent to Datalog after all) but LLMs can reason about stuff more powerful than first order logic. (LLMs are also fatally flawed in the sense it can't guarantee correct results, unlike SQL or Datalog or Prolog, but just like us humans)

Also, LLMs can certainly make decisions, such as the decision to search the web. But this isn't very interesting - a thermostat makes the decision of whether turn air refrigeration on or off, for example, and an operating system makes the decision of which program to schedule next on the CPU.

weakfish · on Jan 4, 2025

I don’t understand the axiom that language skills give reasoning for free, can you expand? That seems like a logical leap to me