> Q: Are the releases aligned with pre-training efforts?
> A: There used to be a time not that long ago, maybe half a year, distant past, where the models would align with RL runs or pretraining runs ... now the naming is by capability. GPT5 is a capable model; 5.1 is a more capable model
> I also think it’s important to notice that a lot of these challenges they happen with humans too. The concept of prompt injection isn’t that different from social engineering, right? When somebody calls in and says, “Oh, I forgot my password, can you just help me this one time?”
I wonder if the error propagation problem could be solved with a “branching” generator? Basically at every token you fork off N new streams, with some tree pruning policy to avoid exponential blowup. With a bit of bookkeeping you could make an attention mask to support the parallel streams in the same context sharing prefixes. Perhaps that would allow more of an e2e error minimization than the greedy generation algorithm in use today?
Having the data structures is nice and all, but using them is kind of painful. They are certainly second class.
Having to use accessor functions or destructuring macros instead of just a period or -> is often annoying too. The lack of syntax has cons as well as pros.
Writing a reader macro that allows for something like...
[some-numbers 0]
...to get the first (many programming languages make this mistake, using 0 to refer to the first element of a collection, so we can forgive CL for this) element. But I'm curious how you can write...
(object -> slot)
...without getting an error about OBJECT not being a valid function or macro.
The 1962 dated Lisp 1.5 Programmer's Manual already describes a 0 based array feature. Lisp was clearly one of the historic instigators of zero based array, rather than just playing along.
Yes, but the various Lisps that Common Lisp is the more-or-less common subset of are (were?) all 0-indexed. Between easy heap implementation (left is (ash index 1), right is (1+ (ash index 1)), parent is (ash index -1)) and easy last element selection (nth seq (length seq)) I prefer 1-indexing, but I realize that's an unpopular opinion.
A late reply but it's worth addressing one way of doing this. First, your concern about object not being a valid function or macro isn't relevant at read time. Second, note that Lisp already has similar syntax: '(1 . 2) is essentially (cons 1 2). Implementing this type of syntax is not a privilege of the implementation alone. You're allowed to redefine your own reader for left paren. In SBCL:
You can write `(set-macro-character #\( 'sb-impl::read-list)` and everything continues to work just fine. You can also jump-to-source and modify it if you want -- though it's cleaner to just copy it out to your own project, that's what I did for a quick hack/proof of concept. Essentially I added before the existing (when...) which handles the special dot syntax:
(when (and (eq firstchar #\-)
(eq (peek-char t stream t nil t) #\>))
(read-char stream t) ; actually read the nextchar > to discard it
(let ((next-obj (read stream)))
(sb-impl::flush-whitespace stream rt)
(return `(slot-value ,@listtail ',next-obj))))
I won't claim this is good or proper, but it shows that it's quite feasible. We've turned (foo -> bar) into (slot-value foo 'bar).
Personally I wouldn't use this even if it was more properly/carefully implemented. (There's really no reason to replace the default left-paren reader, and no reason we have to have a space surrounding the "->". One thing I like about the infix reader macro package https://github.com/quil-lang/cmu-infix is that it doesn't care about spaces, I can write #I(1+1 + 4) and get 6.) I'm quite happy putting my class in its own package, and thus getting the primary tab-completion behavior I care about. e.g. "(ma:<tab>" could complete to "(math:" and then "(math:v<tab>" could complete to a list of options like "vector-x" "vector-y" or so on. I also like the somewhat unusual approach of naming my accessors with a dot prefix, e.g. (.x vec) and (.y vec), or even (math:.x vec) if I haven't imported the symbol.
Sparse attention essentially combines 3 types of attention optimizations:
1. Compression of the query input vectors to reduce the size of the KV cache
2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores
3. Using sliding window for local attention at full resolution
> Both Full Attention and sparse attention models are pretrained on 270B tokens of 8k-length texts, followed by continued training and supervised fine-tuning on 32k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.
> our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters
Evaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7/9.
Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBench
Perfect retrieval on 64k needle-in-a-haystack
The CoT eval is less convincing, but outperforms the FA on AIME24.
Training speed of 2-9x vs. FlashAttention
Decoding speedup of 4-12x vs. full attention ["expected"? Didn't see comparison to other attention mechanisms]
Great to see this is alive and progressing! I believe Ohm started life in Alan Kay’s research group, to build a graphical OS and office suite in 10k lines of code. I found this talk immensely inspiring https://m.youtube.com/watch?v=ubaX1Smg6pY
Very close! Alex Warth created OMeta (https://en.wikipedia.org/wiki/OMeta) as part of the STEPS project. Ohm was designed as a kind of successor to OMeta, but was created after STEPS.
My takeaway was that autocomplete, boiler plate, and one-off scripts are the main use cases. To use an analogy, I think the code assistants are more like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).
For me, only the one-off script (write-only code) use-case is useful. I've had the best results on this with Claude.
Emacs abbrevs/snippets (+ choice of language) virtually eliminate the boiler plate problem, so I don't have a use for assistants there.
For autocomplete, I find that LSP completion engines provide 95% of the value for 1% of the latency. Physically typing the code is a small % of my time/energy, so the value is more about getting the right names, argument order, and other fiddly details I may not remember exactly. But I find, that LSP-powered autocomplete and tooltips largely solve those challenges.
> like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).
I 100% agree with the not hiring a carpenter part but we need a better way to describe the improvement over just a handsaw. If you have domain knowledge, it can become an incredible design aid/partner. Here is a real world example as to how it is changing things for me.
I have a TreeTable component which I built 100% with LLM and when I need to update it, I just follow the instructions in this chat:
I'm thoroughly impressed as it suggested data structures and more for me to think about. And here I am asking it to review what was discussed to make the information easier to understand.
All of this cost me less than a penny. I'm still waiting for my Anthropic API limit to reset and I'm going to ask Sonnet for feedback as well, and I figure that will cost me 5 cents.
I fully understand the not hiring a carpenter part, but I think what LLMs bring to the table is SO MUCH more than an upgrade to a power tool. If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.
> If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.
In other words: you must already know how to do what you are asking the LLM to do.
In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).
This basically makes it useless for me. Typing speed is not a bottleneck, I automate or abstract away repetition, and I seek novel tasks that have not yet been well solved—or I just reuse those existing solutions (maybe even contributing to respective OSS projects).
The cases where something new is needed in areas that I don’t know well it completely failed me. NB: I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.
> In other words: you must already know how to do what you are asking the LLM to do.
Those that will benefit the most will be senior developers. They might not know the exact problem or language, but they should know enough to guide the LLM.
> In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).
I definitely use a LLM as a typist and I love it. I've come to a point now where I mentally ask myself, "Will it take more time to do it myself or to explain it?" Another factor is cost, as you can rack up a bill pretty quickly with Claude Sonnet if you ask it to generate a lot of code.
But honestly, what I love about integrating LLM into my workflow is, I'm better able to capture and summarize my thought process. I've also found LLMs can better articulate my thoughts most of the time. If you know how to prompt a LLM, it almost feels like you are working with a knowledgeable colleague.
> I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.
LLMs are definitely not for everyone, but I personally cannot see myself coding without LLMs now. Just asking for variable name suggestions is pretty useful. Or describing something vague and having it properly articulate my thoughts is amazing. I think we like to believe what we do is rather unique, but I think a lot of things that we need to do have already been done. Whether it is in the training data is another thing, though.
> They might not know the exact problem or language, but they should know enough to guide the LLM.
I was in this exact situation. I worked with an unfamiliar area with a hardware SDK in C that I needed to rewrite for my runtime, or at least call its C functions from my runtime, or at least understand how the poorly written (but working) example SDK invocation works in C by commenting it. The LLMs failed to help with any of that, they produced code that was 1) incorrect (literally doing the opposite of what’s expected) and 2) full of obvious comments and missing implementetions (like “cleanup if needed” comment in the empty deinit function).
Later it turned out there is actually an SDK for my runtime, I just failed to find it at first, so the code the LLM could use or tell me about actually existed (just not very easy to find).
Those were two top LLMs as of December 2024. It left me unimpressed.
I don’t think I would be compelled to guide them, once I understood how the code works it is faster to just write it or read relevant reference.
My friend, who volunteered to waste those precious tokens to help with my project, does use chatbots a lot while coding, but he’s more of an intermediate than senior developer.
> Just asking for variable name suggestions is pretty useful.
I can’t see myself asking anyone, much less an LLM, for the name of a variable. I am known to ask about and/or look up, say, subject domain terminology that I then use when naming things, but to name things well you first need to have a full picture of what you are making. Our job is to have one…
I think you make a very good point about your existing devenv. I recently turned off GitHub copilot after maybe 2 years of use — I didn’t realize how often I was using its completions over LSPs.
Quality of Life went up massively. LSPs and nvim-cmp have come a long way (although one of these days I’ll try blink.cmp)
https://youtu.be/3K-R4yVjJfU?si=JdVyYOlxUbEcvEEo&t=2624
> Q: Are the releases aligned with pre-training efforts?
> A: There used to be a time not that long ago, maybe half a year, distant past, where the models would align with RL runs or pretraining runs ... now the naming is by capability. GPT5 is a capable model; 5.1 is a more capable model