Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it.
If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.
The switching is done by layer, not just per token. Every layer is loading completely different parameters, you don't really benefit from continuity. You're generally better off shifting this work to the CPU, since CPU RAM is more abundant than the GPU's VRAM hence it matters less that so much of it is "wasted" on inactive expert layers. Disk storage is even more relatively abundant, so offloading experts to disk if you can't keep them in RAM (as OP does) is the next step.
Personally defined <dtf> as 'don't touch files' in the general claude.md, with the explanation that when this is present in the query, it means to not edit anything, just answer questions.
Worked pretty well up until now, when I include <dtf> in the query, the model never ran around modifying things.
One important operation I've noticed in the examples that do end up with abiogenesis is having a 'copy' operation. In the bf version they use in the paper, one head can copy the byte under it at the location of the other head. Which makes it quite easy to make a self-replicator: just loop on the copy operation and move both heads, essentially (5 instructions). You could try adding the 'copy' operation to your setup and see if that helps!
Sure. But this argument is surely less powerful than it was back in the era of church bells and big clocks on factory walls and so on. We now have electronics that add a whole new layer of abstraction to our schedules, to the point that you can now miss a DST change if you're not paying attention. For many people (I'm one) this change is now just a useless irritation.
Yeah, but he can't use his $200 subscription for the API.
That's limited to accessing the models through code/desktop/mobile.
And while I'm also using their subscriptions because of the cost savings vs direct access, having the subscription be considerably cheaper than the usage billing rings all sorts of alarm bells that it won't last.
reply