More

frotaur · 2026-03-22T12:50:57 1774183857

Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.

etiam · 2026-03-22T21:14:18 1774214058

I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it.

If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.

zozbot234 · 2026-03-22T21:27:04 1774214824

The switching is done by layer, not just per token. Every layer is loading completely different parameters, you don't really benefit from continuity. You're generally better off shifting this work to the CPU, since CPU RAM is more abundant than the GPU's VRAM hence it matters less that so much of it is "wasted" on inactive expert layers. Disk storage is even more relatively abundant, so offloading experts to disk if you can't keep them in RAM (as OP does) is the next step.

frotaur · 2026-03-13T21:33:20 1773437600

Nitpick, but the volume increases cubically (it scales with volume), not exponentially.

observationist · 2026-03-13T21:39:10 1773437950

Thank you, I'll correct that. I was thinking inverse square law, then instead of asking an AI like a good nerd, I just winged it.

setsewerd · 2026-03-13T23:29:37 1773444577

Some might say you're a purist in that regard

Side note, would positing an argument online without doing an AI fact check first be considered rawdogging your answer?

It seems fitting.

frotaur · 2026-03-13T09:51:42 1773395502

Personally defined <dtf> as 'don't touch files' in the general claude.md, with the explanation that when this is present in the query, it means to not edit anything, just answer questions.

Worked pretty well up until now, when I include <dtf> in the query, the model never ran around modifying things.

frotaur · 2026-03-09T11:01:29 1773054089

One important operation I've noticed in the examples that do end up with abiogenesis is having a 'copy' operation. In the bf version they use in the paper, one head can copy the byte under it at the location of the other head. Which makes it quite easy to make a self-replicator: just loop on the copy operation and move both heads, essentially (5 instructions). You could try adding the 'copy' operation to your setup and see if that helps!

frotaur · 2026-03-07T13:57:51 1772891871

Seems Claude is also writing the comments for you?

frotaur · 2026-03-02T22:45:52 1772491552

In a world where there isn't work schedule and in general the whole of society's schedule which works around the arbitrary time, I agree with you.

bluebarbet · 2026-03-02T22:55:40 1772492140

Sure. But this argument is surely less powerful than it was back in the era of church bells and big clocks on factory walls and so on. We now have electronics that add a whole new layer of abstraction to our schedules, to the point that you can now miss a DST change if you're not paying attention. For many people (I'm one) this change is now just a useless irritation.

lamontcg · 2026-03-02T22:55:45 1772492145

So adjust the work schedule.

If people want more time in the evening, get up earlier and go to work and go home earlier.

You can even shift school/work schedules throughout the year.

dagss · 2026-03-03T07:15:54 1772522154

Changing the time (zone) IS changing the work schedule. That is essentially what a time change IS. In the most expedient way possible.

1718627440 · 2026-03-02T23:01:42 1772492502

The work schedule is adjusting all the time, and it moves in the opposite direction.

frotaur · 2026-02-11T00:39:11 1770770351

I'd venture this article is written by AI with the density of 'it isn't X, it's Y'

frotaur · 2026-02-04T01:00:08 1770166808

and : kessler syndrome

frotaur · 2026-01-30T11:44:37 1769773477

I think the website was just down when you tried. Skills should work with most models, they are just textual instructions.

frotaur · 2026-01-23T12:25:13 1769171113

I don't understand, you CAN use claude code through the API.

horsawlarway · 2026-01-23T13:42:48 1769175768

Yeah, but he can't use his $200 subscription for the API.

That's limited to accessing the models through code/desktop/mobile.

And while I'm also using their subscriptions because of the cost savings vs direct access, having the subscription be considerably cheaper than the usage billing rings all sorts of alarm bells that it won't last.