I think it would be hard to make a solid argument that AR or non-AR is strictly ...

drdeca · on Sept 20, 2024

Ah, yeah, I guess that probably is true of transformers in practice. I was thinking about something which strictly takes in a sequence of tokens and outputs a (possibly 1-hot) probability distribution over all possible next tokens. Such a thing running autoregressively would have to recompute y each time. But, if intermediate computations are cached, as with transformers in practice, then this isn’t necessary.