I think it would be hard to make a solid argument that AR or non-AR is strictly better wrt full sequence error rates, whether or not we place constraints on compute, memory, etc. I'd guess that there's some intrinsic form of complexity inherent to any particular distribution of sequences which requires spending at least some amount of compute to achieve sequence generation error less than some epsilon. I'd also guess that AR and non-AR models could both achieve this bound in principle, though maybe it's practically harder with one or the other. It would be interesting to formally characterize this sort of complexity, but that's above my analytical pay grade.
The hash function example is interesting. I think the model could compute y prior to outputting any tokens and then output the `hash(y), y' sequence deterministically. In architectures like transformers, all the compute in earlier steps can be reused in later steps via attention, so it wouldn't be necessary to recompute y at each step as long as the model commits to a given y up front before starting to generate hash(y).
Ah, yeah, I guess that probably is true of transformers in practice. I was thinking about something which strictly takes in a sequence of tokens and outputs a (possibly 1-hot) probability distribution over all possible next tokens. Such a thing running autoregressively would have to recompute y each time. But, if intermediate computations are cached, as with transformers in practice, then this isn’t necessary.
The hash function example is interesting. I think the model could compute y prior to outputting any tokens and then output the `hash(y), y' sequence deterministically. In architectures like transformers, all the compute in earlier steps can be reused in later steps via attention, so it wouldn't be necessary to recompute y at each step as long as the model commits to a given y up front before starting to generate hash(y).