> Plain Unicode, however, doesn't really work well with neural networks. That is...

int_19h · on Feb 25, 2024

The point isn't that it doesn't work at all, but that it doesn't work as well as other approaches that we have. Which is evidenced by the fact that all the best-performing models on the market use tokenization. It's not a secret that tokenization is fundamentally a hack, and that ideally we'll get rid of it eventually one way or another (https://twitter.com/karpathy/status/1657949234535211009). And in principle, you can compensate for the deficiencies of byte-level tokenization with larger models and larger contexts. But what this means in practice is that a model that has the same level of intelligence (on most tasks; obviously, there are some specific tasks, like say counting characters in a word, where tokenization is detrimental to intelligence) takes a lot more resources to train, hence why we aren't seeing more of those.