Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Plain Unicode, however, doesn't really work well with neural networks.

That is not true. See ByT5, for example.

> As an illustration, let's take the word "PostgreSQL". If we were to encode it (convert to an array of numbers) using Unicode, we would get 10 numbers that could potentially be from 1 to 149186. It means that our neural network would need to store a matrix with 149186 rows in it and perform a number of calculations on 10 rows from this matrix.

What the author calls alphabet here, is typically called vocabulary. And you can just use UTF-8 bytes as your vocabulary, so you end up with 256 tokens, not 149186. That is what ByT5 does.



The point isn't that it doesn't work at all, but that it doesn't work as well as other approaches that we have. Which is evidenced by the fact that all the best-performing models on the market use tokenization. It's not a secret that tokenization is fundamentally a hack, and that ideally we'll get rid of it eventually one way or another (https://twitter.com/karpathy/status/1657949234535211009). And in principle, you can compensate for the deficiencies of byte-level tokenization with larger models and larger contexts. But what this means in practice is that a model that has the same level of intelligence (on most tasks; obviously, there are some specific tasks, like say counting characters in a word, where tokenization is detrimental to intelligence) takes a lot more resources to train, hence why we aren't seeing more of those.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: