Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting to consider whether this limitation of BPE points to a more fundamental issue with the model. Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?

If so, wouldn't this be evidence that the model is using its mind-blowingly large latent space to memorize surface patterns that bear no real relationship to the underlying language (as most people suspect)?

I suppose this comes back to my question about Transformer models in general - the use of a very large attention window of BPE tokens.

When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read. So I doubt our brains are keeping some running stack of the last XXXX words, or even some smaller distributed representation thereof.

It's more plausible that we're using some kind of natural hierarchical compression/comprehension mechanism that operating on the character/word/sentence/paragraph level.

It certainly feels like GPT-3 is using a huge parameter space to bypass this mechanism and simply learn a "reconstitutable" representation.

Either way, I'd be really interested to see how it handles character-level input symbols.



> Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?

The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

> When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read.

Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

> the use of a very large attention window of BPE

You're also able to remember chunks from not long before. You just don't remember all of them. I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions. (E.g. you can't just bolt a nearest-key->value database on the side and simply expect it to learn to use it).


> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

> Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

That's right - but then we're just talking about memorization and regurgitation. Sure, it's impressive when done on a large scale, but is it really a research direction worth throwing millions of dollars at?

> I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions.

Of course, but all of my whinging about Transformers is a roundabout way of saying "I'm not convinced that the One True AI will unquestionably use some variant of differentiation/backpropagation".


> nmfisher 1 day ago [–]

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: