Interesting to consider whether this limitation of BPE points to a more fundamen...

nullc · on July 28, 2020

> Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?

The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

> When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read.

Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

> the use of a very large attention window of BPE

You're also able to remember chunks from not long before. You just don't remember all of them. I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions. (E.g. you can't just bolt a nearest-key->value database on the side and simply expect it to learn to use it).

nmfisher · on July 29, 2020

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

> Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

That's right - but then we're just talking about memorization and regurgitation. Sure, it's impressive when done on a large scale, but is it really a research direction worth throwing millions of dollars at?

> I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions.

Of course, but all of my whinging about Transformers is a roundabout way of saying "I'm not convinced that the One True AI will unquestionably use some variant of differentiation/backpropagation".

nullc · on July 30, 2020

> nmfisher 1 day ago [–]

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)