I think it's important to point out for people that might be interested in this ...

I think it's important to point out for people that might be interested in this comment that a few things are wrong.

1. Standard JPEG compression uses the Discrete Cosine Transform, not the Fourier Transform.

2. It is easy to be dismissive of any technology by saying that it is 'just' X with Y, Z, etc on top

3. Vision transformers allow for much longer range context - the magic comes in part from the ability to relate between patches, as well as the learned features, which JPEG does not do.