The JPEG algorithm is: 1. Divide up the image into 8x8 patches 2. Take the DCT (...

The JPEG algorithm is:

1. Divide up the image into 8x8 patches

2. Take the DCT (a variant of the Fourier transform) of each patch to extract key features

3. Quantize the outputs

4. Use arithmetic encoding to compress

The ViT algorithm is:

1. Divide up the image into 16x16 patches

2. Use query/key/value attention matrices to extract key features

3. Minimize cross-entropy loss between predicted and actual next tokens. (This is equivalent to trying to minimize encoding length.)

ViT don't have quantization baked into the algorithm, but NNs are being moved towards quantization in general. Another user correctly pointed out that vision transformers are not necessarily autoregressive (i.e. they may use future patches to calculate values for previous patches), while arithmetic encoding usually is (so JPEG is), so the algorithms have a few differences but nothing major.

-----

I think it's pretty interesting how closely related generation and compression are. ClosedAI's Sora[^1] model uses a denoising vision transformer for their state-of-the-art video generator, while JPEG has been leading image compression for the past several decades.

[^1]: https://openai.com/index/sora/?video=big-sur