If you don't design your compressor to output data that can be compressed further, it's going to trash compressibility.
And if you find a way to compress text that isn't insanely computationally expensive, and still makes the compressed text compressible by LLMs further - i.e. usable in training/inference? You, basically, would have invented a better tokenizer.
A lot of people in the industry are itching for a better tokenizer, so feel free to try.
And if you find a way to compress text that isn't insanely computationally expensive, and still makes the compressed text compressible by LLMs further - i.e. usable in training/inference? You, basically, would have invented a better tokenizer.
A lot of people in the industry are itching for a better tokenizer, so feel free to try.