AVX512 got far far FAR easier to read and write than previous iterations. Lots o...

AVX512 got far far FAR easier to read and write than previous iterations. Lots of 'missing' operations were added and the set seems far more thought for autovectorizer compiler passes.

Still, yes, it's difficult to grok because it's a low-level vector assembler, and as any low-level assembler, can be hard to follow at first.

AVX512 also has amazing features like mask registers, all the gfni, bit extract/compress and vpternlog, vpopcount, etc. features that once you've grokked you start seeing vector code as even more magic.

My main gripe with AVX512 is that it's still hard to get amazing performance because of memory bandwidth, unoptimized streaming issues, cache locality issues. You can write amazing compact avx512 that will still have poor performance because it's all about feeding the FMA units.

But, if you want to see a simpler high level language that will generate fast code for AVX512 targets, checkout ISPC, I've had very nice successes in the past, writing idiomatic code and getting better performance than my shitty experiments with intrinsics. It's not for every kind of code, but when it fits, it's just nice. And ISPC generates C-callable objects/libraries so it's relatively nice to integrate in a build pipeline.