hm, that seems optimistic for this use-case. I heard from a compiler engineer th...

dragontamer · on May 31, 2023

GPUs have a crossbar that allows for high speed lane-to-lane permutes and bpermutes, but it's still slow compared to butterfly shuffles.

I do believe that compilers can optimize any movement pattern into the right butterfly shuffles (not today in the general case. Modern compilers in CUDA are impressive but this is a hard problem) but I'm convinced that the programmer needs to be aware of the low level difficult nature of many-to-many data movements on a 16-wide AVX512 register, or a 32-wide GPU block / warp / wavefront.

--------

EDIT: I'm like 90% sure some dude at Bell Labs from 1950s working on CLOS network or Benes network design probably has an efficient representation for many-to-many data shuffles on a parallel architecture. But I'm not PH.d enough to have read all those papers or keep up with those old designs.

Many-to-many data movements is traditionally a networking and routing problem. But SIMD-programmers and SIMD-chip designers are starting to run up against this problem... because a ton of parallel programming is about efficient movements of data between conceptual lanes and/or threads.