Another way that rearrange is not vpshufb is that 0 in the 17th position means "...

gergo_barany · on June 3, 2023

Not sure what you mean by "get me the 17th byte (from this 16-byte lane)". Do you have a concrete example of a vpshufb that you would like to model?

anonymoushn · on June 3, 2023

vpshufb only accesses bytes within the same 16-byte lane and has latency 1. Instructions that shuffle bytes across lanes tend to have higher latency. So at the 16th position, a shuffle value of 0 (or 16, or 32...) gets you the first byte (byte 0) from the other argument, but at the 17th position, a shuffle value of 0 instead gets you the 17th byte (byte 16, or byte 0 of the corresponding 16-byte lane, if you like) from the other argument. This isn't really desirable behavior. Rather, it's somewhat awkward behavior that one has to work with to avoid doing a sequence of more expensive operations.

I am concerned that it takes like 6 instructions (2 to put copies of each lane into all lanes, 2 shuffles, cmpgt, blend) including a load and using an extra register to implement rearrange for use cases that would be fine with vpshufb for 32-byte registers and maybe 14 instructions (4 to put copies of each lane into all lanes, 4 shuffles, 3 compares, 3 blends) with 3 loads and 3 extra registers to implement rearrange for 64-byte registers. Maybe you can do better for the latter case though, I haven't had a thorough try at it.

A common use of vpshufb is to get a mask of character positions from a string matching a small set of characters, which you can do with cmpeq(in_slice, vpshufb(c, in_slice)), where the input substring in_slice is used as the shuffle and c is chosen so that input characters in the set of characters of interest (e.g. {'\n' '\r' '\t' '\"' '\\'}) map to themselves and other characters map to something other than themselves. rearrange doesn't seem to define what happens when indices in the shuffle are negative or too large, so it seems like to use rearrange we need to mask off some bits and do cmpeq(in_slice, rearrange(c, and(d, in_slice))) where the `and` is only there because of the nit about invalid indices and `rearrange` is 1 or 6 or 14 instructions depending on what register width we're working with. That could be unfortunate.

You don't need this stuff for sorting primitives though!

gergo_barany · on June 3, 2023

You are still making things up about the API. If you're actually interested, look at the VectorShuffle class, its docs will tell you all about how illegal indices are handled and how to legalize them. But I doubt that you're actually interested.

It's a pity because you clearly have a lot of knowledge that you could put to use to provide very helpful feedback to implementors of the Vector API. (I'm doing this for the GraalVM compiler.)

anonymoushn · on June 4, 2023

I just read the entire documentation for the rearrange method. My apologies for not reading some unrelated docs to find out what rearrange does. It looks like this just makes things worse though? Like you really do have to mask off the high bits of the input, but then you're also paying several instructions for the mapping of out-of-bounds indices to negative values when making the VectorShuffle and then several instructions again to check whether any of them were negative so you can throw an exception from `rearrange`.