Auto-vectorization is easier to get into than other SIMD-frameworks like CUDA, O...

sva_ · on June 1, 2023

> I'd say auto-vectorization should still be learned by modern high-performance programmers, because its very low-hanging fruit.

Somewhat related: 55 GiB/s FizzBuzz https://codegolf.stackexchange.com/questions/215216/high-thr... (discussion: https://news.ycombinator.com/item?id=29031488)

saagarjha · on June 1, 2023

Definitely not autovectorized.

jackmott42 · on June 1, 2023

What do you mean by learning auto vectorization? Do you mean learning how to structure your code so that your compiler of choice has a good chance of pulling it off?

Matumio · on June 2, 2023

Compilers are pretty good at restructuring code. As a programmer you need know some common situations when the compiler would like to reorder your code, but can't.

E.g. when you do summation using a for loop, and the compiler is trying to preserve the rounding-correctness your float operations. Or because you created a dependency graph with a for loop where each iteration requires the previous iteration's result.

Veliladon · on June 1, 2023

Pretty much.

If one wants to sum two arrays instead of for looping through the elements of two arrays one might instead use iterate by chunks so that the compiler can easily tie them all together as a single operation which can then easily vectorize.

pasmafaute · on June 1, 2023

If I recall, you can absolutely loop through the elements in a tight loop and the compiler (e.g. GCC) will auto-vectorize for you (if you have the relevant optimization flag set).

The trick with coding for auto-vectorization is to keep your loops small and free of clutter.

I don't have the documentation handy but I think you only need to follow a couple rules:

- loop must have a defined size (for-loop instead of while-loop)

- don't muck with pointers inside the loop (simple pointer increment is okay)

- don't modify other variables (only the array should be modified)

dragontamer · on June 1, 2023

The linked document describes Intel's autovectorizer, it's warnings and compiler flags that point out which loops autovectorized or not, as well as listing specific reason codes why.

Microsoft, GCC and Clang all do this too, though with different compiler flags and messages.

I'd say that the whole point of this document listed here is to build up the programmer to understanding these error messages and specifically know how to fix the errors that causes a autovectorization-fail.

galangalalgol · on June 2, 2023

In rust I've noticed using iterators instead of for loops tends to trigger auto vectorization more often.

Veliladon · on June 2, 2023

In Rust using iterators is almost always better than for loops. The compiler is virtually guaranteed to optimize out bounds checks vs indexing in.

killingtime74 · on June 1, 2023

Here are some examples, or you could use this library itself in rust https://github.com/vorner/slipstream

mgaunard · on June 2, 2023

Even more important than learning how to vectorize is learning the basics of computer architecture: register, caches, pipelining, branch prediction, and superscalar execution.

One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.

Once you do the work such that your code actually allows execution to happen in parallel, then leveraging superscalar and simd happens automatically.

dragontamer · on June 4, 2023

> One of the common interview questions I ask is computing the sum of an array of doubles. Barely 1% of applicants can even get the basics correctly such as having multiple partial sums.

Ehhh... I've got multiple hats from my college classes.

Yes, optimizations say that multiple partial sums is fastest. But... my floating-point analysis classes has my brain yelling "cancellation error". The only way to minimize cancellation error is to sort the numbers from biggest to smallest, and then separate them from positives-and-negatives. You add the numbers from smallest to largest in magnitude, only matching signs, and then finally combine them at the end.

The only two partial sums you should be keeping are "positive numbers" and "negative numbers", and everything else needs to be folded together to minimize rounding and cancellation errors.

--------

You know. For stability and minimization of errors. There's other, faster, methods to possibly be employed. But optimization of partial sums is probably the last thing I'm thinking if I see "float" or "double".

If things are "int" or "long", I'll have a higher chance of actually thinking (and caring) about optimization, because stability/cancellation error isn't a thing with int or long.

mgaunard · on June 5, 2023

The reason you need multiple sums is because there is a loop dependency which causes every iteration to stall and prevents pipelining and saturating the multiple compute units of your superscalar processor.

Yes, floating-point isn't associative, so the results aren't numerically equivalent, and that is why you need to do that optimization yourself since the compiler isn't allowed to do it for you.

As per the exercise, there is no information about the distribution of the data, so there is no indication that any order would be any better than another for numerical stability.

If you care about stability you'd implement a kahan sum, which isn't really affected by changing the order of the sum so that you can parallelize it.

dragontamer · on June 6, 2023

Kahan summing doesn't fix cancellation error. It just extends the precision by another 53-bits or so.

The only way you fix cancellation error is summing the positives in one variable, and the negatives in another, and do a big cancellation at the end.

Anyway, my overall point is not to criticize Kahan summing or other techniques. My point is that these techniques are incompatible with optimizations. The faster way is often times a less stable operation.

listenallyall · on June 3, 2023

Is asking a question that fewer than 1% of applicants "can even get the basics correct" -- which implies that only a tiny fraction of 1% actually provide a fully complete and correct answer -- a good use of either the applicants' or the interviewer's time? Is it effective in distinguishing candidates, or are you only seeking 1 success (fewer, actually) per 100?

mgaunard · on June 3, 2023

1% success rate is typical when interviewing developers.

Also they get asked multiple exercises, not doing great at one is not eliminatory.

Being able to quickly identify whether an individual is exceptionally good is very important to ensure the employer can fast-track them and be responsive quickly enough that he doesn't accept another offer.

preseinger · on June 3, 2023

cpu cache line optimization is relevant to a superminority of programming contexts

if you're working in those contexts then it's all fine, but it's important to acknowledge that most software development is bottlenecked by concerns far above this level of concern

Genghis_Khan · on June 2, 2023

> auto-vectorizers are just not as flexible as the proper SIMD-tools

What is a "proper" SIMD tool?

dragontamer · on June 2, 2023

There is no such thing as 'failure to autovectorize' in CUDA or OpenCL. All code is vectorized in these languages. SIMD is fundamental to the language model.

As such, it's easier to write high performance code in practice. Intel's ISPC is the closest tool that replicates this effect for AVX512.

iamcreasy · on June 2, 2023

What would happen if I try to run a program contains AVX instruction on a system that does not have it? Do I need to recompile the program?

crest · on June 2, 2023

The cheeky answer would be to throw away the trash every decade, but the real answer is that there is tooling in place to pick the best possible implementation of a function at link time on modern *nix systems to target several levels of SIMD support e.g. SSE2 only, SSE4.2, AVX2, common collection of AVX512 extensions.

iamcreasy · on June 4, 2023

What tool you are referring to? Can you please forward me to those resources?

janwas · on June 5, 2023

I think crest may be referring to function multiversioning, but that only helps choose/call the implementation, not generate it.

github.com/google/highway supports compiling your single source code once into several versions, and then dispatching to the best one.

Disclosure: I am the main author.

kookamamie · on June 2, 2023

E.g. ISPC can compile for multiple targets at once. The runtime system selects the best compatible implementation for the machine running the code.

stevefan1999 · on June 2, 2023

segfault with illegal opcode

iamcreasy · on June 2, 2023

So when I am downloading binary - i.e. Krita/GIMP or any open source project - how do the dev ensure that the common binary will utilize these newer instructions without hitting 'illegal opcode' error?

fragmede · on June 2, 2023

Depends on the binary. Compiling to the lowest common denominator is one strategy. The other is to to have the code detect what's available and only use that subset.

iamcreasy · on June 2, 2023

Interesting. How do the devs decide on the lowest common denominator? Do you happen to know any blog or resources that explains the process?

stevefan1999 · on June 2, 2023

Most devs don't care, so it depends on the compiler default. However some people who are eagering for optimization will turn on -march=native to get the best profile out of the current running processor family. This is where the troubles come in

Const-me · on June 2, 2023

I touched that question in "Hardware Support" section, on page 6 of that article http://const.me/articles/simd/simd.pdf

TLDR: SSE1 and SSE2 are parts of the base AMD64 ISA, they are always available while compiling for 64 bits. SSE3, SSE4.1 are old, market penetration is very close to 100%, always available in practice. For the newer extensions like AVX and FMA, the optimal tradeoff depends on the product and the market.

iamcreasy · on June 2, 2023

Do I need to download the source and compile it using correct flags on the target system for optimal performance? In my previous institution they would download the commonly distributed R language binary and run as is on a brand new system, and I was wondering if recompiling from source would have made a difference.

Const-me · on June 2, 2023

I don’t know, it depends on the specific library.

If they were compiling for least-common denominator like SSE2, recompiling with -march=native might make a difference. The exact amount of that difference depends heavily on the code.

Automatic vectorizers are limited, they only generating good vectorized code for pure vertical operations which handle long dense arrays of FP32 or FP64 values.

OTOH, some widely-used libraries like Eigen https://eigen.tuxfamily.org/index.php?title=Main_Page are using manually-written intrinsics under the hood, selecting an implementation with preprocessor macros. If your R library uses Eigen, compiling for AVX2 or AVX512 will probably cause a non-trivial performance win.

If your library uses runtime dispatch to select best implementation in runtime, and is compiled recently (i.e. you don’t have new instruction sets in your CPU which weren’t available back then), you’re unlikely to get much profit.

BTW, I’ve noticed C standard library implemented by VC++ uses runtime dispatch, functions like memcpy() are using AVX vectors internally, when supported.