SIMD instructions

dang · on April 10, 2020

This is a list of articles—probably a good one, but HN is itself a list of articles, so this is too much indirection.

Lists don't make good HN submissions, because the only thing to discuss about them is the lowest common denominator of the items on the list [1], leading to generic discussion, which isn't as interesting as specific discussion [2].

It's better to pick the most interesting item from the list and submit that. You can always do it more than once, if there is more than one interesting item—but it's best to wait a while between such submissions, to let the hivemind caches clear.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Twinklebear · on April 10, 2020

SIMD is used a ton in rendering applications and starting to see more use in games too (through ISPC for example).

I'd add to the list:

- Embree: https://www.embree.org/ Open source high-performance ray tracing kernels for CPUs using SIMD.

- OpenVKL: https://www.openvkl.org/ Similar to Embree (high-performance ray tracing kernels), but for volume traversal and sampling.

- ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code

- OSPRay: http://www.ospray.org/ A large project using SIMD throughout (via ISPC) for real time ray tracing for scientific visualization and physically based rendering.

- Open Image Denoise: https://openimagedenoise.github.io/ An open-source image denoiser using SIMD (via ISPC) for some image processing and denoising.

- (my own project) ChameleonRT: https://github.com/Twinklebear/ChameleonRT has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for vectorizing the rest of the path tracer (shading, texture sampling).

bityard · on April 10, 2020

> starting to see more use in games

Starting to see? Back in Ye Olde 586 Days of the late 1990s, MMX was added to the Pentium architecture pretty much exclusively for 3D games and real-time audio/video decoding. (This was back when the act of playing an MP3 was no small chore for the average consumer CPU.) Intel made quite a big deal over MMX including millions of dollars in TV ads aimed at the general population, despite the fact that software had to be built specifically to use MMX and that only certain kinds of software could benefit from it.

rasz · on April 12, 2020

MMX had nothing to do with games! It was a part of Intel _marketing scam_. https://news.ycombinator.com/item?id=19468837 :

"MMX was useless for games. MMX is Integer math only, good for DSP, things like audio filters, or making a softmodem out of your sound card. Unsuitable for accelerating 3D games. Whats worse MMX has no dedicated registers, and instead reuses/shares FPU ones, this means you cant use MMX and FPU (all 3D code pre Direct3D 7 Hardware T&L) at the same time. ... Funnily enough AMDs 1998 3DNow! did actually add floating point support to MMX and was useful for 3D acceleration until hardware T&L came along 2 years later.

Intel Paid few dev houses to release make believe MMX enhancements, like POD (1997)

https://www.mobygames.com/images/covers/l/51358-pod-windows-...

1/6 of box covered with Intel MMX advertising while game used it only for some sound effects. Intel repeated this trick in 99 while introducing Pentium 3 with SSE. Intel commissioned Rage Software to build a demo piece showcasing P3 during Comdex Fall. It worked .. by cheating with graphic details ;-) Quoting hardware.fr "But looking closely at the demo, we notice - as you can see on the screenshots - that the SSE version is less detailed than the non-SSE version (see the ground). Intel would you try to roll the journalists in the flour?". Of course Anandtech used this never released publicly cheating demo pretending to be a game in all of their Pentium 3 tests for over a year.

https://www.vogons.org/viewtopic.php?f=46&t=65247&start=20#p... "

MMX was one of Intel's many Native Signal Processing (NSP) initiatives. They had plenty of ideas for making PCs dependent on Intel hardware, something Nvidia is really good at these days (physx, cuda, hairworks, gameworks). Thankfully Microsoft was quick to kill their other fancy plans https://www.theregister.co.uk/1998/11/11/microsoft_said_drop... Microsoft did the same thing to Creative with Vista killing DirectAudio, out of fear that one company was gripping positional audio monopoly on their platform.

djmips · on April 10, 2020

Indeed it's been a workhorse since forever. All the consoles from PS2 forward have included some form of SIMD and it is used extensively.

Here's a GDC 2015 article about SIMD at Insomniac Games. https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afr...

misnome · on April 10, 2020

> ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code

I've been learning ispc lately and it does seem like a wonderful solution, you avoid having to build separate implementations for every instruction set and/or worrying about per-compiler massaging to get it to recognise the vectorisation opportunities. The arguments for having a domain-specific language variant and why it was written (https://pharr.org/matt/blog/2018/04/30/ispc-all.html is a good read) seem like persuasive arguments.

However, outside of the projects in the above list - it doesn't seem to have very wide usage. There are still commits coming in/responding to some issues so it doesn't seem dead, but there are many issues untouched or just untriaged. There isn't much discussion about using it, or people asking for advice. The mailing list has about a message a month.

Is it merely just an extremely highly specialised domain? Is it just that CUDA/OpenCL is a more efficient solution for most cases where one would otherwise consider it? Are there too many ASM/intrinsic experts out there to bother learning?

Twinklebear · on April 10, 2020

ISPC is really awesome, but you're right it is much less known than CUDA/OpenCL. Part of that might just be lack of marketing effort and focus (you don't hear much about it compared to e.g. CUDA) and the team working on it is far smaller than that on CUDA. There has been some wider adoption, like Unreal Engine 4 using it now: https://devmesh.intel.com/projects/intel-ispc-in-unreal-engi... which is super cool, so hopefully we'll see more of that.

As far as support from other languages I did write this wrapper for using ISPC from Rust https://github.com/Twinklebear/ispc-rs (but that's just me again), and there has been work on a WebASM+SIMD backend which is really exciting. Intel does also have an ISPC based texture compressor (https://github.com/GameTechDev/ISPCTextureCompressor) which I think does have some popularity.

However, the domain is pretty specialized, and I think the fraction of people who really care about CPU performance and are willing to port or write part of their code in another language is smaller still. It's also possible that a lot of those who would do so have their own hand written intrinsics wrappers already. Migrating to ISPC would reduce a lot of maintenance effort on such projects, but when they already have momentum in the other direction it can be harder to switch. I think that on the CPU ISPC is easier and better than OpenCL for performance and tight integration with the "host" language, since you can directly share pointers and even call back and forth between the "host" and "kernel".

fluffy87 · on April 10, 2020

ISPC’s creator Matt Pharr works at NVIDIA, they have a series of blog post explaining the history of ISPC.

conistonwater · on April 10, 2020

Link: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html

KMag · on April 11, 2020

At work, I had a project involving a DSL for Monte Carlo simulations. The DSL was an internal DSL in Scala, our interpreter was in Scala, and we transpiled to ISPC (for servers/VMs that didn't expose a GPU) and OpenCL.

I generally liked ISPC, but I really didn't like that it tried to look as close as possible to C but departed from C in unnecessary ways. With Monte Carlo simulations, we deal with a lot of probabilities represented as doubles in the range [0.0, 1.0]. The biggest pain is that operations between a double and any integral type cast the double to the integral type, whereas in C, the integral type gets implicitly cast to a double. I understand the implicit casting rules were changed to give the fastest speed rather than minimize worst-case rounding error. I could understand getting rid of implicit casts, or maybe I could understand changing rules to improve accuracy and know that the user could easily use a profiler to discover any performance problems this caused. However, in our case, uint32_t * double = (uint32_t) 0, which then would get implicitly cast back to a double if being assigned to a variable. My interne was beating his head against the wall for the better part of an afternoon before I gave him a bit of debugging help. All of his probabilities were coming out 0% and 100% for his component.

I actually emailed the authors with a bug report when I found the implicit casting rules differed so radically from C and were in the direction away from accuracy. (Note there's no rounding error when converting uint32_t to a 64-bit IEEE-754 double.) They were very nice, and pointed us to where this behavior was documented.

If you're going out of your way to make your language look like C and interoperate seamlessly with C, you should have really strong justifications for the places where you radically depart from C's semantics.

adev_ · on April 10, 2020

> However, outside of the projects in the above list - it doesn't seem to have very wide usage.

ISPC is pretty popular in the HPC world.

BubRoss · on April 10, 2020

Is it? I haven't heard about it actually being popular anywhere. It definitely works well, but I haven't seen it talked about much except in the case of embree, Intel's ray tracing library. It doesn't seem like there is any funding for it, though it actually works so well already it doesn't seem to need big leaps in progress to be valuable.

adev_ · on April 10, 2020

> Is it? I haven't heard about it actually being popular anywhere.

I know 3 simulators running on supercomputers in the neurosciences domains that use it + some graph processing over supercomputers tools.

It is true that is is not extremely well known, but it is used.

BubRoss · on April 10, 2020

That's great, but you said it was popular in the HPC world. I would love that to be true, but I don't know of a way to see the big picture.

apjana · on April 10, 2020

Adding nnn: https://github.com/jarun/nnn A terminal file manager.

It takes advantage of SIMD at -O3 level of optimization in it's custom string copy function: https://github.com/jarun/nnn/blob/bc7a81921ed974a408d4de2cbf...

The function is used extensively in the program.

saagarjha · on April 10, 2020

I don’t see why you’d write one of these yourself when the one in the standard library is probably vectorized too, and better…

apjana · on April 10, 2020

Not necessarily. There are implementations which don't even take advantage of 4/8 byte copying. We wanted to have something uniform. But yes, you are right with glibc or macOS.

Also, from the strncpy man page:

   strlcpy()
       Some systems (the BSDs, Solaris, and others) provide the following function:

           size_t strlcpy(char *dest, const char *src, size_t size);

       This function is similar to strncpy(), but it copies at most size-1 bytes to dest, always adds  a
       terminating  null  byte,  and  does  not pad the target with (further) null bytes.  This function
       fixes some of the problems of strcpy() and strncpy(), but the caller must still handle the possi‐
       bility of data loss if size is too small.  The return value of the function is the length of src,
       which allows truncation to be easily detected: if the return value is greater than  or  equal  to
       size,  truncation  occurred.  If loss of data matters, the caller must either check the arguments
       before the call, or test the function return value.  strlcpy() is not present in glibc and is not
       standardized by POSIX, but is available on Linux via the libbsd library.

saagarjha · on April 10, 2020

Why not call strncpy or memcpy rather than exhibiting undefined behavior?

apjana · on April 11, 2020

> Why not call strncpy

Read the excerpt.

> undefined

Nothing's _undefined_ there.

saagarjha · on April 11, 2020

I'm not sure which specific excerpt you're referring to, but I have a good idea of the many functions that libraries have come up with to sling characters from one buffer to another, plus I read your implementation and the man page snippet you linked above. I'm still not seeing why you can't replace the code between lines 881 and 902 with one of the appropriate copying routines; you quite literally have a source, destination, and length and you can fix up the last NUL byte right after the call. The standard library's function will be vectorized regardless of how your compiler was feeling that day, and it's probably smarter than yours (glibc, for example, does a "small copy" up to alignment before it launches into the vectorized stuff, rather than skipping it entirely if the buffers aren't aligned). And your function does have undefined behavior: you pun a char * to a ulong *.

apjana · on April 11, 2020

> glibc, for example

> The standard library's function will be vectorized regardless of how your compiler was feeling that day

We talked glibc. I mentioned there are libraries which do not do _any_ optimization other than a byte by byte copy.

> I'm still not seeing why you can't replace the code between lines 881 and 902

Because you are considering only glibc.

And yes, we can do a lot of things. But the function and copying buffers are not the top priority for us ATM. I shared it as an example in the context of the current topic. Not all code is supposed to match your preferences.

> char * to a ulong *

Both the source and destination are guarded by length and alignment requirement checks.

saagarjha · on April 11, 2020

> Not all code is supposed to match your preferences.

Yikes, sorry if I came off as trying to force my opinion on your project. I'm just trying to understand the rationale behind the choices you made, since I've (clearly) never seen anything like it. (If I was genuinely interested in trying to modify your project to my desires, I hope you can believe I'd be kind enough to dig through the project to see if I could figure this out myself, then send a patch with rationale for you to decide whether you wanted it or not, rather than yell at you on Hacker News to fix it.) But to your points:

> We talked glibc. I mentioned there are libraries which do not do _any_ optimization other than a byte by byte copy.

I haven't actually seen one for quite a while–most of the libcs that I'm familiar with (glibc, macOS's libc, musl, libroot, the various BSD libcs, Bionic) have some sort of vectorized code. I'm curious if the project can run on some obscure system that I'm not considering ;)

> Both the source and destination guarded by length and alignment requirement checks.

Perhaps we have a misunderstanding here: I'm saying it's undefined by the C standard, as the pointer cast is a strict aliasing violation regardless of the checks. It will generally compile correctly as char * can alias any type, so the compiler will probably be unable to find the undefined behavior, but it's technically illegal. (I would assume this is one of the many reasons most libcs implement their string routines in assembly.)

apjana · on April 11, 2020

> send a patch with rationale for you to decide whether you wanted it or not

I would really appreciate that. And I do understand your intention is good.

The problem I see with geeky forums is there are just too many people trying to force their ideas on you at every step and expect you to implement those. So it's kind of a standard reply from my side.

saagarjha · on April 11, 2020

It’s a bit too late for me to be writing string manipulation code in C, but I’ll see if I can take a look at this tomorrow. It’ll probably just be a replacement of the copying part with memcpy, and a benchmark if I can find one.

apjana · on April 11, 2020

Sure thing! Thanks a lot!

apjana · on April 11, 2020

Also, here's the source from musl (somewhat similar to what we have)

https://github.com/ifduyue/musl/blob/79f653c6bc2881dd6855299...

saagarjha · on April 11, 2020

Yeah, musl just vectorizes mildly using when certain GNU C extensions are available. Presumably Rich didn’t want to write out another version in assembly. (It really is a shame that strncpy returns dest.)

_3lin · on April 10, 2020

Hi, thank you for the pointers!

I try not to include C or C++ projects other than for educational purpose (like the Mandelbrot set) because one of my life's goal is to help the world to transition to a C & C++ free world (other than for kernels...).

I believe that my role is to promote projects which are "building the new world" and thus we need to abandon and port all form insecure core.

sk0g · on April 10, 2020

So in an article about high/extreme performance systems, you're ignoring the vast majority of them because you don't agree with the tool used to achieve said performance? What..?

pjmlp · on April 10, 2020

I guess because using other programming languages proves the point that there are other approaches, instead of reinforcing the status quo.

_3lin · on April 10, 2020

Exactly this

_3lin · on April 10, 2020

I believe that performance is irrelevant without correctness and security.

My opinion is that C and C++ can't bring enough security and correctness guarantees for mere mortals (lack of tooling, language features...).

Yes some correct and secure programs are written in C and C++ but it's not and will never be the norm.

adev_ · on April 10, 2020

By the sake of god, please stop to put C and C++ in the same basket when talking about security.

It just show you do not know what you are talking about.

Most security problems affecting C program DO NOT affect C++ programs.

Stack smash, vla abuse, string null termination problems, goto error control, double free corruption do NOT affect C++, they are C specific.

pjmlp · on April 11, 2020

Unfortunately they surely do, because a large set of developers writes C++ code full of C idioms.

Which is why Google has thrown out the towel and Android 11 will require hardware memory tagging for native code, and now everything is compiled with FORTIFY enabled.

Also Microsoft research shows otherwise, https://msrc-blog.microsoft.com/2019/07/16/a-proactive-appro...

> ~70% of the vulnerabilities Microsoft assigns a CVE each year continue to be memory safety issues

So yeah, you are correct that C++ does offer the tools not to write C like security holes.

Now you just need to convince a large spectrum of companies to actually stop doing C idioms while writing C++ code.

adev_ · on April 11, 2020

> Unfortunately they surely do, because a large set of developers writes C++ code full of C idioms.

That's an other problem, not technical but educational. A lot of (older) programmer came to C++ passing by C and continue to use C in C++.

That need time, education and guidelines to change that... a lot of time.

Changing mindset and programmer education is sometimes harder than changing the program itself.

> Now you just need to convince a large spectrum of companies to actually stop doing C idioms while writing C++ code.

That is already ongoing. However do not forget that C++ has a bagage of 25 years of code pre-C++11 to upgrade before arriving there.

pjmlp · on April 11, 2020

While I mostly agree, plenty of companies aren't going to change their coding, and outsourcing practices, until they hurt their button line.

_3lin · on April 10, 2020

C++ is too large and huge to not shoot yourself in the foot (or of your user's) in one way or another.

adev_ · on April 10, 2020

This argument has been debunked 20 times already.

pjmlp · on April 11, 2020

And 20 times more in security reports from Microsoft, Google and Apple.

fermentation · on April 10, 2020

Despise of C and an interest in high-performance, a unique mix.

pjmlp · on April 10, 2020

Back in the 80's C was anything but high performance.

Only with people willing to challenge the status quo do we move forward.

beagle3 · on April 10, 2020

It was always higher performance than e.g. Pascal or Basic on any relevant platform (the cost was lack of error checking, e.g. array bounds).

And it was slower than FORTRAN on most 32-but platforms such as DEC, Sun and IBM Unix workstations, VAXen and mainframes - but it was still the speed king on the most prevalent platform of the time, 8086/80286 and friends.

pjmlp · on April 10, 2020

Only as urban myth scattered around by the C crowd.

As user from all Borland product until they changed to Inprise, it was definitely not the case. Pascal and Basic compilers provided enough customization points.

When one of them wasn't fast enough versus Assembly, none of them were.

I used to have fun showing C dudes in demoscene parties how to optimize code.

Now, if you are speaking about the dying days of MS-DOS, when everyone was jumping into 32 bit extenders with Watcom C++, then we are already in another chapter of 16 bit compiler history.

beagle3 · on April 10, 2020

I used TP from 3.0 to 7.0 and a little bit of Delphi 1, and contemporary Turbo C; I dropped to assembly often, dropped TP bound checking often, and was well aware of all these controls.

Parsing with a *ptr++ in TC was not matched by TP until IIRC v7; 16 bit watcom often produced way better code than either TP or TC.

And, as you say, indeed when speed was really needed, you dropped to assembly; no compiler at the time would properly generate “lodsb” inside a loop, although watcom did in its late win3 target days IIRC.

pjmlp · on April 11, 2020

I cannot say I ever bother to benchmark parsing algorithms across languages in MS-DOS, so maybe that was a case where Turbo C might have won a couple of micro-benchmarks.

beagle3 · on April 11, 2020

That was just an example. In general, properly written TP code (properly configured) was on par with properly written TC code, and both were slower than properly written Watcom code in my experience - I did them all and switched frequently.

Parsing was one example where C shone above Pascal, and there were others. My experience was Watcom was consistently better, but in general C was sometimes easier/faster, Pascal was rarely easier/faster, and if speed mattered ASM was the only way.

pjmlp · on April 11, 2020

Well, as I mentioned in several comments, in what concerns my part of the world, in a time and age where a BBS was the best we could get for going online, Watcom did not even exist on my radar until MS-DOS 32bit extenders were relevant.

So we are forgetting here the complete 8 bit generation, and 3/4 of MS-DOS lifetime.

BubRoss · on April 10, 2020

This doesn't line up with the reality that almost all games were written in combinations of C and asm.

pjmlp · on April 10, 2020

During 8 bit days, all games that mattered were written in Assembly.

During the 16 bit days, Pascal, Basic, C, Modula-2, AMOS were the "Unity" of early 90's game developers, with serious games still being written in Assembly.

The switch to C occurred much later at the end of MS-DOS lifetime, when 386 and 486 were widespread enough, thanks success like Doom and Abrash books.

Easy to check from pouet, hugi, breakout, Assembly or GDC postmortem archives.

BubRoss · on April 10, 2020

The person you replied to said that C was the language of choice for speed and not rivaled by pascal or basic. What games were written in pascal or basic and known to be competitive with other high end games of the time?

pjmlp · on April 11, 2020

This kind of speed?

"Allen: Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization...The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue....

Seibel: Do you think C is a reasonable language if they had restricted its use to operating-system kernels?

Allen: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve. By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities."

-- Excerpted from: Peter Seibel. Coders at Work: Reflections on the Craft of Programming

Back to your games list,

Most strategy games from SSI used compiled Basic and Pascal based engines. Only at the very end did they switch to C / C++.

Apogee has written several games in Turbo Pascal.

The games released by Oliver Twins on the BBC Micro, using a mix of Basic and Assembly. Which then eventually found Blitz Games Studios.

If one considers OS having the same performance requirements as games, Apple's Lisa and Mac OSes, written in a mix of Object Pascal and Assembly.

Also related to games, Adobe Photoshop was initially written in Pascal before going cross platform.

EDIT: Forgot to add some demos as well,

Demos from Denthor, tpolm.

Anything from Triton and first games from their Starbreeze studio.

MaxBarraclough · on April 11, 2020

> C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities.

How is C to blame for universities not teaching compilers?

BubRoss · on April 11, 2020

You didn't list any actual game titles, just game makers.

Also quoting someone saying that C destroyed the ability to make compiler optimizations is a little strange when that has been at the core of most software for decades. It's bizarre how much you try to argue about things with mountains of evidence to the contrary.

beagle3 · on April 11, 2020

While I don't necessarily agree with his claims, it is true that there's a huge gap of about 10-15 years between when FORTRAN compilers did some optimizations and when C compilers were able to do them (and only if you properly annotated things with __restrict, etc). I used FORTRAN77 compilers in the early 1990s that did vectorization / pipelining of the kind C compilers started doing in the last decade.

The main reason, though, is that in FORTRAN, the aliasing rules allow the compiler to assume basically anything, whereas C has sequence points and a (super weak) memory model which don't. But I wouldn't say it is C's fault.

pjmlp · on April 11, 2020

Apparently going into the history of the games produced by those game makers is asking too much.

That someone has done more for improving the computing world than either of us ever will.

See that is is the thing with online forums, I tell my point of view and personal lifetime experience, someone like you will dismiss it, then I reply, you dimiss it again as not fitting your view of the world, ask for yet another set of whatever stuff, and I will just watch Netflix for what I care, as I have better things to do with my life than win online discussions.

BubRoss · on April 11, 2020

Saying C stops optimizations is a preposterous claim in the modern era, credentials of someone quoted from decades ago have nothing to do with that.

It's hilarious that you would try to make that claim and then expect someone to buy into it.

Write whatever you want on your resume, it doesn't change reality.

z3phyr · on April 10, 2020

In my opinion, we should instead focus on hardware and experiment more with different kinds of cpus, memory, co-processors etc. The key to newer software systems are newer kinds of hardware, for which you can write newer experimental systems in the language of your own designs.

The sky is the limit, and there is so much to do! Transactional memory, massively multicore computers, hardware built on predicate logic, neuromorphic computers, and whatnot.

We are still mostly stuck with the cpu and memory designs of old.

hedora · on April 10, 2020

Some of the most secure software is written in C.

The language matters less than you’d think once you get past a certain correctness baseline.

_3lin · on April 10, 2020

I have not doubts that secure software can be written is C, but it's not the norm and it's too easy to introduce vulnerabilities in C for mere mortals.

bityard · on April 10, 2020

I mean, have you even _seen_ the trail of CVEs that Java has left in its wake over the past few decades?

burntsushi · on April 10, 2020

ripgrep does, and it's a big reason why it edges out GNU grep in a lot of common cases, especially for case insensitive searches. The most significant use of SIMD is the Teddy algorithm, which I copied from the Hyperscan project. I wrote up how it works here: https://github.com/BurntSushi/aho-corasick/blob/66f581583b69...

_3lin · on April 10, 2020

I wasn't aware! Added it.

Thank you very much for all your work. Your CLI tools are really making a positive impact on the world of development.

burntsushi · on April 10, 2020

Thanks, I appreciate it.

But could you take out the part that says it "aims to kill grep"? That definitely wasn't my intention. grep still has a place. See https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#int... and https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#can...

mastax · on April 10, 2020

The intended connotation of ripgrep wasn't "RIP Grep" but that it rips through searches, i.e. it is fast. I can't find the comment where he said this but burntsushi can confirm.

burntsushi · on April 10, 2020

Right, yes. It's a common enough mistake that it has its own FAQ entry now. :-)

https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#int...

reikonomusha · on April 10, 2020

A Common Lisp project that uses SIMD (specifically AVX2) is the Quantum Virtual Machine [1]. It’s a quantum computer simulator. Here [2] is part of the source that has the SIMD instructions.

It’s cool that with using SBCL, an implementation of Common Lisp, you can write compartmentalized assembly very easily in an otherwise extremely high-level language.

[1] https://github.com/rigetti/qvm

[2] https://github.com/rigetti/qvm/blob/master/src/impl/sbcl-avx...

corysama · on April 10, 2020

The megahertz-scaling "Free Lunch" was declared dead 15 years ago [http://www.gotw.ca/publications/concurrency-ddj.htm] and it's been only getting deader. People are finally, grudgingly accepting that they must go parallel unless we want to see software performance stagnate permanently. For most people here, the issue has been obvious since before they learned to program. But, still they are putting off learning how to deal with it. The first, obvious answer to that is threading. But, in my experience, SIMD is a bigger bang for the buck for two reasons: 1) No synchronization problems. 2) Better cache utilization. It's not just that SIMD forces you to work in large, contiguous blocks. Fun fact: When you aren't using SIMD you are only using a fraction of your L1 cache bandwidth!

A big challenge is that SIMD intrinsic-function APIs are weird. They have inscrutable function names and sometimes difficult semantics. What helped me greatly was going through the effort of writing #define wrappers for myself that just gave each function in SSE1-3 names that made sense to me. I don't expect many people to put in that effort. And, unfortunately, I don't have go-to recommendations for pre-existing libraries. Best I can do is:

https://github.com/VcDevel/Vc is working on being standardized into C++. It's great for processing medium-to-large arrays.

https://ispc.github.io/ is great for writing large, complicated SIMD features.

https://github.com/microsoft/DirectXMath is not actually tied to DirectX. It's has a huge library of small-vector linear algebra (3D graphics math) function. It used to be pretty tied to MS's compiler. But, I believe they've been cleaning it up to be cross compiler lately.

CyberDildonics · on April 10, 2020

Can you say more about non SIMD instructions not making full use of the L1 bandwidth? Is it just that even keeping all the integer units busy still doesn't equate to using all the bandwidth? I suppose that makes sense when adding up the numbers for clock cycles and bytes. I'm guessing this not common to point out since being limited to L1 cache bandwidth is so unlikely to be a program's main bottleneck.

corysama · on April 10, 2020

Intel's scalar pipelining does do an amazing job of keeping pipes busy. And, well-pipelined code can approximate SIMD performance. But, in practice to solidly get that kind of pipelining you need to pretty much write your scalar code as if you were emulating SIMD.

But, the point is that a 4-byte load instruction leaves 12 bytes of bandwidth on the table for many architectures -even with a perfect L1 cache hit.

I point it out because I usually get rebuttals that everything is memory bound (true) and that using the cache well is more important (true, but it turns out...).

TazeTSchnitzel · on April 10, 2020

> SIMD […] is a good alternative to multithreading

They are not alternatives to eachother, they are orthogonal things, unless you're using a GPU.

_3lin · on April 10, 2020

They can, but as explained in one of the article (by Cloudflare, "On the dangers of Intel's frequency scaling") SIMD in a multithreaded environment can cause performance problems due to CPU throttling.

So generally SIMD are used for single thread algorithms.

gameswithgo · on April 10, 2020

I'm not sure it is fair to say "Generally". Sometimes, you maybe don't want to multithread it. When I've used it, multithreading was still useful, despite downclocks, by huge margins.

And on AMD cpus, the downclock issue doesn't exist near as badly.

opportune · on April 10, 2020

If you're doing a scientific workload where you need to process all your vectors before moving on to the next part of the simulation, you're still better off multithreading your vectorized operations even with the throttling.

poorman · on April 10, 2020

Surprised Apache Arrow isn't on this list.

https://arrow.apache.org/ > Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

singhrac · on April 10, 2020

Pretty much every neural network framework is aggressively SIMD-optimized (after all, that's kind of the point besides autodiff), not sure why Tencent's framework is picked..

If you know about it, I want to hear about more fast SIMD-based CLI tools that can replace my existing workflow (e.g. burntsushi's ripgrep or xsv).

_3lin · on April 10, 2020

Thank you for the feedback.

I picked Chinese technology because they are rarely promoted but really great.

Regarding the CLI tools it's a great question and I have opened a ticket for a future issue: https://gitlab.com/bloom42/open_source_weekly/-/issues/14

nickysielicki · on April 10, 2020

Plus many many more if they use the right compiler flags and use aligned types.

I wrote this a couple days ago: https://sielicki.github.io/posts/playing-around-with-autovec...

mynegation · on April 10, 2020

Exactly this! I am glad this list exists but even more interesting question is what are the reasons for the list like that to exist? Ideally it is up to the compiler to user target architecture to its maximum potential.

Every item on this list is a library, compiler optimization, or an idiomatic abstraction waiting to happen.

saagarjha · on April 10, 2020

Most compilers will by default generate portable code.

sharpneli · on April 10, 2020

SSE2 is mandatory on X64 and Arm64 has mandatory neon instructions.

So even portable code has simd insts available.

saagarjha · on April 10, 2020

…which compilers will emit when possible?

gameswithgo · on April 10, 2020

I have a couple of videos introducing intel SIMD intrinsics and how to use them well.

For Rust/C/C++: https://www.youtube.com/watch?v=4Gs_CA_vm3o

For C#: https://www.youtube.com/watch?v=8RcjQPbvvRU

dmos62 · on April 10, 2020

To someone just hearing about SIMD, anyone care to give an experience infused introduction? Is it worth the hassle only in rare cases?

dahart · on April 10, 2020

GPUs are SIMD machines, does that color whether it seems rare? Coming from a graphics background and working on GPUs, I am super biased, but I say it's very worth the hassle.

The easiest intro IMO is check out & play around in ShaderToy. You don't have to know much about SIMD to write shaders, but once you start paying attention to how the machine works, you can make it go really fast.

In general, using something like CUDA is similar to C++ you just have to make sure all your threads do as close to the same thing as possible in order to see good perf.

tombert · on April 10, 2020

If you are just hearing about it, there's a good chance that you're not doing any kind of elaborate matrix math, so it's tough to say if it's useful.

It's incredible useful if you're doing a lot of stuff involving matrices (graphics, image/video processing, neural-net stuff, stuff like that). If you have any interest in that topic, it's absolutely worth learning how to use SIMD (or at least learning a library that takes advantage of SIMD in your language of choice).

xioxox · on April 10, 2020

It's certainly useful for quite a lot of mathematical science code. In my experience compilers are not very good at autovectorizing anything but the most simple loops and writing SIMD intrinsics is necessary to obtain the maximum output of the processor.

calaphos · on April 10, 2020

The cpu just takes a bunch of numbers and applies the same operation on them in one instruction instead of a bunch single ones. E.g. if you are writing some simulations/game code where you deal with 3d vectors a lot the use case is there a lot. Is it worth writing the custom instruction intrinsics yourself? Most likely no as the compiler often (!) does a good job at auto vectorizing or the underlying library already uses them (I think c# vector types does).

augustt · on April 10, 2020

Compilers are really smart, so in many cases they will vectorize without even asking - no hassle involved. Some examples here: https://llvm.org/docs/Vectorizers.html

TazeTSchnitzel · on April 10, 2020

GCC and Clang will try their best to vectorise simple code, but if you try to write more complex algorithms, they will have a more difficult time and you may end up trying to massage your code to be more vectoriser-friendly. Auto-vectorisation is no substitute for a predictable SIMD programming model.

qppo · on April 10, 2020

They tend to stumble over data dependencies and for floating point math they can't reorder instructions as well (unless you explicitly disable IEEE754 compliance).

There are a number of structured approaches to rewriting code to be SIMD friendly, but you do need a degree of explicitness to get what you want out of the compiler.

opportune · on April 10, 2020

Some proprietary compilers provide tools to explicitly instruct a block of code to be vectorized via annotations

nickysielicki · on April 10, 2020

> Is it worth the hassle only in rare cases?

Vectorization is not always faster. It's important to understand that modern processors can perform work on >100 instructions in a given cycle, and not all instructions take equal amounts of time. So reducing a dozen instructions to a single instruction doesn't necessarily mean that the single instruction is going to be faster.

atq2119 · on April 10, 2020

This is misleading, at least if you're looking at a single core.

A single core can have >100 instructions in flight, but most of them will be waiting for their operands to arrive. The theoretical maximum throughput is gated by the instruction decoders and execution units, and is an order of magnitude lower.

nickysielicki · on April 10, 2020

That's fair to call me out on that, what I said wasn't clear. I think the broader point still stands though: don't look at the number of ISA instructions and think that fewer instructions necessarily means performance improves.

I think that we also miss a piece of the puzzle in terms of what happens once you leave ISA instructions and get into uops. There's magic inside there and it's completely opaque.

vardump · on April 10, 2020

> So reducing a dozen instructions to a single instruction doesn't necessarily mean that the single instruction is going to be faster.

Are you saying dozen scalar instructions can be faster than one vector instruction? That's wrong 100% of time on modern CPUs.

jcranmer · on April 10, 2020

One corner case that exists is that using AVX instructions imposes a frequency limit, although this isn't the case for SSE instructions.

There exist some vector instructions that are going to be slower than a non-equivalent sequence of scalar instructions: VPGATHER is going to be an easy such case.

However, I doubt there are going to be any cases where a vector instruction will take fewer clock cycles than its equivalent scalarized instructions. There are some where it might be equivalent--a vector of 2 elements performing an operation that can be issued twice a cycle is an easy example--but I can't think of any where it would be worse. If that were the case, then you should just implement the operation in hardware by scalarizing the uops (and some instructions appear to be so implemented--e.g., gather/scatter).

vardump · on April 10, 2020

VPGATHER is actually significantly faster than scalar loads at least on Skylake, possibly on some earlier CPUs as well (Broadwell?).

On Haswell where it was first introduced... yeah, not very fast, like you mentioned.

nickysielicki · on April 10, 2020

Yeah, I think there exists a case where this is possible. Let me play around with it this weekend and I'll comment here if I find anything.

vardump · on April 10, 2020

I know it can happen on some older ARM designs, microcontrollers and weird DSP-like etc. chips. But I can't think of any case on modern x86 chips at least.

Some low-performance ARMv7/8 designs can split NEON SIMD instructions into multiple clock cycles, but I think even then NEON is going to perform better.

nickysielicki · on April 13, 2020

I started digging into it but I don't have an intel machine with linux easily available and that's a prerequisite for looking at microop performance. I think you win.

vardump · on April 14, 2020

I don't need to win, I just want the truth to win. In other words, I'd consider it a win if I, you or anyone else learns something new.

SIMD is highly optimized at this point, with sustainable two instructions per clock throughput. It's hard to imagine scalar getting anywhere near even at 4 inst/clk rates.

haolez · on April 10, 2020

I found this curious regarding QuestDB[1]:

> Java 8 64-bit. We recommend Oracle Java 8, but OpenJDK8 will also work (although a little slower).

Anyone have an idea why?

[1]https://github.com/questdb/questdb

bluestreak · on April 10, 2020

This is because OracleJDK has more intrinsics than OpenJDK.

veselin · on April 10, 2020

I was wondering generally, is SIMD a good idea for general purpose CPUs. Imagine if the current high end CPUs had double the number of cores, no SIMD, but possibly higher frequency and the algorithms that benefit from SIMD were all run on integrated accelerators instead.

At least as a side observer it looks like a huge number of very large registers take large portions of a core, for sure consuming a lot of power as well, just to sit idle while the core is running JavaScript. Can somebody with CPU architecture experience say what is the real tradeoff here.

zamadatix · on April 10, 2020

Adding SIMD takes less space than adding cores and the use case where you need double the cores on a many core chip but aren't doing the same thing many times is pretty rare.

SIMD units don't need to consume power or limit the frequency of the rest of the chip while not being used Same as when JavaScript is running on one boosted core and the other 63 powersave. While being used SIMD units are more efficient than running 2x or 4x entire cores just to get the additional operation per clock.

jzelinskie · on April 10, 2020

Reminder that nothing is a panacea: I've heard from game engine authors and cryptographers that on Intel chips _over-using_ SIMD can actually heat up the chips too much such that it'll cause the system to then adjust the clockrate lower to cool down and you can degrade performance beyond not using SIMD at all. Before hearing that, I had never considered thermal properties of particular instructions.

corysama · on April 10, 2020

It's not a problem for SSE and AVX1. But, with AVX2/AVX-512, the deal is that you should not just dip your toe with an occasional call to a small SIMD task using such heavy-hitting features. Either do enough SIMD work to overcome the down-clock, or use a lower-end SIMD functionality for smaller tasks.

And, even within AVX2/512 there are huge sets of added functionality that are really "AVX1-enhanced" without going wider. Those are fine to use to without worrying about downclocking.

calaphos · on April 10, 2020

The AVX ALUs also go into power saving when not used and take a couple of cycles to switch back on, delaying the first AVX instruction. There is afaik even a paper on a side channel attack that uses this.

robocat · on April 10, 2020

“Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies written on the box), license 1 (L1) is slower and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle. Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.”

Great info from here: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

Be careful to benchmark real loads as there are perverse interactions e.g. “Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).“ so a function using AVX512 can affect the speed of unrelated code (similar to thermal throttling).

saagarjha · on April 10, 2020

That’s probably AVX-512.

GordonS · on April 10, 2020

As a point of interest, you can even use SIMD (and other hardware intrinsics) in dotnet core, since 3.0, e.g. https://medium.com/@alexyakunin/geting-4x-speedup-with-net-c...

tarr11 · on April 10, 2020

Apache Lucene has recently started using SIMD to decode postings lists (Java)

https://issues.apache.org/jira/browse/LUCENE-9027

FZ1 · on April 10, 2020

Adding the obvious numpy vectorization - I presume that counts as an 'open source project'?

Or maybe this is limited to little personal projects, and not major libraries ?

truth_seeker · on April 10, 2020

JVM generates SIMD to certain extent, i wish other runtimes like V8 (Browser/NodeJS), Go, BEAM (Elixir/Erlang) etc did the same.

gameswithgo · on April 10, 2020

JVM does it to an extremely limited extent. Anything jitted, doesn't have a lot of time to do autovectorization. Even GCC/llvm are pretty limited at this, as it is just a hard problem, and doing it with floating point is problematic as it usually changes the result.

pjmlp · on April 10, 2020

To a limited extent that also takes advantage of AVX thanks to Intel contributions.

All modern JVMs take advantage of code caches and possibly PGO (depending on the configuration) between runs.

ART also does the same for NEON.

andrea_s · on April 10, 2020

Yandex ClickHouse also should be on the list!

vmchale · on April 10, 2020

gcc can do SIMD on its own quite surprisingly (to me)

https://github.com/vmchale/ats-codecount/blob/master/DATS/wc...

capableweb · on April 10, 2020

Looks like a blog post, so shouldn't really be a Show HN. Take a look at the guidelines: https://news.ycombinator.com/showhn.html

> This week has been particularly bad regarding security

Seems like a weird way to phrase it, when talking about security fixes. If it was about security vulnerabilities, I would understand you say it's bad, but in this case it's about fixing vulnerabilities, that's good right?

_3lin · on April 10, 2020

Oops, Fixed!

I prefixed with "Show HN" because I started this newsletter 7 weeks ago and wanted to 'show it' to HN.

You are right regarding the phrasing (I'm not a native English speaker, so any feedback is welcome). Fixed :)

The_rationalist · on April 10, 2020

The easiest way to benefit from SIMD is to use OpenMP SIMD directive on for loops.