The FreeBSD kernel only runs on platforms where 32-bit sized and aligned ordinary loads and stores are atomic; this is a requirement that it demands of the hardware. This may not be (extremely) portable, or may not work for Swift, or userspace under some very weird runtime environments, but it is true of the vast majority of CPU hardware out there.
It is important to separate the hardware and the language. The article is about Swift and it doesn't matter what the hardware guarantees exist for 32-bit loads and stores. The is the compiler's job. If the language doesn't guarantee that a particular action will product an atomic 32-bit load or store than it may be compiled to something different.
This is why it is important to tell the language what you mean. Even in C non-atomic access don't have these guarantees. It may seem pointless because in 99% of cases that "non-atomic" store compiles to the exact same instructions as a relaxed store. But that is just because you are getting lucky. The language doesn't guarantee that and with the atomic store the compiler is well within its right to emit something different (like two stores, spilling dirty values, ...).
“Platform” in my comment refers to the combination of hardware and language/compiler targeting that hardware. FreeBSD does not target the C abstract machine, only a handful of very specific platforms. I agree it would probably be better to explicitly state atomic requirements using the primitives provided by the language.
A problem identified in the article is that if you access a word-size object, your compiler may use a load or store of some different size, violating your assumptions about atomicity. This is not just about ISAs.
So FreeBSD can run only on simple single 32-bit CPUs with no cache (or no registers)? I would argue that outside of a few select embedded controllers there's almost no hardware out there that meets that requirement.
It's the assumption that the hardware operation is the same as the C abstract machine operation.
Hint: they're not, and that causes all kinds of subtle problem when programmers assume they are, especially when it comes to atomicity. The compiler caches a value in a register and two threads synchronize on that value? Boom. Or maybe no boom when one was expected.
This. Even Java guarantees this. The only practical places this is can be a problem are maybe 8/16bit microcontrollers. What the hell is TFA talking about?
It's not guaranteed in C and C++. The author pointed out an example where an arm64 compiler split a word-size write into 2 store instructions because it resulted in smaller code size.
Interestingly, the Go memory model[1] does in fact guarantee atomicity of word-sized reads/writes:
> Otherwise, each read of a single-word-sized or sub-word-sized memory location must observe a value actually written to that location (perhaps by a concurrent executing goroutine) and not yet overwritten.
However, it allows the implementation to immediately exit and report an error as well.
While Rust won't let you share a non-atomic variable across threads without going through a lock of some form or using `unsafe`.
So for both Rust and Go you'll get atomic accesses when you need them, and for C etc you only get atomic accesses if you ask for them. Which pretty much speaks to the different programming models: Rust and Go will only compile the subset of possibly-valid programs that can be expressed by the language and proven by the compiler. C compilers will only reject code they can prove is incorrect.
This is touched on in Olivier Giroux' talk on forward progress in C++[1]. I've time-stamped the section on the roach motel problem, which ends with the observation that the execution model should capture "as-if isolated for a finite time." But even if C++ were to be improved in this way, you'd still only get the guarantee if you expressed the write as an atomic. There are no guarantees for non-atomic writes and shouldn't be.
But this gets to the deeper problem of the blog. There are three ways you can reason about these kinds of problems:
1. What happens with the execution of assembly language on the chip? Behavior is pretty well specified by the ISA, and you absolutely can reason operationally. The x86 memory model includes TSO, so you get fairly strong guarantees; it's basically acquire/release "for free," so you know you won't get tearing or other such things.
2. The formal memory model provided by the language. In the case of C and C++, it's also pretty well specified, and there's lots of work that's gone into it over many years. Again, it's possible to reason productively in this world, though it's quite different than case 1 - ultimately you've got causality graphs and other things expressed on an abstract machine. Ideally you'd do this work with formal proofs, but model checkers like CDSChecker can help a lot.
3. Informal reasoning based on an intuitive model of what the computer "should" do. This is serious YOLO territory, and basically a guarantee that whatever you write will be broken, possibly leading to serious security vulnerabilities. It's popular among a subset of confidently wrong HN commenters though, who I expect will come out in force in this thread.
> Even that isn’t guaranteed. If you don’t say “atomic”, the compiler might decide to split up a store to optimize for speed or code size! This isn’t hypothetical; Greg Parker ran into this with libobjc.
Maybe I'm tired and lack imagination, but please enlighten me. How can splitting up a store speed something up or reduce the code size? Assuming it is aligned and on a 32bit native CPU.
The linked thread doesn't say exactly how, but I can construct some way in my head:
Final 32-bit store effectively stores two 16-bit values or'ed together as high-16 and low-16. You can get this without any shifting and simple addition and the proper values. If these two sixteen bit values are calculated along two very different paths which have different lengths, the compiler might opt to spill one of them early, only to reload the spilled value just before the final addition.
From there it isn't too hard to see some liveness analysis determine that the reload, addition, and immediate store affect onlye bits of the final store, and you can remove parts of it.
Pure speculation on my part that this is what happened, but all of these intermediate steps are performed all the time by modern optimizing compilers.
Or if I understood it correctly, there are platforms like CHERI which have metadata about memory, for example to detect invalid access. This metadata might race with the data itself.
The hardware must, and does, ensure that the metadata (both addressable - bounds, permissions, etc - and non-addressable - the tag) is kept atomic with the address portion of the capability, as otherwise you would be able to forge capabilities via such races. That is, you will never see a torn capability write, and the tag is updated atomically with every write, capability or not. This is easy to do since capabilities are always within a single cache line.
The value being written was composed from multiple inputs. Instead of instructions to combine those inputs into a register and store them, it performed multiple partial stores of the values it was composed from. Think two u16s being ORed into a u32.
Maybe it chose this optimization due to register pressure, and not wanting to spill values that were still going to be used?
You don't even need unusual architectures - e.g. it's pretty easy to find a constant that might be fast to encode a value as 2 immediate and store the value in memory as two halfword stores on ARM - see https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/ for the immediate encoding. It might be quicker than loading those immediates into a register, shifting and or-ing them as necessary, and then doing a single store. Especially if your focus is on code size (for small platforms or cache utilization reasons).
An I wouldn't be surprised if there's some weird x86 encoding that in some cases helps code size similarly, with different encoded instruction lengths allowing for different immediate sizes.
The Go runtime has quite a few places where it assumes atomic load/store behavior of int sized words as well as the eventual propagation of the stores.
I always get a little anxious when looking at such code. But it seems to work well in practice?
Yeah, but there is no guarantee that a write of one goroutine will eventually become visible to other goroutines. So in practice the runtime expects stronger guarantees.
Memory models don't usually explicitly guarantee that writes "eventually become visible". They're usually written as ordering guarantees for when a write becomes visible, such as happens-before relationships. Obviously, for multithreaded programs to be useful, the writes have to eventually become visible to other threads/groroutines just like you want all sorts of other operations to happen in finite time that are not explicitly guaranteed by standards (like whether a thread/goroutine eventually starts.)
Well said. But how much finite time are we talking about (for writes to become visible)? Does it differ between architectures? Can there be extreme edge cases?
I guess I’m less worried about the atomic nature of the operation and more about the way the writes become visible. That seems to be entirely hardware dependent?
If a language is self-hosting that means it must guarantee translation of code that is not explicitly atomic in that language to instructions which are explicitly atomic in the CPU ISA to implement language explicit atomics.
I don't feel that the article take is useful or informative and amounts to gatekeeping ISA parallelism.