There's No Such Thing as “Implicitly Atomic”

loeg · on Oct 5, 2023

The FreeBSD kernel only runs on platforms where 32-bit sized and aligned ordinary loads and stores are atomic; this is a requirement that it demands of the hardware. This may not be (extremely) portable, or may not work for Swift, or userspace under some very weird runtime environments, but it is true of the vast majority of CPU hardware out there.

kevincox · on Oct 5, 2023

It is important to separate the hardware and the language. The article is about Swift and it doesn't matter what the hardware guarantees exist for 32-bit loads and stores. The is the compiler's job. If the language doesn't guarantee that a particular action will product an atomic 32-bit load or store than it may be compiled to something different.

This is why it is important to tell the language what you mean. Even in C non-atomic access don't have these guarantees. It may seem pointless because in 99% of cases that "non-atomic" store compiles to the exact same instructions as a relaxed store. But that is just because you are getting lucky. The language doesn't guarantee that and with the atomic store the compiler is well within its right to emit something different (like two stores, spilling dirty values, ...).

loeg · on Oct 5, 2023

“Platform” in my comment refers to the combination of hardware and language/compiler targeting that hardware. FreeBSD does not target the C abstract machine, only a handful of very specific platforms. I agree it would probably be better to explicitly state atomic requirements using the primitives provided by the language.

klodolph · on Oct 5, 2023

A problem identified in the article is that if you access a word-size object, your compiler may use a load or store of some different size, violating your assumptions about atomicity. This is not just about ISAs.

The example cited was in a library written in C.

loeg · on Oct 5, 2023

Yeah. FreeBSD requires its C compiler not to do that, essentially.

bregma · on Oct 5, 2023

So FreeBSD can run only on simple single 32-bit CPUs with no cache (or no registers)? I would argue that outside of a few select embedded controllers there's almost no hardware out there that meets that requirement.

sapiogram · on Oct 5, 2023

You're confusing atomicity with sequential consistency. The former guarantee is perfectly compatible with caching.

Dylan16807 · on Oct 5, 2023

What do you mean?

Caches don't break atomicity.

loeg · on Oct 5, 2023

No, you’re making an unstated bogus assumption somewhere but I’m not sure which one.

bregma · on Oct 10, 2023

It's the assumption that the hardware operation is the same as the C abstract machine operation.

Hint: they're not, and that causes all kinds of subtle problem when programmers assume they are, especially when it comes to atomicity. The compiler caches a value in a register and two threads synchronize on that value? Boom. Or maybe no boom when one was expected.

ahoka · on Oct 5, 2023

This. Even Java guarantees this. The only practical places this is can be a problem are maybe 8/16bit microcontrollers. What the hell is TFA talking about?

favorited · on Oct 6, 2023

It's not guaranteed in C and C++. The author pointed out an example where an arm64 compiler split a word-size write into 2 store instructions because it resulted in smaller code size.

sapiogram · on Oct 5, 2023

Interestingly, the Go memory model[1] does in fact guarantee atomicity of word-sized reads/writes:

> Otherwise, each read of a single-word-sized or sub-word-sized memory location must observe a value actually written to that location (perhaps by a concurrent executing goroutine) and not yet overwritten.

However, it allows the implementation to immediately exit and report an error as well.

[1]: https://go.dev/ref/mem

andrewaylett · on Oct 5, 2023

While Rust won't let you share a non-atomic variable across threads without going through a lock of some form or using `unsafe`.

So for both Rust and Go you'll get atomic accesses when you need them, and for C etc you only get atomic accesses if you ask for them. Which pretty much speaks to the different programming models: Rust and Go will only compile the subset of possibly-valid programs that can be expressed by the language and proven by the compiler. C compilers will only reject code they can prove is incorrect.

monocasa · on Oct 5, 2023

Also, I half remember an architecture that essentially put no upper bounds on when a normal write would be visible to other cores without a barrier.

raphlinus · on Oct 5, 2023

This is touched on in Olivier Giroux' talk on forward progress in C++[1]. I've time-stamped the section on the roach motel problem, which ends with the observation that the execution model should capture "as-if isolated for a finite time." But even if C++ were to be improved in this way, you'd still only get the guarantee if you expressed the write as an atomic. There are no guarantees for non-atomic writes and shouldn't be.

But this gets to the deeper problem of the blog. There are three ways you can reason about these kinds of problems:

1. What happens with the execution of assembly language on the chip? Behavior is pretty well specified by the ISA, and you absolutely can reason operationally. The x86 memory model includes TSO, so you get fairly strong guarantees; it's basically acquire/release "for free," so you know you won't get tearing or other such things.

2. The formal memory model provided by the language. In the case of C and C++, it's also pretty well specified, and there's lots of work that's gone into it over many years. Again, it's possible to reason productively in this world, though it's quite different than case 1 - ultimately you've got causality graphs and other things expressed on an abstract machine. Ideally you'd do this work with formal proofs, but model checkers like CDSChecker can help a lot.

3. Informal reasoning based on an intuitive model of what the computer "should" do. This is serious YOLO territory, and basically a guarantee that whatever you write will be broken, possibly leading to serious security vulnerabilities. It's popular among a subset of confidently wrong HN commenters though, who I expect will come out in force in this thread.

[1]: https://youtu.be/g9Rgu6YEuqY?si=ZvKlDnqKOqzfFVSZ&t=4267

mhh__ · on Oct 5, 2023

Alpha?

monocasa · on Oct 5, 2023

I can't quite remember and my Google Fu is failing me at the moment.

I think it was something about how L1 wouldn't flush down to the inclusive L2 until commanded to, so external reads hit the stale L2.

Alpha being a pain here would be on brand, but I think this was a different arch.

tjoff · on Oct 5, 2023

> Even that isn’t guaranteed. If you don’t say “atomic”, the compiler might decide to split up a store to optimize for speed or code size! This isn’t hypothetical; Greg Parker ran into this with libobjc.

Maybe I'm tired and lack imagination, but please enlighten me. How can splitting up a store speed something up or reduce the code size? Assuming it is aligned and on a 32bit native CPU.

compiler-guy · on Oct 5, 2023

The linked thread doesn't say exactly how, but I can construct some way in my head:

Final 32-bit store effectively stores two 16-bit values or'ed together as high-16 and low-16. You can get this without any shifting and simple addition and the proper values. If these two sixteen bit values are calculated along two very different paths which have different lengths, the compiler might opt to spill one of them early, only to reload the spilled value just before the final addition.

From there it isn't too hard to see some liveness analysis determine that the reload, addition, and immediate store affect onlye bits of the final store, and you can remove parts of it.

Pure speculation on my part that this is what happened, but all of these intermediate steps are performed all the time by modern optimizing compilers.

_nalply · on Oct 5, 2023

Or if I understood it correctly, there are platforms like CHERI which have metadata about memory, for example to detect invalid access. This metadata might race with the data itself.

jrtc27 · on Oct 6, 2023

The hardware must, and does, ensure that the metadata (both addressable - bounds, permissions, etc - and non-addressable - the tag) is kept atomic with the address portion of the capability, as otherwise you would be able to forge capabilities via such races. That is, you will never see a torn capability write, and the tag is updated atomically with every write, capability or not. This is easy to do since capabilities are always within a single cache line.

stouset · on Oct 5, 2023

The value being written was composed from multiple inputs. Instead of instructions to combine those inputs into a register and store them, it performed multiple partial stores of the values it was composed from. Think two u16s being ORed into a u32.

Maybe it chose this optimization due to register pressure, and not wanting to spill values that were still going to be used?

kimixa · on Oct 5, 2023

You don't even need unusual architectures - e.g. it's pretty easy to find a constant that might be fast to encode a value as 2 immediate and store the value in memory as two halfword stores on ARM - see https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/ for the immediate encoding. It might be quicker than loading those immediates into a register, shifting and or-ing them as necessary, and then doing a single store. Especially if your focus is on code size (for small platforms or cache utilization reasons).

An I wouldn't be surprised if there's some weird x86 encoding that in some cases helps code size similarly, with different encoded instruction lengths allowing for different immediate sizes.

junon · on Oct 5, 2023

16 bit CPUs for one. They often lack the instructions.

felixge · on Oct 5, 2023

The Go runtime has quite a few places where it assumes atomic load/store behavior of int sized words as well as the eventual propagation of the stores.

I always get a little anxious when looking at such code. But it seems to work well in practice?

colesbury · on Oct 5, 2023

Go's memory model is more constrained than C, C++, and Swift and this case is specifically addressed: https://go.dev/ref/mem#restrictions.

"...each read of a single-word-sized or sub-word-sized memory location must observe a value actually written to that location"

felixge · on Oct 5, 2023

Yeah, but there is no guarantee that a write of one goroutine will eventually become visible to other goroutines. So in practice the runtime expects stronger guarantees.

colesbury · on Oct 5, 2023

Memory models don't usually explicitly guarantee that writes "eventually become visible". They're usually written as ordering guarantees for when a write becomes visible, such as happens-before relationships. Obviously, for multithreaded programs to be useful, the writes have to eventually become visible to other threads/groroutines just like you want all sorts of other operations to happen in finite time that are not explicitly guaranteed by standards (like whether a thread/goroutine eventually starts.)

felixge · on Oct 6, 2023

Well said. But how much finite time are we talking about (for writes to become visible)? Does it differ between architectures? Can there be extreme edge cases?

compiler-guy · on Oct 5, 2023

Anyone who writes a compiler for Go can guarantee such behavior--anything else is a buggy implementation.

It works in practice because it's a requirement of the implementation.

felixge · on Oct 5, 2023

I guess I’m less worried about the atomic nature of the operation and more about the way the writes become visible. That seems to be entirely hardware dependent?

_nalply · on Oct 5, 2023

I assume Rust concurrency guarantees will prevent races and have this covered. Except of course if you use unsafe.

Right?

kevincox · on Oct 5, 2023

Yes.

It is actually fairly tricky to get a value across threads like this. The simplest I could come up with is this:

    struct UnsafeSync<T>(T);
    
    unsafe impl<T> Sync for UnsafeSync<T> {}
    
    fn main() {
        let i = std::sync::Arc::new(UnsafeSync(std::cell::Cell::new(0)));
        
        let thread_i = i.clone();
        std::thread::spawn(move || {
            thread_i.0.set(1);
        });
        
        eprintln!("i is {}", i.0.get());
    }

https://play.rust-lang.org/?version=stable&mode=debug&editio...

_nalply · on Oct 5, 2023

Why didn't you use a Mutex?

kevincox · on Oct 5, 2023

What are you thinking?

I could have used a Mutex then cast the `&Mutex` to `&mut Mutex` so that I could call `.get_mut()`. But this didn't seem simpler.

_nalply · on Oct 5, 2023

I wanted to avoid unsafe.

Because unsafe introduces conceptual complications and it seemed to be simpler to avoid that in an example.

kevincox · on Oct 5, 2023

Well you can't have a memory race without unsafe. The goal of the example was to show undefined behaviour.

_nalply · on Oct 5, 2023

Ah, you wanted to show an example in Rust with unsafe where atomicity was not guaranteed.

Georgelemental · on Oct 5, 2023

Will prevent data races, yes.

_nalply · on Oct 5, 2023

Of course I meant data races. Or am I barking up the wrong tree?

And of course only if you use safe Rust.

fohfauXoo3erieb · on Oct 5, 2023

If a language is self-hosting that means it must guarantee translation of code that is not explicitly atomic in that language to instructions which are explicitly atomic in the CPU ISA to implement language explicit atomics.

I don't feel that the article take is useful or informative and amounts to gatekeeping ISA parallelism.

workingjubilee · on Oct 5, 2023

Have you ever heard of "CMPXCHG8B"?

cmpxchg8b · on Oct 5, 2023

never heard of it mate

mhh__ · on Oct 5, 2023

Couldn't resist, I take it?