Codelet: (Notes From) Reasoning About The Performance of Atomics

12 minute read

These are the notes from a (non-codelet) post that I don’t know when/if I’ll finish.

(If you’re wondering “What’s a codelet and why is it in the title of this blog post?”, it’s silly name I’m using for my less involved, shorter, or less polished blog posts… for example… this, which is literally a pile unedited notes in my typical “mess of nested lists” format)

Wall Of Caveats

So, a while ago I was going to write a blog post on how to reason about atomic performance on various hardware mostly focusing on non-x86 since you can just measure on x86 (well, assuming you’re using it, which is likely), but it’s much more of a pain to run benches on — even if you have an android phone or whatever, it’s still a big hassle to do just for some benchmarks. Anyway, note that my experience here is fairly dated, which is part of why this needs more research

I wrote a bunch of notes on stuff from experience, then did a decent amount of research, and wrote these notes

The biggest issues are that I never got around to:

  1. Checking how much of this stuff holds true today.
    • Most notably, PPC is super different than it used to be.
    • Also, on every place where I describe things about how the hardware works “under the hood”, it’s worth verifying it with architecture manuals as much as possible.
  2. Investigate more chips, in particular ARM flavors.
    • I have no real experience with ARMv6 or earlier, and only am just beginning to do this stuff on ARMv8 (e.g. aarch64).
    • A comparison with ARMv8 and IA64 would be neat, since they both are very close to the C++11 (and, by extension, Rust) memory model
  3. Some measurements, although note that measuring concurrent performance in microbenchmarks is Very Hard.
    • That said, while intuition (like this hopes to empart) is helpful when optimizing, it… can be wrong, so you can’t go on it entirely.
    • Conversely, benchmarks aren’t wrong (although they might measure something you don’t expect), but have poor signal to noise.
    • This isn’t only true for atomics, but… The signal to noise problem becomes much worse when you add in threads. Try to measure your real application.
      • In particular, threading microbenchmarks often spawn a bunch of threads to do the same stuff. This is bad unless your application has no threads going on other than these.
    • Kinda off topic, tbh.

Anyway, there’s definitely stuff I like in here, in particular explanation of LL/SC, which is about half of the reason I’m posting this.

Caveat: So, these are incomplete research notes. I’ve done a bit of touching up (even though I said I wouldn’t), but be cautious: there’s almost no way I didn’t get something in here some degree of wrong here. Hopefully (and to my knowledge), I didn’t get anything massively wrong in ways that matter a lot, but I definitely wouldn’t be putting this up without this caveat. Also, it’s not really polished — Expect tons of grammar errors, unfinished thoughts, bad expanations, todos, notes to self, and dodgy oversimplifications.

With that out of the way, let the overly-nested lists begin!


Basics / Terminology / Background

  • 3 types of atomic ops, broadly
    • load/store, AtomicFoo::{load, store}
    • fences, core::sync::atomic::fence
    • read-modify-write (aka RMW), like AtomicFoo::{compare_exchange, fetch_*, ...}. (slowest)
  • I’m assuming some familiarity here, start with nomicon, and familiarize yourself with the apis in core::sync::atomic::* ideally
  • Some hardware implements general atomic updates with a “compare-and-swap” (CAS) instruction. Most uses the pair of LL/SC instructions.

LL/SC

  • LL/SC stands for load-linked/store-conditional, which are two instructions that together are used as an alternative to CAS on weaker ordered processors.
  • The load-linked instruction (LL) basically loads a register as normal, and does some book keeping so that store-conditional can be implemented.
  • The store-conditional instruction (SC) uses the bookkeeping from LL can say “store the this new value if no writes have happened since the last LL.”
  • I gather names “load-linked” and “store-conditional” refer to what they were called on the Alpha (an old very weakly ordered processor), other architectures all name them different things.

  • Internally LL/SC are generally implemented by the CPU by:
    • having LL tag the cache line holding the address with the processor ID in some manner.
    • any writes to anything on the cache line by other threads (atomic or not) will clear that tag,
    • the store-conditional succeeds iff the tag hasn’t been cleared, meaning it fails if other writes on the cache line happen, regardless of where.
  • As probably the simplest LL/SC loop, this is what x.fetch_and(0x80, Relaxed) compiles as on ARMv7, cleaned up and manually commented.
.retry:
    ; linked_load is like a normal read, but can
    ; be used with store_conditional later.

    ; you can think of it as `ptr.claim_and_load()`
    ; if that helps

    ; `let loaded = ptr.linked_load()`
    ldrex   r1, [r0]

    ; `let to_store = loaded & 0x80`
    and     r2, r1, #128

    ; if there have been no writes to `ptr` since the load
    ; linked, it does the store and returns true, otherwise
    ; it returns false.

    ; you can think of it like `ptr.store_if_still_claimed(to_store)`
    ; if that helps

    ; `let success: bool = ptr.store_conditional(to_store)`
    strex   r3, r2, [r0]

    ; `if (!success) goto retry;`
    cmp     r3, #0
    bne     .retry
    ; carry on — old value (result of fetch_and) in `r1`
  • Looks a bit like a cmpxchg loop, but note that there comparison for store-conditional can’t be changed, it’s just “have writes occurred”.
  • The tracking for writes is on the granularity of a cache line and not whatever was loaded. This means that it can fail spuriously.
    • This is like compare_exchange_weak, and LL/SC needing a loop is why compare_exchange_weak exists. Unfortunately, compare_exchange_weak is mostly a “man LLVM should really get on this” thing, since it compiles the same as strong a lot of the time.
  • On the bright side, the arbitrary computation you get to do between the LL/SC is nice and more flexible than what you get with CAS (but both are equally powerful).
  • Note that stronger-than-relaxed RMW ops are generally just implemented by adding fencing around the loop (sometimes inside) in a similar way as for non-RMWs.

CAS

  • CAS stands for compare-and-swap, which behaves like AtomicFoo::compare_exchange
  • TODO: write a good description like the LL/SC one (have half of it below)
  • x86 is basicaly the only arch that matters still that uses this. (I think SPARC is another recent one, but probably firmly in “no longer matters” camp…)

x86/x86_64

  • load/store with whatever orderings is free basically (aside from limiting the compiler a lot),
    • major exception is store(_, SeqCst) which is a swap. This is pretty slow.
    • Note that here and elsewhere: When I mention “free”, it’s just architecturally speaking.
    • It also will quickly become less than free if that cache line is contended, in which case everything sucks…
  • No real difference for orderings between most RMW ops even if Relaxed or SeqCst.
    • For some operations, x86 just can use lock on the instruction
      • lock xadd for atomic add
      • or lock cmpxchg for CAS/compare_exchange
      • note that xchg (swap) is implicitly locked (when used on memory, anyway).
    • That said, most stuff doesn’t have a dedicated fast way to do it, and so compiles to a CAS loop.
    • For example, a.fetch_and(0x80, _) from the LL/SC example is still something like (same for all orderings, so I’m omitting them):
            ; let old = atom.load();
            mov     eax, dword ptr [rdi]
        .retry:
            ; let new = old & 0x80;
            mov     ecx, eax
            and     ecx, 128
            ; FIXME: is this too abstracted from whats happening?
            ; if Err(saw_insted_of_old) = atom.compare_exchange(old, new) {
            ;     old = saw_instead_of_old;
            ;     goto .retry;
            ; }
            ; note: CAS instruction is lock prefixed too
            lock    cmpxchg dword ptr [rdi], ecx
            jne     .retry
      
  • lock-prefixed instructions are slowest if there’s if there are other cores accessing stuff on the same cache line, and it will make them slow too, even if they don’t lock anything.
    • Note that IIUC it’s not really a metaphor — the processor is actually taking a lock out (well, a latch) on the cache line on behalf that instruction, so it reduces concurrency as well.
  • All fences are noops, except fence(SeqCst) which is mfence.
    • This is kinda an oversimplification, but for the purposes of atomics it’s true.

ARM 32bit

Caveat: the ARMv7s used in phones are what I’m most familiar with, so the further away from that the less true this stuff might get. I also suspect part of why some of these things aren’t so costly (fencing isn’t as bad as on PPC) is because it tends not to have many cores

  • Weak ordering, this means everything stronger than relaxed requires additional fences or special instructions.
  • One point to note is that 64-bit atomics are supported but inefficient compared to 32 bit ones, so the widest operations aren’t always the best here.
  • ARM32 has one relevant fence, and no general fenced-load or fenced-stores that I’m aware of (and to be fair: these wouldn’t be very RISCey).
    • The fence is dmb ish and it’s basically a fence(SeqCst) instruction.
    • A dmb ishst also exists but I’ve never used it nor seen it emitted.
    • this costs around a cache miss, sometimes better, sometimes worse, it’s cheaper if fewer writes have happened since you last did it, I believe.
    • Pretty much every non-relaxed atomic is implemented by inserting this before the op, after the op, or both (when it’s both, it’s usually cheaper the 2nd time).
  • RMW operations are especially costly, all require the LL/SC loop from above.

64bit ARM/aarch64

  • Also uses LL/SC for RMWs
  • actually has dedicated instructions for load(acq), store(rel), and fence(acq), and will uses them.
  • pretty similar to arm32 otherwise AFAICT, but I don’t have a lot of experience here yet.
  • adopts C++11 memory model to a suprising extent.

PPC/POWER

Caveat: haven’t written for this a while, and what I wrote for was old, and in-order, which I imagine is no longer the case. (TODO: probably verify or delete the bits about what the hardware does — i imagine lwsync and isync are probably close if the recent PPCs are out of order)

  • Uses ll/sc loops for RMW.
  • No support fo 64-bit atomics at all (even slowly) unless its ppc64.
  • PPC implements non-relaxed similar to arm32, e.g. peppering stuff with fences. It’s fences are weird.

  • Has multiple fences, which don’t quite map to the c++11/rust model.
    • isync: Weakest fence, giving a subset of fence(Acquire) (the load-load subset)
      • Specifically, loads from after the isync can’t be ordered around loads before it, but stores from after it can (for fence(Acquire) neither loads nor stores can).
      • I don’t remember what this costs, but it’s cheaper than lwsync — it basically just flushes the instruction pipeline.
      • sometimes used on acquire loads and in other sequnces of instructions
    • lwsync: Behaves basically as fence(AcqRel).
      • Doesn’t give SeqCst, doesn’t give store-load. Costs around the same as arm’s dmb ish, e.g. cache miss.
      • in the hardware this just impacts your processor core, serializing its queued writes and cache reads.
      • critically, it doesn’t create a synchronization point with other cores, which is why its not SeqCst, but also why its cheaper.
      • generally peppered about as the common, unless you use SeqCst
    • hwsync (aka sync): fence(SeqCst) (well, I think not quite exactly but close).
      • This is slow as hell, like, hundreds of cycles, and slower the more cores you have.
      • It’s costly enough that IBMs docs oddly try to get you not to use it and suggest using eieio disabling out of order io (which includes memory operations) all-together instead.
      • It forces all cores to come together and mark a sequence point, in addition to fully flushing basically everything that is available to flushed on at least the current core.
      • Required for implementing any sort of SeqCst, for which it will show up once or twice depending on the op and how it gets compiled. If it happens twice, the 2nd time still has to make a sync point on all cores, but the flushing should be faster then.
  • That said, the only people running PPC these days are enterprisey mainframes so SeqCst to your hearts content tbh.
  • Also who knows, maybe things have gotten better.

Misc

  • This page on the mapping between atomic ops and various hardware is great: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. Godbolt is also very useful.

  • Stronger orderings do impact compiler’s ability to reorder stuff around the operations, even though I didn’t metnion it.
    • In practice I’ve never seen this be a difference maker but YMMV
    • The compiler will pretty much never remove use of any atomics, not even relaxed, although it would be nice if it did. (it will if you use the unstable unordered atomic intrinsics, so maybe we’ll someday have those for certain cass.
  • Contention on the same cache line is what makes anything here many many times worse, and causes every access to bounce back and forth from the cache.
    • Can be hard to avoid given that allocation quantum < cache lines.
    • Some particularly bad patterns are stuff like while !atomic.rmw(stuff, ...) { spins += 1; } when spins is on the same cache line as atomic.
      • on ll/sc arches under contention: the write to spins will cause other threads to fail the ll/sc on atomic, which it will retry, causing us to fail it, etc etc, leading to a live lock
      • for clarity this is fine and normal if they won’t be on the same cache line.

Atomic Perf TL;DR

  • Except on PPC (Where all seqcst operations are very costly), using a stronger-than-needed atomic is usually a cache miss (at worst two and a pipeline flush).
    • However, often enough there’s no difference.
  • Contention (specifically, on the same cache line) is the real killer even if using relaxed. This is slow on all arches, even x86.
    • If you’re optimizing this code, target this!
    • Stuff like thread_local or per-cpu data is good, but often only for high contention, especially for the per-cpu case.
  • RMW ops are slow and complex (compile to a loop) on non-CAS arches (everything that still matters except x86). for some op
    • For clarity, most things other than x86 don’t have, say, fetch_add as an instruction.
    • I’d love to give advice like “compare_exchange_weak is a better target than SeqCst most of the time” but LLVM doesn’t do a great job with it…

Anyway

Sorry about dumping this on you, but I figure someone will find it worth reading, and it being in my drafts does nobody any good.