Codelet: (Notes From) Reasoning About The Performance of Atomics
These are the notes from a (non-codelet) post that I don’t know when/if I’ll finish.
(If you’re wondering “What’s a codelet and why is it in the title of this blog post?”, it’s silly name I’m using for my less involved, shorter, or less polished blog posts… for example… this, which is literally a pile unedited notes in my typical “mess of nested lists” format)
Wall Of Caveats
So, a while ago I was going to write a blog post on how to reason about atomic performance on various hardware mostly focusing on non-x86 since you can just measure on x86 (well, assuming you’re using it, which is likely), but it’s much more of a pain to run benches on — even if you have an android phone or whatever, it’s still a big hassle to do just for some benchmarks. Anyway, note that my experience here is fairly dated, which is part of why this needs more research
I wrote a bunch of notes on stuff from experience, then did a decent amount of research, and wrote these notes
The biggest issues are that I never got around to:
- Checking how much of this stuff holds true today.
- Most notably, PPC is super different than it used to be.
- Also, on every place where I describe things about how the hardware works “under the hood”, it’s worth verifying it with architecture manuals as much as possible.
- Investigate more chips, in particular ARM flavors.
- I have no real experience with ARMv6 or earlier, and only am just beginning to do this stuff on ARMv8 (e.g. aarch64).
- A comparison with ARMv8 and IA64 would be neat, since they both are very close to the C++11 (and, by extension, Rust) memory model
- Some measurements, although note that measuring concurrent performance in microbenchmarks is Very Hard.
- That said, while intuition (like this hopes to impart) is helpful when optimizing, it… can be wrong, so you can’t go on it entirely.
- Conversely, benchmarks aren’t wrong (although they might measure something you don’t expect), but have poor signal to noise.
- This isn’t only true for atomics, but… The signal to noise problem becomes much worse when you add in threads. Try to measure your real application.
- In particular, threading microbenchmarks often spawn a bunch of threads to do the same stuff. This is bad unless your application has no threads going on other than these.
- Kinda off topic, tbh.
Anyway, there’s definitely stuff I like in here, in particular explanation of LL/SC, which is about half of the reason I’m posting this.
Caveat: So, these are incomplete research notes. I’ve done a bit of touching up (even though I said I wouldn’t), but be cautious: there’s almost no way I didn’t get something in here some degree of wrong here. Hopefully (and to my knowledge), I didn’t get anything massively wrong in ways that matter a lot, but I definitely wouldn’t be putting this up without this caveat. Also, it’s not really polished — Expect tons of grammar errors, unfinished thoughts, bad explanations, todos, notes to self, and dodgy oversimplifications.
With that out of the way, let the overly-nested lists begin!
Basics / Terminology / Background
- 3 types of atomic ops, broadly
- load/store,
AtomicFoo::{load, store}
- fences,
core::sync::atomic::fence
- read-modify-write (aka RMW), like
AtomicFoo::{compare_exchange, fetch_*, ...}
. (slowest)
- load/store,
- I’m assuming some familiarity here, start with nomicon, and familiarize yourself with the apis in
core::sync::atomic::*
ideally - Some hardware implements general atomic updates with a “compare-and-swap” (CAS) instruction. Most uses the pair of LL/SC instructions.
LL/SC
- LL/SC stands for
load-linked
/store-conditional
, which are two instructions that together are used as an alternative to CAS on weaker ordered processors. - The
load-linked
instruction (LL) basically loads a register as normal, and does some book keeping so thatstore-conditional
can be implemented. - The
store-conditional
instruction (SC) uses the bookkeeping from LL to say “store the this new value if no writes have happened since the last LL.” -
I gather names “load-linked” and “store-conditional” refer to what they were called on the Alpha (an old very weakly ordered processor), other architectures all name them different things.
- Internally LL/SC are generally implemented by the CPU by:
- having LL tag the cache line holding the address with the processor ID in some manner.
- any writes to anything on the cache line by other threads (atomic or not) will clear that tag,
- the store-conditional succeeds iff the tag hasn’t been cleared, meaning it fails if other writes on the cache line happen, regardless of where.
- As probably the simplest LL/SC loop, this is what
x.fetch_and(0x80, Relaxed)
compiles as on ARMv7, cleaned up and manually commented.
.retry:
; linked_load is like a normal read, but can
; be used with store_conditional later.
; you can think of it as `ptr.claim_and_load()`
; if that helps
; `let loaded = ptr.linked_load()`
ldrex r1, [r0]
; `let to_store = loaded & 0x80`
and r2, r1, #128
; if there have been no writes to `ptr` since the load
; linked, it does the store and returns true, otherwise
; it returns false.
; you can think of it like `ptr.store_if_still_claimed(to_store)`
; if that helps
; `let success: bool = ptr.store_conditional(to_store)`
strex r3, r2, [r0]
; `if (!success) goto retry;`
cmp r3, #0
bne .retry
; carry on — old value (result of fetch_and) in `r1`
- Looks a bit like a cmpxchg loop, but note that there comparison for store-conditional can’t be changed, it’s just “have writes occurred”.
- The tracking for writes is on the granularity of a cache line and not whatever was loaded. This means that it can fail spuriously.
- This is like
compare_exchange_weak
, and LL/SC needing a loop is why compare_exchange_weak exists. Unfortunately, compare_exchange_weak is mostly a “man LLVM should really get on this” thing, since it compiles the same as strong a lot of the time.
- This is like
- On the bright side, the arbitrary computation you get to do between the LL/SC is nice and more flexible than what you get with CAS (but both are equally powerful).
- Note that stronger-than-relaxed RMW ops are generally just implemented by adding fencing around the loop (sometimes inside) in a similar way as for non-RMWs.
CAS
- CAS stands for
compare-and-swap
, which behaves likeAtomicFoo::compare_exchange
- TODO: write a good description like the LL/SC one (have half of it below)
- x86 is basically the only arch that matters still that uses this. (I think SPARC is another recent one, but probably firmly in “no longer matters” camp…)
x86/x86_64
- load/store with whatever orderings is free basically (aside from limiting the compiler a lot),
- major exception is
store(_, SeqCst)
which is a swap. This is pretty slow. - Note that here and elsewhere: When I mention “free”, it’s just architecturally speaking.
- It also will quickly become less than free if that cache line is contended, in which case everything sucks…
- major exception is
- No real difference for orderings between most RMW ops even if Relaxed or SeqCst.
- For some operations, x86 just can use
lock
on the instructionlock xadd
for atomic add- or
lock cmpxchg
for CAS/compare_exchange - note that
xchg
(swap) is implicitlylock
ed (when used on memory, anyway).
- That said, most stuff doesn’t have a dedicated fast way to do it, and so compiles to a CAS loop.
- In some cases, an unnecessary one https://bugs.llvm.org/show_bug.cgi?id=37974
- (TODO: note that this is only true if you use the result — the example below can be a
lock and
if it were `a.fetch_and(0x80, _))
- For example,
a.fetch_and(0x80, _)
from the LL/SC example is still something like (same for all orderings, so I’m omitting them):; let old = atom.load(); mov eax, dword ptr [rdi] .retry: ; let new = old & 0x80; mov ecx, eax and ecx, 128 ; FIXME: is this too abstracted from whats happening? ; if Err(saw_insted_of_old) = atom.compare_exchange(old, new) { ; old = saw_instead_of_old; ; goto .retry; ; } ; note: CAS instruction is lock prefixed too lock cmpxchg dword ptr [rdi], ecx jne .retry
- For some operations, x86 just can use
lock
-prefixed instructions are slowest if there’s if there are other cores accessing stuff on the same cache line, and it will make them slow too, even if they don’tlock
anything.- Note that IIUC it’s not really a metaphor — the processor is actually taking a lock out (well, a latch) on the cache line on behalf that instruction, so it reduces concurrency as well.
- All fences are noops, except
fence(SeqCst)
which ismfence
.- This is kinda an oversimplification, but for the purposes of atomics it’s true.
ARM 32bit
Caveat: the ARMv7
s used in phones are what I’m most familiar with, so the further away from that the less true this stuff might get. I also suspect part of why some of these things aren’t so costly (fencing isn’t as bad as on PPC) is because it tends not to have many cores
- Weak ordering, this means everything stronger than relaxed requires additional fences or special instructions.
- One point to note is that 64-bit atomics are supported but inefficient compared to 32 bit ones, so the widest operations aren’t always the best here.
- ARM32 has one relevant fence, and no general fenced-load or fenced-stores that I’m aware of (and to be fair: these wouldn’t be very RISCey).
- The fence is
dmb ish
and it’s basically afence(SeqCst)
instruction. - A
dmb ishst
also exists but I’ve never used it nor seen it emitted. - this costs around a cache miss, sometimes better, sometimes worse, it’s cheaper if fewer writes have happened since you last did it, I believe.
- Pretty much every non-relaxed atomic is implemented by inserting this before the op, after the op, or both (when it’s both, it’s usually cheaper the 2nd time).
- The fence is
- RMW operations are especially costly, all require the LL/SC loop from above.
64bit ARM/aarch64
- Also uses LL/SC for RMWs
- actually has dedicated instructions for load(acq), store(rel), and fence(acq), and will uses them.
- pretty similar to arm32 otherwise AFAICT, but I don’t have a lot of experience here yet.
- adopts C++11 memory model to a surprising extent.
PPC/POWER
Caveat: haven’t written for this a while, and what I wrote for was old, and in-order, which I imagine is no longer the case.
(TODO: probably verify or delete the bits about what the hardware does — i imagine lwsync
and isync are probably close if the recent PPCs are out of order)
- Uses ll/sc loops for RMW.
- No support fo 64-bit atomics at all (even slowly) unless its ppc64.
-
PPC implements non-relaxed similar to arm32, e.g. peppering stuff with fences. It’s fences are weird.
- Has multiple fences, which don’t quite map to the c++11/rust model.
isync
: Weakest fence, giving a subset offence(Acquire)
(the load-load subset)- Specifically, loads from after the isync can’t be ordered around loads before it, but stores from after it can (for
fence(Acquire)
neither loads nor stores can). - I don’t remember what this costs, but it’s cheaper than
lwsync
— it basically just flushes the instruction pipeline. - sometimes used on acquire loads and in other sequences of instructions
- Specifically, loads from after the isync can’t be ordered around loads before it, but stores from after it can (for
lwsync
: Behaves basically asfence(AcqRel)
.- Doesn’t give SeqCst, doesn’t give store-load. Costs around the same as arm’s
dmb ish
, e.g. cache miss. - in the hardware this just impacts your processor core, serializing its queued writes and cache reads.
- critically, it doesn’t create a synchronization point with other cores, which is why its not SeqCst, but also why its cheaper.
- generally peppered about as the common, unless you use SeqCst
- Doesn’t give SeqCst, doesn’t give store-load. Costs around the same as arm’s
hwsync
(akasync
):fence(SeqCst)
(well, I think not quite exactly but close).- This is slow as hell, like, hundreds of cycles, and slower the more cores you have.
- It’s costly enough that IBMs docs oddly try to get you not to use it and suggest using
eieio
disabling out of order io (which includes memory operations) all-together instead. - It forces all cores to come together and mark a sequence point, in addition to fully flushing basically everything that is available to flushed on at least the current core.
- Required for implementing any sort of
SeqCst
, for which it will show up once or twice depending on the op and how it gets compiled. If it happens twice, the 2nd time still has to make a sync point on all cores, but the flushing should be faster then.
- That said, the only people running PPC these days are enterprisey mainframes so SeqCst to your hearts content tbh.
- Also who knows, maybe things have gotten better.
Misc
-
This page on the mapping between atomic ops and various hardware is great: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. Godbolt is also very useful.
- Stronger orderings do impact compiler’s ability to reorder stuff around the operations, even though I didn’t mention it.
- In practice I’ve never seen this be a difference maker but YMMV
- The compiler will pretty much never remove use of any atomics, not even relaxed, although it would be nice if it did. (it will if you use the unstable
unordered
atomic intrinsics, so maybe we’ll someday have those for certain cases.
- Contention on the same cache line is what makes anything here many many times worse, and causes every access to bounce back and forth from the cache.
- Can be hard to avoid given that allocation quantum < cache lines.
- Some particularly bad patterns are stuff like
while !atomic.rmw(stuff, ...) { spins += 1; }
when spins is on the same cache line asatomic
.- on ll/sc arches under contention: the write to spins will cause other threads to fail the ll/sc on
atomic
, which it will retry, causing us to fail it, etc etc, leading to a live lock - for clarity this is fine and normal if they won’t be on the same cache line.
- on ll/sc arches under contention: the write to spins will cause other threads to fail the ll/sc on
Atomic Perf TL;DR
- Except on PPC (Where all seqcst operations are very costly), using a stronger-than-needed atomic is usually a cache miss (at worst two and a pipeline flush).
- However, often enough there’s no difference.
- Contention (specifically, on the same cache line) is the real killer even if using relaxed. This is slow on all arches, even x86.
- If you’re optimizing this code, target this!
- Stuff like thread_local or per-cpu data is good, but often only for high contention, especially for the per-cpu case.
- RMW ops are slow and complex (compile to a loop) on non-CAS arches (everything that still matters except x86). for some op
- For clarity, most things other than x86 don’t have, say, fetch_add as an instruction.
- I’d love to give advice like “compare_exchange_weak is a better target than SeqCst most of the time” but LLVM doesn’t do a great job with it…
Anyway
Sorry about dumping this on you, but I figure someone will find it worth reading, and it being in my drafts does nobody any good.