Crust of Rust: Atomics and Memory Ordering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

u/jonhoo Thank you for taking the time to go through this. I've read through the docs, but having it explained like this helps cement it.

👍︎︎ 53 👤︎︎ u/MrHalzy 📅︎︎ Apr 02 2021 🗫︎ replies

I'm a bit sad that rust accepts Ordering values that don't make sense. For example, the store methods on atomic types panic if you pass in Acquire or AcqRel orderings. Why couldn't rust have defined a separate enum here so it isn't possible to pass in semantically wrong orderings? Seems like it would be more consistent with the way the rest of std is usually built. A compile-time error is better than a runtime panic.

👍︎︎ 29 👤︎︎ u/loonyphoenix 📅︎︎ Apr 02 2021 🗫︎ replies

After watching with interest (Thanks!): fence documentation in the cppreference is here: atomic_thread_fence

👍︎︎ 3 👤︎︎ u/kryps 📅︎︎ Apr 03 2021 🗫︎ replies

A question I had while watching:

Why does compare_exchange return Result<usize, usize> instead of Result<(), usize>? In the success case we already know the previous value since we passed it, why does the function need to return it again?

👍︎︎ 3 👤︎︎ u/flaghacker_ 📅︎︎ Apr 03 2021 🗫︎ replies

Does anyone know what his vim config is, which vim plugin does he use with rust-analyzer?

👍︎︎ 5 👤︎︎ u/EarthyFeet 📅︎︎ Apr 03 2021 🗫︎ replies
Captions
hello folks uh welcome back to yet another crust of rust video um in this one i'm gonna tackle a topic that keeps being suggested over and over um and that is atomics and memory ordering um if you don't know what those are that's fine i'm gonna sort of go through them in a decent amount of detail over the course of this um but this is a topic that i've been hesitant to go into partially because of the topic where there aren't there's not really that great documentation for how stuff works so it's very easy for me to explain something and then get it wrong which i don't like to do not because i'm worried about being wrong but more because i'm worried about putting content out there that people then rely on and that content is wrong and the other reason i haven't tackled this because i felt like i was still sort of grappling with some of the concepts myself and so i was worried that that might translate into sort of a poor explanation of some kind i feel like now i'm at a point where i've sort of read through enough of the documentation i've worked with this material enough that i think now i can both give a correct and understandable explanation of what's going on so that's what we're going to try to do today um the uh the sort of core of the video today is going to be focusing on the rust atomic types and the memory ordering that's observed in rust and so the stream will be rust specific but at the same time most of this translates to basically any other language that has atomics in this kind of sense so the sort of c c plus um to some extent java too although the memory model is a little different and same with go so hopefully you should be able to take some of the things that you learned here and apply to other languages as well this the the underlying memory ordering stuff is useful to know regardless of what language you're working in um so most of what we're going to be talking about today is going to be the atomic the standard sync atomic model from the rust standard library um and this module has uh pretty good docs um on sort of the high level uh ideas of why we have these types what they're used for but there's just a lot of detail there that that makes it hard to get things right and sort of understand all the subtleties of the interactions of the different types what we're going to start out with is just like why do we need atomics in the first place um so you see here that the the documentation lists a bunch of types like atomic bool atomic eye size atomic use size atomic i8 etc and you might wonder well why not just use bool or eye size or you size why not just use the primitive types um and there are a couple of reasons for that um the primary one is that if you have a primitive type like a bool or u-size or something that's shared across thread boundaries there are only certain ways to interact with that value that are safe and when i say safe here i mean it both in the sense of data races are undefined behavior so if you don't control how different threads inter interoperate on the same memory location you run into undefined behavior which as we talked about in in some of the previous streams is just bad um but the other reason is because it makes sense to have slightly different apis for these types because what you're doing when you're operating on the atomic versions of these types is really you're issuing different instructions to the underlying cpu and placing different limitations on what the what code the compiler is allowed to generate and what we'll get into what what those differences are and why they're important but the core of it is that if you have shared access to some memory value you need to have um additional sort of information about that access to let the cpu know when should different threads see the operations that other threads do how do they have to synchronize how do you know that if one thread writes to a value and another one reads one what are the guarantees about which values the reader will read will always read exactly the latest one what does the latest one even mean but also for other reads and writes on the program in both of these different threads which of those are visible to this other thread in general if you had something like a u size and let's say you found sort of a sound way to do data racing so you had a thread that just like wrote to this value and a thread that just read from the shared value and it was a standard use if you did that there is actually no real guarantee that the reading thread is going to see the value that was stored ever now in practice the cpu and the memory system will usually make that be the case but you're not guaranteed for that to be the case there's no there's nothing in the specification that says that this should be the case um and this is where atomics comes into play it sort of lets you place uh limitations and rules on the use of this type and the value and what um what values can be exposed when um now when i say sort of the specification or the the um the sort of rules here for for the compiler and such what i'm really talking about is the language's memory model and the rust reference which is sort of the so you have the rust documentation the standard library documentation and stuff uh you have all the rfcs and then you have the rust reference and the rest reference is at least the idea is that it should fully specify the language um such that if someone else came along and wanted to implement a rust compiler like a compiler for the rust language they would know exactly what to implement and the behavior of those things and it has a section on the memory model and it says rust does not yet have a defined memory model various academics and industry professionals are working on various proposals but for now this is an under defined place in the language which you know is kind of unhelpful if you're trying to write anything that uses atomics and concurrency now in practice it's not actually that bad because in general partially because rust relies on lvm um rust generally follows the c memory model and in particular like the c11 memory model i think is the the most sort of recent variant of that um and so what we're going to be following here is actually the uh memory ordering documentation from cpus plus um it's not necessarily because z plus plus is exactly what we match here sorry for the bright screen this one doesn't have a dark mode we're going to be using this because it has fairly good documentation for what the different memory orderings mean and good examples of what goes wrong in general we're not going to be reading this page too much it's more that this is where much of the uh much of sort of the explanations i'm going to be giving you the examples i'm going to be using and just much of the sort of guarantees that i'll be talking about are coming from just so you're aware um so let's start out with a type like uh atomic even size this is the atomic equivalent of the u size type if you were to like look at it in memory like the in-memory representation it's exactly the same as a u-size the only real difference between an atomic use size and a u-size um is that it has different methods on it which we'll see in a second and you can only really access the value through those methods you can't get the u size without calling one of the methods it's not like it's castable into a use size trivially for example and it's the same size as a use size two the only real difference is what instructions get generated when you access this value and if we look at the methods on atomic use size you see that there's a constructor that gives you a new one the atomic used size and this applies to all the atomic types are not inherently shared so they are sort of values that are placed on the stack so if you want to share them across thread boundaries you can't just like create a single atomic use size and give out i mean you can you can give out just shared references to those threads but shared references to something on the stack won't be static and so that usually gets you into a pain so generally what you do if you get one of these atomic types is that you stick it on the heap using something like box or more frequently an arc and that will allow you to share a pointer or reference depending on how you want to use the language here to that shared atomic value which you can then update the reason this works and the main differentiator between the atomic use size operations and u-size operations is that you can operate on an atomic use size using shared references to self so notice that for a normal use size you need an exclusive reference to the use size in order to modify it that is not the case when atomic use size instead a shared reference is sufficient and the reason for that is because the the compiler generates sort of special cpu instructions that make it safe for multiple threads to access this value at the same time um okay so let's just first go through sort of what are the methods we'll go into what each of them mean and how you might use them in a second but just to sort of get a survey of what kind of stuff we have to deal with so first and foremost there is load and store and they do basically what you would expect uh load will load out the value that's stored in the atomic use size so in this case it returns a use size and store takes a use size and stores it into the atomic use size uh similarly the swap which sort of does both and you'll notice that these take an additional ordering and we will talk plenty about ordering if you look at ordering ordering is just an enum that has these different variants and what these means is like part of what this video will be about what kind of semantics they establish but fundamentally what ordering does is it tells the compiler which set of guarantees you expect for this particular memory access with respect to things that might be happening in other threads at the same time um the other methods on this are uh compare and swap compare exchange and compare exchange week uh we'll talk about the differences between those later on in the stream uh these are basically ways of reading a value uh and also swapping out its value but doing so conditionally and in one atomic step and when we mean when i say one atomic step um i'm going to show you in a little bit of a second what that means but basically if you do a load and then a store there's a chance that some other thread comes along and does something in the meantime it gets to run between your load and your store and a compare and swap is of a single operation where no thread can get in between there are also a bunch of these uh fetch methods so fetch ad fetch sub fetch and fetch nand these are basically operations that similarly try to avoid something happening between when you load the value and when you change it so fetch add for example will atomically so as a single step load the current value and add a value to it without it being possible for any thread to to get to execute in between or modify the value or read the value in between those operations and we'll talk about why that's useful uh and why these are separate from the comparison methods in a second um before we dive into actually using one of these i'm going to take some questions uh on what i've talked about so far because i went through a bunch of things fairly rapidly here and i think it might be useful just sort of make sure we have a shared foundation that makes sense [Music] don't u64s on x86 have atomic access or i'm confusing it with something else so on some platforms the non-atomic types have additional guarantees so for example i think on like intel x86 if you have a 64-bit value any access to it is sort of automatically atomic assuming it doesn't cross a cache line boundary i think like it needs to be an aligned access and that means that in theory like if you're down in the raw assembly and you know that you're on intel x 64 um you can do this without atomic use size but realistically in the standard library what we want to do is expose sort of a common interface that will always work and so this is why we can't really expose like just for you size on x64 you can um you can use them through shared references um so instead everything is modeled through here it's just that on something like a target platform or an architecture like that uh atomic u64 for example will generally be free um it's not quite true either but close enough uh why is it non-exhaustive yeah so ordering if you look here is non-exhaustive this is a great keyword if you haven't used it before which basically says that no code is allowed to assume that these are all the variants that we'll ever be in ordering so if you match on an ordering for example you always need to include like an underscore branch in an otherwise or else branch because the standard library wants to be able to add additional orderings later on if necessary the biggest one i think is um consume ordering which is something that c plus plus has but rust doesn't currently have i don't know whether that will be added but the idea is that if we want to have the ability to add things later it needs to be marked as non-exhaustive is the ordering enum related to the different memory models of the architectures like x86 or arm yes the different orderings are they're not it's not that they're related to the memory architecture they're related to what guarantees the operation gives you how different architectures implement those guarantees will vary from architecture to architecture as we'll see in a second too uh what's the difference between atomic and mutex so in atomic there's no locking there's with an atomic it's just multiple threads can operate on this value at the same time in some reasonably well-defined way but it's not like with a mutex right what happens is one thread gets to access the value at a time and all other threads need to wait and the mutex usually guards a larger section of code right it says i grab the mutex i run this code and then i release the mutex and no other thread is allowed to execute anything under that mutex while i'm in this critical section as it's called with an atomic there's nothing really like that you can maybe think of it as a mutex that guards just a single memory access if you wanted to but but it's much more efficient than that um do those atomic operations then not block other threads potentially no all atomic operations are lock free they're not necessarily weight free um this is on certain architectures that don't have uh fetch ad for example is like implemented in terms of compare and swap um so you may have to wait for other threads but they're not uh they're considered lock free like there's no mutex in there um uh the atomic operations are not just per cpu like per cpu architecture it's not like they just modify the instructions they also change the compiler semantics uh again as we'll we'll look at in a second when we start looking at the different orders so they they limit both what the cpu can do and what the compiler can do about a given memory access yeah so the point here is if even if you're on something like um like intel x86 64 um you might still want to use the atomic type because you need guarantees from the compiler as well uh store swap and friends all take a immutable reference to self not mute self i guess that means they rely on unsafe cell um you know that's a good question i think you're right i mean we can check that pretty easily uh it's called of course it's a macro um let's see what we can find here oh there's a lot of macros in here let's see if we can find the struct definition for this i see a self.v dot get which is usually an unsafe cell so my guess is that all of these atomic instructions contain an unsafe cell that they then call get on to get the pointer to the value and then they issue the actual assembly instructions on that and in fact if we go all the way to the top of this macro we should see the definition of it yeah so here pubs worked atomic type so this is what the macro is going to expand to uh it just has a v which holds an unsafe cell of the inner type and you see that this is actually just the in-type so if you have a an atomic view size it's really just an unsafe cell you and then it uses the the sort of special instructions to access the value behind that and this ties back to the older unsafe cell uh or the unsafety stream where we talked about interior mutability where unsafe cell is the only way at sort of the fundamental level to get mutable access through a shared reference you said atomics are generally shared via arc rather than box but the only reason for that is that arc is sensing while box is not so it's simply more convenient right that's not quite true so um if i box a value and place it on the heap then now i have a i have an owned value but if i spawn two threads those two threads both generally require that the closure you pass in is static has a static lifetime if you pass a reference to the box to both threads then now the like that box reference does not have a static lifetime it's tied to the stack lifetime of the box with an arc you can clone the ark and give each thread its own individual own individually owned and therefore static arc uh to the to the atomic use size and so this is why arc is usually used over box um you don't have to do it that way so for example there's box leak box leak will leak a value on the heap so it will never call the destructor which gives you back a static reference which you can then pass to both threads great okay so now that we have sort of a shared foundational understanding of like why these types are um let's go into what you might actually use them for and and in particular what this memory ordering business is uh so here we're going to do a new lib and we're going to make it um bin and we're going to call it not hang that's from a previous stream um we're going to call it atomics because feeling feeling boring today um all right uh fine just to make it stop complaining watch this now like break breaks my install or something it'd be great okay so let's say that we tried we want to try to implement our own mutex type so i'm going to define a mutex t and a mutex t is going to hold a unsafe cell t right which is the value we're going to be given out cell unsafe cell and also what else are we going to need sync atomic and we're going to need atomic bool and ordering so a mutex is a combination of a boolean that marks whether or not someone currently holds the lock and then an unsafe cell to the inner type so that if we hold the lock we can give out immutable reference right and then this is going to have a mutex t uh and i'm not gonna implement the sort of guard mechanism because that's not that interesting for this particular stream uh instead i'm gonna do like a with lock which takes a uh impul f and once that's given immutable value to t returns r so with lock is going to take basically a closure a function that it's going to call once it has the lock and let's for now just make this be call f this is not actually going to work but for now um and i want to caveat this with what we're going to implement here is known as a spin lock and the reason for that is we're going to if if the lock is currently held we're going to spin until it's no longer held we might like yield or something but fundamentally it's a spin uh don't use spin locks like you almost never want to spin lock you almost never want to implement your own mutex there's a great article by matclad on spinlock's considered harmful i'll post it in chat here it's a great article read it and don't implement your own spin locks and probably don't use spinlock in the first place but we're going gonna do it because it's a it's a good sort of exercise in understanding um what this is for um so how's this gonna work well i guess we need a new method here too let's make these pub wow i can't spell today pub fn new which is going to take the value to initially create the mutex with it's going to create a self with locked which is going to be atomic bool new and i'm going to have some constants here just to make the code a little nicer locked bool is true and unlocked is false so it's going to start out being unlocked and the value is going to be an unsafe cell new t so far so good uh so now we have a constructed mutex and the question now becomes how do we take the mutex how we how do we lock it uh well let's start out with sort of the the naive way to do this right which is while uh self dot locked i guess this needs to be self while self.lock.load not equal to unlocked spin self locked store uh locked call f captures return value set it back to unlocked and return red so the boolean here is going to be if the thread is locked so true means locked false means unlocked all right so this is sort of the naive implementation right while the while the lock is unlocked um oh yeah you're right this should be here uh self dot v dot get and it's going to be unsafe because we need to basically what we're asserting here right safety is uh we hold the lock therefore we can create a mutable reference all right so great now we have something that seems reasonable right like we're waiting for the lock to become unlocked um then we store the fact that it's locked so that no one else can get the lock then we call the function with immutable reference and we create the mutable reference because no other thread can be in the critical section at the same time trust me i will get to compare and swap i know this is wrong but just hang with me for a second uh just trust me on this please and then we're gonna store that it's now been unlocked so that other threads can now access the value great so far so good so this code has some problems and the first is that it doesn't compile and why doesn't it compile well it doesn't compile because we need an argument here right this this method as we as we looked at in the documentation this requires an ordering and it's not immediately clear what this ordering is going to be because we don't know what orderings are yet i haven't told you maybe you already know in which case that's great but for now who knows we're just gonna do ordering relaxed because we're relaxed people and so it seems fine for this to just be ordering relaxed we don't need to be super strict about things we like anarchy apparently uh ordering all right so uh we're just going to uh we're just going to just relax load whatever that means and see if it's unlocked great it now compiles and in fact if we try to run this it would probably work so here's what i'm going to do i'm going to create a lock mutex new and i'm gonna do a little bit of ugliness because why not uh we're gonna call this what type we're gonna use here we're just going to use one and this is going to be a tick static of whatever type we implemented and then we're just going to spawn some threads uh so we're gonna do and just spawn five threads and what all the threads are gonna do this is gonna be sort of a silly test we're just gonna have lots of threads they're all gonna try to modify this value at the same time through the lock and we're just gonna see whether it ends up doing the right thing because we totally got this right right so each such thread is going to do l dot with lock and in they're going to take the value and all they're really going to do is they are going to say v plus equals 1. um this is going to be 4 and they're going to do this a hundred times actually let's do a hundred let's do ten threads are gonna do each of this a hundred times um and i guess we're gonna have to wait for these threads so handles is a vec in fact let's do this the hardcore rust way so we're going to collect all the thread handles just so that we can wait for them afterwards so we're going to spin up 10 threads each thread is going to increment the value a hundred times oh what am i doing here right and i'll get to that in a second that's worth explaining uh and then at the end we're just gonna do for handle and handles handle.join because we're going to wait for the thread to exit and then once all of the threads have exited surely now we can assert that l dot with lock v if we just read out the value from behind the lock this should now be 10 times 100 right because we have a lock so all of these increments should exactly happen there should be no bad stuff going on here right and this complains that unsafe cell cannot be shared between threads safely we just correct like unsafe cell in general does not implement send or sync because uh [Music] um cannot be shared between threads it just because like unsafe cell is inherently just so unsafe so we generally want people to opt in to send in sync for that type in this case we know that mutex is in fact sync sync for mutex t and this is important one where t is send so we're going to implement sync from mutex so that you can concurrently access it from multiple threads as long as the value t is send and the reason we need the t is send bound is because um like the lock can be taken from multiple different threads and each of those threads might take the value in fact i think this needs to be sent and sync because it might access the value as well uh no it does not we never we never concurrently access the thread from the the inner value from multiple threads at the same time all right so now we have code that surely works right someone pointed out that the collect isn't needed the collect is needed if you didn't have the collect here it's true this would still be an iterator but you would join each thread the moment you spawn it so you would only have one thread running at a time uh initial value is one oh you're right this should be zero good call uh all right great so we have perfect code right uh if i now what do i call this atomics if i now run this i'm gonna run it with release to make it fast great so this works right our lock is perfect all right what if we increase the concurrency here we're going to do this a lot more times so this is not going to be a hundred times a thousand oh nice works so well great so clearly this code is entirely right fantastic stuff right it works ship it nothing more to worry about except it turns out there are some problems with this it's just that it's really hard to reproduce the problems so in particular um remember how i talked about the one of the reasons we need atomics and why we need operations like compare and swap is because here we're doing a load and then we're doing a store but in between here maybe another thread runs so imagine the two threads call with lock concurrently and let's imagine the mutex is currently unlocked right now both of these both of those threads are going to see that the thread is unlocked so both of them are going to leave the while loop both of them are going to get down here they're both going to store locked to the value but they don't see that the other one has stored locked because they've already left the loop then they both get immutable reference to the same value which is undefined behavior they both call f with it and then they both unlock the mutex so clearly like it's possible for that to happen it's just that generally the the computer is so fast that this just doesn't happen um and like we can we can do things to try to encourage it to happen so for example here we could do like a yield now here just to make it more likely that some other thread is going to come in between if we now run this you see that it panics we didn't get to the final value right we got to some other value that's slightly smaller and the reason for this is because two threads ended up executing the the closure we passed to with lock at the same time so they both read the same old value they both wrote down that old value plus one but because they both ran at the same time that means that we're losing some of the increments because they get overwritten by another thread that runs concurrently this is also just undefined behavior like the compilers are allowed to do garbage here it does something that's somewhat reasonable but you can't generally rely on that and you might be like okay yield now clearly there's not a yield now here uh and so why would the some other thread get to run um and there are a couple of different reasons for that one is there might be multiple cores on your computer right like this computer has like 16 cores or something which means that if two threads running on different cores you can't control the relative operations of when different threads do different things so here it could totally be the two threads running on different course and they're both in the while loop at the moment they both see it being unlocked and then they both proceed the other reason even if you have a single core you can end up in a situation where the operating system will generally limit how long a given thread gets to run for and then do something called preemption which is basically stop the program forcibly in the middle because it's been running for too long just to ensure that all the threads on your system get to run and when it does that you have no control over so you might be preempted at any point in time including between this load and the store now this is unlikely it's unlikely the preemption happens exactly there but it can and that's sort of what we're modeling here with the yield now is we're pretending that the thread gets preempted here okay so before i go into how we solve this problem let's talk about what uh what just happened here why it's wrong um and just make sure everyone understands that uh would sleeping a bit while doing the plus one equals create concurrency issues um so if we slept in here that would probably not increase the concurrency issues because if you're in here you've already taken the lock it's taking the lock that's racy not what you do under the lock so sleep here would probably not have the same effect as the yield now i inserted um isn't the compiler already predetermining the sum no the compiler is not predetermining the sum here because this is behind a different thread so even though technically this could be computed statically the compiler doesn't it doesn't do that across thread boundaries would thread sanitizer catch this we'll talk about thread sanitizer later in the stream but thread sanitizer would catch something like this because you have basically you have two threads concurrently writing to one memory location which is the kind of stuff that it's a right right conflict which uh thread sanitizer would generally catch we will also talk about loom my promise please trust me on this it seems like sleeping there would result in more lock contention and thus more races no actually you have more lock contention if the critical section is shorter so sleep would make it um less contentious well it's not quite true either i don't i don't think the the sleep is all that important and it's sort of it's it's not necessary to demonstrate the problem uh so are atomics related to multi-threading or concurrency i mean atomics themselves are useful for concurrency and multi-threading is a form of concurrency it doesn't require if you don't have multiple threads it's unlikely you will need atomics if you have multiple threads even if you only have one core atomics are necessary isn't sequential consistency by default in rust no there's no default sequential consistency in rust um it is true that for code that you write that doesn't use atomics like if you're operating on a single thread for example there you don't have to worry about these problems um and the reason for that is because the if you look at the memory model um any sequence of operations within a given thread are guaranteed to be observed in order uh so they might execute out of order but the effects will appear as though the program ran sort of top to bottom in sort of sequenced order so that might be what you're referring to but there's no default of sequential consistency this is why all these methods take an ordering that you have to explicitly pass in um all right so it's now clear that something here is broken right and and one of the reasons it's broken is because this is race uh between this load and the store where multiple threads might win that race at the same time and so therefore we need to find some way to avoid this sort of this race condition of two threads observing the mutex being unlocked at the same time and the way we do that is using the compare exchange operation you'll notice that there's also comparing swap compare and swap is often what the cpu instruction is referred to um in general you want to use comparexchange rather than compare and swap um as like compare swap is deprecated too for exactly this reason the reason you want to use comparexchange over compare and swap is because compare exchange is strictly more powerful in fact i think the implementation of compare and swap used to just call comparexchange and the reason it's more powerful is because comparexchange lets you specify the memory ordering differently for whether the operation succeeded or failed whereas compare and swap does not and we'll talk about what that means too um comparison week is interesting and we'll talk about it in a second uh yeah great okay so uh if we look at comparexchange we see that it takes the current value it takes the new value and then it takes two orderings and we'll talk about why there are two orderings and what they mean in a second and the first line of the documentation is pretty helpful stores a value into the atomic integer or in our case boolean if the current value is the same as the current value right so let's see what that would look like compare exchange uh we're gonna go from unlocked to locked uh still just ordering relaxed because we don't know what this means um and in fact now the store goes away i realize this is maybe hard to see there we go i wish i would format this differently in fact maybe i'll do that here just so that it's easier to follow what's going on okay so compare exchange takes what the current value should be for us to update it and what it should be set to if the current value is what the first argument was so what will happen here is the cpu is going to go look at the value the the atomic bool here see if it is unlocked so that is if it is currently false then and only then set it to true and do that in such a way that no other thread gets to modify the value in between when we look at it and when we change it so compare and compare exchange is a single operation that is the read and the write and notice we can do this in a loop right because if the um current value is locked then the value will not be updated because the current value is not unlocked and so compare exchange will return an error and so we loop and try again and compare exchange will return an error if the um if the value was not updated and it will return okay if the value was updated and in either case it will the so the value contained in the okay or the error is going to be the current value whatever it was at the time when the cpu went to that memory location so if you get an error it's going to tell you here's what the value was when i when it wasn't equal to what you passed me in our case we don't need the actual value all we care about is whether it was updated so hence the call to is error here but if you have something like an atomic u size for example it might actually matter what the old value was like imagine you're updating a counter right and you're trying to increment it by one then the next time around the loop you want to use the updated value to do the increment rather than the value you've read previously all right so does this make sense does it make sense why compare exchange solves this particular problem right because here there's no space in between the load and the store it's they're just one operation one atomic operation that's performed on the memory location we're operating under now in practice you don't actually want to do it this way and there are two reasons for that first this means that we are going to try to do well compare exchange is a fairly expensive operation because if you think about it if every cpu is spinning during compare exchange they're all going to try to get sort of exclusive access to the underlying memory location and in practice what that means is imagine you have like eight cores right and one of them is currently holding the lock all the other threads are trying to get exclusive access to the value that's sort of that holds the true or false and so what they're going to do is each each core is going to say give me exclusive access of this value which is sort of a coordination effort right it needs to coordinate with all the other cores to say i now own this memory location to make sure no one else is writing it at the same time and then it's going to look at the value and let's go oh it's not the current value and then then some other core is going to say now give it to me and then just the that memory location is sort of going to bounce between the cores rather the ownership of that location in memory is going to bounce between the cores and this is very inefficient cpus are not it's not that they're not graded it's just inherently an expensive proposition to coordinate exclusive access amongst all these cores if you're curious about this i highly recommend you look up the messi protocol which explains a lot about like how this actually works on the hardware level it's a super nice protocol to know about um if you want to understand sort of performance implications at this kind of low level um so uh so the messy protocol basically says that a given location in memory it's actually talking about cache coherence and cache lines but i'm going to refer to it as a location in memory a location of memory can either be shared or exclusive there are some other states too but but basically share to exclusive so in compare exchange the cpu requires exclusive access to that location in memory which requires coordinating with everyone else alternatively a given location of memory can be marked as shared and multiple cores are allowed to have a value in the shared state at the same time so in general if we can have a value stay in the shared state while the lock is is still held that's going to avoid all the sort of ownership bouncing and so what you'll see in certain spin lock implementations is another inner loop which does this so notice that this doesn't actually change anything like the behavior is still the same it's just that if we fail to take the lock then we're just going to read the value so notice that this doesn't do a compare and swap it doesn't require exclusive access it's read-only access so if we fail to get the lock then we're going to spin and just do reads which allows that value to stay in in the shared in the shared state which means that we don't have all this ownership bouncing and then the moment it changes because some core takes exclusive access to it probably because it's doing an unlocked store only then do we go back to doing the compare the expensive compare exchange where we try to get exclusive access if we fail to get the lock then we fall back to doing these this read-only loop so this is actually a fair amount more efficient when you have high contention but again don't use a spin lock um do you think the performance degradation would be visible if you just redid your test again no these optimizations are you're talking like nanoseconds um they only really matter if you have a lot of contention on a particular value and when you have a lot of threads a lot of cores are due in that contention so you're usually going to see these kind of things show up in if you plot the like performance like throughput or good put usually by the number of cores in a in a graph what you'll see is um if you have a lot of contention like everyone is trying to get exclusive access then as the number of course goes up the good put starts to sort of flatten out or even go down because the courts are spending all of their time just trying to like fighting over who gets exclusive access to the value whereas if you do something like this where you you sort of avoid that collapse so you might still not see linear growth which is like if you double the number of course you double the good put but you will see at least usually see a better curve because you avoid some of this performance collapse is compare exchange then much faster than locking a mutex it's hard to say a single compare exchange is not that expensive the biggest difference between comparexchange and a mutex is that a mutex has to wait a compare exchange will will never wait right what will a single compare exchange call is gonna go and try to do the operation you told it and then it's gonna say either i succeeded or i failed so it doesn't like if if it failed it's not like it blocks the thread or anything you can choose to do that like in this case we spin so we are sort of holding up and waiting for that to succeed a mutex on the other hand if you lock it then your thread will be blocked until you have the lock this is sort of the difference think of comparison as a much more primitive operation in general what you'll see is that comparex change is often but not always used in a loop um so like keep retrying with some updated current value until you succeed um very often although not always there are some algorithms where if you fail the compare exchange there's some other useful work you can do so you go do that and then you try to compare exchange at some other later point in time but and this is where week comes in so remember we looked at the documentation there's comparexchange and then this compare exchange week the difference between these is fairly subtle but basically it comes down to compare exchange is only allowed to fail if the current value did not match the value you passed in it is not allowed to fail under any other conditions compare exchange week is allowed to fail spuriously so that is even if the current value is what you passed in it's allowed to fail it usually won't but it's allowed to the reason this matters is because there are some so what's the best way to explain this it comes down to what operations the cpu supports so on um like intel x86 there is like a compare and swap operation it's not technically called that but effectively there's an operation that implements compare exchange it does exactly that one operation it's one atomic instruction on arm intel uh i guess slash amd really i guess x86 on arm though you don't usually have a compare and swap operation instead what you have is a uh oh what's it called you have like a lock uh ld rax and strex um so this is load exclusive and store exclusive and what so you have two operations and load exclusive you can think of as takes exclusive ownership of the the location and memory and loads the value to you and store exclusive is only if i still have exclusive access to that location that is no one else has taken it away from me only then will i store and otherwise i'll fail um and you could imagine that some other thread took ownership of the value for example just to read it or to overwrite it with the same value that's already there in that case the store x the store exclusive operation on arm will fail even though the value might still be the same but it will fail nonetheless the upside of doing it with these two operations is that the storex is really cheap right like you don't have to then go and grab exclusive access um but it might mean that you fail the operation without needing to so on arm compare change is actually implemented using a loop of ldrex and strx because it needs to implement the semantics of comparexchange which is only fail if the current value stayed the same and this means that on arm processors this loop ends up being a nested loop and nested loops are they're not inherently a problem but they tend to generate less efficient code they generate more registry pressure and stuff so this is why we have compare exchange week where compare exchange week is allowed to fail spuriously and so it can be implemented directly using ldrx and strx and of course on x864 compare exchange weak is just a compare and swap so it doesn't have to like it just always works it doesn't generate spurious failures and the idea is that if you already call it in a loop you should use compare exchange weak if you are not calling it in a loop if you're expecting it to if you're if you wanted to only fail if the operation if the current value had changed then you use compare exchange so in this case because we're calling it a loop this should be weak all right did that make sense before we continue um oh it's called load linked and store conditional that's what these operations are you can call arm 64 does have compare swap but there are a bunch of other arm variants that do not let's see okay so that that seems like it was fairly clear as to why what what the difference between compare exchange and compare change uh week is okay so i guess let's leave that comment up here uh stay in s when locked just sort of to keep the code commented is that all the conditions when weak is allowed to fail weak is allowed to fail spuriously so a week can fail for all sorts of reasons it will only succeed under the one condition you expect it to but it can fail for whatever reason it feels like whereas exchange can only fail uh if um the sorry the non-weak version can only fail if the current value um changed all right so so far so good uh we're now at the point where we have something that looks like it can't have this problem and in order to test that i guess we could put like a thread yield now in here and maybe one in here too and then try to run it again all right so nothing fails okay so now clearly our code is correct right ship it it's all done now we we fixed the race condition there are no more problems um of course no that's not true i wish it was but it is not and the reason comes down to this ordering business um let's see do we want to do ordering first or fetch first uh let's do ordering first yeah let's do ordering first okay so as we discussed there is uh this ordering enum uh and you see that there are a couple of different variants here there's relaxed release acquire acquire release and sequential cons sequentially consistent and these as i described before basically dictate the allowed observable behavior when you have multiple threads that interact at some memory location they sort of dictate uh what's supposed to happen if and or what what is allowed to happen uh when you run this code okay so uh ordering relaxed means that there are no guarantees and when i say no guarantees that's not quite true like you're still guaranteed that uh the operation will happen atomically like if you do a load you can't like see some bits that were stored by one store and some bits that were stored by another store like the operation is still atomic but it's the only thing that's guaranteed uh and to demonstrate just how extreme that is i'm going to give you a little fun test case let's do relax too relaxed so here's what's going to happen i'm going to spawn two threads um i'm going to say x is box sleek box new atomic i guess you size atomic view size new zero uh the reason i specifically give the type here is because oh i can't there we go is because box leak returns a static mutable reference which i couldn't move into two threads so this is my way of casting it basically um sync atomic atomic cue size i guess i can have it a test it doesn't really matter i'm going to have an x and y that are both numbers and i'm going to have two threads i can't spell um and one thread is going to do uh let let's call this thread 1 and this thread 2. thread 1 is going to read y with relaxed ordering and then it's going to store that value into x and i guess we can have it return r1 uh t2 is going to load x and then it's going to store 42 into y and then we're going to do r1 is t1 join unwrap so we're going to join both the threads so that we have the values t2 all right so we have two threads and one thread is going to read y and store x the other one is going to read x and start a y and the surprising bit that i'm going to tell you here is that it is possible to have r1 be equal to r2 be equal to 42. this is a possible execution of this program let's see why that's weird so here r2 loads x and then it stores 42 somewhere that's what thread 2 does and i'm telling you it's possible for r2 to be 42 even though the store of 42 happens after r2 is red this should be extremely surprising to you but it is possible with ordering relaxed and the reason is because when you have multiple threads executing concurrently by default there are no guarantees as to that's not quite true close enough there are basically no guarantees about what values a thread can read from something another thread wrote under ordering relaxed so even though you might think well either this hap like this happens before this so how is it possible to have them happen the other way around the answer to that is in atomic operations generally what happens is that there is a modification order that's stored per value so this thread might see um how are we going to explain this so for there's sort of the there's the modification order for y uh and there is the modification order for x let's make this the other way so for y the modification order is that it is 0 and then it's 42. for x it's that it's 0 and then it's 42 right when you load a value you can see when you load a value with ordering relaxed you can see any value written by any thread to that location there's no restriction on when that has to happen relative to you the the only restriction is if there's a synchronization point between the two of you um and the the spec talks about this in terms of happens before relationships we'll get the into those in a second but for ordering relaxed there is no guarantee that just because you read or write the same value there's like a there's no guarantee that you happen after or before relative to some memory ordering so in particular this load of x is allowed to see any value ever stored to x that includes 42 because it's in the modification set of x and there are no constraints on which subset of that which which uh range of that modification order is visible to this thread it might seem like time travel the other way to think about this is that the compiler is totally allowed to reorder these two operations it's maybe an easier way to think about it similarly the cpu is allowed to execute them out of order right just for optimization purposes the reason it's allowed to do that is because there's no there's no sequencing operation between the two like this operation does not depend on anything from this operation it doesn't use r2 it doesn't access x this one doesn't write into y so there's nothing that strictly requires them to happen in this order you might wonder why doesn't the cpu just cpu and compiler just run them in order and the reason is because very often there are significant performance gains from sort of rejiggering the execution of a program if you run the the operation slightly out of order you might get much better performance you might utilize memory better and so under either of these conditions the reverse execution might happen and if the code looks like this it's totally obvious why r2 might be 42 right and with ordering relaxed you're not um there's nothing that guarantees that this won't happen think of it as relaxed does not establish an ordering between these two operations or between this thread and this thread and therefore r2 is allowed to see 42. does that make sense it's weird it doesn't make sense i i know it's it's really weird but but if you think about it again like this reordering makes a lot of sense like if if you were a programmer like imagine you you were working a huge code base and like you don't know about these other threads you only you're only looking at this in like a 10 000 line long file right would you ever think twice about swapping these two lines they don't depend on each other so why not just swap them like is there a downside to it probably not right but of course the the observable execution outcomes are are different or in the case of relax they're not because these are interchangeable as far as the the memory ordering specification is um yeah the other way to think about this right is um there's a sort of speculative execution as well of the cpu is allowed to run this operation before it runs this operation sort of speculatively because it knows that it's about to be executed and in this case there are no branches or anything so speculative execution usually comes up in if there's a question whether it might be executed so if there was like an if here on something that's not related to r2 or x like z is equal to three right uh this is where you run into speculative execution where this the cpu doesn't know whether or not this is true yet but it still executes this operation in a way that it can later undo just so that if the condition is true it's already started the operation so that's where like spectre spectre and meltdown come in but in this case there's no it's not really speculative execution it's just out of order execution which today's cpus are very good at because it's necessary it makes programs much faster um wait so when i write code the cpu can take any instruction that does not depend on others and start by that yes uh sort of so there are a bunch of constraints as to what reordering the cpu and compiler are allowed to do and what they're not allowed to do but in general the they're allowed to reorder anything that doesn't have a clear sort of happens before relationship which is the formal way to specify this like if there isn't a dependent think of it as like if your program is like a graph of dependencies you're not allowed to reorder like if a depends on b if a depends on b you're not allowed to reorder them because a dependent on b but if there's no relationship between a and b if there's no dependence relation if there's no if there's nothing that says that b has to happen first then they're interchangeable and that's what the specification gives it and this is also where it becomes important that memory ordering and memory semantics are not just about the cpu or the architecture they're also about what the compiler is allowed to do right it could be that the compiler reorders these or it could be that the cpu execute is out of order and they're equivalent like as far as we're concerned all that matters is whether it's legal for them to be reordered um yeah and so one of the reasons this might happen right is um it might be that uh we already have y in x like we have exclusive access to y already um but we don't even have shared access to x so the load from x is gonna have to wait for the memory system to do something but we can do the store right now because we already have it exclusive so we're gonna do that first because we can do it efficiently um okay so this is an example of the kind of shenanigans that can happen when you're using relaxed because relaxed just doesn't imply any ordering it doesn't imply that anything happens before anything else and why does this matter well here remember we're taking the lock with relaxed and so here's an example of something that might happen so let's imagine that instead of f this just did this right it can't do that because we would have to specialize to you size but let's imagine that it just like directly did the plus equals one great okay well this is relaxed so this can just be moved up here this operates on self.locked this operates on self.v there are different parts of memory they're different locations there's nothing that says that this reordering is not allowed we know it's not around like this is not okay if these execute out of order this is really bad now this operation is executing while some other thread mild might be holding the lock and we're violating the sort of the exclusivity of uh immutable reference right this is bad not okay not okay compiler not okay cpu but based on the ordering we gave this is fine okay so how do we fix this well this is where we have to use a different memory ordering than relaxed relaxed is too weak um so let's look at what are the alternatives well if we go back to ordering we see that the next thing sort of up from relaxed is release and acquire and these sound a lot like the terms we use for locks you acquire a lock and you release a lock and that's no accident it's because these memory orderings are basically designed for being used in the context of acquiring or releasing a resource okay so um let's leave the compare exchange weak for a second and first talk about the store so here's another example of a reordering that's valid with relaxed this can just move down there why not totally fine all good no fire is here right um but of course no this is not fine we don't want to allow this uh so it's not just the acquiring of the lock that we need to ensure that things don't move above but it's also the release of the lock we want to make sure that nothing moves below and this is where the release ordering comes in so let's switch back to the spec this is going to be a little bit bright actually maybe it'll say here yeah okay let's use this instead because it's dark mode so for the release ordering when coupled with a store all previous operations become ordered before any load of this value with acquire or stronger ordering in particular all previous writes become visible to all threads to perform an acquire load of this value and this ordering is only applicable for operations that can perform a store there's a lot of language there and all the language is important but basically what this means is that if we do this store with release then any load of the same value that uses the acquire ordering must see all operations that happened before the store as having happened before the store there's an additional restriction if we go back to the memory order this will be bright so if we look at memory order release a store operation with this memory order performs the release operation no reads or writes in the current thread can be reordered after the store so this is an additional sentence that actually not in the rust version which seems unfortunate that it should be in there but basically the the second part we saw is nothing can be reordered to after a release store great so that's solved the other thing is that as you noticed in the documentation it said uh all previous operations become ordered before any load of this value with acquire so that is the load that happens as part of this compare exchange if it uses acquire ordering must see any operations that we did in here this might sound weird like why does that part matter well if this is relaxed then the next person to take the lock is not guaranteed to see the modifications we made to memory here this comes back to the the stuff we talked about down here with uh with relaxed right is that this thread this operation might see values that like it it um in fact if we reorder them this is allowed to see zero for obvious reasons right we do a store to y but we load from x this red might not have run them in the meantime but there are more extreme cases too where it's just like it might be that this thread just will not see 42 in the time that it executes even if we like looked at the wall clock time and saw that yeah t1 ran after t2 it might still see zero because that's all ordering relaxed guarantees and so in our case with the mutex if we keep this as relaxed then the modifications we made to memory here may not be visible to the next person who takes the lock which is a huge problem right this is not okay so this needs to be released this for now let's say this has to be acquired now we have the two guarantees we have this cannot be reordered after this and anything we do in here must be visible after an acquire load of the same value so whoever next takes the lock must see anything that happened before this store the way to think about this is that the acquire release pair establishes a happens before relationship between the thread that previously released the lock and the next thread that takes the lock and that happens before relationship establishes that anything that happened before the thing that did the store also happens before anything that happens after the load and there's an additional part to this which is we if we go back to the bright speck again uh acquire says a load operation uh with this memory order performs the acquire operation on the effector memory location no reads or writes on the current thread can be reordered before this load so this cannot be reordered before this load which is the other property that we needed nice so this has to perform a release this has to perform a choir there is um so there's also a choir release which is written as hack rel so acquire release when you pass it into an operation that does a read and a write like compare exchange right it reads the value but it also writes the value the acquire release says do the load with acquire semantics and the store with release semantics in our case the question becomes do we care whether the store is released or not so this the story in this case is uh storing that we now hold the lock and here we don't actually need that to have release semantics the release of semantics are established down here acquire release is is more commonly used when you do a fetch ad for example or some kind of we'll get to what those mean in a second but acquire release is usually used when you're doing a single modification operation where there's no critical section like you're not going to do a later store or anything you're just doing one operation that you want to synchronize with other threads so in our case this can be acquired all right does this stuff so far makes sense i'll get to this last argument in a second um let's see is it only acquire ordering or acquire and stronger it's acquiring stronger so sequential consistent sequentially consistent ordering is acquire and release and more um great so it seems like what we went through so far makes sense um so the question then becomes what is this extra parameter for compare exchange and compare exchange week the extra parameter is imagine that the compare exchange looks at the memory and goes this value has changed so i didn't do the store this this ordering is what ordering should the load have if the load indicates that you shouldn't store in our case when taking the lock you can think of this as what is the ordering of failing to take the lock you can imagine cases where if you fail to take the lock you still want to establish a happens before relationship the cases where you need this are pretty rare but they do exist in our case we don't need that if you fail to take the lock that doesn't mean that you have to now like synchronize with the thread that last released the lock or the thread yeah the thread that last released the lock that's not important all that matters is the moment you do get the lock you have to establish sort of a sequential relationship or rather a happens before relationship between the the thread the last held the lock and yourself because you're about to operate on the stuff in there so here we can keep this relaxed the reason this matters is because if you fail to take the lock you don't want to then have to do sort of coordination with the other threads or the the other course rather to how get like exclusive access is not important great okay so hopefully this now makes sense like now we have a working example of a spin lock spin locks never work but it is a working example of this bin lock this in fact i will tell you is an a i believe sound and correct implementation not one you should use but a correct and sound implementation as you might wonder well why didn't these problems show up when we ran this with relaxed right when this was relaxed and this was relaxed like we ran the the test we ran the binary we did lots of iterations with lots of threads and lots of cores why didn't anything fail why did we still observe the col the sort of correct output and the reason for that is because my machine is a an x 86 64 machine and on x8664 the architecture basically guarantees that um acquire release semantics for all operations it's a fairly s it gives um fairly strong consistency guarantees by default um that is you can't opt out of it it's just the memory architecture the cpu architecture is such that all operations are guaranteed to be acquired release that's not true for all platforms on arm for example that is not generally true if you specified ordering relaxed you will get relaxed memory semantics and so this is one of the problems with trying to sort of test concurrent code by just running it lots of times and that is you're still only testing it on your current hardware and your current compiler sort of like with undefined behavior like we've talked about in past streams it's just you can't rely on the current state of affairs for you being representative of future executions it gives you some idea like our previous implementation which was like obviously wrong the one with the load and store separate that one did break but just running it lots of times it's just not sufficient to do testing of this kind of concurrent code we'll talk about how you might do do something much better a little bit later on in the stream this load can stay relaxed yeah because here too it doesn't we don't need to establish any happens before relationship here because we haven't taken the lock yet so this can still be relaxed in general the cases where you want to use relaxed is um when it doesn't matter what each thread sees so for example if you do something like maintain a counter like for statistics or something you can generally have it be relaxed because it if one thread doesn't see the increment of a different thread it doesn't really matter if the counter gets updated slightly earlier slightly later in relative to execution order it doesn't really matter and so there relax is great because it imposes the least amount of restrictions on the cpu and compiler and so they can they can execute the compiler can generate more efficient code and the cpu can execute the instructions more efficiently but for anything where the ordering between threads uh and the relative ordering of instructions matter you might have to look at stronger stronger orderings um okay so now that we've talked about acquire release let's look at in combination the fetch operations and sequentially consistent ordering uh so let's first look at fetch the fetch operations um oops okay so if you look down here you'll see that there are a bunch of fetch operations in addition to load store and compare exchange says fetch ad fish sub fetch and fitch nand fitch or etc uh and if you think of like you think of load and store as sort of being the like bread and butter the nuts and bolts like that the very low level primitive that you can deliver things with and you can think of compare exchanges like the sledgehammer like do this or don't do it and do it atomically and there's nothing like it's either set this value to this value or tell me if the value changed um the fetch operations are gentler versions of of a sledgehammer like a mallet or something rubber mallet uh more concretely and more helpfully perhaps the fetch operations are instead of just saying if the current value is this set it to this so imagine something like uh you read the current imagine you have a counter or something read the current counter and if the and then you do a compare exchange of the current counter value and the current current or value plus one which will fail if the counter has been changed in the meantime effect you can do a fetch ad which is sort of like instead of saying what the new value should be you tell the cpu how to compute the new value so that way a fetch ad will never fail again for the counter example the fetch ad you say do a fetch out of this value and add one and then the cpu is going to go to it's going to get exclusive access to the value it's going to read the current value and regardless of what the current value is it just adds one to it and stores it back this means that fetch ad just doesn't fail it doesn't matter what the current value is the new value will just be set to one plus whatever that was and the reason it's called fetch add or fetch sub and etc is because it also tells you what the value was when it incremented it so this again is to get at the the race condition between doing the load and the store where fetch ad assumes that you care about what the current value was uh you can throw that value away that that's not important but um if that there's no other way that you could learn which value you you just incremented right if you did if you just had like an atomic add operation you couldn't combine that with a load to figure out what value what the value was that you incremented to or from because if you did a load and then an atomic increment then there's a slight space in between there where some other thread could sneak in and similarly if you did like the add and then a load there's also a space in between so fetch out is a single atomic instruction but instead of just dictating what the updated value should be you tell it what the operation should be this is perhaps best illustrated by the fetch update method which is a little bit of like an ugly duckling so fetch update takes a closure that is given the current value and should return the new value so think of this as like the sort of generic version of fetch adam fetch sub and stuff where you could imagine fetch ad being like fetch update with a closure that adds one to its argument the the reason i say the fetch update is like a little bit ugly and weird is because the cpu has built-in support for doing ad atomically or doing subtraction atomically or and atomic like bit twice and atomically it doesn't have this for an arbitrary closure and so really what fetch update is is a compare exchange loop that's implemented for you so if we look at the source for fetch update oops that's not at all what i wanted to do i just made my browser full screen and that did not interact well with my streaming setup fetch update okay so fetch update internally loads the current value and then does a while loop and does compare exchange in a loop so it all it really is doing is doing the compare exchange loop for you and you'll notice that they correctly use compare exchange weak because it's in a loop already right so this ties back to what we talked about earlier and the difference between non-week and weak compare exchange but this is why i say the fetch update isn't really the same as the others because in general a fetch add will just be a single atomic instruction whereas fetch update is actually a compare exchange loop the reason why fetch update still sort of fits here is if we go up and look at the documentation for atomic you'll see here under portability all atomic types in this module are guaranteed to be lock free if they're available so for example like if the platform supports atomic access to u64s the atomic u64 type will be available if it doesn't it won't be this means they don't internally acquire a global mutex atomic operation atomic types and operations are not guaranteed to be weight free this means that operations like fetch or or fetch add may be implemented with a compare and swap loop so the idea is that if the architecture you're compiling for doesn't support an atomic increment for example then the standard library on that architecture implements fetch add with the compare and swap loop or compare exchange loop so in other words if you call fetch ad because you want to avoid compare exchange that is the right thing to do because it means that you get to take advantage of the atomic increment operation if it exists on your architecture but there might be architecture where it ends up being a change anyway okay so now hopefully you understand what would fetch the fetch operations are for in general these are used for things like um if you want to give um like unique sequence numbers to a bunch of operations that happen concurrently then rather than like have a lock which is like next sequence number you take the lock and then you read the sequence number you increment it and then you release the lock you just do a fetch ad on an atomic use size instead and that guarantees that every every call to get a sequence number will get a distinct one because every increment will happen and the fetch will ensure that you read the value that was there when you updated it so no thread will read the same value if if every thread increments by one for example and it will generally be a fair amount more efficient than doing certainly a mutex but even a compare exchange loop on platforms to support it um okay that then brings us finally to sequentially consistent ordering and this is going to be another instance of head explosions maybe and here we're going to need a different example because our lock is now correct the example i'm going to give you here is as follows uh this is mutex test yeah yeah i'll i'll do that test so here's what i'm going to do here i'm also here going to have a sorry while i move around uh these are going to be atomic bools they're going to start out false and then z is going to be atomic u size uh and here watch this this is this is fairly involved to demonstrate the difference between acquire release and sequentially consistent this is also the same example as used in the cpus bus memory order page except sort of translate it into rust so we're going to have one thread that stores true with ordering release so remember release is for stores and acquire is for loads we've got to have another thread that is a store to y we're going to have one thread that does while not x load we're going to acquire in a loop and then it checks if y again will acquire then it's going to fetch add one to z and this one is gonna be relaxed we're gonna call this t1 and then we're going to have a t2 that does the same thing except in reverse order so t2 is going to spin until y becomes true and then if x is true then it's going to add 1 to z and then down here we're going to wait on t1 and we're going to wait on t2 and now the question becomes z is going to be z dot load ordering and here i'm just going to use sequentially consistent for uninteresting reasons so think about this but i just want to read whatever the final value of z is so i'm going to do it this way this orderings you can ignore and the question is what are the possible values for z which essentially boils down to is 0 possible what did i do every now and again i find weird vin bindings is zero possible is one possible as two possible um more than two should not be possible because there are only two increments so that would be insane so we're gonna assume that's not possible let me zoom out a little bit so we can fit that whole thing on screen uh hopefully that's still legible all right so i'll give you a second to sort of consider this code and talk through it a little bit and then just check chat in the meantime um if something if some things are implemented under the hood with compare exchange why don't they why doesn't everything return a result um because fetch ad always succeeds the fetch methods always succeed this is why if it's implemented with compare exchange it does compare exchange weak in a loop until it succeeds so a fetch ad will will never fail and therefore it doesn't need to return a result [Music] um okay so uh let's let's work through this here and i'm going to do these from last to first so i'm going to start with whether two is possible and then go back up is the mutex panic safe that mutex is panic safe but it won't propagate panics all right so two is clearly possible uh and just to make sure i demonstrate why uh let's call these i guess uh uh t x and t y because they respectively store true to x and y so two is possible under the following execution t x ty t1 t2 if the threat imagine that we just had one core the threads ran one at a time this is possible right t x runs and sets x to true t y runs sets y to true t one runs it immediately sees that x is true after observing that x is true it sees that y is y is true so it increments z so z is now one then t2 runs it immediately sees the y is true it then sees that x is true both of which because t x and t y have run and so it increments z by one so now we get z equals two okay so this is trivially possible right um okay is one possible uh one should also be possible right so uh we run tx then t1 then t2 uh i guess then t y then t2 so with this kind of execution x is when t1 runs x is true but y is false so t1 runs it waits for x to become true which it is immediately because tx already ran it tries to load from y t y has not run yet so y is false so it does not increment z then t y runs and t y stores true to y then t2 runs t2 observes the y is true so it immediately exits the while loop then it checks whether x is true and x is true and so it increments z by one so now z is one great so is one is a possible outcome for z zero is more complicated so let's sort of try to work through this the same way we did one and two basically can we find some execution of threads um where the outcome zero is possible well we have a couple of restrictions right we know that t1 must run after tx the reason we know that is if you run t1 first t1 is going to spin until tx runs right it has the spin loop so we know the t1 like even if you run t1 first it's just going to wait for tx anyway so t1 is going to happen sort of after tx it's going to be placed after it's going to have to execute after tx in order to complete we similarly we know that t2 must run after ty right that must be the case uh for the same reason if t y hasn't run and t2 tries to run it's just going to sit in a loop and so at some point it's going to be like preempted or something or ty gets to run on another core then t uh t2 is going to observe that now y is true so it exits the while loop and then it just then it completes so we have these two restrictions so given that what are what execution would allow z to be zero well we know that t x let's just arbitrarily pick t x goes first right then at some point later t one goes great so how we we have this execution where we don't know where the other threads are gonna run but we know that this must be the case okay now let's try to place t2 in different locations if t2 goes first then we know the t1 so so uh this is the pattern right so if t2 goes here then we know the ty must go before that and this execution t1 will increment c right because t1 runs after x and y have both been set to true okay uh back to the drawing board let's try another one so let's say the t2 goes here it goes after tx well t y still has to go before t2 is going to get to do anything useful so if whether t y goes here or whether t y goes here right it might be either of those but in either case uh t1 and t2 will increment z because they they're both going to go after the things that set x and y okay what about if we place t2 at the end well if we place t2 at the end i guess t y could go here but if t y goes there then t x is already run so t2 will t2 will increment z if ty went earlier t2 would still increment c so there's no place given this restriction there's no place we can place t2 in this kind of execution order such that one of them one of t1 or t2 won't increment z so it seems impossible to have a thread schedule where z equals zero seems impossible does this ring true so far does the walk through here make sense so far it seems impossible right okay and now this is where it gets super weird the way we've currently written this zero is possible and this may not come as a as a huge shock um but it is possible and here's why so it is true that there's no thread schedule that would allow z to be zero but we're not bound by thinking about thread schedules thread schedules are just like the human desire to put things in order but in reality computers don't have to have a single order that things run in you have multiple cores and those cores can show old values new values all we're subject to are the exact rules of acquire and release semantics which is what we've given here so if we go back to the modification order of x right so we talked about how there's like a modification order for each value so modification order for x is uh zero as false and then true the modification of order for y is similarly false and then true so let's look at what happens when t1 runs okay so t1 runs and it observes that x is true right so we know that t uh so here's a valid execution uh t1 is gonna observe this it's gonna observe true from the modification order of x and so what we know a couple of things that means we know that remember from the documentation of acquire and release it said that if you observe a value from acquire you will see all operations that happen before the corresponding release store while the corresponding release store is here in this thread there are no operations prior to the store but if there were we would be guaranteed to see them here because we're we're synchronizing with tx but there are none when we here get to the load of y this acquire synchronizes with whichever store the value it gets stored there's no requirement that that's any particular store of why if there had been a store of y in this thread if this did y dot store we must observe that y dot store because it happened strictly before this store which happened before this load because of acquire release so if there was a store here we must see it but there isn't so we're allowed to see any value for y that means we're allowed to see this y or we're allowed to see the y those stored here regardless of whether t y has run or not even if t y has already run sort of strictly speaking in wall clock time it doesn't matter the memory system is allowed to still show us false that is t y is t1 is bound to get true from the modification over order of x but it's allowed to see either of these regardless of whether ty is run and the same thing applies to t2 right this load synchronizes with this store so after this load we must see all memory operations that happened before this store to why there are none great we're done no requirement that we see anything that happened in here yeah sorry in uh in here there's no relationship to that thread there's technically actually i'm lying a little bit here so when you spawn a thread these threads all happen after the main thread that spawned them so we must actually see this false it's not like we could see a value written somewhere else independently but we must see this false or anything that happens later we can't see if imagine that this thread did like x dot store true then uh the loads down here must see the store because it happened before them because this thread spawned this so when t2 runs even though it will synchronize with this thread and therefore must see must see any previous rights here there's no requirement that it sees any particular operation that happened to x because there's no happens before a relationship between the store here in tx and the load down here there just isn't any this is not about reordering right so remember how the acquire loads says that you're not allowed to move any operation from below to above an acquire load so the compiler is not in the compiler or cpu are not allowed to reorder these not allowed right by the acquire semantics so that's not what's going on it's just that the this load is allowed to see any previous value for x subject to happens before which does not include the operation of t y so therefore uh t2 must see this but it can see either of these i guess if i get rid of this it might be clearer so t2 can see either of these and t1 can see either of these and that's still valid and if that happens then you could imagine that both of these threads run so both tx and ty run then t1 runs it observes x being true it does not observe y being true even though t one is run because there's no happens before relationship there imagine that it's already in cache in the cpu or something it just uses that value it doesn't bother to check again because it's allowed not to therefore it this value is false even though t1 ty ran therefore it does not increment one this one for some reason like it spins until y is true so let's say that it observes that y is true great it's no requirement that it observes that x is true even though t x is run so it does not increment c so therefore z is zero so this is weird right this is a really weird way to think but but the the way to try to get at this with your brain is that acquire release and in general all memory ordering is about which things which concurrent things happen before which other concurrent things and if there's not a happens before a relationship between two operations then it's sort of up in the air whether one sees the other the the seems was a giveaway yeah you're right um okay so you might wonder well why is this allowed like why do the designers of languages and memory ordering memory systems and cpus and compilers have this have this be possible and the answer is because if you looked at something like mutex this is fine this doesn't cause any problems and it gives the cpu and the compiler more leeway to rearrange your code imagine for example like here x and y are just independent variables so why should we establish an arbitrary con sort of connection between them when there really isn't one if there was then suddenly we're enforcing that the cpu and the compiler must execute things in order they must get exclusive access to certain operations and it's just technically not necessary so it would be imposing a cost that you can avoid think of this as like if you want stronger guarantees you have to opt into them because by opting into them you're also opting into the cost of them if every operation enforce sequentially consistent operation then you would have no way to opt out of it you could imagine that you have a default that you can override and that's exactly what rust gives you right every operation takes an ordering you must provide an ordering and so if you don't want to think about these problems you just always provide sequentially consistent like ordering sec cst you can always do that but if you do you're also then requiring the cost that comes with that all right so the question now becomes how does sequentially consistent operation change this well if we go back to memory ordering let's see what the rust thing says uh sexy as t i don't know how to pronounce the abbreviation uh like acquire and release and acquire release with the additional guarantee that all threads see all sequentially consistent operations in the same order and notice that this is all sequentially consistent operations it's not all sequentially consistent operations on this memory location it's all so if we go back and change this program to be sequentially consistent for all of these we can leave the counter as relaxed because it doesn't matter if we make these all be sequentially consistent now zero is no longer possible and the reason for that is because if this thread observes that x observes x is true and then y is true it establishes a happens before relationship between this load and this load right some thread observed that x was set to true and then y was set to true and that means no thread is allowed to see why being false is allowed to see x being false after y being true because some thread saw the opposite ordering so that's what the text is trying to get at that every thread or there must exist some ordering that's consistent across all of the threads if some thread sees that x happened then y happened then no thread is allowed to see x not happen even though y has happened and so in this case if we here end up with x is true and then y is true then here y is true therefore x must be true it's not allowed to see x being false because that would be inconsistent with the memory ordering that this that these sequentially consistent operations saw notice though the sequentially consistent ordering is only with relation to all other sequentially consistent operations so if some of these like if this was release and this was released this would not give us the guarantee we needed actually this might yeah this might but if this was a acquire it would not because here we're allowed to see x being true and then we're allowed to see x being true here but there's no ordering relation between these no sequentially consistent ordering has been seen by a thread that has extra than why true um so like this is subtle stuff but in general sequentially consistent only really interacts with other sequentially consistent acquire release does interact with sequentially consistent so sequentially consistent it is always stronger than acquire release so if you have a operation that is acquire and you do uh if you have a store that's released and acquired that's sequentially consistent then the load with sequentially consistent will indeed be sort of happen after the the release store um all right then fairly involved but basically sequentially consistent is acquire release with the additional guarantee that all sequentially op sequentially consistent operations must be seen as happening in the same order on all threads and therefore in this particular example if all of these are sequentially consistent then z equals zero is not possible all right nice we did that on exactly the two hour mark very good um i'm going to talk a little bit about um testing shortly like how do you how do you not make these mistakes but i'm very happy with the timing okay so as we've seen memory ordering is real subtle it's just like hard to make sure you got it right or rather i think it comes down to the human brain is just really bad at emulating all of the legal executions and as we talked about like you can test this by just running your code lots of times in a loop or like making your computer be really busy with other tasks to cause more weird interleavings to happen but it's not a very reliable way to do this kind of testing because it might depend on the architecture might depend on the operating system scheduler it might depend on the compiler it might depend on which optimizations you have enabled for that compiler it might just depend on like how likely is the thing to happen maybe you need to execute your code like billions and billions of time in order to get that one execution where the code is wrong and so surely you're like there must be a better way and the other way to think about this is like it might even be that there's a problem in your code but it doesn't cause a panic right like if you think about the the counter we had early on with our mutex where the values didn't end up adding up to a hundred thousand or whatever the value was right it added up to something slightly less the only reason why the program crashed the only reason we noticed is because we had an assertion that checked that the value was right at the end but if you have a program that just like does a bunch of operations it takes locks it just assumes that everything is right it might not be obvious if something went wrong if the lock had this kind of bug in it two things run the critical section at the same time maybe you write a log entry twice maybe the log gets truncated maybe your backup ends up empty right like who knows what happens when you just a mutex just isn't the mutex right it's very hard to predict and it might not crash your program it might not be detectable sort of immediately and that's scary right but it also means that even if you ran it for like 10 years at lots of cores on lots of different hardware right like your code might hit the bug but there's nothing that notices that you hit a bug and these two problems of how do you make sure that you test all the possible legal executions and how do you know if something bad happened are sort of separate problems and they're they're often handled a little bit differently uh in general for the second one like how do you detect these bugs when writing concurrent code like this with with atomics you want to stick in a lot of assertions just to make sure that you did the right thing you can make them debug assertions if you really want to and then run your test suite with like in release mode but with debug assertions turned on or something like that so that it doesn't impact release builds too much but you do need to have ways to detect these problems there are also some automated systems so for example there's um so google has a bunch of different sanitizers that are now built into a lot of compilers like i think both clang and gcc and msvc probably others have many of these sanitizer built in there's a nightly flag to use it in rust and one of the sanitizers is the thread sanitizer and thread sanitizer runs your code in like an instrumented way where every load and store you do in your program like every memory operation gets tracked like it gets special instructions added to it to get to track what it modifies and when um and then as your program runs the thread sanitizer in the background detects if you ever do like if you ever have two threads that write to the same memory location or if you have a thread that writes to a memory location after another thread is read for it but there's no happens before relationship between them so it basically tries to detect when you have um unsynchronized operations on a single memory location that can't detect all problems like there might be logic problems you don't detect this way and if you look at the um the documentation for the thread sanitizer algorithm towards the bottom here they give like the state machine and what assertions they check so here like sort of the algorithm specified but there are things that aren't detected yet there are problems you can't detect this way um and also the thread sanitizer is like you run your code and it will detect if that execution of the code did anything bad it might be that this particular execution was fine but you still have a concurrency bug and this comes back to the how do you make sure that you exercise every legal execution of your program and so that's then much harder actually to solve because you need to sort of figure out how to explore all the possible behaviors on all possible compilers cpus architectures according to the spec now the best tool i know of for this is a tool called loom and loom is a it's a rust project that implements a paper which that's fine um this paper called the cds checker which is a paper written for cnc plus plus atomics but it translates to rust pretty well and basically what that paper outlines and what luma implements is a strategy for taking a concurrent program instrumenting it and this is not automatic instrument instrumentation it's like use looms uh synchronization and atomic types instead of the ones from the standard library so use like the loom atomic use size the loom mutex the loom channels and whatnot use those instead of the standard library ones and then what loom will do is you pass it like a test closure and it will run that closure multiple times and every time when you do a load through one of the loom data structures it will feed you back one of the possible legal values when you do a store it will sort of keep track of all the values that have been stored so that on other loads it will expose your execution to one of those possible values and then every execution of the closure exposes you to a different possible execution so as lume executes it will run all possible thread interleavings uh all possible um like memory orderings so when we looked at the remember we talked about modification order for each variable where t1 was allowed to see either value for y or for x and t2 was allowed to see either value for for x loom will make sure that there's one execution where it sees the false value for where t2 sees the false value for x there's one execution where t2 sees the true value for x and in general we'll do this for all of the operations in your program and if we look through like the the example here is gonna be pretty illustrative of how this works so you see here we use a bunch of the the atomic types but use them out of lume instead of using them out of the standard library the idea here is that if you are doing if you're running your loom tests you use loom primitives and if you are like building your code for like release or just a standard test suite you use the the standard library primitives there's like a bunch in the documentation there's a bunch of explanation of how you go about doing this but the basic idea is that if you write a loom test uh everything in your test so that includes like inside the library and inside any libraries that the library uses sort of all the way down you want to use the loom primitives instead and then inside your test you call loom model and lu model is the function that will try all these different execution paths you pass in a closure and then inside of that closure you write the code that you want to test and loom will then call that closure over and over and over and over again and for example down here right like this thread does a load with a choir every time the closure executes this load might get a different value or you see that there are lots of threads each time through the which threads execute in which order will differ and loom will make sure that it totally explores all the possible legal executions of course the downside of this is there might be an insane number of such possible executions right imagine you have 10 threads that each do like 10 stores and 10 10 loads and you end up with like 10 to the power of 10 to the power of 10 or that's not even right that's even a larger number than that but you end up with testing every possible interleaving that's legal according to the memory specification that you're using which is can be insane so loom is somewhat limited in that i mean not because of the implementation just because of the fundamentals of there are so many possible legal executions um loom has a bunch of implementation things that are from the paper that we just saw that looks at reducing those so imagine that a thread does two different loads um but they're like of different values then uh it doesn't matter which one executes first like even though the memory uh the memory model allows them to be reordered the correctness of your program is unlikely to depend on whether those loads happen in order and so this is a bad example because in the in the case we just looked at the load ordering might have mattered but the paper basically has a complete specification for what what executions what possible executions cannot differ from one another and therefore can be eliminated like you only need to run some of them loom implements a bunch of those to try to help but ultimately there's like a limit to how complex your test cases can be when you run them on the loop loom like you can only have like so many modifications to give an atomic variable only so many threads so many instructions but even so loom is the best tool i know of to try to make sure that your concurrent code is actually correct under only the assumptions that the memory model gives you rather than the assumptions that the current compiler or optimization or cpu might give you uh all right let's do some questions before i do more on loom um are there any sort of toy programs one can think of to try and drill this into my head like the spinlock was a pretty good example but this last example seems crazy sauce yes so the example between a choir release and sequentially consistent i don't have a less contrived example for you i think my recommendation would be to look at some papers that implement concurrent data structures and notice where they use sequentially consistent ordering as opposed to other orderings and then generally the paper will explain why um i think it's hard to come up with like a simple motivating example for one like the one i just had is the simplest one i can come up with but it's it's still complex like even though there's little code the thinking and reasoning is complex i don't have such a good concrete example where sequentially consistent is necessary um let's see yeah so uh someone pointed out that the name loom actually makes a lot of sense it spins threads in many ways right um so uh loom for the for the cases we've looked at so far what so zulu has some limitations some of them are known problems some of them are more fundamental problems with the approach or are just impossible to model so for example relaxed ordering is so relaxed that you can't fully model all the possible executions even with something like loom so one example of this is in the relaxed example we have let me pull that up here so remember here where uh this load might see 42 which is stored by this store imagine how you would try to emulate that right imagine you pass this into loom this method call occurs first so loom has to like if this is this is a load from like a loom atomic use size right that load like loom has to produce some value for that load and it doesn't know about 42 yet because it hasn't this method call hasn't happened yet but the ordering relaxed allows 42 to be returned from this load but loom doesn't know that 42 is even a value yet so how can it return it from the load now if you go read the paper and it's a fascinating paper there are ways to do this where because you know that you're executing the closure many times over you can like execute the closure remember all the relaxed stores you saw the next time through return from load a value that will be returned from a later store continue executing see if that store still stores that value and if it does you've successfully like done the sort of reverse dependency if it doesn't you discard that whole execution because that race no longer happened it's real weird read the paper it's fascinating um but but fundamentally like without some really fancy tricks modeling relaxed properly it's just very difficult um i i will like loom is a really cool and helpful project that i know just like there's a lot more we could do with it i i know that the project is like looking for contributors so if you're interested in this concurrency stuff please go contribute to loom like it i would be so happy to see that project um do even better than what it does now there's another example of where like this kind of contribution could be helpful so uh loom currently doesn't really model the sequentially consistent ordering um in that it will it won't enforce so so let me restart that sentence um as we saw sequentially consistent ordering is acquire release plus some more guarantees loom doesn't currently model those additional guarantees it runs the question leak assistant as if it was a choir release now the the reason for this is partially because implementing the requirements or the the additional guarantees of sequentially consistent ordering is actually fairly complicated um the paper talks a little bit about this too but the translation and implementation is fairly complex um what what loom does is it just downgrades every sequentially consistent to acquire release and that means that loom won't miss any bugs um because acquire release is weaker than sequentially consistent but it might tell you that your code has a bug when it doesn't so it won't give you a false negative but it might give you a false positive which is like probably what you want but is also unfortunate uh there's an open issue for it um there's also an open like there's a test case for it that's in the code but is currently ignored this is something i know is like high on the priority list to fix because sometimes like if you have an algorithm that depends on sequentially consistent ordering to be correct loom can't currently tell you that it's correct because it'll downgrade that ordering this is something very worth fixing but what's important is that it does model acquire release correctly which is generally the thing that you're not able to test well on something like an x86 right so uh in the case where this was still uh release release acquire acquire acquire require so let's say we we kept this code the way it was and we did like a assert n e z or z zero so remember z equals zero is a possible outcome with a choir release but it's not a possible outcome with sequentially consistent if you ran this through loom loom would correctly cause this assertion to panic that is it would actually explore this possible case with a choir release so in other words like loom will catch real errors and it will catch some of the ones that are the hardest to actually reproduce but there are some cases will give you a false positive but hopefully those will be fixed um and similarly for relaxed ordering there are some orderings it just cannot model so like arguably the answer is you shouldn't be using relaxed for anything critical in a data structure anyway but if you do loom doesn't quite capture it i actually recently pushed a large documentation like revamp to loom so much of what i'm talking about now is actually written in the loom documentation i highly recommend you read it it explains both a bunch of sort of how loom works how you use it what its limitations are and how you can work around those limitations i think the documentation should be fairly good at going through this um and if it inspires you enough to go contribute to loom like please do especially things like documentation now that you've watched the stream you should be in a really good position to try to document a lot of the the things that lume exposes and yeah so someone hasn't stream i see the tokyo uses loom as it caused siri caught serious issues and it really has like i know that tokyo uses lume uh especially for like the scheduler which does a lot of lock free business and it's cost several like critical bugs there um which like luckily never as far as i'm aware made it into production it was like hard to say um but yeah loom is absolutely called cost caught real issues one thing you run into with loom is like because the test cases have to be they have to have like a state space that's reasonable to explore just because they try every possible execution you sometimes have to run them on like limited use cases sometimes they're more intricate use cases that can still trigger a bug i've found that generally it can explore a state space that's large enough to still catch interesting cases and and the worst ones um loom also requires that execution is deterministic like if you did like a cisco or something in here loom would re-execute the skull each time so you may have to do some mocking in or just like with normal tests really in order to be able to use loom efficiently um but but apart from that i've been very happy with uh lumeficon currency i think i think that's all i wanted to go through uh let me see if there's anything more in loom that you need to know about yeah so here see there's even like the relaxed ordering example it's like given us an example limitation of something we can't model um sweet i think that's all i wanted to cover uh now that we're sort of at the the tail end of the stream are there any questions at the end now that we've like been through all the weirdness all the hopefully all the explanations is there anything that's still unclear that i can try to help with more explanation on oh memory barriers that's a good one a good call so let's go to atomic so if we look at the atomic module in the sender library you see that there are all these like atomic types there's the ordering enum they're constants so these are for these are from before the new method on the atomic types was const this was the only way you could get a const atomic this is the other way to share one between threads right instead of sticking it in an arc or leaking a box you can just make a constant it also has three functions spin loop hint which should probably never have been in the atomics module at all which is why it's now deprecated um it's now been moved to the hint module um because it's not really an atomic instruction it's just it was there because it's often used in atomic contexts so in our spin lock for example in the the while loop you might want to call spin loop hint i'm not going to go into what that means it's not terribly relevant this compiler fence compiler fence is weird um the compiler fence is a way to ensure that the compiler won't move a given loader store uh or not a loader store like you place in a fence and the compiler is not allowed to move operations under the fence to above the fence or above the fence to below the fence within that thread's execution this is only for the compiler the cpu can still execute things out of order so you very rarely need compiler fence specifically it's mostly used for making for preventing a thread from racing with it with itself which can only really happen when you're using signal handlers in general you won't need this um and then finally there's fence so fence is important um fence is basically a an atomic operation that establishes a happens before relationship between two threads but without talking about a particular memory location so remember that uh when we talked about load and store and acquire release those synchronize like a load acquire synchronizes or happens after a store release of the same memory location right offense is a way to say synchronize with all other threads that are doing an offense so if we go back to the memory order uh so the memory ordering document has like really detailed instructions on what exactly these happens before dependencies mean so if we go down to happens before uh [Music] where is the release a choir oh yeah here's also the note about um on certain systems release acquirer ordering is automatic for the majority of operations only weekly order systems like arm have special operations here let me see if i can find fences i feel like there used to be a section on fences here uh i guess there isn't that's too bad but the documentation on fence is pretty good too basically what it does is that every fence synchronizes with uh another fence if there are memory operations before the the before and after the fences that happen to synchronize so a fence is basically a way to say synchronize a little bit later or a little bit earlier without doing the actual load and store and again i recommend that you actually read through the instructions the documentation here to understand what it's for um but essentially the offense is a way to to establish a happens before relationship between two threads um in a way that where the happens before is sort of moved a little bit relative to the actual memory access you're right the atomic has to be a static not a const but you can only assign const values to statics that's what i meant oh volatile okay so someone asked about volatile um okay so notice that the atomic module has no mention of volatile there's nothing volatile in here and that's for a good reason volatile is a keyword that is just unrelated to atomics but since why not it's a common question there are the operations read volatile and write volatile that exists in the pointer module in rust a volatile read sounds like it has to do with atomics because it it's often explained as ensuring that you go to memory so people think like oh maybe the reason why concurrent operations might race is because like one cpu does an operation in like a register rather than in memory but if it goes to memory surely other threads must see it and therefore i'm going to use volatile and that's not really what volatile is for also volatile doesn't establish happens before relationships across thread boundaries what volatile is for is when you're interacting with memory mapped devices so imagine that like you're i don't know what's a good example of this your network card like the that has to like send and receive packets um it maps itself into a particular region of memory like at a particular range of addresses and let's say that the packet queue is mapped into that region of memory it might be that reading from a part of that memory ends up modifying the memory an example of this is um many device many memory map devices have uh regions of memory where uh there's like a ring buffer and there's a pointer to the head and the tail of the ring buffer if you don't know what ring buffers are just bear with me for the next three or four minutes um so when you the the usually the head and the tail dictates where new rights are added and where the reader is reading from and these might be different threads that operate one operates only on the head one operates only on the tail and often this these kind of device mapped memory regions have the side effect that if you read from the tail you also reset the tail or there's some other like side effect of reading and in those cases imagine that you read twice from a given variable that is in memory map memory if you didn't use read volatile the compiler would probably do the first read realize that reads are read only so it caches it sort of caches the first read into register and the second read just reads from the register but in reality you need both to go to memory to have a side effect on the device map memory that is what read volatile is for it's a way to tell the compiler that it must go to memory and the operation cannot be reordered relative to other operations so a volatile operation cannot be moved up or down by the compiler for example because it might actually have side effects there's a section on this sorry bright screen again this is a section on this at the bottom of the memory order documentation which talks about the relationship between atomics and volatile and memory ordering and the answer is basically they don't interact like they're not for the same purpose even though sometimes it might seem that way okay uh great i think i think that's all i wanted to cover on atomics uh is there anything that still feels unclear uh anything you'd like me to go over one more time uh anything you feel like i haven't touched on now's your time can you tell i'm going a little hoarse so what happens when i talk for like two and a half hours uh i have to i have to wait for about 10 seconds for youtube comments to come in in case anyone has questions from there so for those of you who are watching this on video on demand it might seem weird when i'm like are there any questions and then i just sort of sit there for 20 seconds is the delay to youtube comments coming back to me uh anything particular about atomic pointer okay so atomic pointer might seem like it's a little different from these other types like these these other types are sort of primitive types they seem simple like they're numbers and booleans atomic pointer it's not really special like atomic pointer is an atomic u size where the methods are specialized to the underlying type being a pointer in fact my guess is if you look at the source for this oh it does actually store a pointer now nice um but atomic pointer you see that it doesn't have like uh it doesn't have fetch add for example because that's not an operation you technically you can do a fetch ad on a pointer like using assembly but it's like probably not going to do what you want you run into all sorts of undefined behavior you lose like pointer provenance if you start thinking about those things it's complicated um but basically it's a specialized version of atomic use size that operates on pointers and keeps point of provenance um so you see like it provides load provides store and rather and it then operates on pointers which are represented as view sizes sort of in memory if you will it also does compare exchange and it has fetch update right so remember fetch update is really just a compare exchange loop so it's sort of like having a nicer interface to doing a compare exchange loop um so yeah there's really nothing that special about atomic pointer and it still requires unsafe to turn that pointer this is a raw pointer into a reference there's one more thing actually that i can mention while i'm at this so most of the atomic types you will see have a get mute which takes a mutable reference to self and gives a mutable reference to the inner type this is safe because if you have a an exclusive reference to the atomic itself then you don't need to use any like special memory ordering or exclusive operations or anything no one else has a reference to that atomic so you don't need to use atomic operations on it so that's what get mute does in fact atomic u size could implement d-ref mute it just can't implement d-ref which is kind of interesting i wonder if an implement's as mute no because as mute so d-rough mute extends dref so you can't implement dear of mute without drf and i think as mute is the same it requires asref so it can't even though technically this type can implement either it doesn't uh how do atomics interact with regular stores so if i cast an atomic to a mutable uh pointer to u32 and pass it to c um i mean in c you also have the option of doing atomic operations using like memory order in which case it would be subject to the same things although i don't actually know how the compiler operations happen across there um but the if you did a an atomic operation in c like if you did a like a memory a store with like a memory order sequentially consistent that would still be enforced on that operation like think of it as it's not that the value is special it's that the compiler knows to generate special instructions for it when you're using an atomic use size or atomic q32 if you pass that as a raw pointer to c and then did just like just like de-reference it with star that would that's even weaker than a relaxed load um of that value so like it it's the same as if you were to cast it to a [Music] a raw well you couldn't even do this in rust actually yeah if you cast it to a raw pointer to a u size and then dereferenced it which i don't think is even sound like you might just run into undefined behavior um what is consume ordering so uh russ doesn't currently implement consume ordering i don't know that it's used very often in c and simple source either i should have warned you sorry so release consume i haven't looked at too carefully but it looks like the rules of what things are visible to the other thread changed a little my guess is that this is like more specialized to cases where you don't need full release acquire or is the other way around that release acquire is uh weaker than release consume i'm not i'm not quite sure it's not available in rust currently anyway uh probably for good reason i don't know um it's really hard to define consume ordering great yeah i mean looking at that i forgot to warn you again looking at the description of release consume it looks like it's just annoying uh and if you look here since sible 17 the specification of release consume ordering is being revised and the use of memory order consume is temporarily discouraged so i think it's just not worth trying to learn it at the moment and even if you learn it chances are that information will be outdated all right with that i think we're going to call it a call it a stream thank you everyone for coming i hope that was interesting i hope you learned something i hope this does not leave you more confused than you were uh i wish you luck in writing uh lock free atomic code just keep in mind that in general don't write log free code unless you have to right like i want to end on this note that if you can just use a mutex and most of the time you can do that it will just cause you much less headache if you really need to get into atomics like use loom use thread sanitizer run it through miri which i think now has some concurrency support like get other people to vet the code do everything like find a paper that describes the algorithm you're trying to implement just do your best to make really sure you get it right because a lot of subtlety as we've seen and the best way to avoid the complexity of atomics is to not use atomics so in particular don't use them unless you need to um and with that uh i will see you next time so long for well i hope it was an educational experience
Info
Channel: Jon Gjengset
Views: 27,281
Rating: 4.9909296 out of 5
Keywords: rust, live-coding, atomics, memory ordering
Id: rMGWeSjctlY
Channel Id: undefined
Length: 159min 20sec (9560 seconds)
Published: Fri Apr 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.