ITT 2019 - Kavya Joshi - Let's talk locks!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good morning or let me try this good I been nestle sin us fool so I'm Kavya and I'm here today to talk with you about walks now we take it everybody here is familiar with multi-threaded programming yeah so you're all probably familiar with walks and you probably just love locks no right we all use locks but locks have this big bad scary reputation and this is for good reason right we're probably all heard about or work with systems where the use of locks cause performance problems here's an example from one of here's an example from one of the services production services where I work at samsara now this is a critical component of our read pipeline so we care a lot about the latency of the service and you can see one fine day the latency went up by nearly 10x and this turned out to be because of a walking issue that we'll come back to so we know from experience locks if kind of bad but that said you know what this production service is written in go and the go runtime uses locks extensively under the hood as does your favorite database server and memory allocator your favorite operating system scheduler walks are everywhere so what gives what is it about locks that causes them to have this this fascinating spectrum of effects on the performance of our programs in what scenarios do they perform well and when do they cause a problem and what can we do about it these are the questions we'll explore today and to do that we'll first take a tour through lock internals will then talk about walking performance and then finally we'll discuss locking strategies now before we get started an important note everything about locks is specific to the hardware the instruction set architecture the operating system the language implementation so today we'll be assuming a standard SMP multi-core system so all CPU cores share a single memory bus we're going to assume x86 64 and a modern Linux and we're going to look at the lock implementation of go speaking of go any go programmers on the house know okay don't worry I have you covered all you need to know about go today for this talk is that in go the unit of concurrency is called the go routine and goroutines are conceptually very similar to regular threads so in a programming language like Java or C++ so you'd use them exactly like you use threads but the big difference is that they are userspace threads what this means is the language runtime go takes care of managing them not the operating system the details don't matter right now the other thing to know about goroutines and go is that just like with threads when you have two goroutines accessing the same shared data that access needs to be synchronized and you can do this using gos lock implementation called the sync mutex now the sync the sync mutex the mutex is a blocking non recursive walk so there's no try acquire semantics if a goroutine tries to lock and it doesn't get the lock it will block okay with that out of the way let's talk lock internals or better yet let's just build a lock okay so where do we start well let's start with what we know and work backwards from there we know we want locks to give us mutual exclusion right that is say I have very simple program this is a task queue that's shared between two threads on the right hand side we have the reader thread that's going to read from the task queue and process a task and on the left-hand side we have the writer thread which is going to create work and like put it into the task queue so we know that whatever lock construct we come up with it needs to give us a mutual exclusion it's it needs to make it so that t1 and t2 will never access the task queue at the same time right so how do we go about doing this well let's see can we just use a flag well here's how this would work we have a flag the flag is going to tell us it's going to track whether the tasks whether the task queue can be accessed if it's free at 0 and if the task queue can't be accessed then flag will be one to mean it's not available it's occupied okay now our t1 the reader code looks like this it checks to see if the flag is 0 if it's 0 it can access the tasks so it increments flagged right it sets it to 1 to indicate that it's now in use then it does whatever it reads it and then it sets flag back to 0 and then it returns if however it reads flag and flag is 0 sorry if flag is 1 it's in use by the other thread so what do we do we just try again it's just going to loop now the writer thread looks exactly the same on the writer side we check to see if we do the same thing we say is the flag 0 if so I can access tasks if not try again ok does this work can we just go home are we done no no this doesn't work for a couple of reasons both of which come down to what finally gets run on the hardware so first let's look at this line this flag plus plus now the compiler compiles this down to this in x86 to this increment instruction Inc and the processor takes this instruction and runs it as a read modify instruction so read-modify-write instructions so in three steps it reads the variable from memory into a local CPU register it modifies it in this case it increments it and then it writes it back so this means that this one instruction results in two memory accesses why is this a problem well what this means is it's totally possible when we have two threads for that read and write to be interleaved with the other threads read so in this case it's totally possible 42 - 42 is read to start after tyonne's read-modify-write has started and still see the old value flag which is zero now this is called a non atomic operation and a non atomic operation simply means that another processor or another thread can see the effects of the operation half complete now an operation could be non atomic because it uses multiple CPU instructions so for example if you do a loader or store on a large data structure or if it uses a single CPU instruction but that CPU instruction results in memory in many memory accesses like the flag plus plus we just saw the opposite of non atomic instructions atomic instructions and atomic instructions are indivisible so in x86 simple loads and stores of data that fits within a single cache line is atomic and it's guaranteed to be atomic and this comes down to the fact that x86 is cash coherent right it guarantees that if you have many CPU cores they are all going to have a consistent view for a single cache line okay so atomic is what we want atomic is good but using a flag is not atomic but that's not the only problem the second problem with our implementation comes down to this block of code and the problem with this block of code is it turns out memory operations get reordered all the time the compiler can reorder these memory operations so in this case it can move that the flag equal to zero before it reads tasks and it's not just the compiler the processor can also reorder operations so this is an example of a store load reordering in which case the store of flag equal to one can happen after the read of tasks now the starve load reordering is something x86 is allowed to do and it does this to hide write latency writing is expensive because of cache coherency and so you do the read before while you start the write and then the write completes so the compiler and the processor reorder memory operations all the time to speed things up to make things faster the only Cardinal rule about how they can reorder things as sequential consistency for single threaded programs so they can't change the order of memory operations as visible to a single-threaded program now the compiler has additional rules for what at what what it can and can't do but this comes down to the specific compiler implementation and this is encompassed in the language memory model so for example go and C++ and I think Java to guarantee that if you use locks and other synchronization mechanisms correctly to your programs data race-free then they will give you sequential consistency in multi-threaded programs as well so that's the compiler reordering now the processor also has rules about what I can and can't reorder and this comes down to the hardware's memory model in the case of x86 like we said it's a relaxed consistency model called total store order so it can't reorder most memory operations but it can reorder the thing we just saw it like store load reordering is valid in x86 cool so that is the second reason we can't use our silly flag example it's not atomic and it doesn't give us memory ordering guarantees ok no problem we'll just build a construct that gives us a Tama City and memory ordering easy right so where do we even start well luckily for us the hardware provides the instruction set exposes these special Hardware instructions that give us a Tama city an example of this is the exchange instruction in x86 and other instructions that guarantee memory that memory order is preserved now these instructions are called memory barriers or memory fences again in x86 examples are M fence which is a complete memory barrier so loads and stores are not reordered when you use an M fence you have a store fence called s fence and a load fence called elephants okay but the really powerful tool that xyx that x86 gives us is the lock instruction prefix the lock instruction prefix is a prefix that you can apply to memory operations like adds and increments and it gives us both it makes the instruction atomic and it's a full memory barrier that sounds useful can we use the lock prefix to solve our conundrum why yes we can the lock prefix compare and exchange or compare and swap instruction is also called an atomic compare and swap this gives us exactly what we need what this does is it conditionally updates a variable so it checks to see if the variable has the expected value and only if it has the expected value it updates it to the desired value how are we going to use the atomic compare and swap well let's see so here's a program now again now here the first thing we do is we're going to do that atomic compare and swap so only if flag was 0 it gets updated to 1 and this happens atomically oh we do our thing we read or write and then when we're done we atomically store flag back to 0 now if the compare and swap fails it means the flag is not 0 that flag was 1 because another thread set it to 1 and so we loop we try our compare and swap again pretty simple now does this actually work well it turns out that it does this implementation is a simple spin lock implementation and these simply spin locks are used all over the place in the linux kernel the other thing to note is these atomic compare and swaps are the quintessence of any lock implementation as we'll see throughout this talk ok so what is this atomic compare-and-swap going to cost us right that's what we care about well let's just measure it here's a simple micro benchmark so what we do is in a loop 10 million times we atomically store so this is in C so we atomically store an integer and we measure how long that operation takes us we see that with a single thread it takes about 13 nanoseconds and when we have multiple threads trying to do this they're all effectively serialized those are the guarantees that's the special guarantees this operation gets us so we see they're effectively serialized and it takes about 12 times as much with 12 threads on my 12 core machine cool so this is pretty nice we have the scheme it gets us mutual exclusion and otama city and memory order and guarantees should we ship it well does one problem right we have this thread just spinning just burning CPU cycles not doing any useful work which is wasteful because we could run other threads instead on the CPU and this time you know what would be really nice it would be really nice if we could if if the thread tried to do this compare and swap and when it failed instead of spinning if we could just put it to sleep put it to sleep until we could put it back until it could get flag again right that sounds pretty nice well luckily for us the Linux operating system gives us a way to do this Linux has these few taxes and few taxes are both an interface and a mechanism to sleep and wake threads so they give us a way for our program to request to sleep and wake up threads now the interface part of it is a system call you can use the Linux Cisco to do this and the mechanism is a kernel managed run kiss or call a kernel managed wait queue all right let's just see how it works in practice too many words okay so here we are back at our example now we're going to extend flag to take on an additional value so 0 means it's unlocked one means it's locked and two means there's a thread waiting there's a thread sleeping on that flag variable okay now we have the reader the reader does the compare-and-swap assume it fails now we're going to update flag to two to indicate that there's a thread waiting there's a thread going to be sleeping on this and then finally we make the futex system call to tell the colonel hey colonel I'm a thread I want to sleep until flag changes now the colonel takes care of that for us and finally it resumes us when flag has changed and once we're resumed we're going to try and compare and swap again okay now let's switch to colonel land and see what colonel what the colonel does well the colonel needs to do two things well first it needs to store this information that t1 is waiting on this variable so we can find it and resumed the thread later when it's time and to do this it uses a run it uses a wait queue so we have the use of space address we convert it to a unique key because remember we're in the kernel now and then we create an entry that has the key and the thread and now the problem with this is having a wait cube per unique address is obviously very expensive so what the kernel does instead is it uses a hash table so the the key is hashed to a bucket and there is a weight cube per per hash table bucket what this means is you can have two unique addresses hash to the same bucket and get added to the same weight queue but this is not a problem right because those entries store the key as well we can find the right thread okay so step one we've stored this information away step two we're going to descale t1 we're going to put it sleep all right now the waiter comes along and the waiter finishes sorry the writer comes along the writer finishes whatever it's doing and it's so it's now going to atomically set flag back to zero it's going to say I'm done with it flag is now free and it makes a few texts system call again to tell the colonel to wake up any threads that we're sleeping on the flag variable now what does the colonel do in this case well the colonel finds the right wait queue and walks it it finds the first thread that was sleeping on the flag variable and it reschedules it it wakes it up now this is really nice and convenient right now the implementation we saw is an extremely simplified version of a few text implementation few taxes are incredibly tricky to get right but this is the conceptual idea behind that and what this few text mechanism gets us is a lightweight way to build these synchronization primitives like locks in fact the pthread mutex uses few taxes okay so pretty nice but what does this upgrade going to cost us well let's see again we have our micro benchmark this time we're going to use a pea thread mutex we're going to lock and unlock it in a loop ten million times and measure how long it takes each operation per lock unlocked pair and this is what we get we see that in the uncontained --it case it takes about the cost of that atomic compare-and-swap that's exceeded and which is about 30 nanoseconds but in the contended case it takes about a microsecond because we're calling into the kernel that cisco we're paying the cost of that context switch which is really expensive now this point it's worth asking is it is it really worth it to spin sorry to sleep instead of spin and it comes down to an amortized cost argument spinning makes sense if we're going to hold the lock for a short duration but the trade-off is we're burning CPU cycles so at some points it makes sense to pay the cost of that thread contact switch to put the thread to sleep and run other threads okay there are some hybrid few taxes which are really smart few tax implementations they they're a mix of boats so they'll spin a small fixed number of times and if the compare-and-swap still hasn't succeeded only then they called a few tax system call and put the threat to sleep an example this is the go few tax implementation this is what it does okay so this point we've evolved from spin locks to few taxes to hybrid few taxes are we done now can we go home well we could and in a language like C or Java that has a native threading model we probably would but what about for a language that has userspace threads like go that has gue routines can we do even better now the whole thing about userspace threads is that they run on top of regular threads okay so the go runtime takes care of swapping routines in and out on top of regular threads and the reason to use them is goroutines run entirely in user space so they're lightweight and cheap and context switching between goroutines only takes tens of nano seconds rather than context switching between threads which takes like a microsecond now why does this matter well this gives us the opportunity to do something clever this gives us the opportunity to block the goroutine without blocking the underlying thread so that is to say if we have a goroutine and it tries to get a lock and it can't we can totally put the goroutine to sleep while keeping the thread running so we don't pay the cost of that thread contact switch now this is exactly what the girl run time does the girl run time has an has an internal semaphore which and these semaphores are conceptually very very similar to the few taxes we just saw the big difference is there used to sleep and wake up girl routines not threads okay let's go back to our implementation and see how it will work in go with semaphores okay so we have the reader again it's going to do the compare and swap if that fails the go runtime adds the goroutine to a wait queue and this and in this case the wait queue is managed in userspace what does it look like it looks very similar to the wait queue for the futex so again we hash the flag we find a hash bucket the details of the wait queue the per hash bucket wait for you are slightly different but the details don't matter today so we store the information away that we have g1 waiting on flag and then once we store that information away we can put g1 to sleep so the go runtime invokes the ghost scheduler and the go routine is D scheduled now on the writer side the writer once it's done it sets the flag back to zero and if there's a waiter we need to wake up the waiter so the go runtime finds the waiter in this case it's a girl routine and it reschedules it cool so this is clever this is really clever because it gives us a way to avoid that that big expensive thread contacts which have I said that it's really clever cool so at this point we could almost be done but we're not quite done yet and that's because there are two problems with this implementation retro and both these problems come down to the fact that when a goroutine is resumed it has to compete with other goroutines so that is say g1 so g2 is done and g1 is woken up now g1 has to loop again it has to try and do that compare and swap again and other goroutines could be trying to do it so if we had other reader threads they could be trying to do the same thing and it's entirely possible that g1 will lose again in fact not only is it possible it's very likely that g1 will lose again because there's a delay between g2 setting the flag to unlock and g2 invoking the scheduler to put g1 back on a thread on rescheduling it and so what this means is that semaphore implementation we just saw can results in unnecessarily resuming a waiter girl routine right it can wake it up it competes it loses again so we pay the contacts which cost again the girl routine contacts which cost again and the second thing that can happen is this implementation can cause goroutine starvation we we have a goroutine that's waiting we keep waking it up it keeps losing and this can cause really long really high tailee agencies and so what does the go runtime do about it well the go runtime adds another layer of implementation and sophistication on top of this semaphore and this nicely wrapped up package is the sync dot mutex that we'd that our NGO programs use and so finally we can talk about the sync dot mutex now the stick that beauty x is a hybrid lock so it does a compare and swap it spins a small fixed number of times and if those don't work it uses the semaphore that we just saw to sleep and wake the go routine but it's not a simple hybrid lock it has some additional state tracking that it uses to prevent unnecessarily waking up a goroutine and to prevent goroutines starvation now again the details don't matter today so these are these can be considered performance optimizations so that's the that's our fancy sync mutex so again our question is well what does this cost us so again we go back we run on micro benchmark in this case we're locking and unlocking a sync dot mutex and we see that in the uncontained case it takes us the cost of the atomic compare and swap and in the contended case it takes point eight microseconds now how does this compare to the futex implementation that's he has well in the contended case initially up to a certain number of goroutines go performs better than C and we'd expect this right because if it's fancy mutex implementation but as the concurrency level the number of goroutines goes up that difference between them starts to converge and this is curious why does this happen well I let you in on a secret now the synced up mutex uses a semaphore right and the semaphore has that hash table that it uses what if you have two goroutines and the thing they're waiting on hashes to the same hash bucket so they both need to be added to the same wait queue it's almost like that hash party needs to be synchronized right which means we need a lock and guess what the lock / the / hash bucket lock is it's a few tacks so that's where that contention so once we have enough number of goroutines we're hitting thread contention and threads end up getting suspended now wait a minute the few checks also had a hash table and so you can have the same problem in the case of the few tacks you can have two threads both trying to update the same hash bucket wake you we need a lock in that case as well yes we do and in that case the lock is a spin lock and so we're come full-circle the mutex uses semaphore which uses few taxes which use a spin locks its locks all the way down you know what this reminds me of have you watched the the movie Inception this is like Inception but with locks it's really wild okay so I can stand here and talk about lock internals all day but it's time to move on and talk about performance so let's see if we can take our understanding and apply it to try and explain and put numbers on that spectrum we talked about well let's see we know that there's this huge disconnect between the unconsented case performance and contended case performance for mutexes but in the contended case how expensive it is really depends on how much contention there is red so our micro benchmarks gave us like this nice cost breakdown but they don't help us answer the question how does performance change as contention changes because really as application developers and system engineers what we care about is how does my applications performance change as the number of threads changes so say I so say for throughput right like I have a fixed target I have a target throughput I'm trying to reach how many extra threads do I need to add or in the case of response time say I have a fixed workload and I want to paralyze it how does adding more threads help my response time these are the questions I care about and to answer these questions we turn to the theory on those law on those law tells us you're probably familiar with our doubts law but it tells us that the speed-up so response time it depends on the workload and how much of the workload is serial versus parallel and that's the formula it gives us now this formula doesn't mean much to me so let's do an experiment and graph it now in this case we have a program we have a fixed workload and we're going to change the serial fraction of it the serial fraction of it is how much work is done holding a sync mutex and we're going to change the parallel fraction so how much of it can be parallelized and we're going to change the parallel fraction from 0.75 so mostly parallel to mostly serial we're going to scale that work lid and we're going to change the number of goroutines so we're going to run from one go routine all the way up to twelve goroutines and we're going to measure and to end how much time it takes us to do that workload and we're going to graph it and we see that that's what the graph looks like so up to a point adding a number adding threads as the number of threads increases it gives me a nice speed-up but at some point we start to flatline so adding threads doesn't help and in the mostly parallel case that happens much later than the mostly serial case now on those law is good and all but in practice systems don't just pay a contention penalty they pay an additional a coordination penalty that on those law doesn't account for but a law that does account for it is the universal scalability law the USL and this tells us that most systems exhibit both both contention and coordination so they pay both penalties and graphing the universal scalability law looks like this don't let the formula scare you all it says is if we didn't have contention and coordination we'd expect linear scaling right you add more threads your program gets faster but that's not what happens in practice if you have contention then you have that on the olds law curve and if you have contention and coordination then you have the green curve which is even lower so using the usl and plotting our experiment graphs this is what it looks like and we see the same thing a mostly serial workload not so good our throughput flatlines a mostly parallel workload throughput flatlines much later ok now with this we have tools we have on those law and the usl to analyze performance now let's quickly talk about a few smart walking strategies things we can do if we hit contention well first things first profile your program right like we said locks have this vast spectrum of performance effects so they might not be the problem so really the way to know is you use something like the NGO contention profiler or you can use per flock if you're on Linux there are some links and resources but you use these tools to profile your program to figure out if lock contention is a problem and if it is here are some things you can do first thing don't use locks right more specifically try not to have synchronization in your hot paths alternatively you can use atomic operations if your critical sections are very small you can also use lock free data structures the go scheduler has these perk or run queues and those run queues are lock free they're manipulated by multiple threads without holding a lock strategy to is use more granular more fine-grain locks one example of how you can do this is you can charge data so if you have a large set of data you break it up and you have different threads working on different subsets of it if you were to do this make sure you don't run into faults sharing so make sure that your data doesn't collide on a single cache line or the locks you use for each subset of your data don't collide on a single cache line and strategy 3 is do less work in the critical section so this is the same graph we saw at the beginning of the talk for our production service we saw that lock contention caused this huge latency hit right and we we learned this by using the mutex profiler to profile our code and the change we made to bring content back down was we restructured our code to do less computation while holding the lock alright so those are three simple strategies and it is now time to say goodbye it's been great speaking with you all I hope you all walk away with having learned something about locks and hopefully a tool or a strategy that you can take away to your programs thank you [Applause]

Info

Channel: Istanbul Tech Talks

Views: 3,624

Rating: 5 out of 5

Keywords: Istanbul, Tech, Talks, Development, Technology, Programming, Coding, Code, Conference, Tech Talk, Developer, open source, Locks

Id: tjpncm3xTTc

Channel Id: undefined

Length: 39min 49sec (2389 seconds)

Published: Wed Jun 26 2019