GopherCon UK 2019: Roberto Clapis - Tackling Contention: The Monsters Inside the 'sync.Locker'

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone I a little bit of introduction I work a security announcement for the web at Google which is totally not related to my talk I've been doing parallel programming for a while and I've been to go for years now before starting my talk I would like to introduce the agenda so we're about to talk to talk about contention what it is how to measure it how to reduce it and some surprising facts about it contention is a competition between processes to acquire resource that doesn't only happen in programming that happen everywhere and since I'm a scientist when I need to face a new Beast I want to measure it so go is very good at helping you write parallel code and concurrent threads of execution problem is that when you access the same data and you need to share something across those threads you might encounter issues and it's not good when you have to deal with them so let's take a look at tools in our in our shell we have people of people is very good as a tool documentation is not excellent but it allows you to do many things kind of intuitively the UI is as good as normal go you I get and so when you run your benchmarks or you run your tests you can just add a flag which is CPU profile for example and give you the file and that file you can then later use to understand what's going on alternatively you can also import a package and that will set up the listeners to live collect your profiles I strongly suggest doing that because the overhead is minimum and it gives you real data on your real problem otherwise you might end up measuring the wrong thing today I have a small snippet of code just as an example so I spawn count goroutines this this was fairly small number depending on your course and then for a hundred thousand times I acquire a lock I increment an integer and I release the lock and I run the profiler on this piece of code and that gave me many from a lot of information most important is the root part of your tree this one and this says that I spent 0.04 seconds in my function so doing what I actually wanted to do which was in current in this variable and two seconds and 41 outside of the function so do something else I don't know about you but I don't like when my code spends 1% doing the work in 99 percent 96% doing something else so to understand what is going on you need to dig a little bit deeper and see the rest of the of the thing this sounds like a lot of things you want care most of the time about runtime intrinsics I hope you never have to get there because otherwise it gets really complicated but there is this breakdown that told me that I was spending all the time acquiring and releasing the mutex that is suboptimal I would say this was taken from a real code example and this is just a toy but it's an oversimplification of something that I saw in production and if you're really into fancy you eyes you can also have this is a flame graph by running the command line and selecting the menu this was the CPU profile so this was just the amount of time spent with those functions on the stack sometimes this is not enough because maybe you're actually doing work so your mood taxes don't take enough time to appear in in the CPU profile but you still spend a lot of time in there so you can have a mutex profile that just highlights time spent occurring logs and this is pretty good because you can tell you the line at which you spend more most time trying to acquire a lock sometimes you might appear in the unlock function these are known bug this is intended because it is just going to tell you which multics you're contending not how and when and where and in this case this was fairly simple to understand if people fizz not enough there is another good tool which is better than people and it has even worse documentation to the point where there is no documentation so you can't really tell if it is bad it's just not there but it gives you a lot more options so there is a blocking profiles that are the same that I showed you before that just in a browser but one of the most important features they found was regions you can import a trace package and begin and end regions so you can actually laser point what you really want to measure so you can then have this kind of breakdown that tells you the total time spent in the region how much time you spent in the sync block which is what I'm going to talk about but also how much time you wasted in the scheduler because if you acquire a lot of locks or send a lot of stuff in channels the scheduler is going to suffer under the pressure of your actions and if you don't have the time or the will to set up trace regions you can also have automatic kind of break down based on the functions and most of the time a function is what you care about I I would say that one of the most important things here to notice is that most of the time you'll have a lot of numbers here and some of them will be greater than the others that is the tail latency when you see that one of your guru teens took like four seconds and all the others took ten milliseconds you need to look into it because I don't know about you but as a user you never notice when things go smoothly but when one thing goes slow that's where you open a bug report that's where you're angry you try other products so you want to know the longest time that you waited why did you do that so as also distal which is the trace to these not only has no documentation it also also sometimes break only works on Chrome and it's very unintuitive to use so this is a masterpiece you're looking at the reason this tool is good and the reason I still use it it's not just because I want to suffer in pain but also because it tell it lets you know when you're not doing work people will never tell you when you did nothing people just counts what functions are not around the stuck at any time so this tells you that process zero and process one and there was one that was just not doing anything at that time this is why the trace tool is good and colors colossal amazing that said try not to measure too much the Heisenberg uncertainty principle is still valid in go so if you try to measure too much and you start adding regions and you people of too much you end up measuring the measuring tools so this is actually me trying to make the previous screenshots precise enough because I knew the real value and at the end of the day I realized that I was just benchmarking traced out region function good so just do not overdo it that said you should always measure you should know what are your bottlenecks and you should not take any action for the rest of the talk that is scrubbing the rest of the talk unless you know contention is a problem sometimes it will be sometimes it will not be those tools work anyways let's add let's say how to reduce contention there are too many ways change the algorithm and change the primitives you are using I strongly advise for the first one so typical structure of a go pipeline program you have producers you have consumers sometimes you use the actor pat pattern whatever you do you have some inputs and some someone reading from the inputs could be a channel could be a structure locked by a mutex could be anything this is the way you see eating your mind when you design the code but this is the way the hardware sees it those are two bottlenecks and you don't want your hardware to feel like that because you care about it mostly because you're paying for it so channels just to give you a little bit of information are a struct with just a mutex on it every time you do something on a channel except or reading the length you're acquiring that mutex so you are contending on that thing every time you have a channel that is deeply shared in your program so one way to address the problem is to disent relies so do not use a central state do not use a central state . but also if you write your code in a way so that you can pass the state into the functions you're calling you can instead of creating one single shared state you can create multiple ones and just have something that on the side synchronizes the value from those states just when you need to read them for real these works in a way so that if you know how many producers and consumers have so the ratio you can just break them down in the beginning beware that these only really works if you know that they're roughly going to take at the same time if you might end up having very long running jobs one of the workers might just starve so you will need to know the data you're processing and how you process it this works really well and you actually break down contention by a number which is the number of states you have which is pretty good the other thing that you can do is to batch so instead of accessing the shared state and every time you have an update you just buffer all the updates and you just send a batch of things to do over the channel these functions and is really good if you only care about eventual consistency and you don't care about latency if you caramel actual consistency this is not good for example if you're trying to write software for a bank you might want to be sure you not spend money twice so this is kind of stuff that you should keep in mind so many times I see people switching from channel to MU taxes because they're faster they're slightly faster this is true but you have to consider that notices give also give you a lot less things to deal with you have to manually handle the buffers you have to think hard about kind of deadlocks situation and also when you work on the sharding thing so you have multiple states there is a zip flaw that will hunt you down one example is you have a dictionary and you realize that the lock on that dictionary is contended so you say okay let's break it down it's a de train these are strings so let's break it down by a first letter of the alphabet so instead of having one I have 26 and each one of those has a mutex so I just reduce contention then you measure and you find out hey nothing changed why because your dictionary is being used by Java indexer and for packages and all Java packages start with com so you did a lot of work for nothing so you should really be aware that your data even if it is looks like it's evenly distributed accesses might not be if you have a higher rate of right ratio you can use a read/write mutex is a little bit more complicated and slower than normal mutex but if you have a lot of reason a little rights it's actually good and it's fair fairness is important and that's why I always say do not try to be better as better than Moo taxes when you need to lock a state Moo taxes are like this if you in a situation which you don't have contention this is the entire code path that gets executed when you lock in and locomotives that's it compare-and-swap atomic instruction pretty quick roughly forty cycles and add in 32 still 40 cycles you can afford those forty cycles believe me that said we entered the danger zone so up until now I'm confident that every one of you was able to follow me understand my reasoning and will probably be able to apply it and you can even stop there I mean it's not necessary to go on but it won't it so let's take a look at this piece of code so this is generic code sorry about that so let's say that I do contend int type and that type might be one of those three types that you see on the right what I do I spawn one goroutine per core and each one of those is going to write to a different position of a slice so I have a slice that is shared across course so if I have eight course it's eight elements each goroutine is going to keep writing into one element of that slice so no locks they access different regions of memory so it locks here this one is going to write a lot less data than this one because the data is writing is actually smaller and they do the cycle they loop the same amount of times so who would say that writing less data is faster just no one someone okay and then there is who would say that writing somewhere in between is faster okay so I'm assuming the people that didn't raise their hands were for the third options because you're really following me this is actually faster so writing more data in this case is faster and at least on 64-bit machines and this is where I get nasty because if you instead are a 32-bit machine who will raise the hands second were actually right so the reason for that is that when you code usually like to think about a perfect machine that doesn't have any hardware underneath it but your machine has hardware so this is how your CPU looks at memory your CPU is this course and they can only see an l1 cache on top of them and they can only access that memory and if they need something that is not there they go to they tell the cache hey can you please read from the other cache this value and if it's not there they need to access further layers in caches and these can lead to wasting thousands of cycles and I'm not kidding the access main memory on my machine takes twelve hundred cycles and you can do a lot of computation in that time so you have to keep this in mind to solve this issue you usually pad so if you have a struct that needs to have Bulls into access concurrently you just want to read them you can just use a lot of underscores that looks nice look at this isn't it beautiful as code please use comments if you do this never optimize for this stuff without comments because it makes it impossible to understand what's going on in your mind another thing you can do is use big bigger slices and skip items so just increment by the size of your line there is a way to take the architecture size you're running on and I've seen people using that as a reference so if a word is 64-bit I can just you know use reflection to see the size of an integer or do magic bit shifting but it's actually not true so you might be on a 64-bit machine and have a 128 bits cache lines so that usually holds it's not always the case and this is for false sharing but there is also true sharing so you know read/write mutex who has used every readwrite mutex at least once good see a lot of hands so in theory you think if I read I'm reading that's good assumptions you're not so when you read you need to update a lock telling its how many readers there are so you're writing to memory so a read is actually right and that invalidates all the caches so if you're contending a read lock across hundreds of course and they all try to read you're going to trash main memory and every access to the variable that hasn't changed will go to main memory so you'll find yourself wasting a lot of time just writing to read this suboptimal I would say I don't like these kind of things so there are kind of ways to get around these one there is one specialized structure in the sink package that is map it has methods are similar to to the map to a much we are really lock on it it's quite bigger and you know when you put stuff with an empty interface you should always be aware that you are not doing something right you should feel bad and if on top of that you don't write wrappers to access it so until we get generics you should really wrap your empty interfaces maps with types and even if you do that there is a problem with sync map if you blindly just use sync map because you need to access stuff across threads I'll let you know that every time you access data you need to read into the map and a map which gives pointers to structs with an atomic value the points to a struct with a map that points to a map header that points to a hash table whose elements are themselves pointers to atomic pointers to interface values that point to concrete values which may be themselves pointer and you might have a pointer in there so if you're really worried about accessing main memory you either make sure that those points are very close to each other or you're just wasting your time so sometimes this might be better but you really need to benchmark for it but sometimes some up is not what you need I mean even that was not that good but there are rare cases in which you need to read a lot but I rarely write and it's not a map key you're accessing in this case you might want to use atomic atomic has two main problems drives you insane that's a big downside well I mean you can probably deal with that after all you're in tech so you probably already familiar with the concept and there is a second problem which is when the race detector sees an atomic operation that emits a memory barrier it gives up the race detector will not detect any data race that uses an atomic that's it full stop you're alone in a jungle on fire which sharks sharks in a jungle well so sometimes you have to reason about atomic sand there was a talk prometheus atomic stuff and you need to reason a lot usually writing atomic code it's not about writing the code is about spending hours thinking about it and writing it realize you made a mistake and go back again real experts in the field have a hard time thinking about Atomics and since you are some of you are probably experts some of you like atomic and they know all the infra and all everything that's happening in the hardware to the electron level you might want to do this yourself and in this case I have a very comfortable snippet for you then gives you an example on how to do with reading and writing a shared integral for example the read is very simple you do aught on occlude afterload that will there will be a type based on what you're loading that's very easy for the right it gets a little bit more complicated because you need if you need to update the value like let's say you want to increment it you load it you change it and then you have to bet that no one updated it in the meantime so you do compare and swap which is an atomic instruction if this is still the old value I want to update it with this new one and if you succeed you're done otherwise you do it again as you can see if you are wrong on your assumption that this is rarely written you might take down your entire system because this spin lock is going to take down everything as soon as the garbage collection happens now if you just use store you're wrong but even if you use this snippet there are two problems the ABA problem which means that you read the value you update it and someone else read the value updating again and someone else reacted it back to the original value so something happened in the meantime you didn't see that and when you do the compare and swap you happen to accept that as a valid operation now depending on your logic you might not care about missing an update but usually if you get to this kind of code you want to know if something happening between and then there is another problem which is on some platforms integers have to be aligned in order for atomic operations to function otherwise you only atomically update half of the value and if you get to the point in your life when you're debugging code and you see the half an integral was updated I don't know when it happened but probably you took a wrong decision at one point so a little bit of history there were two very scary bugs the happen in 2011 which for newcomer might be all date but there was a time in which the runtime and the garbage collector were using locks and they switch to a lock free implementation - bugs surfaces found in September and July 2011 one said that a bug caused some random looking memory corruptions I don't know what you dream at night but this is a nice version for me and there are problems in the statement there is so much Turner up there so I'm just going to the code this was what from this is from a time where the runtime was written in C can you spot the race I mean it's just four lines of code and you basically don't need those two in this if statement there was a race condition that caused weight group to wake up too early in some cases so when someone added a new waiter a new thing that should be waited upon they wouldn't they wait group would still wake up in some cases and the race is here on the end done and and truck and the fix is amazing they back it up before they actually use the value when you read this yell and there is a 50 comments long discussion on it you realize you should probably stay away from this stuff if there is one takeaway there you have to take from my talk is change your algorithm measure your stuff and then change your algorithm again and measure it and when you're satisfied with it you're fine then maybe switch to mo Texas but don't go there just don't and that was it thank you

Info

Channel: GopherCon UK

Views: 693

Rating: 5 out of 5

Keywords: computer science, technology

Id: KUC5WtbBdFA

Channel Id: undefined

Length: 25min 36sec (1536 seconds)

Published: Wed Oct 02 2019