GopherCon 2019: Daniel Marti - Optimizing Go Code Without a Blindfold

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I'm here to talk about optimizations and that's because it's fun and because this picture I think is great but wait the first question you should ask yourselves is is your program actually slow by which we mean do you think it could go fast and there's a mean here and in here somewhere and lastly and the most important question is is it worth optimizing and spending that time in doing that so enter a silly example we've got a function that simply that simply copies a list of strings so it creates a new list of strings and then just append the elements from the old one into the new one that's pretty simple now we create a benchmark that takes a static list of strings and then reports the allocations for copying set list of strings n times but also how long each of those copies takes so we can run that benchmark with go test and we can say that it takes about 5,000 nanoseconds per run our 10 allocations and creates some garbage as we would expect we can get a CPU profile part of this benchmark and if we use paper off to see what's happening in this function you you can see that most of the time or the CPU time is spent in the append what we can do is pre allocate that slice and then just assign each of the elements and what happens with that as you might expect is that we only allocate once and it's much much faster it's less convenient but it's faster so we're done right thanks for listening but this is a silly example after all enter a JSON benchmark now if you have ever done JSON and go you might realize that it's a bit of a controversial topic it's a good package it's not great I'm a maintainer and maybe I shouldn't be saying this but there's a benchmark called codes decoder which decodes a fairly large JSON file into a struct and if you run that benchmark you can see that it runs at about 200 megabytes per second and it run a hundred times this benchmark so you can see that it's it's not a very fast benchmark you can see that that's a lot of nanosecond not a second zero so this benchmark is slow but the biggest the bigger problem is that it's not gonna stay still if you've run it a bunch of times you can see that the the relative speed of this benchmark jobs jumps up and down by as much as four percent and this might not make a lot of sense if you don't think about it that much and it's this is a problem because the recent json decoder speed ups have been as small as 1.3 percent so how can you possibly how can you possibly measure progress with as much noise and the answer is math or rather statistics now please take what I say with a grain of salt I'm gonna remain practical because I only took a couple of classes in math and university that was a long time ago and I was asleep for most of them so I'm not gonna talk about theory I'm mostly just gonna talk about how you can use statistics the idea is that you want to get multiple samples and with that you can measure variance so instead of using bench comp which just compares two numbers straight like that we're gonna use a tool called bench stat which I believe was written by Russ Cox initially so what we do is use tests count flag and that gives that gives us a number of measurements so in this case if we get eight measurements for this benchmark we can tell bench stat okay give me stats about these numbers and you can see it tells you roughly it takes about 10 milliseconds per run up plus minus three percent so that's your variance but we still need less noise and what we should think about is is a machine actually idle now you should meet my work friends I tend to have a couple of browsers open a couple of slack windows open because it's 2019 a couple of editors my music may be email and so on so my my CP usage when idle quote unquote is sit at somewhere between 0 and 15 percent but this match mark depends all of my CPU so if slack suddenly decides to do something then buying benchmark is gonna get slower right and here's a fun fact about slack and it's just an example and trying to hate on it but at least until recently the way it rendered gifts was with the CPU so each of these dancing badgers which we use a lot in the tools channel it could it consumed about 2% of my cpu so if somebody say Paul jolly would send me 50 Badgers at once that's all of my CPU being burned and I wouldn't even notice otherwise so you should close resource-hungry apps and you should get a CPU usage that's a few percent at most and that's ok ish at least for a laptop so if we rerun the same batch mark aged 8 times you can see the variance is now down to about 1% and we can work with 1% but we have still another problem at least with laptops and it's that the laptop burns it gets really hot so at least for example in my with my laptop after about 10 seconds or so of running at full speed it's gonna it's gonna get much slower now why is that the reason is that laptops throttle at least modern ones and I've got some evidence for that so this is my laptop with my blue Cofer on top of it and you can see that the Europe ends are tiny they're even smaller than they go for itself and this laptop has four cores that can go pretty fast and the air vents measure less than 204 feet which is really unit of measure so in practice whenever I try to benchmark or measure anything with my laptop with no extra tooling this is pretty much what happens my my lab just set on fire so the idea here is that we cannot use turbo speeds the maximum CPU speed because the laptop cannot maintain that speed for a long period of time so there's a tool for that and this one was written by Austin Clements was also in the go team it's called perf lock and it does what you might think from the name which is it locks a certain performance in your CPU so you run the daemon that should be with a sudo because it needs root and then you can tell it please run this command at a certain maximum CPU speed so in this case I found that 70% is a speed that's fast enough to be realistic but slow enough that my laptop doesn't heat up and then throttle it still heats up but at least a lot the fan can keep up with that so after 20 or so runs you can see that it's consistently slower but it's consistent but there's a small caveat and that's that it this only works for Linux it should be portable to Mac and Windows if any of you want to spend time of that please go talk to Austin you can probably flag him down somewhere in the conference so if we gather eight measurements before and after of this benchmark without actually changing anything we should see that even though the averages might change bench studies is smart enough to tell us the variance is too high so probably nothing actually happened here there's nothing there's no significant change and that's the Delta with the symbol that you see there now you might have some questions such as what count should I use 8 10 20 or what's a p-value now I'm not going to delve into the details here but here's how I think about it pragmatically which is that if you've got high variance you want a higher count to get good results or at least meaningful results and here's a visual aid suppose that it's each of these little fat Gophers is a data point so if it's higher up it's a slower benchmark run and it's if it's lower it's a faster benchmark run so you run it three times before and three times after and you're wondering did I improve actually anything here if you've only got three points maybe one of them you just got unlucky and slack was renting some gift or something like that so if a single one of those points got a tiny bit lower on the Left maybe that may be the result would be completely different what if you have 10 if you have 10 as a human you can sort of realize that there's a cluster of Gophers further up on the left and then a cluster of Gophers further down on the right it's unlikely that many of them in that many cases you got unlucky at the same time so this is roughly what a higher p-value would mean but here's a gotcha you shouldn't be searching for p-values and this is called the multiple multiple testing problem that's just a technical term for it but the gist of it is when we spoke about bench stat which compares single runs if you keep running the benchmark over and over again without actually changing anything the numbers the numbers are gonna be jumping up and down right if you do that with bench stat they're not gonna bench that it's gonna be smart enough to tell you that most of the time nothing is actually changing but if you keep hammering it because it's statistics about 5% of the time you can actually get a false positive result if that makes sense so in this case I run it five times around the fifth it told me maybe something improved but I hadn't actually touched any of the code so if the data looks bad don't try to get new data until you're happy with it because then you're gonna run into this problem and I have done that many times before and Austin wasn't happy with it so here's an xkcd because obviously there's one that it that sort of explains this problem a company gets its scientists to test the relation between different colors of jellybeans and acne and they run the test with so many combinations that statistically one of them shows that true shows that four result but it's not a true result and it's published and it says 95% confidence but the joke here is that they run it so many times that that's how statistics work it's a false positive now here's a small side note that I want I think I should mention and that's bottlenecks I've been talking about a lot about see about CPU loads so tools like PF and perf lock they work purely on the CPU but statistics and this case bench that they work on all benchmarks even if your benchmark is benchmarking network activity or disk activity or Cisco's or anything else so to recap you should use bench stat to compare statistics and get information from them and per flock if your laptop is too noisy or heats up too much so now we get to the fun part and please don't worry if you don't follow every single slide we will publish the slides later so you can try these commands later the first trick that I think is quite useful is you can ask go build with the M flag twice you can ask it which functions could have been in light sorry which function calls could have been in lined but weren't and why so in this case in the i/o package a call to copy buffer was an inlined because the function was too complex by just and so maybe that function is really slow and you want it to be faster maybe you can tweak the code a little bit to get the compiler to do what you want I'm not saying you should do that but you can do that and it's interesting another thing that the EM flag can tell you is when expressions escaped to the heap which is when they cause allocations now allocations are not a bad thing per se but if you've got a very hot function that generates garbage and you're trying to figure out where's this garbage being generated and why maybe this can tell maybe this can help you a little bit there's also BCE which stands for bounced checks alumina bound check illumination now a bounced check in go is for example when you index into a slice if you if you're interested if your index is out of bounds the R on top is gonna panic right because you're not allowed to do that so the compiler inserts a check whenever you do that kind of thing so this is what for example found a slice in bounds that's that kind of check being added at this position in your in your code you can also ask it information from the proof pass and the proof pass tries to prove things such as maybe I can prove that this slice bounds check isn't actually necessary because I know statically that it's always gonna be within bounds so in this case it can tell you the the cases where it's it is removing those bounds checks and if you add if you increment this debug value it can tell you more information it might not be useful to you but maybe you find it interesting and there's a good show that I also want to cover and that's when code suddenly gets slower or faster and you don't know why here's an example and I hope Ross Cox is not in the audience it's from 2011 in the JSON package you might see start seeing a trend here and it says I cannot explain why benchmark skip value gets faster maybe it is one of those code alignment things now he changed some part of the JSON code code base and then this benchmark which is entirely separate certainly got faster why is that and the answer the short answer is that modern computers are come and in this case code alignment could affect the CPU cache now I don't know anything about hardware so I'm gonna leave it at that but the the answer sort of the the useful bit of information here is that sometimes you just cannot tell why a computer is doing what it is doing but the compiler is getting better for example it used to be that the best way to empty a map was to just completely create anyone that was fine and fast but it also threw away all of that allocated memory unallocated an entirely new map so we generated extra garbage and since go 111 if you actually range over the map and delete all the keys that's compiled into an efficient statement sorry function code which actually actually efficiently clears all the elements without allocating a new map it also used to be that without using any standard library package the only way to count to efficiently count the number of runes in a string was to iterate over the over set string but since go 111 if you take the length of the rune slice of a string it is not gonna copy the string and it is simply gonna give you the number of runes so please give the compiler a chance and if you think it could do better please file bugs so that it can actually do get better there's a label on the github issue tracker you can use for this it's called performance there's quite a lot of stuff in it but it's good to be able to search for it and if you want to go deeper and this is the last trick I'm gonna cover there's an environment variable called go as a safe func and if you give it a pattern and then build a package it's gonna give you information about how the compiler built that function so for example you've got a hello world function that just print something with println and then you ask it goes a funky Qualls hello world go build and it generates an associate or HTML file now if you open that there's gonna be a lot of information so I'm just gonna skim through a few bits the first bit that is gonna give you is what the compiler quote-unquote start with when compiling this function in SSA and at the at the far end of it it's going to give you the assembly that I generate for that function but the interesting bit is that between those two steps it's going to show you all the optimization passes and all the changes to the SSA that the compiler did so for example if you expected the compiler to optimize a piece of code but it didn't maybe this generated website can try to can perhaps point you at where it went wrong and if you want to learn more about the compiler there's a couple of readme files that you can read they're not they don't go into details but they give a good introduction and then ideally they should point to other places if where you can learn more about specific things cool so now we get to the demo I'm gonna stop talking because I know that's not particularly interesting and this is a demo including benchmarking and P prof so it is likely to go wrong so please be please don't be too harsh cool so we are in a clone we're in a clone of go one twelve seven and that's bills from source from source now I want to do I want to improve the performance of the of the JSON decoder right so what I can do is go to the source of the JSON package and there's a benchmark that I like using quite a lot it's called code decoder is the one I showed earlier and what this does is it decodes a very large JSON into this data structure it's just a struct that has a tree like structure inside of it and then each tree element is a struct itself so it's basically decoding structs and floats and in it that's pretty much it so what we can do is first so the first thing we can do is sudo per flock daemon to actually start per flock so that we can use it and then we're gonna get the first benchmark numbers from this benchmark so we can do go tests we don't want to run any tests so just the benchmarks so bench equals code decoder we're not interested in vet which is a default now yeah what else Oh perfect now we said 70% if I use anything more like that than that my luck my stand might set on fire and finally we just want to tee that cool so how you forgot one blank so we want to get six measurements which i think is a decent number to start with cool so once that finishes we can run bench that upon the resulting file and we can see it runs at about in twelve point seven seconds plus minus 1 percent and so we want to improve the the performance of this benchmark right so one thing we could do is use a paper of profile to try to see where we could improve so one way to do that is to use the benchmark itself to get a profile so we can do Co test kind of the same thing we did we don't need per flock for this well sure why not we've run the benchmark we only wanna run it once and we want to run it for a little bit longer but the default is one second but we don't write for say 10 seconds and then we want CPU profile CPU turd out this is going to take 10 seconds so so once that finishes we're gonna get a CPU profile and that's going to tell us where the benchmark spent its CPU time cool so we do that oh they're drying somewhere else great so the first view it gives us is a graph that shows you so it begins whoa this is very sensitive okay I'm just not gonna zoom any more so you can see that most of the time is spent in these big boxes which are functions so one one piece of information that I find really useful is top that gives you where the most of the CPU time will spend per function so you can see for example the flat percent of time that was spent in each function such as scan while now you want to look at these functions that take a lot of CPU time and try to see how you could have optimized at least for example one of them so what we can do is go into source and this gives us the source for each of these top CPU users now usually what I would do [Music] so I could search one by one and see which line I could optimize for example this line takes two point seven four seconds out of the ten seconds which is a long time but I already try to optimize that and it doesn't work so we're just gonna skip past that now I know ahead of time that there's something we could do here which is there so you can see here that we spent a total so the first number is how much CPU time this function uses and the number on the right is how much time the function and all its sub calls use so we can see that this function in total uses one point six seconds and there's another second that seems like a lot because the others around them don't use much time and look it's the function that finds what field to decode in to instruct and that's something that you do a lot when you decode a struct every time you see a value you have to map it to which struct field you need to code into right so I think this isn't very performant I think we could do better so let's go and do better so let's kill that cool so it's this piece of code can everybody read that if anybody cannot read that please raise your hand okay great so what this piece of code does is at first tries to find an exact match and then it tries to find case-insensitive match but if you think about it the happy path is that there's one exact match and that's it but what you're doing here is linearly searching one by one but you're also doing the case insensitive search at the same time which is very expensive so what we could do is use a map because we can for example use a map from string which is the the key to int which is the index of the field of the struct field so where's this field data structure defined it's a slice of field and each of these fields holds information about decoding into that field so you can see here type field named named by its equal fold is what compares that in a case insensitive way to some byte slice so what we're interested in is what we want to do is we want to add a map to this so we're gonna add a struct type oops and we're gonna add two fields a list of field and then we call it was called index map from string to int so then this should return struct fields to the door this just finds the fields and creates a slice not particularly interesting and then towards the end it just returns the fields right so in here instead we want to create we want to return we want to create this map and then for each of these fields we want to say to index that field we're gonna do index F dot name equals I and then we here we're gonna return struct fields and we have the list of fields and then the index and I'm not using keys please don't shout I think that's it oh the json decoder also caches these data structures so we also have to update this so bear with me format type fields and then returns great and now oops there we go so we've got the fields slice okay so that's gonna be a struct feels now and instead of doing this linear search what we're gonna do is first if F okay feels that index so key is a byte slice but we can do a string key and recent compiler versions are not gonna copy this key byte slice because we're just using it to index so it's gonna index directly with the byte slice without copying or allocating so we've we found it cool and then if then else if we don't find it with the map with this index of ours then we do the expensive linear search but because we already did the case-sensitive search with a map we can simply remove all of this and we know that F here is gonna be no it's not gonna be yeah it's gonna be no cool great I believe that's it now I'm gonna try it it's probably not gonna work there we go it's doesn't work go imports can I just have to decode six nine eight oh it's an index of the field duh so F equals fields that list and then the index ah thank you this is very realistic because I do make lots of mistakes in real life list sorry hunter you can tell this is real because I'm making a lot of mistakes we only have one mistake left 6:17 you strut encoder or feels it still feels why does this need a focused maybe I don't remember okay this broke something we're gonna ignore that for now please thank you [Music] we're gonna ignore the tests for now because I don't have time to fix that and we're just gonna focus on the benchmark great so thank thanks for thanks for helping so what we want to do now is grab new benchmark numbers and then compare them with the old ones so we want to do the same thing but then T into new and we wait now we check old new great and I got about five percent faster now this this is good it's a speed up that went into 113 so it's already in master this is why I'm doing it on top of 112 the fixin master does not introduce a bug I can tell you that much so great we're done but we've got time so I'm gonna do another demo so this demo is a change that's not a master yet and I change that I haven't tried yet so you can you can find out with me whether it works or not and if it doesn't work that's gonna be maybe a low end to a talk but we'll see how it works great so let's leave this 112 version and enter master so one second OOP great so if we do it over j'en so this is the a version from a couple of days ago same thing we can just enter and code in JSON and the same thing first thing we do is T old great oh yeah I forgot by default the bench partner also gives you a location information which is useful so the change that I want to do is if we look at the decode file so this is the code that I ended up committing to master it's pretty much the same thing I did live but with comments because comments are useful now my theory is that using this map is good but it's not great because maps are actually pretty expensive so maps are great if you've got thousands and thousands of elements but realistically when you're decoding a struct destructors gonna have less than a dozen fields right so hashing the string and then looking through the map comparatively it's quite expensive compared to instead of just iterating over the list but what was expensive from the previous implementation was the equal fold call so here's what we're going to do we were gonna say if the length of fields list is small and I'm gonna I'm gonna guess and I'm gonna guess that 20 is small and nobody please look that up so if the length is is small then we're gonna do pretty much the same thing that this does what we're gonna do if bytes dot equal FF name right key today and otherwise if it's large then we're gonna use the map and only in the case where we still haven't found a match then we're gonna do the expensive case insensitive case insensitive linear search great that looks like it let's see what test broke this time oh great toy imports oh nothing broke that's useful great so teen evil so who thinks this is gonna be useful please raise your hands I see very little confidence in my abilities okay let's see yeah 1.5 percent that's not bad I would say that should go into master so well sorry sorry I I should say I would say I can mail this and then somebody can review it somebody in the audience maybe so what we can do we've got time left and the internet maybe works so what we can do is new branch Jason oh my god so first encoding Jason don't decode stalked faster struct decoding wonder okay I this is realistic I want to write something short hmm encoding JSON decode small structs faster do you think that's okay I'm doing this life that gopher cotton please go easy on me my maps are non-trivial II expensive.if know if we have few struck fields linear search is faster and simpler than using an app do that and I forgot to copy the instant so we want to copy all this by the way what I'm doing now is showing you how I'm actually gonna send that as a patch to go so I want to indent that great to do for myself add a benchmark that shows the so I guessed that 20 is a good number but maybe 20 is too low or too high so in the future future Daniel will write a benchmark that basically benchmarks the speed of the old and new code for for example one field ten fields a hundred fields a thousand fields and with that maybe I can find a better number to separate the two approaches great so with that we can do get code review and if the demo gods are being merciful success right let's go look at it great so who in the audience can review this we've got I think we've got time Oh caleb is doing it No oh no he's not cool great that's pretty much it so as you can see so I was trying to show two things first is how in real life somebody would do little optimizations and go and the other side of it is sending patches for those is not difficult at least for updates oh there's a race for it amazing so do I get tripods three times now great so hopefully this is going to be helpful for you to benchmark and optimize your own code but if you want to join me in benchmarking and optimizing the style library I can help and I'll be happy to get more hands thank you
Info
Channel: Gopher Academy
Views: 4,693
Rating: undefined out of 5
Keywords: software development, programming, golang, gophercon
Id: oE_vm7KeV_E
Channel Id: undefined
Length: 37min 27sec (2247 seconds)
Published: Tue Aug 27 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.