GopherCon 2019: Dave Cheney - Two Go Programs, Three Different Profiling Techniques

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hopefully what I'm going to show you this afternoon is more impressive than a sight gag okay but as you see not only do we need neither have sleeves but I have nothing up my sleeves there is there are no slides to make this an extra element of danger it's all live coding okay but to set the scene what are we going to do today I'm going to show you three different types of profiling we're going to improve the performance of two different programs and we're going to do this in 43 minutes and 38 seconds all right so I better get about get cracking so what I have here let's show the code that big enough all right what we have here is a simple program to read readers a piece of text and count the number of words in it we open the file as you'd expect we have some counters for words and a very very simple state machine to say we're actually in a word and so until we reach the end of the file we read a byte check we didn't get to the end of the file make sure there's no other kind of weird error and then we say is this a space do we just read a space and also if we're in a word because we're basically reading through the file accounting character character character space okay that's a word including the word counter now what happens if there are a few spaces in a row the presence of a space it's only a word if there's a non space either side so we have to keep a little state there and then we ask is it actually a letter so let's first thing why am i building this program rather than using the more convenient go run well because we want to find out how long it takes to execute this program we don't want to include the time of compilation so first of all go build this program and now we're going to need some input and she'll probably be a reasonable size input like you can count how many words there are in very short strings yourself you don't need a computer for this so um thanks to the internet we have the complete text of home and milk milk moby-dick which is quite an extensive volume and if you've ever had the unpleasant pleasure of working with me you will know that I will take this text and drop it into any text box in your application and see what happens how big is this it is 1.2 megabytes over the fine prose so actually sorry it's ok how long did that take ok I should I should have timed how long that took seemed to take some time so how long does it take to count 1.2 mega of characters 2 seconds ok I don't know like is that a reasonable amount of time to take or is that like so like given given that we've just run this program we have no basis of comparison let's compare to something else that counts words does anyone know a program we can use to count words okay you call word count is that its real name WC yes why why write word count when you can just say WC so number of words well that was well that's I don't know if we need to time it that was effectively instantaneous okay wait for comparison let's just compare our version okay so important thing I need to point out that there are a different number and that that is correct what WC considers to be a word and what my program considers to be a word are not the same but I think the underlying idea that the the program has to read all the characters in the file to find to find like to you have to iterate over all the characters to find the spaces and count the words so while they come up with slightly different answers I don't think that's really important for the purpose of this demonstration of optimization but given that we do have some challenges we need to make this program at least two orders of magnitude faster let's how will we go about doing this if you're in chris's talk here a few hours ago we asked dr. P profs so the first thing we need to do is probe profile this program welcome to my office with the world's least economic standing desk and because I can't ever remember how to do it with run time people off I'm going to use my own helper package that I wrote and we're just going to say put the profile just in the working directory so it's not a horrible long import a horrible long path run go over to imports unsaved like everyone does okay so now when we run this program golly gosh I forgot to differ I forgot to defer the stop there we go I told you there's an element of danger okay so two things to note one this program has got a little bit slower P prof profiling gets the operating system to set up and interrupt on a very frequent time I think it's hundred times a second and that kind of stunning the process and asking to do some some kind of profiling in the background does add some overhead it's made it a little bit slower but the benefit is we now have a profile well the nice things was added a couple of releases ago is a much more updated version of a paper off in the standard library so we can now ask it for a visualization rather than and a nice web-based GUI let's just fullscreen this so this is the default thing that you'll you'll come across when you you get it you get a profile and you look at it in people off and there are a few things to point out here you can think of there are three kind of distinct starts of these trees what you're seeing here is the call stack so there are three kind of prominent call stacks the way that people off works under the hood is every time we stop the process we sample the current call stacks of the active go routines and then when we get to the end of the program the ones that show up the most obviously the ones that we're doing the most work so the one of the things you can see here are there are kind of three distinct areas I'm gonna ask you to ignore this for the moment and we'll go back and find the bit that's actually our program and this is our program over here the others are both part of run time stuff which we'll come back to one thing to know the order in which like because this is just a graph you past a dot it can be randomized around the position of the elements on the screen is not really important it's the arrows between them normally when I run this run time domain is over on the other side it's entirely random so we have run time domain main dot main main tree bites and the important things to see here is that they all contribute themselves nothing to the call stack the time of the course deck zero seconds of one point four seconds passing down to fire dot read zero zero and just all of that time is passed down through some FD stuff this is called ivory this is called and then we get to the big contributor this is called on Cisco we had an issue on the go project we checked you inspired this where someone wrote in and said Cisco is too slow look at it it's huge you're consuming all my CPU you must make it faster my buddy what are you doing you're amateurs do you think that syscall is too slow I was expecting either like a yes or no to that kind of okay so it wasn't that I prompted you for the answer but but if this call is not to slow then what is the what is the problem it is using all the time in our in our program but why too many syscalls yes why would there be too many syscalls exactly it's not that syscall is slow it's that in go all our read and write operations are unbuffered by default we don't have a kind of buffered abstraction by default if you want buffering you should buffer yourself now ask you to ignore this this tool always present so beautiful like it's so it's so easy to just show on a screen so if this is this arm here and maybe I can demonstrate with technology if this arm here is the work of our program then what's all this nonsense over here what's happening over here it's a lot of runtime stuff all of them end up in like pthread cond Waits and con signal if you again we're at Chris's talk this is what he's talking about the overhead of handoff between threads it's quite expensive to sleep one thread and wake up another and one of the reasons why this can happen is you notice that in syscall don't read it's called syscall don't read but it then dispatches to a generic Cisco at Cisco so from the go runtimes point of view and the operating system underneath it they're all just syscalls whether you're asking for the time whether you're reading a file over a network connection we don't know how long the thread underlying this go routine is going to is going to be gone off in Cisco land so what happens under the hood is the the go runtime has a timer and if that timer elapses before the Cisco's returned it's time to panic and make a new thread to make up the pool of threads that servers go routines if the if the Cisco does return fast enough then that's okay like that that latency we can we can just hide it so I think the timer is like 20 microseconds at the moment or something like that it's pretty small so so what we're seeing here is we're spending so much time in Cisco that the go runtime is having to replace that thread which went off to do a Cisco with another one to to make up the difference every time you come back from Cisco now there's a thread that's supposed to be in the scheduler and an extra one that just is hanging around and so what I believe most of this overhead is the constant chopping and changing and switching and going to sleep and handing off between them which of the several threads in this program is actually responsible for it well let's come back to the let's come back to the problem so how do we fix this we buffer it okay we were previously it was 2.6 seconds let's see how long it takes now impressive so that one change improve the performance is programmed by an order of magnitude let's have a look at the CPU profile again still probably okay now not only has this changed but we can't even really see our own program in here anymore there are only there's like a main dot main and all the code that we were actually running is not visible anymore there's just some kind of just runtime stuff and you have to note that the times are very small here 30 30 milliseconds 10 milliseconds this is because again profiling is triggered on a periodic basis if the program doesn't run for very long we're not going to make many samples it only ran for 0.2 over second it probably only took 20 samples probably took less so what is the histogram when all of the samples are different what's basically random so the CPU profile is not telling us anything really useful anymore like like this is just background noise well there is one there is one thing hmm it doesn't show it here again again it really depends on because we have so few samples but in times when I run this sometimes it points to malloc GC so of the few stack traces it was able to grab in fact let me just try I mean this program takes so long so little time to run in while we can just grab another profile profile it yes so now I mean even though like this is just one sample 10 milliseconds but it's saying that of the activity I could grab Malik GC so memory maybe memory usage here's a key to the difference between what WC is doing and what our program is doing so that by that setup it lets us gives us a hint that we should try a different kind of profiling on this program so let's switch from CP profiling to memory profiling and now when we run our program rather than rather than capturing CPU cycles and using a and interrupt from the operating system the go runtime itself is going to call it catch the stack trace that led to that allocation and you see here it says write 4096 by default we don't sample every allocation we just sample one out of every four thousand and let's have a look at that profile now you can also see that the real time went down significantly so different types of profiling have different overheads and in fact now we're not using that operating system interrupt it's quite the overhead is quite a bit a bit lower so we can look at this profile just like we did the CPU profile and what it's telling us is all might all of the allocations in this program are coming from main tree pipe and we can actually rather than looking at graphically we can actually see the source code and it's telling us the line that the allocation is coming from now there's one thing I do want to I do want to diverge on just a little bit and show you see it says 617 kilobytes this is but this is because we are sample based we're only sampling one out of every thousand allocations we can at the cost of some performance cranked that rate up and say I want to profile every every allocation okay and see there was an increase in the overhead we went increase the overhead a little bit but we've also set the profiling rate to one so now when we look at the profile itself we should see every allocation that happened in this program and of course we see rebate up there and we see repart up there with 1.2 megabytes so it's a really strong clue that the memory usage of this program is directly derived from the data you feed it you feed it a 10 megabyte program it's going to ok 10 megabytes of memory the other thing is that there's at one point two megabytes we know this is exactly where the allocation is coming from so can we fix that what's a way that we can I should I should explain why this is leaking it's leaking because we declare an array we slice that a right into a slice and we pass it to read now what is the type of our it's an i/o reader it's an interface so the run time and the compiler don't know at runtime what reader were actually going to be past could be a file could be a buffaio could be something that somebody else has written so we don't know what they're going to do with the contents of the buffer that we passed them they might capture the address of one of the elements in the buffer and keep it for why they know so this forces this allocation to escape to the heap and that's where all the memory allocation is being reported so in the interest of time one of the ways that we can solve this is just to do that now instead of instead of allocating a buffer on the stack which escapes to the heap every single time we call rebate we're just going to reuse that buffer Reis lice it over and over again and now that's what we expected point two over seconds just compared with WC pretty close these days let's have a look at the profile it's just compared to other previous one which I can't do yep I can't do because I've I've closed the terminal and it might look like I've made it worse because there's an awful lot more on the screen now but if you actually zoom in and find out what these allocations are there they're tiddly they're tiny in fact the biggest allocation this entire program is the 4k buffer for buffer reader we've we've reduced the allocations in this program so now things that were previously tiny below the threshold of visibility of now bubbled to the top and in fact if we look around here one of the cutest things I like about this is that you see the allocations for printing out the log line that says your memory profile is here and just just for completeness let's turn off all the profiling so we're now around about what around about twice as slow as WC and I think that's a pretty good result we don't have time to argue about like what it would take to get down to WCS level but the thing I want you to take away from this is knowing nothing about this program using only the tools in in the ghost standard library we were able to improve its performance by two orders of magnitude now maybe in all programs there aren't this many pieces of low-hanging fruit but this simple investment of time I think pays off for anybody who's interested in improving the performance of a program so I promised you three different profiling methods so we're going to leave this example here and we're going to move to something different which is an example that both princess can I work off this program when you run it produces a mounted brought image people are probably familiar with this this kind of fractals they're very popular in the 90s when I was a child it produces a thing like this this image has 1 or 2 4 by 1 or 2 / 4 this is a mega pixel this is one megabyte of pixels that we have to compute so how long does it take to do that let's do a quick quick sniff test 1.6 seconds on this day at this altitude with this temperature don't forget laptops run slower when they get hotter so today it's one point six seconds is this good like I don't know like it is one point six seconds to make a megapixel image like thinking when was the last time you had a camera with a megapixel right so maybe using 2019 Hardware we one point six seconds isn't great so let's apply the knowledge we just did with the premium with the previous program and see if we can improve it I was really smart I would have setup macros for all this but there we go exactly the same as you did with the last program grab yourself a CPU profile let's have a look at it okay there's a bunch of stuff here there's some runtime stuff it's all very small but when you're reading these graphs you follow the arrow you follow the giant arrow and it tells you where most of the time is being spent and this is telling us that main something called sequential fill image and fill pixel and paint as taking nearly 90 percent of the time so don't worry about anything else on this graph because if we don't fix this we don't we don't improve the performance of the code so look at the source so this is paint it's just a little function that takes the matter brought the amount of our calculation is a you iterate over a complex number until you reach some limit and you take that limit and use that as the color of the pixel so we're doing that here and then we're assigning it to the red green and blue so this is so paint is just assigning to some fields don't think we can speed that up very much fill pixel is the Mandelbrot equation it runs that until it reads reaches the limit and then we hit paint so people have told us what it saw said you're using all your CPU time here but I don't think we're going to be able to improve the speed of assigning to a field and I don't I don't think I'm smart enough to know to compute the amount of right equation faster so it seems like people office told us what it saw but there's not much we can do with that information so I'm good this is now the case to introduce a different tool which is the trace tool now tracing was added in go 1.5 and over the next couple of years we actually figured out how to do it we did we uncovered the missing manual of trace so asking for a trace profile rather than a pea prof is just as simple as asking for a trace now we use a different tool to analyze this because it actually reuses a bunch of infrastructure from Chrome as it turns out so this thing is just writing it on a web server translating our trace file into the output that chrome wants to visualize this is the JavaScript debugger super useful tip question mark shows you the hotkeys just just like on every other Google property so using those a and E go left and right w and s zoom in and out what can you what can you see on this graph like there are three things that are very obvious the first one is that this program is bimodal it does some stuff here and then it's operation changes I'll just zoom in there it's operation changes and whatever this is we'll get into what that is in a second starts it behaves differently it moves around we can see this is a graph of the heap which is very common for garbage collected languages this stair-step behavior so obviously memory allocation is involved here like it starts to allocate memory now we can explain what we see here because don't forget we compute the image and then we write it out let's have a look at if I have any of these left no I haven't don't anything left that's okay so this is obviously computing the image and then this is writing it out and in fact we can prove that but if you click on these and you're lucky you actually get a stack trace for them there we go main PNG and code and code right right compress right out there's one X one glaring thing on here doesn't want to want to take a guess of why this program is not as fast as it could be l doubt single-threaded yeah there's a big green line here on proc zero and all the rest of them are doing nothing so one of the ways that we could improve the performance of this code without having to like reinvent in new form of mathematics to multiply complex numbers is to just spread this work over multiple CPUs and that's in fact what this is what we can do with this program this is our sequential fill image for each row and each column in each row compute the pixel now this is go we have go routines why don't we just compute every pixel in parallel so for each row for each column start a going to compute the pixel now of course we can't like we can't do a million pixels at the same time like they're only a limited number of cores but we know the scheduler handles is built for this so it should work through that backlog of go routines as quickly as we can the way the way that we do this this program has a mode flag so we can we were in sequential mode we can go to per pixel mode okay so it didn't really change the time one point six seconds to one point seven one thing that did change was user reel is the real time it's the time that we experience it's the progression of time user is an accounting measure for how many CPU seconds you used if we use more than one CPU we use time at at a faster rate so given that they were the same here but vastly different here there's a hint that we're doing more parallelism like we're using more CPUs but also it didn't seem to make the program faster so let's have a look at the trace now it's taking longer to load because the size of a trace is based on the number of goroutines and their interactions so we used to just be single threaded now we have million go routines all fighting to get on the scheduler to do their work and this really does really does chug quite hard and because of this when when it opens we won't be able to look at the whole one-and-a-half seconds at the runtime we have to kind of break it down into smaller chunks because just loading all these points into chrome ste bugger is not a good idea like it's it really it really really chugs from doing this now okay so there's more work on more cause but a lot of gaps and something else that I want to point out is the idea that we had like what I suggested was let's start all those go routines and then just sit back and wait for them to finish so we have a counter up here of the number of runnable go routines and a number of running go routines now as expecting runnable is ready to run so I was expecting hundreds of thousands but there's only like six like it never seems to get started and if we zoom all the way in here you can see that like there's virtually no work happening at all in the interest of time what I believe is happening here is that the amount of work it takes to compute one pixel is very small relative to the overhead of go of starting and managing those go routines and in fact don't forget that the producer itself is a go routine it competes for CPU time with the things consuming them so if all your see if all your um CPUs are busy computing pixels there are no threads available to start more goroutines so if doing one pixel at at per go routine is too small an amount of work what if we kind of split the difference and did a row at a time so we're gonna have a thousand times let's go routines but they'll each be doing a thousand times more work he's that going to make us faster for comparison let's just look at per pixel mode again 1.8 seconds and let's try this row at a time mode that's looking better we've almost cut that time to a quarter of what it was we made it but also apparently using more more users I'm than we did real time so it's probably some parallelism going on and that's why we improved the speed of the code let's look at the trace tool you notice how much faster that opened it's because it's a thousand times less goroutines there's a thousand times less interactions and one of the nice bits about that is that the trace we now get it all we now get to see the whole of the program in the trace we also see what we expected with goroutines we get this maybe I'll use magic we get this large peak that we expect like the producer is running ahead of the consumers building up work on the scheduler so at this point we have 850 runnable goroutines 600 400 that's what we expect we also if we look at the number of threads they're 8a running so there's 8 procs just chewing through that backlog of of goroutines we also see I collapse that there we go we also see that now we have work on each of our CPUs the colors by the way are not indicative of anything you don't get to control the color it's not like what Patrick Hawley was suggesting yesterday they're just assigned so that you can tell when two different girl routines are next to each other so each of these boxes represents that the time to time to start a goroutine and compute a thousand pixels and then at the end we obviously collapse back into printing out the image ok so that's pretty good in fact that's very good if you look the time to compute the image is about 200 milliseconds and the time to ride it out is about 200 milliseconds instead of this program being dominated by the time to compute the image and a little bit of overhead and writing out the image it's now almost one-to-one now when I give this example in my workshop usually usually people say well okay you start a thousand goroutines that's fair but what if instead we had just some workers and we fed them coordinates so instead of starting a thousand goroutines we'll just have a worker pool and we'll feed them feed them work to do is that different is that going to be faster is this tool are we still paying too much overhead in the scheduler so we can investigate that we have the time so we start a number of workers we're provided and each of those workers ranges over the channel getting essentially a column sorry getting essentially a row saying please compute this row and then for every column in that row cause full pixel so we say to the worker please go out and render that line and you go and render that line in the main goroutine we queue up all that work and we know we can queue it up because this is a buffered channel so we've made sure that we have enough space to the producer if it wants to can run all the way to the end set up all the work close the channel so that those workers will eventually exit and then just wait for them to exit so instead of running each of them in the go routine we've kind of chosen to use a worker model this is obviously the so we now have the ability to control how many workers we have let's say eight because that's how many CPUs I have code workers so just for comparison let's this is the row and this is workers pretty much the same pretty much the same let's have a look at the it's quick look at the trace just as a comparison here is the perc go routine model we can see each of the different go routines this G is the the Gardena identify that the runtime users internally so we can see that these are different cheese kind of scheduled at random if we look at the trace here overall it's very similar we just kind of for comparison it's very similar but instead of this giant buildup of goroutines to service and then trailing away it's just a fixed number of goroutines which is our eight workers we see down here in the proc list that it's always the same go routine just doing work weight getting another bit doing more work that's why they kind of clumped together on the same line this is G 10 G 10 G 10 G 10 G 10 so one of the nice things is that we can see that see that they're not thrashing between CPUs they live on the same CPU they probably have all their data in the cache did this make much difference in practice you know I'm it's gonna be quite difficult to get them on the same time scale there we go zero 200 400 zero 200 400 it turns out for this example just firing off a bunch of goroutines or building the worker pattern gives about the same result like we have good CPU utilization which one is better for you like which one is more applicable maybe I mean depending on how this program is run it this work might be in response to some user input like this might this might be work you're doing as part of an API request or something like that and so you may not want to dedicate all of the capacity of the machine to this request lest you starve out the next one to arrive so depending on your use case you may want to dedicate a number of workers to computing this image and the other one in parallel with that so I think I've shown that the two strategies deliver roughly the same performance and it's really down to operational concerns of which do you want to do this one is this one with Peugeot routine it's nice and simple the worker one is a little more overhead of setting up the workers and waiting for them but you can see that they deliver pretty much the same result now obviously if we we have less workers takes longer we have too many workers we don't really get any benefit but you can see while p prof couldn't help us out the execution tracer showed us that we could we could bring more CPUs to use on this problem and we improved the performance of the code well nearly four times and what I want to end on just just as a little aside so the worker model or the go team model it doesn't really matter obviously the speed-up in this program is coming from throwing more CPUs at it so you could imagine just imagine Apple don't ship them yet but if we had 16 cores instead of 200 milliseconds maybe that goes down to 100 milliseconds we saw what happened when we throw one work one worker at it it takes 1.6 seconds so even though it isn't completely linear you could make the argument that if we had more cause we could reduce this front time even further in fact you could probably take it to the logical extreme and say if we had a thousand cause we could compute each row at the same time now let's not get let's not get distracted on you know how much it's going to cost for a thousand cause but this problem is very tent lends itself very well to parallelism so imagine if we did imagine if we could compute every row in parallel imagine we had a million cause we compute each pixel in parallel we could reduce effectively reduce the time it takes this to nothing and the reason I mention this is you might have noticed the other part of this program is now the time it takes to compute the image which used to dominate is now 50% of the run time the other 50% is spent writing out a compressed zip of the of the image and I can tell you that doing parallel zip is much harder than paralyzing mandelbrot calculations so overall the runtime of this program is now dominated by how long it takes to print out the image this is although it jiggles between CPUs because because it's doing writes which is causing the write as a sis call sometimes it ends up back on the other on the other CPU but this is effectively now the sequential part of the program and I mention this because of a very important law and that is arm dales law which says that the potential speed-up that you can achieve in a program is dominated by the sequential part the sequential part of the program here down here there's this line here this is the line that says even if you have I mean you can you can imagine the sequence extending into infinity but when this was proposed I guess in the 70s or 80s the idea of 65,000 cos was considered to be infinite but you can see that after after 16 you get basically no benefit there's no more speed-up you can get and that's with the 50% of the workload being paralyzed all and we can see a visualization of that because we have 50/50 split here you can see if we reduce this part of the program to nothing it's still going to take you at least this long only thing I want you to want to leave you with is to think about your own programs now obviously we want to we want to use go routines we wanted to take advantage of the multiple cores and machines but think about your programs how much of that work can be done in parallel I mean the the example here 95% in parallel only 5% of the program is sequential and that still taps out at no greater than a 20% 20 times speed up with 4000 cores so you think about your programs they must do some argument passing at the start they must do a little bit of connection a little bit of validation reading a config some things like that that's all coming out of your if I out of your five percent of sequential work so to speed up it programs the most they can the key is to make sure you can do as much work in parallel and not just most of it but effectively all of it and with that that's all I have for you thank you very much

Info

Channel: Gopher Academy

Views: 44,298

Rating: undefined out of 5

Keywords: golang, gophercon, programming, software development

Id: nok0aYiGiYA

Channel Id: undefined

Length: 42min 59sec (2579 seconds)

Published: Tue Aug 27 2019