Golang UK Conference 2016 - Dave Cheney - Seven ways to Profile Go Applications

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so hello again notwithstanding my remarks yesterday one of the reasons that people do choose go as a solution is that it lets us write efficient programs and it's not just the language and the compiler and the technology behind that but it's also the tooling and the ecosystem we can build around it to when your programs are not fast you can find out why and I think this really gives us an edge so perhaps full-heartedly I chose seven different ways to talk about a profiling go programs and if you have an advanced degree in mathematics you'll be able divide 30 by seven and come up with a fairly small number so I will be going very quickly please do keep your hands inside the vehicle I won't cover everything but the slides will be online and you're welcome to come and buttonhole me with questions afterwards so the first way of profiling a program is time perhaps a little bit anticlimactic like time but so time is a shell bill built in and it's just a it's just a stopwatch it'll tell you how long your program took to run so on this example I just put time in front of go format all the packages in the standard library just because it takes a few seconds like things that take a few seconds rather than a few milliseconds they're kind of easier to to reason about as people as humans but as I said time is a built into your shell it's actually not running a program it's actually like CD and LS it's actually built into your shell and the format of the output is specified by POSIX so on on most of your computers you'll actually have a time command and user bin and as you can see the the output that we get here and this is running on different machines so don't worry about the difference in the time it's slightly different like we actually get more information and BSD if you're running a Mac again gives you something slightly different now because this is a command not a shell built-in it has flags and we can use things like like the impressive - V which will tell us you know how much how many pages memory this process took to run how many times it was swapped out how many times it caused a page fault things like that so you might think time is a bit pedestrian because it's you know just this this this best show built-in but actually it's it's quite powerful this is the same thing for BST now did you know that go has there's a flag for the go tool called tool exec and tool exec effectively lets you interpose your own command in front of the command that the go tool is going to shell out to so it's going to shell out to the compiler or the linker you can actually interpose there and put your own command as a prefix and this is super useful we use this Russ wrote a tool called tool stash I think it's a I think that's what it's called and we use this during the compiler transit transition to basically every time the go-to went to build a package would actually build it with the old compiler and in the new compiler and verify that were bite to bite compatible and we did this by basically interposing this little shim more practically whenever we build build and test for Android or iOS we also use tool stash that the packages compiled cross compiled for Android on a on a machine on it's a it's probably Mac or a Linux machine and then use ADB push via tool exec to run it on the phone and you can read a little bit more about this it is documented in go help build so why is my build slow who knows about the - X flag who's used - X ok not too many people so - X basically tells you what what the go tool is doing behind behind the scenes and it does it in quite a nice format that you can actually copy and paste this and just paste it back into your shell and run the exact same clients that they go till runs this is quite nifty so if we wanted to find out how long running the compiler could work we could copy all these lines one by one and then we get to this one we could put time in front of it but as I said there this is kind of laborious but we can combine tool exec and time to basically do this for so I won't refresh that that's a bad idea just in case it breaks um so we can take this and I'm choosing command compiled ontology see a lot of the motivation for this talk was the work I've been doing on with with many other people I certainly shouldn't take any of the credit for it of speeding up the compiler 41.7 so we used all these tools and trying to figure out what what is going on so this is a this is a good example because this one takes you know a good six point nine seconds to run and by interposing time we now get this output so instead of having to copy and paste like line by line by line from the output of minus X we can just say use time put put time in front of running writing this command will come back to this a little bit later to see how we can use this tool exec for other things so number two who's used the go debug flag who knows about it okay so one of the one of the ways that whenever someone comes on the mailing list or the forum or raises an issue on github says my program is slow go being garbage collector blank which most most of the time you'll be allocation bound and the easiest way to just get a ballpark figure of how hard is the garbage collector working is to turn on the turn on the debug output from the garbage collector and that's controlled with go debug go debug controls a whole bunch of other flags it's basically our kitchen sink to get into the get into the internals of the runtime but in this example I'm going to just run go doc with GC trace equals one just to show you what you should be looking for so is we're my window dock starts up it's going to read through your your go path it's going to pass a whole bunch of files that's going to be fairly allocation heavy and as you saw from this example just tiny bit smaller effectively the output was continuous and so that that's one way that's one way of knowing if we continually seen the garbage collector run there's probably a lot of allocations there and over generalizing a lot many many people's applications are bound on their allocations that's the thing that's basically one of the things which will determine the speed of your application so being able to just get a ballpark number for you know how hard am I making the garbage collector work which you can do just very quickly with the go debug flag where they haven't to recompile anything without having to use use some of the other things I'm going to talk about later in this package in this talk you can get a very very good ballpark number now every application is different and we're not really concerned with the actual numbers that bring printed out here just how much information is coming like if your application is running and constantly outputting garbage collecting lines it's a fair chance that you're bound an allocation if the results are ticking over very slowly it's probably it's probably likely that you have your allocations under control so code debug is useful for this all right time for an intermission we're going to talk in this next section about profilers so how does a profiler work it basically runs your program and configures the operating system to interrupt that program at regular intervals in the Linux world in the uke's world this is done by sending a signal called cig profile to the program which suspends it and transfers X about execution back to the the profiler which then grabs the program counters for all the threads that were running basically starts the program again until it's interrupted and a question I'll just leave open to the floor is why here am I talking about threads not go routines so profiling do's and don'ts I think it's important to talk about this even even as a kind of sidetrack because I see a lot of people getting this wrong if you're profiling the Machine you're running the profile on must be idle like this this is crucial so don't run on shared hardware where it's competing for other jobs don't browse the web or watch a youtube video while you're waiting for your benchmark to run in the background that will skew your results a big one is watch out for power saving and thermal management and for everyone with a laptop in this room you are affected by this problem the harder that the longer your benchmark runs the more heat your CPU will generate anymore the software which you have little control over will actually scale back the CPU speed to maintain that thermal envelope this is a really big problem and something which if you want to get repeatable accurate benchmark numbers is well something you will run into all the time kind of extension of that avoid virtual machines in benchmarking in the cloud come on seriously that's not an option and if you are using versions of OS X less than al Capitan there is a bug that affects all bsds that basically the profiling signals are broken the good news is that it's fixed in El Capitan and people seem to be pretty aggressive at upgrading upgrading the Mac's they kind of take away advice is if you can if you care about this if you can afford it by dedicated performance test Hardware put it rack it somewhere in your in the rack in the closet or in your data center just savor all the power management Disko disable all the thermal scaling or set it to a fixed value and never update the software don't change the kernel otherwise you won't be able to compare apples to apples to apples but for everybody else it's important too it's important remember to have a reliable before and after sample run your before sample and your after many times to get to get a number which is stable okay so paper off paper off descends from I believe the Google performance tool suite which was originally targeted at C++ it very the ideas behind it will port it over to the go runtime very very early on and it consists of two parts run time paper off which is the actual package it's part of every go program that's built and go to a paper off which is investigates the output of the data that this generates and talked a little very quickly about the types of profiling the types of profiles you can generate see profiling is the most common most obvious one so when CPU profiling is enabled every every 10 milliseconds which is the the frequency of their counter we record the stack trace of all the currently running go routines and there's a hint back to that previous question I left we do with profile safety disk and afterwards we can analyze it by basically what is the most frequent program counter what is the most frequent function that is running whenever we interrupt the program and look at what's running memory profiling is a bit different it records the stack trace of where the allocation was made its sample based so it's not done by time but it's done by the number of allocations so by default when you tell our memory profiling one out of every eight thousand allocations will actually record this information you can change that rate obviously right down to one to one but this has an extreme impact on performance it's important that stack allocations are assumed to be free and they're not tracked in the memory profile and because of all these factors because it's sample based and because it tracks allocations not the use of memory using the memory profile to determine how much memory your application is using over all is difficult and not very accurate I see a lot of people who say who think my application using a lot of memory I know the tool for this is the memory profiler and it just leads them astray there are better ways to get an accurate idea of how much memory your applications using and I'll talk about those a little bit later on block profiling I won't have time to go into today but it's quite unique you need to go it's similar to the CPU profile in nature but it records the amount of time a goroutine spent waiting for a shared resource not running and and this can be very useful in determining where bottlenecks occur a bunch of places that a go team could make progress if it if it wasn't blocked to things like channel send and receives mutexes reading on shared shared resources basically but the takeaway is that the block profile is a very specialized tool it you shouldn't reach her at first you shouldn't reach it once you believe you've eliminated all the CPU and memory usage bottlenecks in your application and lastly one profile at a time please profiling is not free in terms of the overhead it generates and when you if the runtime and most of the tools are at most the tools around this will not deliberately prevent you from enabling multiple profiles at once and people think well I want to have all the information about my program I'm gonna turn them all on the I want I want all the data and unfortunately they just end up measuring themselves rather than your application so the key is one profile at a time the easiest way to profiler function he's going to be with the testing package the testing package gives you a harness to turn all these things on it exposes them as flags so CPU profile to a file writes the CPU profile while running that test binary man profile blood profile same thing not a nice addition that came a few regions of go-go-go-go ago is that using any one of these flags will also provide the binary you don't have to do two stages of doing go test - see and then specifically run that program and I've just got an example of how to do this down here go test run bench I'm just running the index bytes benchmarks from the bytes package and taping your CPU profile the this here is very important use run equals iOS a xxx it's actually a reg X so anything that doesn't match the name of your tests in your package to basically say I don't have run any tests like if we're doing a CPU profile you don't actually profile the tests I want to profile the benchmark so that basically says run the benchmarks don't run any tests now the opposite is if you want to profile a whole program so testing testing is great if you have micro benchmarks that are structured as you know testing style benchmarks but if you want to profile a complete application what are you going to do you're back to using the raw runtime paper off API a little situation I found myself in a couple years ago and I was a bit disappointed I was like well testing makes it so easy why why can't we have this for for all our applications and so I sat down and wrote a little a little package called package profile that just makes it easier to you to to profile your hope your whole program to to do this is literally just one one package to import and in this line at the top which I want to demonstrate by doing go doc oh Jesus yep and for everybody's wondering why it's so slow this is my machine at home in Australia so you can now see what the rest of the internet feels like for me so setting it up is as simple as finding your main function and your sticking at the top so going Putz's add or import for us I may regret this so now if I run go doc that saying go doc that I was running before we'll just take out the the trace it's my fault I have go doc aliased so I have got up pipe through a pager I'm gonna skip that example but come and see me afterwards okay debug paper off so if you have so the package profiling is great for profiling the whole thing from start to end and that's how that's how it works if you have applications deployed in your data center and production they're running you want to connect to them to get to interrogate them debug pre prof is one of the ways to do it pretty much everyone should should know about this it when you import net HTTP paper off it will attach itself to the default MUX and so create a bunch of endpoints which the profiler can then talk to so here are some examples here I've actually instrumented this so I've implanted this copy of present which is running here just to show you what what kind of information you get and goroutines gives your done for the goroutines that's pretty uncontroversial the heap one I think is the most useful we can ignore this this is all for the paper off tool if we go all the way down to the bottom here we get the runtime mem stats and this is the way to find out how much memory you'll go your go program is using because this is the real number anybody else is interpreting and any other like top is reinterpreting this number sharing memory on Linux is very complicated this here is the number of pages that your go programmers ask the operating system for they'd the number of bytes this is the real number and there also other statistics in here but if you want to know how much memory you go program is using that's the value right there talk okay so now we talked about what people often measure I want to talk about how to use people off to analyze a profile people have should always be invoked with two arguments the name of the binary and the name of and the name of the profile a lot of people get this wrong because people OFF also supports its online mode it can be run like this and if you do that your profile will be empty just because people OFF needs the binary and that the profile this is a sample of the people of output that you get let me honest with you that's completely useless that doesn't tell me anything what I do like is to visualize it and this is done either with the web command or with SVG PDF and so that same profile we can look at it as a as a graph and this is an example from the issue report where someone was saying my program is spending all its time in Cisco syscall is slow and actually from what from here we can see that most of the time is being spent in net read and the reason most the times being spent in net dot read is that there is no buffering every time read bytes reads one byte that calls down to new Conrade which calls net FD read which actually makes Cisco to read one byte so that the problem there was the reporter needed to add some buffering but the key there is I find it so much easier to look at it graphically then then then you maracle e okay again you can visualize memory profiles the same way and block profiles you can have a look at these this yourself okay second intermission now go 1.7 is out we have frame pointers and frame pointers are crucially important for integrating with the rest of the UNIX toolset frame point is basically let you unwind the stack you know an agnostic way like Oh knows how to unwind its own stack but for other programs to understand how go programs are laid out you need frame pointers and we now have them in go one at seven so that means we have tools like perf perfect again using two legs gives us great statistics on what the application is doing at the CPU level so where time was looking from the out this is looking from the inside you get information like the front end cycle stalls the back end cycle stores how many instructions per cycle your application is using and this is this is important information if you're tuning your application for your memory layout because generally lower instructions per cycle means that you're spending more time waiting on memory we can also use perf to record a profile it's similar to the way that we used the people off tool and this gives us the same this gives us it's the same program we're but we're introspecting it in a different way I'm gonna try to run this yeah I think I can I have a picture if it doesn't work so so now that we have frame pointers we have we can unwind the stack of all these things if you've ever tried to use per for gdb before one point seven now this worked per flex you dig into each of the the actual instructions that are running the program we have the source code on the left and kind of like a percentage a percentage percentage on the left of how expensive this instruction was you can get incredibly fine fine grain information here so number six this is where things get really exciting who's who's hurl used flame graphs this is Brendan yeah this is this is Brendan Greg's ideas from from Netflix thanks to our friends at uber we can use fine grass with go and they've they've built this nice package which effectively looks very similar to the go people off interface you build a you build a CPU profile and then you you you you run it you run it similar to the way that you would run go to a paper off so I'm not gonna run that but I do have an example here that's and the key takeaway about flame graphs is that this represents time but the order is unimportant they're literally distorted so it's sorted alphabetically so each level represents a stack frame we have runtime that main main domains as the compiler and so at the top level the compiler is basically doing three three or four big things this small one is slightly larger on this enormous one called func compile which takes up all of that all all of that time and then with flame graphs we can dig into what you know of the time that front compilers doing what is the percentage that's doing what's going to compile build SSA is obviously the big component because this is the new this is the new compiler and 1.7 and you can really dig in to dig into what your application is doing and the key here is that func compile is called once for every function in your package but it's only listed here once like it's every call to it is summarized at this point so it lets you know even if func compiles called a hundred hundred thousand times and had a very small usage when that's all added up it's very obvious that func compilers the heaviest they have use hitter so the last one go tool trace this was added by Dmitriy vehicle around go 1.5 and it's a new kind of profiling rather than tools like perf and paper off which just look at the application like I'm just capturing the program counter I'll record that write it down go till trace knows more about the actual application I mean it's very high precision but it also is aware of things like go routines they're blocking unblocking when network calls occur as opposed to different other kinds of sis calls and it's knows a lot about the the garbage collector and its interactions the downside is that it's effectively undocumented and has been for about two releases maybe that'll get fixed soon again doing a trace is another kind of probe I have an outstanding see all that adds this trace profile flag to the compiler package profile also supports them now as well so you can generate them yourself and once you have a trace you pass it to go tool trace in in a similar way that you will use you would use the people profile ok and you get something like this and that lets you analyze you'll go analyze the operation of each go routine this this data set is taken from the compiler if this was a server application that had many go routines it would be should let me go back one it would be a little bit different let me just go back one because I can show you so if you've used tools like visual VM things like that you'll find this very comforting you have the heap size in terms of the green and the other color and it has that traditional stair step pattern of a garbage collected language you have the go routines that are active at any one time the threads that underlie them again because the compiler is effectively sequential operation there is not a lot of concurrency there and let me just so each of these little these little segments here which are very difficult to zoom in on effectively a girl routine running for some time and then it goes back to the it goes back to the schedule either because it was preempted or because because it executed an operation that block back to reading from a file or something like that and so we can see that basically this is the operation the compiler that goes along for a little bit watching the memory increase mirror usage increase up until this point where we've exhausted the heap all of a sudden the garbage collector springs into action and takes all the CPUs away to work on garbage collection that's why we can see there's now all four of these processes are busy and then we go back to doing some more work and then this process kind of happens again but the go team that's actually being the doing the compilation job ends up on another CPU and works there for a while and gets transitioned around there the this is where this is where go tooling is going I'm doing a particularly bad job of explaining it but this this is the kind of Diagnostics we can now get out of a go application and hopefully one day there'll be a blog post that reads didn't explain what it does so that was a lot to go through in 30 min 30 minutes I work went very fast the taker I want you to I want you to take from this is that different tools give you a different perspective on the performance of your application I'm certainly not saying that you should learn and use every one of these tools all the time like you you should find ones that work well for you but all of these tools have been instrumental in helping us understand the performance of NGO programs especially the compiler yeah so these slides are online there's a lot of stuff I didn't I didn't have time to go through but I'm gonna be around for the rest of the conference I welcome welcome you to come and ask me difficult questions about this material thank you very much
Info
Channel: GopherCon UK
Views: 11,277
Rating: 4.9322033 out of 5
Keywords: Golang, Go
Id: 2h_NFBFrciI
Channel Id: undefined
Length: 30min 3sec (1803 seconds)
Published: Thu Sep 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.