Analyzing production using Flamegraphs - Prashant Varanasi - Release Party #GoSF

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I am Prashad I'm an engineer working on infrastructure at ruble I specifically work on network infrastructure so supporting our pcs we have a pretty extreme micro service architecture at uber and so there's a whole bunch of services communicating with each other and so we I kind of our tech lead for the team that earns the service discovery components and a lot of the routing components so today I'm gonna talk about how we can use peer often flame graphs to analyze production and I don't just mean performance regressions I mean how you can also find bugs using PF and flame graphs but before we can cover flame grass let's first talk about p prof go comes with pretty powerful profiling built right into it called P prof P prof supports a whole heap of different profiles a lot of people are familiar with the cpu profiling which is basically about a hundred times a second it looks at what's running on the CPU what the stack traces for what was running and records that that's basically stack sampling that's the that tells you what where your time is being spent there's also heap profiles where go is constantly our not constantly on a fraction of memory allocations it's recording the stack trace of where that memory was allocated and P prof lets you view roughly where is what memory am I currently using or you can even look at what memory has been allocated since the start up of this program you can actually control the rate of allocations as well I think by default it's it tries to do at least one allocation for fight or kilobytes or something but you can control that using runtime memory far right there's go routine protocols which basically tells you every girl routine that's running in your program and the exact stack trace you can actually even look at arguments that were passed on the stack or at least the first five or six arguments passed on the stack in the goroutine profiles and we'll have a look at that later there's also tracing which France s kind of mentioned earlier traces record a bunch of important runtime events for example a garbage collection was started a garbage collection has entered stop the world we did a memory allocation a Cisco all of these are recorded in traces and umyou a timeline of exactly what happen when so you can use people off to profile code during development whether it's during your you whether it's with your unit tests or with any benchmarks you've written in your code but the biggest advantage to people are firstly is that you can run these profiles against production so I'm gonna do a really quick demo of P prof just to show you how people OFF works so we have some code don't worry too much about the exact thing I'm profiling I want to show you the P prof interface and how you can use it so I'm gonna cover the simple case where you want to profile some benchmark that you have just a basic benchmark you've written you run that using go test I'm gonna do read headers normally when you run your benchmark you just get a number which represents how much time was spent inside of that function if you want to profile this just pass it an extra flag CPU profile profited CPU now go is profiling that benchmark run and it's gonna create a file called profit CPU which has information about what was happening inside our during this benchmark run so if we want to look at the data we want to analyze this data we run go to a P prof you have to pass the binary that actually uh sorry so go creates a binary when you run a benchmark with - CPU profile and that's mostly for symbols so it's information about at what memory location is what symbol etc so we're gonna run with that binary and we're gonna pass the profiling data and now we have the P prof interface the most basic command is top so top you can pass a number like 1 10 5 whatever I tend to do top 10 top 10 shows you the top functions in terms of CPU usage but this is just time inside of the function not any other functions our code if you want the cumulative time inside of that function or anything else at code you do top 10 - cumulative or - C um now I'm gonna do that again just so we can see it a little bit clearer let's do top 5 so you can see it kind of makes sense if you think about it I'm running a benchmark of course the cumulative entry should be my benchmark function followed by the functions it called if you want to see what happened inside of benchmark read headers we can call we can use a command called list so list benchmark read headers now we see the source code annotated with profiling information so we have two columns of information on the left here this is time spent inside of this function without calling any child functions well whereas this second column represents time spent inside of functions called by this line in this case because it's a function call there's no time spent here all of the time was spent inside of that function so let's have a look at what that function was doing similar thing again most of the time was spent inside of another function but you can see that some of the time was spent on a different line as well so you can kind of see get some idea of where time is being spent within that function let's have a look at the underlying function again now we see some more useful information here we see this might be a little bit surprising but most of the time in this spec in this function was actually spent creating the map and then we've got we're reading a couple of strings and setting some things in the map turns out the most expensive part was creating the matter one thing you might notice is I mentioned that the left column was for time spent within this function where is the right columnist for things that were called however if you look at some of these headers B equals B sorry head is K equals B I'm not actually calling a function y is go telling us that it was spent inside of some function call we can dig down into what your source source code compiles into using another command called this ASM which basically brings up the disassembly or go disassembly which is a little bit higher level than most our underlying architecture assemblers so we can do this is M dis ASM it also takes a regex for the function to match so we're gonna pass the same function in and we see a whole bunch of information now about the actual assembly and one thing you will notice is that when I set a map key to some specific value underneath the hood go is actually calling a function which is why we saw the profile data in that way so don't don't be too afraid of using dis ASM it shows you what the runtimes doing under the hood and it's actually surprisingly easy to read now I mentioned that you can do other types of profiles so we did a CPU profile let's try a memory profile almost as easy just type in memory file instead of CPU profile create a separate file now you can use go to a peeper off the same way so let's create go to with the mem one thing about memory is there's two different measures there's a measure of what is still being held on what hasn't the garbage collected cleared what references are we still holding onto which is really useful in production because you want to see why is my process using 500 Meg's of RAM what types am i holding on to in memory but when you're doing things like benchmarks typically you don't care about what are you holding on you want to see how many allocations am i doing per iteration of the benchmark loop so you can tell go I want the allocation and you have two choices here as well you can ask for space so a lock underscore space and it shows you information in terms of megabytes bytes etc or you can pass a lock underscore objects which tells you the number of objects you're allocating maybe you have one huge object and a hundred little objects those hundred little objects are gonna cost you it turns out allocations are expensive whereas one large allocation doesn't actually cost that much whereas if you look at this space allocation profile you won't notice these smaller allocations though just dis they'll kind of blend into the noise so you can tell go I care about the object count so let's do that and now it's telling us the number of allocations that happened so let's look at that readhead this function that I showed again now it's telling us exactly where we're doing our allocations of course allocating a map Krita sorry making a map allocates reading strings is allocating just because this is converting some byte slices to strings and of course it looks like setting it is sometimes allocating as well well you notice that we're not allocating as much meant when we set as when we allocated the map originally or the amount of strings by reading why is that well first of all we profiling a fraction of allocations so for example these two they're likely to be exactly the same in production but the numbers look a little off because we're profiling some percentage of them and it just happened to be that the posada the profile rate happened to catch more of the keys than the values so even though this should have been the same they look a little different that usually shows by a factor of like 3 to 4 X at most whereas this is a huge difference right why is this so different because most of the time when I'm setting these keys in my map I've already given it a hint I've given it a hint for how big my map is gonna be it doesn't need to resize my map but whether you need a resize or not really depends on what buckets your keys and values happen to fall into so because of that you need to whenever you need to resize you end up creating an extra allocation so this is the kind of information that's really useful to see from my heat profile now I mentioned that you can also profile a running binary so how do we do that well first I'm gonna show you what you need to do to a binary to make it profile so you can profile it during runtime it's really really simple just add this blank input for net HTTP prompt and I'm right now just listening and serving on the default server MUX by importing by blank importing this net HPP prof package you are registering the peeper peeper off handlers on the default serve marks so if you're using the default server much that's all you need to do so I've already got the program running I'm just gonna start some sort of benchmark because if you don't have any load the profiling output is just empty there's nothing running right so let's create some artificial load and now we want to see what does that paper off page look like we can just go to debug /p prof this is the default path where net here should be P prof. put some the endpoints you see a bunch of different useful information like goroutines you can just click go routine and it's telling us that we have 21 goo routines that have this stack trace and the stack trace is basically for every connection TCP connection the HP library is reading from that TCP connection so we roughly have 21 TCP connections going into this process right now this also heap which is a little hard for us to read without using goat or P prof so how could we use go to a P prof to look at this data it's pretty simple go to a paper off and you just pass the URL and now all of a sudden we are profiling our production service and getting a heat profile for what's going on turns out there's not much that is being referenced by this program so it could be that my benchmark has ended let me just start that again but we can also say I care about allocated objects because there should be a lot of allocated objects and there we go there's a whole bunch of different objects and you'll notice that I'm looking at a view that isn't the command line interface we were looking at earlier I used a command called web to show to create that web people off generates an SVG and it tells you kind of the hot pot so you can see like oh you're allocating a bunch of different objects from this function here and it shows you kind of the path taken before you allocated the object now let's go back to our slides for a little bit so we kind of saw the visualizations of paper off data Francis mentioned 1.9 came with some improvements to peopre you can see before 1.9 this is what our data look like just monochrome pretty hard to follow actually there was some differences in the sizes of boxes but it was hard to see what exactly was going on 1.9 adds color which makes it a lot easier to see what are the hot pots where am i allocating a ton of objects but even even with all of this colored output there's a lot of data here especially with a more complex program you're gonna have hundreds or thousands of functions allocating and it could be hard to process all of this data and that's where flame graphs come in so flame graphs are a visualization of profiling data that Brendan Gregg who now works at Netflix I think still kind of came up with and around I think it was around 2011 so flame graphs are a much easier and faster way to comprehend this profiling data instead of digging through this really complex graph with a large number of nodes you can see at a quick glance where is my time being spent so let me open this flame graph and make it a little bigger so we can kind of see what's going on so very quickly I can see hey my reads and my rights you can tell what proportion of the total time they're taking up immediately right you didn't have to follow the graph you didn't have to read percentages you can just visually see oh it turns out rights are a little bit slower than reads right now in this flame graph y axis shows the stack depth so this is the leaf function call and this was the stack trace leading up to that function call the x-axis in the on the x-axis the position actually doesn't mean anything which is a huge point of confusion for a lot of users because they think the x-axis represents time that is not the case with flame graphs the x-axis is just alphabetically-ordered it's nothing special about it what is important though is the width so the fact that this box is not as wide as this box tells us something the width represents how long that function spent on the CPU so this box is is wider which means that it spent more time on the CPU right this doesn't just mean that a function ran for a really long time it could mean that the function ran for a really long time but it could mean that you're calling that function a lot of times and so it happens to be on your CPU a lot of the time now one of the other big advantages of flame graphs is that you can zoom in so this read and write maybe I can't really do much about it I want to debug some specific part of my program I want to see what is the performance like inside of some handler maybe there's something else I can click in and zoom in so now that readwrite noise has kind of just moved away I don't need to worry about the garbage collector and I can focus on inside of my handler what's happening and similarly can keep zooming in until you find the information you're looking for so that that's a kind of flame graph in a nutshell the other question I get is what are these colors mean contrary to what you might expect the colors don't actually mean anything they basically random but they chosen to be yellow and red and orange so it looks like a fire ants then flame grass so let's generate oh I didn't mean to close that let's generate a flame graph for the same data we just we were profiling earlier so let's start our same benchmark how do we generate a flame graph well this was the argument these this was the command line we used to open a P prof session almost exactly the same just replace go to a peer off with go - torch that's it nothing else you have to do it create a SVG you can open that SVG and there you go we've created a profile for our HTTP server that was running and now you can very quickly see oh I don't really care about this read request this is stuff in the HP library I can't really optimize what the HP library is doing to read or request I care about my hand law let's zoom into my hand law within my handler I can see very clearly there's three big things I'm doing regex encoding JSON and converting time to a string so if I was to optimize I would pick the widest bar regex looks like there's a compile here we're not using compadre X's lets pre-compile our reg X's rather than compiling them on every single request right it's very easy to kind of see what's going on with flame graphs and zoom in very quickly so flame graphs can be really useful to analyze any performance issues you're having whether it's CPU whether it's memory but there's other factors are there's other things you can do with flame graphs they're actually really useful to debug other issues in production so good through a few examples where I've used flame graphs and P broth to debug things like memory leaks goroutine spikes and even deadlocks so one thing we do in production at uber is we've enabled the p prof endpoints by default in our RPC frameworks so almost every service has people off by default this is because there's no overhead to registering these handlers you only pay a cost when you hit the profiling endpoint until you hit it it's free so why not register them because then when you have an issue you know where to go looking for more data to make it easier for developers we also have some two links can basically we ssh into production pull forwards and things so that you can run p prof go torch whatever it is on your local laptop because that's what all the tools kind of assume when you write run web it assumes i'm just gonna open a browser on your local machine so to make it easier we just forward all of these ports to the developer laptops using a little tool so let's talk about some of the issues that were kind of seen in production recently a service owner sorry before i kind of cover memory leaks i want to first mention that memory leaks can happen even in a garbage collected language people are sometimes surprised when i say memory leak and go they're like wait but you don't have to free your memory that's what the garbage collected us how can you have a memory leak and go the memory leaks still happen in garbage collected languages because you can hold on to a reference for longer than you intend to you may have a list or a slice somewhere that's global that's keeping track of every request you you received because you're holding on to that reference now you have a memory leak over time you're using more and more memory and your servers gonna slow down so recently a service owner reached out asking for help in debugging a go memory leak the easiest thing i typically do when people ask me to debug a memory leak is take a heap snapshot before you restart the service so while you still have the memory leak restart it now we have a fresh instance throw if you requested it to warm it up compare the different heap snapshots to see where what are we leaking right what is normal what is supposed to be there because you cashed it what are some objects that you always create on startup and what is actually the memory leak easiest way is just to compare the difference between pre restart and post restart so I did that in this case we took the snapshot and we looked at the webview now it's possible to figure out what are the memory leak is using these two graphs the problem is there's a lot of data here and it's gonna take a little while you have to put some effort into reading oh what are the numbers what are their kind of expensive things what can I ignore so you can throw that information instead at a flame into a flame graph using go torch and this is what it looked like so low sorry let me make it larger so everyone can see so this is the same comparison exactly the same profile all except with the flame graph you can see very clearly that there's something going on down here right that is not here in the other graph and it turns out that was actually the issue what exactly does that say it says run time dot system stack you're probably thinking what is a run time dot system stack am how did this end up in my profile so after a little bit of digging and I saw references to defer proc in the code and then I looked through the run time turns out this happens when you defer when you defer some work and there's not enough stack space it uses the system stack or this something about how it uses the system stacks when you do it defer so I knew I was looking for something related to the first next step I use the P prof / debugs life be process life go routines page and I could see every girl routine running in this service and the girl routine has a stack trace that has the exact file and line information about where it's running using that I was able to quickly find what girl routines are running trace them to the code and I saw some code that looked a little like this now it's probably kind of obvious to some people but basically someone is someone tried to use a differ basically to recover from panics because this is a long-running service and they didn't one you know one little bug to crash the service they usually defer but they used it in a for loop the first a run at the end of four loops they run at the end of a function this function is not expected to end during the lifetime of a service so what ends up happening we end up piling on differs on to our and unto the stack or in memory in general and these defers are never actually run and that's what ended up causing the memory leak in this case the defer this the surprising thing about this was that even after a few days of running the service was still only consuming about 300 or 400 megabytes so it wasn't a huge amount it was a very slow memory leak that would have been hard to kind of find out but thanks to the people off pages in my flame and the flame graphs it was surprisingly easy and this ended up saving the service owner a bunch of errors because once they got into this state there was serving requests a little slower than before there a service which has a very low timeout so now all of a sudden 1% of that request for failing all it took was moving the defer one line out no more issues no more restarts just to fix a memory leak so another kind of issue I've dealt with is goroutine leaks we typically report a bunch of runtime metrics from our processes in production so that we can trace oh this deploy is doing something odd or memory spiked so we report a whole bunch of go specific information such as what is the heap usage according to go we get that from the runtime door read mem stats we also report the number of guru teens and in one case we saw a huge spike in the number of guru teens so again using a flame graph we're very quickly able to see what these guru teens were doing we could look at this and say oh these two these are actually related to reading and writing so you can kind of see read here and this one's actually for writing but those were expected because every TCP connection we need one girl routine to read and one girl routine to write so we knew these were kind of expected this was the surprising this was the surprising kind of leak so we looked into what was happening here and you can kind of see there's a runtime dot chance and which is what the score routine is doing what does that mean it means that a guru team was blocked on writing to a channel turns out this is this was supposed to be an asynchronous endpoint where it took some request data it put the request in to a channel responded immediately and those a background girl routine that was processing this work what happened was that the background girl routine slowed down a huge amount because the downstream service it used to process the data was having a bunch of timeouts and slowed down there was timeouts who were seeing a whole bunch of errors and because we retry until a certain time out it slowed down processing a huge amount and put a ton of back pressure on the channel the channel was full and now we were blocking a whole bunch of guru teens on this channel so the fix was surprisingly easy it was just returned an error if the channel was born don't write to a channel without checking whether it's wool first especially in an asynchronous context and thanks to p prof this was just as easy to detect as wall just hit you can either use a flame graph or you can just go to slash debug slash go routines go Paige gives you all of this information as well now once we got rid of this bad we got rid of the bad downstream service that was timing out we expected all of these guru teens to clear up because now we're processing all of the backlog and they should have emptied out now we should have no more issues right well turns out that we were using this number here it's from PS and it's the memory usage was over a hundred gigabytes and at this point there was nothing in our routine logs everything was looking good guru teens were back to normal why was this dual using that huge amount of memory so we went through P prof you quickly see some interesting things that you're like runtime mal G not something I'm used to seeing what's going on here do a few Google's fire this is some memory related to a girl routine so while we were kind of backlogged we had up to I think a million go routines running right and even though all of those go routines have been drained they weren't running anymore goroutine there's some memory related to goo routines that is never freed in the runtime and so because of that we're always gonna hold on to that memory but a hundred gigabytes is a huge amount of memory go routines are supposed to use a tiny amount of memory like three kilobytes each what happened here turns out there's a known issue where when you're allocating a whole bunch of guru teens the girl routine descriptor can leave your heap in a fragmented state so as you do more memory allocations they can't use the free memory that's available because of how it's left that how it just leaves holes and so we ended up with a huge amount of fragmentation and using a hundred gigs of ram in this case unfortunate we didn't really have any choices we just restarted the process and took care of it so people can also help you find deadlocks now a common deadlock that the most common way to run into deadlocks is misuse of some of the locking structures like whether it's sync mutex or sync dot RW mutex we had a slightly more subtle deadlock in one case where we accidentally were getting a read lock and then getting the same read lock in a separate function we didn't realize this was happening of course because most of the time it looks it works out fine the only scenario in which you see an error is if someone tries to get a right lock in between your first read lock in the second read lock that doesn't happen very frequently but it happens and when it happens all our yoga routines are deadlocked so we detected again we could see that there was a large spike in goroutines memory usage because something was deadlocked and things weren't being processed so now I use the guru teens except you noticed this debug equals to what is this debug equals to well when you look at the guru teens page there's actually two different versions of it one does a whole bunch of aggregation for you so this is saying there's 33 guru teens with this exact stack or 33 goroutines with this exact stack right it's aggregated some information about them for you two on the other hand gives you information about every individual goroutine now one of the most useful things is that it also tells you exactly how long a guru teen has been in some sort of block state so here we were able to see there's a semi choir which means we're trying to acquire some semaphore which is typically what happens on a lock in this case this isn't actually such a bad thing this this one here was something waiting to close that's okay but you'll see that the reed locks were also in this semi choir stage so using this we're able to find out what is the stack trace that led to this how did we end up locking the same same thing twice trace through the stack trace and of course fixing it was super easy it was a one-line change where instead of calling a function which gets the read lock for you this was supposed to be a helper function that gets the read lock and returns the information we just access the slod the map directly one other example I want to cover with how people of his help does debug kind of production issues is a very tricky memory leak that I ran into recently so we have a forwarding proxy where you send your request to the proxy the proxy sends it to the destination so a ton of requests are going through this proxy there's proxies everywhere and we suddenly saw that those about five or six different instances using a ton of memory this thing usually uses less than a few hundred megabytes of memory but sometimes sorry some of these instances we're using gigabytes so we decided to go investigate to see what was going on first thing I did of course is open up P prof now here is the actual P prof from that issue there's a lot going on here it's pretty hard to tell what exactly the issue is we were able to see roughly what was causing the memory usage to spike it was a specific type code of frame a frame is just any time you send a request we read that into a frame and we followed the frame now a frame is a very generic type it's the most basic type in this proxy so knowing that frames are being leaked doesn't really tell us much we need to know more information like where who is still holding on to references to these frames unfortunately there's no easy way to get at that information from the go runtime right now so all you can do is use the rest of the information that people of has kind of available to you to figure out what went wrong and so we tried to do that using this this page here but of course because there's just so much data it was real pretty hard to analyze what was going on instead we got this flame graph and with the flame graph what was useful was we now know proportionally how many objects of each type of being allocated by looking at these leaves and so one thing I immediately noticed was this object here is called appear appear represents a back-end instance so every time you have a connection to some back-end you have to have a pure object to represent it I looked at how many connections this instance had a couple of thousand and yet somehow we had 95 thousand samples of a pure object being referenced right this doesn't make any sense why do we have 95 thousand peers if we only had 2000 connections and this hope does eventually trace down the issue turns out we were holding on to pure objects but a pure object is tiny but a pure object references connections very indirectly what was happening in this case was that we were trying to maintain a list of connections per pair and when the list of connection when the connections are gone we clear them out we have a slice with the connections and every time that connection ends we remove the connection by mu get to the end of a slice and then truncating the slice right simple enough we're now no longer referencing that connection a connection list should be zero and it was zero the length of all of these slices was zero yet somehow we were leaking memory turns out that because we didn't nil out that last element in the slice the underlying array still has a valid pointer so ordered from my perspective from the codes perspective there was a length length zero slice there was still some data underneath that slice that was pointing to valid memory and that caused a leak and we're only able to debug these issues because we were able to like look at this flame graph and say oh these two these two objects look around roughly the same why do we still have peers and connections going on that shouldn't exist how did these references make it and we eventually traced it down to this one line that was all it took so this is also another good life lesson if you have pointers in a slice and you're truncating your slice nil out your slice first otherwise you will leave memory and that leak won't just be tiny it can leak gigabytes of memory so that was not a kind of lesson we learn if you're curious about this issue when we when I eventually send out the slides you can feel free to look at this issue which has more information so let me recap a kind of the talk basically so people often flame graphs they're great for profiling but they're not just for profiling you can debug a whole bunch of your production issues using flame graphs and p prof and typically when you're dealing with a production issue you want to figure out the issue as quickly as possible right so that's why flame graphs come in because they help you comprehend the data very easily and very quickly so we tend to use plain grass very frequently in our production kind of ecosystem to help service learners debug their debug that any issues they're saying one other thing I wanted to cover is that go 1.9 adds a special feature a new feature to peel off code labels so you notice that everything I showed was very much dependent upon the code location which function are you running what line of code in that function allocated this information there's no way to get at that information and any kind of runtime parameter away so for example maybe you have two different colors one is a maybe ones like a bank and really important and you want to know Oh like how much memory are we allocating on their request because we need those requests to happen as quickly as possible you can do that now we with go on point nine because the same function can have different labels based on some runtime information such as a request parameter or a head or whatever it is you can now add that as a label and that will let you bucket your data in different ways and cut and slice at it in a way that isn't just your code it lets you kind of model your runtime shapes as well so that's all I had one thing I wanted to mention is that if you're interested in profiling an optimization I wanted to focus on more the production issues that you can debug using flame graphs in P prof. but if you're more interested in just profiling an optimization I have given a talk previously on that topic so feel free to check that out other than that that's all I had so any questions [Applause] that was awesome any questions is this on any yeah any questions what do you do in a situation where you you don't have or like the thing that you're trying to profile takes a long time to run and you can't a lot of times sorry so the question is how do you know like long-term trends of whether something's gotten faster or not because profiling is kind of expensive basically yeah so typically we don't use profiling to look at long term how did this code change did it get faster or slower we tend to look at metrics like latency instead so if you keep track of your latency over months and months it's very obvious to see oh something changed with this deploy and we don't don't look at your key 50 your medians are actually not as useful as your P 99.9 czar look at some of the worst latencies you seen not the absolute max because max is often affected by external factors like CPU scheduling or your OS in general but like the P 99 s in the P 99.95 they give you a good information about whether your code has actually gotten worse or better in some way so that's what we tend to use and you any more questions how much overhead does adding labeling from runtime data add to an app that's a great question I personally haven't I haven't seen the impact of it yet I played with a little bit it's actually still a little bit better it's I don't think that the labels are shown in the P proper UI yet so we it still hasn't been useful enough for me to kind of add it to production services and get the benefit out of it it's more of a do them do this now because when we eventually have tooling to show you these labels it'll be useful but looking at the code it doesn't seem like a huge overhead it seems to just associate some label set with your girl routine descriptor so it doesn't seem particularly bad janab dog and wrote a blog post on exactly that topic so you might want to check it out any more questions are you sure okay so thank you very much thank you

Info

Channel: The Go Programming Language

Views: 5,062

Rating: 4.9733334 out of 5

Keywords: go, golang, performance, pprof, flamegraph, prashant

Id: aAhNDgEZj_U

Channel Id: undefined

Length: 36min 47sec (2207 seconds)

Published: Tue Oct 03 2017