A Simple Approach to the Advanced JVM Profiling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone thanks for joining us for this session today i'm your host mala gupta and today i'm pleased to welcome andre bankin author of async profiler a java champion he works as a principal software engineer with odd knock knock snucky i'm regina i know i messed that up hi henry it's a pleasure to have you present today hi mala hi everyone thank you for introduction it's an honor for me to be here to contact this webinar especially on one of my favorite topics um so andrew before i let you take the stage i have a little more to add so andrew will talk about how to find performance bottlenecks using profilers but given multiple choices which profiler should you choose so today andrew will demonstrate how you can use the async profiler how it discovers subtle performance issues native memory leaks input output problems and other noteworthy pitfalls in java applications and andre before i let you take the stage let me share some quick housekeeping details uh for anyone who's attending the session please use uh the youtrack sorry the youtube chat to post your questions i and kirill from the jet prince team will answer your questions as you pose them so you don't really have to wait until the end of the session to ask your questions andrea would also take multiple breaks while presenting to answer your questions and the last one is the session will be recorded and hosted to intellij ideas youtube channel so andrea i'll let you take the stage now thank you so here i am here is my twitter handle which is also a kind of official channel for the news about async profiler so make sure to follow me and of course you may also find me in ordnancy it's one of the largest popular social networks in russia having many million users thousand servers highly scalable distributed java applications what else i'm crazy about everything related to the jvm technologies i used to develop the hotspot jvm before and besides that i'm the author of the profiler i'm going to speak about today so don't expect me to be unbiased so um we'll um nevertheless we'll start uh we'll talk today not only about async profiler i'll start with uh some profile in theory generic profile theory and then i'll introduce the sync profiler and show how to use it in standalone and inside the intellij idea and finally you'll show some advanced profiling techniques well uh why do we talk about profiling toll we as developers care about performance of our code when we investigate bugs we use debuggers what so what to do if we or our users experience performance problems no profiler shouldn't be the very first thing to do because we first need to know what resource is stopping us from getting better performance is it cpu or memory network io whatever in our production environment we monitor all resources for example on this picture cpu seems to be underutilized and it's likely it's unlikely to be a root cause of the problem but what if the picture looks like this here we clearly see that the cpu got saturated and in order to improve performance further we need to either add more resources or optimize our code probably this is where we need to start cpu profiling so uh let's dive into cpu profiling all profiler types can be divided into two large categories first instrumenting profilers they use bytecode instrumentation to modify your code they basically record every method entry and exit this gives you the full execution trace with timings but of course this implies a big cover head furthermore it distorts the real execution profile since the code being measured is not the same as you would run without profiling for profiling production codes sampling profilers are more appropriate how do they work they basically take a snapshot of thread state every small period of time this is a lightweight approach and by tuning the assembling period you can control the amount of other heads in our company we do use sampling in production and my first talk will be about sampling profiles how does a sampling profiler take a thread dump there is a simple convenient java api which basically gives you stack traces for all java threads and there is also a corresponding function to be called from native agents this approach is standard and supported on all platforms doesn't require anything to tune on gpm site no extra gm options whatever well if there are facilities for profiling right there in java api isn't the problem solved yet indeed there are many tools in the market both free and commercial uh i used to demonstrate visual vm before but in the reply to the announcement of this session i saw this nice tweet and decided to show jprofiler instead so time for a demo here is my small program actually this is my favorite example which shows that almost every sampling profiler lies the test is pretty simple it creates a large string builder of 1 million characters and repeatedly in an endless loop appends 5 characters to the end of the string builder and removes 5 characters from the beginning so try to guess what will be the most uh cpu intensive part of the code uh here so if you run this in the profiler what will the profile shows as i promised here is a g profiler or string builder test start recording cpu the data let it work for a few seconds and what it shows that uh nearly 95 percent time spent in thread is a live native method so basically uh all time spent in thread life in this method do you believe in that i don't okay but we have a flight recorder which is known to be a modern tool really powerful maybe flight recorder and with mission control you'll be able to show something better okay let's try the same test styles flight recording 10 seconds will be enough all right so your guesses will mission control with fight quality better or not and here is the answer oh uh no answer mission control uh the flight recorder was able to take just one literally one sample from completely irrelevant place of code and there is nothing about our string builder test at all so it didn't help us either okay why profilers lie to us or see nothing at all the reason is the save points in order to take a thread jump hotspot jvm needs to stop all threads but it cannot stop an application at any arbitrary point there are certain places in the compiled code where it is safe to stop execution and where gvm knows how to get a stack trace this place is usually at the end of the methods before you turn and inside the loops when gvm requests a save point and threads will continue to work until they reach the nearest save point that's why if there is a long plain piece of code with no loops it has no chance to appear in the stack trace and this is not a problem of a particular profile but rather a common problem of all save point based profilers this is a deficiency of the api itself there is a excellent blog post by nizar mccarthy explaining why sampling java profiles are terrible and it's quite not article but still relevant more than ever in short for the reason is that profiles rely on the standard api subject to save point bias unfortunately save point is not the only problem of profiles based on get all stack traces api another big issue concerns native methods gvm has no clue what method is doing whether it is sleeping or doing cpu intensive work consider this simple example an application is reading from a socket in one case a socket may receive more and more data while in the other case it can be just waiting for retaining data doing nothing and the gvm doesn't distinguish between those two cases in both cases it will show that a thread is considered runnable and in a traditional profiler we will see literally no difference between these two methods even though a busy client actually receives data while idle client is just sleeping in a native circuitry method here's the summary of problems of the standard api how we can get over these problems well the one attempt to solve the issues from the gvm site is async get call trace api this is an internal hotspot api for getting a stack trace of a current thread not only at save point that's why it's called async profiler is supposed to call this method from inside the timer handler initially this api was designed for especially for sun studio later renamed to oracle developer studio even though this is not a standard api it's really a treasure which couldn't remain unnoticed so other developers started using this cpi too one of the first open source tools featuring asynchronous api was oris profiler async profiler used this api as well and recently uh some commercial profilers also added so called async sampling mode which also relies on async get call trace another benefit of using async get call trace api comes for free since the signal handler is delivered only to negative threats sleeping threats won't get a chance to appear in profile at all so if i re-run the previous example with two clients with uh honest profiler for example it will now show the real picture that the more cpu is taken by busy clients but not everything is so bright about facing get call trace in many marginal cases it is unable to get a valid stack trace and therefore we can see many unknown java errors by the way that's also the reason why flight recorder in my very first example didn't show anything even though flight recorder doesn't call lacing get contracts directly under the hood it relies on the same stack working mechanism in the gbm and therefore suffers from the same bug the right way would be fix it in the jvm i have submitted this back with a detailed analysis almost four years ago but unfortunately the bug is still there without diving deeper into technical details i need to say that i have a workaround for most of these problems in async profiler well that's what was from the gdm site now what can we do for profiling from the operating system perspective did you know that almost every modern cpu has a built-in profiler in hardware starting from pendulum architecture there is a special performance monitoring unit which collects various performance counters it can be programmed to collect [Music] given counters and generate a hardware interrupt once a counter overflows linux provides an interface native interface to work with hardware counters this is perf event open system call which allows to subscribe to certain hardware events moreover birth events provide a sampling capability right on the kernel level while your program is working kernel can collect execution samples into a ring buffer and these samples include some important things like process and ecp registers but what is most important for us is the call chain so uh it's exactly the stack trace not only the stacktrace of your application but even the stack trace of what is happening inside the linux kernel of course the kernel api is not easy to use directly but there is a well-known perf utility that does all complicated stuff for us to profile an application with birth you just need to run curve records and after profiling collected perf reports we'll bring the table with the hottest functions under this table we can see native functions we can see even kernel functions denoted by k and but what we don't see here is for java methods but here there are some random hex addresses that's because perf knows only about statically compiled functions not about the code generated by jit but for fortunately there is an open source library called spark pages that can build a table that matches the jit addresses to java method names and using this tool we can profile java methods with turf as well but this would require another gvm option uh we need to uh curve support from the gvm side includes the preserve frame pointer tm option available from gdk 8 update 60 and what this option does is it makes the gvm to reserve one register for keeping the track of the top java frame so that the perf will be able to walk the linked list of these frames and since this register needs to be set up and it can't be used for other things in the code sometimes this using this option implies some performance penalty use usually this performance penalty is quite low uh less than one two persons but in some cases it can be can their overhead can go to five or even more persons well the most beautiful part of the birth profiling is flame rough visualization thanks to brandon greg and his blog flame graphs are very popular nowadays and uh with sir brandon greg's script and a few spell casts you can convert perf output into cool graphics interactive graphics i hope almost everyone already knows what flame graphs are and how to read them or in short this is a compact but yet but still comprehensive view of the application profile the height of this graph are the stack traces so the y-axis are the frames in the javascript and on the x-axis either the number of samples how many times the method was met in the profile during the sampling so the longer is bar on this graph it means the more time was spent in this method well so far so good but uh when i tried to profile the application startup with perth and perhaps my pageant i've got this strange picture i saw a lot of code spent in the gvm in jvmg compiler but i didn't see java code at all instead just this big red part is interpreter that's because perf can work only through compile java frames but not through interpreted frames but during application startup much of the java code is not yet compiled and therefore we don't see it in the profile well perf is a powerful tool but when it comes to java it has a number of limitations there are no interpreted frames it requires a special agent to collect symbols it needs a separate gvm option and it needs you know to be used in a tweaking a few or kernel options system control options and finally the data that perf collects can be really huge for example when profiling a large java application with 1000 threads just for one minute output collected data can take more than one gigabyte from disk well what if we could take the best from both approaches take the top of the stack trace with native and current functions from perth and the bottom part from async get call trace that provides a fast and efficient way to get javascript trace and this idea exactly was implemented in async profiler it uses perf event open api directly to receive stack traces from kernel and async get called trace to get java part of sick traces the tricky part is how this uh how to merge this together who is interested in the implementation please ask me later i'll be happy to discuss that okay let's finally get our fingers on the async profiler it's an open source project available on github with a rather permissive apache to license go to the page scroll to download section here is to up-to-date releasing 2.0 released just a few days ago it already supports a number of platforms there is no windows yet but when you okay when you download and unpack their archive there will be such contents here is a core profile agent the command line interface java api converters and the most important part which is everyone forgets about this but i highly recommend to read it before using async profiler or instead of [Music] download it in manually installing and running from command line you can just go and install intellij idea starting from 2018 3 it comes with session profile integrated okay time for demo so uh other equations so far um so andrew we had a uh two questions um so one was i think one of your favorites so there was a comparison between async profiler and j profiler i'm not sure whether you want to take that now or later because kirill already answered a bit about uh to that question so do you want to take that now uh yeah and actually uh j profiler traditionally uh was based on as far as i understand i don't know everything for sure because it is closed sourcing commercial products but as far as they know uh that used so it was based on public api for getting stacked races and it actually has had two modes instrumenting and sampling which was safepoint biased recently um they also added experiment experimental async async sampling mode and um i have i have a strong feeling that they have taken the code of facing profiler and adopted it to use inside keril also mentioned that jprofiler has some enterprise related features that it it can show sql queries inside the call tree and async profiler the biggest difference that kiril mentioned was that the async provider can show native stacks in java and native threads which can help us to understand if something is going wrong with gc or the compiler if you want to add anything yeah that's true so g profiler the most recent version also can show native frames of java threads okay so i think that was the only question which needed your answers so i'd let you move forward with your session okay so let's get back to our stream builder example here's uh idea ultimate ultimate edition the early access preview and what it has in a run pop-up menu there are new new menu items and here as well so we can run the application with the profiler the cpu profiler allocation profile and with flight recording and the first two options are actually work upgraded on async profiler so let's run it our stream builder test with cpu cpu profile mode five seconds will be enough idea can interactive flame graphs and what we see here now that uh string builder test spans almost all time in array copy which is called from string builder delete and and this time i believe it because indeed here we need to copy just the whole 1 million characters shift 5 symbols and of course copying 2 megabytes takes time okay let's turn maybe to our second example this socket test and two clients two identical clients we just read from a socket stream and here is the sender which sends almost everything to busy client and only one packet in 10 million iterations to idle clients um here it is if we expand a command line we can see that it adds async profiler to the java launch string and here what we see yes only busy client you don't see an idle client and you also see a sender thread and they both take roughly the same amount of cpu time and what is uh also interesting here is that um besides java frames which are colored in brown you also see uh blue frames which have native functions and also violet frames and these are kernel stacks so we can see what happens in linux kernel when we call sender receive the graphs are fully interactive we can zoom into frame and there is a hint on each frame saying how many samples were spent how many times it was met in the profile so we can just start typing to search search frames by keywords and the matches will be automatically highlighted also good interesting things that that we can see the whole plane graph for the whole application and a separate frame graph for each individual thread for example for our pc client thread and sender thread okay you may ask that we are java developers and we don't need uh rare java developer need to dig so deep inside linux kernel and to know what happens when i call this socket set but i can object that in our practice we do investigate such plane graphs a lot and because the real problems can happen there too and here's a simple example this is pure java code nothing fancy you just read a file the whole file from the input stream by chunks of a given buffer size the buffer size is a configurable option so what do you think what will be the most optimal buffer size for reading this large file let's try this few options i need to first let's try one megabytes see how long it takes to read the entire file okay that's seven to eight seconds and what if uh intuitively we think that the larger is buffer then the faster we can read the whole file but if you reduce the buffer just to 32 kilobytes and repeat our test once again okay uh it's got faster but now a really large bar for us 32 megabytes oh yeah it's longer than uh it's quite cool control intuitive the largest buffer the slowest application let's try running this cpu profiler and what we see is that the most time spent indeed spent in kernel code and there are some page faults which we wouldn't see with a small buffer and these page faults explain why the larger the larger buffer takes a long time because it's how the standard the native allocation algorithm works when we request bridging system to locate a large buffer it falls back to a map system call which allocates new pages pre-allocated new pages from operating system but [Music] the first time we accessed these pages we got a page fault with a small buffer the buffer is located right from the native heap without a map system calls okay maybe this is not the best example but i have another one one fellow of mine told that when he ran his benchmarks on amazon ec2 in amazon cloud he discovered that all his benchmarks became enormously slow and when he profiled it happened that it appeared that the most time benchmark was spent on measuring the time itself by calling system nanotime for example here we see that in one second you can call system running time 20 million times if we try some another configuration the top of the performance will suddenly drop the two orders of magnitude and again the cpu profiler with kernel level cpu profile will help to understand where the problem is he will it will tell that the time is spent in in kernel code but normally getting the system time should not call into kernel at all you know that system calls uh rather expensive and the problem was in the clock source actually in linux we can set up several different clock sources for getting timestamps and by default the fastest one is tsc clock source but here i manually set it to another clock source which is not magnitude slower and in amazon cloud there was where there was the virtualization there was yet another clock source so there is actually an article about configuring clock source in amazon cloud you'll find it in my presentation so uh what i wanted to say that kernel profiling is important too and you can see it with async profiler well but not always ccp profiling tell us where the problem is for example let me run this datagram test this is a simple test which in 10 threads sends datagrams to localhost and counts how many packets it could send so here it is able to send less than 50 000 packets not so much if you profiled with a cpu well this time it again shows us a lot of kernel codes not very meaningful to regular java developer what we see here some re-entrant logs which may give you a clue indeed the problem is in locks and because not all threats are able to proceed and some need to wait on for some lock and this is not quite obvious from cpu graph but fortunately async profiler has a special mode for profiling a lock contention go to profile settings or here are the settings for cpu profile and for location profiler and here we can change the profiling event if you replace event with lock you'll be able to see container locks the syntax of this agent options is equal to the syntax of async profiler you can find in your documentation okay now we'll start log profiling and now everything gets more clear because here we see that indeed there is a much contention on some re-entrant log log was there was an attempt to acquire a lock which is already quiet more than 50 000 times and all of them came from datagram channel sand uh you saw that i ran this test on a modern jdk 16 which has been released just a few days ago and it's a shame but is in gdk16 there is still a lock inside the datagram channel sent so we cannot really send from multiple threads through there through one datagram socket and of course so this big problem of java nao and in our production systems uh since we have uh really high loaded applications which need to deal with uh too many packets uh big traffic and we don't rely on java nio and we have our own library which is also available on github one nao here is basically the same test which runs the same loop in 10 threads the only difference is that instead of a standard java socket we create a one nao socket okay let's run it with cpu profiler result profile sorry yeah you see the throughput increased almost three times and now there are no locks at all in the profile ah yeah there were just uh three three samples of are not related to circuits uh just class slowly okay any questions so far about cpu profiling log profile from idea so andre we have a lot of questions but not specifically related to the one that you just mentioned let me quickly mention some and then you can take a call whether you want to um talk about them now so one of the question is could java app compiled to native image monitoring with async profiler native image with async profiler [Music] i haven't tried that i guess the answer is no so far but it's definitely an interesting area for investigation um so the attendees they are loving what you are demonstrating and uh they want to try out but they have a lot of questions related to where they can use this profiler is it ready to be used in production whether they can use remotely so one related question was um [Music] so i think it was about when you were adding some configurations so it was uh hope we can include this argument in application server start script which is hosting the var files uh um the question was probably about if we could add option to start async profiler with an application uh certainly we can and in our production environment we are include async profiler agent by default to all java applications when agent is added it it just enables profiling in in future it doesn't it starts profiling it can be started on demand right um so there was another question and i kind of have this question asked by a lot of developers uh so it is about the difference between profiling by attaching the async profiler at the start of the java application versus attaching it while the application is running so kirill answered yes if you have a lot of uh kind of um uh heavy startup then it is kind of okay to have your profiler attached after you're done with the startup so is there a practice which you have which you see developers following uh okay so uh the question about this starting profiler with the application or touching later uh i would say that um well uh one thing i didn't mention about facing get call trace api is that it needs a debugging info to be generated by the gvm when compiling methods to be able to unwind in line and stack frames better and this option can be turned on manually by specifying let me write it here debugs non-safe points but if we start the profiler agent right at the gvm startup time this option is not needed because async profiler turns on debug information automatically it still can attach later to a running application even if it was started with no extra gbm options but the profiling data could be less accurate some uh so i think some network issues with andre um alex is it only me who is viewing the screen freezing for andre okay so yes i think andre is uh trying to reconnect okay so that's okay happens all the time i know it's a live event so [Music] um says he's trying to connect and uh yes for everyone who's watching the session so as i mentioned in the announcement blog post this is a bit longer session runs for one and a half hours and depending on how many questions you have we might extend it a little bit so yes thank you for sending in all the questions that you are and don't forget to like the session and the video and subscribe to our youtube channel so the buttons right down there somewhere right so android is still trying to connect and um i'm not sure whether i would be able to answer any questions so let me just pick up one so which was by steve so so yes steve thank you so much for the question which uh in which you asked how does intellij idea profiler compare to j profiler oh we've already asked that question and uh just to let you know that jprofiler or or async profile on which android is presenting today those are not by uh jetbrains we are not we intellij idea integrates working with these profilers so andre is back again hi andre [Laughter] was just me dropped it i thought chart was working but there was no wheel that's that's okay happens all the time because we are doing this live and uh networking activity it's okay okay so i would let you get on with your session and the remaining questions i would ask whenever you are ready to answer them okay so uh let me see yes so i'm adding your screen now yes so thank you okay we talked about uh scrolling profiling uh here is the link i promise you about the clock sources um another interesting case how random access file implementation was changed in gdk10 and async profiler could reveal the problem inside the system call okay we've already seen a case with log profiling and what was sometimes the cpu profiling doesn't help because a threat may spend time doing nothing it can be sleeping waiting for some condition for a lock for io and so on and in this case it doesn't consume cpu and simple profiling is useless but you still want to know where the time spent and that's exactly where wall clock profiling comes through the scene so the idea is to sample all threats equally whether they are running or sleeping or doing whatever they do just like uh good old gets all stacked traces api worked but now without safe point buyers and without need to stop the gvm so with no save point pauses async profiler has this option for wall clock profiling uh here's how it's enabled from a common line interface and when using async profile as an agent so far intellij idea doesn't support this mode but i believe it will be fixed very soon so let me show the next example from the common line trophies this example came actually from real practice this was a as the simplest pick or not application error in just hello world like application but for some reason it started for it took very long to start okay two seconds two seconds looks looks a bit oh now even more i've added some magic setup and now uh the simplest hello world application starts for almost four seconds and if you with cpu that's agent clip on the command line let's see what cpu profile shows us i have too many things and not obvious at all but if we look at the cpu profile carefully these are just 200 samples very small this means the application did not spend much cpu it was waiting for some other things here's where world clock profiling will help us the only thing i changed here is event type but this profile showed us also jvm threads because jit threads gc threads are also waiting and they are also shown in the profile but we are not interested in the gcs threats we just want to know what our main threat was doing so we can exclude unnecessary information or just filter everything but expect races with the main method in the end and now the problem gets obvious that most time is spent in it net address lookup which comes from determine cloud provider function so pick or not framework during startup tries to detect whether it runs in a cloud and for this is it runs look up for network address but fortunately this can be disabled with the environment option set micronauts action to false i'll repeat the profiling and our time is back to two seconds still more for for hello world still a lot for hello world's application but now at least there is no so obvious problem here what we see now that [Music] the most time spent on glass slowly reading from jars yeah almost everything is about glass loading okay moving on to to the next examples another case when cpu profiling is not very helpful is when we have too much garbage collection let me right now this the next example with cpu profiling so what does this example does do it reads some file line by line these are actually key value pairs separated by a column and what this test does is with this key value paypal pairs and puts them into map and that's it and this takes 20 sec more than 20 seconds let me get back to cpu while it's working you may try to guess what will take the most cpu time here yeah the more cpu time is spent in garbage collection so more than 60 percent of the entire execution time and this is quite interesting because that's what traditional profiles won't show you at all so all they show you is this brown part which is actually a smaller part here but [Music] with a this bed profiler we would probably spend much time in optimizing uh like hashmap put instead of focusing on the real problem here we see that the problem with garbage collection and we can't affect garbage collection directly except that reducing the allocation rate and here where the allocation profiler helps us now we can run the same example with allocation profiler and it will tell us what objects and where they allocate from a location profiler is not something something that appeared in async profiler uh many traditional profiles did this before long before but uh why is it what is different in the same profile okay um what it says says us that analyzing flame graph you see that again more than 60 percent of allocations happen from string split and on the top of this playing graph are the class names of the allocated instances so basically this line is responsible for the most locations how can we improve that the split is implemented by allocating a temporary array but we don't need that if you know that there are exactly two items key and the value so we can just look for a column ourselves and trim everything on the left and on the right side let's rerun our modified version so uh about traditional profiles yeah they did uh have a location profile but how did they achieve that uh they actually instrumented your code uh basically aiden call to every allocation and they record it every collocation but to record every single location is quite expensive that's why uh usually instrumenting profilers had a counter and recorded every let's say 10th or one sample in hundreds for so okay so replacing just one the split call we saved seven seconds for our execution time what is so next problem here so another things that we see that allocate slot is string substring and stream substring called from trim and directly from our lead map so streaming is done a substring is done multiple times here one time here and the other time inside three so why not doing just one trim instead so we can rewrite this method to do just one copy instead of two so we'll write our own implementation of stream which is combined with sub string so it skips white spaces and does one substring instead of two um hmm something went wrong okay i'm not sure why it took so long but now we don't see this substring in the profile well how does location profiler in asian profiler works okay um i created this table a couple of years ago uh it's not necessarily up to date but it basically showed that most allocation profilers are just terrible in terms of performance overhead and once once i tried flight recorder i found that the overhead of a location profiling in flight recorder is extremely small typically not even noticeable it didn't show the show us the number of allocated objects but instead some strange metrics like total t-lab size so what is t-lapse thread local allocation buffer which is used for fastpass allocations now usually an object is allocated from t-lab by a simple pointer increment but if we run out of key lab and our object is too large to fit in t lab then it can be allocated directly in the heap with the atomic operation or if one tillap is over then uh the dvm creates a new thread local location buffer and starts locating from this new one so typically all allocation paths can be split into a first one where we allocate from t lab by simple point to increment and two slow paths with the allocation outside the lab and allocation of a new tlab and the idea of allocation profiling of location sampling is to profile only slow path allocations this way you don't touch a fast pass at all and that's why the overhead is quite quite small but oh of course we will not uh we won't be able to measure all allocated objects only those that come through slow pass but uh since this is sampling it's okay to profile just a portion of objects and if a sample long enough or if allocation rate is large enough then a result picture will will be more or less realistic regarding to the allocation of the actual allocation part of the application so how does async profiler works it exploits the same technique as flight recorder it intercepts slow locations in the gm runtime and records their stack traces and now the g4 is available even in gtk 8 for free but async profiler was able to profile locations long before the flight recorder was open sourced it's it's worth mentioning that since gdk11 there is a special api for profiling locations issuing profiler does not use it yet there are plans to do but anyway the idea is pretty much the same to profile slow pass allocations but in case of this new api the slow pass allocations happen not when t-lab is over but after given given the location size so the gvm puts a virtual barrier inside the t-lab after reaching this barrier an object a new object assembled any questions about location profiling before we get to some advanced profiling techniques um let me quickly check uh no there are no specific questions for the topic you just mentioned so probably you can move forward with your demonstrations okay so what else can hm profiler do science integration with idea and common line interface it has also java api so you can use profiling directly from your application consider this example which takes a list of files to process and reads the files converts them to some other formats and saves the results it doesn't actually match just three steps read converse right and what can we do we want to find how how long just this takes we can start using profiler directly from the application case our api and in the simplest form it's just call of one start function that's it and after profiling is over we can get a result profile as a simple string or we can dump it into a flame graph using a stop command with a file name okay let's try to do it i don't need to run with profiling just usual run application because the profile will be turned on automatically by the java api yeah so uh here is a textual representation of a flat profile that we literally printed here and you also should be able to see a flame graph indeed here it is now let's consider that we are not interested in reading and writing phase and we want to optimize just the conversion phase and we would like to exclude this from a profile of course we can exclude by command line options that you already seen but in this case you don't even want to collect them not only to exclude them during a dump phase and this is also possible with java api official profiler now the difference is that instead of start command we use the same the same one part called resume it works exactly the same way it starts except that it doesn't reset data collected previously so what we do here is resume profiling before conversion phase and stop profiling right after conversion phase so that the lead input and save results doesn't get into the final picture and let's see if it makes difference by the way here in execute command you may write any valid profiler command and the list of all commands you can find in readme we sell updated profile which includes just [Music] just convert function and nothing else well i i talked about hardware performance controls on purpose and sometimes need to dig really deep into that because not everything can be expressed by just cpu utility utilization and in this example there are two threads accessing two arrays one array is small and the other one is larger it's a much larger uh parts both threads do the same they just uh to crank them elements of this rain incremented and that's it if i profile it's now is regular cpu profiling i will see that both threads take the same amount of cpu time but they work different they take the same amount of cpu time but the duration of the algorithm differs very much you see that one thread already finished its work and the other one is still working and i'm afraid we won't be able to uh wait until it ends ah yeah it works a long a lot longer but in both cases the number of iterations is the same so why it is so um uh of course uh these tests ran a lot go but if you profile both threads while the both working you'll see that while while these threads are working they take almost the same amount of cpu time almost 50 percent but why once threats works so much slower than another and the reason is that [Music] they access a different amount of data in memory and in one case their memory is small enough to to fit in cpu cache and in the other case accessing large array you need to the cpu needs to fetch data from the external memory but how to find the problem why one thread works more than another here we can place our event type to cache misses find each thread has more cache misses if our hypothesis is right then the 8 megabytes thread should be should have much much more cache misses yeah and that's true see that more than 80 percent of cash misses are in this luxury array and much less in a small array equations so far i guess we'll get two more small examples and we'll come to to the end okay let me show the two final examples and if people still here he'll proceed with the q and a session so what else can do a thing profile that traditional profiles can't profile in native functions this case also came from real practice consider small servers that just returns an input stream uh that reads any image from an input strip and returns an image of the battery and that's it and we call this servlet many many times in a loop and see the memory use usage of the process let me set the heap size let's see 2 gigabytes if you see the process in top command you'll see that the memory usage of java process increases constantly increases and if you wait a bit more the usage will even get more than two gigabytes that we gave to our heap so looks like a memory leak but this is not a java a java leak because there is no out of memory exception or something this is a native memory leak and it's quite a typical problem uh i even have a separate presentation about uh problems of native memory in java you can find it sorry on youtube but asian profiler can help us to find where the native memory is allocated from by by profiling the malloc function so here is an event name i specify mallet but it can be any native function it's not the special event type it's just the name of the native function that i want to profile profile for five five seconds it will be enough and here i can specify their process id or or the name of the application instead of the process which is called get image write outputs to plaincraft yeah this time i think profiler showed us where the malik is called from but these places doesn't yet mean that there are native they are native leagues because the memory allocated my malik would be freed later this can give us a clue and direction to investigate but it doesn't yet tell where the problem is so instead of profiling malik we can profile a map system call with this system call there process asks for more memory from the operating system so if the resident side size of the process increases then certainly the process calls maps somewhere and now it points us to the problem in our code that wasn't obvious at all from the beginning indeed the native league happens in inflator which comes from our code get image because here we open an input stream and this accesses a jar file under the hood but we don't close the stream and the native resources associated associated with the jar file remain remain alive so uh what else can what else can we profile with async profiler so we already saw that it can profile cpu wall clock profiling locations and locks uh various hardware performance counters like cache misses branch misses and so on kernel events uh page false and context switches it just specifies this as a event name and everything like before our trace points in linux kernel to see the complete list we can use perf list we can profile any native function like malloc or we can find where the process starts new thread and like a particular case of the above possibilities to profile native functions inside the vm for example we want to find where the vm where the optimization happens or where the [Music] huge objects are allocated that are located in g1 humongous space also async profiler can profile java methods you can choose any specific method name or just all names so for a specific class and a sync profiler will record well all samples whenever this method is called or whenever all methods of the given class are called it's a kind of marketing slides or comparison facing profiler with perfect gfr and a few common myths and facts about a sim profiler it doesn't require perf to be installed on the system to be used because it doesn't use perf it rather calls system call a linux system call instead it doesn't require root privileges to even to profile kernel but you need to tweak a kernel option perfect and paranoid to set it to one by default in a new linux kernels the default is two or even three uh so you won't be able to see current stack traces unless you tune this option and async profile can be certainly used inside a container but by default docker for example restricts the usage of uh perfect opencs call but this can be adjusted with um visa security policy of the container or at least if nothing helps if configuring the container permissions isn't possible you may fall back to special eye timer event it works a similar way like cpu but it doesn't show you the carroll's stack traces and um doesn't require privileged access as well so what can you do to contribute to a sync profiler well the easiest thing you can do is to provide feedback to use it and to provide feedback by posting questions reporting bugs requesting features so there are two uh primary places where you can do it of course the github page of the project and also there is a tag on stack overflow if you ask a question there and mark it with async profile tech i will definitely see it and we'll answer pull requests are also welcome especially like pull requests with documentation updates because [Music] it's the documentation gets outdated so so quick and and uh if you find something interesting not causing the documentation i'll be happy if you send me a full request with the update and that's it they wanted to tell about the thing profiler today if we have time we can discuss more but i don't know thank you so much uh andre it was amazing i got to learn so much about so many dimensions of profiler the parts that i really liked were how can i work with the cache misses and uh profiling sections of my code i i wasn't really aware about that part and that kind of blew my mind that there's so many things that we could do of course you are you were talking about you covered a lot in this presentation and i'm sure uh the attendees would take multiple sessions to get uh to the pace with what you've covered um the couple of questions that i want to ask you uh so the first one was a lot of people uh the attendees were asking if you have already hosted the presentation and the code samples that you are using in this presentation they would like to have access to it so um i don't know if it's already hosted then you can share the link now or we could post it later uh there yes the presentation are the presentation as well as the code examples are already on github um they are not probably all up to date but well anyway we'll post it later as well on twitter and or wherever you'd like sure so there was another uh question about which jvm would the async profiler work because uh there was a question regarding a specific jbm which was adopt open jdk open j9 uh in docker to run spring boot so um i think kirill mentioned that this is not supported for open j9 so do you already have a list of jvms and the versions and the vendors for which async profiler works uh yes it's basically mentioned in readme async profiler works in with hotspot gvm and with gvms based on hotspots including sub-gvm for example or zinc which is not hotspot but has a compatible async called trace api perfect i think one of the users also mentioned about the link to uh the presentation which uh andrea mentioned uh that it's already there so probably not updated so everyone who is listening they can check out the readme to know about more details too on which jvm flavor does the async profiler work um there was one other question by uh so you had an example on the native memory leak so the question was will try with resources help to deallocate memory um can you repeat please what about your location um so normally we use the try with resources so that is yes so the question was for the sample that you had uh will try with resources help to de-allocate memory okay yeah absolutely and this would be the right fix to just rob this input stream and try with resources because once we close this jar input stream it closes the native underlying resource as well perfect perfect and i'm not sure whether the attendees have tried out uh the interactive flame graphs that you showed it was amazing to see the different colors that you had for the plane graphs because then you get to know which are the native threads and which are the ones which are called uh the api threads and it's absolutely amazing to kind of just click on them and go to the source code so that's that's helpful as well um so another question was um can we mark one java method to profile for async profiler and then any thread which enters it could be profiled it could come handy in web applications um if i understand the question correctly so the idea was to start profiling automatically once the application reaches a certain method well this is currently implemented for native functions so a recent profiler version 2 has an option to start profiling automatically when a native function is executed this is helpful for example when you want to profile save points for pauses because there is a known point in the jvm code when it begins a safe point request and there is also a known point the gvm where the seekpoint is reached and the vm starts its operation and these two functions are used to profile what happens between this point request and but the similar mechanism is not yet implemented for java functions so yes the idea is to allow such beginner intent functions beginning triggers for java methods as well okay thank you so much for that one and the other question was can async profiler can it also profile used memory heap and non-heap um well uh it profiles allocations which each occur in heap and it can profile uh system calls that allocates native memory uh like i shown in the example malik and malik and in map so um basically whatever goes to um off heap non-heap is allocated usually with either malloc or a map so yes it is possible just like i did with let's say get image example okay so i would take the last question which is what do you see is the future of profiling is there a third style of profiling on the horizon could it be possible to partially instrument parts of code for example via rejects which is the regular expressions i guess partially so this is a bit uh in contradiction with the idea of sampling profiling which is time-based to event-based so profile specific pieces of code is scoured usually by instrumentation techniques but with a sim profiler we can use post filters to just reject everything that we are not interested in okay so even though i mentioned that was the last question there was there was another question we just came in so we'll take just one last question and then we'll uh call it today uh so did you try async profiler with the java virtual threads i think the question is talking about project loom and the virtual threads that we have now the lightweight threads yeah it's a interesting point and um so the quick answer is no it doesn't work yet but it's already on my list of future plans so um no more questions now even though we have and uh thanks a lot kirill he was amazing with answering uh a lot of questions on the chat and uh thank you so much andrew for the amazing session and thanks everyone for attending the session for asking the questions don't miss to click on the like button because this was an amazing session and please stay tuned for our next live stream which is on spring on april 21st uh so before we uh say bye um andre if you would like to say anything to the audience um well thank you very much for coming to the session and for sitting a bit longer than it was planet but i hope that your patience was rewarded and feel free to contact me and as i told before come to the project page on github use it for and post your feedback and thanks for inviting me and it was a pleasure to speak here today thank you so much andre the pleasure was all ours and i'm sure everyone who was attending the session they would have loved all the features that you talked about for the async profiler and yes sitting a longer would have helped them so thank you so much thank you bye you
Info
Channel: IntelliJ IDEA by JetBrains
Views: 15,873
Rating: undefined out of 5
Keywords: webinar, intellij idea, intellij, jvm, jvm profiling, Andrei Pangin, performance bottlenecks, open source
Id: TDpbt4thECc
Channel Id: undefined
Length: 111min 52sec (6712 seconds)
Published: Wed Mar 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.