Thread Dump Analysis for IBM JVM - Part1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is Pierre Charbonneau I'm a senior IT consultant and the author on Joey support patterns today's tutorial will provide you a basically trade em analysis tutorial for the IBM GBM what you will see is that the IBM JVM shred them format or Java core it's quite different than the hotspot I'm going to show you where a VI case study how you can do at random analysis so I hope you will enjoy this tutorial for today so what you see here is basically a generated tread dump from production environment following a start read problem on an oracle service bus so as you can see it's the raw data typically you'll get a Java core when you hit or you generate with the kill minus 3 option that's a typical way to generate a tread um if your if your environment running on a UNIX base OS right so this is the fine you will get and the as you can see the IBM GMO generator it's called Java core file that's why I generated and contains a lot of crucial details so that's what we're going to go to today I'm going to show you how you can do the analysis it's very crucial analysis as you will see I'm not using any fancy tool to do that there are few tools are there like TDE and a few others but it I really recommend that you learn how to basically analyze the raw data and after that you can explore some other tools but bottom line start to understand the raw data and at some point with experience you'll be able to figure the issue fairly a quite quickly so the Java cores you can see it's kind of it it's more than at read them write it it's basically kind of a specialized core of the java vm process it contains a lot of information so as i mentioned in this problem we're seeing a lot of hugging treads and structured from WebLogic so we went under production generated kill - tree which is why you see dump event user here that file was generated then if we go to you'll see it provides some of the detail right like the US version some of the processor detail you will see in the Dom then you will see he provides also some detail on your runtime Java environment as you can see we're obviously we're using the IBM JVM 1.6 for this production environment and some of the arguments I did remove some for security purposes but just to let you know that basically for this environment we're using 64-bit IBM JVM with 4 gig Java heaps paid as you can see verbal GC was enabled so we can track down the garbage collection helped and the environment is using the GC policy GenCon which is a generational concurrent JC policy and the exam policy means that you basically can generate some heap dump of the process now this is just basic information in the header of the Java Java core you will see later where the true tread dump is so again when you get that that file you don't really care about this information for now it's more reference and information you have also some of the user limit detail on the process level this is sometime useful from let's say a cycle perch perspective and then you also get detail about the actual jg VM and the heap right so you know for example how much free heap basically the one other time you took the snap that you will see on our free heap you had versus the total capacity then you also have access to all the different native segments as as well again this is the information is available it always depends your problem context in the context of a struct read problem you won't really need that level of details but it's always useful to understand what was the availability of the heap space at that time so let me jump to the locks portion right so you will see the tread dump there is a log portion and just above that if I scroll back up a little bit you will see get also a GC history so you see the Java core of the IBM is pretty good right it's content at GC history as well detail on the heat so it's very important when you try to correlate start read problem with excessive Jaycee so this is always a good idea to look at this data unless you know exactly what's going on with your stock threads there right so in our case as I said we saw some stock threads so don't spend too much time on this just scroll down quickly the next step is the lock subroutine the lock subroutine that in Java core that one is a really crucial section of the tread down right so it contains all the different dependencies of the tread so for example if you're dealing with a tread lock contention right or tread lock chain then that will show up in the lock sub component up routine as we can see here you can see right so the way it works is that you will see chain of treads for example you will see typically like this right you will see one trend like in this example you can see tread 16 which did acquire a flat lot right on this object and all these guys are waiting right you can see these are all stock tread from WebLogic all these are waiting to acquire the flat lock on this object which is held by tread 16 so this is just an example right so which is why and this is a big problem for us in our case right so right away by looking at a locked routine we know we have a big problem here we have a lot a lot a lot of treads about 200 threads here waiting from a single tread to complete its processing so it can release a flat lock and for your reference a flat lock means synchronize block in in the code right so for example if you write Java code and you put synchronized statement that will basically translate the flat lock when the tread goes in the synchronized it will basically hold the flat lock on the object monitor and then if for example if your computing test is taking too much time then this will generate a waiting queue like this okay so let's keep this one in mind for now and let's go to the tread done so when you start looking at read them it's always best to start at this portion right so don't spend too much time going through the lock a subroutine at the above the locks subroutine that we just saw will be serve you as a reference so when you get at read them always start with this right so search for the word treads right at the beginning right just need to go back do a quick search don't spend too much time you see in just search after the GGC and and quickly move to the tread subcomponent this is your actual tread dump information and then depending what you will find at read them what you will learn then you can go back in the reference which is the lock subroutine and the heap history as a reference or supporting data right so always start to analyze this portion right here so back to a problem so basically it's it's giving us all the detail of each active tread and the jvm at a given time again a common mistake I've seen when people I start to learn how to analyze certain time is to go on to each read line by line and try to figure out well obviously it's it's a common reflux if you if you're quite new to this with my recommendation to you is no don't do that start with a fast scan of all the tread and try to pinpoint a pattern or two or three patterns you see this this is how you have to analyze this stuff is to try to pinpoint patterns read it quickly and for that you need to be able to differentiate a healthy tread versus a non healthy trend right healthy tread will means tread waiting to do something and a non Lt trend means it's a try doing an actual computing task but it's hanging or waiting to leader I you call or it's winning to a quarter flat lock like in this case like you we see so if we have a look at this right tread detail so the anonymous native tread this we don't care too much here here the next thread in the chain you see it's time event generator kind of an internal tread you see it's in weight state nothing to what we're about and WebLogic perspective typically you will see especially for you know heal your version or later version like version mine plus you will always see like this right so the tread name is created with WebLogic default self tuning so these are the trigger that you have to worry about the other internal thread unless you see some CPU issues you don't have to worry too much so here level for example you see execute read eight from the cue default sub tuning is it's the main cue WebLogic right so all tread will be basically defaulting to that cue by default so though it works you will get this right you will get the state of the tread under in the WebLogic term because web logic has its own status of the tread so you always see that here so status of the tread obviously it's tucked so we know this try this stock stock mean by default it's taking more than two minutes to complete it it's work then you will get the ID of the trend execute read eight then you get the name of the queue as I said that's typically what you'll see for web logic then if you start scrolling on the right side then you get some native information about the tread including the native state which is B which we blocked okay but don't worry too much about this State I've seen scenarios where that the state is healthy but the actual tread is stuck as well so it's not so don't worry too much about the negative state worry more about the actual computing tatts is doing here right the more important thing analysis is the stack trace the Java call stack okay so the IBM VM will give you two thing that Java call stack and the native call stack native means the actual portion of code that goes beyond the Java call at JVM native code execution again don't worry too much about the natives don't start to spend too much time analyzing that your curve the work should be on the Java call stack this is where 95% plus of the issue will will arise and this is where you have to focus your your time on so as I said start quickly go to that look at this thread oh so you know you will destroy a stock now you start looking what this trade is doing the core of your work is analyzed stacktrace if you did have a look at my tutorial on my blog you will see that when we talked about the stock trades you it's very important you analyze from bottom up write a Java stack trace is like an exception that you would see in the log it's it the execution stack right of the tread either if it's triggering exception in this case is the execution snapshot what that rate is doing at the time so all the way start from bottom up so in this case what we can see is well an execute thread was fired and WebLogic then we can see it's executing its work and the cool stuff about the stack journalist getting will be able to understand the type of requests you see then we can see the request its execute GMS session so we already know right here that this request it's not an HTTP call it's right at that JMS it's basically trigger from a JMS call and then we can see there's an MDB fired right and then we can see there some internal code executed here right and then and then there is some again this is the OSB product right so it's a big abstraction layer it will be more friendly with your application code you will you will be able to see your application code before the USB product you have to go to all these abstraction layers right and then when it's telling us is that it's trying to execute the kind of a kind of a callback from one of the call and turn or the OSB right and then and then basically what it what it tries to do after it executes some other pieces of code trying to basically begin a transaction and that and then that shred is stuck at this line right so question is where is it stuck right well what you have to understand is that when you for IBM VM when you see something like this the the hotspot return will actually show you for example waiting to lock right in this IBM VM what I don't like is that you don't see that here right so when you see a line like this you have to think that that tread is probably waiting to acquire lock and in order to prove that especially if you know that the native state is blocked right then what you do just to confirm right you just search the execute track right you can go back in the lock sub routine right oh there it is here's our guys see that so which is why the lock routine is very good as a reference showing you basically the all the synchronize so we know that this trend based on the stack right that we saw at that line it's waiting to a Korell lock on this guy so that's good so we already know that trap we know what it's doing we know it's waiting tack wear a flat lock so we can you know we can go back nap the the tread section and move on where analysis right so now that you know at some point we'll be able to see if other treads are waiting right then then you continue allies as you move to the next row this one is called dynamic SSL this one we don't care it's more for the SSL inbound requests then you continue there's another trail right so again as I said you have to move quickly or you don't want to spend like two days on this right so you have to move and then try to figure out a pattern very quickly now you move on you see then this thread what it does again this red is stock right state is blocked again now Java called back as I said by lemma right oh this one you see it's an HTTP call again it's a service bus so typically it's a web service driven so it's executing some web service call right here and then again in the same code we can see be a double I derive cache this is kind of a object in ways be used for the service service definition that's a proxy and business service definition so that code is heavily solicited and we can we can what this code is showing us that there is some synchronized piece of code in there right and we can see this thread is also stuck and very likely also weighing that core a lock on an object monitor again you can do the same analysis right you search you go back up and they're going you will see again from the lock subroutine 83 it's part of this right at that same chain so bottom line in this case you can see we have so many treads that very quickly after a few minutes you will realize well I have I definitely have a pattern so you can go back you Serge this line down go back to the trend right so at this point it's quite easy right you search up and down using the trail ID and you were able to quickly navigate to the edge to tread them just looking at the raw data then at this point one you see - a tree - in the same pattern then to move more quickly then you know this one is in the same condition no point to spend too much time you see these are what I call healthy tread you see some of the tread they are not doing anything then you can move quickly right this one again up - interesting and then you can see again execute read row seven same piece of code you see transaction manager began also this one this is where get interesting around because this shred is an active State but it's in block state so you have to be careful not to get confused so because web logic has its own state access means web logic has no idea at this point if this thread is healthy or not so you have to do your own due diligence because this the the state will change to stock only when you reach your stock time configuring web logic which by default is around two or three minutes but it's still possible that this thread is waiting to occur the same luck even doing it did not reach that two or three minutes buffer at this point so you think it's healthy but this thread is not at all it's waiting in the same condition right and we can see it's blocked Nate state lock means it's waiting - typically a I quote some some lock so if we search back down just to validate our assumption you go back there go you see three oh seven you go back up and this guy's part our big chain again so at some point you can see there's no point spending too much time you see all these guys we can see they're all waiting to acquire a lock now now that we know that all these shredder waiting to acquire a lock the next step is what well you have to analyze which thread is holding this lock right as as the next step so we go back up to the chain so this time the next part of your analysis will be to look at the culprit so we know that all these guys are waiting tech wear flat lock or synchronize right on an object monitor which is transaction manager it's internal to the product so it's not the other source you have in your end obviously but then we know which thread is the culprit right so then we know it's red 16 so what's the next step well we need to understand what the trade 16 is doing and here's stress 16 so you Serge back down look at trade 16 what is this trade 16 doing well you see this this one is in basically pork state this is where it gets a bit complex now and I'll explain this whole issue after but then the IDS to add undefined the culprit because in this case we have a starch red jutsu one culprit red creating at red lock contention chain right basically multiple multiple trade winning from only one thread complete its work you have to see what this work is doing so let's analyze again bottom-up so we know that trades and run run state run oh look at this WebLogic deploy service deployment callback so what it's telling us is that this shred was fired by the WebLogic deployment service it's a deployment receiver callbacks waiting means that there was some action taken in the console in the WebLogic or service box console so for example in OS b when you take an action in terms of us consulates you want to make some change the login from a prongs your business service this action will trigger a tread from the managed server basically trying to commit that change to the all the member of the cluster of that OSB right that's exactly what's going to do here you see it's receiving it is trying to commit the change let's say you want to increase logging level for one business service so it's trying to commit the change we can see it here you see so far so good right now it's trying to end the transaction just the deplore commit right it's kind of a transaction on this process it's trying to commit but then you see it's trying to run through this code and its trying basically to hear this is why it's getting a bit tricky because it's trying to acquire a right lock it I'm not if people are in familiar with the re-entering readwrite lock the subject will be on the scope for it for today but it's basically another way for you to implement Java concurrency so instead of just implementing flatlock yes synchronize you can also implement your your tread safe code or non trade safe code I mean by the using a reentrant read or write lock so it kind of allows you for example the you can example acquire a read lock and multiple tread could be acquiring between each other some read lock but obviously if there is a trail attempting to acquire a write clock then there are some rules that apply obviously that the read lock has to be released and obviously a same apply for write lock so this is different concept but that's the way always be code implemented that portion of the code which is around the management of the service definition for your proxy and business service so there is a it's heavy duty code which is used it's kind of using a mix of synchronized also reentrant lock so what this really showing us is here is that this trend is attending tech wear right lock and as we can see acquired pork and check for interrupt that typically means that this trend right is unable to acquire the right lock on the object and typically this is this that that will happen if another thread is holding either a read or write lock so this is where it's getting very interesting right because we have a culprit tread which is holding up a flat lock but that same culprit red is waiting check where right lock you see that again the IBM JVM will not show you unfortunately the native lock routine for the reenter because this code is part of the GD key so it's not basically a detected by the lock subroutine so you have to try to understand where you may find that so at this point we'll identify the culprit we know it's a deployment call back and I can give you I know in this problem case I know we add DMS order we took an action in the OSB console and by taking this action during heavy load what happened is that the commit on one our managed server did not work and did end up in this hang scenario and ultimately triggering a domino effect as you can see all the treads waiting from this guy right and this guy unfortunately stuck waiting to a write lock and I'll show you the potential culprit soon but at least does give you an idea I can analyze the data now what we did was basically what we did was to run some search through the derive cab because this is kind of one of the culprit is the derive cache and what we what we found is that the in the derive - there is definitely some piece of code basically that also did attempt to acquire some read lock right and at some point you end up this is why you see in the tread 616 right because there is another tread which did acquired also a read lock on on that same man monitor this will be presented as a part 2 of this analysis so this part 1 was just about at least to give you an ID for the tread dump and I can identify that chain so as you can see in summary always start with the treads section as I mentioned go quickly to the treads identify the path the pattern then go back to the lock subroutine right as we just saw and identify the culprit right when they dare it there when there is a culprit like in our case and after that analyze the actual culprit itself and understand what its work is doing and typically you will be able to pinpoint okay yeah this thread could be trying to acquire a connection from a database and so on in this scenario in this trade them analysis pattern the pattern is tread lock contention or tread lock chain due to one culprit and the part two we'll be showing you the actual deadlock condition what you will see in the part two is that this this track is in deadlock with another tread because of a re-entrant read/write lock so there is a lock ordering problem here that will be presented as a part two so for now keep that in mind and I hope you have appreciated this video I will be providing you a lot more videos going forward with several patterns obviously this is just one of the pattern which is a tread lock chain there are many many many pattern that you have to understand when you lie straight down so this is just the first video and I hope you enjoy it please feel free to ask me any question or post any comments on this view thank you and have a good day
Info
Channel: Pierre-Hugues Charbonneau
Views: 35,720
Rating: undefined out of 5
Keywords: JVM, Thread Dump, Java (programming Language), Part1, Hosting, Software, Technology, Computer, Tutorial, IBM, AIX, Web Application Monitoring, Network Monitoring, Java Monitoring
Id: YQgmF8I-zhk
Channel Id: undefined
Length: 26min 0sec (1560 seconds)
Published: Tue Jan 22 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.