Learn about JVM internals - what does the JVM do?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my PhD thesis was some of the foundation for a technology which became involved in a startup company called transitive Technologies and one of the most successful project products was Rosetta and it's good to see lots of MacBooks around here and so on because they'll have Rosetta on demand and so on I did a lot of work on a meta-circular semi-famous virtual machine I have a book chapter in a book by O'Reilly the book is called beautiful architectures and it's been translated into Russian Japanese and Chinese and I only have the Chinese version in the US for obscure reasons but it's nice to have been translated I'm a software engineer for as all systems I mean I do more than just coding I also you know engage my brain so and there's something we're encouraged to do it as as all were not just code monkeys and so on so yeah so um so my old research group boss was Steve Furber and I you know I'll probably keep mentioning about how great manchester is it's something that I do Manchester in the UK that is that I hear there were other ones and Steve Furber is a monkey Union and he was there one of the co developers of the ARM processor and what people used to say at the University of Manchester was that you know Steve is a great professor but whenever he teaches a course he starts with transistors and then builds up and so I'm going to kind of do something a bit similar tonight and kind of start kind of low level with the JVM and and build upwards and if people have problems with it raise your hand and I'll you know hopefully be able to answer your questions and concerns and so on it's a fairly new slide deck so I expect people to have problems and issues with it and and so on and I've got a whiteboard and whiteboard markers so so that's good what I'm going to do is kind of like introduce some of the underlying technologies for the the JVM how energy portability and then kind of go into some of the more challenging and novel aspects of the of the JVM and how kind of performance is achieved in those areas and then I'm going to start talking about what we're doing it is all systems with the the Xing virtual machine which is generation as all solution ok so a JVM described in a simple sentence so reading slides a software module that provides the same execution environment to all Java applications and takes care of translation to the underlying layers with regards to the execution of instructions and resource management that's a big sentence let's look at the picture okay we've got some code we can take the same piece of code and we can run it on two different hardware architectures and two different operating systems and the JVM is the thing which is doing the mapping between these things and the JVM is doing a whole lot of stuff it's mapping you know the the java bytecodes down to the hardware architecture and it's also dealing with things like you know giving you a consistent writing model and things like that I haven't done any slides on threading models but if people are interested I can talk about threading models about okay so starting it kind of like you know the beginning the interpreter and so we have java bytecodes that wraps up in a Java class file and when you write your Java program your compilers into a dot class file I'm assuming that's familiar to people in this audience because of who you are and so on but we need to kind of bridge this gap and the simplest way to do it is with an interpreter so and kind of although I work on meta circular JVM and sandwich don't have interpreters in general and production VMs you still need an interpreter because you still end up with large methods that are running frequently and it's not really worth spending the energy doing the compilation and a good example of that is in servlets okay so let's kind of look at a silly interpreter you know thing that I've read so I have an infinite loop we're running forever and we're fetching byte codes from a bytecode stream which is you know a class file we've loaded into memory I'm going to read a bite out of the bytecode stream and advance the the program cancer at the same time and then I'm going to do some switch statements on the byte code so I'm going to decode what the byte codes meaning is and kind of dispatch it and work out what the the operation I want to perform and then I'm going to perform some operation so here are three example byte codes I've chosen some easy ones because that makes it easy for me and what these byte codes are doing is you know I Const one is just going to put a value of one on to what's known as the expression stack inside the the interpreter and I load of zero we'll take a local variable and put that onto the stack and I add will pop two things from the stack add them together and push the result on to the stack and this is pretty much how the JVM is defined if you look in the virtual machine specification you'll see that they'll have you know this is what a bytecode is this is what the expression stack looks like before this is what it looks like afterwards and here's the transformation from one to the other and okay so this isn't very efficient we have a lot of branching going on we've got to decode each operation and we've got to manipulate the local variables in the stack you know we can do better you can do better inside interpreters you can do things like a threaded interpretation so with with threaded interpretation here we have a single switch statement which is going to branch to one of these byte codes with a threaded an interpreter instead of having this break which is going to go down here and conceptually we're going to go back down to there with a threaded interpreter you basically branch to the next byte code there kind of where the break statement is is the and GCC will actually do this as a compiler optimization and and so on so a lot of these things that can be done for you and we're always going to be kind of accessing memory for manipulating stack or local variables so if we want to have an efficient interpreter and we've got access to kind of like the machine the low levels of the machine we can start putting these values into into registers in the machine and that means we can avoid accessing memory one of the kind of like more advanced techniques is to kind of pre decode the byte codes and just realize that you know byte codes are just going to arbitrary addresses so you can just create a list of those addresses but people don't tend to do that because you might as well compile when you're getting into that level of complexity and then there other tricks like optimizing the dispatch so it's just a arithmetic operation which is one of the tricks and in the the dalvik VM okay so interpreters that's it done any questions no good and so interpreters are intentionally simple but their performance isn't that great so we want to do just-in-time compilation and just in time is one of these terms which I believe the Japanese actually stole from America from a grocery chain but I'm there's a whole origin to just-in-time that I'm not familiar with but just in time inside a virtual machine means that we have some code we're executing it frequently let's compile it so that it runs faster we don't compile everything we just focus on running the hot parts of the code so you know you get names like hot spot you know indicative of the fact that you're just concentrating on compiling the hot regions of code so I'm going to talk about every single bit of a compiler isn't that scary it shouldn't be and hopefully you know I'll do a good enough and explanation that you can kind of follow how a vm compiler works and so stop me when you're failing to understand it really shouldn't be that hard so the first thing we want to do is we want to turn the byte codes into some kind of graph representation now the reason we're going to do this is because we want to eliminate the overhead of having this stack and and so on and I'm going to show you what the graph representation looks like and I'm actually going to use examples taken from hot spots so it's scary but not really the net the next thing is we've got this graph of our instructions connected together what we would then want to do is we want to linearize the graph because instructions inside a computer or addresses and the sequential in memory so you better linearize that graph somehow and so I'm going to talk about how that's done at that stage you have this infinite pool of virtual registers and then the next thing that needs to happen is you need to map that infinite virtual pool of registers into some finite resource within the processor some finite number of registers and you know mapping infinite into finite doesn't work so what happens when it doesn't work well you're going to use the stack in memory then once you've got instruction nodes and registers allocated to it and then you can actually do some cogeneration so let's start off with a very simple Java program so this is Fibonacci so Fibonacci is just kind of a simple recursive routine for computing the Fibonacci sequence it's written recursively there's no clever optimizations in this I wanted it to be a simple example and to fit on to two slides and and and so on but it's a kind of a classic example at the same time so let's go from the Java so the bytecode now a people comfortable with bytecode and degrees so what we can see in in the bytecode is that we're going to access local variables and push constants onto the stack as I mentioned earlier and I can kind of work through this on the whiteboard if it's useful to and what we can see from the definition of the method is that it takes an integer as a parameter in the the JVM the first argument is in the first local variable so this eyelid of zero is actually taking the insta arguments and putting it onto the stack then we're going to push onto the stack the value of one and then we're going to do a greater than comparison and if it is greater than then we're going to go to location 7 so if we you know flip back to our original code this less than equals then sign has been kind of flipped around from being less than equals to be greater than and the the byte code the the byte code immediately following this list this can you know compare and branch is going to put a 1 on the expression stack and return in the case that it's less than equal to 1 so it is that kind of little kind of what's what's going on with that that flow and so on good I'm not scaring people yet so then we get down to a kind of bytecode 7 it's going to load the integer arguments again the value of one subtract it and there's going to recursively call the Fibonacci method so this directly corresponds to this call here we're doing n minus 1 and we're recursively calling Fibonacci so the result of that will get left on the expression stack and then here we load the incoming arguments again we load the value 2 we subtract them and we do another recursive call finally we add the two values together and then we return so hopefully people can see that I'm kind of there's been a kind of a syntax directed translation of the the Java code into the bytecode there's not really been any smarts or cleverness going on it's a fairly literal translation from one to the other so people happy you're wondering where I'm going with this ok we're going down and of course ok so the going you know going back a few stages I was saying that the next thing that's going to happen is that we're going to build up a half of nodes and the client the c1 compiler inside hotspot and the what happens is you have basic blocks so when you have you know a block structured code like this you can imagine that this return one is one block and this other return is another block of code and so on so that the term basic block can mean different things at kind of high level and low level within the compiler looking at what we've got here and here are the command line arguments to get you this printout from this very method that I was showing you a few seconds ago you can get this done you can't get this dump out out of a regular JVM if you use the java underbar G then that's the debug version of the JVM and you can see what's going on inside the the VM you can see the VM internals yourself and you can go and you know go you know what horrible code have you generated my method and and and so on if you if you really care so first basic block we have is just the the entry basic block here we have to graph nodes we have a graph node which is a constant one and we have another graph node which is a comparison between two other graph nodes and a branch to one of the two basic blocks based based on those things and we can see here it says I three so where is I 3 I 3 isn't defined I 3 is actually the the incoming argument to the method and I 4 is the constant one ok so the we then have basic block number 1 which has got this I 6 which is again is the constant 1 and I 7 and I 7 is just an I return graph node with a nice X so you've got this graph of blocks and they've got a list you know they've got instructions and instructions are kind of defined in terms of things coming in to them so they kind of internally format and I've chosen an example which doesn't really show one of the more complicated operations this compiler Media form is known as static single assignment and so there's an issue with static single assignment what single assignment gives you is it gives you this property that things are only written to ones so here we're defining a value to I six here we're defining a defining a value to I seven we've only defined these values once and what this gives you as a property inside a compiler is that you only have true dependence so if you're kind of a hardware guy then it's kind of a read after write dependence so something's written and then you read it and that's the only kind of dependence you're allowed you're not allowed right after right because that would mean that you're right into it twice it would be single assignment and you're not allowed anti dependence which is where you have a right after a read and so how do you deal with loops and things like that where you do get that property you have a special instruction node called a fee and a fee instruction takes our takes inputs which are other instructions but the the choice of which instruction you use is based on where the graph is coming from and it sounds complicated and it is there are peel the eye papers on how you constructing it is not too bad deconstructing it is kind of a more tricky situation so anyway we've got this this graph and we can see here this is kind of like the more complicated bit from the example where we're taking this input arguments again we're subtracting one from it we're doing the recursive call we're taking the input arguments again we're subtracting two from it and we're doing the recursive call we're adding the two values together and then we're returning back to the verb the caller so hopefully you can see how we've gone from the java source code down to this graph of instruction nodes kind of step-by-step okay so we're going to go even further down so Ronnie's transition I'm going down to the transistors from from my source example so the the JVM option for this is prints I with LIRR so IR means intermediate representation and inside the client compiler intermediate representation means the graph and the LIRR means this kind of low-level intermediate representation which is the representation when things have been made into a linear list of instructions we can see what's going on is that my graph node here immediately afterwards has these lar instructions associated with it so what's happening here is it reads left to right so we're moving the value in Rd I into this virtual register from my infinite pool of virtual registers it's fantastic I've got this infinite full of virtual registers I don't need to worry about memory I'm moving Rd I into R 36 and Rd I is actually from the calling convention of the VM is the incoming argument register and this is the the zinc virtual machine I'm taking this example from so you may see different things if you're using a different VM there's some stuff here to do with save points I'm going to skip over that is more to do with garbage collection and I've got stuff at the end to kind of talk about what goes on with with garbage collection and here we have a branch which says always branch to block number zero so block number zero is here and it's going to do the comparison between our 36 which is what was defined up here and the the value 1 and the value 1 has gone from being in a its own individual graph node and just being folded into this instruction which one of the things that happens with the the LIRR construction in the compliant compiler so we're doing a comparison between this virtual register number 36 and the constant 1 we're going to branch if it's greater than 2 block number 2 which is over there otherwise we're going to fall through then if we fall through we're into our case where we're going to return 1 we go and put the value of 1 into our ax which is the result register on Intel processors commonly and then we're just going to do the the return and their return kind of implicitly uses the RA X register but it doesn't actually any Intel Architecture not to worry about these things so over here we have our graph node for doing the subtraction i3 - i-4 that turns into a move of the value of our 3637 so we've over here we copied the input argument into our 36 over here we're copying the input arguments and again into our 37 and creating redundant copies when we're doing this kind of construction we don't worry about having lots of copies because the register allocators job is actually to remove copies so a good register allocates will eliminate all on the copies so we move the value into our 37 and then we're going to subtract from our 37 the value 1 and put the result into our 37 so you might observe here that we're kind of we could have just subtracted 1 and our 36 and put the result into our 37 intel instructions only have two operands and so this is something that goes on inside the client compiler to try and facilitate easy from the the compiler the compiler in intermediate form to the Intel instructions is just to make two of the arguments the same if you're doing this on a RISC architecture which tends to have three eight three address instructions then you know this naturally doesn't happen by the client compiler and so on so there was some differences in time in terms of the compiler intermediate form depending on which architecture you that you're targeting so we've done this subtractive one then we're doing the recursive call here we're going to do the subtractive two and we'll get the subtract with two we've made another copy of the incoming arguments into our 39a recursive call and then finally we've got our add and our return down so we've gone down a layer we've gone from having a graph and we've gone to a linearized list of instructions but we can't yet run these and Productions because they're using registers from an infinite pool and processors don't have internet pools of registers and so the next thing we must do is register allocator and at the bottom again is the kind of the Java G option which will show you what's going on inside the compiler when you when you this this next thing happens and on Intel processors the register naming convention is somewhat wacky so try not to think too much about it but hopefully what you can see is that you know I've got rid of the the graph nodes from the IR so here we just have the LIRR instructions which are in a list you know starting at zero they go through 252 so they're in a list and now instead of mentioning things like our 36 they mentioned specific registers so we're going to move our di into our BX and here we're going to compare our B X with the value of 1 and these directly correspond to the things we had before except instead of having an infinite pool of virtual registers we now have specific ones and the other thing that's gone on is that we've lost moves so the the moves have been eliminated but eliminated by the register allocator because the register allocator was always trying to make sure that the source of a move and the destination of a move are the same thing and then the move is doing nothing no work at all so it just eliminates the instruction so finally we get the Intel machine code for that and there's some boilerplate in here so there's some boilerplate for kind of saving registers there's some boilerplate for restoring the registers which were saved but you can see that you know I've got to compare with EB X and the value of one and here we had a compare with you know RB X and the value of one the E and the are somewhat arbitrary looking the the RS means 64-bit versions of the registers it's an integer so it's really a 32-bit version of the register but for simplicity inside the compiler we just assumed poor Bert down here we have the the calls which are going to be the recursive calls we have a decrement which is a subtract of one we have a subtractive two we have again that the recursive call we have at the end the ad which is going to add these values together and we're going to make sure that EI X holds the results at the end and then we're going to restore the the the registers which needs to be preserved and do the do the return so that wasn't painful as it's a complicated thing so the the VM allocates memory all the time Java has a brilliant property of having memory safety you never free up the memory memory safety is a good thing gets rid of whole classes of bugs and so on it means people can develop Java code very quickly as a whole advantage of using Java the the problem with allocating memory all the time is every once in a while you want to stop the application from changing memory and what a safe point is it's a way of the VM to request that all of the application threads stop running - let the garbage collector threads to run so you can see here that what's happening is there's a comparison to say has some flag being set which says I need to stop running and if so go and go and go to the place where where I stopped running and let the garbage collector run okay and I'll mention something about stop the world garbage collectors and concurrent garbage collectors and one of the take homes from this talk is that concurrent garbage collectors are the be-all and end-all and you want to pay big money for them so there's a little bit of extra code which wasn't mentioned in anything in the kind of the pushdown which was to do with exception handling so in Java methods you might call through to something this code does method calls and those things might throw exceptions and what this code does is you're goes in that case you know just doing it doesn't quite say it but what this will achieve is just returning to the thing which called it and say I don't handle this exception so by default every every bit of generated code will have something which says if I don't handle this exception make sure that I return to the caller and make sure that they try and handle the exception and then there's a little bit at the end here to do with the optimization and the optimization is something that I'll kind of pick up on it later so this slide deck was done by someone other than me and they thought that dead code elimination was a great compiler optimization and I think dead code elimination means that you're writing bad code because if there's that much dead code it shouldn't be now it's not quite true especially if you're writing meta-circular runtimes and things like that you have big bits of your code base which aren't necessary all of the time they might just be there for debugging and things like that in which case you can eliminate them you can reduce the size of the generated code prevents irrelevant operations to occupy time in the cpu yeah not my statement but basically having coach isn't useful to the running of the program get rid of it and so guess what the compilers do that okay now method inlining so you know I the first time I was in California was for JVM oh one so I am older than I look and one of the take homes from JVM oh one was that when it comes to optimizing Java programs what's the main trick that the JVM does is it in lines and in lines and then you know if it still isn't going fast enough it in lines again you know just in line in line in line it's one of the key tricks and what inlining does is it it takes a method so in this case we've got this days left method and rather than calling it it's in lines the the code of the the method body into the the call site here so it's getting rid of the call into the method and you saw that you know from the deep dive into what the compiler was producing there's a certain amount of boilerplate to do with entering a method and returning from a method so the inlining gets rid of that but it also enables other optimizations so if you see in this example it's passing through the value of zero and here if x equals zero we're going to return zero and guess what in that situation this else statement is is dead code so inlining enables you to do constant propagation so it's propagating constant and then eliminating code which then turns out to be dead because it was unreachable and so so this is going a bit more depth with what the previous example looks like when you've taken the method body and replicated it all of the times which were which were necessary and yet more so constant propagation okay lovely her redundancy removal so this slide was written by a marketing person I'd call it common sub-expression elimination but anyway it's if you have if you have two things to two statements which are doing the same thing so if I have you know y equals x plus 1 and z equals x plus 1 then why do I do the work of computing X plus 1 twice I can comment up the work and just have X plus 1 ones and I can say Z equals y and it sounds like it should be a very complicated optimization I assure you it isn't basically you you you work out a way of putting all of the instructions into a hash into a hash table and if you see that you get a hit in the hash table that means you've seen this instruction before and and then you go haha you know I don't need to do the work I can just reuse the results of that instruction and that kind of falls out from static single assignment form which is why compiler people like static single assignment form and so okay so so the first two examples were from marketing and then here's one that I like a lower-level tree rewriting so what am I got I've got a I've got a class and my class has got a field in it and I've got a very simple method my method is just going to increment the field so is that over anyone's had good um so what do we get when we turn this into byte code we get a gap field instruction which is going to read the field and push it onto the expression stack we're going to generate the constant one and we're going to add the to top of the stack values together so that's going to do the plus plus operation and then we're going to do a put a put field to write the result back so whilst in the java code Oh mr. semicolon so shoot me but the the so whilst in the java code this is a single expression it's turned into many byte codes as you can see from this this example I did actually compile this but I obviously didn't copy the compiler code I compiled into the into this so you know when I did the kind of the deep dive from the source code down to the the instructions we generate what we'd end up it with in in that case is you know the gap field would turn into a load of the field into some register I'm calling it r1 I'm going to move the value of one into r2 I'm going to add the two values together and then I'm going to store the result back into it into memory and hopefully you're now feeling somewhat comfortable with how that code could actually be generated by the the VM to the the root when computing the the linear form so when you tend to depend where you do this I in the meta-circular VMs I've already linearized the code when I'm doing through the tree rewriting if I'm doing this in a compiler like the server compiler in hotspots then that has a tree the whole thing is in a graph and you know you can look at regions of a graph as if they're trees is it true that you work from the leaves upwards I I think the answer is yes so yeah hope that answers it yeah yes okay so what I what I want to kind of look at is a compiler optimization is a tree rewriting and so what I can observe is that this part of my tree is just adding a constant 1 in which case previously I was moving the constant one into a register what I can do is I can just directly add the constant one on to the register you know assuming that I've got a machine which can add constants to two registers which most do okay so that that's nice I've saved an instruction on x86 you can do better than that you can just directly add one to a memory location so what I want to do in that situation is to recognize this whole tree and then realize I can do an ad with memory okay whoo compiler magic how does it happen so so a regular ad with memory will actually in the x86 hardware will become a load out of one and then a store there's a variant of that on x86 where you can put a lock prefix on it and that will make a d'etat m'q yes entirely so when we went from the Java byte code to the the graph of instruction nodes then we eliminated the notion of local variables and unstack what we had was we had instructions which generated a value and then we had edges which put those values into other instructions so I mean in a way this this code doesn't have anything to do with the stack because it's kind of happening away down in the compiler at the point where I've got these LIRR instructions which is saying you know do a load do a store and I'm recognizing I can fold them together so nothing that ok so you know this looks like it should be amazing compiler magic and you know you need a PhD or something to try and understand this amazing compiler magic and what happens is it's an algorithm which is about as complicated as def or something like that what the algorithm does is it tries to do every possible covering of a tree until it realizes that it doesn't work and so what we start off with is we we put a covering and we go ok here you know I can do a gap field and I can see that that's got a certain cost if I generate the result into a register and these cost functions are actually just arbitrary numbers we can again we can generate this constants into a register and that'll have a cost we can realize that we can fold these two operations together and that'll have a cost so what we do is we we we label the tree and when we've labeled the tree we can then do a traversal which show which just selects the least cost traversal of the tree not acceptable when you're doing this at runtime this parity approximation so instruction selection in hotspot only happens in the server compiler in the client compiler it's kind of other than recognizing that it's got a constant one going into the add instruction it doesn't do anything else and you use the client compiler for most warm code you only start using this server compiler for extremely hot code so whilst your compile of the compiled of the bytecode may take a little bit longer you're going to recoup the cost because this code is very hot so there's a kind of a cost benefit analysis that the the VM has to do to determine whether it should spend energy doing compilation and and and and so ok high level tree rewriting new stuff this stuff's good fun and yes JVM is actually do do this unfortunately it's not as well structured as it should be the the previous optimization gets captured in what's gets called as the the architecture definition for a CPU platform for hot spots this kind of stuff is is less formally defined and tends to be kind of hacked into the VM but it's good stuff so we should talk about it and hopefully you'll appreciate why it's good stuff but I might be kind of overshooting but we'll see so can people understand this code I've got three strings a B and C ABC I could have made the method strat static and what I'm going to do is I'm going to concatenate the three strings together and just return it as a result okay so again not fantastically complicated and again java c the you know the the jar the the java to class file compiler has done some magic for us it's um it's going to actually use a string builder and the concatenation operations are going to turn into string builder append operations so we're going to create a new string builder with our first string a we're going to concatenate beyond two then we're going to concatenate C and then we're going to turn the whole thing into a string and although the original code mentioned nothing about string builders the the new code mentions string builders and this is kind of there's nothing that JVM can do about this the JVM will see the string builders and in Java 1.4 they'll string string buffers and there's a whole slew of integer you know literature you can go and read about that but basically you've got this code and I'm going to say it's not very good so that's a very good point so the question was why not look at the lengths of the strings a B and C and add the lengths of those three strings together and then when you create the string builder create it with that length and and then you can avoid this inefficiency that if the original string builder is too small you can then end up growing it in the append operation and then they've got this extra append operation which is again going to grow it so I'm going to say that that's what we're going to do however what I need is I need a new constructor to string builder which can take multiple strings and append them together okay now people are probably going to jump to the java api documents and go documentation and go well no such method exists there are methods that exist on these things which you don't know about how does how did you know how how does that actually achieved string buffer and string builder have a common parent class I think it's called abstract string builder and abstract string builder has these methods on them but abstract string builder is a package protected and so no one outside of java.lang can see it traveling Java util and so what actually happens is this new stringbuilder is actually a method on this abstract string builder abstract string builder which takes these these three things and so hopefully you can see that you know earlier I was talking about doing efficient mapping of certain instructions to Intel instructions here I'm talking about mapping several kind of high-level JVM operations into kind of a more efficient JVM operation and this can avoid kind of memory allocation and and other things like that so there's lots more optimizations and I got bored of writing slides and and so on so you can pretty much do any optimization you want and you know that the question which will get thrown back to me as well there you know it's at runtime you can't do this and basically there's a kind of a law of diminishing returns the more optimization you do the the less performance you get you should kind of focus your attention and then there's this very popular 8020 rule or 90/10 rule depending on who you talk to that you spend 80% of your execution time in 20% of your code if you've got code like eclipse then that's a lot of code but it's still kind of true that you spend most of your time in certain hot pieces of code so the JVM will do complicated opera optimizations like loop and switching loop and rolling loop invariant code motion I can tell you about those but just believe me they're all good stuff you like them they make your code run faster there's a dataflow analysis a dataflow analysis is a really nice it what it does is it lets the the VM propagate information through the graph that we were talking about earlier so one of the things that you have in Java is you have the ability for runtime exceptions to occur in lots of different places so one of the runtime exceptions which can occur is a null pointer exception however I can guarantee that this pointer is never know so what I can do is I can propagate that information through the graph and anywhere I see you know compared with no love at this point so then I can get rid of that code it's dead code and and so on and what data flow analysis can do is it can do you know more impressive things than that it can say this integer is within these ranges and that's never going to be outside the range of an array length and so it can eliminate bounds checks and things like that so they've believed me vm's work very smart sort of people working on it we're doing good stuff there's a vectorization I've spelled it correctly good where you're trying to take advantage of the sim D multimedia instructions which are on modern processors modern you know you've been able to buy them for like 15 years so code layout code layout is a big win it's just getting the the hottest bits of code all in a straight line of code and making the branches go to the less common bits of code if you have profiling you can do this the JVM is a hot running system it can profile all the time and so it can do very good code layout this is one of the main kind of like advantages of a JVM over a static compilation route like the c and c++ escape analysis deliberately put there because it's a a good optimization so it's very common in java to have a collection and then to say I want an iterator over that collection and that iterator will commonly have one field inside it called next and that next is just going to go along your linked list or is going to go down your ArrayList or whatever it's going to do and if you're just using that iterator inside a single method and you've in lined all the code then what you would like to do is take that next field and place it in a register you don't to do any memory allocation for it and so on and so so forth so what escape analysis does is it looks for allocations of memory and it sees that it tries to see if they can in some way escape so an escape is where you take the freshly allocated iterator and you write it to a field or you do something or you return it something like that if that's outside of the the code which has been inlined then you'll disable the escape analysis optimization and your next field won't get put into a register and you will actually allocate some heap so an escape analysis is very nice and very good and kind of it's much more common inside the VM than people give it credit for don't avoid iterators because you think there was they're allocating memory they shouldn't in a decent VM you know dalvik you know doesn't do escape analysis but anyway so the the JVM has a bunch of challenges so being object-oriented is a challenge you have a virtual method dispatch and it's certainly true in kind of older processes the virtual method dispatch had a penalty compared to kind of just being able to call a specific target method and if you look at you know what Java did compared to C++ it very deliberately made everything virtual by default every method is virtual you can override it unless the class is declared final or the method is declared final so on and so forth so Java is encouraging YouTube to do object oriented code writing unlike C++ which makes you type the word virtual if you actually want to make object-oriented code so it's you know making you pay a penalty by making your type of word and C++ because it's being statically compiled what they must do for the method dispatch is they must create a dispatch table where when you do the virtual method dispatch you have to load a value out at the table and then call through to that so you've got a load followed by a call and normally it's much worse than this and gills there in the audience can scream at me and say it's far far worse what happens inside the JVM is you know for example if you have an interface that you only have a single class in implementing that interface so if that's the case then why don't I just call the method that is belongs to that single class the interface doesn't have any method declaration I know there's a single implementing class there's only one place I can go to why do I need to look up look it up in a table there's no need so I can make a I can make an optimistic optimization which says I'm just going to go to this one place but inside the VM I can remember that I made that observation and if that ever proves not to be true I have to do this thing called the optimization which is basically I throw away the generated code I we compile it if it becomes hot again and so on and so forth I can also observe that I'm commonly dispatching to a certain method so if I have a hash table or a hash map and all of my keys are strings then I'm going to be calling string got hash code from the the hash table and hash code hash map methods and so whilst it's true I could be going to any hashcode method I could be going to the object hash code or something like that I can do profiling and know where the common target is so I can realize it's always string hash code and then I can inline string hash code but I can put some guard in there which says you know if it proves to be anything otherwise I can either deoptimization live above ever there's there's an optimization which gets used which is inline caching you don't have to do it inline caching is where you you remember the last target that a method dispatch went to and you assume that you know because you went to it last time you're going to go to it again but it's kind of unnecessary if you've got decent profile data so so Java has a lot of stuff a lot of this stuff existed before and some people said that class loading is kind of the big new thing that Java introduced that no one had ever seen before its ability to kind of dynamically take class files from the internet and include them into your program or in clean oh say them from anywhere a class loader can go and grab them however how can the JVM make optimistic assumptions about you know I'm this interface only has a single implementing class and what can i know my method in line if I've inlined a method how do I know that someone's not going to later on override that method and invalidate the inlining assumption there so class loading throws up this challenge and the solution to it is you know 2d optimized to say oh you know I got it wrong D optimized and the D optimization in the worst case will roll back and start using the interpreter again um so the class loader has to record so the question was sorry to for the benefit of the microphone so the quote the question was how does it determine it's done it wrong so you can either win the in the dynamically generated code you can put guard tests and so in the dynamic generated code you can go you know if this isn't a string then you optimize this method or you can record information in tables so the class loader as a table associated with it called the dictionary and in the dictionary you can remember that this class has got you know certain dependencies like you know that it's only got one into one interface and if that's a sorry this this class is a single implementer of an interface and if any anything loads which kind of invalidates those assumptions then you can bring the VM to a safe point and you can go and deoptimization because though there'll be things running that code but because the class loading happens at a particular point then you can actually do it semantically correctly and so if the VM is doing its job well then it shouldn't be problematic because the VM has what's known as an adaptive optimization system and the adaptive optimization system is keeping track of the assumptions that the VM is making and making cost-benefit analysis and determining you know what should it compile with what compiler what kind of assumptions should it baked into the generated code and if the adaptive optimization system is working well then you know the optimization will rarely happen it will happen because you want to make these you know very optimistic assumptions because when they hold it's a really great performance win yeah and this this is you know this is really why you know Java can really you know beat C++ and C and so on they they have to be conservative you know with the latest versions of GCC you're getting into things like link time optimization and so on but even in that scenario they're looking at the whole program and they have to conserve it tool it conservatively assume that any code path which exists in the program is going to happen inside the VM you can go that's never been run I'm just going to assume that if I ever get to that piece of code I'm going to deoptimization Terp retur and it means you can make very small fast bits of code with with very aggressive assumptions in them in level what kind of problems do you see if the optimization is occurring a lot so for the optimization to be occurring a lot you've got to be compiling code and then for that code to be being be optimized so one of the things you'd observe is that a lot of CPU time was being spent by the compiler threads of the VM so if you've got exposure in your profile or whatever whatever to the compiler threads you can see that they're running a lot there's also a command line flag called dash X X colon plus trace T optimizations and if you switch that flag on you can see what the optimization events are occurring inside the VM if you start seeing a lot of them then you know that the VM is doing something wrong now the VM doesn't do things wrong so you shouldn't get paranoid and think oh you know I've heard about this the optimization stuff the VM is going to make all of these aggressive assumptions and it's not going to it you know so what the VM does is if it sees one of its assumptions it has been invalidated then it adds its to its profile data and when it comes to comes back to recompile that code it's going to recompile it in a different way and not make the same assumptions and so you'll see a G optimization event once it's only when the compilers are broken and I've seen compilers are broken but you know it's only when the compilers are broken and the adaptive optimization system is broken the you'll get repeated guilt Amaya's Asian things going on yes to compile up a piece of code you can't well not quite true you can force the compiler always to run and back in my example I my Fibonacci example I'd force the compiler to run by using a command line flag called X comp and that just says always use a compiler you don't normally want to do that because it will have a big performance overhead you want the adaptive optimization system to be working and making these decisions for you in general so what can you do to encourage the compiler to compile a particular method make it small and hot yeah small hot methods the VM loves them and it's going to compile them so if you're you know if you have something like you know if this condition holds do something nice and fast otherwise do this big complicated thing move the complicated thing into another method and then you're going to make that method smaller and more attractive to the compiler so it's in line and to compile on its own and so on because the compare the adaptive optimization system is making these cost-benefit analysis and part of that is based on the length of the method unfortunately it's a crude heuristic but it is what it is the compiler doesn't spend energy compiling cold code so if people are going to start going about trace compilation or something sorry datian so Java agents are used to enable debugging through an API called the java j vm tools interface and the j before that was a JVM profiling interface and the question was you know do these profilers effect the optimization and the answer is yes what the jbjb MTI agents do that the Java agents as they get specified in the command line what they do is they say I'm an agent and I want these features from the value of the VM so they might say something like I want to be able to single step through bytecode and the JVM can go that's a real nuisance to implement inside a compiler yeah so what it can do in that situation vision is it can it can go I if you really want to do that then use the interpreter so if you switch on that gbmt eye agent you know if you use the GBM CIA agent is going to and it uses that option then the the VM can start becoming very conservative about what it's going to do and and so on and you can start seeing different behaviors and that you'll be spending more time running in the interpreter than running in compiler generated code now there's no reason why you can't single step through byte code if it's compiler generated code is kind of like where does the JVM guy spend his energy does he spend his energy making profilers run very quickly for specific cases or does he spend it doing cool stuff like vectorization and escape analysis and all of these kind of good stuff for people who aren't running code with profilers and so on so people have done a lot of stuff with the JVM CI interfaces and so on most JVM site interfaces aren't fully implemented by most VMs but it is what it is so back to do the question our final static or private methods more likely to be optimized so the question is our final static or private methods more likely to be optimized the VM doesn't care the JVM doesn't care whether that private static file um if they're static or private then you don't need dynamic dispatch they're only ever going to go to one target so static and private is a good way of kind of forcing the hand of the inliner but you shouldn't need to the profiler should be able to figure out the inlining itself and go there's a single target to this stuff so the key messages right clean nice easy to debug good quality code don't worry about what the JVM is doing it's far smarter than you are and is gay I there's people who should know better like you know the Scala people they do ridiculous things which completely break escape analysis and you know it's frustrating and then they tell you about this is a really great optimization you know know okay so that's kind of capturing that'll help us common so the JVM is compiled are in the optimization stuff system and so on they're going to do all of this smart stuff and you do shouldn't worry about it unless you want to and then hopefully giving you some pointers to some of the - xx options and you can kind of get involved and look at the machine code that's being generated and so on and you know it's all open source so go and play around with it and do good stuff and doesn't have certainly in 2008 it was just an interpreter trace compilation where's my PhD thesis haha but I won't bore you with the details but trace compilation is a different way of looking at compilation where instead of looking at hot methods you look at hot basic blocks and the problem with that is that if you start looking at things as basic blocks instead of methods you lose the call stack so if anything becomes uncommon then you have to go and recreate the call stack and if you look at what architectures do hardware architectures they do things like they predict that when you do a call you're going to return to the next instruction after that so the problem with trace compilation is they have to they have to kind of get round the fact that they've been so smart that they've kind of gone beyond what the architecture was actually expecting them to do and so they have to be so smart that they're going to kind of win back that performance loss and so on and kind of in the extreme they're going to kind of equal each other in terms of performance have smart instructions which help JIT compilation when they they come up with all of these smart instructions that the one I'm most familiar with is that they have an explicit null check instruction which is compared as value to zero and and you know trap if it is and how does a regular JVM handle that it makes sure that any of the pages of memory which address 0 and above aren't mapped in and so if you ever try and load or store so those lo pages of memory then it's going to create a page fault and it's going to create a segmentation violation and then it goes and handles that and goes oh look it's really a nullpointerexception violation and so arm have got these instructions and I can't find a single use for them because we've already kind of long engineered around the problems that these instructions were trying to solve they might have come up with some other great ones but I'm unaware of them and my research group boss was the count venture of the arm instruction sensor okay so garbage collection so memory safety is it's really great stuff and we can allocate objects and look you can see that the author of the the slide deck was called Eva so you can see here we're doing memory allocation and I didn't need to worry about the fact that you know I've created some I've used some resource I've used some memory I don't need to worry about trying to reclaim that memory if I was working in C or C++ I'd have to try and free up that memory otherwise eventually I would you know run out of memory memory safety is the kind of the fact that the runtime system does this for you and it's thank God and so so we've got garbage collection and we've got all of this goodness what do we want from a garbage collector we want it to be parallel we want to make it we want it to use all of the CPU results as we have on our then we wanted to be concurrent want to be generational fragmentation and compaction at Ultra of tuning most JVMs it's hard to work with this stuff and so this these are these are getting away from from from from my slides so we're creating you know objects in memory so we're saying you know these regions of memory are going to be used for this particular objects and the simplest way would just do that is just have a pointer and go okay we'll put an object here and we'll move the pointer up and then the next object we allocate will immediately allocate above that point symbol well bump it up again okay at some point in the future we want to work out what objects are alive and which objects are garbage now one of the things with garbage collection is we're gonna spend time tracing through stuff which is garbage we just want to work from a set of you know route nodes and the root nodes tend to be things like static fields and values in registers and values on the stack we start from those root nodes and we want to find what can these root nodes find in terms of objects which are alive and this is a marking of the objects there are other ways to do garbage collection you can do reference counting and so on but reference counting every time you you know you use a field within an object you have to bump a counter up and down it sucks in terms of performance so what you really want to do is you want to do tracing and tracing only pays a penalty for objects which are alive and if we're doing tracing to find what objects are alive anything that isn't alive is therefore garbage and if we're not going to move objects around then we need to do some kind of maintenance of these things so we can start putting things on to free lists to remember you know this memory is now available for us to reuse again but that means we get away from doing our simple bump pointer allocation we have to go and look at free lists and go you know oh there's something on the list which isn't currently in use and I can go and use that so that kind of garbage collection is called mark and beep so you go through and you mark everything that's alive and everything else is kind of swept out of the way generational collection I know there's a slide coming up on generational collection but the observation is that most objects die young so it's the so there were different forms of the generational hypothesis there's a weak one as a strong one the strong one says that you know the older an object gets the probability is going to die you know kind of increases in proportion that doesn't tend to hold but the weak generational object hypothesis that you know the vast majority of objects which get created will then just die it does hold and this is an interesting observation because it means that we can focus our garbage collection energy onto a small set of freshly allocated objects and not look at every objects in the entire system we can just look at the freshly Alec you know the recently allocated ones and concentrate and trying to reclaim the memory that they were using if they're dead so come through to the generational stuff fragmentation and compaction did people remember what it was like to defragment hard drives so you're you're familiar with the term fragmentation so fragmentation is that you've got things distributed around in memory and what you'd like to do is kind of push them all together and make more efficient use of the memory and if you don't do this at some point you're going to kind of there's going to be so much wasted space in your memory that you're going to run out of it much faster than you normally would so this leads to a thing where you want to do compaction and when you do compaction you want to move objects around and if you start moving objects around that means they can be in two places in memory and that's a problem because if your application is running at the same time as the garbage collector how can the object be in two places which object do you use and so let's carry on working through the slides because I think the stuff coming up in on on these and so most of the stuff is and we have the the VM memory for the JVM and that's very small don't worry about it we're here about what your applications using which is this great big whopper over here when we do allocation inside the hotspot VMs mean there's there's much simpler VMs out there which don't do this then we're going to do thread-local allocation so what we want to do is we've got this this pointer that we're bumping up in the simplest case the problem with that is that if multiple guys are trying to bump this pointer up then we've got race conditions which can exist if we make the pointer thread-local then that all goes away fantastic so what we do is we have thread-local allocation regions so here we've got a you know thread a which has a thread local allocation region we have thread B which has got a thread local allocation region and we're placing objects into these these tea labs as they get called and then thread B goes and comes along and he's trying to put an object into his local allocation buffer and there's no space for it so this is the most common cause of kind of a GC event having to happen is that you you filled some some finite memory resource okay okay so you know in a in the regular kind of hot spot VM what we have to do at this point is we're going to go okay we need to do a GC cycle we need to work out what's alive we need to you know throw away the garbage we want to do compaction possibly willing to do compaction you normally don't want to always do compaction but you're always going to sometimes need to do some compaction and so what this means is because you're moving things around the application can't deal with things being in multiple places at the same time so the best thing you can do is stop the whole application and do your garbage collection and this is what leads to GC pauses okay so here we are doing some Oh garbage collection I hadn't seen the animation that's new to me okay so for concurrent garbage collection we won't be able the application in the garbage collection to happen together and this slide deck refers to the there's all garbage collector as the generational so we've got old and young generation we've got this efficiency where we realize that we don't always want to be scanning the old generation pause lusts and that it's never going to have these safepoint pauses garbage collector in new terminology we've rebranded it the the c4 garbage collector which is the continuous concurrent or concurrent continuous compacting collector so Gil thinks of these things that's c4 which apparently is an explosive and so what we want to do is we don't what we want to be able to move things around what's this slide deck showing me there's not enough room okay generational collection fine promotion we don't care promotion is just when something is young and then you go haha now it's been alive for a certain amount of time I want to promote it into the old generation okay and then compaction so that hasn't mentioned how you actually do any of this smart stuff so what's the problem we've got objects in two places how can we deal with the object being in two places what we need is we need some smarts which tells us you know where is the objects in memory that we're interested in and normally the object we're interested in memory is going to be in one place it's only when some compaction is going on that is going to be moved to another place so what we need is something which says when we have a pointer to something which is being relocated make sure it points to the thing which has been relocated and so this is a reed barrier and this is kind of the big difference between the the zinc virtual machine and the hot spot virtual machine is that the zinc virtual machine there's well one has read barriers and this property of having a read barrier avoids garbage collection pauses and so great there are some subtleties to that you can try and get reduce them garbage collection pauses via other means and certainly son tries to do that but they don't do as good a job and this is kind of an example of a production concurrent mark-sweep which is the concurrent garbage collector that son have and you know here's what the command line should be if you want to kind of tune the garbage collection and normally what these people are trying to do by tuning the garbage collection is trying to make sure that everything ends up in the nursery in the young generation and that they seldom get promoted into the old generation because if they end up in the old generation then you're going to have to look at the whole of the heap to actually work out what what data is alive and there are various kind of white papers on how you can do garbage collection ergonomics as it gets called to try and get these things to work I mean basically the take-home is this is a mess get yourself a proper garbage collector and by simha's also has software and the other the other thing with this is that these garbage collection pauses only get worse as your as your heap sizes increase so it's realistic to think that a current JVM is going to start getting unacceptable garbage collection pauses you know a pause which is going to mean that you miss a heartbeat and you end up rebooting a server because it's trying to compare is trying to it's trying to visit the whole of four gigabytes of heap which just when you try and do that in a garbage collection pause takes many seconds and that means that you can end up missing a heartbeat if you miss a heartbeat in a kind of a server environment then it looks like the server has died or crashed or something so you end up rebooting it it wasn't that at all it was just that your garbage collector kicked in so this is a this is a real problem and that's why as all has a real solution ok Oh so these come from Eva oh don't be afraid of garbage it is good don't like finalizes finalizes are just a mess because what they're trying to do is they're trying to recoup resources typically from the from the operating system or something and there's not much control over when they get run in Java 7 you're getting things like also closeable and things like that so the need for finalizes is going to be reduced the way they're implemented in the VM is kind of messy as well so yeah I wouldn't be too afraid of them but you know know their limitations always be careful around locking well if you're writing very parallel things and you have a lock and everything gets serialized on the lock guess what that's bad so make things concurrent when you can benchmarks are often focused on throughput but miss out and real GC impacts so it's very common for people to gain benchmarks and one of the ways they can gain benchmarks is they can set up the garbage collector so it doesn't do an old generation garbage collection pause whilst the benchmark is running so typically a benchmark will kind of run several iterations if you do your old generation pauses in between the iterations then it never gets timed and that gives you very nice garbage collection you know very good performance numbers and so guess what you know when people are generating one performance engineers generate generate benchmark numbers this is what they'll try and do because it makes that VM look faster than it actually is if you're an application guy and you actually care about getting these pauses you should kind of wake up and benchmark yourself okay so I'm at the end hopefully I haven't used the whole of an hour or 45 minutes ABM is a great abstraction hopefully you've kind of seen into what the JVM is doing in terms of the compilers I've talked some about the the garbage collector the garbage collector is a much more complicated beast than kind of the few slides which were in the lights on but kind of the that the take home is that you know you need to do something smarter with the garbage collector then as a typically happened compaction is kind of this this nasty with the you know we like memory safety we don't like garbage collection pauses why do we get garbage collection pauses because of compaction how can we avoid got you know these pauses we can be continuously compacting avoiding the situation where these composers are going to happen and ending is you know kind of the at the front of doing this what is zing okay read the slides tea I already said generational pause less garbage collector is now known as the c4 garbage collector is the garbage collector inside the there's all software you know JVM that we produce originally why did we develop this whole platform we had these these server boxes such a mentioned you know earlier on that with the first one we were showing off the the boxes so these systems have got eight hundred sixty-four processor cores in them and I think 640 gigabytes of RAM so they're big it's a lot of memory a lot of RAM a lot of CPU cores and when you're kind of even if you go to JVM at that kind of size then you can't you know guess what 1640 gigabytes if you're going to do a GC pause that's going to take a while so you have to engineer around it and you want to engineer everything to be as concurrent as possible and so on and so you know as all as being focused on this problem for eight plus years it's really great that you know in terms of the number of cores that you're getting now inside a laptop inside a desktop inside of you your common server you know Intelli catching up with with what as well as hard for the the last few years but that's kind of sir some history okay you and I so that the you know the real you know answer is that there's a Ribery though so so the read barrier means that when you load a reference from a from an object so you've got a field which is a reference so maybe it's a linked list it's the next field so it's the next node in a linked list you're loading that reference out of an object into a register say and immediately following that you do this weed barrier optimization sre barrier in operation on the original vega hardware this was a single instruction and rather than it being called read barry was called the loaded value barrier on x86 we have something equivalent and it's super smart and it's super fast and it's about the same cost as the lvb instruction on the Vega hardware what does this what is the property of the lvb give you it gives you this property that whilst an object can be in one of two places you only ever see it in the final place and so if it's not already in that final place you actually do the copying that the application thread will actually do the copying itself however that's bounded so that it only ever has to do a small amount of copying if it's a large amount of copying it does a small lines of copying and then ask the garbage collector to help it out and do everything else and threads and garbage collection threads you have to do compare-and-swap operations so that you know if the garbage collector is relocating something and the mutator is relocating gets the the winner of the compare and swap is the guy who actually wins the you know now it's being relocated and then you know the application continues with whoever was the winner of the compare-and-swap operation and they were very you know the chance that that happens it's very rare because copying is a very unusual operation the normal situation is that you've just got a point a' which is pointing at the old location you just need to move it to the new location so copying is usual so Gil can give much more detailed explanation of all of these things because he were identical you might have guessed I tend to work more in the compilers rather than the garbage collector and so yeah move the JVM optimizations and I specifically mentioned escape analysis and the this you know this the Scylla guys in Switzerland did a nice paper on this and submitted into a workshop that I was the chair of so I need a paper quite well and what they were actually doing was because invokedynamic doesn't exist in boat dynamics this new bytecode which is going to kind of help languages like scholar and so on they were using reflection for doing method dispatch and what they were doing was they were going let's remember the method objects for doing the method dispatch and so they were storing this method objects into a static final or a final field or or whatever and going now I've got I remember what the method is I can just reuse it I don't need to recompute the method object all the time however because they've taken this method object and they've written it into a field it's now escaped so it's the escape analysis can't kick in because it has escaped just truly has what the escape analysis will do for reflection is it'll so behind the scenes of reflection the J the the runtime system will generate byte codes and the byte codes will do the the reflective method coil for you and if you avoid the escape analysis optimization then you can't you can't realize that the method object whilst is created it has no purpose other than to carry around these byte codes and all these byte codes are doing is doing a virtual method call so so what happens in this particular case for Skylar and hopefully they've fixed it or hotspot has actually got its escape analysis working slightly differently or something is the because they store the method object they break the escape analysis and then you don't get the the direct virtual method calls to the target so you can do reflective method calls and is the same performance as just a regular method call if you can inline everything and escape analyze it and so on and I've got research papers on it so but the jv8 so hot spot and you know jigsaw VM and pick your your virtual machine they tend not to have something which is like if people are familiar with the the C functional or K which is kind of like allocate on the stack they tend not to have that operation inside the VM so but they do put object onto the stack how do they do that well what they do is they do the escape analysis realize that the object doesn't escape and then they turn it in that that the next field from our iterator becomes one of our registers in our infinite pool of registers and then when we register allocate that it'll either end up going in a register but if we're limited in terms of the number of registers we have then we'll end up spilling it out onto the stack and when then when we need it again we fill it back in again and so this is effectively stack allocation but not quite you know kind of a side note is that the vaguer architecture actually did have a notion of stack allocation and so but not to worry about that but um make it very large make it large make it cold you know all of these things based basically the JVM is going to inline it if it thinks as a performance advantage to inline it you know an access a method should always get in line so don't you know if you're trying to avoid inlining and access a method you know bad luck there's also there's a command line flag - axe axe - inline which switches off all inlining in the VM if you were you know crazy enough to want to feed the whole of the compiler concurrent things and it sounds like they're going to have lots of bugs in them and so on how do you avoid this so you invest in QA that's a big way to to solve these problems but you can also recognize things that there are certain invariants about the code about the heap and you can make sure that those invariants hold and then you can have kind of slower debug versions of the VM which are making sure that all of the invariants are always holding and that your heap is always in the correct state and all of these kind of things and then when they don't hold you can go we need to go and fix a bug but you know if you if we've done our job right this doesn't affect you and and so but yeah it's it's a lot of work is complicated you know I've got a job that's great of reflection so people have told me in the past you know avoid reflection because it's got this huge performance overhead and I usually say to them show me the numbers and then you know they never come back to me and the reason they never come back to me is if they've gone and looked at the numbers and when they actually have some hot code which is doing reflection it works very well so it's true in very early JVMs there was very little optimization there so a method dispatched by the reflective mechanisms would be kind of like hundreds of instructions whereas a regular method dispatch would be say one instruction these days you know I I think the kind of the rule of thumb they have is that it's kind of twice as costly as a regular method dispatch the problem with this is that you need to realize it's a dynamic system and so it's when it gets hot you're going to see these performance winds and and so on in the virtual machine ID you know the matter circular VM I did a lot of work on the the cost of a reflective cold was exactly the same as a regular method call we optimized everything away and so on so there's really no reason for it the problem is is to get that optimized version you've got to get the optimizing compiler in there you've got to the adaptive optimization system is going to kind of like realize oh this does nothing and zip remove all of the code which was you know causing the performance overhead and the idea behind Metta circularity is that you kind of should eat your own dog food and so if you read kind of like the the dragon book and things like that they're kind of a rule of thumb that when you write a compiler she'd write in the language they are actually trying to compile and clearly there's a bootstrapping problem when you do that so if you look at a regular vm like hotspot it's written in c++ so it's not written in Java and so this eat your own dog food property doesn't exist I'm at a circular VM eats its own dog food so so jigsaw VM is an example of a meta circular VM which is a VM written in Java it sounds like that shouldn't be possible it's very possible people have built whole operating systems entirely out of Java and so on as there's really there's there's no reason why not and the kind of the advantage of the eat your own dog food is when you realize something is an issue you go and fix it in the language you go and fix it inside the VM and you kind of get this beneficial circle happening you know it whilst it's you know it's kind of unfair to say that hotspot is kind of not eating its own dog food it did have the bootstrapping problem you know what are the VM was that for it's a bootstrap itself on what what happens inside hotspot is you know things get pushed out into the Java code so they can take advantage of all of these optimization optimizing compiler goodnesses and so on yeah if you have application threads and you have garbage collection threads and you've got your heap which has got you know things which are alive and things we share at dad and you need to set it to compact them it's possible via an algorithm we have which is this continuously compacting concurrent collection algorithm and there's a whole research paper which describes it so I'm trying to think you know zero Israel these are a quick way to kind of summarize that for you it's complicated but you know essentially you make sure that when you're doing this you know this compaction so and you're not reliant on things like locks and and so on and you know I already said you know if the application can't make progress because it's waiting for the GC to do something the application actually makes the progress itself a small amounts of bounded progress and and so on so the big I mean that the big thing for Java seven really is just this invokedynamic bytecode and the invokedynamic bytecode has these interesting method dispatch properties but it's being kind of created with a kind of a knowledge of how the VM is actually working so is it really a VM optimization it's kind of it's kind of you know the VM guys holding hands with the the guys who are generating the bytecode and realizing there's kind of a more efficient way to do something and invoke dynamic also exposes kind of some clever call side semantics and things like that really you know what you should be waiting for is just the optimizations would already exist to get better so you know the the new Intel processors have got AVX instructions so AVX instructions have got 256 bit registers and they're also three address and there's all of these other things which are very unintelligent for for these instructions and when you're doing vector optimal optimizations you've got to do vectorization so that's realizing you've got a in loop structure and you can actually do kind of four operations instead of one inside that loop structure other things I mean escape analysis needs to get better everything needs to get better but it's kind of more of the same there's been some interesting research on things like object inlining and different things like that there's that it's it's a very dynamic and interesting research space how does groovy dude do on the JVM compiled some code how are we going to know where the incoming arguments are for the for the method so in the in the Xing virtual machine we pass arguments through registers this avoids is having to load them from the stack they're just there in the register and this is this is common on 64-bit x86 code on 32-bit x86 code is very common to pass things on list on the stack the number of parameter registers that you have to pass things in registers is only six so if you have more than six parameters then the extra parameters are going to get passed on the stack but of course you know when these things get in lined all of this passing on stack and passing in registers all goes away because you know the inline method becomes part of the graph and these things never kind of end up in registers so I mean to kind of answer your question more directly it's kind of a choice for the for the VM as to what is the calling convention is going to use and it just happens for the Xing VM we use pass by register there for efficiency reasons the benchmarks that show the performance differences aunty so um as you guys probably know you

Info

Channel: InfoQ

Views: 143,621

Rating: 4.8218126 out of 5

Keywords: java, JVM, Java, Virtual, Machine, Azul, systems, garbage, collection, concurrency, ian, rogers, marakana, open, source, techtv

Id: UwB0OSmkOtQ

Channel Id: undefined

Length: 93min 35sec (5615 seconds)

Published: Fri May 27 2011