A JVM Does That??? by Dr Cliff Click

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right I think I'm live hi I'm cliff click uh I'm here to talk about under the hood and the double machine so I was a JVM engineer for more than 15 years mmm still amazed at what's in a JVM there's a huge count of services that have just sort of increased slowly over time and many of the services were sort of painfully volunteered by sort of very naive changes in the spec you know finalized errs and what the hell I did the garbage collection was just awful and there's lots of these things that were you know somebody had a cool idea and they changed the language and the impact and the JVM was like enormous so if you look at a JVM and what goes on you find a bunch of interesting parts you know the the the the sum of the parts is greater you know the whole is greater than some of the parts but the parts themselves are interesting so as a high quality GC so parallel concurrent uh uh you know incremental collectors with you know reasonably good low total allocation cost when Java started that was just not possible and now it's sort of assumed high-quality machine code generation was something I had a very personal hand in that was again a thing that no one believed you could do at run time and now it's just assumed that the code quality comes out matches what you've got have a standard you know static compiler at - OH - level I'm including profiled code and all the code management goes with it I'm and then the the thing under there that says bytecode cost model that is a key a theme that the JIT brings in I'll talk more about that later there's a uniform threading and memory model and again that was not a thing that was possible when Java started and so Java sort of did something new there broke some Brown said this is what it means to have threads communicate across all kinds of different hardware you know of course type safety of all kinds of ways dynamic code loading you don't have to have the closed world assumption you can keep adding code and all just works high quality quick time access lots of internal introspective services this huge library that made Java you know one of the one of the key building blocks for Java is that you access to the the JDK and the concurrent collections and all kinds of collections access to the os-level uh you know things that only provides or brought through the JVM so you get threads and scheduling priorities and of course native code um we're not only services come from and you know why are they there so they're mostly incrementally added over time the language the JVM and the hardware have all Co evolved together you know like 64-bit math didn't start when Java started because you didn't have 64-bit processors support for high core count machines was not in there in the beginning because you didn't have them and so on and so forth so why did these services show up it's because the service is providing an illusion and those illusions are this powerful abstraction it's the V in virtual machine it's this great abstraction you can think about solving other problems somewhere else and know that you have this illusion that that works to cover up for all kinds of complexity elsewhere it's a separation of concerns JVM does this problem you solve that problem so I'm going to talk today about what kind of services are in a JVM and sort of where they came from a little bit and what they do for you and then whether or not they do a good job and what can be done better about it in particular many of the services will overlap with existing OS services but they have some sort of different requirement going on that doesn't exactly match you can't just lean on the OS resource to make that happen and then there's some you know easy changes I would like to see make made that would make jaebeum's work better so look let me look first at illusions that we have when I say we have I mean they actually work so garbage collection is the illusion of infinite memory you just allocate and allocate and allocate a locket you're never free right you don't in fact lifetimes and that works because GC figures out what's live and what's dead and this is vastly easier to use than malloc and free model and fewer bugs and quicker time market and all that goes with it enables certain classes of concurrent algorithms that you simply cannot do it with malloc and free because you cannot track lifetimes and GCS have made these huge strides in the last 15 years they are obviously production-ready been that way for quite a while now robust parallel concurrent um there's still a major pain point for many users too many G C tuning flags too many issues with GC pauses there's lots of you know active differentiation going on the different kinds of GCS are available I'm gonna mention three right here you'll see pause times vary over you know six orders of magnitude the the azules pause this GC has actually gotten substantially better than one millisecond max pause time these are max pause times by the way but you have to go to not custom hardware but almost a custom setup because you can't get down below a millisecond without involving all kinds of other features of the OS because it's not just the JVM it's not just GC to get rid of all the pauses whereas the the stock a standard full GC is very efficient very high throughput but if you get into tens of gigs of heap you can get tens of seconds of pause time and that's just not acceptable in some situations another illusion we have is that vicos are fast in fact bike houses really lousy way to describe program semantics there's all kind of better descriptions for how to you know choose to describe program semantics but we're stuck with byte codes fine the main win of course is that byte codes hide cpu details they're talking about some sort of language semantics but you don't actually talk about machine registers or cores or the speeds of loads and stores and multiplies or whatever the hell it is but it's all big illusion because if you interpret the byte codes because of their complicated semantics the interpretation is slow and the JIT bring back brings back an expected cost model the the cost model is that the the things that look like a load and store turned into a load and store eventually and it's because of that that you can reason about the speed of program so you understand where the time goes so you can you know reach a point you understand with time guys and that lets you write high-performance code and that was one of the reasons the Java was able to take office is being able to match see performance as sort of close enough level and that was only possible with the bike with the cost model that came out of the JVM to make that happen the JVM had to mimic the the performance that the optimizations that were done by the C compilers so JVM that the JIT inside brings out too you know everyone in this room essentially GCC - OH - level of optimization without you ever thinking about it or realizing it's happening and why didn't we use GCC directly or a number of open source readily available compilers at that era it's because they didn't do the right job everywhere they missed on certain key places they didn't rack pointers for garbage collection they didn't follow a java memory model which has very strong restrictions and when you can reorder the reordering they all did everyone will reschedule loads and stores for performance reasons but in case the Java memory model there are times in you can assumption you cannot and you simply have to track those through and that's just sort of woven throughout another thing that happened is there were different patterns of code to optimize Java has implicit semantics for rainchecks C does not that internees you do aggressive raincheck elimination when that was done java ray axises become on average a same speed as of c array access and that was one of the performance guarantees that came about that wasn't possible when the Java when the JVM started the Java the ginning goes on to requires profiling because you can't compile everything but the profiling has a bunch of extra benefits and that you get focused code generation and actually better code generation it's same as doing profiled code to everyone in the room when I started the JVM code no optimizing with profiling code was only done by vendors trying to get the latest greatest spec whatever benchmarks out and the the profiling definitely had a large improvement speed was too cumbersome to use for everyday use bringing it into the JVM let us do it on the fly every time everywhere and it brings all these benefits in terms of performance and you now just accept as a normal hey job is fast kind of thing another major speed issue that came around was virtual calls this is an illusion that virtual calls are fast in the land of C++ of course you have virtual calls but they're slow so you have to ask for them by saying this is virtual and then you don't ask for them very often and so it's slow and there's no point optimizing them they're just not present and job let's have the way around you get it by default they're all virtual except when you didn't actually overload so what do you do about it we have to make it fast by default and turns out that you mostly can so JVM does some class hierarchy analysis discovers there are no callers except the one so you get a static call out a virtual call sometimes you get a new class loaded and you that was a mistake and you have to read yet okay fine you do and if that doesn't work you sometimes discover you mostly discover that call sites go to only one target and you've hidden like cache and again it brings the speed back down to a static call and eventually if all that fails you do an actual virtual call the same as C++ does every time so the cost comes back down to same cost as a static call if your behavior is like a static call and the virtual Kaufman and you never had to think about it but it was the illusion that the virtual calls are fast becomes a reality another thing that the see folks could ever do was load partial programs they always had to have the whole world assumption come along and that usually happened to link step but sometimes for performance reasons you did earlier and then your compilation cost like went through the roof in Java you can just load code on the fly every time and it's going to get compiled on the spot when it needed and become as fast as the original code in the original program you know here's the start piece you started with that it may require you to unwind some optimizations and repro file and recompile and that just happens and so the illusion is you can incrementally build up your program every like frickin Java j2ee server does that they pull in more code and pull in more code pull in more code and run it run and run it and if you wanted to lunch it gets fast and it's all good another thing that happens here is that there's a consistent memory model so you can reuse it about parallel programs all the machines all the pieces of hardware out there have very different memory models and it turns out that they not only very fairly widely from machine to machine but within generations of the same machine and we all think a x86 has got this very conservative memory model well the real answer is that that if you roll back ten years ago it didn't have one at all and the memory model dependent on what the motherboard vendor did and how they communicated between chips so you had to get you know the JVM had to deal with compact motherboards versus you know dec alpha motherboards teresa's whatever and they'd give different things some other tips that people use commonly have you know much more aggressive memory models then you know the x86 one so arms probably very well known to be a but you know I've worked on a bunch of varieties of hardware where people had very aggressive memory models the real short story is they're all different none of them matched the Java memory model semantics so the jayam the JVM has to do that and the JVM does that by C if I have the next slide on that one yes the JVM bridges the gap but it has to do it by keeping cost model cheap so loads and stores have to remain fast so you understand where the speed goes but there's a combination of you know right kind of memory fences and code scheduling and placements of locks and the right kind of things you can get back both performance and a well-understood memory model of how the threads communicate requires detailed knowledge of the hardware requires close cooperation from the jet it's not something you can just like patch in afterwards and somebody else's compiler there's a consistent threading model and again this was very different you know back in the day it wasn't Linux everywhere there was AI X and Solaris and a dozen dozen other os's um but even these days you get Windows and you get whatever runs on your phone pick your you know OS of choice and Java will cover up the differences in the OS is threading model so anywhere from you know small devices to thousand core machines synchronized weight notified join the oldest work and they work efficiently well that's one of the keys not just they work but but if you have a you know hundred thousand runner bulls pile in on some stupid lock and they're all waiting for a notify you don't get a hundred thousand things woken up and I'll try to run and all go back to sleep in whatever it takes you actually have efficient good you know reasonably fair locking properties and that is the JVM covering up for the sins of the OS so that brings around the locks so locks are fast well obviously if you contend on the lock you're not fast and you have to block and go in the OS and you would like to get a fairness from the OS but it turns out that every OS I've ever worked on does not provide fairness unlocks that if you have ten things or hundred threads maybe it's reasonably fair if you have a hundred thousand threads piling on the lock somebody starves indefinitely it never runs and so the the JVM has to not rely on the OS for all properties of locks it has to do something else when the count of locks get big enough but people do locks a lot so they get optimized and they have to run fast so bias locking is just a handful of clocks when it works but you get very fast user mode locks in pretty much all situations and why is this happening well it's because people still know how to program them concurrently I did this talk five years ago that statement I believe is still true we don't know how it works so you get in this mode where you just add a lock to fix your bugs so the bug goes away right you get a lot of junk locks ok fine locks are common the xql time they have to be optimized they were optimizable you get this particular concurrent programming style of I'm blocking around every possible thing but it mostly kind of works we've learned a lot about concurrent programming as a result and there are some better answers out there but I don't think there's a real clear winning solution it is still commonly the case that synchronized keywords show up and they show up a lot quicktime access system curt time Milly's why do I care well when I first started in the land of you know hotspot and Java Virtual Machines there were all these benchmarks there's one in particular that would call current time Milly's billions of times a second so count the billions per second it's not running on one core because you can't run billions per second 10-15 years ago but across a large you know big shared memory multiprocessor server you would get billions of calls a second it is still fairly common in nearly all large apps and that funny line in the middle real Java programs really expect this what hell's that mean it says if some thread called current time Milly's and got one milli off from some other thread who called current time bellies then thread one makes an implicit assumption that there's a happens before relationship but team between the two threads that is to say if thread 1 loads our current time Milly's and loads the value from memory and thread two loads of current time Milly's and they compare them and they're off by one then thread two knows that his load comes after the first guy within the clock cycle like clock cycle by clock cycle not millisecond by millisecond time clock cycle by clock cycle they're really hard invariant to maintain because you know it's it's thousand milliseconds there's a million milliseconds before you have a nano and that's roughly the the the timing that these loads are happening on but still if you don't follow this property real job applications then crash because they get things out of order they assume that because I saw this millisecond was less than yours that my transaction was completely before yours began that's not true then you die and we tried a couple of ways and we discovered that you must actually have this property so up until about 2012 if you want an x86 and you grab the DSC you register which is the county register for doing high accuracy time it was actually not coherent across the cores it was definitely not coherent across sockets in a motherboard but even one chip you have some processors that were being idle they would tick slower for low-power mode and then the counts of the nanos and there would all vary from another core which is running fast and so if a thread jumped from a busy core to an idle core his current time milites would run backwards because the TSE register was ticking differently right you couldn't do it so it was a real mess in fact it led to like a this magical flag called - xxxx : + aggressive optimization which said hey we're gonna cheat on time and know that this F won't crash but we're gonna use the TC register even though it's not monotonic and you couldn't use that flag if you were running you know Gabe alster WebSphere because you would crash but if you're routing this Mac magical you know J app server whatever the benchmark was it worked and you got a lot faster because you could grab the TSC right just or do something with it and this led to the notion of well isn't there a better way to get time and of course there is and all I need is a planeload updated by a background thread by the Linux kernel thousand times a second Linux kernel suggests flip a page counter in a page and everyone can just read it it would be you know atomically correct down to the clock cycle across all the cores we played with it for a while 10% speed-up on that key benchmark whatever so then you know Along Came hypervisors and VMware and people love to play hypervisor games we discovered that because the TSU register sucks so bad that the hypervisor people wanted to help and so they jumped in and said we'll intercept the TSC register loud and we'll make it uniform monotonically ticking do everything you want out of a good time register except because we jumped in and intercepted it got a thousand one hundred times slower and of course we calling something a billion times a second and it gets a hundred times slower you notice it so that was not helpful okay so now I'm going to talk about some illusions that people hope to have or I thought they had that actually aren't present and that maybe you wish you might have had so you know the illusion of the infinite stacks of tail recursion there are some functional languages that would love to have tail recursion it's not in the JVM now it's not actually that hard to put in I kind of thought it would come on years ago but I bailed out of this game so it's up to Oracle now right um closures you know running code is data again if you have sort of JVM level internal actual support for closures you can do things instead Java has these you know thunks that aren't quite really closures and so you kind of sort of kind of get to first-class functions but not really you get really close you know lambda helps a lot but it's not all the way there and so various people like swear at the JVM for not actually providing real closures bigeye integer is cheapest int and auto boxing optimizations so so the magical badness about capital I teenager and auto boxing is it silently Auto boxes if you make a mistake and then if you auto box what happens is that you allocate for every time you make an INT and then the allocation misses in cash 100% of the time guaranteed furthermore because as a final it has to have a memory fence even on an x86 and then because you have a bunch of these Kappa laters running around they all alias with each other and so the the JIT can't reorder them a voice them into registers and so suddenly you get like this massive slowdown in some piece of code plus a huge pile of allocation plus you burn a lot of memory bandwidth and things are a lot slower and and it was like what did I do I turned a little I into a big I by accident it's like whoa what happened here you know if we had better auto boxing optimizations that would get better but I think actually there's a language fail there where you can't say this code is performance sensitive let me know if I'm auto boxing and that kind of brings around to the big integer thing you know javascript has this notion of all in Tsar infinitely sized and what it really means is that when you overflow from a small it you have to switch to some complicated structure but most people don't ever actually overflow and so I think Java here has got it right although we might want to have you know silent overflow two big integer you can't do it performant lis without help from the JIT directly or else it's a language visible feature and then you have to ask for it and so in Java you ask for it in JavaScript you suffer everyone suffers little in accidentally being able to flip to big int and then they have to deal with all extra tests and whatnot to check for that yeah you know one of the one of the ways that you might hope to get better with concurrent programming is to have something with you have at some sort of atomic multi address update unfortunately you know that's never really panned out and in practice just now some of the hardware chips have ability to do more than one word atomically updated but there's no language level support for it people tried software transactional memory for many years to give them the illusion of its software and the answer of course is that it's much too fragile never actually worked yeah um invokedynamic I think it's actually here now and I say here I mean it's here and performing enough weight you could actually use it I think that I wrote this talk you know five or six years ago invokedynamic was this concept but it wasn't performant to the point where it was worth screwing with right okay what's going on here well there's this giant massive code it's been approaching nearly 20 years old um even though I've been out of the game for a while it's very clear given the rate of change that it still has issues where large chunks of the code or fragile or sort honestly very fluffy per line of code if there's too much crud to make good forward progress I will give you no Oracle and Brian Gretzky Joe's for trying very hard and making some progress but it's clear that there's issues at adding a lot of the features that you might want to add here's another illusion that we like to have or think we have and that's thread priorities mostly on Linux you don't have them you do but you have to be root and of course no one wants to run their application is route in in reduction so you're not running this route so all you can do is lower your priority you can't raise it and that means that you can make a thread sort of commit suicide but he can't demand that this thread is important so for instance if you have a concurrent GC you have to have a concurrent GC thread run if it's going to be concurrent if you have all your threads being mutator threads and they're all running full blast the concurrent threads are not running well then they don't catch up and then you get a GC pause or it you do that but you fix that by raising their priority but you end up raising it on the entire box because you can't raise it for process and that means that a low priority JVM doing some batch work with burning all the cores has some high priority threads during concurrent GC they're gonna starve out some other JVM because they're got high priority threads that are running yeah write once run anywhere kind of sorta works but scale matters what you do for programs that are very small or very large are very different and so you have to think about the problems in a different way it's not the case that the program's just up and run everywhere finalized errs what a great concept what a horrible I don't know what actual usage case because they have no timeliness guarantees and when they run eventually finalizes run but it eventually might be never so it was the situation where Tomcat did finalizes for closing file handles for doing uh you know web service work they had a very high turnover rate on thread request I got a file handle for each one you get a full GC cycle periodically your file handles all came back but then heaps got bigger and full GC times got further apart and further apart and suddenly you ran out of OS file handles and then your your you know your tomcat would crash until flotilla C cycle and that was such an egregious situation that we end up doing a call back from the file handle I didn't get one from the OS into the GC to demand a full GC cycle even there was plenty of heap spare in order to get finalizes to run and then you went back and asked again the OS can I have a final Hamlet out right and this is like a it's like a completely it's the wrong way to deal with OS level resources there was no timeliness guarantees and when those resources came back so file handles okay but you're gonna do it with you know bit buffers on your on your video games and your screens like this how about of byte buffers how about I mean I can name like 27 more in a row so this is not the right way to handle OS resources and finalizes put this huge burden on the GC how about soft phantom week refs they're using essentially asking the GC to manage a cache the problem is the GC has no idea why your cache exists in what's trying to cache so this situation would occur you have a server it's running and you ramp up load on the server and the load gets higher and higher and the server is working well and caching the cache is working because that's what casuals are supposed to do and it's caching most requests and it's very efficient and has a high throughput rate and then you bobble up just a little bit more load and you get a GC cycle and the GC cycle flushes some refs out of the cache and then when you go back to the cache it's empty so you miss so you have to do the work to rebuild the thing in the cache well that requires a lot of allocations did that work so you did that allocation but you were low on memory so pretty quickly you get another GC cycle and you flush some more things out of your cache and you get this vicious cycle where you're constantly flushing your cache and you can't ever get it fall so the server has to keep doing all the work to fill the cache but it keeps getting flushed out from under it and you get this crashing throughput on the server and it stays crashed until you kill the load and then you can add the load and will ramp back up and be high-performing again and that's because GE has no freaking clue what your cache is trying to do or what this software s is related to a cache or anything like that there's no way for it understand your application level caching needs so this is like a great idea it sounds like fun we'll have the GC handle your your caching behavior and throw things away out of the cache when we need to reclaim memory but it doesn't understand the load requirement so they're going on and it can't you know it has no feedback mechanism so in in practice that leads to actually under load a very fragile situation ok so I'm going to walk down through the illusions and do different things with them so here's some services that the J the JVM provides for GC yeah we all love GC you kidding ok that's great Java memory model this woven throughout the GC the JIT inge the VM itself thread management some sort of fast time axis hiding CPU do details and hard memory model those go into like having you know a cost model for execution plus the communication model team threats there are some services provided below the JVM the OS layer so threats context switching thread priorities IO file access virtual memory protection there's a bunch of those there's a bunch of services that are above the JVM that are pretty common people lean on thread pools very commonly worklists transactional behavior somehow right crypto caching layers models of concurrent programming that are not just threads and locks maybe alternative languages want new dispatch rules well maybe big integer they have alternative concurrency models meme it's an Erlang or an accurate style that there's a done different services you might provide above a JVM back here so here's a here's a service I think belongs in the OS the JVM should provide or neath provide a fast quality time and you get fast but not quality from TSC up until recently and then you have to actually have some custom code still because you can get context switch on threads and it's not actually it's not actually there yet quality not fast from OS get time of day it's pretty good but if your benchmark runs at a billion times a second it it can be faster we festivai live all I have to do is have a memory page where you take a word on a kernel interrupt me on a time interrupt everyone to get a cache coherent to the clock cycle updated version of time for no more than a load thread priorities oh it's already provides thread priorities at the process level but we want thread priorities within a process because the JVM has to have some threads with higher priority than the mutator threats the GC has to have cycles to catch up with the garbage being produced by the mutator threads or if it starves you run out of heap and you take a pause and the whole point of doing you know one of these low pause collectors is to not take a pause so that doesn't work unless you can define a thread priority on the mutator on the GC thread that's higher than the mutaters same thing happens to the jet if you run a thousand threads and they're all busy doing some work and the jet gets starved then those thousand threads only ever run interpreted they never get jetted and I have totally seen that happen so you have to have thread priorities for the jet as well when I was out of Zul we ended up faking thread priorities with a duty style locks and blocks we just had to do it or we weren't gonna get a low pause concurrent collector and this is just something that just bongs in the OS he's already doing process priorities he's already running managing threads he should do priorities on threads the current Linux situation so you can't raise your priority without being route which is fine if it wasn't the case that everyone starts out at max unless your route sube if everyone started out some middle and level and then you could raise voluntarily without being route this problem would go away alternative concurrency models so the JVM does provide you with thread management you can make a hundred thousand run bowls and they'll work and they all have you know reasonable performance I'm gonna provide you very fast locking okay that's great but that's like a you know the thread and lock model of concurrency programs were like a symbol language for concurrency there's got to be a better way so there's actors there's message crashing model the software transactional memory you know the fork/join model the streaming thing these are all new ideas about concurrency about how to think about concurrency for which the JVM itself is just too big and cumbersome to move faster and you really need to do exploratory work above the JVM level all right until maybe we get some consensus on what the right way to do concurrency is and then the JVM might provide some building blocks so engine it's a fast parking and park or some specific kind of software transactional memory that maybe overlays you know the new Intel you know Hardware transactional memory something like that fixed Nam's some people love them some people hate them they're actually best implemented in the JVM because the JVM can get out special instructions to check for overflow math there are special instructions and all the hardware tips to do overflow math they're not jetted out now because the JVM doesn't understand you're trying to do big integer math because it's not the default case that everything might overflow if we had some support for for you know overflowing integer math inside the JVM you could get out high-quality digit code and it would have roughly the same cost as small and all the time unless you flipped over to the big end and then you of course you're big in math now so one of those things though we're mostly people know that they fit in 64 bits so they can just ask for a long and that's going to be very efficient and if they know that they're not going to fit in a long they people know that and then you can just ask for it but don't ask for it by default so what JavaScript does the GC chitting and Java memory model and type safety are all strongly tied together and this sort of defines of the JVM dies you have to have you know GC has deep hooks into the ginning process to know when and where safe points go in win or read barriers and white barriers go in how to optimize them the Java memory model also has deep hooks into the ginning process and into the JVM proper to know when threads are allowed to talk when they're not running and jetted code there's some chance that you get these alternative concurrency models that would allow you get away with a weaker memory model but that is still going to require close cooperation from the jet so I argue that these things taken together as a whole basically to find the core guts of the java virtual machine and that is the definition of it OS resource lifetime I think these can be moved out of well GC out of finalized errs that people should do it themselves and this is what you know try earth resources is all about is it you know stack based allocation of OS level resources basically works and when it doesn't you might want to go to reference counting or some sort of arena level management but the GC is like a really bad place to say you know go manage this resource for me and when it's next convenient to run a full GC cycle clean it up because it might never be next convenient if you have the right kind of collector right same thing for weeks off phantom refs they were intentionally originally designed to help people with caches for which they didn't want to keep something alive only because it was in the cache but again the GC doesn't know the meaning of your cache or why it exists or that it is a cache even and so it is a really lousy heuristic for defining when you should throw something out of the cache and when it should not that there's a better way to go here but you don't have the GC you know change your application semantics okay so quick summary here the JVM currently gives you thread priorities fast time and os resource management via finalized errs and I think these things should move out either down to LS or up to the application level fixed nums tail calls enclosures I think should go into JVM and that would enable a bunch of different kinds of language features of you know various people have argued where I asked for for a long time it turns out that gpgc is you know I hate to say it Oracle but like just you know fess up it's nowhere close the g1 is nowhere close to gpgc it's a truly better allocator by a lot it needs a little bit of OS work that is now in the kernel and you know this should be moved into the Oracle land but in this case what I really happen here is I got a TLB user mode process so this piece happened already hardware performance counters this is sort of Intel screwing up a few more times and that that required you know privileged access to get out the counters but you want to have a user mode process that JVM not only get at the counters but actually probably get code that it automatically reads them on the fly so that you can do sort of fine-grained counter analysis it's a natural consumer of hardware performance counters there's all kinds of fun stuff I can and have done with hardware performance counters but Intel made actually hard to get at them fine with those with that kind of information the JIT and the JVM is a natural place to map that stuff back into the Java level so you have all the performance counters you have a mapping internal inside the JVM so that you know what to in line and what not what to optimize and where the best places to go that information should be brought out a standard performance in relation for you know IntelliJ your kit whoever's gonna run a profiler and it's not it is making sense it really should be brought out this is a theme that that you know the information that JVM has the mapping should bring it out and let people see what's going on there okay so I claim these things ought to happen and they can't currently for one reason or another and okay doom I'm done that waste that was cliff running through thirty five minutes fast like in so this time for questions yeah exactly what the hell is this thing so I am I'm looking for people cheating in the stock market I am using big data and machine learning to analyze stock trade activity to look for fraud Lots there are various governmental requirements for looking for fraud internal to companies for trading firms for exchanges for clearing houses and we have sort of the obvious better answer by a longshot over you know what is currently sort of common practice we also have been working closely with various kinds of regulatory agencies on actual real-life court cases because we're able to find and pull out things that no one else is doing and and and and people who who might be on the wrong side of the regulatory inquiries also want to know what we're finding so it's an interesting business use case and it's written in Java yes so it's it's h2o my last project done to a financial vertical no no running on stock hardware today's show are anjaan stock hardware if we use the full GC the standard old-school collector works great was h2o and big dead so no g1 no CMS no G B G C just stock collector and we get fantastic performance out of it because we've ranged the data that worked out way and then another fun performance hack most of my data is stored in big fat byte arrays there's not very many there's a lot of data but not many objects and the old-school collector like really goes really well with that is my opinion that Oracle would have to try really hard to remove it because they just shoot themselves in the foot it's all like being the default for so long like so many people are gonna be so wedded to its exact performance characteristics that they're never going to be able to turn it off for a long time until they have a truly better answer that's uniform across the board better I don't think they're gonna be able to turn it off so I think it I think it will remain supported for long timers my belief that they be you know business-wise foolish to to stop supporting it it's also gonna have a very low maintenance cost because it's alive and well forever in a day if you don't do anything it's just gonna keep working not that I'm gonna speak for you know Brian gets an Oracle about what they end up doing but yeah right so there's any work on mem map and friends is that what I just heard you Cesc so I don't know so I have been out of the Oracle Arena for a while you have to ask you know Brian get some company it's my understanding that as of the last two releases ago lady there was a big push to get in a map to behave better and that that's the currently the state that there's more reasonable things you can do with it you can unmapped and you can I don't know about the 2gig limit that that's a different one you're asking that memcache D yeah I ended up doing my own caching in Java heap and so I don't have memcache D and I'm in map issues but but peopling people do memcache D and then they want to know how that works as far as I know you can now do more things with it I but I'm the wrong guy to ask for that one okay so I see a lot of people hanging around hoping there's another interesting conversation get started for that to happen you have to actually like speak up yeah so I've talked with the growl guys a fair amount they're in a funny situation where they've been sort of you know behind the curve and playing catch-up for a long time and for some set of interesting benchmarks they finally caught up for some other you know benchmarks there they they clearly or not and then there's a domain where you have a mixed Java and non Java like C native code where they can cross the boundary at no cost and even optimize across it that they're well ahead on so they they've reached a point now where I think the the implementation has some interesting properties where you might want to use it if you're if you fall in the domain where it's clearly better but if it's not in that domain where it's clearly better it's pretty questionable because you're gonna pick up a new technology was going to have a lot of you know uncompleted unfinished warts that you have to deal with but if you're experimenting on a new application and you understand you might get to the zone where you're gonna cross the the native code boundary a fair amount or you're mixing you know Python actual Python not JSON with you know the the Java code or whatever um yeah I take a look at it there's something there you know I one of my comments here was that the JVM has been around forever today and the white code is fluffy um I did a lot of riff and replace at his ole and a lot of mileage out of it you know growl is essentially written replace of various interesting components you know it might actually also be more maintainable and therefore faster to move going forward so I'm of the opinion that they've pushed hard for a long time against the leader and maybe they're at a point where they can make something you know they can actually challenge hotspot directly I'm on the sidelines watching I'm cheering both parties yeah yeah yeah um so what extent Java performance is driven by non Java languages so you know invoke dynamic was the big one that was the the non Java languages like screaming out for a way to do virtual calls with different dispatch semantics other things I don't hear a lot of screaming about that are pushing forward so I mentioned a bunch of things including you know fixed numbers that have been brought up to me repeatedly both before and after I left there you know son about can't we do something here and the answer of course is not too hard to do something and it would enable you know you know fix numbers and big integer enables of JavaScript you like thing tail calls go to everyone who's doing true you know functional programming things closures go - everyone is doing true functional programming like things yes something could be done that doesn't seem to be enough pressure to make that happen yet so for the moment I think Java you know internal VM performance is looking in the other direction still so I know crawls been a long time project there's a lot of resources going there I know that people have tried and experimented with other kinds of performance optimizations like the escape analysis which try and and I don't believe found to be very effective but maybe want to do something else with it I don't know what else is going on in see tool and I'm haven't heard from those guys and forever in a day I did a lot of stuff with all that you know maybe Oracle should or could have copied including you know knowing what a new mint and it was uh nails and I could watch optimizations around it I did a bunch of stuff with interfaces and interface optimizations that again that would require sort of a major C to B plumbing I don't know there anything else going on at the JVM at the the JIT compiler level so you know the the g1 is a GC thing as a giant pile of work going on there what is old it for gpgc included replumb ahead and stopped and exceptions were thrown and that replumb enabled like a really cheap way to do bias locking much cheaper than what Oracle ended up with as well as the ability to stop and start single threads at the you know nanosecond granularity level and that turned into what you know what you need for a low cost collector that's something that some Oracle ought to be doing it's not really JIT stuff but it's certainly JVM performance work this is like ask Brian gets I'm sure he'll have an answer for you is what's going on inside the JVM there's a lot of work going above it you know streams right that was all non JVM stuff so that asked on on you know thinking through it streams PI bring up a new coding style for which you want to have certain kind of optimizations done and you would might tweak the compilers to make sure those optimizations happen so there might have been something done there to make streams go faster for instance I know fork/join went through some round of opt-ins a of the VM at various levels because you had all kinds of fun games with too many threats or not enough threads and what it meant to do weights or notifies when you had huge count of threats that were constantly trying to juggle tasks back and forth something was raising their hand that maybe they're stretching okay this is a real real reason yeah do I think Intel's a team can help performance so the HTM has a very small count of of things you can do atomic updates on and so it's only gonna apply where you have a small count of things where you can get an atomic update and that in turn means it's restricted to you know things that people understand very carefully what it is it can be done atomically and why and that will in turn probably means its best use is to be pickled into a library call to allow some sort of concurrent update on some sort of you know shareable structure at low cost and so that probably means it would make you know something to do with jdk locks and the the trial locks and and reader/writer locks whatever do something more efficient it's got to get buried into the library somehow to be useful um that said you know we kind of worked out these problems like people screamed at Intel for years on end and Intel in its infinite wisdom stalled on these things for very long time and so people worked around it and now the need for multi word atomic update is less than it used to be because we now have a better answer in many cases so I think there's some room there but I know there's a lot of room for performance where it might pan out in a sort of a more aggressive way it's very specific use cases where I know I want have multiple cores having very tight coupling so you know not to say high frequency trading or anything like that but there's a few domains where people pay big bucks for that but I don't know there's a whole lot of other domains where it's gonna matter there might be something in the OS itself maybe when he's doing context switches amongst lots of threads and he has to do a bunch of state changes might work out better and I might lower context switch costs maybe I don't know about that's not a JVM question at all though ya know I thought I said that already gpgc is gpgc as those gpgc is a much better GC and a lot of work ways it's hugely more performant hugely lower latency hugely higher throughput like what the hell so you know having said that I filled with g1 some years ago and it was clearly a loser to gpgc and it was clearly a loser to my use cases where the stock memorize I was targeting my heap allocation patterns and work well the stock VM it's been a few years maybe it's better I could try it again I haven't um the numbers I hear from Oracle don't look very impressive compared to the numbers I know Azul is getting I think Google or Oracle should buy Azul tech shove it into the JVM and just turn that sucker on as the default and I think people would jump on it I yeah unfortunately I don't control either one of those so yeah hahaha uh okay help me out which one's Valhalla oh value types ok value types yeah sure so actually if I was looking at JVM work that might go somewhere that has some legs so when they Chow was did essentially value types manually using unsafe and big piles of byte arrays and you know primitive byte arrays it's really how do I get rid of the object header on the zillion tiny objects so if I do capital D double because I love my hash tables using capital D double keys for whatever reason I end up with you know eight bytes of payload and 32 bytes of object word four to one like looser ratio so if I have a billion of these things and a four to one ratio I lose you know I bye bye bye for billions or whatever it's going to be right a lot so that lawsuit turns into allocation time it turns into memory bandwidth and caches get smaller by 4x on the hold on your Ansari's action cost there and I say capital EWS is sort of a sort of obviously like why would you ever do that but people do this point people will do within and long actually long without meaning to it'll happen with auto boxing for instance so there's something to be said there where you'd like to have a structure which looked like you know a full table object but you can make an array of a zillion of them and have it you know be performant without all the overheads now having said that the way I do that as I rotate my data structures 90 degrees so if I need to have a zillion tiny objects instead of the having an array of structs I have a struct of arrays and you just put accessors to make it transparently behave you have a two-dimensional structure it's accessed one way by field name the point be x y&z the other way by the array index so you make an accessor that you know doesn't care which way it goes and under the hood you you flip the thing you have a bunch of a rage and the arrays then have linear striding pattern act access patterns I have no headers except for one for the array not one for element and so on so forth so you get all the memory speed out you get the bandwidth out you get prefetching the hardware out you lose object headers you have all the right properties except you can't take an individual point and move it out of that array and pass it around which in the Valhalla thing will sort of auto box it if you will back to an object and pass it around for me if I want to do that I explicitly unbox it or box it as the case may be but I get my performance out for these cases where I have large arrays of strikes and that's in fact what hto basically does more or less the the major theme there all right I'll be around a while people want to talk to me too embarrassed talking in public but will talk to me one-on-one I'll be here
Info
Channel: Devoxx
Views: 16,377
Rating: 4.9186993 out of 5
Keywords: DevoxxBE2016
Id: -vizTDSz8NU
Channel Id: undefined
Length: 51min 59sec (3119 seconds)
Published: Fri Nov 11 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.