Performance tuning Twitter services with Graal and ML - Chris Thalinger

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome ten years Scala days who's excited about this yeah a few all right well at least you're here that's the most important thing so my name is Chris I work for a tiny little company that's here on my shirt you might know it it's called Twitter and I do a lot of conferences and I've already done a lot of conferences this year giving this presentation most of the time and they're all very Chavez centric conferences right but this one holidays is actually the most important one for me because a you know Twitter uses a lot of color but B growl and all the technology we're using is so good for Scala code that it's ink unbelievable almost so you guys you need to you know pay attention really good right now and then you go home and try the same thing as but what we did so I know it's already the second day or the third but I really want you to tweet about this event right it's ten years so tweet about it put a bunch of hashtags and your tweet tweet about all the sessions that you liked maybe tweet about the ones you didn't like don't do that with mine thank you if you tweet about my talk at the hashtag Twitter VM team because Twitter actually has a VM team and we are not that small anymore we are now I think eight people or something so we have a bunch of JVM engineers we have three GC engineers and they are constantly busy doing you know GC support you know all the issues that you have with to see everyone in the world does then I'm a compiler engineer so I do JIT compiler work my colleague of mine named his name is Flavio he helps me out a little bit he does and that might be interesting to you he does Scala specific optimizations for gras so and some of them are already up streamed we have one in the pipeline it's a little more complicated and convoluted but that will bring another performance boost they don't have some people work on infrastructure you know testing blah blah blah we build our own JDK s and we have to do that and then we have some people work on a tool called auto-tune and auto-tune is the thing that I'm going to talk about today so a little bit about me I'm working on JVM for a very long time it's I think it's 15 years now and all these 15 years I've worked on compilers so I'm going to ask you a bunch of questions because I want to know what you know please raise your hands if I ask do people know what a JVM is all right no I mean you know we start low people know hotspot yes the hchave IAM of open JDK who knows what a chip compiler is oh that's a lot of people alright so just for the few who don't basically when you have Java bytecode if it comes from Java or Scala doesn't matter right the che VM takes that Java bytecode interprets it which is really slow and then there are compilers that compile the Java bytecode into native code on the fly while you run your application we call it just in time right that's the other end would be aut ahead of time compilation which would be something like GCC where you compile something into a native executable so these are check compilers I worked on check compilers still do I worked at Sun and Oracle in the hospital pilot team and mostly on C - who knows what C 1 and C 2 are all right lessons so hotspot has to cheat compilers one is called C 1 or client Compellent the other one is called C 2 or server compiler you might have heard the - client their server before don't use that anymore we don't need it anymore but there are two CH it compiles and I mostly worked on C 2 so C 1 the purpose of C 1 it's a high throughput compiler so the purpose is to get away from interpreting code as quickly as we can it does a little bit of optimizations not too crazy does some inlining some other stuff but nothing fancy we just want to run a native code and then C 2 is highly optimizing compiler so it does all the fancy optimizations you can think of certainly more inlining than c1 escape analysis loop unrolling auto vectors say you know all that stuff so that's c2 and these three projects are basically the biggest ones I've worked on at my son at my time it's on an Oracle chaise our 292 invokedynamic you might know it I'm not sure how many people do Java here but if you use Java 8 lambdas you're actually using invokedynamic under the hood that's how it's implemented they we had two implementations for chase our 292 that was Java 7 a long time ago we had two implementations the first one was a lot of hand written assembly code like pages and pages of assembly code which we had to port to all the architectures that we support it was a major pain in the ass to write was a major pain to support maintain and then on top of it we had a performance issue so we completely redesigned the whole thing and moved all the handwritten assembly logic into Java into a package called java.lang invoke and a lot of that Java code in Java man invoke I've wrote so if it doesn't work it could technically blame me but I always say the code I wrote was perfectly fine and the people who touch it after me they broke it so that's how I look at it chap 243 we introduced in teh TK 9 Java level JVM compelling to face also JVM CI so growl is written in Java right oh maybe I should ask this now who knows what braless not growl VM other hands went down everyone's confused so growl VM is an umbrella term for three different technologies basically Groth the JIT compiler who knows whether check compilers all-hands should be up I just explained it so draw the JIT compiler then truffle which is a framework to implement language runtimes and substrate VM which you might know as native image that's what everyone's super excited about right now to compile things to native so we're all the chit compiler that's basically what I'm talking about that's what we are using a Twitter if you saw Stu's talk yesterday and Danny they're using native image to speed up source compilations right but we in the VM thing we only use growl legit compiler and it's written in Java and growlithe has an interface because it needs to talk to hotspot which is written in C++ so there's an interface that growl uses and we just extracted that interface put it in a Java 9 module and called it chapter 43 JVM C is it's kind of stable ish it's not an official supported API but it hasn't changed a lot since 9 so basically if you have a growl jar file you can just take jdk 9 10 11 12 13 and run it with it so and then 243 if we did for chap 295 ahead of time compilation also in jdk 9 this is not native image does something different this is basically a small command-line utility where you passing class files or charge files and then it takes all the methods in there sends it off to growl compiles it and spits out a shared library at the other end and then hospit can pick up that shared library and basically you are skipping the interpreting step and that might help if the application is really really big and runs a lot of different Java code when it starts up this might help you with startup it's very tricky the main difference between this one and native image is this one is actually Java right Java specification native image is just a subset of Java so yeah and now I work at the greatest company on the planet Twitter and this is what Twitter looks like we have hundreds and hundreds of micro services and thousands of instances of these right and then on top of it we run on heterogeneous hardware we own our own data center so we actually know what hardware we own you run probably in the cloud you don't even know what machines you're running on so one problem with this is when you want to do performance tuning you don't even know what you're tuning for so performance tuning of optimization in general hand tuning doesn't scale right we all know that the the way we do performance tuning today is you sit at work at your desk you are annoyed by the performance of something and then you decide okay I spend a couple hours in Tunis and this happens every three years five years never right that's how we do it today and then you can only tune a few parameters because you have to build up a mental model in your head to understand what's going on if you change this one and this happens and they to you know very time-consuming labor intense error-prone most important point on this slide is upgrade to make optimal if leading so we all live in the super cool agile world and you deploy a hundred times a day because it's so cool but the code is constantly changing right so even we deploy not multiple times a day but we have some services we deploy multiple times a week so if you do performance tuning today it's not valid and tomorrow anymore because the code changed right the code is constantly changing so an important point is many services operate below optimality and that's certainly true for Twitter services right we are in the same boat as you right now that's not there's not much difference there's a kind of theoretical part what is performance tuning right you have a function f defined over domain X and then what do you want to do is you want to find a configuration a that maximizes F whatever F is could be throughput could be to reduce latency could be to reduce memory consumption could be anything right one important point is it's always subject to some constraints and the number one most important constraint is it still has to work and you will see with the experiments I did that actually you can tune too far so it doesn't work anymore good this is performance tuning right as we all know it we've done it probably at least once in our lifetime you pick a parameter to test run it on the system right you measure your F and then you get it back and then you the performance engineer a person highly-paid i hope has to decide if the value you got back is better or worse than before well this is ridiculous right I mean maybe we could get a monkey to decide if it's better or not but what we really want is some black box we don't need to know what the black box is of what it's doing the only thing we need to know is if the thing that we just measured is better worse than before and what to try next right that's all we need and we use something called Bayesian optimization for that so it's a method to learn potentially noisy objective functions iteratively efficiently that's important because as I said earlier we can't wait forever for a result because the code is constantly changing right we need you know a result in some reasonable time finds near optimal iterations works well with nonlinear multimodal high dimensional functions that the last point is important because if you think of the JVM or anything else really there are hundreds of parameters you could tune right especially on the JVM some people ask me after my talk so could I tune my database yeah of course you can right doesn't mean I'm just using the TVM because I know it but you could tune anything you want it tune your minecraft if you want so if you want to know more about this Bayesian optimization thing my colleague romkey he's the expert so he's given presentations about this two years of DevOps for example he has a very soothing voice to listen to his youtubes and you see the slides he's using and basically the only thing I did is I stole his lights and I'm briefly explaining to you how Bayesian optimization works I'm not an expert in this if you know Bayesian optimizations don't call me out don't ask questions this is just for you to understand the data you seen later a little bit better and how we got there all right so we have a parameter that affects performance goes from negative six to six then on the y-axis we have performance highest better we have three actual evaluation runs so that three data points you're seeing are true results this is the actual performance curve which we don't know right we want to figure that one out and then what Bayesian optimization does it just assumes a performance curve with some certainty and the blue area above and below the curve is the certainty you'd see where we actually have a data point the certainty is zero or the uncertainty because we know that's correct right and then overlaid with the actual performance curve it looks like that it's kind of what it looks like but not really but it doesn't matter right and then what it what Bayesian optimization does it puts in a line with the best result you have right now and the blue area above the red line is the curve at the bottom that's your expected improvement and the highest point of you expect an improvement curve is the value you try next that's basically what it does so and then we take the one the highest one which is the the second one from the from the left just one we run an evaluation we get the data point right and so it goes through these the one on the left now and then this one and then we have exhausted that space here right we we know the performance curve we have the data points we've exhausted that space so we move on to the right we do that one that was not very good so we do that one on the very left it was all the worse and then we do the one over there we do this one that one and this one and then we found our global optimum that's all what Bayesian optimization does for us it's basically what you would do as a human as well just the machine does it better and faster and cheaper all right that's all that's really all so we like it nonparametric robust extensible sure but the last point is the most important one its belt as there are many types of real-world high-impact problems this thing is around for a while and it works and that's what we need so we are not there yet but what we want is this auto-tune thing to be always on in production we want our services are hundreds of services to tune themselves constantly all the time and that we need to rely on the system we're using right so we're using this one because we know it works ok that was the theory the part now let's go to the experiments the numbers in the graphs so what is auto - it's a Bayesian optimization as a service as I call it but Wes please get that hashtag trending I try it at every conference never works don't disappoint me this time so it it's kind of a joke but it's actually true it runs as a service inside of Twitter and other other things I actually debate this vision optimization service as well and we use it for for auto-tune it's it's a thing called wet lab which is a company way cried a couple years ago and it's unfortunately closed tours and we can't open source it for whatever legal reasons and it's an enhancement of a framework called spearmint and spearmint is still available as open source on github so if you're interested in that that's the one part Bayesian optimization part and then auto-tune is just a driver to run the experiments you can think of it as a very you know complicated script that starts and stops instances as measures it's things that's that's really all it is we in right now it's very hard to use autotune you know there's no nice user experience or anything but if we have that at one point in time we're planning to open source auto-tune and then you could take that make it work with spearmint and do exactly what we are doing right now we run in Aurora on mesas that's what we do right now so if we would open source auto-tune and you run on dr. vinet is a world away whatever you would have to write a let's call it back-end to start and stop instances really that's all you would have to do right so what is crawl we talked about this before it's a java virtual machine just-in-time compiler I actively developer or collapse there's a standout there walk over and ask them a trillion questions there's an official open JDK project for it but all the work is actually done on github uses JVM CI as I said earlier and it's written in Java the fact that it's written in Java is not important for this talk but if you're interested in running draw as of today you might want to watch this talk that I'm doing how to use the new JVM check compiled in real life because there are a few things to know if you've run on JDK 11 and I think everyone here is running 11 in production today yes No all right probably not then it would be very easy because Corral is as you've seen before because with a ot chap 295 we introduced growling to open JDK so it's in the open since nine since JDK 10 there was a chip that Oracle said okay it now we've supported in quotes as experimental JIT compiler so you can just turn it on with a few command line flags and as of today there are a few things you should know number one is something called bootstrapping because goral is written in Java but it's the JIT compiler so it has to compile itself you know it's kind of meta-circular thing it's it takes a little bit of startup you know time but it's it's not that bad if you watch that talking we'll see it's not a big issue and then the other thing to know is and that that will change actually in the future there's a project called Lib growl where or collapse an Oracle JP Qi are working on to a ot compile trial itself and link it in to us but then all these things will be gone but as of today that's the state since grants written in Java memory allocations are different right C 1 and C 2 are written in C++ so memory allocations when they do check compilations happen with malloc on the native heap but grah allocates memory on the Java heap you know it's just things you should know if you run benchmarks it's good to know that you interpret the numbers correctly ok so which parameters did I - I tuned I pick 3 inlining related parameters who knows what inlining is all right so in lanning is so when you write code and you have a a piece of code that you know have in multiple places the thing you do is you factor this into a small method right it's a perfectly sane thing to do from a maintenance perspective because you're human but what the chicke compiler does it undoes what you did it takes that piece of code and puts it back in the places where it's called that's for performance reasons and also to widen to the view of our compilation unit to make it bigger the more we know as of compiler the more optimizations we can apply so inlining is the mother of all optimizations because it widens the view the world view of what you see and then you can apply other optimizations much better especially as Cape analysis escape analysis goes hand in hand with inlining I'm not going too much into detail how escape analysis works but basically it can get rid of object allocations if the compiler can prove you don't need it that's the most important thing why growl is so good with Scala code because Scala is it's allocating all these temporary objects that you don't need but it's happening underneath you and you can't control it but growl is taking care of that right so I tuned in learning the first one is called trivial inlining sighs default value of 10 and this means 10 nodes in the compiler graph so what a compiler does it's parse the source code whatever that source code is in our case it's Java bytecode parses the source code into into a graph and if that graph has less than 10 nodes we know it's a very trivial method and we just inline it all the time right maximum in landing size is the other end of the spectrum it's 300 nodes if it's 300 nodes we don't inline it and then small compared low-level graph size is similar to the second one but almost every optimizing compilers c2 has that growl has it IBM j9 GCC and clang has something like this as well and maybe even more different levels of compiler graphs there's usually a high level intermediate representation and a low-level intermediate representation the difference is the high level one is high and the low level one is low you understand that's how I explain things no just kidding the high level one is closer to the source language in our case Java bytecode but could be I don't know basic or whatever you write the low level one is closer to the actual machine instructions for your architecture in our case that would be Intel or it could be RM a spark or whatever right so that's why words also 300 so these are the ones I tuned I have another talk I gave it last year at Scala date in New York it's called Twitter's quest for Holy Grail runtime it's basically the story of me working the first year Twitter and how we started running Twitter services on growl and all the bugs we found I explained all the bugs we found if you're interested in that and also how much CPU we are saving by running on Raw and so during this talk while I was preparing the slides I did the thing that I told you you earlier you shouldn't be doing I hand-tuned it right so I set down a few hours on a Friday afternoon and I tuned it and so I'm going to show you two slides out of that talk as as a perspective so when I when I did this talk I still thought JDK 9 would be coming up right that was wrong but we're looking at GC cycles here this is 24 hours of the tweet service by the way GC cycles and by just run unfortunately the numbers the numbers are really odd I'm not sure why I did it that way by just running on growl and growl would be the orange one or the green one you can see we've reduced GC cycles by about 2.7 percent I think and then I hand to the three parameters you just saw and I've reduced it by another 1.5% up to total of 4.2 which was pretty cool but the thing that I care the most about is user CPU time but just running the tweet service on growl instead of C 2 we save 11% of CPU utilization that's huge right 11 percent just imagine and then I had to knit and I got another 2 out of it and I was very proud of that - hey well it took me a few hours 2% very cool probably you drank a beer that night because I deserved it but you will see that auto tune will kick my ass so remember that - good so this is these are snippets of my configuration file I gave my parameters ranges you don't necessarily have to do that right you saw how Bayesian optimization works you could say from do it from 1 to 1 million right it will figure it out anyway if you have to write constraints you can do it that way the reason why I gave it ranges is because I needed to finish it in a certain time to fit it on slides so it's basically if just for this talk and I know the service really well so I know what ranges out good ranges so I gave it ranges the test setup that I used I used a dedicated machine per instance configuration it's very important if you do performance analysis and you expect improvements in the single-digit percentage range you need dedicated machines because if we do it in your data center and others things are running on your machine all the results you are seeing is noise so use dedicated machines and all my instances instances receive the exact same requests so not only the same number of requests but the exact same request it's very important for let's say a service as the tweet service because a tweet could be one character longer 280 and the memory allocation pattern for the two would be very different but we want to compare apples to apples so all my instances receive the exact same requests and these are read-only dark traffic requests so they are not replaced or anything it's just life data that comes in if you tweet right now that would be running my experiment I would handle your your requests I run this version of JVM CI this version of crawl not that important we are running default tiered c1 growth set who knows what tier compilation is that's what I expected so you remember when I talked about c1 and c2 that there are two chick compilers the way hotspot does is and IBM's open j9 does that too you start at the interpreting code right and then we compile with c1 as I explained earlier and we compile it in a way that we add we compile in additional code that collects profiling information it collects how often was the method called how often did we you know execute a loop if else branches it counts how often they were taken it collects types that we were seeing at you know it invoke interface call sites or whatever and then we take that profiling information and recompile with c2 later after a bunch of iterations and so we stepping through the tiers that's to your compilation and the thing we are doing if you turn on girl in let's say open JDK 11 and we run on eight but we do that we just replace in tier compilation c2 with Gras we still use c1 because we need that step but then peak performance will achieve with Rob alright good experiment one tweet service my favorite one a finagle thrift service you guys maybe even know what finagle is it's an extensive RPC system for the TVM used to construct construct high concurrency service I have no idea what it is I really don't but it doesn't matter because the thing that's important to me is that bottom left 92% of Scala code right as I explained earlier Groth can optimize color code really well so this works great and almost all of our services are written in Scala they're only just a hint or two handful of services that written in Java and most of our services are basically Neddie at the bottom then finagle and then the logic off the server's on top that's kind of what they look like okay my objective our F what are we trying to do you see it at the end user CPU time so we are trying to reduce user CPU time and if you remember Bayesian optimization always looks for a Maxima but since we're all you know really smart computer scientists and we know math really well we know how to solve the problem we say 1 / amazing and then we have constraints or at least one you see it there's something called throttled and if you're doing something really really wrong mazes will throttle you and basically tell your you're a bad citizen if we're going to have auto-tune always on in production as I said earlier for all of our services we certainly need more constraints every service owner has some metrics he looks at and then he can tell oh yes my service runs fine right all the things that the service owner has would be needed here as well but I only used to have one because I know the service really well I was actually monitoring it while it was running you know so I only used that one and then you'll see what happens does the outcome that's what the run looks like 24 hours of tweet the second you see blue is experiment and oranges to control and this is just requests for second as I said early they've received the exact same request so it's the same curve the slices are 30-minute evaluation runs thirty minutes is not very long but again I needed this to finish in twenty four hours preferably so that I can fit it on a slide and then in thirty minutes I know that the service has compiled all the methods it actually reached a steady state and the results are meaningful so trust me on that this is the actual outcome so there's this user CPU time you see every time the blue line is below the orange one that's an improvement the spikes you see are when the CH a VM restarts it's basically the spikes are chit compilations when we compile the whole thing we do a bunch of thing and then it runs for like 30 minutes and then what auto-tune gives you is a website a table basically and this table is sort of by objective and the top one is one point zero eight three eight what that means is that we reduced user CPU time by eight point four percent you remember I had to was very proud of it eight slightly more eight is really good remember the the improvement we got by just running on grah instead of c2 was eleven eleven ish so we get another eight out of it then we have an eight point one a six point four six point four a five point eight right so let's assume the first one the first two outliers and we could expect a six point four ish improvement I would guess right at the bottom of the table looks like that and you can actually see that three valuation runs violated the constraint so we tuned it too far that it didn't work anymore as I said it actually happens and then there's the table and then you see at the top we can also look at the charts and these are the charts that's trivial in lining size so take this with a grain of salt because we tune three parameters so we exploring a three dimensional space right very hard to put that on slides so every data point in here depends on two other values that are not the same so but at least we get an idea of what's going on so if you look at that there is a there's a tendency going upwards default is 10 and our best result is 21 and if you look at the curve we can say yeah 21 22 23 maybe that would be a good value so this one maps maximum in landing size it's like almost flat there might be as at slight tendency upwards but it's almost flat so it doesn't affect your performance too much there are two outliers at the top right but again they might be outliers or not it could be that the two other values that are for this data point it's just a perfect configuration we don't know but we don't need to know right we just take whatever Bayesian optimization gives us and then we use it and this one it's very obvious what's going on here the default value for this one was 300 I actually went down to 200 I think to see how it affects performance negatively and it certainly does and our best value is was 580 or whatever that is certainly should be 600 almost double of what the default is for this particular service right okay so what I did next was I wanted to see if what auto-tune found is actually true in the real world so I ran a verification experiment I've ran the tweet service for 24 hours that's it I compared C to graph and then red is grogg with the bells we just found with auto-tune it's just requests per second this is GC cycles so the tweet service uses parallel to Z so we use we look at PS Kevin cycles here and you can see yes when we run on Gras we can reduce GCC cycles by a roughly 3.4 percent you've seen similar things before right in this particular particular runs 3.4 3.4 might not seem a lot but it's actually impactful if you can't reduce remember we're still processing the same amount of requests per you know requests per second basically but we reduce the memory consumption and if you can avoid she sees that always means your latency is improved so 3 1/4 this is auto-tune and you can already tell well that's a good improvement of 3.5 up to a total of almost 7% very important here yes very nice run on Gras you will save a lot of money in our case we do but if you're not tuning your stuff you're throwing out a lot of money out of the window in this particular case we tuned twice as much as we would actually get it out by default okay users I don't know different site that's the same data just in a different graph it's baby people understand some people might understand this better it's allocated bytes per tweet right how many bytes do we allocate per tweet you see it's pretty flat over the day it's fluctuates a little bit because tweets have different lengths but the improvement is the same right 3.5 and then with auto-tune it's another 3.4 up to a total of 7 is it's the same data all right use the CPU time there's particular run 12% so imagine the tweet service and we have thousands of instances of this service we can use 12% less machines to process the same amount of tweets just imagine how much money that is I cannot tell you how much it is but it's a lot it's way more than I get paid I don't know how much you get paid but it's more than I get paid so 12% auto-tune certainly a good improvement remember we expected a 6.4 --is-- maybe it was the two outliers 6.2 exactly what what we wanted to see up to a total of 18 so 18% less machines it's ridiculous that's like we improved it by another 50% this is when you add scale it's just so much money the next thing I looked at was p99 latencies for the tweet service you can already see that growth certainly gives us way better p99 latencies thing than c2 a little hard to tell how much it really is I only look at two 9s here and not three or four because if we look at three or four or you only look at you long as cheesy right if you look at two nines you actually get a like a rough idea what the real world looks like and this is 99% of the tweets anyway so it's fine this is auto-tune certainly better than regular default round hard to tell how much it is what I did was I integrated over the 24 hours and that's that graph and then we can actually tell what the improvement is but just running on growth instead of C to the tweet service we not only reduce CPU utilization but 12% we reduce p99 latencies by 20 its runs better and faster and then another 8% on top if we tuned it 28% that means if you look at Twitter right now and I encourage you to do it you will see your tweet 28% faster if you scroll really fast you could read 28% more tweets I would also appreciate if you would do that all right experiment to social graph so social graph is also a finagle first service it's an abstraction for managing many to many relationships at Twitter it's basically who follows you and who are you following and the reason I'm also doing an experiment with this one is because it's basically the same as I said maybe finagle and then a little bit of different logic on top and you'll see the logic actually influences the outcome so we'll see that although they are very similar objective same thing we try to reduce use of CPUs and excuse me constraints we don't want to get throttled and this is to run different day different graph right you see nothing new like 30 what I did again 30 minute evaluation runs because this service I know really well like the tweet service so I could do it that's the result you see improvements you see some that are worse again the spikes are cheat compilations and this is the outcome the top one is 7.6 that's a pretty good improvement then we have a seven point six a seven point two or six point eight and a six point four so I would get maybe we get a 7% improvement that would be really cool if we get that oh yeah I'm on the right I haven't mentioned this earlier you can actually see the values for the parameters that auto-tuned found at the bottom of the table we had one run that violated the constraint don't look at the three that are still running that's a bug but I think we fix that one and these are the charts it's not as obvious as with the tweet service especially this one a trivial inlining size but you can see a tendency up and then the best one is again around the 23 before it was like 21 so 21 22 23 for the code base we have is probably a good value this guy again kind of flat like tendency up our best one is around the 400 mark whatever this one also not as clear as with the tweet service but certainly a tendency upwards right the important thing to point out in this slide is the best one we have is 649 or whatever it is and my range goes to 650 so I might redo this experiment with either a bigger range or no range because we might actually get a better result so again same thing I did a verification experiment ran social graph for 24 hours this is the outcome social graph is running CMS and we're looking at part news cycles here and we reduced by just running on Raw instead of C to G C cycles by one only 1.6 percent we just roughly have what we had for the tweet service and then we tuned it that's the important thing and you kind of already see it 3.5 we tuned it twice as much better than we get by default so the performance tuning is really really important and the the compute power we're just waste every day is is ridiculous right so by when we do this we reduce what let's say the tweet service where was it 18% right 80% less machines and we own our own machine so it means 18% less machines a lot of money means less electricity means less cooling I'm trying to save the world here really I'm glad it's warm outside today but still climate change is real use the CPU time for the social graph service is 5.5 it's roughly half of what we had for the tweet service that kind of goes hand in hand a little bit with the half like 50 like half of the reduction of GC cycles right we had three point I can't remember the number now three point four I think for the tweet servers and one point six for the social graph it's also like half of it the if we can eliminate the object allocations we don't a we don't have to allocate it right which saves us a lot of CPU time but then more importantly we don't have to collect the garbage and collecting garbage takes a lot of CPU but if we can't reduce as much memory allocations like half as for the tweet service we can also not save as much CPU so you see they kind of go hand in hand it will this is auto-tuned important very important thing is also we again two more out of corral that it gave us by default important to point out here up to a total of 13 you might see it if you look at the blue and orange curve the improvement is actually the same it doesn't matter if the load is low or the load is high but with autotune when it's high the improvement is more as when the load is low this Chardon this should not be the case and I actually haven't had time yet to go back and look why that's the case but it doesn't really the 7.8 I'll be from up here right I'm not stupid I'll pick the best one but it it actually it makes sense because if the load is low we have enough machines anyway we need the improvement when the load is high right that's that's how we size our instance sizes questions this is a joke like everyone right now should have this questions you remember this is actually an auto-tuned talk and I talk about all the time you should have this question right well of course I did right I couldn't come up here on stage and rave about how great growl is and then not do an auto-tune experiment with c2 so I did that I tried to pick we're not doing questions right now I was hello no we're not no that's it down there was for dramatic effect but Albert you'll be first after so I picked three similar parameters for c2 one is called max inline level growl doesn't have that but makes inland level is very important I talked to someone yesterday about this it's basically how deep your inlining goes and this is very important today the value the default value is nine and today in our world everyone's using a trillion of frameworks right so framework a is calling frame it B is calling framework C and that nine it stops like there's a heart stop it sometimes it doesn't even get to the code you wrote and you can it doesn't in line the whole thing so max in line level max in line size is the same one as the trivial inlining size the difference here is the 35 is byte code size you remember for grodd it was note in a compiler graph but here's actually byte code size if you ever added a search statements to your code and then suddenly your performance went to it's because of this guy because this one what c2 does when you add a search statements you byte code size increases but c2 doesn't discount for a search statements so then suddenly it doesn't inline it anymore I know very embarrassing we never fixed it I'm very sorry you scroll in line small code very odd one I was not a torque lower son when they introduced this guy mm means two thousand bytes of native code so basically their reasoning behind it I think is if c2 wants to inline a method and we have already compiled code for that method and it's bigger than two thousand bytes the method is very big so we'd rather not inline it that's the reasoning behind it it's kind of odd sure it makes a little bit of sense but if you know how in learning works if you in line it can a lot of things can collapse in your compiler graph and you actually get less code than if you would just comparison and alone it's an odd thing but I tried to tune it okay I gave it ranges again for the same reason then just to run yeah exactly who doesn't matter so this to run again 30-minute evaluation runs with c2 that's the outcome you know we see some better some or worse that's the table so the best result we have is 5.1 that's pretty cool 5.1 percent improvement with a compiler that's been around for many many many years and and highly optimized for all the code out there I'll certainly take it the next one is three point eight three point five three point three and a three so I would argue that the five is an outlier you know especially with the numbers we've seen before so let's say we get a three point five percent improvement still good right auto-tuned and remember it's an auto tune talk auto tune did its job it tuned our compiler that's been around for 15 years so well that would get another three point five percent out of it I would certainly take it if there wouldn't be grow right because growl we already get twelve by default until another six on top up to a total of eighteen compared to three and a half so that's the bottom of the table no constraints by only it it doesn't really matter and these are the charts that one's very obvious right that's the next in line level thing I talked about how deep in line goes you see all the others are not really important a perfectly fine curve default is nine but it should be sixteen seventeen or eighteen and I'm arguing that it should be eighteen or whatever for all the code out there because we picked nine I don't even know when we'd have to go back into material history and see when we did that but let's say was ten years ago and I'm sure it was or more the code ten years looked very different than it looks today right it's way more code way more frameworks and stuff so if you if there's one thing you wanna tune tune that and increase it this one max inland size is completely flat that surprised me because it's basically the same thing as the trivial in mining says and if you remember the charts before they heavily influence performance but this one's completely flat doesn't affect performance at all this one is also flat didn't surprise me too much because there's this odd one that was talking about so basically the max inline level is the one you you want to change if you want to I did not do a verification experiment for C 2 but let's assume we get a 3.5 percent improvement sure cool but compared to 18 as I said especially for Scala code so my summary for all of my talks and I think I'm kind of out of time with something it's very simple and it's just it's the same summary for all of my talks that's how you turn growl on if you run on JDK 10 or later basically 11 then the only thing you have to do is this and then you replace you two with growl and if you're running Scala code and I'm sure everyone in this room is you're an idiot if you're not turning on girl you really are because I've never seen Scala code that ran worse with growl and c2 never we've always seen improvements how long can I rant on more I don't even know you're fine I'm fine good so I don't think we are the only ones anymore but we're kind of the biggest company Twitter that uses growling production I don't think I even mentioned this before the tweet service the social graph service the user service these are kind of our biggest services in terms of instance size and a bunch of like 20 other services run 100% on growl in production for over two years so every tweet you've seen in the last two years or tweeted yourself was you know processed by code compiled by growl and it works fine it never crashed it just works fine did you lose any tweets I don't think so you wouldn't even know if you lost one but you did not no you would you would you would blame Wi-Fi like Wi-Fi but actually yes we're fine feelin so it runs really really well for us and when we move to bra we found a few bucks and my other talk I'll talk about them as I said but we haven't found a bug in more than two and a half years so what I need is you to run growl and find bugs it's we need this the shitty old production code that you have that would tickle you know that the corner cases of the compiler that's what we need the older the better it doesn't have to be big right it doesn't have to be a huge application whatever it is it can be a small pet project you're working on give it a shot if it runs better for you very good if you if it runs as well as for us maybe at your company and you save a lot of money I'm very happy for you I'm a nice person I want you to save money especially when I come back next year then you say we save so much money I buy your beer that's a cool think you cannot sleep at your house so that's our our role but I mean that right I want you to save money if you can save sure you're probably your scale is not as big as Twitter but your compute expense is usually a fraction of your company revenue right so if you compute expenses let's say ten thousand dollars a year and you can save 10% of dead a thousand dollars that's an amazing Christmas party you know so do that if it crashes for you excellent that's exactly what I want I want you to find a bug because if it crashes for you and we find a bug in which six it means Twitter doesn't crash and it Twitter doesn't go down we all would be sad if Twitter it would be down especially I would be sad because then I would have to go back to work and fix it so please find bugs file it on github if you find one if it runs worse for you performance-wise also let us know we want to know where growl is still lacking compared to c2 it might be difficult if it's code from your company it's probably proprietary but maybe you can extract a small test cases something send it to us and we've probably the growl team will take a look at it so please please please i want i'm doing these car these talks and yeah talks and other things that i'm doing moving to twitter and running twitter on growlithe i do this for a reason because I want Corral to become the new default JIT compiler in openjdk for many reasons but the number one reason is you've just seen it right the improvements especially this community gets there's also improvements was for Java but not as much it's especially for Scala where what grows just shines and I want this to become the new compiler my colleague Flavio who I mentioned at the very beginning who's writing this specific Scala optimizations he had a little bit of compiler experience but he'd never worked on JVM or check compiler right z2 is very very complex when I started working on C 2 it took me four years to fully understand how it works no one's doing this today anymore everyone downloads the framework from github every day and expects to understand it in a week right that's not the case with something like C 2 but Flavio was able to write compiler optimizations and upstream it 2 of them already up streamed in about 3 or 4 months so we need a new framework basically a compiler frame of where we can do these things and it's especially important for other languages than Java I worked at sound and Oracle for many years in a hospital palliative on C 2 and not a single day not a single hour or a single minute did we optimize for any other language than Java all right so that's where Gras can give you so much booth with Scala give it a shot let me know how it works tweet it out tweet at me send me a DM tweet at the Grove VM hashtag or Twitter handle whatever please give it a shot you can always reach out to me or the Gras the M team they have an evangelist you will take care of you they actually have two one is here Elina a trillion of questions if you need help to set it up or if you don't understand anything ask us on Twitter you'll get an answer in two minutes I would say all right that's all I had tweet about everything you're seeing here and thank you very much and since I was too rude earlier you can ask a question now or did I answer it in the meantime I don't think it was just a curiosity question have you ever run outta tune experiments with other platforms beyond draw gvm say Holland right so auto-tuned as you've seen ronke my colleague his talk was already two years ago so auto-tuned is already a few years old and it was developed to tune the cheese see you can as I said earlier you can tune anything I've not done it I've actually never tried to tune that you see I might do that next but I'm a compiler engineer that's what I understand you know what I mean so actually the answer is no but I might yes other questions yes do you know what Y is it's louder than I expected yeah and it always is and I'm not speaking louder right now why why is growl so effective with Scala optimizing Scala Java it mainly has to do with what I said earlier with the DS temporary objects that are being created I so I didn't say it earlier I wanted to get out of this room without saying it but I knew nothing about Scala I I cannot if you would force me right now to write hello world in Scala I couldn't do it no it's not a joke I mean it because I oh all I need to speak is Java bytecode right so but what I can tell you you know all the whatever nomads and bah-bah-bah things you use there's a lot of immutability or whatever going on so it has to create all these new objects all the time right but if you have a big enough view of your source code then you you can prove I don't need these objects right and then the compiler can say well I'm not allocating these that's basically and if you watch my Twitter's quest for a holy grail runtime talk you'll see that that's basically 50% of the improvement the other 50% comes mostly I would guess and I don't have exact data to back this up from the better inlining and then a few other optimizations on the side but it's I think it's mostly inlining because the inlining implementation in grow is just better it's not so restricted as with see - you've saw you've seen it here right you've seen the things we tune and actually affect the effective performance and the other things didn't do it so much the mechs in line level one is a very important one so that's kind of the reasons why Grodd is so good with with Scala other questions over there over time we run out of time all right over time one more it's you it's only lunch after this so have you experimented with the like zing zing VM no yeah so no and I've also not experimented and I promise every time I see the ID I mention just to try j9 and I've not done it yet the so seeing are you talking about Falcon yes so Oh before Falcon but no we never tried before so I think if you download thing now they got rid of all the other compilers and Falcon is the default right now so Falcon for the people who don't know Falcon is a JIT compiler as systems was and is working on and it's based on LLVM it uses L of a.m. to do compilations the problem with Falcon is and I don't want to bash us all too much here but the problem with Falcon is a Latvian and when I used to work at Oracle and I was looking for a replacement for c2 I was also looking at Le Vian I mean it's an obvious choice right but lrvm was always designed to be a static compilation framework it has some support for trick compilations but it was never designed to be a fast JIT compiler so the problem Falcon has it's very slow slow in terms of compiler through but not the code it generates but the throughput so as Huell has to do you know has to jump through some hoops they have technology ready now they have this for a while but then they're working on another technology not sure if they have a marketing term yet they call it compiler stashing which basically where they have to save chit compile code because Falcon is so slow so crawl it's just better I'm sorry as ooh all right thank you [Applause]

Info

Channel: Scala Days Conferences

Views: 1,072

Rating: 4.5555553 out of 5

Keywords: Chris Thalinger, Twitter, Graal, Machine Learning, ScalaDays, Lausanne

Id: ldk8CL0fygE

Channel Id: undefined

Length: 58min 46sec (3526 seconds)

Published: Thu Jul 11 2019