The Evolution of ART - Google I/O 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
NICOLAS GEOFFRAY: All right. Good afternoon, everyone. Thank you for coming to this talk. You'll hear about ART, the Android Runtime. And all about the new, exciting features we've worked on for Android N. I'm Nicolas Geoffray, a software engineer in the ART team, and I'll be presenting along my colleagues Mathieu Chartier and Calin Juravle. So ART is the software layer between the operating system and the applications. It provides an execution environment to run Android applications. That environment essentially consists of two things. It is a huge text bytecode, that's the internal Android format and Android applications, through a mix of interpretation and computation. It also manages memory of those applications, allocating it and reclaiming it when it's no longer used. Why should you guys care about ART? Well, it turns out ART is at the forefront of the user experience. ART needs to be fast so that applications can run smoothly. It needs to start applications really quickly to give a snappy experience to the user. It needs to ensure the UI can render at least 60 frames per second to make the user experience jank free. It needs to be power-efficient and not do too much extra work besides executing the application. And it needs to be savvy in terms of memory use. The less memory being used by the Runtime, the more applications you can actually run. So if you remember Dalvik, which was the first Android Runtime released in Android phones, it was sort of efficient on some of these metrics, and sort of OK on a few. At the time memory footprint was paramount, so we constrained the compiler to not use much memory, and that explains not-so-great performance. Dalvik also had a relatively unsophisticated garbage collector, and that could lead to long application pauses, leading to a janky experience. So in 2014 we introduced ART. ART shifted the paradigm of doing interpretation and Just-In-Time computation at runtime to a Ahead-Of-Time computation when the application is being installed. So when the application is being installed, ART will compile it directly to optimize code. That gave us a great boost on performance and application startup. ART did not need a Just-In-Time compiler to be warmed up before it could execute. Applications were running out of optimized code directly. And because this code is directly loaded from disk, you don't need to pay the cost of a JIT computation code cache. The garbage collector has also been completely revamped, where we implemented a state-of-the-art concurrent garbage collector algorithms, and we made sure that GC pauses were to a minimal so that the experience was jank free. Since Lollipop we've mostly improved on performance, so we have [? evolution ?] right on the compiler. Lollipop, when we shipped, had a quick-- when we shipped Lollipop it had a quick compiler. That was a fast dex thanks to machine code compiler. It was ported from the Davik JIT at the time. At the time we were really eager to ship ART, given all the benefits, and we mostly focused on shipping a well-tested and robust compiler. However, [? Quick ?] was not structurally meant for more sophisticated optimizations, such as inlining and register allocation. So in Marshmallow we introduced the optimizing compiler. That's a state-of-the-art SSA-form-based compiler infrastructure. SSA, in compiler jargon, stands for Static Single Assignment. And that's a well-known format for doing optimizations in the compiler. We implement all sorts of optimizations such as inlining, [? bomcheck ?] emulation, command super-expression emulation, and so on. We also implemented a leaner scan register allocator, a state-of-the-art register allocation technique for compilers. And in N we iterated on the compiler and made more aggressive optimizations, like more inlining and lots more optimization. This graph shows the performance of Marshmallow and [INAUDIBLE] normalized for performance in Lollipop. [? As ?] benchmarks run on Nexus 9, a tablet where we [? should ?] Lollipop, and the first [? will leaves ?] of ART. And you can see we've constantly improving performance over time. We measured on three different kinds of benchmarks. Delta [? and ?] Richards are well-known object-oriented benchmarks in the [? Android ?] [? community ?] that stresses how the runtime implements [? calls. ?] Dhrystone is about how the runtime in the compiler generates integer computations. Reversibench, Chessbench, and Ritz are actually adaptations of Android applications. Reversi and Chess emulate the AI of an actual game that you can download from the Play Store and reads emulate spreadsheet emulation. So we've been very pleased with the improvements we've made here, ranging from 1.2x to 5x speed-ups. And the performance boost isn't specific to one architecture. Because most of our optimizations are CPU independent, all platforms benefit from them. So running the same benchmarks on ARM32 will lead to the same improvement trend across the board. So that's great. Improving the code generated by the compiler is a great win for performance, faster frame rendering, and faster start up. Great, but-- [AUDIENCE CHUCKLES] It looks like you guys run into this. So you're probably familiar with that dialogue now. It's pretty lucky, right? Only 21 to go. [AUDIENCE LAUGHTER] So that dialogue is what is shown when you take a system update. And what happens there, in case you don't know, is that we are redoing all the optimizations we've done at install time of an application. Why? Because when you install an application, ART will optimize it heavily against your platform. And it will put hard-coded dependencies in the compiled code that will make your application run much faster. But when you get a system update all those hard-coded dependencies become invalid because you've got a new system. So we need to redo all the work. When we shipped ART, that was actually a trade-off we've made because OTAs or system updates were usually once a year. So for a once a year updates you get yearly better Android experience. But Android ecosystem has evolved. And security being at the heart of Android, our security team worked hard on making sure security fixes could be sent to Android phones as soon as possible. So when our Lead Security Engineer and Director of Nexus Products announced we are now moving to monthly updates, it was sort of clear that this initial trade-off we've made would not work and ART needed to adapt. So we brought back a Just-In-Time compiler in ART. So no more "optimizing app" dialogue. Woo-hoo! [AUDIENCE APPLAUSE] Thank you. It turns out, removing that dialogue is not the only benefit of having a JIT. We now get much faster installs, around 75% faster. And because AOT could not know, when it was compiling, which parts of the app is being executed, it turns out that we were compiling all the code of your APK. And that is sort of a waste of your storage if you're going to use just 5% or 10% of the code. But I think a JIT has some unknown. [? That ?] clearly keeping on compiling all the time, like you started your app with JIT, you kill the app, you start it up again, your JIT. It has some implications on your battery, that AOT compile code didn't have. And having a compilation code cache could be wasteful if not managed properly. So what we did in N is introduce a hybrid Just-In-Time [? half-time ?] compilation system that combines the benefits of both worlds. The idea is that, applications start running with a JIT, and when the phone is idle and charging ART will do profile-guided optimizations. And Ahead-Of-Time heavily optimized the parts of the application that JIT has executed already. So later in this talk, my colleague Calin will go into more detail about how this hybrid AOT JIT profile [? GAD ?] [? and ?] compilation works. What I want to focus now is, how is the [? unknown ?] with the JIT? Let's go back to the five metrics I mentioned earlier in the talk. For runtime performance the JIT is based on the same AOT compiler that brings the same high performance. So we're covered. But the other metrics are actually down to how we tune the JIT. A couple of things we made when starting this project, a couple of design decisions we made, were based on those metrics. Like, we implemented a much faster interpreter in N compared to the one in Marshmallow. It runs, like, up to 3x faster than the [? App ?] [? startup ?] in Marshmallow. It's very important to have a fast interpreter because when you start your app you don't have any Ahead-Of-Time compiled code its [? gentura ?] is going to run, and later on the JIT. Second, [? we do ?] a JIT compilation on a separate thread and not on the application threads, some compilation can take very long, and you don't want to block the UI thread just for doing compilation. For saving power what we do is mostly focusing on the hottest methods of an application. And finally, for memory footprint, we implemented a garbage collection technique that ensures that only the methods that in the end matter, over time, are kept. So if a method is being compiled and then not being used anymore, we'll remove it from memory. All right. So let's focus on application startup. Now I'm going to walk you through some Sys Trace that will explain some of the implementation decisions we're made. If you don't know what Sys Trace is, it's a great tool for both app developers and platform developers to analyze what is happening on Android system. So bear with me. There's a lot of information on that slide, but we'll focus on the things that matter for us. So here's how Sys Trace looks after launching Gmail on a device that had Gmail Ahead-Of-Time compiled. So application startup for a user is actually when the first frame is being rendered. And this trace here is [? somewhat, ?] more or less, what you would get on Marshmallow and Lollipop, is that for starting Gmail, it would take half a second for the first frame to be drawn. So our first implementation or alpha implementation of the JIT, we did the same measurements. And results were not great. You can see now the startup is around one second, so we increased the startup around 2x. And you can see that the JIT thread here executing is initially idle, and then becomes very busy. So what's happening? If you take a closer look at what the application is doing, there's around 200 milliseconds for doing just the APK extraction. And doing APK extraction is blocking the operation, so you need to do it before executing any code. Similarly, there's lots of things happening after the APK extraction that don't have to do with executing the application. And that's verification. ART needs to verify the Dex code in order to run it and optimize it. So we fixed this problem. We decided to move extraction and verification out of every application startup and move it back to when the application is actually being installed. So I just made the application startup two times faster than our initial JIT implementation, and quite on par with compile code. So I've just talked about application startup. How about jank? For jank we looked at the frame rate of scrolling within the Google Photos application. And Sys Trace gives you this nice ist of frames that are being drawn. A green frame is when the UI managed to render it in time. Orange and red is where you're dropping the frame and it hasn't been rendered. Now jank can be attributed to many factors. And ART does its best at executing the code of the application. But if the application does too much on the thread, obviously we're going to miss a frame. So we have-- for that specific experiment, we have around 4% of frames that are being dropped. During our first [? bring ?] [? up ?] we made the same experiment. So application wasn't compiled, it we'd just run it with a JIT. And the results weren't great. We were dropping around 20% of frames. If you take a closer look at what the JIT is actually doing here, you can see those long compilation activities where the compiler is actually waiting for methods or for requests to compile methods. Those methods haven't reached the hotness threshold that we've set. And the hotness threshold is high because we want to save on battery. But that doesn't save on saving jank. The UI thread-- you want the code that that thread is executing to be hot as soon as possible. So the solution was to increase the weight of UI thread requests for compilation. So the methods it runs would be compiled almost instantly. So if you see [? it's ?] [? trace ?], there's no more long pauses of compilations and we only dropped around 4% of frames, which was the OT level we had initially. All right. Battery usage. We've measured the power usage of starting Gmail, pausing for 30 seconds, and then scroll around the emails. And we can see here that the JIT is paying a high cost at application startup compared to Ahead-Of-Time compilation. But once startup is done, the scrolling actually has no difference between AOT and JIT. The reason for this difference is that, as they start up, the JIT is very busy compiling a lot of methods. And Gmail seems to be very aggressive in executing code at Startup, which is not necessarily the behavior of all apps. So we've done the same experiment with the other apps. We've looked at Chrome, Camera, and Photos. Chrome and Camera are mostly based on native code. So here JIT doesn't really-- is not really useful for all the things we mentioned. And the power usage is very similar whether you're in AOT setup or at JIT setup. Photos, on the other hand, does have Java code, but doesn't have the behavior that Gmail had. And you can see that the difference is, again, very little. That was battery. Let's discuss about the final metric, memory footprint. So we looked at the overall usage of our beta testers within Google, and we were quite happy to find that the memory use of the JIT is fairly reasonable. The maximum we've seen on a heavily-loaded system that had lots of application being executed was around 30 megabyte, system-wide. But on average what we've seen is that, in general, 10 megabyte. For individual applications, big Java applications will take some memory, like, four megabyte. But on average, most applications have a reasonable Java code size, and the code cache is fairly small, around 300 kilobytes. So to wrap up on how JIT affects these metrics when it's being executed, we've seen that performance in jank are on par with the quality level we had with ART Ahead-Of-Time compilation. And I have shown you that it does have a relative impact on application startup, battery, and memory footprint. So now I'll be handing it over to my colleague Calin, and he's going to explain to you how we're recovering from those small regressions compared to AOT by doing [? Ahead-Of-Time ?] profile-guided compilation. [AUDIENCE APPLAUSE] CALIN JURAVLE: Thank you, Nicolas. Hello, everyone. I'm Calin and I'm here today to give you more details on profile-guided compilation. And this is a new compilation strategy that we introduced in N. And, as my colleague Nicolas mentioned, is a combination between JIT and Ahead-Of-Time compilation. And it's mainly based on the observation that the percentage of the application's code, which is actually worth optimizing, is very small in practice. And focusing on the most important part of the application drives a lot of benefits for the overall system performance, and not only limited to recovering the slightly regression that we have to [INAUDIBLE] with this system. So the goal here was to have a full five-star system. And this is what profile-guided compilations help us achieve. So let's take a look a bit on the idea. In a nutshell, we want to combine execution profiles before Ahead-Of-Time compilation, and that will lead to a profile-guided compilation. What it means is that, during application execution we'll record profile information about how the app is executed, and you will use that information to drive off-line optimizations, but at a time when the device is charging and idling so that it doesn't take resources out of our users. So let's take a look into more details how this works, how it affects the lifecycle of the application, and how it fits together in the JIT system that my colleague talked about. The first time the application is executed, the runtime will interpret it. The JIT system will kick in and optimize the hot methods, and eventually this cycle will repeat. [? Imperil ?] with the interpretation and with the JIT compilation will record the profile information. And this profile information gets dumped to disk at regular interval of times. The next time that you execute the app, the same process starts again. And the profile files will eventually be expanded with new use cases based on how the user used the app. At a later point, when the device is not in use anymore, it's idling and charging, and it's a state that we call "maintenance mode." We kick in the service. And what the service does, it takes a look at the application, it looks at the profiles, and it tries to optimize the application based on its use. The output of the service is actually a compiled binary, and this compiled binary will replace the initial application in the system. So now the next time the application is launched after this service was executed, the application will contain different codes, different states of the same code. So you may have code which is interpreted and eventually be JIT-ed, and it'll also have code which Ahead-Of-Time compiled. If the user, for example, uses a new part of the application that hasn't been explored before, that part will be interpreted in JIT and it will generate a new profile information. And so the cycle begins again. So what's important here is, we'll improve the application performance as the user executes it with new use cases. And it will keep recompiling it until we discover all possible cases. So let's focus a bit on the profile collection and how that impacts the application performance and other factors. As I mentioned, we do collect them in parallel with the application execution, and what it focused on is to make sure that it has a minimal impact on the performance. And one factor that we put a lot of attention into is to have an efficient caching and [? IO ?] foretelling so that we minimize the write operation to disk. We also have a very small file footprint, and the amount of data that we write to disk is actually very, very small. Another point, which I mentioned before, is that we keep expanding these profiles as the app executes. And, obviously, it depends on the application, it depends on the use case. Our test shows that the largest part of the data is actually captured during the first run. Subsequent runs add to the profile information. It obviously depends how the user used the app, but the largest chunk of the information is actually captured during the first execution and that gives us important data to work on. A final point which is worth mentioning here is that, all the application and all the users get their own profiles. And with that in mind, let's take a look on what exactly we record in this profile information. The first things are the hot methods. And what constitutes a hot method? It's a metric which you have internal to your runtime, and the factors that contribute to it are, for example, number of invocations, whether or not that method is executed on the UI thread so that you can speed up requests that will impact the users directly. We'll use this information to drive off-line optimizations and to dedicate more time to optimize those methods. The second data that we record are the classes which impact the startup times. How do we know they do so? It means that they are loaded in the first few seconds after the user launched the tab. And my colleague Mathieu will go into more details on how we use that to improve startup times even more. A final piece of information that we record is whether or not the application code is loaded in some other apps, and that's very important to know because it means that the application behaves more like a shared library. And when it does so we'll use a different compilation strategy to optimize it. So let's focus now on the compilation daemon, on the service that actually does the compilation, and let's take a look on what [? decision ?] makes. This is a service which is started at boot time by the system and is scheduled for daily run. Its main job is to iterate through all the APKs installed in the system and figure out whether or not we need to compile them. And if we do need to compile them, what sort of strategy we should use. The service wakes up when the device becomes idle in charging, and the main reason for that is that we don't want to use user time when the device is active, and we don't want to waste battery time. So we delayed this until the device is not in use anymore. When the daemon wakes up what it does, it iterates with the applications. And the first questions it asks is whether or not the application code has been used by some other apps that they thought that I was telling you that we record in the profile. If that's the case, then they perform a full compilation to make sure that all the users benefit of the optimized code. If it's not-- and this is the case probably for the largest percentage of the app. It's a regular app, you don't get used by some other apps. We go into and perform a much deeper analysis on the profile information. If we have enough new data, if we collected enough information about the application, then we will profile a guide compile that application. We'll take a look at the profiles and then optimize only the methods that were executed so that we focus on what actually the user used from that application. If, by any chance, we don't have enough information-- let's say we only know-- we only have data about one single method from the top, then we'll just keep it because probably it's not worth optimizing. And an important thing here is that we do perform the profile analysis every time that we run the daemon. And that what that means is that we might end up recompiling the app again and again until we no longer have any new information about it. And here you can see that actually I was talking about different use cases and how we apply different compilation strategies. Shared libraries get a full compilation, whereas regular apps prefer a profile-guided compilation. In N we generalized on that. And different use cases from the lifecycle of the application have different completion strategies. For example, at install time we don't have profile information, yet we still want the application to start as fast as possible. And as my colleague Nicolas mentioned, we have a strategy where we extract and verify that app with minimal running time, and which will ensure that the application starts fast. When you update the app we have the same story. We no longer have a profile because what it recorded before was invalid. So we repeat this [? education ?] procedure. When the compilation daemon kicks in, we'll do a profile-guided compilation where possible. And for system and shared libraries we're going to do a full compilation so that we make sure that all their users are properly optimized. With that in mind, let's take a more closer look on what benefits are when we do profile-guided compilation. And all the benefits share the same root cause. We only optimize what is being used. And what it means? When you first start the app after the compilation happened, previous hot code is already optimized. We no longer have to wait for the JIT, for the methods to become hot so that the JIT can compile them. So the applications will start faster. We have also less work for the JIT. And that means we use less CPU and we increase the battery life overall. And because we are much more selective with what we optimize, we can dedicate and spend more time there and apply more optimizations. Besides that we get a smaller size for the compiled binary. And that's a very important thing because what we do it as binary [? map ?] into memory. Smaller size translates to reduced memory footprint, and the important difference here is that, for example, this memory that we map into-- this binary that we map into memory will be a clean memory compared to dirty memory that JIT will generate. We also use far less disk space because the binaries are much smaller now and they free a lot of space for our users. How much space? Let's take a look at the numbers. In this chart we compare different applications, which is Google Plus, Play Music, and Hangouts. And we tracked how the generated binary for the compiled code performance across Marshmallow, which is the blue line, N preview during the first boot which is equivalent of a fresh install of the application, which is the orange line. And the green line is how much it takes for the profile-guided compilation. As you can see, the reduction is more than 50%. And obviously, the green line will go up over time. And our tests show that it actually stays around 50% most of the times. And you may wonder, how come we get so great reduction in terms of size? Well, when we analyze the profiles, we've realized that only around 4% to 5% of the methods actually get compiled. And as we use the app more, obviously this percentage will go up, will generate more code. But, as I said, in general we stay below 50%, or around that area. Now a natural question here is, if I'm only compiling 5% of the app, how come I reduce the space only by 50%? Why not 95? Well, those lines contains also the application [INAUDIBLE] code, that Dex code. And that's a line below which we cannot go, because we need to run something. And here is how we compare to the application size. You can see that, in Marshmallow, we generated more than 3x in terms of code size, whereas in M we stayed below 1.5x. And these are all cool benefits, but it is not the only thing, actually, that we use profiles for. We also use them to farther speed up system updates. As my colleague Nicolas mentioned, because of JIT, we don't need to recompile the app again. And that basically gets rid of the long waiting time for the optimizing app. We still want to do some processing of the app, in particular extraction and verification, to ensure that those apps get executed as fast as possible when they are first launched. And in M we actually know how those apps were executed before. So we can use profile to guide the verification. And that saves around 40% of the time extra. We also added new, improved usage stats. And compared to M, we can now track precisely how the application was used and how it was executed. And we only analyze what actually matters for the users. What is that? Application [? has ?] a user interface, and the users can interact with them. Those are the most valuable for our users and we focus on them during system updates. However, when we take the update first update to M, we don't have access to all the goodies. We don't have profiles and we don't have enough accurate user stats to realize how the app was used. And what we do, we do a full verification of most of the apps. This is still much, much faster than we used to do in M. And how much faster? Let's take a look at the numbers. You can see here three different lines. And these numbers are obtained on the same device which have roughly 100 applications installed. And the first line, which represents an OTA-- a system update from M to M prime, where we took a security update-- we took around 14 minutes to process all the applications. And there's the time that the runtime spends processing them. When you take the update to M, we reduce the time to about three minutes, which is up to 5x reduction. And there we verify most of the applications. Now, the next step our security updates will kick in for M. And what we do, we already have profile information, we already have improved usage stats, and we can use that to drive that time even lower. And when you take a security update on this device, we take less than a minute. And compared to M, that's more than 12x improvement in terms of speed up. With that I'd like to pass it to Mathieu, and he will explain how we use profiles to speed up application even more. [AUDIENCE APPLAUSE] MATHIEU CHARTIER: Thank you, Calin. One new feature in N that reduces application launch time is application images. An application image is a serialized [? seed ?] consisting of pre-initialized classes. This image is loaded by the application during the launch. During launch most applications end up loading many classes for initialization work, such as creating views or inflating layouts. Unfortunately, loading classes is not free and can consist of a large portion of the application launch time. The way that application images reduces this cost is by effectively shifting work from application launch time to compile time. Since the classes inside of the application image are pre-initialized, this means they're able to be accessed right off the bat by the application. Application images are generated during the background compilation phase. I think it was Calin who referred to it as a maintenance phase, by the OT compiler. Leveraging the JIT profiles, the application images include and serialize only the set of classes that were used during prior launches of the application. Using the profile is also key to having a small application image, since it only includes a small fraction of the classes inside the actual application. Having a small application image is important because the application image is resident in RAM for the entire lifetime of the application, as well as larger images take longer to load. As you can see here, application images have a very low storage requirement. This is mostly due to the profile. For the four apps here the storage requirements were less than two megabytes per app. As a comparison, I put the application that I compiled the application code with the profile. So this is already reduced compared to what application code sizes would have been on M. The loading process of application images begins with the application [? ClassLoader ?] creation. When the application ClassLoader is created, we load the application image from storage and decompress it into RAM. For dynamically loaded Dex files, we also verified that the ClassLoader is a supported type. Since there is no dependency from the application compiled code to the image, it means that we can reject the image. So if the ClassLoader is not supported, I simply reject the image and resume execution. Here are some results for application image launch time improvements for the four apps that I just displayed. As you can see here, there's around a 5% to 20% improvement compared to profile-guided compilation only. And now to the garbage collector. The garbage collector has not changed too much since our last IO presentation. As you can see here we are already in pretty good shape back then. There is still only one short GC pause and the garbage collector has high throughput since it is generational. There have been a few GC changes, mostly related to application images and Class Unloading, however these changes do not have-- or, these GC changes do not have a substantial performance impact. One thing that has improved each release is allocation time. With L we introduced a new, custom thread local allocator that reduced allocation time substantially by avoiding synchronization costs. In N we removed all of the [? comparance ?] [? womp ?] operations in the allocation common case. Finally, in N, the allocation common case has a [? written ?] and [? hand-optimize ?] assembly code. This is the largest speed up yet, at over 200% compared to M. Combined, all of these improvements mean that allocations are now around 10x faster than KitKat, which was Dalvik. Class Unloading, also introduced in N, is a way to reduce RAM for modular applications. Basically, what this means is that classes and class loaders can be freed by the GC when they're no longer used. In previous Android versions, these were held live for the entire lifetime of the application. For classloading to occur, all the classes in the ClassLoader most no longer be reachable, as well as the ClassLoader itself. When the GC notices this it frees them, as well as the associated metadata. This chart demonstrates just a bit of how the ClassLoader interacts with other components and retains them. With that, off to Nicolas for the wrap-up. [AUDIENCE APPLAUSE] NICOLAS GEOFFRAY: Thank you, Matthew. All right. So to wrap up, we've shown you all the features we've worked on for N, mainly a faster compiler, up to 5x compared to previous releases. A faster interpreter, up to 3x compared to Marshmallow, a new JIT compiler and profile-guided compilation. They have just removed that optimizing apps dialogue and provides faster installs. Applications image for fast startup, and fast applications. You can actually, if you're interested in all that sort of low-level stuff, follow it on the AOSP website, where we actually do all development. With that, we'll take your questions. Thank you. [AUDIENCE APPLAUSE] [MUSIC PLAYING]
Info
Channel: Android Developers
Views: 9,946
Rating: 4.9039998 out of 5
Keywords: Android, ART, runtime, Android platform, Android runtime, app, apps, mobile development, performance, compiler, profile guided compilation, io16, product: android, Location: MTV, Team: Scalable Advocacy, Type: Live Event, GDS: Full Production, Other: NoGreenScreen
Id: fwMM6g7wpQ8
Channel Id: undefined
Length: 41min 19sec (2479 seconds)
Published: Wed May 25 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.