Google I/O 2014 - The ART runtime

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ANWAR GHULOUM: Thanks for coming to the talk. We're going to talk about ART, L's runtime replacing Dalvik. Hopefully you'll leave, with a better understanding of what ART is and what improvements you can expect. We'll also talk a bit about 64-bit support in ART, and in the platform at the end. So I'll be presenting today along with my colleagues, Brian Carlstrom and Ian Rogers from the Android runtime and tools team. I'm Anwar Ghuloum. They'll talk about different aspects of the runtime. They'll be coming up pretty shortly. Thanks guys. First, why ART? So when Dalvik was originally designed, the processors we were targeting were single core, relatively poor performance, relatively little flash and RAM available. But things have changed. We had a compact bicode that was designed for fast interpretation, relatively unsophisticated GC, and it wasn't designed for multiprocessing. But since then, what we've seen is we've got phone shipping with eight cores. We've got devices that have improved in performance over the G1 by 50x or more in terms of raw CPU performance. We've got 4x more RAM, sometimes more, 64x more flash, high resolution screens, extremely powerful GPUs. We made a lot of changes in Dalvik to improve things. We added a JIT, we added a concurrent collection garbage collector, and we added support for multiprocessing. But we felt frankly, that the ecosystem was outpacing us, that we had to do more. So ART was born. So our motivation for ART is that you shouldn't notice the runtime at all. You're app should be buttery and smooth, even though using GC. Code should be fast without having to resort to JNI or native code. Start up time should be fast, without having to warm up a JIT, and so on. The runtime in general should scale to the parallelism modern processors, as well as the complexity of modern mobile applications. In other words, it should be really solid foundation for the future. So in this talk, we're going to start with telling you about performance and compiling in ART; how ART compiles ahead of time, and what kind of performance improvements that brings you. We'll then talk about the garbage collector and how GCed apps can be buttery and smooth. And then we'll talk a bit about 64-bit. So with that, I'll hand it over to Brian. BRIAN CARLSTROM: Thanks, Anwar. So I'm going to start by talking about a user. We have this public bug that's one of our favorites on the ART team. So a user wanting a performance boosting thing. And if you read the details in the bug, they want to take their existing phone and their existing apps and they just want to get a system software update. And they want faster performance for their apps, want better battery life. They want to use less memories so they can multicast more. And we really like this bug because it really matches what Anwar said about the ART manifesto, about trying to transparently improve performance for apps without people having to do anything. So what can we do about that? So as we've already said, we started with Dalvik. And it had some performance boosting things. It had a fast interpreter. JIT was added, another performance boosting thing. And what can we do for ART? What can we do going forward? And we said well, Dalvik was built for really targeting ARM. Now in more mobile device, modern devices, we have a lot more ABI's we have to support. We want to do potentially different optimization levels for different use cases. So rather than just trying to build one performance thing, we want to build a framework of performance boosting things. And so we're going to build a flexible compiler infrastructure that can allow us to do that. So this just a little box diagram of what we're doing at the ART compilers. As you'd expect, we have a front end that can read our dex bicode for compilation. We have a compiler driver, and what's interesting that it directs to various different compilers that are available in L. We have three main compilers that we're using in L, but there's already seeds in the AOSP code for other compilers that we're working on for the future, seeds of future performance boosting things. What we have for L is a compiler we call a quick compiler, which is based on the old Dalvik JIT. So it does many of the same things, but does more than that. And we'll talk about that in the next slide. It focuses on compiling your dex code to native ARM code, x86 code, MIPS code, et cetera. We also have a JNI compiler that helps build bridges from your dex code to any native code you've done within the NDK. And finally, we have a dex to dex optimizer which just quickens code we decide not to compile, that we just want to do fast interpretation on. In addition, we have to have numerous back ends, and so we're supporting the 32 ABIs that Androis currently supports, as well as the new 64-bit ABIs that are coming in L. So what are some of the other benefits of doing ahead of time compilation? One big difference between ahead of time compilation and the JIT we had in Dalvik is that we optimized the whole program potentially and instead of just focusing on small traces of the program, loops, or things like that. So the kind of optimizations you did when you're targeting like small parts the program, or loops, are very different than when you're trying to do the whole program. And we try to do optimizations that can generally apply to all kind of object oriented programming that you might be doing in Android. And so some examples are, we try to make virtual method calls faster when possible, we try to lower the overhead for interface and vacations, we try to avoid some implicit checks that need to be done when you're calling from one class to another to make sure it's initialized, and when you're calling constructors or static fields. So exercising site fields, or calling [INAUDIBLE] methods. And finally, there's a lot of things are sort of implicit checks that are done. Not just growing languages, that there are no pointers, and uncommon cases, and StackOverflow, and things like that. So we try to make all those kind of common case things faster so we can improve the general performance of your program. But we've also seen some other benefits from of ahead of time compilation. Those things that the user wanted. We do see better battery life, and there's two reasons for that. One is because we're compiling, and only in the installation time, we compile once as opposed to a JIT based environment where every time you run the program, it typically re-JITs, rediscovers things, redoes work. It's kind of wasteful every time you run your program through doing the same compilations. Second thing we see is that generally faster performing code saves battery because you can more efficiently run and get back to a lower state on the processor quicker. We also see a better Svelte memory. Remember Project Svelte from last year, trying to run better on lower memory devices. The reason we see better Svelte numbers are because the code is written out to disc, and can be paged in and out by the kernel on demand. If there's a lot of applications running, multitasking, the kernel can kind of manage paging things in and out, as opposed to the JIT, where there would be a kind of set of private pages per application which couldn't be paged out. So that meant that if the kernel needed to free up more memory, it basically has to kill a process, and you lose one of your processors you maybe are trying to multitask between. So how do we do that? This shows the life cycle of an APK, the Android package that comes from developers to the users. And so the typical steps in the APK, a life in a life cycle are, you take your source code. You compile it, you make a dex file, and can combine it with native code from the NDK and resources, and you basically put in a zip file we call the APK. And that gets distributed through Google Play to the user. So on the user side they take it, and they install the application. And the native code and the resources generally get packed unpacked into the application directory. But the dex code requires further processing. And this is true both for Dalvik and for ART, as you see here. In Dalvik, they ran it through a tool called dexopt during installation, which is some already did some ahead of time things that ahead of time class verification. Again, try to quicken things up to improve performance for the interpreter. Dex to ODE is the tool that ART uses. And again, this is sort of transparent behind the scenes. But it compiles the application in addition to doing some of that Dalvik did in terms of verifying things ahead of time, and quickening things, it also does that compilation to native code for the target for the particular device. And the key takeaway here is that from a developer's point of view and a user's point of view, kind of the flow of how the APK works is the same. And hopefully, you're just going to get faster, a better experience. This is a little bit of a picture of what it looks like when you're running the dex tool inside if you were inside of a tool. We actually start up a minimal run time that loads the framework code and some of the commonly used framework objects so that when we compile the code, we kind have an idea of what the Android system you're running on looks like so we can make optimization specific to that target device. So once we kind of initialize this framework environment, we then go ahead and load the compiler and load the coder compiler of your application, and generate an ELF file. So what does this ELF file look like? Sorry, yeah. This is the ELF, kind of an overview of what these files looks like. ELF is a standard file format for representing executable code. It is a standard file. We have kind of the sections you want to expect. We have a symbol table, we have read only data. We have some text that contains the actual code. One difference between how we use ELF files in more of a C, C++ kind of language would use them is we don't use symbols to find every single entry point. Because typically a Java program has many, many methods. And having a symbol for each one as unique would be too big. So instead, we use the things that are already in the next dex bicode, type indexes and method IDs to identify things. And we have this old data structure that we instead look up using the symbols in the symbol table. And once we have that data structure loaded, we can quickly navigate when we're loading a class to find all the methods for a particular thing and link them up so we can execute the code quickly. One other thing that we do that's a little bit different is we actually keep the original dex files around in the metadata in the ELF file. We use that first sort of the class meta data for loading classes. You also use it for other purposes, such as when you want to run the debugger and single stuff, your code. We keep the original code around so we can give you an accurate debugging experience. One final thing to note here is that in a compiled code for the class on the box at the bottom, you can see it actually does have direct links into framework code as well framework objects, so we can efficiently call things and not have to do runtime lookups to find references to framework's code, which generally speeds application execution. So this is just kind of a time based view of what a compilation looks like. On the lower left, we have a systrace of a compilation of a pretty meaty app. It's running on a four core device. And I've tried to break out the colored boxes on the lower left there to the right so you can kind of get an idea of what's going on there, and I'll talk through that. You can see most of it actually does get good parallelism during compilation. That's part of taking more advantage of the modern devices whereas the dexopt and and other tools for Dalvik were more single threaded. So we do have a serial face at the beginning, where we extract out the dex files for processing. But then we do a pass, most of the passes are parallel. And the first pass is class resolution, where basically load all the classes, load their fields and methods. And this is where the compiler learns what the layout of a class is, what the label of the v table is for the methods in the class so we can again, use that later on during compilation. The next phase is class verification. We walk through the bike codes of the class. For most almost all methods we encounter, we find that they are verified, and we can mark that class as verified. And again, skip doing that during every runtime. One other thing we do, there are cases where we have failures during verification. So for example, if you install an APK that's maybe meant for the L release, and you install it on KitKat, you may reference new APIs from the L release. And so we deal with that by noticing this [INAUDIBLE] is a soft failure during verification, we mark that one thing and so later on when the compiler compiles, it can know that well, I'm not sure if this code is avialable or not and put a slower path in so that we can do the correct thing and handle it if the code accidentally goes down this path of calling a method that doesn't actually exist in this release. So if all that passes, we do this pass called the initialization pass. And we actually check for any method, or sorry, any class that was probably verified. If it all verified OK and has no class initializer, we'll go ahead and mark it as initialized ahead of time so that we know when we're compiling code that references that class, that we don't have to do any class initialization checks. Again, another optimization for object only programming. And so finally you can see, we've spent not quite half the time. But a lot of time in the program down there through that purple section during verification, and now we finally start to do compiling. And this is where the compiler driver takes over and directs the different methods to different compilers potentially, choosing to compile for native code or JNI bridges, or just doing dex to dex for things were not commonly run class initializers. We don't waste time or space compiling it out. So finally, we have a link step where we combine all the references between the compiled code any references to the framer's code and metadata. And then we link that together, and we write our own ELF file. So what does this get us? So this is the slide from yesterday's keynote I wanted to go into a little bit more detail. You can see here the blue bars are the Dalvik JIT as a baseline. This is our Nexus 5, kind of an AOSP build from recently around L snapshot time. The red bar show the ART performance. And then you can see our collection of common benchmarks we are doing better than the Dalvik JIT. The lowest number there is the AnTuTU composite score. And I wanted to just talk about the for a second. Because the composite, it's a composite of some things that are running dex code, and some things that are running native codes. We can only get improvement on the parts of that composite benchmark that are running dex code. And then we do get some improvement on those areas. A lot of the other benchmarks, you can see we have this general kind of bars of improvement. Many of those are kind of these historical benchmarks, like Drystone, or ford porter from C or Fortran or something. And they're not really written in a Java style. What we really like is the chest bench score, because that's actually more reflective of an Android developer writing functionary code in a modern way where the benefits they'll see. And I think hopefully that's more representative of what you'll see for most applications doing kind of compute in Android. So with that. just a reminder. This is kind of a snapshot of where we are now. We have additional work we've been doing in AOSP that hopefully some of that will make it into L, so we'll have more performance boosting things before we ship. And with that, I want to introduce Ian Rogers, whose going to talk about other performance boosting things, the ART garbage collectors. IAN ROGERS: Thank you, Brian. So-- very loud. So I'm going to talk about garbage collection, and basically how we're going to achieve fast allocation. What we've been really focusing on less junk, giving you a buttery experience when you're using Android, and minimizing the memory footprint. So what is the memory manager trying to do? We're trying to allocate objects inside a application. We need to track on behalf of the application whether those objects are still in use, whether they're alive. And need to reclaim them when they're no longer in use. We want to do all this and we want to do it for free. And this is a real advantage in the Android ecosystem is that this burden is taken away from the developer. So let's think about memory management schemes. There's no approach which is free. If you have native code, you have to worry about when you're allocating objects when you're going to reclaim them. You have to worry about complex data structures which might have cycles in them. You have to worry about multi-threaded issues like when an object says accessed by multiple threads. So there are costs associated with that. If you're using reference counting, than you can find that a thread that lowers a reference count suddenly ends up having to reclaim all of the garbage in the known universe. And all of these things can create a lot of work on an application thread and cause significant junk and problems for the application developer. So what we've been focusing on in ART is making GC something that doesn't get in the way of the application and doesn't get in the way of the application developer so they're freed from worrying about it. And our principal consideration has been how to reduce junk. And this comes around from post times from the garbage collector. And I'll go into detail about that. So let's look at what's going on inside a garbage collection. Here, we have some threads though, the wiggly lines. We have a heap, which is the rectangle. And inside the heap are green and red rectangles indicative of live objects which are objects which are in use by the application and dead objects, which are objects that the application can no longer access. The blue object on the left is an object that we're trying to allocate. So the job that the garbage collector's got to do is at allocation time, it's got to find places for objects to fit in to get allocated. And it's also a GC time, it's got to go and find all of these live objects, these green objects. And having found all of them, it then has to go and free them up. So let's see what happens inside of a Dalvik. When we're going and finding this initial set of live objects, Dalvik will suspend the application threads. It also suspends threads inside of the virtual machine. So we tend to refer to these threads as a whole as mutator threads. They mutate the state of the Java heap. And to do this, Dalvik suspends these threads so that it can crawl less stacks and find what objects those stacks are referencing. Once it's determined, this initial reachable set, it then traverses it in a concurrent phase. So the garbage collector is running alongside the application. Once it's found all of the objects, it then goes and does another pause. The reason for this pause is to make sure that the application wasn't playing any games with the garbage collector and hiding objects which the garbage collector thought it had already processed. In this phase, the card table will be examined by the garbage collector. Finally, having determined that everything has been marked, the garbage collector restarts all the threads, and it concurrently sweeps and frees up the object which were allocated. So in Dalvik, the pauses have been highly optimized. And the first pause is around 3 to 4 milliseconds, and the second pause is around 5 to 6 milliseconds. But the sum total of that is about 10 milliseconds. If we're going to achieve 60 frames per second, we need to be able to run the frames in 16 milliseconds. And taking 10 milliseconds for garbage collection pauses out of that 60 milliseconds means we get dropped frames. So here is an assist trace of Dalvik running the Google Play store. And we're doing a fling, so we're scrolling down and allowing lots of bitmaps to scroll by. And the bitmaps are causing objects to get created, and caught, and triggering GCs. And at the top, you can see what's running on each individual CPU. Beneath this are the vsync events, which are the screen refresh events. Below this, we have the vending application, the name of the Play Store application. And its rendering frames, and you can see that in the dark green. And ideally, what we would see is with every vsync event, there would be some dark green. But you can see in the sensor here that the dark green isn't there, and that's because of a dropped frame. It was causing a drop frame. At the bottom, we've got the view of what's going on inside of Dalvik. And we can see that there are GCs going on. They're explicit GCs going on back to back. Another pauses within this. And we can see that the second pause came in just before the vsync event where we had the dropped frame. So this isn't good. This isn't giving us the buttery experience you want on Android. And one of the things that was on that systrace, we saw that GC cycles were happening back to back. And why was this happening? Well, the application was trying to avoid a pathology. And kind of what happens in Dalvik's behavior where Dalvik is trying to avoid a heap fragmentation and allowing the heap to grow, because it's worried about large objects getting reclaimed, and then small objects coming in and fragmenting the heap making it so that the wrong places where large objects can come and fit into the heap. So this is kind of shown on this graphic here. We've got the large blue object. We failed to find somewhere to fit it. We're worried about fragmentation, and so Dalvik is going to suspend all of the application threads and do it a GC for our log, to then go and free up all of the objects it can so hopefully, do a better job at fitting this large object into the heap. What does that look like for the application? It looks like one large pause. And these pauses are typically around 50 milliseconds. So this is going to translate into four dropped frames. Here, we can see again, a systrace of a Google Maps application this time. And we can see that's we've got two GCs happening at the bottom. And for in this situation, the two GCs have caused a pause of around 60 milliseconds, and 4 to 5 dropped frames. So for us, everything is awesome. We fixed all the problems. Open bugs, if we haven't. Please [APPLAUSE] So we've made the garbage collector faster by new backup collection algorithms. We've reduced the number of causes and made the causes themselves shorter. We've taken ergonomics decisions and strategies to reduce the amount of fragmentation so that we no longer need the GC for analog. And we've done all of this with support for things like moving and compacting collectors and ways that allow us to use less memory. So what do pauses look like in the ART's garbage collector? Going back to the earlier graphic, we've got rid of the first pause. What we do to create this initial set of objects that the garbage collector needs to traverse is the garbage collector requests that the applications go and mark their own stacks. So this is a very short thing for the application threads to do. And after they've done it, they can carry on running. So there's no need to suspend all of the applications threads. So this checkpointing, as it gets called gets done first, and the garbage collector waits for that. And then it enters the concurrent phase that we had before in the Dalvik CMS asking where we're going to mark all of the objects. Then we have the second pause. And we've managed to make pause a lot shorter than the Dalvik pause. We've done this by tricks like [INAUDIBLE] cleaning where we take work out of the pause and do it ahead of it, and then we just double check that what we did before the pause was correct. If we look at systrace, this is going back to the Google Play Store application, and doing a fling. We've got the vsync events and frames being rendered. We've got a garbage collection pause happening just where the single pause is indicated. And no dropped frames. The other thing that you can see on this systrace is that the GCs aren't occuring, the back to back GCs that was coming in the Play Store before. And that's because ART doesn't suffer from the fragmentation problems that Dalvik had. And so we now trigger, treat the system GCs as something are optional, and we only use them as a hint that we should trigger a garbage collection cycle. Whereas with Dalvik, they would always trigger a concurrent garbage collection cycle. And so there are no dropped frames here. And that single pause is measuring about 3 milliseconds. So that was kind of a lot faster than the pauses that you're seeing in Dalvik. So how did we get rid of the GC for analog events? So the problem we were having in Dalvik was these large objects were coming in. And these large objects would be bitmaps in the common case. And so what we do in ART is we move the bitmaps out of the main heap, and put them into their own separate, still part of the managed heap. But it's a separate area in the managed heap especially dedicated to large arrays of primitive objects. And this is also useful for us, because we know that by definition, a raise of primitive objects don't have references to other objects. And so we can improve the bookkeeping and the performance of the garbage collector as a consequence. What does this look like for an application like for Google Maps? So on the left, we've got the concurrent marks GCs happening in Dalvik. The first pause, around 3 to 4 milliseconds, the second pause around 5 to 6 milliseconds. And total coming in at just under 10 milliseconds. These are times from a Nexus 4. The GC for analogs, which typically taking the average one was around 54 milliseconds. And so that translates into a lot of dropped frames. What is ART looking like in the same test scenario? Well the average pauses are coming in at 2 and 1/2 milliseconds. So that's around four times faster. We've also found that one of the areas where we needed to improve was in terms of our allocation performance. So we were working with the Google Spreadsheets team internally inside of Google, and they were really excited by ART, because they've been experiencing some bad performance. And so they switched over to using ART, but they didn't realize that the full performance that they were expecting to see. So you can see that on this bar chart. On the left normalized to one is the Dalvik performance. And when they switched over to using ART, they managed to get a 25% performance improvement on a Nexus 4. However, they wanted more. They wanted the 2X. They wanted the 4X. So there were other things we did in ART, having seen this. So one of the things was we started specializing the allocation paths so that the allocation paths cannot allocate certain classes of objects faster than others by knowing things like the size of the object ahead of time. And this achieved a performance win, but not the dramatic performance win we would really have liked. So the next thing that we did was something that we'd wanted to do for a long time. And we implemented our own memory allocator. So Dalvik and ART in KitKat we're basing on top of Doug Lea's memory allocator, which is the common one used in Unix's and so on. And this supports a [INAUDIBLE] free, but it's not really tuned for Java, and it's not really tuned for a multi-threaded environment. So to make sure that it behaves in a correct manner, it has a single global lock around it. And so the garbage collector could be freeing things, and that would conflict with the application trying to allocate objects. And you had multiple allocations happening at the same time, then they would get run sequentially. The rosallac, as we call it, the rows of slots allocator, has a number of optimizations to improve performance. The first of which, is the small objects, the small temporary objects that is so useful in languages like Java. They get allocated in a thread local region. And so because the region is thread local, this means that allocations can occur with no locking and no kind of sequential behavior. For larger objects, the bends that they fit into have their own individual locks. So again, it's reducing the contention for the locks, and is a lot more parallelism. This is translated into a huge performance for this benchmark, and you can see that the memory allocation performance has improved 10x. So we've been working on a number of different garbage collection algorithms for ART. We're making ART into a framework that developers and partners can come along and contribute to and improve upon. The first new garbage collector that ART added was a sticky garbage collector. The sticky garbage collector takes advantage of the fact that we can know what objects were allocated since the last GC cycle. And because the generational hypothesis tells us that these objects are the most likely to be useful to be freed up, then we focus our energy on these. And because garbage collection at the time is proportional to the amount of live data we're going to process, and we're going to process less because we're only going to consider the live data since the last GC, the sticky GC is typically running 2x or 3x faster than the current garbage collector, and reclaiming the same amount of memory. When the sticky garbage collector can't run, we run the regular CMS collector for either the applications heap, or the applications heap in the zygote. And the ratio between running the sticky garbage collector and the regular concurrent garbage collector, we tend to run the sticky one five times for every one time we need to run the partial one. We're also working on moving garbage collectors where this slide comes in. And what are moving garbage collectors? So moving garbage collectors, they want to do the same reset creation and marking of objects. But what they want to do during the sweeping phase is they want to move all of those objects together. And this basically removes all of the fragmentation and so on. We do this in ART via two approaches. We have a semi space garbage collector where we create a new memory region, and we evacuate the objects out of a heap and into that newly created space. And we also have a mock compact algorithm that's coming online at the moment where we compact things in place. So moving collectors are good. There's less fragmentation, it saves heap. The downside to them is that if you're going to be moving objects around, the effects of moving an object has to be seen by the application atomically. There are two approaches you can take to doing this. You can introduce read barriers so that the read barrier handles the fact that the object is moving around in the heap. Or the most common approach is just to suspend all of the application threads again. And this isn't good from a GC junk point of view. So we want to use the moving collector to compact the heap and get these memory savings. So when can we do it? So we know certain things about how applications run in Android. We know that it's worthwhile to compact the heap joined zygote startup. Because if we save space in the zygote, we're going to save it for every application after the Android runtime is up. The other place we can use COMPAxiON is when applications go into the background, so when you press the Home screen. And so we have various ergonomics determining when is a good idea to run, to switch the garbage collector over to a compacting algorithm. We have various means of determining whether an application can perceive junk. We don't want to do things like have the Play music application, which isn't a foregrounded application. We don't want to have junk in that, because that would give you choppy audio. And so hopefully we've done our jobs right, and it's all working, and so on. And so you'll get less memory usage. So with that, let me hand back Anwar. ANWAR GHULOUM: All right. Thanks Ian. So a few words about 64-bit support coming with L. So it's one of the major features of L. Partners have been shipping, or readying to ship 64-bit SFZs, and we're working pretty closely with them to make 64-bit userspace happen. So why 64-bit? Well you might argue that we don't need the additional address space now, but if you look at the keynote yesterday, Android usage is diversifying pretty rapidly from wearables to TVs. And bearing in mind, I think it's really-- and there's definitely value today in supporting 64-bit. We see nice performance gains in 64-bit apps running on 64-bit capable cores. There are new instructions, domain specific instructions for media encrypto, where we get huge speedups that can benefit the entire platform. But even in general purpose apps, we're seeing nice speedups. So we'll take a look at that first, and then I'll get into a few details of how we make this work. So as I said, we spec compute intensive stuff to take biggest advantage of this. We're working closely with ARM, with Intel, MIPS, Qualcomm, and others to make sure that we really deliver on this. So what I'm showing you here is graphs of speedup of 64-bit apps over 32-bit apps on the same silicon. On upper left hand side as you face the screen, I'm showing you speedup in terms of a multiplier. This is on Intel's Bay Trail. Processor's a four core SOC for some custom render script scripts. And we're seeing some really nice speedups going to 64-bit up to and over 4x. For crypto, we use open SSL's a primary crypto engine. If you look at ARM64 support, ARMv8 support on Cortex A53 and Cortex A57 for the open SSL speed benchmark, we're seeing some really nice speedups there as well. Again, this is used throughout the entire platform. Any applications that's using Android crypto is probably going to benefit from this. And again, those are multipliers on the vertical axis sourcing about, a 15x improvement there. And even for native stuff, we're seeing some nice improvements. Panorama benchmark is a benchmark that we use internally to evaluate tool chain updates. It's meant to be representative of a user taking a bunch of photos and then waiting for them to get stitched together in a nice little panorama. What we see here is that on Cortex ARMS, Cortex A53 and A57 cores simply recompiling it for 64-bit, we see up to 20% boost for this one. That means the end user is sitting there waiting for a panorama, and happens 20% faster. But it also uses 20% less power probably in terms of CPU core. And that's a really nice thing. But what about ART? So for ART, we weren't really looking for big performance gains of 64-bit. Obviously we were hoping, but we've been pleased with what we've seen thus far. We expected most of this again, to come again in the computer intensive workloads. So looking at ART performance here, this is on Baytrail. We have on the left, speedups of going to ART 32-bit from Dalvik 32-bit. So they're getting some really nice speedups on spec JVM up to over 6x. But the cherry on top is that we get up to another 30% speedup going to 64-bit. And again, all these numbers, it's early days. There's a lot more tuning that we're going to be doing for 64-bit with ARM Intel and MIPS. So expect more to come from this. OK, so how are we getting this done? Well, as Brian mentioned, we have compilers for 64-bit, for x86 64, for ARM64, and MIPS64 coming soon. We've extended the zygote model for app creation by having dual zygotes. So we have a zygote for 32-bit apps, and a zygote for 64-bit apps. What happens when you launch an app or a service, we detect what ABI it needs, or what ABIs it can use, and then we delegate to that zygote to start up the app. So what that means is that, and I'll say more about this later, is that your apps, you're 32-bit apps are still going to work in your 64-bit device. The other thing we were concerned about is memory bloat. How much more memory is this going to take having two zygotes, and then having 64-bit references? Well in ART, we're using compressed references. We're using just 32 bits for object references on the heap. So that mitigates some of this concern. So the cool thing about this is that if you have an application that's just running on ART, that's just dex code written in Java or whatever, it'll just work. As long as there's no native code you can download that app, and if you have a 64-bit device, it'll run on the 64-bit VM. That's 85% of apps in Play Store that are immediately 64-bit ready. Developers don't need to recompile, they don't need to upload a new version or whatever. It just works, and it's ready for you for free when you get your new device. So another performance boosting thing that ART and L and our SSE partners are bringing you. The other thing is that we'll be shipping NDK support And RenderScript support for 64-bit with L as well. So you can take advantage of all that other compute intensive stuff as well. OK, so to learn more about ART, there's a couple of articles you should check out on android.com. The first one, "Introducing ART," covers some of the same ground that we've covered here. The second one, "Verifying App Behavior on ART," goes into more detail on how ART can help you find bugs in your application. Especially around J&I, you should definitely check that out. And with that, we'll take questions. Thank you. AUDIENCE: Can I start off? ANWAR GHULOUM: Yeah, go for it. AUDIENCE: Cool. So I was wondering how ART is going to jive with byte code injection that might happen right after compilation or even at runtime. ANWAR GHULOUM: Go ahead. BRIAN CARLSTROM: No. I'll let you do it. IAN ROGERS: So, maybe I'll-- ANWAR GHULOUM: Oh, yeah. IAN ROGERS: OK. So the model that Dalvik has and ART continues is that for class loaders, we have to have everything that the class loader has backed up by a file. So Dalvik never had supports for the kind of doing end memory injection of instructions, and so on. If you have a file on the disk, then this is something we can do ahead of time compilation for and put into our cache so that we're not regenerating it all of the time. So basically, it works the same way as with Dalvik. There might be a little bit more time when we're doing the compilation so that you might notice something at the initial startup, a pause at startup, because of the compilation. But our compilation times are really good and really fast, so hopefully imperceptible. AUDIENCE: Cool, thanks. ANWAR GHULOUM: Go ahead. AUDIENCE: What's the effect on the boot time compared to Dalvik? ANWAR GHULOUM: So first boot when we're compiling things, things can slow down a bit. Compile time takes a bit longer, and the scale things we're really talking about is an app goes from taking one second to DEX-HOP to say, 2 and 1/2, 3 seconds to DEX-HOP. But if you're doing that over many apps. it adds up. That's just at first boot, or after an OTA, which are relatively infrequent events. In general, we think boot time will be faster. The one mitigating thing though is because we're doing COMPAxiON boot time, to really make the heap more efficient. There is potential to extend things. Although recently, I think a patch went up too to optimize that down, it'll only do the COMPAxiON sort of toward the end so we're not extending boot time. But code should be faster, starting up system services and so on should be faster. And things should just be faster. AUDIENCE: Thank you.

Info

Channel: Google Developers

Views: 49,109

Rating: 4.8594594 out of 5

Keywords: Google, I/O, IO, io14, Develop, Android, product: android, fullname: other, Location: other, Team: Other, Type: Live Event, GDS: Upload only, Other: NoGreenScreen

Id: EBlTzQsUoOw

Channel Id: undefined

Length: 45min 6sec (2706 seconds)

Published: Fri Jun 27 2014