ANWAR GHULOUM: Thanks
for coming to the talk. We're going to talk about ART,
L's runtime replacing Dalvik. Hopefully you'll leave,
with a better understanding of what ART is and what
improvements you can expect. We'll also talk a bit about
64-bit support in ART, and in the platform at the end. So I'll be presenting today
along with my colleagues, Brian Carlstrom and Ian Rogers from
the Android runtime and tools team. I'm Anwar Ghuloum. They'll talk about different
aspects of the runtime. They'll be coming
up pretty shortly. Thanks guys. First, why ART? So when Dalvik was
originally designed, the processors we were
targeting were single core, relatively poor performance,
relatively little flash and RAM available. But things have changed. We had a compact bicode
that was designed for fast interpretation,
relatively unsophisticated GC, and it wasn't designed
for multiprocessing. But since then, what
we've seen is we've got phone shipping
with eight cores. We've got devices that have
improved in performance over the G1 by 50x or more in
terms of raw CPU performance. We've got 4x more RAM,
sometimes more, 64x more flash, high resolution screens,
extremely powerful GPUs. We made a lot of changes in
Dalvik to improve things. We added a JIT, we added a
concurrent collection garbage collector, and we added
support for multiprocessing. But we felt frankly, that the
ecosystem was outpacing us, that we had to do more. So ART was born. So our motivation
for ART is that you shouldn't notice
the runtime at all. You're app should be buttery and
smooth, even though using GC. Code should be
fast without having to resort to JNI or native code. Start up time should
be fast, without having to warm up a JIT, and so on. The runtime in
general should scale to the parallelism
modern processors, as well as the complexity of
modern mobile applications. In other words, it should
be really solid foundation for the future. So in this talk,
we're going to start with telling you
about performance and compiling in ART; how
ART compiles ahead of time, and what kind of performance
improvements that brings you. We'll then talk about
the garbage collector and how GCed apps can
be buttery and smooth. And then we'll talk
a bit about 64-bit. So with that, I'll
hand it over to Brian. BRIAN CARLSTROM: Thanks, Anwar. So I'm going to start
by talking about a user. We have this public
bug that's one of our favorites
on the ART team. So a user wanting a
performance boosting thing. And if you read the
details in the bug, they want to take their existing
phone and their existing apps and they just want to get
a system software update. And they want faster
performance for their apps, want better battery life. They want to use less memories
so they can multicast more. And we really like this
bug because it really matches what Anwar said
about the ART manifesto, about trying to transparently
improve performance for apps without people
having to do anything. So what can we do about that? So as we've already said,
we started with Dalvik. And it had some performance
boosting things. It had a fast interpreter. JIT was added, another
performance boosting thing. And what can we do for ART? What can we do going forward? And we said well, Dalvik was
built for really targeting ARM. Now in more mobile
device, modern devices, we have a lot more ABI's
we have to support. We want to do potentially
different optimization levels for different use cases. So rather than just trying to
build one performance thing, we want to build a framework
of performance boosting things. And so we're going to
build a flexible compiler infrastructure that can
allow us to do that. So this just a
little box diagram of what we're doing
at the ART compilers. As you'd expect,
we have a front end that can read our dex
bicode for compilation. We have a compiler driver,
and what's interesting that it directs to various
different compilers that are available in L. We have three main compilers
that we're using in L, but there's already seeds in the
AOSP code for other compilers that we're working on
for the future, seeds of future performance
boosting things. What we have for L is a compiler
we call a quick compiler, which is based on the old Dalvik JIT. So it does many of
the same things, but does more than that. And we'll talk about
that in the next slide. It focuses on
compiling your dex code to native ARM code, x86
code, MIPS code, et cetera. We also have a JNI compiler
that helps build bridges from your dex code to
any native code you've done within the NDK. And finally, we have
a dex to dex optimizer which just quickens code
we decide not to compile, that we just want to do
fast interpretation on. In addition, we have to
have numerous back ends, and so we're supporting the
32 ABIs that Androis currently supports, as well as
the new 64-bit ABIs that are coming in L. So what are some of
the other benefits of doing ahead of
time compilation? One big difference between
ahead of time compilation and the JIT we had in
Dalvik is that we optimized the whole program potentially
and instead of just focusing on small traces of the program,
loops, or things like that. So the kind of
optimizations you did when you're targeting like small
parts the program, or loops, are very different
than when you're trying to do the whole program. And we try to do optimizations
that can generally apply to all kind of object
oriented programming that you might be
doing in Android. And so some examples are, we
try to make virtual method calls faster when possible,
we try to lower the overhead for
interface and vacations, we try to avoid some
implicit checks that need to be done
when you're calling from one class to another to
make sure it's initialized, and when you're calling
constructors or static fields. So exercising site fields, or
calling [INAUDIBLE] methods. And finally, there's
a lot of things are sort of implicit
checks that are done. Not just growing
languages, that there are no pointers,
and uncommon cases, and StackOverflow,
and things like that. So we try to make all those
kind of common case things faster so we can improve
the general performance of your program. But we've also seen
some other benefits from of ahead of
time compilation. Those things that
the user wanted. We do see better battery
life, and there's two reasons for that. One is because we're compiling,
and only in the installation time, we compile once
as opposed to a JIT based environment where every
time you run the program, it typically re-JITs,
rediscovers things, redoes work. It's kind of wasteful every
time you run your program through doing the
same compilations. Second thing we see is that
generally faster performing code saves battery because
you can more efficiently run and get back to a lower state
on the processor quicker. We also see a better
Svelte memory. Remember Project
Svelte from last year, trying to run better on
lower memory devices. The reason we see
better Svelte numbers are because the code
is written out to disc, and can be paged in and out
by the kernel on demand. If there's a lot of applications
running, multitasking, the kernel can kind of manage
paging things in and out, as opposed to the
JIT, where there would be a kind of set of
private pages per application which couldn't be paged out. So that meant that if the kernel
needed to free up more memory, it basically has
to kill a process, and you lose one
of your processors you maybe are trying
to multitask between. So how do we do that? This shows the life cycle of an
APK, the Android package that comes from developers
to the users. And so the typical steps in
the APK, a life in a life cycle are, you take your source code. You compile it, you
make a dex file, and can combine it
with native code from the NDK and resources,
and you basically put in a zip file we call the APK. And that gets distributed
through Google Play to the user. So on the user
side they take it, and they install
the application. And the native code
and the resources generally get packed unpacked
into the application directory. But the dex code requires
further processing. And this is true both for Dalvik
and for ART, as you see here. In Dalvik, they ran it
through a tool called dexopt during installation,
which is some already did some ahead of
time things that ahead of time class verification. Again, try to quicken things
up to improve performance for the interpreter. Dex to ODE is the
tool that ART uses. And again, this is sort of
transparent behind the scenes. But it compiles the
application in addition to doing some of that Dalvik
did in terms of verifying things ahead of time,
and quickening things, it also does that compilation
to native code for the target for the particular device. And the key takeaway
here is that from a developer's point of
view and a user's point of view, kind of the flow of how
the APK works is the same. And hopefully, you're just
going to get faster, a better experience. This is a little
bit of a picture of what it looks like when
you're running the dex tool inside if you were
inside of a tool. We actually start up
a minimal run time that loads the
framework code and some of the commonly used
framework objects so that when we
compile the code, we kind have an idea of what the
Android system you're running on looks like so we
can make optimization specific to that target device. So once we kind of initialize
this framework environment, we then go ahead and
load the compiler and load the coder compiler
of your application, and generate an ELF file. So what does this
ELF file look like? Sorry, yeah. This is the ELF,
kind of an overview of what these files looks like. ELF is a standard file
format for representing executable code. It is a standard file. We have kind of the
sections you want to expect. We have a symbol table,
we have read only data. We have some text that
contains the actual code. One difference between how we
use ELF files in more of a C, C++ kind of language would use
them is we don't use symbols to find every
single entry point. Because typically a Java
program has many, many methods. And having a symbol for each
one as unique would be too big. So instead, we use the
things that are already in the next dex bicode,
type indexes and method IDs to identify things. And we have this old data
structure that we instead look up using the symbols
in the symbol table. And once we have that
data structure loaded, we can quickly navigate
when we're loading a class to find all the methods
for a particular thing and link them up so we can
execute the code quickly. One other thing that we do
that's a little bit different is we actually keep the
original dex files around in the metadata in the ELF file. We use that first
sort of the class meta data for loading classes. You also use it
for other purposes, such as when you want to run
the debugger and single stuff, your code. We keep the original code
around so we can give you an accurate
debugging experience. One final thing to note here
is that in a compiled code for the class on the
box at the bottom, you can see it
actually does have direct links into framework
code as well framework objects, so we can efficiently
call things and not have to do runtime
lookups to find references to framework's code, which
generally speeds application execution. So this is just
kind of a time based view of what a
compilation looks like. On the lower left,
we have a systrace of a compilation of
a pretty meaty app. It's running on a
four core device. And I've tried to break out the
colored boxes on the lower left there to the right so you can
kind of get an idea of what's going on there, and
I'll talk through that. You can see most
of it actually does get good parallelism
during compilation. That's part of taking more
advantage of the modern devices whereas the dexopt and
and other tools for Dalvik were more single threaded. So we do have a serial face at
the beginning, where we extract out the dex files
for processing. But then we do a pass, most
of the passes are parallel. And the first pass
is class resolution, where basically load
all the classes, load their fields and methods. And this is where
the compiler learns what the layout of a class is,
what the label of the v table is for the methods in the
class so we can again, use that later on
during compilation. The next phase is
class verification. We walk through the
bike codes of the class. For most almost all
methods we encounter, we find that they
are verified, and we can mark that class as verified. And again, skip doing
that during every runtime. One other thing we
do, there are cases where we have failures
during verification. So for example, if you
install an APK that's maybe meant for the L release,
and you install it on KitKat, you may reference new
APIs from the L release. And so we deal with that by
noticing this [INAUDIBLE] is a soft failure
during verification, we mark that one
thing and so later on when the compiler
compiles, it can know that well, I'm not
sure if this code is avialable or not and put a slower
path in so that we can do the correct
thing and handle it if the code accidentally goes
down this path of calling a method that doesn't actually
exist in this release. So if all that passes,
we do this pass called the initialization pass. And we actually check
for any method, or sorry, any class that was
probably verified. If it all verified OK and
has no class initializer, we'll go ahead and mark it
as initialized ahead of time so that we know when
we're compiling code that references that class, that
we don't have to do any class initialization checks. Again, another optimization
for object only programming. And so finally
you can see, we've spent not quite half the time. But a lot of time in
the program down there through that purple section
during verification, and now we finally start to do compiling. And this is where the compiler
driver takes over and directs the different methods
to different compilers potentially, choosing to compile
for native code or JNI bridges, or just doing dex to dex
for things were not commonly run class initializers. We don't waste time or
space compiling it out. So finally, we have
a link step where we combine all the references
between the compiled code any references to the
framer's code and metadata. And then we link that together,
and we write our own ELF file. So what does this get us? So this is the slide
from yesterday's keynote I wanted to go into a
little bit more detail. You can see here
the blue bars are the Dalvik JIT as a baseline. This is our Nexus 5,
kind of an AOSP build from recently around
L snapshot time. The red bar show
the ART performance. And then you can see our
collection of common benchmarks we are doing better
than the Dalvik JIT. The lowest number there is
the AnTuTU composite score. And I wanted to just talk
about the for a second. Because the composite,
it's a composite of some things that
are running dex code, and some things that are
running native codes. We can only get
improvement on the parts of that composite benchmark
that are running dex code. And then we do get some
improvement on those areas. A lot of the other
benchmarks, you can see we have this general
kind of bars of improvement. Many of those are kind of
these historical benchmarks, like Drystone, or ford porter
from C or Fortran or something. And they're not really
written in a Java style. What we really like is
the chest bench score, because that's actually more
reflective of an Android developer writing functionary
code in a modern way where the benefits they'll see. And I think hopefully
that's more representative of what you'll see for
most applications doing kind of compute in Android. So with that. just a reminder. This is kind of a snapshot
of where we are now. We have additional work
we've been doing in AOSP that hopefully some of
that will make it into L, so we'll have more performance
boosting things before we ship. And with that, I
want to introduce Ian Rogers, whose going to
talk about other performance boosting things, the
ART garbage collectors. IAN ROGERS: Thank you, Brian. So-- very loud. So I'm going to talk
about garbage collection, and basically how we're going
to achieve fast allocation. What we've been really
focusing on less junk, giving you a buttery experience
when you're using Android, and minimizing the
memory footprint. So what is the memory
manager trying to do? We're trying to allocate
objects inside a application. We need to track on
behalf of the application whether those objects are still
in use, whether they're alive. And need to reclaim them when
they're no longer in use. We want to do all this and
we want to do it for free. And this is a real advantage
in the Android ecosystem is that this burden is taken
away from the developer. So let's think about
memory management schemes. There's no approach
which is free. If you have native
code, you have to worry about when
you're allocating objects when you're going
to reclaim them. You have to worry about
complex data structures which might have cycles in them. You have to worry about
multi-threaded issues like when an object says
accessed by multiple threads. So there are costs
associated with that. If you're using
reference counting, than you can find that a
thread that lowers a reference count suddenly ends
up having to reclaim all of the garbage in
the known universe. And all of these things
can create a lot of work on an application thread
and cause significant junk and problems for the
application developer. So what we've been
focusing on in ART is making GC
something that doesn't get in the way of
the application and doesn't get in the way
of the application developer so they're freed from
worrying about it. And our principal consideration
has been how to reduce junk. And this comes around
from post times from the garbage collector. And I'll go into
detail about that. So let's look at what's going
on inside a garbage collection. Here, we have some threads
though, the wiggly lines. We have a heap, which
is the rectangle. And inside the heap are
green and red rectangles indicative of live objects
which are objects which are in use by the application
and dead objects, which are objects that the application
can no longer access. The blue object on
the left is an object that we're trying to allocate. So the job that the garbage
collector's got to do is at allocation time, it's
got to find places for objects to fit in to get allocated. And it's also a GC time,
it's got to go and find all of these live objects,
these green objects. And having found
all of them, it then has to go and free them up. So let's see what happens
inside of a Dalvik. When we're going and
finding this initial set of live objects, Dalvik
will suspend the application threads. It also suspends threads
inside of the virtual machine. So we tend to refer
to these threads as a whole as mutator threads. They mutate the state
of the Java heap. And to do this, Dalvik
suspends these threads so that it can crawl
less stacks and find what objects those
stacks are referencing. Once it's determined, this
initial reachable set, it then traverses it
in a concurrent phase. So the garbage
collector is running alongside the application. Once it's found
all of the objects, it then goes and
does another pause. The reason for this
pause is to make sure that the application
wasn't playing any games with the
garbage collector and hiding objects which the
garbage collector thought it had already processed. In this phase, the
card table will be examined by the
garbage collector. Finally, having determined that
everything has been marked, the garbage collector
restarts all the threads, and it concurrently
sweeps and frees up the object which were allocated. So in Dalvik, the pauses
have been highly optimized. And the first pause is
around 3 to 4 milliseconds, and the second pause is
around 5 to 6 milliseconds. But the sum total of that
is about 10 milliseconds. If we're going to achieve
60 frames per second, we need to be able to run the
frames in 16 milliseconds. And taking 10 milliseconds
for garbage collection pauses out of that
60 milliseconds means we get dropped frames. So here is an assist trace of
Dalvik running the Google Play store. And we're doing
a fling, so we're scrolling down and allowing
lots of bitmaps to scroll by. And the bitmaps
are causing objects to get created, and
caught, and triggering GCs. And at the top,
you can see what's running on each individual CPU. Beneath this are
the vsync events, which are the screen
refresh events. Below this, we have the
vending application, the name of the Play
Store application. And its rendering
frames, and you can see that in the dark green. And ideally, what we would
see is with every vsync event, there would be some dark green. But you can see in the sensor
here that the dark green isn't there, and that's because
of a dropped frame. It was causing a drop frame. At the bottom,
we've got the view of what's going on
inside of Dalvik. And we can see that
there are GCs going on. They're explicit GCs
going on back to back. Another pauses within this. And we can see that the
second pause came in just before the vsync event where
we had the dropped frame. So this isn't good. This isn't giving us the buttery
experience you want on Android. And one of the things
that was on that systrace, we saw that GC cycles were
happening back to back. And why was this happening? Well, the application was
trying to avoid a pathology. And kind of what happens
in Dalvik's behavior where Dalvik is trying to
avoid a heap fragmentation and allowing the heap
to grow, because it's worried about large
objects getting reclaimed, and then small objects coming in
and fragmenting the heap making it so that the wrong places
where large objects can come and fit into the heap. So this is kind of shown
on this graphic here. We've got the large blue object. We failed to find
somewhere to fit it. We're worried about
fragmentation, and so Dalvik is going to
suspend all of the application threads and do it
a GC for our log, to then go and free up all
of the objects it can so hopefully, do a
better job at fitting this large object into the heap. What does that look like
for the application? It looks like one large pause. And these pauses are typically
around 50 milliseconds. So this is going to translate
into four dropped frames. Here, we can see again, a
systrace of a Google Maps application this time. And we can see that's
we've got two GCs happening at the bottom. And for in this
situation, the two GCs have caused a pause of around
60 milliseconds, and 4 to 5 dropped frames. So for us, everything
is awesome. We fixed all the problems. Open bugs, if we haven't. Please [APPLAUSE] So we've made the
garbage collector faster by new backup
collection algorithms. We've reduced the
number of causes and made the causes
themselves shorter. We've taken ergonomics
decisions and strategies to reduce the amount
of fragmentation so that we no longer
need the GC for analog. And we've done all
of this with support for things like
moving and compacting collectors and ways that
allow us to use less memory. So what do pauses look like in
the ART's garbage collector? Going back to the
earlier graphic, we've got rid of
the first pause. What we do to create this
initial set of objects that the garbage collector
needs to traverse is the garbage
collector requests that the applications go
and mark their own stacks. So this is a very short thing
for the application threads to do. And after they've done it,
they can carry on running. So there's no need to suspend
all of the applications threads. So this checkpointing, as it
gets called gets done first, and the garbage
collector waits for that. And then it enters
the concurrent phase that we had before
in the Dalvik CMS asking where we're going
to mark all of the objects. Then we have the second pause. And we've managed to
make pause a lot shorter than the Dalvik pause. We've done this by tricks like
[INAUDIBLE] cleaning where we take work out of the
pause and do it ahead of it, and then we just
double check that what we did before
the pause was correct. If we look at systrace, this is
going back to the Google Play Store application,
and doing a fling. We've got the vsync events
and frames being rendered. We've got a garbage
collection pause happening just where the
single pause is indicated. And no dropped frames. The other thing that you
can see on this systrace is that the GCs aren't
occuring, the back to back GCs that was coming
in the Play Store before. And that's because ART doesn't
suffer from the fragmentation problems that Dalvik had. And so we now trigger, treat
the system GCs as something are optional, and we
only use them as a hint that we should trigger a
garbage collection cycle. Whereas with Dalvik,
they would always trigger a concurrent
garbage collection cycle. And so there are no
dropped frames here. And that single pause is
measuring about 3 milliseconds. So that was kind of a lot
faster than the pauses that you're seeing in Dalvik. So how did we get rid of
the GC for analog events? So the problem we were having in
Dalvik was these large objects were coming in. And these large objects would
be bitmaps in the common case. And so what we do in ART
is we move the bitmaps out of the main heap, and put them
into their own separate, still part of the managed heap. But it's a separate area in
the managed heap especially dedicated to large arrays
of primitive objects. And this is also useful
for us, because we know that by definition, a
raise of primitive objects don't have references
to other objects. And so we can improve the
bookkeeping and the performance of the garbage collector
as a consequence. What does this look
like for an application like for Google Maps? So on the left, we've
got the concurrent marks GCs happening in Dalvik. The first pause, around
3 to 4 milliseconds, the second pause around
5 to 6 milliseconds. And total coming in at
just under 10 milliseconds. These are times from a Nexus 4. The GC for analogs,
which typically taking the average one was
around 54 milliseconds. And so that translates into
a lot of dropped frames. What is ART looking like
in the same test scenario? Well the average
pauses are coming in at 2 and 1/2 milliseconds. So that's around
four times faster. We've also found that
one of the areas where we needed to improve was
in terms of our allocation performance. So we were working with the
Google Spreadsheets team internally inside of Google,
and they were really excited by ART, because they've
been experiencing some bad performance. And so they switched
over to using ART, but they didn't realize
that the full performance that they were expecting to see. So you can see that
on this bar chart. On the left normalized to one
is the Dalvik performance. And when they switched
over to using ART, they managed to get
a 25% performance improvement on a Nexus 4. However, they wanted more. They wanted the 2X. They wanted the 4X. So there were other things we
did in ART, having seen this. So one of the things was
we started specializing the allocation paths so that
the allocation paths cannot allocate certain classes of
objects faster than others by knowing things like the size
of the object ahead of time. And this achieved
a performance win, but not the dramatic performance
win we would really have liked. So the next thing that
we did was something that we'd wanted to
do for a long time. And we implemented our
own memory allocator. So Dalvik and ART
in KitKat we're basing on top of Doug Lea's
memory allocator, which is the common one used
in Unix's and so on. And this supports
a [INAUDIBLE] free, but it's not really
tuned for Java, and it's not really tuned for
a multi-threaded environment. So to make sure that it
behaves in a correct manner, it has a single
global lock around it. And so the garbage collector
could be freeing things, and that would conflict
with the application trying to allocate objects. And you had multiple allocations
happening at the same time, then they would get
run sequentially. The rosallac, as we call it,
the rows of slots allocator, has a number of optimizations
to improve performance. The first of which,
is the small objects, the small temporary
objects that is so useful in
languages like Java. They get allocated in
a thread local region. And so because the
region is thread local, this means that
allocations can occur with no locking and no kind
of sequential behavior. For larger objects, the
bends that they fit into have their own individual locks. So again, it's reducing the
contention for the locks, and is a lot more parallelism. This is translated into a huge
performance for this benchmark, and you can see that
the memory allocation performance has improved 10x. So we've been working on a
number of different garbage collection algorithms for ART. We're making ART
into a framework that developers and partners
can come along and contribute to and improve upon. The first new garbage
collector that ART added was a sticky garbage collector. The sticky garbage collector
takes advantage of the fact that we can know
what objects were allocated since
the last GC cycle. And because the generational
hypothesis tells us that these objects are
the most likely to be useful to be freed up, then
we focus our energy on these. And because garbage
collection at the time is proportional to the
amount of live data we're going to
process, and we're going to process less
because we're only going to consider the live
data since the last GC, the sticky GC is typically
running 2x or 3x faster than the current
garbage collector, and reclaiming the
same amount of memory. When the sticky garbage
collector can't run, we run the regular CMS collector
for either the applications heap, or the applications
heap in the zygote. And the ratio between running
the sticky garbage collector and the regular concurrent
garbage collector, we tend to run the sticky
one five times for every one time we need to run
the partial one. We're also working on
moving garbage collectors where this slide comes in. And what are moving
garbage collectors? So moving garbage
collectors, they want to do the
same reset creation and marking of objects. But what they want to do
during the sweeping phase is they want to move all
of those objects together. And this basically removes all
of the fragmentation and so on. We do this in ART
via two approaches. We have a semi space
garbage collector where we create a
new memory region, and we evacuate the
objects out of a heap and into that newly
created space. And we also have a
mock compact algorithm that's coming online
at the moment where we compact things in place. So moving collectors are good. There's less fragmentation,
it saves heap. The downside to them
is that if you're going to be moving
objects around, the effects of
moving an object has to be seen by the
application atomically. There are two approaches
you can take to doing this. You can introduce read barriers
so that the read barrier handles the fact that the object
is moving around in the heap. Or the most common
approach is just to suspend all of the
application threads again. And this isn't good from
a GC junk point of view. So we want to use the moving
collector to compact the heap and get these memory savings. So when can we do it? So we know certain
things about how applications run in Android. We know that it's
worthwhile to compact the heap joined zygote startup. Because if we save
space in the zygote, we're going to save it
for every application after the Android runtime is up. The other place we
can use COMPAxiON is when applications
go into the background, so when you press
the Home screen. And so we have
various ergonomics determining when is
a good idea to run, to switch the garbage collector
over to a compacting algorithm. We have various
means of determining whether an application
can perceive junk. We don't want to do things
like have the Play music application, which isn't a
foregrounded application. We don't want to
have junk in that, because that would
give you choppy audio. And so hopefully we've
done our jobs right, and it's all working, and so on. And so you'll get
less memory usage. So with that, let
me hand back Anwar. ANWAR GHULOUM: All right. Thanks Ian. So a few words about 64-bit
support coming with L. So it's one of the major
features of L. Partners have been shipping, or
readying to ship 64-bit SFZs, and we're working
pretty closely with them to make 64-bit userspace happen. So why 64-bit? Well you might
argue that we don't need the additional
address space now, but if you look at
the keynote yesterday, Android usage is
diversifying pretty rapidly from wearables to TVs. And bearing in mind,
I think it's really-- and there's definitely value
today in supporting 64-bit. We see nice performance
gains in 64-bit apps running on 64-bit capable cores. There are new instructions,
domain specific instructions for media encrypto, where
we get huge speedups that can benefit the entire platform. But even in general
purpose apps, we're seeing nice speedups. So we'll take a
look at that first, and then I'll get
into a few details of how we make this work. So as I said, we spec
compute intensive stuff to take biggest
advantage of this. We're working closely with ARM,
with Intel, MIPS, Qualcomm, and others to make sure that
we really deliver on this. So what I'm showing
you here is graphs of speedup of 64-bit
apps over 32-bit apps on the same silicon. On upper left hand side
as you face the screen, I'm showing you speedup
in terms of a multiplier. This is on Intel's Bay Trail. Processor's a four core SOC
for some custom render script scripts. And we're seeing some
really nice speedups going to 64-bit
up to and over 4x. For crypto, we use open SSL's
a primary crypto engine. If you look at ARM64 support,
ARMv8 support on Cortex A53 and Cortex A57 for the
open SSL speed benchmark, we're seeing some really
nice speedups there as well. Again, this is used throughout
the entire platform. Any applications that's
using Android crypto is probably going to
benefit from this. And again, those are multipliers
on the vertical axis sourcing about, a 15x improvement there. And even for native stuff, we're
seeing some nice improvements. Panorama benchmark
is a benchmark that we use internally to
evaluate tool chain updates. It's meant to be
representative of a user taking a bunch of photos and then
waiting for them to get stitched together in a
nice little panorama. What we see here is that on
Cortex ARMS, Cortex A53 and A57 cores simply recompiling
it for 64-bit, we see up to 20%
boost for this one. That means the end user
is sitting there waiting for a panorama, and
happens 20% faster. But it also uses 20% less power
probably in terms of CPU core. And that's a really nice thing. But what about ART? So for ART, we
weren't really looking for big performance
gains of 64-bit. Obviously we were
hoping, but we've been pleased with what
we've seen thus far. We expected most of this
again, to come again in the computer
intensive workloads. So looking at ART performance
here, this is on Baytrail. We have on the left,
speedups of going to ART 32-bit from
Dalvik 32-bit. So they're getting some really
nice speedups on spec JVM up to over 6x. But the cherry on
top is that we get up to another 30% speedup
going to 64-bit. And again, all these
numbers, it's early days. There's a lot more
tuning that we're going to be doing for 64-bit
with ARM Intel and MIPS. So expect more to
come from this. OK, so how are we
getting this done? Well, as Brian mentioned, we
have compilers for 64-bit, for x86 64, for ARM64,
and MIPS64 coming soon. We've extended the zygote
model for app creation by having dual zygotes. So we have a zygote
for 32-bit apps, and a zygote for 64-bit apps. What happens when you
launch an app or a service, we detect what ABI it needs,
or what ABIs it can use, and then we delegate to that
zygote to start up the app. So what that means is that, and
I'll say more about this later, is that your apps,
you're 32-bit apps are still going to work
in your 64-bit device. The other thing we
were concerned about is memory bloat. How much more
memory is this going to take having two zygotes, and
then having 64-bit references? Well in ART, we're using
compressed references. We're using just 32 bits for
object references on the heap. So that mitigates
some of this concern. So the cool thing about
this is that if you have an application that's just
running on ART, that's just dex code written in Java or
whatever, it'll just work. As long as there's no native
code you can download that app, and if you have a 64-bit device,
it'll run on the 64-bit VM. That's 85% of apps in Play
Store that are immediately 64-bit ready. Developers don't
need to recompile, they don't need to upload
a new version or whatever. It just works, and it's
ready for you for free when you get your new device. So another performance
boosting thing that ART and L and our SSE
partners are bringing you. The other thing is that
we'll be shipping NDK support And RenderScript support
for 64-bit with L as well. So you can take advantage of
all that other compute intensive stuff as well. OK, so to learn more
about ART, there's a couple of articles you should
check out on android.com. The first one,
"Introducing ART," covers some of the same ground
that we've covered here. The second one, "Verifying
App Behavior on ART," goes into more
detail on how ART can help you find bugs
in your application. Especially around J&I,
you should definitely check that out. And with that, we'll
take questions. Thank you. AUDIENCE: Can I start off? ANWAR GHULOUM: Yeah, go for it. AUDIENCE: Cool. So I was wondering
how ART is going to jive with byte code injection
that might happen right after compilation
or even at runtime. ANWAR GHULOUM: Go ahead. BRIAN CARLSTROM: No. I'll let you do it. IAN ROGERS: So, maybe I'll-- ANWAR GHULOUM: Oh, yeah. IAN ROGERS: OK. So the model that Dalvik
has and ART continues is that for class loaders,
we have to have everything that the class loader
has backed up by a file. So Dalvik never had supports
for the kind of doing end memory injection of
instructions, and so on. If you have a file on
the disk, then this is something we can do
ahead of time compilation for and put into our
cache so that we're not regenerating it all of the time. So basically, it works the
same way as with Dalvik. There might be a
little bit more time when we're doing
the compilation so that you might notice something
at the initial startup, a pause at startup, because
of the compilation. But our compilation times are
really good and really fast, so hopefully imperceptible. AUDIENCE: Cool, thanks. ANWAR GHULOUM: Go ahead. AUDIENCE: What's the
effect on the boot time compared to Dalvik? ANWAR GHULOUM: So first boot
when we're compiling things, things can slow down a bit. Compile time takes a bit
longer, and the scale things we're really talking
about is an app goes from taking one
second to DEX-HOP to say, 2 and 1/2, 3 seconds to DEX-HOP. But if you're doing
that over many apps. it adds up. That's just at first boot,
or after an OTA, which are relatively
infrequent events. In general, we think
boot time will be faster. The one mitigating
thing though is because we're doing COMPAxiON
boot time, to really make the heap more efficient. There is potential
to extend things. Although recently, I
think a patch went up too to optimize that
down, it'll only do the COMPAxiON sort
of toward the end so we're not
extending boot time. But code should be faster,
starting up system services and so on should be faster. And things should
just be faster. AUDIENCE: Thank you.