[MUSIC PLAYING] CHET HASSE: We're going
to talk about trash. So why are we going
to talk about trash? Why don't we-- yeah, let me-- [INTERPOSING VOICES] CHET HASSE: All right, so there
is a common phrase in software, "garbage in, garbage out,"
but nobody ever says how fast. So we're going to
talk about that today. And why? Because back at I/O, we
had this talk that we gave, called Modern
Android Development. I have to apologize
for the sunglasses. The sun was right there-- right there, really annoying,
couldn't see a thing. Sunglasses didn't actually help. Nor did they make
us look any better. So we talked about many things. You know, Android had
certain development practices way back then. And Android has changed,
and devices have changed, and ecosystems have changed. And now we are recommending
different practices. All the stuff that you may have
known about Android development may not necessarily
be true today. One of the things that we
talked about in particular was a lot about memory
and garbage collection. We had certain
recommendations there. So for example, we said,
back in the Dalvik days, Dalvik was optimized for size. It was meant to fit
in a very small area. It didn't have a lot of
area to do things like AOT. It didn't have a place
to stash the code. You really needed
to constrain memory. The optimizations
were not optimal. Allocations/collections--
unbelievably expensive. You get GC for
alloc all the time, causing jank all over the place. Heap fragmentation
was a problem. So really the recommendation was
to not really allocate anything ever if you could
possibly help it. And use primitive
types everywhere, because objects are expensive
because you're allocating them. Avoid autoboxing,
all this stuff. ROMAIN GUY: Yes, so I need
to correct you once again. You said, avoid the
locations whenever possible. Enums, they don't allocate. CHET HASSE: Yeah. OK, well-- ROMAIN GUY: That's
the whole point. CHET HASSE: But but but but
but they take up space, right? So that's what, 1
to 2K compared to-- ROMAIN GUY: You can't say-- [INTERPOSING VOICES] CHET HASSE: All right, anyway. It's memory-related. So the recommendation instead
was, pay attention to ART, because it turns out ART is
optimized for performance, actually getting faster
with every release because the whole
platform was built to be able to optimize more and
more the more the team works on it. We're doing JT as well as AOT. So we compile this code. We find out where
the hotspots are, and we stash this
code somewhere so we can run that faster next time
you go through that loop. Allocations and collections
are much, much faster. We'll go into
details about this. We had the ability to defragment
the heap and actually compact as we go now. And there's a large
object heap, which makes some of the allocations
amazingly faster and simpler than they used to be. So the new recommendations
are, go ahead and allocate. It's really not that
big a deal anymore. Still be concerned for
inner loop situations, and be aware that
you are actually causing the device to do stuff. You are causing
battery and CPU usage. So you still want to be
aware of these things, but maybe they're
not such a big deal that they should affect your
APIs and your development patterns the way that they used. However, that was a lot of
really terse information stuffed into a very
short amount of time. So we thought maybe we should
do this talk to actually explain ourselves a little
more completely, and say, why is this the case. Maybe talk about what ART
has done to make life better. The original idea for
the talk was actually, OK, well, we said
all this stuff, but wouldn't it be
nice if we could just write a bunch of
demo applications, and show benchmarks,
and say, this is why, and these are the canonical
results that prove our premise. And it turns out
that is really hard. Because garbage collection, by
its nature, especially in ART, is concurrent. There's stuff happening in
the background all the time. And if you want to trigger
that thing right now, you will not be able to. So we made an attempt to
write some demo applications. You will see some
of the results here. But we realized we don't
really have enough canonical deterministic data to show you. So instead we're
going to tell you the background of why
it's difficult to write these things because everything
is happening magically for you in the background. So first of all, let's
talk about memory. We see a couple of simple
lines of code here. We have a primitive type. Here's foo. We're going to
set it equal to 5, and then we're going to
allocate an object here. Well, there are different
kinds of memory in the system, and different implications. So if we're going to
have a primitive type, that may show up in the stack,
or it may show up in registers. Dalvik was a register-based
allocation system. So it would actually
pop it in over there. But whether it shows up in
the stack or the register, you can think of it
as essentially free. It's kind of allocated
at compiler time. It says, when I run
this line of code, here's where I'm going
to stick this thing. It didn't need to find
space for it anywhere. It knows that it has
space on the stack. Sort of you can
think of that as kind of limitless storage space--
as well as the registers. It will basically find a
cubbyhole to stash that thing. It's free. However, when you allocate
any other sort of thing-- when you say, new object, then
it needs to find space for it in the heap, which means it
needs to figure out where it's going to fit among the other
things that already occupy the heap. And it will find
space down there. So that's the
basic memory system of the runtime and
the computer overall. So the idea of
garbage collection-- we've all been doing
garbage collection, even before higher level languages
like Java and Kotlin. If you're writing C++ code, then
you can do an allocation like this, and then you can write
the code that actually uses that object that you've allocated. And then if you don't
do anything else, you've just linked an object. You're taking up space
in memory somewhere that eventually is going to
cause you to run out of memory. So what you need to do is
actually free the object. So you delete the thing,
you reclaim that memory. So you're basically doing
manual garbage collection here. But sometimes you forget that
you've freed that over there. And then in this other
place over there, you continue using that object. And then you crash-- maybe. So very
non-deterministic system. You are responsible for
managing your own garbage, your own allocations
and collections. It tends to be tedious. It tends to be very error-prone. And so higher level
languages came along. So we have language like Java,
where you allocate an object, and then you write the code
that actually uses that thing, and then eventually it is freed. And if you continue
using it, then you still have a reference to
this object, which means it's going to eventually
be freed without freeing it too soon. The system knows
what's going on. You don't have to
manage this thing. It does it for you. However, if it's doing it for
you, several questions occur. And there is no crash. Yay-- well, ideally. So there are some things
that naturally occur to you, such as OK, well,
how long does it take for the system to do this? I know how long my malloc
statement was going to take in C++. But how long is it going
to take for this system to automatically find
space in the heap for me? How long does it take
to walk the heap, find the appropriate
spot, to put this thing, and then allocate that space? How long does it take to
actually collect these objects? When a reference goes away,
when will it be collected, and how long does it take to
actually collect these things? And what impact is
there system-wide? So if we're running this
sort of garbage collector thread, this heavyweight
thing in the background, then what impact does that have
on jank on the UI thread or whatever? And when do these
things actually happen? And also, how efficient
is that heap usage? It's not just going to malloc
these little tiny chunks for your things. It's probably allocating
a large amount of space, and then putting things in
there, and sorting them around. And certainly on
Dalvik, as we'll see, it's fragmenting
the heap over time, which means you may have
a lot of memory available, but you can't necessarily
even access it anymore. So how efficient is
that, and how much memory are you taking up in the
system to do all this stuff? ROMAIN GUY: So we're going to
start by taking a look at how Dalvik collects garbage. Dalvik was the runtime we were
using until Android KitKat. It was replacing
Lollipop with ART. So this is a
picture of the heap. Everything that's in
white has been allocated. And we have a few holes. And we're trying to
allocate this blue object. So what Dalvik is
going to do, it's just going to walk
through the heap and find this spot
where the object fits. When it finds one, pretty
easy, just slot it there. Things become a lot
more complicated when it comes time
to collect objects. So there's four phases. The first one,
Dalvik has to pause all the threads in
your application to find a route set. So route sets are
typically local variables-- threads, studied variables. They're just the routes
of all the locations that can be reached
in your application. So that takes a bit of time. And during that time, your
application cannot do anything. The second phase is concurrent. Your app is running again. So from this route,
Dalvik we'll find all the objects that can be
reached and mark them as such. Unfortunately, since that
second phase is concurrent, allocations could be
triggered during that time. So you see here, like
for instance, we just allocating another object. So we need to pause the
application again and find the reachable objects again. And all your threads are
paused, so your application stutters a little bit again. And finally, we can mark all the
objects that are not reachable. And there are candidates
for garbage collection. So the collection
itself just gets rid of the objects in the heap. So now, if we want to allocate
something in the heap, and the heap is
pretty much full, and we have objects that have
been marked for allocation, Dalvik will go through
the heap, realize that there is no
memory available, and then it will trigger
a garbage collection for the specific use
case of an allocation. That's the GC_FOR_ALLOC. And every time this
happens, you see a log. On KitKat, doing
an adb logcat, you can see those
GC_FOR_ALLOC messages. So it will run the collection,
get rid of this memory, then we can run through the
typical allocation mechanism. However, sometimes
your heap is full. There are no objects
that can be collected. Every object is reachable. So Dalvik will run through the
heap-- can't find anything. And only two things can happen. Either the heap will grow or
your application will crash. And it's the Out of Memory error
that you've seen, I'm sure, in a lot of visual applications,
especially on older devices. This typically happens when you
are allocating large objects. I remember a lot of developers
in the early days of Android filed bugs against
Bitmap Factory because Out of Memory
error tended to happen during the decoding of bitmaps. And they thought there was a
memory leak in Bitmap Factory. The main reason was
bitmap objects are big, and it was hard to
find space for them. There was no leak in
Bitmap factory whatsoever. CHET HASSE: So we wrote
a simple demo application to show off how some of the
stuff with heap fragmentation works. So in the demo, we
allocate chunks of memory all the way up to max heap. So your heap starts
out very small. And then if you can't
allocate an object, it's going to grow that over
and over and over until it reaches the max possible. So for this demo, we
allocate these 1 meg chunks. You press this button that
says, there's 400 megs free. You press the button, it's going
to allocate all these 1 meg chunks, all the way until it
grows the heap until it cannot anymore. You get an error. We catch the error, and now
we know the heap is full. There's no space anywhere. ROMAIN GUY: And
this is pretty much the only situation where
you should do a try-catch on an out of memory error. Don't do that in
your application. [CHET CHUCKLING] CHET HASSE: And then we say, OK,
well there's only one meg free. So we're going to go ahead
and fragment the heap, and we're going to
free a bunch of chunks. So we're basically going
to go through everything we've allocated, and
free every other one through the entire heap. ROMAIN GUY: So just go
through every other reference, set it to null,
then force the GC a couple times to make sure
that that memory goes away. CHET HASSE: And then
we get this result that says, OK, the fragmented
heap size is now 200 megs. So we should have
plenty of space to go ahead and allocate
a two megabyte object. So that blue object
there, it's only 2 megs. We've got 200 megs free. So what can be the problem? So we press the button,
and it says, nope, we can't fit that in there. And if you look at the log,
you get these depressing things here, where it says-- ROMAIN GUY: It's the
best error message in all of computer science, I think. CHET HASSE: Yes, right here. This is beautiful. It says, OK, you have
200 megs free out of 400. And we're forcing a collection
so that we can make room for a 2 meg chunk. And we're trying
really hard to do that, and we ran out of memory. So we cannot find space
for 2 megs out of 200. Because apparently Dalvik
was really bad at math. The problem was, of
course, that we couldn't find 2 megs contiguous,
and Dalvik did not have the ability to
compact the heap. You get what you get. Once we put the thing there,
it was not going to move. So we couldn't
actually shove stuff to the side to make room
for a larger object. So ART came along in Lollipop. As I said, it's a platform
for optimizations. It no longer had the memory
constraints that Dalvik did, so they could sort of build
in a lot of the fundamentals that they'd been
improving over time. But out of the box, it came with
much faster allocation times, much faster collections, as
well as a faster runtime, the ability to do
ahead-of-time compilations, so we're actually running
binary code all the time. We're not constantly
jitting things to find how we can
speed things up. We're making everything faster. ROMAIN GUY: One thing
we need to make clear is when we talk about
allocation and faster allocation in this talk, we mean just the
time it takes for the runtime to reserve the memory. We are not talking
about the time it takes to run the constructors. It has nothing to
do with your code. It's only in the runtime itself. CHET HASSE: All right, so
how did ART allocation work? ROMAIN GUY: So in ART, we
introduced a new allocator called a RosAlloc. And I don't know what
it stands for, actually. But it replaces something
called dlmalloc. So DL stands for Doug Lea, who
is also the person who wrote, I believe, java.util.concurrent. CHET HASSE: That's
what happens if you write a really nice algorithm. Then people name the
algorithm after you. ROMAIN GUY: Yeah, he's
the kind of person that makes me feel very
inadequate as an engineer-- really smart. Anyway, dlmalloc is
basically the algorithm behind the malloc function
call in native code. So this is what you use
when you call malloc, or we call new in C or C++. This is the algorithm
we're using. So Dalvik was relying on that. ART replaced it with
its own, calls Ros. So the main benefit of RosAlloc
is that it's thread-aware. So it's able to do allocations
that are specific to a thread. And we're going to
look at this in detail. And you'll understand why
it brings a lot of benefits. There's also a lot of little
tweaks that have been done. So small allocations
are grouped together to reduce fragmentation. We align larger
locations on pages-- typically 4 kilobytes on
modern OSes, and gives you better performance. Also, finer-grained lock,
because the garbage collector has to acquire locks because we
have a lot of threads running. It used to protect
a lot more code. So now it protects takes less
code, and it runs faster. And overall, allocations with
RosAlloc are four to five times faster than they
were with Dalvik. Again, this is just about the
act of allocating the memory. It has nothing to
do with your code. You could do something
really, really, really expensive in
your constructor, and we would not be five
times faster than Dalvik. CHET HASSE: All right. So let's take a look
at how allocations work on ART, in this new system. Oh, sorry. The other very, very
important thing with ART was the ability to
deal with large objects in a much better way. So you've got this
normal-sized object. It's going to find a space
for it in the heap over here. And what happens if you
have this large item? By a large object, we mean an
array of primitives or string. And these are the types chosen,
because we can guarantee that these objects will not have
a reference to something else. They can live
completely on their own. ROMAIN GUY: And it's an array
of at least 12 kilobytes. CHET HASSE: Yes. And that is the
heuristic for now. That could change over time. But right now, it's 12K,
primitive types or string. So you've got this huge object. Where are we going to put it? Well, in Dalvik, we would
put it where, exactly? You can see in this
fragmented heap that there may not
be space for it. In ART, it's a
little bit simpler. The complicated mechanism
looks like this. We just put it somewhere. We just malloc a space for
it, and shove it in there. It's not even in a object
bucket that holds all of them. We just allocate a space
for it somewhere in there. And we say, OK, you are
now part of the heap. But really it's just living on
it's own somewhere-- very fast, very easy. It's also a moving collector. So we can actually
compact things. So we no longer have the
fragmentation problem. However, it does this
in the background. So actually it's a
little more complicated. My understanding
originally was, well, if your application goes
into the background, then eventually this very
expensive operation-- it could take up to
100 milliseconds-- may run that's going
to compact the heap. Obviously, we don't want to
run that in the foreground, because we're going to jank
your app all over the place. So we're going to wait
till it's sitting there in the background, user
is doing something else, they're not paying attention. So we're compacting
the heap for you. That's awesome. So I said, OK, well I'm going
to demonstrate this and show that same defragmentation crash
error that we saw earlier. I'm going to show how it
crashes on KitKat, using Dalvik, and it will also crash
in all of the releases until we're able to do it in the
foreground on a later release, in O. And this will
be a cool demo. And then I ran and on
L, and it didn't crash. And the thing is, yes, it will
defragment in the background, but it will also do
it in the foreground if it has to, which is
really what you want. So if you actually
need that memory now, wouldn't it be nice
if you didn't crash? So that's what we-- ROMAIN GUY: That compaction
is almost a replacement for the GC_FOR_ALLOC
from before. CHET HASSE: So now it
basically takes everything, it says, well, you need space
for this really large object. We're going to go ahead and
compact things, and then put it where we need to. So on L and above, we run
the same fragmentation demo that we saw before. We go ahead and alloc up
to the maximum heap size. It says, yep, you've only
got about 1 meg free. And we go ahead and
free every other meg, null out those
references, and then we try to find space
for this 2 meg block. It compacts the heap,
and puts it in-- very simple. ROMAIN GUY: So, another
improvement-- so remember with the
Dalvik GC, we had those four phases, including
two pauses for your application. So the pauses were bad because
your application was not doing anything during that time. But what was even worse
was that those pauses could be pretty expensive. So on average, the sum
of those two pauses was about 10 milliseconds. But even when it was
only 10 milliseconds, that was actually pretty good. I mean, we've done a
lot of performance work over the years, and I've seen
these kind of pauses last for 100, 200 milliseconds
in some applications. And during that time,
nothing happens, which means no matter how
good your UI is, it will jank. Like if the user is
trying to scroll, it's not going to work well. One of the things
that ART does is it removes one of the pauses. So now the first
step, the marking the route set, finding the
routes of all the allocations that are reachable in your
heap, is now a concurrent phase. It doesn't pause the
application anymore. And on top of that,
the second one-- the only pause that we still
have left is also a lot faster. So instead of spending
10 milliseconds in there, we only spend about
3 milliseconds now. So at most, your
application will probably pause for 3 milliseconds. So if your application
is well-optimized, the GC happens during an
admission or scrolling. You should be able to look
reach 60 fps without any jank. Another thing that
was introduced in ART was the concept of the
minor garbage collection. So the idea here is to keep
track of all the objects that have been allocated
since the last major garbage collection. Those are typically
temporary objects, and we're going to
look at them first. So if we can reclaim enough
memory by looking at a subject first, because
they're short-lived, we won't have to go
through the entire heap. CHET HASSE: This is
one of the things that has an important consequence
for Android development, where we used to tell
you, never allocate even temporary objects,
because they tend to be expensive, because they're
going to fragment the heap. We have to do the allocation. We have to do the collection. All of a sudden, we made
temporary object allocation and collection much
faster and easier. ROMAIN GUY: It's not free. It's just less expensive. CHET HASSE: Yes. ROMAIN GUY: We also introduced
the larger malloc heap that we talked about. So we have less fragmentation. But one of the huge
benefits of that is, because we don't
fragment the heap as much, we don't have to grow the heap
as much in all the processes. And of course, we don't
have to GC_FOR_ALLOC. I mean, they still exist. We just don't see them as much
because they were very, very common in the Dalvik days. And also there's
a faster runtime. That's the ahead-of-time
compilation. And we introduced the G tag. But this has nothing to do
with the garbage collector. Marshmallow-- I was
joking with Chet yesterday that it's kind of
a boring release, because I can never remember
what was in Marshmallow. So here it is-- optimizations in ART. CHET HASSE: Things got faster. Fine-grained details,
things got faster. And again, things got faster. Isn't that nice? Allocation in particular. They rewrote everything, all
the core stuff in assembly. And that turns out to
still help in software. Who knew? And now we're up
to about 10 times faster for allocation costs
when you compare it to Dalvik. And now we're in Oreo,
where basically they rewrote the whole thing. So we introduce an
entirely new collector, called the Concurrent
Heap Compaction Collector. And this means that
now we can actually compact in the
foreground, not just when we need to do that
GC_FOR_ALLOC compact to find a space. But it is actually
running all the time, moving stuff around,
and optimizing what the heap looks like
so that allocations are much faster and easier overall. So defragmentation
in the foreground, you're not resizing
the heap is often because the heap is
always optimized, because we're always sort
of culling the things out of it as necessary. There's far fewer
GC_FOR_ALLOCs because we just don't get into that
situation anymore. Huge savings for
the entire device. Think about it. If you can only defragment the
heap when an application is in the background, what
about the applications and services that are constantly
in the foreground-- system service, play
services, system UI. Well, we couldn't
necessarily defragment those until it got into a
really bad situation. Wouldn't it be nice if we
could do it in the foreground so that those things
were optimized? Well, if we can do
it for them, that means we're getting system-wide
savings on the pure heap size of all of those applications. So smaller heaps for
all means less memory in the entire system. And what we found
was the entire device had about a 30% savings on
overall memory requirements. So we can take a look at how
compaction works in general. We have these 256K buckets
that are assigned per thread. Which means, again,
huge savings in not having to lock down all the
threads to do these allocations and collections. Instead, a thread,
if it needs memory, it's just responsible
for itself. So a thread says, OK,
I need some memory. It's handed one
of these buckets. It allocates in there. And then over time that
thing may get defragmented. There's a heuristic
that they have about if there's less than 70%
or 75% utilization in one of these things, then they'll go
ahead and collect it, and empty the thing out entirely. So we see the T1,
2, 3 regions here don't have much going on there. So we're going to take
the memory in there, all of those allocations,
and shove them somewhere else in the heap, completely empty
out those buckets, which allows something
that's super-efficient, called thread-local
bump allocator. This means that
all we have to do is actually just move a pointer. We don't need to
walk the heap to find where the free space is. We just put the next object
in the next available space. And we know where that is
according to the pointer. It turns out, all
of this stuff put together, we're now 18
times faster than Dalvik for these allocations. So we can see how these
allocations work in practice. You can see all these
little colored objects. We're making space for the
in the different threads that need those allocations. If we take a close look at T1,
you've got the free pointer. We've emptied it out. We've zeroed it out. We're at the beginning
of that bucket. And now we need to
allocate an object. Well, we know where to put it
because the pointer tells us. And now all we do is
advance the pointer. The pointer tells us where to
put the next one, and the one after that, and so on. So very fast and easy compared
to what we were doing before. We can see on the
graph the comparison of where we were out with
Dalvik allocation costs compared to where we're at
now in O, with bump pointer and assembly
allocations instead. All right, so where
are we going now? It is important to note
that the young generation stuff that we talked
about as being so awesome is currently gone. ROMAIN GUY: But
it's back in AOSP. And so it will probably
be in the future release. CHET HASSE: Yep. So it's important to note. There are trade-offs here. We believe it's better overall. All of the benefits that you get
from the garbage collector in O should compensate for the
young generation collections not being there anymore. However, they're still
a nice thing to have. So they're back in AOSP. So look for those to show
up in a future release. ROMAIN GUY: So
object pools-- this is a technique that we've
recommended using in the past. We use it ourselves extensively
inside the platform. And the conventional wisdom
is that reusing object has to be faster than allocating
them and collecting them all the time. And you can see here
we have a preference graph so the x-axis is
the size of the object that you're creating or reusing. In red, it's the time it
takes to handle those objects with a pool, and in
blue, it's the time it takes to just allocate
and collect those objects. And until N, using the
pool was basically always a win compared to the
garbage collector. But with O, with all the
improvements we've done, and this new thread-bump
local allocator, synchronized pools of
objects are generally slower. And I want to emphasize the
synchronized part of the pool. If you have a pool of objects
that's used only on one thread, and that thread only,
you're effectively doing the kind of optimization
that ART does for you now in O. So you might not see those
same kind of savings. But in general, on O, if
you have a synchronized pool of objects, you'd be probably
better off without that pool. Again, make sure you
profile your application before you take it out. CHET HASSE: And there's
a certain memory size-- a certain size of object
where these graphs cross. But this was in a
benchmark application that was really hammering it
and maximizing the bandwidth. So the general
advice is, you really shouldn't use that, especially
the synchronized pool approach. Because A, it's error-prone
and tedious to manage, and B, it is slower in general because
that synchronization access tends to be slower than
what we can do now with-- [INTERPOSING VOICES] ROMAIN GUY: The
whole point of O is that we don't have a lock
anymore for the allocations. CHET HASSE: So what are
the recommendations now? Creating garbage is OK. I wouldn't go out of your
way to create garbage. It is still taking time. You are requiring
us to take time to allocate objects, as
well as later collect them. And you're taking up battery
and CPU, and you're causing-- ROMAIN GUY: But
don't do like Chet. Pick up after yourself. You should see his desk. It's pretty disgusting. [CHUCKLING] CHET HASSE: I like to
be a counterexample. So in general, creating
the stuff, if you need it, is OK, and so is collecting it. Use the types and the
objects that you need. If they make sense
for the architecture, for the APIs that you're
building, for the libraries, for your code, go
ahead and use those. We are now pushing everybody to
use int and bitflags everywhere to optimize every little CPU
cycle and memory allocation. Instead, go ahead
and allocate when you need to for your use case. However, GC is still overhead. ROMAIN GUY: And
we're going to look at a demo that showcases that. CHET HASSE: Yep. And make the right
choices for you. And in general, the
framework programmers, we still are your inner loop. So we still take
the old practices. Because why would
we allocate if we didn't need to, if we can
make your inner loop faster? So be aware of the
inner loop situations. Otherwise, do the right
thing for your code. All right, so I wrote
a simple application to sort of showcase
some of the jank stuff that you can see because of
allocations and collections. And this-- during the
onDraw, I would call out to a method to run some stuff. And in this case, we're
going to test autoboxing. So we have an array of
100,000 Float objects-- capital F Float objects. So it's not just
the primitive float. Instead, it's the
capital F. So we're going to be allocating
these objects on the fly. Here's what the method
looks like that's going to run on every single frame. It's going to walk the
entire length of the array, and take a primitive
float, and stuff it into the array, which is
going to cause an autobox. So these little
tiny allocations are going to go into this array. A lot of them,
over time, because of the autoboxing
thing, it's going to cause a bunch of
allocations, and then we're going to need to collect
them over time as well. So if we take a
look at the demo. Let me pop out. All right, so we're
running on K here. We run this animation. We should take a
look at the log here. And we run the
autobox, and now we're calling out to that method. And the important thing here,
if you look at the log-- so we're taking allocation
times of 28, 24, in general high-20s
milliseconds. And we're causing a
lot of GC_FOR_ALLOCs, because that is what happens
when you do this thing. So we can pause this one. We can pop over and see what
this looks like on O. We can run the animation here. We enable the log for O.
We'll do the autoboxing. And now we can zoom
in on this, and you will notice a couple of things. One is that the
allocation times obviously are much less than
they were before. And also, more importantly,
there are no GC_FOR_ALLOCs. We allocate, we
collect, but we're never triggering that
that jank-inducing GC_FOR_ALLOC in this case. There is a similar test
that I wrote for bitmaps. So we're running along-- let's take a look
at the KitKat log. So in this one, you're
allocating bitmaps. And again, we're getting
a lot of jank there. If we zoom in on the
log, you're seeing the allocations for these
large objects at the 1,000 by 1,000 bitmap. If you're taking 12, 13
milliseconds each time, and you're constantly
triggering these GC_FOR_ALLOCs, because you need to collect
the old memory to make room for the new one. So if you pop over to the O
log, stop this one over here, run the animation,
do the bitmap test, and now we've got
allocation times of 0, 1. Because again, all
it's doing is a malloc. It's just shoving it into
the large object heap. Very easy to allocate, very easy
to collect when it needs it. Stop that. All right. That just duplicates
what I just said. Here's the bitmap test. Very simple. It's just walking
through every frame. It's allocating 1,000
by 1,000 bitmap. And then you see the
results that we did. ROMAIN GUY: All right. So this is a demo
to basically stress test the garbage collector
and see what kind of overhead we get. So this is a kind of
a ray tracer using fancy physically-based rendering
that I wrote for my desktop. CHET HASSE: You can tell it's
a ray tracer, because it's a checkerboard with spheres. ROMAIN GUY: That's right. And I ported it to Android. And I want to run
you through the code because there's hundreds
of lines of code. So this is Kotlin. But the trick here is that
this does a lot of allocation. So I have a data
class, a float3. It just contains three
floats, x, y, and z. And those are primitives. They're not the capital F
objects that you get in Java. And here I have a function, just
an excerpt of those hundreds of lines of code. And you can see it's just
doing math on the objects. So I'm using operator
overloading in Kotlin. We are multiplying float3 called
x by a bunch of constants. We're do divisions, additions. But the important thing here
is that the way those functions are implemented, my
float3 is immutable. So every time I add
or multiply, I'm creating a new float3 object. So those are fairly
small objects. But for every pixel that we'll
want to render in that ray tracer, we're going to
be allocating hundreds of thousands or millions
of these float3's. So we're going to
really exercise the GC. So then I created two system
images, one on KitKat, one on O, both running
on the emulator. We're emulating
the same hardware. It's a Nexus 5x. We have the same
number of cores. They're running on
the same host machine. So the only difference really
is the garbage collector. What I'm going to
press Next, both demos will run at the same time. We're going to start rendering
an animation with that ray tracer. And at the top you have O, and
at the bottom you have KitKat. See if you can spot the
difference in performance. So we'll save us some time. It takes 47 seconds for KitKat
to render the first tile. CHET HASSE: Yeah, but
it's a really good tile. ROMAIN GUY: It's a
really good tile. And again, we're running
the exact same application. The only difference
is effectively the garbage collector. So on O, rendering
one tile takes about 100 to 500
milliseconds, and on K, it takes 40 to 50 seconds. So two orders of
magnitude slower. And if you look at the logs on
KitKat, this is what we see. We see a bunch of GC file logs. And I just grabbed logs
for about 10 seconds worth of computations. And you can see we're constantly
stopping all the threads. We're blocking the
applications all the time. Nothing is getting done. That's a lot of GCs. And you can see, every time the
GC takes 3 to 5 milliseconds. So if we look at
systrace on O, we can see that even though
things are better, they are not quite perfect. So this is systrace here. You can see what
the CPUs are doing. And from afar, it looks
like they are very busy because of three
worker threads that are computing this 3D scene. But you can see there's a lot
of cases where the CPUs are actually not doing anything. There's holes in our pipe. So if you look at
the app itself, we see two interesting things. First of all, I have
my three threads that are computing--
that are doing the work. They're busy chugging along. But from time to time, here for
instance, the pause, and you probably can't read this, but
it says "full suspend check." And then we're not
doing anything. We're not doing any computation. And the reason is we have this
thread called the heap test daemon. It's basically the
concurrent garbage collector. So even though we're doing
concurrent garbage collection, and then locating
so many objects, it takes so much time for the
concurrent garbage collection to do its job. So here it's taking
about 200 milliseconds in load time that our threads-- from time to time-- have to block anyway. It's not that we
want to pause them because we have to pause them. It's not for the algorithms,
because the garbage collector is still busy. And originally, I was
planning more threads. And I was running
into weird situations where I had so many threads
doing computation that I was starting the GC thread. And it could not
run fast enough, so my threads were
pausing even more, and as a result, were getting
only about 30% to 40% CPU utilization. So we're not using
all the compute power that we have available
on the device. So the demo you saw
running on O could actually be something like three
times faster than that if I was not allocating as much. Another thing I
wanted to talk about-- we only have four minutes left,
so we'll go through this fairly quickly-- is that you have to be careful
when you create benchmarks, because the garbage
collector can greatly affect the results of your benchmark-- not really the benchmark
itself, but the algorithm you are benchmarking, once
it's inside your application, it might behave
very differently. So I'm going to
skip some of this. Basically a quick recap--
when you have a CPU-- this is the Pixel 3
CPU, the gold cores. We have big cores
and little cores. I'm looking at the big cores. Every core has an L1 cache. That's about 64 kilobytes. Every core has an L2 cache. That's a 256 kilobyte cache. And then there's an L3 cache
that's shared by all the cores. And this is important. Because when you
want to access data. So here, we have a
float array, and I just want to read one
float from that array. The first time we
access that float, the CPU is going to look in the
L1 cache, see if it's there. If it's not, it has to
go fetch it from the L2. If it's not there, it
has to go to the L3, and then finally to the RAM. And every time we have to fall
back to a higher level cache, we have to do an
expensive memory access that gets more
and more expensive as you go up the chain. So accessing the L1 takes
only a few nanoseconds. Accessing the L2 is going to
take 4 or 5 times that amount. Accessing the L3 is
going to be probably 10 times slower, and so on. So I wrote a demo that allocates
a list of arrays of floats. Each array of floats is 4
floats, so it's 16 bytes. They're represented
by the red lines here. And I built a benchmark
basically using that. I'm just going to
run some computations over those arrays of floats. So if I look at all those arrays
of floats, one after the other, in the loop, this is what it's
going to look like in RAM. All the arrays are
neatly stacked together next to one another in RAM. And I'm using a
width of 64 bytes here for a reason that we're
going to see in a minute. Then I wrote a very
simple algorithm. So I go through the arrays. That takes four of them. I run some computations. It doesn't really matter what
computations I'm running. And let's see what
happens from memory. So when we access the first
float array in our list, it's not anywhere in our caches. It's in the RAM, but it's not
in the L1 or the L2 or the L3. So we're going to go fetch
it and put it in the L1. But one optimization
that CPUs have is that when you need
1 byte of memory, they're not going to
fetch only 1 byte. They're going to fetch
64 bytes at a time. So by finishing the
first array, we actually fetch the next three
arrays at the same time. So then when I want
to read those arrays, nothing happens because they're
already in the L1 cache. So this is pretty efficient. Then we run our computation. Now in my test app, I've
modified the initialization of the arrays so that I allocate
other stuff between each array. And I'm doing this to
basically replicate what happens when your garbage
collector moves things around, or you have
fragmentation in the app, if you do your allocations
over the lifetime of the application. For any number of reasons
that we've seen before, your locations won't be neatly
next to one another in RAM. So here I'm representing this
with a bunch of gray lines. So if you run the
algorithm again, we go fetch our first array. But instead of
fetching the other data that we want, we
fetch that great data, stuff that we don't
even know what it is, but it's going to
be put in the L1. So then when we want the next
array, it's not in the L1, and we have to go back
to memory and get it, and so on, and so on. But again, we're running
the same algorithm. It's just now we
have to do more-- the CPU has to do more work. And we can recreate the same
thing by spacing out our arrays even more so that we won't find
the arrays in the L2 or the L3, and we can force the CPU to
do even more and more work. So if we run those different
variants of the algorithm-- where again all we
did was change the way we allocate the objects. We're running the exact
same computations. When everything is neatly
store stored together in RAM, the algorithm takes about,
I think, 150 milliseconds on the Pixel 3. So that's the no thrash. And when I space out the
allocation so that we confine the data we want in
the L1, certainly we're almost twice as slow. We're running the same
exact computations. But the CPU is just busy
going to fetch data in RAM. And if I space out
the locations even more so that we confine
the data in the L2, now we are over
five times slower. Again, same exact algorithm. So if you write benchmarks--
and that's very good, you should probably do that-- be very careful. Be aware of the fact
that the numbers you're going to get in
your benchmark may be very different than
the numbers you're going to get in the actual app
running on your users' devices. Yeah, you are truly benchmarking
the super access patterns. And with that, we're done. We have six seconds left. CHET HASSE: Very
important thing-- if you are interested in
what we talked about today, there is a deeper
version of this as well as a lot of the
runtime improvements in ART over time by liaison
engineers on the ART team. So please check that out. It's at the same time as
the fireside chat, which I think is at 11:10. So please go see
that talk as well. And that's it. ROMAIN GUY: Thank you. [APPLAUSE] [MUSIC PLAYING]