Trash talk (Android Dev Summit '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] CHET HASSE: We're going to talk about trash. So why are we going to talk about trash? Why don't we-- yeah, let me-- [INTERPOSING VOICES] CHET HASSE: All right, so there is a common phrase in software, "garbage in, garbage out," but nobody ever says how fast. So we're going to talk about that today. And why? Because back at I/O, we had this talk that we gave, called Modern Android Development. I have to apologize for the sunglasses. The sun was right there-- right there, really annoying, couldn't see a thing. Sunglasses didn't actually help. Nor did they make us look any better. So we talked about many things. You know, Android had certain development practices way back then. And Android has changed, and devices have changed, and ecosystems have changed. And now we are recommending different practices. All the stuff that you may have known about Android development may not necessarily be true today. One of the things that we talked about in particular was a lot about memory and garbage collection. We had certain recommendations there. So for example, we said, back in the Dalvik days, Dalvik was optimized for size. It was meant to fit in a very small area. It didn't have a lot of area to do things like AOT. It didn't have a place to stash the code. You really needed to constrain memory. The optimizations were not optimal. Allocations/collections-- unbelievably expensive. You get GC for alloc all the time, causing jank all over the place. Heap fragmentation was a problem. So really the recommendation was to not really allocate anything ever if you could possibly help it. And use primitive types everywhere, because objects are expensive because you're allocating them. Avoid autoboxing, all this stuff. ROMAIN GUY: Yes, so I need to correct you once again. You said, avoid the locations whenever possible. Enums, they don't allocate. CHET HASSE: Yeah. OK, well-- ROMAIN GUY: That's the whole point. CHET HASSE: But but but but but they take up space, right? So that's what, 1 to 2K compared to-- ROMAIN GUY: You can't say-- [INTERPOSING VOICES] CHET HASSE: All right, anyway. It's memory-related. So the recommendation instead was, pay attention to ART, because it turns out ART is optimized for performance, actually getting faster with every release because the whole platform was built to be able to optimize more and more the more the team works on it. We're doing JT as well as AOT. So we compile this code. We find out where the hotspots are, and we stash this code somewhere so we can run that faster next time you go through that loop. Allocations and collections are much, much faster. We'll go into details about this. We had the ability to defragment the heap and actually compact as we go now. And there's a large object heap, which makes some of the allocations amazingly faster and simpler than they used to be. So the new recommendations are, go ahead and allocate. It's really not that big a deal anymore. Still be concerned for inner loop situations, and be aware that you are actually causing the device to do stuff. You are causing battery and CPU usage. So you still want to be aware of these things, but maybe they're not such a big deal that they should affect your APIs and your development patterns the way that they used. However, that was a lot of really terse information stuffed into a very short amount of time. So we thought maybe we should do this talk to actually explain ourselves a little more completely, and say, why is this the case. Maybe talk about what ART has done to make life better. The original idea for the talk was actually, OK, well, we said all this stuff, but wouldn't it be nice if we could just write a bunch of demo applications, and show benchmarks, and say, this is why, and these are the canonical results that prove our premise. And it turns out that is really hard. Because garbage collection, by its nature, especially in ART, is concurrent. There's stuff happening in the background all the time. And if you want to trigger that thing right now, you will not be able to. So we made an attempt to write some demo applications. You will see some of the results here. But we realized we don't really have enough canonical deterministic data to show you. So instead we're going to tell you the background of why it's difficult to write these things because everything is happening magically for you in the background. So first of all, let's talk about memory. We see a couple of simple lines of code here. We have a primitive type. Here's foo. We're going to set it equal to 5, and then we're going to allocate an object here. Well, there are different kinds of memory in the system, and different implications. So if we're going to have a primitive type, that may show up in the stack, or it may show up in registers. Dalvik was a register-based allocation system. So it would actually pop it in over there. But whether it shows up in the stack or the register, you can think of it as essentially free. It's kind of allocated at compiler time. It says, when I run this line of code, here's where I'm going to stick this thing. It didn't need to find space for it anywhere. It knows that it has space on the stack. Sort of you can think of that as kind of limitless storage space-- as well as the registers. It will basically find a cubbyhole to stash that thing. It's free. However, when you allocate any other sort of thing-- when you say, new object, then it needs to find space for it in the heap, which means it needs to figure out where it's going to fit among the other things that already occupy the heap. And it will find space down there. So that's the basic memory system of the runtime and the computer overall. So the idea of garbage collection-- we've all been doing garbage collection, even before higher level languages like Java and Kotlin. If you're writing C++ code, then you can do an allocation like this, and then you can write the code that actually uses that object that you've allocated. And then if you don't do anything else, you've just linked an object. You're taking up space in memory somewhere that eventually is going to cause you to run out of memory. So what you need to do is actually free the object. So you delete the thing, you reclaim that memory. So you're basically doing manual garbage collection here. But sometimes you forget that you've freed that over there. And then in this other place over there, you continue using that object. And then you crash-- maybe. So very non-deterministic system. You are responsible for managing your own garbage, your own allocations and collections. It tends to be tedious. It tends to be very error-prone. And so higher level languages came along. So we have language like Java, where you allocate an object, and then you write the code that actually uses that thing, and then eventually it is freed. And if you continue using it, then you still have a reference to this object, which means it's going to eventually be freed without freeing it too soon. The system knows what's going on. You don't have to manage this thing. It does it for you. However, if it's doing it for you, several questions occur. And there is no crash. Yay-- well, ideally. So there are some things that naturally occur to you, such as OK, well, how long does it take for the system to do this? I know how long my malloc statement was going to take in C++. But how long is it going to take for this system to automatically find space in the heap for me? How long does it take to walk the heap, find the appropriate spot, to put this thing, and then allocate that space? How long does it take to actually collect these objects? When a reference goes away, when will it be collected, and how long does it take to actually collect these things? And what impact is there system-wide? So if we're running this sort of garbage collector thread, this heavyweight thing in the background, then what impact does that have on jank on the UI thread or whatever? And when do these things actually happen? And also, how efficient is that heap usage? It's not just going to malloc these little tiny chunks for your things. It's probably allocating a large amount of space, and then putting things in there, and sorting them around. And certainly on Dalvik, as we'll see, it's fragmenting the heap over time, which means you may have a lot of memory available, but you can't necessarily even access it anymore. So how efficient is that, and how much memory are you taking up in the system to do all this stuff? ROMAIN GUY: So we're going to start by taking a look at how Dalvik collects garbage. Dalvik was the runtime we were using until Android KitKat. It was replacing Lollipop with ART. So this is a picture of the heap. Everything that's in white has been allocated. And we have a few holes. And we're trying to allocate this blue object. So what Dalvik is going to do, it's just going to walk through the heap and find this spot where the object fits. When it finds one, pretty easy, just slot it there. Things become a lot more complicated when it comes time to collect objects. So there's four phases. The first one, Dalvik has to pause all the threads in your application to find a route set. So route sets are typically local variables-- threads, studied variables. They're just the routes of all the locations that can be reached in your application. So that takes a bit of time. And during that time, your application cannot do anything. The second phase is concurrent. Your app is running again. So from this route, Dalvik we'll find all the objects that can be reached and mark them as such. Unfortunately, since that second phase is concurrent, allocations could be triggered during that time. So you see here, like for instance, we just allocating another object. So we need to pause the application again and find the reachable objects again. And all your threads are paused, so your application stutters a little bit again. And finally, we can mark all the objects that are not reachable. And there are candidates for garbage collection. So the collection itself just gets rid of the objects in the heap. So now, if we want to allocate something in the heap, and the heap is pretty much full, and we have objects that have been marked for allocation, Dalvik will go through the heap, realize that there is no memory available, and then it will trigger a garbage collection for the specific use case of an allocation. That's the GC_FOR_ALLOC. And every time this happens, you see a log. On KitKat, doing an adb logcat, you can see those GC_FOR_ALLOC messages. So it will run the collection, get rid of this memory, then we can run through the typical allocation mechanism. However, sometimes your heap is full. There are no objects that can be collected. Every object is reachable. So Dalvik will run through the heap-- can't find anything. And only two things can happen. Either the heap will grow or your application will crash. And it's the Out of Memory error that you've seen, I'm sure, in a lot of visual applications, especially on older devices. This typically happens when you are allocating large objects. I remember a lot of developers in the early days of Android filed bugs against Bitmap Factory because Out of Memory error tended to happen during the decoding of bitmaps. And they thought there was a memory leak in Bitmap Factory. The main reason was bitmap objects are big, and it was hard to find space for them. There was no leak in Bitmap factory whatsoever. CHET HASSE: So we wrote a simple demo application to show off how some of the stuff with heap fragmentation works. So in the demo, we allocate chunks of memory all the way up to max heap. So your heap starts out very small. And then if you can't allocate an object, it's going to grow that over and over and over until it reaches the max possible. So for this demo, we allocate these 1 meg chunks. You press this button that says, there's 400 megs free. You press the button, it's going to allocate all these 1 meg chunks, all the way until it grows the heap until it cannot anymore. You get an error. We catch the error, and now we know the heap is full. There's no space anywhere. ROMAIN GUY: And this is pretty much the only situation where you should do a try-catch on an out of memory error. Don't do that in your application. [CHET CHUCKLING] CHET HASSE: And then we say, OK, well there's only one meg free. So we're going to go ahead and fragment the heap, and we're going to free a bunch of chunks. So we're basically going to go through everything we've allocated, and free every other one through the entire heap. ROMAIN GUY: So just go through every other reference, set it to null, then force the GC a couple times to make sure that that memory goes away. CHET HASSE: And then we get this result that says, OK, the fragmented heap size is now 200 megs. So we should have plenty of space to go ahead and allocate a two megabyte object. So that blue object there, it's only 2 megs. We've got 200 megs free. So what can be the problem? So we press the button, and it says, nope, we can't fit that in there. And if you look at the log, you get these depressing things here, where it says-- ROMAIN GUY: It's the best error message in all of computer science, I think. CHET HASSE: Yes, right here. This is beautiful. It says, OK, you have 200 megs free out of 400. And we're forcing a collection so that we can make room for a 2 meg chunk. And we're trying really hard to do that, and we ran out of memory. So we cannot find space for 2 megs out of 200. Because apparently Dalvik was really bad at math. The problem was, of course, that we couldn't find 2 megs contiguous, and Dalvik did not have the ability to compact the heap. You get what you get. Once we put the thing there, it was not going to move. So we couldn't actually shove stuff to the side to make room for a larger object. So ART came along in Lollipop. As I said, it's a platform for optimizations. It no longer had the memory constraints that Dalvik did, so they could sort of build in a lot of the fundamentals that they'd been improving over time. But out of the box, it came with much faster allocation times, much faster collections, as well as a faster runtime, the ability to do ahead-of-time compilations, so we're actually running binary code all the time. We're not constantly jitting things to find how we can speed things up. We're making everything faster. ROMAIN GUY: One thing we need to make clear is when we talk about allocation and faster allocation in this talk, we mean just the time it takes for the runtime to reserve the memory. We are not talking about the time it takes to run the constructors. It has nothing to do with your code. It's only in the runtime itself. CHET HASSE: All right, so how did ART allocation work? ROMAIN GUY: So in ART, we introduced a new allocator called a RosAlloc. And I don't know what it stands for, actually. But it replaces something called dlmalloc. So DL stands for Doug Lea, who is also the person who wrote, I believe, java.util.concurrent. CHET HASSE: That's what happens if you write a really nice algorithm. Then people name the algorithm after you. ROMAIN GUY: Yeah, he's the kind of person that makes me feel very inadequate as an engineer-- really smart. Anyway, dlmalloc is basically the algorithm behind the malloc function call in native code. So this is what you use when you call malloc, or we call new in C or C++. This is the algorithm we're using. So Dalvik was relying on that. ART replaced it with its own, calls Ros. So the main benefit of RosAlloc is that it's thread-aware. So it's able to do allocations that are specific to a thread. And we're going to look at this in detail. And you'll understand why it brings a lot of benefits. There's also a lot of little tweaks that have been done. So small allocations are grouped together to reduce fragmentation. We align larger locations on pages-- typically 4 kilobytes on modern OSes, and gives you better performance. Also, finer-grained lock, because the garbage collector has to acquire locks because we have a lot of threads running. It used to protect a lot more code. So now it protects takes less code, and it runs faster. And overall, allocations with RosAlloc are four to five times faster than they were with Dalvik. Again, this is just about the act of allocating the memory. It has nothing to do with your code. You could do something really, really, really expensive in your constructor, and we would not be five times faster than Dalvik. CHET HASSE: All right. So let's take a look at how allocations work on ART, in this new system. Oh, sorry. The other very, very important thing with ART was the ability to deal with large objects in a much better way. So you've got this normal-sized object. It's going to find a space for it in the heap over here. And what happens if you have this large item? By a large object, we mean an array of primitives or string. And these are the types chosen, because we can guarantee that these objects will not have a reference to something else. They can live completely on their own. ROMAIN GUY: And it's an array of at least 12 kilobytes. CHET HASSE: Yes. And that is the heuristic for now. That could change over time. But right now, it's 12K, primitive types or string. So you've got this huge object. Where are we going to put it? Well, in Dalvik, we would put it where, exactly? You can see in this fragmented heap that there may not be space for it. In ART, it's a little bit simpler. The complicated mechanism looks like this. We just put it somewhere. We just malloc a space for it, and shove it in there. It's not even in a object bucket that holds all of them. We just allocate a space for it somewhere in there. And we say, OK, you are now part of the heap. But really it's just living on it's own somewhere-- very fast, very easy. It's also a moving collector. So we can actually compact things. So we no longer have the fragmentation problem. However, it does this in the background. So actually it's a little more complicated. My understanding originally was, well, if your application goes into the background, then eventually this very expensive operation-- it could take up to 100 milliseconds-- may run that's going to compact the heap. Obviously, we don't want to run that in the foreground, because we're going to jank your app all over the place. So we're going to wait till it's sitting there in the background, user is doing something else, they're not paying attention. So we're compacting the heap for you. That's awesome. So I said, OK, well I'm going to demonstrate this and show that same defragmentation crash error that we saw earlier. I'm going to show how it crashes on KitKat, using Dalvik, and it will also crash in all of the releases until we're able to do it in the foreground on a later release, in O. And this will be a cool demo. And then I ran and on L, and it didn't crash. And the thing is, yes, it will defragment in the background, but it will also do it in the foreground if it has to, which is really what you want. So if you actually need that memory now, wouldn't it be nice if you didn't crash? So that's what we-- ROMAIN GUY: That compaction is almost a replacement for the GC_FOR_ALLOC from before. CHET HASSE: So now it basically takes everything, it says, well, you need space for this really large object. We're going to go ahead and compact things, and then put it where we need to. So on L and above, we run the same fragmentation demo that we saw before. We go ahead and alloc up to the maximum heap size. It says, yep, you've only got about 1 meg free. And we go ahead and free every other meg, null out those references, and then we try to find space for this 2 meg block. It compacts the heap, and puts it in-- very simple. ROMAIN GUY: So, another improvement-- so remember with the Dalvik GC, we had those four phases, including two pauses for your application. So the pauses were bad because your application was not doing anything during that time. But what was even worse was that those pauses could be pretty expensive. So on average, the sum of those two pauses was about 10 milliseconds. But even when it was only 10 milliseconds, that was actually pretty good. I mean, we've done a lot of performance work over the years, and I've seen these kind of pauses last for 100, 200 milliseconds in some applications. And during that time, nothing happens, which means no matter how good your UI is, it will jank. Like if the user is trying to scroll, it's not going to work well. One of the things that ART does is it removes one of the pauses. So now the first step, the marking the route set, finding the routes of all the allocations that are reachable in your heap, is now a concurrent phase. It doesn't pause the application anymore. And on top of that, the second one-- the only pause that we still have left is also a lot faster. So instead of spending 10 milliseconds in there, we only spend about 3 milliseconds now. So at most, your application will probably pause for 3 milliseconds. So if your application is well-optimized, the GC happens during an admission or scrolling. You should be able to look reach 60 fps without any jank. Another thing that was introduced in ART was the concept of the minor garbage collection. So the idea here is to keep track of all the objects that have been allocated since the last major garbage collection. Those are typically temporary objects, and we're going to look at them first. So if we can reclaim enough memory by looking at a subject first, because they're short-lived, we won't have to go through the entire heap. CHET HASSE: This is one of the things that has an important consequence for Android development, where we used to tell you, never allocate even temporary objects, because they tend to be expensive, because they're going to fragment the heap. We have to do the allocation. We have to do the collection. All of a sudden, we made temporary object allocation and collection much faster and easier. ROMAIN GUY: It's not free. It's just less expensive. CHET HASSE: Yes. ROMAIN GUY: We also introduced the larger malloc heap that we talked about. So we have less fragmentation. But one of the huge benefits of that is, because we don't fragment the heap as much, we don't have to grow the heap as much in all the processes. And of course, we don't have to GC_FOR_ALLOC. I mean, they still exist. We just don't see them as much because they were very, very common in the Dalvik days. And also there's a faster runtime. That's the ahead-of-time compilation. And we introduced the G tag. But this has nothing to do with the garbage collector. Marshmallow-- I was joking with Chet yesterday that it's kind of a boring release, because I can never remember what was in Marshmallow. So here it is-- optimizations in ART. CHET HASSE: Things got faster. Fine-grained details, things got faster. And again, things got faster. Isn't that nice? Allocation in particular. They rewrote everything, all the core stuff in assembly. And that turns out to still help in software. Who knew? And now we're up to about 10 times faster for allocation costs when you compare it to Dalvik. And now we're in Oreo, where basically they rewrote the whole thing. So we introduce an entirely new collector, called the Concurrent Heap Compaction Collector. And this means that now we can actually compact in the foreground, not just when we need to do that GC_FOR_ALLOC compact to find a space. But it is actually running all the time, moving stuff around, and optimizing what the heap looks like so that allocations are much faster and easier overall. So defragmentation in the foreground, you're not resizing the heap is often because the heap is always optimized, because we're always sort of culling the things out of it as necessary. There's far fewer GC_FOR_ALLOCs because we just don't get into that situation anymore. Huge savings for the entire device. Think about it. If you can only defragment the heap when an application is in the background, what about the applications and services that are constantly in the foreground-- system service, play services, system UI. Well, we couldn't necessarily defragment those until it got into a really bad situation. Wouldn't it be nice if we could do it in the foreground so that those things were optimized? Well, if we can do it for them, that means we're getting system-wide savings on the pure heap size of all of those applications. So smaller heaps for all means less memory in the entire system. And what we found was the entire device had about a 30% savings on overall memory requirements. So we can take a look at how compaction works in general. We have these 256K buckets that are assigned per thread. Which means, again, huge savings in not having to lock down all the threads to do these allocations and collections. Instead, a thread, if it needs memory, it's just responsible for itself. So a thread says, OK, I need some memory. It's handed one of these buckets. It allocates in there. And then over time that thing may get defragmented. There's a heuristic that they have about if there's less than 70% or 75% utilization in one of these things, then they'll go ahead and collect it, and empty the thing out entirely. So we see the T1, 2, 3 regions here don't have much going on there. So we're going to take the memory in there, all of those allocations, and shove them somewhere else in the heap, completely empty out those buckets, which allows something that's super-efficient, called thread-local bump allocator. This means that all we have to do is actually just move a pointer. We don't need to walk the heap to find where the free space is. We just put the next object in the next available space. And we know where that is according to the pointer. It turns out, all of this stuff put together, we're now 18 times faster than Dalvik for these allocations. So we can see how these allocations work in practice. You can see all these little colored objects. We're making space for the in the different threads that need those allocations. If we take a close look at T1, you've got the free pointer. We've emptied it out. We've zeroed it out. We're at the beginning of that bucket. And now we need to allocate an object. Well, we know where to put it because the pointer tells us. And now all we do is advance the pointer. The pointer tells us where to put the next one, and the one after that, and so on. So very fast and easy compared to what we were doing before. We can see on the graph the comparison of where we were out with Dalvik allocation costs compared to where we're at now in O, with bump pointer and assembly allocations instead. All right, so where are we going now? It is important to note that the young generation stuff that we talked about as being so awesome is currently gone. ROMAIN GUY: But it's back in AOSP. And so it will probably be in the future release. CHET HASSE: Yep. So it's important to note. There are trade-offs here. We believe it's better overall. All of the benefits that you get from the garbage collector in O should compensate for the young generation collections not being there anymore. However, they're still a nice thing to have. So they're back in AOSP. So look for those to show up in a future release. ROMAIN GUY: So object pools-- this is a technique that we've recommended using in the past. We use it ourselves extensively inside the platform. And the conventional wisdom is that reusing object has to be faster than allocating them and collecting them all the time. And you can see here we have a preference graph so the x-axis is the size of the object that you're creating or reusing. In red, it's the time it takes to handle those objects with a pool, and in blue, it's the time it takes to just allocate and collect those objects. And until N, using the pool was basically always a win compared to the garbage collector. But with O, with all the improvements we've done, and this new thread-bump local allocator, synchronized pools of objects are generally slower. And I want to emphasize the synchronized part of the pool. If you have a pool of objects that's used only on one thread, and that thread only, you're effectively doing the kind of optimization that ART does for you now in O. So you might not see those same kind of savings. But in general, on O, if you have a synchronized pool of objects, you'd be probably better off without that pool. Again, make sure you profile your application before you take it out. CHET HASSE: And there's a certain memory size-- a certain size of object where these graphs cross. But this was in a benchmark application that was really hammering it and maximizing the bandwidth. So the general advice is, you really shouldn't use that, especially the synchronized pool approach. Because A, it's error-prone and tedious to manage, and B, it is slower in general because that synchronization access tends to be slower than what we can do now with-- [INTERPOSING VOICES] ROMAIN GUY: The whole point of O is that we don't have a lock anymore for the allocations. CHET HASSE: So what are the recommendations now? Creating garbage is OK. I wouldn't go out of your way to create garbage. It is still taking time. You are requiring us to take time to allocate objects, as well as later collect them. And you're taking up battery and CPU, and you're causing-- ROMAIN GUY: But don't do like Chet. Pick up after yourself. You should see his desk. It's pretty disgusting. [CHUCKLING] CHET HASSE: I like to be a counterexample. So in general, creating the stuff, if you need it, is OK, and so is collecting it. Use the types and the objects that you need. If they make sense for the architecture, for the APIs that you're building, for the libraries, for your code, go ahead and use those. We are now pushing everybody to use int and bitflags everywhere to optimize every little CPU cycle and memory allocation. Instead, go ahead and allocate when you need to for your use case. However, GC is still overhead. ROMAIN GUY: And we're going to look at a demo that showcases that. CHET HASSE: Yep. And make the right choices for you. And in general, the framework programmers, we still are your inner loop. So we still take the old practices. Because why would we allocate if we didn't need to, if we can make your inner loop faster? So be aware of the inner loop situations. Otherwise, do the right thing for your code. All right, so I wrote a simple application to sort of showcase some of the jank stuff that you can see because of allocations and collections. And this-- during the onDraw, I would call out to a method to run some stuff. And in this case, we're going to test autoboxing. So we have an array of 100,000 Float objects-- capital F Float objects. So it's not just the primitive float. Instead, it's the capital F. So we're going to be allocating these objects on the fly. Here's what the method looks like that's going to run on every single frame. It's going to walk the entire length of the array, and take a primitive float, and stuff it into the array, which is going to cause an autobox. So these little tiny allocations are going to go into this array. A lot of them, over time, because of the autoboxing thing, it's going to cause a bunch of allocations, and then we're going to need to collect them over time as well. So if we take a look at the demo. Let me pop out. All right, so we're running on K here. We run this animation. We should take a look at the log here. And we run the autobox, and now we're calling out to that method. And the important thing here, if you look at the log-- so we're taking allocation times of 28, 24, in general high-20s milliseconds. And we're causing a lot of GC_FOR_ALLOCs, because that is what happens when you do this thing. So we can pause this one. We can pop over and see what this looks like on O. We can run the animation here. We enable the log for O. We'll do the autoboxing. And now we can zoom in on this, and you will notice a couple of things. One is that the allocation times obviously are much less than they were before. And also, more importantly, there are no GC_FOR_ALLOCs. We allocate, we collect, but we're never triggering that that jank-inducing GC_FOR_ALLOC in this case. There is a similar test that I wrote for bitmaps. So we're running along-- let's take a look at the KitKat log. So in this one, you're allocating bitmaps. And again, we're getting a lot of jank there. If we zoom in on the log, you're seeing the allocations for these large objects at the 1,000 by 1,000 bitmap. If you're taking 12, 13 milliseconds each time, and you're constantly triggering these GC_FOR_ALLOCs, because you need to collect the old memory to make room for the new one. So if you pop over to the O log, stop this one over here, run the animation, do the bitmap test, and now we've got allocation times of 0, 1. Because again, all it's doing is a malloc. It's just shoving it into the large object heap. Very easy to allocate, very easy to collect when it needs it. Stop that. All right. That just duplicates what I just said. Here's the bitmap test. Very simple. It's just walking through every frame. It's allocating 1,000 by 1,000 bitmap. And then you see the results that we did. ROMAIN GUY: All right. So this is a demo to basically stress test the garbage collector and see what kind of overhead we get. So this is a kind of a ray tracer using fancy physically-based rendering that I wrote for my desktop. CHET HASSE: You can tell it's a ray tracer, because it's a checkerboard with spheres. ROMAIN GUY: That's right. And I ported it to Android. And I want to run you through the code because there's hundreds of lines of code. So this is Kotlin. But the trick here is that this does a lot of allocation. So I have a data class, a float3. It just contains three floats, x, y, and z. And those are primitives. They're not the capital F objects that you get in Java. And here I have a function, just an excerpt of those hundreds of lines of code. And you can see it's just doing math on the objects. So I'm using operator overloading in Kotlin. We are multiplying float3 called x by a bunch of constants. We're do divisions, additions. But the important thing here is that the way those functions are implemented, my float3 is immutable. So every time I add or multiply, I'm creating a new float3 object. So those are fairly small objects. But for every pixel that we'll want to render in that ray tracer, we're going to be allocating hundreds of thousands or millions of these float3's. So we're going to really exercise the GC. So then I created two system images, one on KitKat, one on O, both running on the emulator. We're emulating the same hardware. It's a Nexus 5x. We have the same number of cores. They're running on the same host machine. So the only difference really is the garbage collector. What I'm going to press Next, both demos will run at the same time. We're going to start rendering an animation with that ray tracer. And at the top you have O, and at the bottom you have KitKat. See if you can spot the difference in performance. So we'll save us some time. It takes 47 seconds for KitKat to render the first tile. CHET HASSE: Yeah, but it's a really good tile. ROMAIN GUY: It's a really good tile. And again, we're running the exact same application. The only difference is effectively the garbage collector. So on O, rendering one tile takes about 100 to 500 milliseconds, and on K, it takes 40 to 50 seconds. So two orders of magnitude slower. And if you look at the logs on KitKat, this is what we see. We see a bunch of GC file logs. And I just grabbed logs for about 10 seconds worth of computations. And you can see we're constantly stopping all the threads. We're blocking the applications all the time. Nothing is getting done. That's a lot of GCs. And you can see, every time the GC takes 3 to 5 milliseconds. So if we look at systrace on O, we can see that even though things are better, they are not quite perfect. So this is systrace here. You can see what the CPUs are doing. And from afar, it looks like they are very busy because of three worker threads that are computing this 3D scene. But you can see there's a lot of cases where the CPUs are actually not doing anything. There's holes in our pipe. So if you look at the app itself, we see two interesting things. First of all, I have my three threads that are computing-- that are doing the work. They're busy chugging along. But from time to time, here for instance, the pause, and you probably can't read this, but it says "full suspend check." And then we're not doing anything. We're not doing any computation. And the reason is we have this thread called the heap test daemon. It's basically the concurrent garbage collector. So even though we're doing concurrent garbage collection, and then locating so many objects, it takes so much time for the concurrent garbage collection to do its job. So here it's taking about 200 milliseconds in load time that our threads-- from time to time-- have to block anyway. It's not that we want to pause them because we have to pause them. It's not for the algorithms, because the garbage collector is still busy. And originally, I was planning more threads. And I was running into weird situations where I had so many threads doing computation that I was starting the GC thread. And it could not run fast enough, so my threads were pausing even more, and as a result, were getting only about 30% to 40% CPU utilization. So we're not using all the compute power that we have available on the device. So the demo you saw running on O could actually be something like three times faster than that if I was not allocating as much. Another thing I wanted to talk about-- we only have four minutes left, so we'll go through this fairly quickly-- is that you have to be careful when you create benchmarks, because the garbage collector can greatly affect the results of your benchmark-- not really the benchmark itself, but the algorithm you are benchmarking, once it's inside your application, it might behave very differently. So I'm going to skip some of this. Basically a quick recap-- when you have a CPU-- this is the Pixel 3 CPU, the gold cores. We have big cores and little cores. I'm looking at the big cores. Every core has an L1 cache. That's about 64 kilobytes. Every core has an L2 cache. That's a 256 kilobyte cache. And then there's an L3 cache that's shared by all the cores. And this is important. Because when you want to access data. So here, we have a float array, and I just want to read one float from that array. The first time we access that float, the CPU is going to look in the L1 cache, see if it's there. If it's not, it has to go fetch it from the L2. If it's not there, it has to go to the L3, and then finally to the RAM. And every time we have to fall back to a higher level cache, we have to do an expensive memory access that gets more and more expensive as you go up the chain. So accessing the L1 takes only a few nanoseconds. Accessing the L2 is going to take 4 or 5 times that amount. Accessing the L3 is going to be probably 10 times slower, and so on. So I wrote a demo that allocates a list of arrays of floats. Each array of floats is 4 floats, so it's 16 bytes. They're represented by the red lines here. And I built a benchmark basically using that. I'm just going to run some computations over those arrays of floats. So if I look at all those arrays of floats, one after the other, in the loop, this is what it's going to look like in RAM. All the arrays are neatly stacked together next to one another in RAM. And I'm using a width of 64 bytes here for a reason that we're going to see in a minute. Then I wrote a very simple algorithm. So I go through the arrays. That takes four of them. I run some computations. It doesn't really matter what computations I'm running. And let's see what happens from memory. So when we access the first float array in our list, it's not anywhere in our caches. It's in the RAM, but it's not in the L1 or the L2 or the L3. So we're going to go fetch it and put it in the L1. But one optimization that CPUs have is that when you need 1 byte of memory, they're not going to fetch only 1 byte. They're going to fetch 64 bytes at a time. So by finishing the first array, we actually fetch the next three arrays at the same time. So then when I want to read those arrays, nothing happens because they're already in the L1 cache. So this is pretty efficient. Then we run our computation. Now in my test app, I've modified the initialization of the arrays so that I allocate other stuff between each array. And I'm doing this to basically replicate what happens when your garbage collector moves things around, or you have fragmentation in the app, if you do your allocations over the lifetime of the application. For any number of reasons that we've seen before, your locations won't be neatly next to one another in RAM. So here I'm representing this with a bunch of gray lines. So if you run the algorithm again, we go fetch our first array. But instead of fetching the other data that we want, we fetch that great data, stuff that we don't even know what it is, but it's going to be put in the L1. So then when we want the next array, it's not in the L1, and we have to go back to memory and get it, and so on, and so on. But again, we're running the same algorithm. It's just now we have to do more-- the CPU has to do more work. And we can recreate the same thing by spacing out our arrays even more so that we won't find the arrays in the L2 or the L3, and we can force the CPU to do even more and more work. So if we run those different variants of the algorithm-- where again all we did was change the way we allocate the objects. We're running the exact same computations. When everything is neatly store stored together in RAM, the algorithm takes about, I think, 150 milliseconds on the Pixel 3. So that's the no thrash. And when I space out the allocation so that we confine the data we want in the L1, certainly we're almost twice as slow. We're running the same exact computations. But the CPU is just busy going to fetch data in RAM. And if I space out the locations even more so that we confine the data in the L2, now we are over five times slower. Again, same exact algorithm. So if you write benchmarks-- and that's very good, you should probably do that-- be very careful. Be aware of the fact that the numbers you're going to get in your benchmark may be very different than the numbers you're going to get in the actual app running on your users' devices. Yeah, you are truly benchmarking the super access patterns. And with that, we're done. We have six seconds left. CHET HASSE: Very important thing-- if you are interested in what we talked about today, there is a deeper version of this as well as a lot of the runtime improvements in ART over time by liaison engineers on the ART team. So please check that out. It's at the same time as the fireside chat, which I think is at 11:10. So please go see that talk as well. And that's it. ROMAIN GUY: Thank you. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Android Developers
Views: 6,611
Rating: 4.9509201 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Android, purpose: Educate
Id: Zc4JP8kNGmQ
Channel Id: undefined
Length: 39min 54sec (2394 seconds)
Published: Thu Nov 08 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.