How we make Angular fast | Miško Hevery | #AngularConnect

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Yes, I don't think I will live that down! Hi, how are you guys all doing? Good to be here? How is London? Are you liking it? Okay. So, let's talk about some - I'm kind of hoping I'm going to dive deep into some technical stuff, and I hope you're going to enjoy it with me here. Let's see. Slides don't work. Why not? That's me. I am now a tech lead manager in Angular. My non-focus is the Angular framework, the core of it, which is all the bits that end up in the browser. And I share that role together with Igor and Igor works about all this stuff that makes the developers productive. And, two of us are complementary. Let's talk about performance. The first thing we want to talk about is running performance tests, and I always joke that the performance tests, you think you're running the tests over and over again. I want to show you a couple of these things that can really surprise you. First, let's talk about inlining and opt and de-opt. I'm going to make a simple benchmark. Which of the four lines. How is the size? Shall I make it bigger? Oops. Is that good? So, which of these are faster? Now, I hope you agree with me that these are essentially identical functions that return different values and because the value is zero, this side of the equation will always execute and always return zero, the value one, two, three, four does not even come into play in any way, and to it should not have any performance impact. All of these things should execute the same exactly. How will we benchmark this? I have a trivial benchmark here where we grab the time, rerun it some number of times, and right out exactly what the timing was - work out what the timing was. When I run it with 100,000 iterations, they seem about the same. Let's increase the number of iterations. As I increase it, they show up the same number of times. I go even higher, and something weird happens. The test case A is somehow faster than the rest of them, and, if you go even higher, it consistently shows up as being faster. What's going on? These are essentially identical things. So I wanted to show you what actually happens in the VM. First of all, and I'm sure you know this, but you can run Node v8 options and it - V8 options. These are options on top of Node that only V8 understands. You can see there are 800 lines of different options that it can pass in. I'm not going to go through all of they will, but there are fun ones to look at like trace op and trace DeOpt. We can run those and dump out a lot of interesting text for you. We can go through that text and figure out what actually is happening but it turns out there is a better tool. Someone wrote this tool called a DeOptigate, and you can take that same piece of code and execute that. And it should up on open it up in our browser. Opening in the default browser. Where is my default browser? Okay, great for the memo. What is going on? Seriously, where is it? It went into the wrong tab. So it produces a file that you can zoom in and specifically, it places a red box right here. You can click on it. It tells you what it is. It says it de-optimised because of a particular bail-out which is a wrong call target. Let's see if we can think through what actually happened. So you started running the benchmark A - this one right here - and that took this function, and as it was executing this over here, the VM noticed "I keep calling the same function over and over again." If the counter gets too high, the VM wakes up every millisecond or so, looks at the counters and says which of the functions looks like a good candidate for optimisation? And, in this particular case, the VM woke up every millisecond and goes in, and says, it looks like what I can do is I can take the benchmark A and I can inline it in this location to make it run faster. And that's exactly what happened. So, after the first benchmark runs for a while, the A gets inlined over there, and then we invoke it with B. When we run it B, what happens is the VM says, "Oh, wait, I inlined A over here, and now you're calling me with B, so I made an assumption that turns out to be wrong, and so now it does what is known as a de-optimisation" which essentially undoes the thing it did before. Because it did a DeOpt, the VM says I now know this call side sometimes gets called with A and sometimes gets called with B and therefore it probably should not inline over here. When the benchmark B runs it no longer gets inlined. It's slower. You have the overhead of actually making the call. So, it is very easy to make a benchmark where you think you're actually profiling what is actually happening in your application, but because real applications are much more complex, a lot of these optimisations get DeOpted in the future. This is an example of how you could make - sorry, I will make this bigger. How you can make a very simple test case and then shoot yourself in the foot because you're actually measuring the wrong thing, unless you really understand what is happening inside of it. So, what is the fix for it. ? The fix for it is to run the same exact benchmark. I have it done in exactly the same way. Instead of writing my benchmark library as something that takes a function and then runs it in a loop, I flipped it around. Instead, I turned my benchmarking library into a simple call and my benchmark was basically a wild loop that is true until the benchmark was in there, and then I just inserted the wild loop. I placed the thing I wanted to benchmark. And by flipping it around, now instead of the function on the test being inlined into the benchmark, now what is happening is the exact opposite. The benchmarking code gets inlined into my function under test. And in this okay, we can to make sure that we are measuring the right thing. So, if I run this particular benchmark, which is right here, it will show that both A, B, C, and D performed in the exact say. This benchmark, when you're running a benchmark, you kind of have a problem and that is that you need to determine how many times you should iterate over something, how many times, because if you just put a time stamp before you run a piece of code and right afterwards and you don't iterate enough times through it, first of all, you might be looking at unoptimised code, so then you can be looking at the fact that your timer has jitter, and so you really need to be able to execute the function for a certain amount of time which means you need to know how iterations to run for. So a good benchmarking library will not only make sure that it doesn't fall for the trap of Opt and DeOpt but also make sure how many iterations your function should run for before it determines the time difference before it, right? In this particular benchmark, what it does is executes the function once, measures the time, if it is less than 100 milliseconds, says not enough and it increases the iteration loop until it gets 100 milliseconds' worth of data. Then what we do is you have two options: one option is you could simply average all these measurements over a long period of time and look at the statistical deviation, how much faster and slower it is, but we've noticed if we do an average, the average is not as predictive as if you take the best-case scenario. The reason for that is, because when you are benchmarking a particular function, you are - the code is running in the background, it can do other things, the browser is doing garbage collection, all of those slow up the benchmark that you're running. And so by running the benchmark repeatedly and then selecting the fastest possible run, you're actually selecting the situation where no garbage collection runs, no other application woke up, and so on and so forth, and the numbers you get in in this particular case are a lot more predictive than any other way of measuring it. At least that's what we noticed. We talked about inlining, we talked about Opt and DeOpt. One thing I would like you to take away from this is if you really want to measure things, I think it's really useful to measure the best-case scenario. That number seems to be a lot more consistent than any other number you can get. Let's switch gears a little bit and how VM layers objects and this is important because it will lead into how Ivy does things. Let's say I we have an object such as name, say, John aged 42. It turns out that memory inside of your computer is continuous. It's one big array. The VM's job is to convert it into an array. That's what memory is. The way it does that is breaks the object into two parts. One, it's what is called a hidden class ID and it contains some header information. It's a header information which we're going to ignore for a second. But it contains key value pairs where it says the named property is index offset two and the H property is index off set three, such that if you look at the object itself, where if you look at John, second position - this is how the VM breaks up your object. So all of the metadata about the object goes into one location, and actually the values of the object goes into other location. So if you have two objects, for example, "John" and "Mary", they will share the ape object-hidden ID clause because the two shapes of this object are identical. So, as long as the shape of the object is the same, and this is what we mean by shape, we will share the metadata about the object in the same location and then we will have the array which represents the object itself stored the same way. So that means if you want to perform a read, such as you can see over here, you can say object.name, you need to translate the name string into the offset into the array. For languages such as Java and C++, this offset calculation can be done in the compile time because the language is static, but for JavaScript, the object look-up has to be done at run time because JavaScript is a lot more dynamic than the static languages. And so whenever you do something like object.name, what the VM internally does is that it says object.getproperty. It knows that the position zero of the object always contains the information about the object itself, so we can extract the position zero out. And then it passes in the key that it wants to be able to compute, and so it passes that into the function, the function scans the list of passable keys, and it comes back with an answer, and the answer in this case should be two. And so then we can basically - the reserves, the object - the index 2 and we get back out John. This is essentially how VMs work inside. So, if you have a more complicated situation, such as when you have object 1 which in the has name and age, and object 2 which has a title and a name, and in the the name in this case is in position two rather than position one, they can no longer share their hidden class IDs. They have to have separate hidden class IDs, and they store the information about one shape of the object in one location, and the other shape of the object in another location, and now the objects are no longer tied together by their hidden class ID. So that if you wanted to read properly a bit.name off the object, as we can see over here, in the the line above it says object sometimes will be 1 - will be object 1 and sometimes 2 depending on the expression, so when we get object prop, sometimes it will turn two, and sometimes three, depending on which shape of the object we have. In both cases, we are going to get the name out of the object. So that is kind of what's going on. Let's look at this in how does this perform? Okay, so I have a benchmark here. And what the benchmark will do is it will create 10,000 objects. And it will create 10,000 different kinds of objects and we will place them inside different kinds of arrays. This array will be 10,000 items long, and what it will contain are objects of the same shape, value prop zero, value prop zero, so on, and so forth, 10,000 of these objects. We know the array 1 will contain objects that will all share the same hidden ID. For array 2, we will create a value prop zero but then a value prop 1 and then back to 0, 1, and alternate back and forth like this. We know that array 2 will contain 10,000 objects, 5,000 of which will be shape 1, and 5,000 of which will be shape 2. The same for three, four will contain four different kinds of objects up to array 10 ,000 where there will be 10,000 different kinds of shapes of objects inside of it. The next thing we're going to do is try and go and read. We know these objects have a value property so we will try to read the value property out of every single object that we have, and we're just going to add it to the sum. We know that in such cases the value property will be zero, so we don't have to worry about some overflowing and becoming a double because that also will cause a DeOpt to happen. And so in this case, what we are going to do is just read the value property all the time and we know that we will have the same shaped object in the array at all times all the time, whereas in here, we will have two shapes, three shapes, so on, and so forth. Let's execute this particular test case and see how we do. In this case, the benchmarking is trying to figure out how long the particular test should be run for, and then it runs it for a number of times and tries to take the smallest possible value, and then this places the smallest possible value for us right here. Here is the result. Not surprisingly, if you have the same shape object all the time, if you have the same shaped object all the time, you see the highest performance. You see that that particular function executes once - or rather iterating over all 10,000 objects took 20 microseconds. That's our best-case scenario. As we increase the number of shapes of objects we have in a system, we are getting slightly slower. One, two, three, four, notice, are pretty much the same. Something funny happens at five. All of a sudden, we are about two and a half times slower than we were before, and we kind of maintained at two and a half times slower until we get to 10,000 objects, and, wow. We're now 50 times slower than reading the objects internally. So the question is: what's going on? How can we debug this particular thing? So let's go back to what was going on with our objects. Let's talk about it some more before I show you how to debug this thing. What will happen inside of the VM internally is that we have to execute this piece of code right here which says object get property of the hidden class ID of name, and we look it up, and we're going to have 2 in there. It turns out that the VM prop look-up is actually slow. The VM website once it runs, it can collect meted at that information and produce what is - meta information, and it can produce an inline cache. It says if the object you read at property zero is object class ID, and this is a simple comparison, so this is extremely cheap, then we know the answer is two because we ran it before in an interpretative mode, and while we were running it in interpretative mode we collected all kinds of information about it, and so we know that if the hidden class ID of this particular object happens to be this special one, because, in the past, we have seen this particular one to be very, very common through this execution-code path, then we can short-circuit the whole thing and answer it with two, otherwise, if we can't do that and we have to call the get prop function, and it's the slow one, the one that requires us to skim the hidden-class ID, and it takes some time. Doing this particular trick allows the VM to generate code that is significantly faster at execution time. So this works in the case where the number of objects is fixed, so you know that we always have the same-shaped object coming through it. In this particular example where we have two different kind of shapes of objects, the VM can also collect information about it and says, you know what? While I was running it in interpretative mode, I've collected information about this code how to execute it, and I have realised that now there are two kinds of shapes that it can come through these particular locations. There is the object 1 shape and the object 2 shape. I know that if object 1 comes across, then the main property is in location 2, but if object 2 comes across, I know the named property is in 3, otherwise I give up and I go the slow route and I look it up. So these are inline caches and VMs are willing to do this particular trick up to four times, like different VMs can choose different numbers of one but typically they do it up to four times. Because they can do it up to four times, we can explain what is going on here. The first four lines with inline cache hits, right? So, for the first case, the VM only had to compare one hidden class ID against it, and it was always correct. And that's why that particular case runs the fastest. In a second case, the VM had to compare two different values, but both of these values, it always got an object of one value of the other value, and so the comparison was a little more complicated but it's still relatively fast. You can see the same thing happening for three, slightly slower, four, slightly slower, and at five, the VM gives up, and says, you know what, if you have more than five objects, I'm going to call Get Prop, right? That's when it gets slower. However, why is it that 5,000 was relatively fast but at 10,000, it's really slow. There's something else going on in here that we can't explain. I want to show you a couple of tools by which we can explore this. So the first thing is something called Trace IC - IC means inline cache - and this will generate - this is going to generate a log file for us that we can explore. So the log file, so, we need to install - we - in order to look at this log file, you have to install something called V8 Tools. The instructions on how to install that are actually in this repo in a slide that I will show you at the end. If you open the thing, you can install a log file ... oh, come on! Sorry, I'm looking at the wrong one. It's the IC processor. That's what is happening. Let's open it up. We can now look at it by function. Sorry, not function but by file position. This is our code. It has flagged all the locations. We are going to look at the bottom one first. This is at location 44, so to cross this with our source code, 44 is this one here. We're talking about this value over here, so this is line 44, column - it doesn't show my column numbers. Column 33. And in the what it says. It was trying to read "value" - this is the property that it's trying to read - and it says that V8 calls objects maps, so it says I have saw two different kinds of objects come through and these are their hidden class IDs, and the dot means I was collecting information about these objects, and then, after a while, the VM says, you know, I see that there is one shape - so this is called the transition - and it comes across and says I saw this shape of object before so I'm going to record its hidden ID, and then I've seen another transition where it goes from what is known as a monomorphic look-up to a polymorphic look-up. The VM is able to generate four of these things. In the as we go up - so this is line 44, line 61 has - sorry, the benchmark. The next one up. The next one up is line 51 of example 3, and in the here the VM saw up to four - sorry, three different kinds of maps, so it transitioned to monomorphic state from polymorphic to polymorphic. It's willing to do up to four of these states, and so everything still works good. As you go up, up, to the last one, to the top, in the what it shows you here. In this particular location, it has seen 10,000 different shapes, and it has transitioned from monomorphic to polymorphic, from poll polymorphic a few times, and stayed at the - most of the execution hits that happened at this location, about 100 per cent of them, were in the megamorphic state. This only repeats what we already know. It turns out while this too is pretty cool, there is a better way of looking at this. There is another tool out there called the Optigate. The Optigate will present the same information but in a more human-readable way. Okay, so this is the tool we executed. If we look at our example 3 code, what the VM did is that it placed these markers in here, so, first, let's show the - notice the green phone icon means that this property read is monomorphic, so the best kind of property read you can have. In other words, every time the VM executed this piece of code, the object shape underneath was always the same and therefore it could just short-circuit the whole thing and get the proper thing. The blue ones are polymorphic. It is showing you that there were two different shapes of object that went through this location. While 2 is not ideal, it's still pretty good because it can still do inline caching. Now, notice when we transition to array 5 with five different objects, it becomes red, saying, the - your system has transitioned into the megamorphic state and we're no longer executing the code as high performance as possible. So now there's an explanation of what to do and how to use these tools to understand your code, but we still haven't explained why there's such a sudden drop at the 10,000 level. I executed the code one more time, this time with another option called --prof, called ed profiler, and we can open it up in a prof tool that V8 has. We get this kind of a graph. You can click on function 1. Benchmark 1 has the same shaped object and you can see it's all green. Notice at the very, very beginning, those are when the VM is running in the interpretative mode, collecting information about your code, and once the interpretative mode is kind of finished, the VM generates actual assembly instructions that it then executes. That's why the green means it's running - generated assembly instructions, and those are fast. All of these functions begin with a little bit of colour there at the beginning that says, "I'm compiling or collecting some metadata" and then all of the green stuff says, "I'm running in the most efficient possible manner." You can see that one, two, three, and four, they're all very efficient here, but, in five, we now have a whole bunch of yellow that happens in here. Let's highlight this piece of code over here and do bottom up. You can see that about 40 per cent of the time, we were in basically lit green area, the generated code, and that's our function, line 6 2, 60 per cent of the time in generated code. The yellow code means that it is code that the VM generates which is actually dynamic. So it is still generated code but it's sub optimal in a sense that it doesn't execute as fast as purely green code. So that is some of the yellow right away here. We can see that, as we go to all of these - as we go through all of these benchmarks, we have a little bit of yellowing. This is when we are two and a half times slower, and it comes from the yellow area, and we're good all the way up until we hit 10,000 shapes. When we hit 10,000 shapes, we select this bit of code here, and you can see we have a whole bunch of blue. What is going on here? If you zoom in, we have a C++ code here now, and it says run time load IC miss, so the yellow said load IC. The yellow is our get prop function. First, it tries to go into what it calls a megamorphic state cache, a cache that is collected by running, and it can resolve the property look-up not as fast as inline cache, but still relatively fast. That's the two and a half times slowdown that you see with everything that's relatively small. Because we have generated 10,000 objects in this location, that's more than the number of items that the cache has. When we iterate through 10,000 objects, and we have 10,000 different objects, we are sure to be evicting cache entries from the cache and replacing them with new ones, but we don't use that new entry that we have other shapes to read, so by the time we read the shape object again, we've already overwritten the cache with something else, and that's why we get a whole bunch of cache misses here. Okay, so those are some cool tools by which you can understand how the VMs do things internally. Let's look at Ivy and how we take this information and use it to make ourselves fast. So now, imagine you have two components. Let's say we have Hello and you have MyApp. MyApp instance eighties a Hello component and inside of its content, it puts the bold "world" and the hello component simply says "hello" and redirects the content. The resulting DOM tree that you have is on the right. You have hello span, hello world, and then end span, hello. Now, while the tree on the left is what we render, the tree on the right is what we care about. This is what we call the logical tree. The way to think about it is that not the world is a child of the span, a better way to think about it is that the world is a child of hello, and there is a sibling to it which is the content view of the hello which then has the content projection. We have an arrow like that. The reason why this particular tree is important - so, right, we call the tree on the left the render tree; the tree on the right, we call the logical tree. The reason we care about this tree is, because if you look at how Angular actually resolves things through injectors, look-ups, or parents, it's all done on the logical tree, never on the render tree before we have to store additional information in there, such as the injector, the component, the directives, the binding, the listeners, the clean-up works, the pipes, and all of this information has to get stored on the logical tree so we can execute on it. However, so one way to do that would be to create what we call an L node or logical node, which would have pointers to parent, children, the next injector, bindings, pipes, styles, directors, and so on, and most of these things would be blank most of the time. You know, because most nodes don't necessarily have an injector. Most don't have pipes, and so on. This is a very inefficient way of storing this particular thing. So the good thing about this way of storing it is that it's simple and it's super maintainable. The cons of it is that it is very sparse. Because it is sparse, decision memory-intensive, and being memory-intensive means you will have a hard time running it on your mobile devices. The other thing that it does is duplicates information across templates. If you have a template which is inside of *ngFor and execute be the template over and over again, the parents, siblings, and children where replicated over and over and over again. We will have the same information spread across multiple different things. Finally, it affects what is known as low-cache locality which basically means that, because the information is sparse, when I cause a cache miss and the cache then loads into its cache location, the cache will load something called the line which is essentially 128 words of information, and, if things are sparse, the chances are that the thing you've just loaded will have nulls everywhere because you don't need it, and so because of that, when you try to read something related, you will most likely cause a cache miss rather than reading it from the information that you have just had. What we can do is rearrange the data a little bit. We can use the same exact trick that the VM does in the Ivy level as well. Instead of storing the sparse data in the object, we can store all of our data in an array. We will define a new thing called an L view. An L view is an I a ray. You have this problem you need to know where to look nationwide the inside the L view to look in the data. Instead of having an L node - which is a logical node - we have a T node which is shared across all instances of the template. So, if you're inside of *ngFor, we would only generate one T node no matter how many times it will unroll your template. You could have *ngFor that be rolls your template a thousand times but there will only be one T node about how this got unrolled. The advantage of having everything inside of the Arab is that now we can compact everything into the small as a location as possible. We don't store the information we don't need. We don't say this component doesn't have any directives or any components, because we look at the T node and the T node tells us that the binding points to negative one so we don't bother looking inside of the array. This is all nice. But, the cons of this is that it is hard to debug and understand what's going on. Because it's hard to debug and understand, it really could complicate the way the open source contributions would have to be done, because now you can't just kind of look at the code, you kind of - it becomes much more complicated. And finally, it becomes hard to profile, because if everything's an array, then it's hard to separate which object is which, which array am I looking at? We're trying to mitigate all of these things. And the way we mitigate is by this mode we call ng dev mode which is a flag which can be either true or false. If it is true, we execute a whole bunch of extra code and decorate the objects with additional information that allows us to simplify all of these things. So, instead of L view having to be just an array, we actually subclass the array into an L view and debug a property on it. All of the arrays not only have the regular information that you have, but they also have additional debug information that allows us to see it in a more kind of unrolled friendly manner and explore what's going on, and therefore, that makes it much easier to understand how Ivy works internally and makes it a lot more approachable. This particular approach is really the best of both worlds. We can be super compact but at the same time produce the metadata for the developer so they can understand what is going on. It makes it easier to debug, easier to profile, and also fundamentally easy to contribute because we really want all of you to help us to make Angular better. This extra code has an impact the amount of code that shipped which is why, in the production build, when you say ng dev code is false, all of the ng dev code is removed. There's only extra code there if you're in development mode and there is some speed close to it but only in development mode. Once you do a production build, all of the stuff is removed and becomes much simplified. So, let's see if I can do a little demo here. So I have a very simple "Hello, World" which is style, and also, it prints the letters of the alphabet, so it's pretty simple, straightforward stuff. One thing I want to show you is that Igor already talked about it, that we can select any one of the DOM elements. Let's say we click on a span over here and we have this ng property that exists in dev mode. There are all kinds of useful methods over here like get component, get object, so if I pass in the selected node, I can see that I'm going to get my application, I can look at all the properties, and I can explore it. So that's one way in which we make sure that the development under Ivy's going to be much more simplified. The other thing that I wanted to show you is that these objects have - we have the L view on it, so we can see that it says L view right here, and this L view is just an array, right? It makes it super difficult to figure out what's going on over here, but luckily, we have this debug property, and this gives us the same exact information in a much more readable view. For example, we can say what kind of views do we have? So we can look at child views. And we can see that this view has one child one, which is a container, we can go into the container, and we can look at its views, and we can see that that container has 26 sub views because this is an *ngFor, one for each letter, right? If I select one of these views, I can look at nodes, and I can see that that node refers to - if you hover over it, it highlights, like you can see here. And I can also see what styling, et cetera, it has. For example, if I go and look at the nodes at the top, I can see that I have a span node, one of the root nodes, and the span happens to have all kinds of bindings, and so I can look at the styles, and I can see how the style bindings have combined together. How do I get to my presentation? Here we go. In summary, we - first, we really are very careful about making sure that everything in Ivy is monomorphic. We use the same tools that I showed you then to look at our code and make sure that all the code shows up with monomorphic property reads and that is a first big hit. It has two benefits. First, because it makes us fast, but second, because it doesn't pollute the global cache of megamorphic state property reads, and not polluting megamorphic state property read cache means if the application code uses these megamorphic state properties, we don't evict things out and make the application intentionally slower. We always try to iterate over arrays rather than object keys, because working with object keys is expensive whereas working with arrays is super cheap. We are careful about function inlining, like I showed you at the beginning, because it's easy to to get get into something that is faster in reality. We store data in these arrays for efficiency, speed, and as well as for space efficiency, and these are the T - we have lots of focus tests that are basically built on the same benchmarking thing that I was demonstrating over here that allows us to focus individually and reproduce what is going on. Once we have these tests, we can execute them inside the same set of tools that I showed you, and figure out what is going on inside of it. With that, I will close it. Thank you!

Info

Channel: AngularConnect

Views: 9,315

Rating: 4.9597988 out of 5

Keywords: angularconnect, angular, angularconnect 2019, angular conference, angular training, misko hevery, angular team, ag-grid

Id: EqSRpkMRyY4

Channel Id: undefined

Length: 42min 38sec (2558 seconds)

Published: Fri Sep 27 2019