Yes, I don't think I will live that down! Hi, how are you guys all doing? Good to be here? How is London? Are you liking it? Okay. So, let's talk about some - I'm kind of hoping
I'm going to dive deep into some technical stuff, and I hope you're going to enjoy it
with me here. Let's see. Slides don't work. Why not? That's me. I am now a tech lead manager in Angular. My non-focus is the Angular framework, the
core of it, which is all the bits that end up in the browser. And I share that role together with Igor and
Igor works about all this stuff that makes the developers productive. And, two of us are complementary. Let's talk about performance. The first thing we want to talk about is running
performance tests, and I always joke that the performance tests, you think you're running
the tests over and over again. I want to show you a couple of these things
that can really surprise you. First, let's talk about inlining and opt and
de-opt. I'm going to make a simple benchmark. Which of the four lines. How is the size? Shall I make it bigger? Oops. Is that good? So, which of these are faster? Now, I hope you agree with me that these are
essentially identical functions that return different values and because the value is
zero, this side of the equation will always execute and always return zero, the value
one, two, three, four does not even come into play in any way, and to it should not have
any performance impact. All of these things should execute the same
exactly. How will we benchmark this? I have a trivial benchmark here where we grab
the time, rerun it some number of times, and right out exactly what the timing was - work
out what the timing was. When I run it with 100,000 iterations, they
seem about the same. Let's increase the number of iterations. As I increase it, they show up the same number
of times. I go even higher, and something weird happens. The test case A is somehow faster than the
rest of them, and, if you go even higher, it consistently shows up as being faster. What's going on? These are essentially identical things. So I wanted to show you what actually happens
in the VM. First of all, and I'm sure you know this,
but you can run Node v8 options and it - V8 options. These are options on top of Node that only
V8 understands. You can see there are 800 lines of different
options that it can pass in. I'm not going to go through all of they will,
but there are fun ones to look at like trace op and trace DeOpt. We can run those and dump out a lot of interesting
text for you. We can go through that text and figure out
what actually is happening but it turns out there is a better tool. Someone wrote this tool called a DeOptigate,
and you can take that same piece of code and execute that. And it should up on open it up in our browser. Opening in the default browser. Where is my default browser? Okay, great for the memo. What is going on? Seriously, where is it? It went into the wrong tab. So it produces a file that you can zoom in
and specifically, it places a red box right here. You can click on it. It tells you what it is. It says it de-optimised because of a particular
bail-out which is a wrong call target. Let's see if we can think through what actually
happened. So you started running the benchmark A - this
one right here - and that took this function, and as it was executing this over here, the
VM noticed "I keep calling the same function over and over again." If the counter gets too high, the VM wakes
up every millisecond or so, looks at the counters and says which of the functions looks like
a good candidate for optimisation? And, in this particular case, the VM woke
up every millisecond and goes in, and says, it looks like what I can do is I can take
the benchmark A and I can inline it in this location to make it run faster. And that's exactly what happened. So, after the first benchmark runs for a while,
the A gets inlined over there, and then we invoke it with B. When we run it B, what happens
is the VM says, "Oh, wait, I inlined A over here, and now you're calling me with B, so
I made an assumption that turns out to be wrong, and so now it does what is known as
a de-optimisation" which essentially undoes the thing it did before. Because it did a DeOpt, the VM says I now
know this call side sometimes gets called with A and sometimes gets called with B and
therefore it probably should not inline over here. When the benchmark B runs it no longer gets
inlined. It's slower. You have the overhead of actually making the
call. So, it is very easy to make a benchmark where
you think you're actually profiling what is actually happening in your application, but
because real applications are much more complex, a lot of these optimisations get DeOpted in
the future. This is an example of how you could make - sorry,
I will make this bigger. How you can make a very simple test case and
then shoot yourself in the foot because you're actually measuring the wrong thing, unless
you really understand what is happening inside of it. So, what is the fix for it. ? The fix for it is to run the same exact
benchmark. I have it done in exactly the same way. Instead of writing my benchmark library as
something that takes a function and then runs it in a loop, I flipped it around. Instead, I turned my benchmarking library
into a simple call and my benchmark was basically a wild loop that is true until the benchmark
was in there, and then I just inserted the wild loop. I placed the thing I wanted to benchmark. And by flipping it around, now instead of
the function on the test being inlined into the benchmark, now what is happening is the
exact opposite. The benchmarking code gets inlined into my
function under test. And in this okay, we can to make sure that
we are measuring the right thing. So, if I run this particular benchmark, which
is right here, it will show that both A, B, C, and D performed in the exact say. This benchmark, when you're running a benchmark,
you kind of have a problem and that is that you need to determine how many times you should
iterate over something, how many times, because if you just put a time stamp before you run
a piece of code and right afterwards and you don't iterate enough times through it, first
of all, you might be looking at unoptimised code, so then you can be looking at the fact
that your timer has jitter, and so you really need to be able to execute the function for
a certain amount of time which means you need to know how iterations to run for. So a good benchmarking library will not only
make sure that it doesn't fall for the trap of Opt and DeOpt but also make sure how many
iterations your function should run for before it determines the time difference before it,
right? In this particular benchmark, what it does
is executes the function once, measures the time, if it is less than 100 milliseconds,
says not enough and it increases the iteration loop until it gets 100 milliseconds' worth
of data. Then what we do is you have two options: one
option is you could simply average all these measurements over a long period of time and
look at the statistical deviation, how much faster and slower it is, but we've noticed
if we do an average, the average is not as predictive as if you take the best-case scenario. The reason for that is, because when you are
benchmarking a particular function, you are - the code is running in the background, it
can do other things, the browser is doing garbage collection, all of those slow up the
benchmark that you're running. And so by running the benchmark repeatedly
and then selecting the fastest possible run, you're actually selecting the situation where
no garbage collection runs, no other application woke up, and so on and so forth, and the numbers
you get in in this particular case are a lot more predictive than any other way of measuring
it. At least that's what we noticed. We talked about inlining, we talked about
Opt and DeOpt. One thing I would like you to take away from
this is if you really want to measure things, I think it's really useful to measure the
best-case scenario. That number seems to be a lot more consistent
than any other number you can get. Let's switch gears a little bit and how VM
layers objects and this is important because it will lead into how Ivy does things. Let's say I we have an object such as name,
say, John aged 42. It turns out that memory inside of your computer
is continuous. It's one big array. The VM's job is to convert it into an array. That's what memory is. The way it does that is breaks the object
into two parts. One, it's what is called a hidden class ID
and it contains some header information. It's a header information which we're going
to ignore for a second. But it contains key value pairs where it says
the named property is index offset two and the H property is index off set three, such
that if you look at the object itself, where if you look at John, second position - this
is how the VM breaks up your object. So all of the metadata about the object goes
into one location, and actually the values of the object goes into other location. So if you have two objects, for example, "John"
and "Mary", they will share the ape object-hidden ID clause because the two shapes of this object
are identical. So, as long as the shape of the object is
the same, and this is what we mean by shape, we will share the metadata about the object
in the same location and then we will have the array which represents the object itself
stored the same way. So that means if you want to perform a read,
such as you can see over here, you can say object.name, you need to translate the name
string into the offset into the array. For languages such as Java and C++, this offset
calculation can be done in the compile time because the language is static, but for JavaScript,
the object look-up has to be done at run time because JavaScript is a lot more dynamic than
the static languages. And so whenever you do something like object.name,
what the VM internally does is that it says object.getproperty. It knows that the position zero of the object
always contains the information about the object itself, so we can extract the position
zero out. And then it passes in the key that it wants
to be able to compute, and so it passes that into the function, the function scans the
list of passable keys, and it comes back with an answer, and the answer in this case should
be two. And so then we can basically - the reserves,
the object - the index 2 and we get back out John. This is essentially how VMs work inside. So, if you have a more complicated situation,
such as when you have object 1 which in the has name and age, and object 2 which has a
title and a name, and in the the name in this case is in position two rather than position
one, they can no longer share their hidden class IDs. They have to have separate hidden class IDs,
and they store the information about one shape of the object in one location, and the other
shape of the object in another location, and now the objects are no longer tied together
by their hidden class ID. So that if you wanted to read properly a bit.name
off the object, as we can see over here, in the the line above it says object sometimes
will be 1 - will be object 1 and sometimes 2 depending on the expression, so when we
get object prop, sometimes it will turn two, and sometimes three, depending on which shape
of the object we have. In both cases, we are going to get the name
out of the object. So that is kind of what's going on. Let's look at this in how does this perform? Okay, so I have a benchmark here. And what the benchmark will do is it will
create 10,000 objects. And it will create 10,000 different kinds
of objects and we will place them inside different kinds of arrays. This array will be 10,000 items long, and
what it will contain are objects of the same shape, value prop zero, value prop zero, so
on, and so forth, 10,000 of these objects. We know the array 1 will contain objects that
will all share the same hidden ID. For array 2, we will create a value prop zero
but then a value prop 1 and then back to 0, 1, and alternate back and forth like this. We know that array 2 will contain 10,000 objects,
5,000 of which will be shape 1, and 5,000 of which will be shape 2. The same for three, four will contain four
different kinds of objects up to array 10 ,000 where there will be 10,000 different
kinds of shapes of objects inside of it. The next thing we're going to do is try and
go and read. We know these objects have a value property
so we will try to read the value property out of every single object that we have, and
we're just going to add it to the sum. We know that in such cases the value property
will be zero, so we don't have to worry about some overflowing and becoming a double because
that also will cause a DeOpt to happen. And so in this case, what we are going to
do is just read the value property all the time and we know that we will have the same
shaped object in the array at all times all the time, whereas in here, we will have two
shapes, three shapes, so on, and so forth. Let's execute this particular test case and
see how we do. In this case, the benchmarking is trying to
figure out how long the particular test should be run for, and then it runs it for a number
of times and tries to take the smallest possible value, and then this places the smallest possible
value for us right here. Here is the result. Not surprisingly, if you have the same shape
object all the time, if you have the same shaped object all the time, you see the highest
performance. You see that that particular function executes
once - or rather iterating over all 10,000 objects took 20 microseconds. That's our best-case scenario. As we increase the number of shapes of objects
we have in a system, we are getting slightly slower. One, two, three, four, notice, are pretty
much the same. Something funny happens at five. All of a sudden, we are about two and a half
times slower than we were before, and we kind of maintained at two and a half times slower
until we get to 10,000 objects, and, wow. We're now 50 times slower than reading the
objects internally. So the question is: what's going on? How can we debug this particular thing? So let's go back to what was going on with
our objects. Let's talk about it some more before I show
you how to debug this thing. What will happen inside of the VM internally
is that we have to execute this piece of code right here which says object get property
of the hidden class ID of name, and we look it up, and we're going to have 2 in there. It turns out that the VM prop look-up is actually
slow. The VM website once it runs, it can collect
meted at that information and produce what is - meta information, and it can produce
an inline cache. It says if the object you read at property
zero is object class ID, and this is a simple comparison, so this is extremely cheap, then
we know the answer is two because we ran it before in an interpretative mode, and while
we were running it in interpretative mode we collected all kinds of information about
it, and so we know that if the hidden class ID of this particular object happens to be
this special one, because, in the past, we have seen this particular one to be very,
very common through this execution-code path, then we can short-circuit the whole thing
and answer it with two, otherwise, if we can't do that and we have to call the get prop function,
and it's the slow one, the one that requires us to skim the hidden-class ID, and it takes
some time. Doing this particular trick allows the VM
to generate code that is significantly faster at execution time. So this works in the case where the number
of objects is fixed, so you know that we always have the same-shaped object coming through
it. In this particular example where we have two
different kind of shapes of objects, the VM can also collect information about it and
says, you know what? While I was running it in interpretative mode,
I've collected information about this code how to execute it, and I have realised that
now there are two kinds of shapes that it can come through these particular locations. There is the object 1 shape and the object
2 shape. I know that if object 1 comes across, then
the main property is in location 2, but if object 2 comes across, I know the named property
is in 3, otherwise I give up and I go the slow route and I look it up. So these are inline caches and VMs are willing
to do this particular trick up to four times, like different VMs can choose different numbers
of one but typically they do it up to four times. Because they can do it up to four times, we
can explain what is going on here. The first four lines with inline cache hits,
right? So, for the first case, the VM only had to
compare one hidden class ID against it, and it was always correct. And that's why that particular case runs the
fastest. In a second case, the VM had to compare two
different values, but both of these values, it always got an object of one value of the
other value, and so the comparison was a little more complicated but it's still relatively
fast. You can see the same thing happening for three,
slightly slower, four, slightly slower, and at five, the VM gives up, and says, you know
what, if you have more than five objects, I'm going to call Get Prop, right? That's when it gets slower. However, why is it that 5,000 was relatively
fast but at 10,000, it's really slow. There's something else going on in here that
we can't explain. I want to show you a couple of tools by which
we can explore this. So the first thing is something called Trace
IC - IC means inline cache - and this will generate -
this is going to generate a log file for us that we can explore. So the log file, so, we need to install - we
- in order to look at this log file, you have to install something called V8 Tools. The instructions on how to install that are
actually in this repo in a slide that I will show you at the end. If you open the thing, you can install a log
file ... oh, come on! Sorry, I'm looking at the wrong one. It's the IC processor. That's what is happening. Let's open it up. We can now look at it by function. Sorry, not function but by file position. This is our code. It has flagged all the locations. We are going to look at the bottom one first. This is at location 44, so to cross this with
our source code, 44 is this one here. We're talking about this value over here,
so this is line 44, column - it doesn't show my column numbers. Column 33. And in the what it says. It was trying to read "value" - this is the
property that it's trying to read - and it says that V8 calls objects maps, so it says
I have saw two different kinds of objects come through and these are their hidden class
IDs, and the dot means I was collecting information about these objects, and then, after a while,
the VM says, you know, I see that there is one shape - so this is called the transition
- and it comes across and says I saw this shape of object before so I'm going to record
its hidden ID, and then I've seen another transition where it goes from what is known
as a monomorphic look-up to a polymorphic look-up. The VM is able to generate four of these things. In the as we go up - so this is line 44, line
61 has - sorry, the benchmark. The next one up. The next one up is line 51 of example 3, and
in the here the VM saw up to four - sorry, three different kinds of maps, so it transitioned
to monomorphic state from polymorphic to polymorphic. It's willing to do up to four of these states,
and so everything still works good. As you go up, up, to the last one, to the
top, in the what it shows you here. In this particular location, it has seen 10,000
different shapes, and it has transitioned from monomorphic to polymorphic, from poll
polymorphic a few times, and stayed at the - most of the execution hits that happened
at this location, about 100 per cent of them, were in the megamorphic state. This only repeats what we already know. It turns out while this too is pretty cool,
there is a better way of looking at this. There is another tool out there called the
Optigate. The Optigate will present the same information
but in a more human-readable way. Okay, so this is the tool we executed. If we look at our example 3 code, what the
VM did is that it placed these markers in here, so, first, let's show the - notice the
green phone icon means that this property read is monomorphic, so the best kind of property
read you can have. In other words, every time the VM executed
this piece of code, the object shape underneath was always the same and therefore it could
just short-circuit the whole thing and get the proper thing. The blue ones are polymorphic. It is showing you that there were two different
shapes of object that went through this location. While 2 is not ideal, it's still pretty good
because it can still do inline caching. Now, notice when we transition to array 5
with five different objects, it becomes red, saying, the - your system has transitioned
into the megamorphic state and we're no longer executing the code as high performance as
possible. So now there's an explanation of what to do
and how to use these tools to understand your code, but we still haven't explained why there's
such a sudden drop at the 10,000 level. I executed the code one more time, this time
with another option called --prof, called ed profiler, and we can open it up in a prof
tool that V8 has. We get this kind of a graph. You can click on function 1. Benchmark 1 has the same shaped object and
you can see it's all green. Notice at the very, very beginning, those
are when the VM is running in the interpretative mode, collecting information about your code,
and once the interpretative mode is kind of finished, the VM generates actual assembly
instructions that it then executes. That's why the green means it's running - generated
assembly instructions, and those are fast. All of these functions begin with a little
bit of colour there at the beginning that says, "I'm compiling or collecting some metadata"
and then all of the green stuff says, "I'm running in the most efficient possible manner." You can see that one, two, three, and four,
they're all very efficient here, but, in five, we now have a whole bunch of yellow that happens
in here. Let's highlight this piece of code over here
and do bottom up. You can see that about 40 per cent of the
time, we were in basically lit green area, the generated code, and that's our function,
line 6 2, 60 per cent of the time in generated code. The yellow code means that it is code that
the VM generates which is actually dynamic. So it is still generated code but it's sub
optimal in a sense that it doesn't execute as fast as purely green code. So that is some of the yellow right away here. We can see that, as we go to all of these
- as we go through all of these benchmarks, we have a little bit of yellowing. This is when we are two and a half times slower,
and it comes from the yellow area, and we're good all the way up until we hit 10,000 shapes. When we hit 10,000 shapes, we select this
bit of code here, and you can see we have a whole bunch of blue. What is going on here? If you zoom in, we have a C++ code here now,
and it says run time load IC miss, so the yellow said load IC. The yellow is our get prop function. First, it tries to go into what it calls a
megamorphic state cache, a cache that is collected by running, and it can resolve the property
look-up not as fast as inline cache, but still relatively fast. That's the two and a half times slowdown that
you see with everything that's relatively small. Because we have generated 10,000 objects in
this location, that's more than the number of items that the cache has. When we iterate through 10,000 objects, and
we have 10,000 different objects, we are sure to be evicting cache entries from the cache
and replacing them with new ones, but we don't use that new entry that we have other shapes
to read, so by the time we read the shape object again, we've already overwritten the
cache with something else, and that's why we get a whole bunch of cache misses here. Okay, so those are some cool tools by which
you can understand how the VMs do things internally. Let's look at Ivy and how we take this information
and use it to make ourselves fast. So now, imagine you have two components. Let's say we have Hello and you have MyApp. MyApp instance eighties a Hello component
and inside of its content, it puts the bold "world" and the hello component simply says
"hello" and redirects the content. The resulting DOM tree that you have is on
the right. You have hello span, hello world, and then
end span, hello. Now, while the tree on the left is what we
render, the tree on the right is what we care about. This is what we call the logical tree. The way to think about it is that not the
world is a child of the span, a better way to think about it is that the world is a child
of hello, and there is a sibling to it which is the content view of the hello which then
has the content projection. We have an arrow like that. The reason why this particular tree is important
- so, right, we call the tree on the left the render tree; the tree on the right, we
call the logical tree. The reason we care about this tree is, because
if you look at how Angular actually resolves things through injectors, look-ups, or parents,
it's all done on the logical tree, never on the render tree before we have to store additional
information in there, such as the injector, the component, the directives, the binding,
the listeners, the clean-up works, the pipes, and all of this information has to get stored
on the logical tree so we can execute on it. However, so one way to do that would be to
create what we call an L node or logical node, which would have pointers to parent, children,
the next injector, bindings, pipes, styles, directors, and so on, and most of these things
would be blank most of the time. You know, because most nodes don't necessarily
have an injector. Most don't have pipes, and so on. This is a very inefficient way of storing
this particular thing. So the good thing about this way of storing
it is that it's simple and it's super maintainable. The cons of it is that it is very sparse. Because it is sparse, decision memory-intensive,
and being memory-intensive means you will have a hard time running it on your mobile
devices. The other thing that it does is duplicates
information across templates. If you have a template which is inside of
*ngFor and execute be the template over and over again, the parents, siblings, and children
where replicated over and over and over again. We will have the same information spread across
multiple different things. Finally, it affects what is known as low-cache
locality which basically means that, because the information is sparse, when I cause a
cache miss and the cache then loads into its cache location, the cache will load something
called the line which is essentially 128 words of information, and, if things are sparse,
the chances are that the thing you've just loaded will have nulls everywhere because
you don't need it, and so because of that, when you try to read something related, you
will most likely cause a cache miss rather than reading it from the information that
you have just had. What we can do is rearrange the data a little
bit. We can use the same exact trick that the VM
does in the Ivy level as well. Instead of storing the sparse data in the
object, we can store all of our data in an array. We will define a new thing called an L view. An L view is an I a ray. You have this problem you need to know where
to look nationwide the inside the L view to look in the data. Instead of having an L node - which is a logical
node - we have a T node which is shared across all instances of the template. So, if you're inside of *ngFor, we would only
generate one T node no matter how many times it will unroll your template. You could have *ngFor that be rolls your template
a thousand times but there will only be one T node about how this got unrolled. The advantage of having everything inside
of the Arab is that now we can compact everything into the small as a location as possible. We don't store the information we don't need. We don't say this component doesn't have any
directives or any components, because we look at the T node and the T node tells us that
the binding points to negative one so we don't bother looking inside of the array. This is all nice. But, the cons of this is that it is hard to
debug and understand what's going on. Because it's hard to debug and understand,
it really could complicate the way the open source contributions would have to be done,
because now you can't just kind of look at the code, you kind of - it becomes much more
complicated. And finally, it becomes hard to profile, because
if everything's an array, then it's hard to separate which object is which, which array
am I looking at? We're trying to mitigate all of these things. And the way we mitigate is by this mode we
call ng dev mode which is a flag which can be either true or false. If it is true, we execute a whole bunch of
extra code and decorate the objects with additional information that allows us to simplify all
of these things. So, instead of L view having to be just an
array, we actually subclass the array into an L view and debug a property on it. All of the arrays not only have the regular
information that you have, but they also have additional debug information that allows us
to see it in a more kind of unrolled friendly manner and explore what's going on, and therefore,
that makes it much easier to understand how Ivy works internally and makes it a lot more
approachable. This particular approach is really the best
of both worlds. We can be super compact but at the same time
produce the metadata for the developer so they can understand what is going on. It makes it easier to debug, easier to profile,
and also fundamentally easy to contribute because we really want all of you to help
us to make Angular better. This extra code has an impact the amount of
code that shipped which is why, in the production build, when you say ng dev code is false,
all of the ng dev code is removed. There's only extra code there if you're in
development mode and there is some speed close to it but only in development mode. Once you do a production build, all of the
stuff is removed and becomes much simplified. So, let's see if I can do a little demo here. So I have a very simple "Hello, World" which
is style, and also, it prints the letters of the alphabet, so it's pretty simple, straightforward
stuff. One thing I want to show you is that Igor
already talked about it, that we can select any one of the DOM elements. Let's say we click on a span over here and
we have this ng property that exists in dev mode. There are all kinds of useful methods over
here like get component, get object, so if I pass in the selected node, I can see that
I'm going to get my application, I can look at all the properties, and I can explore it. So that's one way in which we make sure that
the development under Ivy's going to be much more simplified. The other thing that I wanted to show you
is that these objects have - we have the L view on it, so we can see that
it says L view right here, and this L view is just an array, right? It makes it super difficult to figure out
what's going on over here, but luckily, we have this debug property, and this gives us
the same exact information in a much more readable view. For example, we can say what kind of views
do we have? So we can look at child views. And we can see that this view has one child
one, which is a container, we can go into the container, and we can look at its views,
and we can see that that container has 26 sub views because this is an *ngFor, one for
each letter, right? If I select one of these views, I can look
at nodes, and I can see that that node refers to - if you hover over it, it highlights,
like you can see here. And I can also see what styling, et cetera,
it has. For example, if I go and look at the nodes
at the top, I can see that I have a span node, one of the root nodes, and the span happens
to have all kinds of bindings, and so I can look at the styles, and I can see how the
style bindings have combined together. How do I get to my presentation? Here we go. In summary, we - first, we really are very
careful about making sure that everything in Ivy is monomorphic. We use the same tools that I showed you then
to look at our code and make sure that all the code shows up with monomorphic property
reads and that is a first big hit. It has two benefits. First, because it makes us fast, but second,
because it doesn't pollute the global cache of megamorphic state property reads, and not
polluting megamorphic state property read cache means if the application code uses these
megamorphic state properties, we don't evict things out and make the application intentionally
slower. We always try to iterate over arrays rather
than object keys, because working with object keys is expensive whereas working with arrays
is super cheap. We are careful about function inlining, like
I showed you at the beginning, because it's easy to to get get into something that is
faster in reality. We store data in these arrays for efficiency,
speed, and as well as for space efficiency, and these are the T - we have lots of focus
tests that are basically built on the same benchmarking thing that I was demonstrating
over here that allows us to focus individually and reproduce what is going on. Once we have these tests, we can execute them
inside the same set of tools that I showed you, and figure out what is going on inside
of it. With that, I will close it. Thank you!