Daniel Khan - Everything I thought I knew about the event loop was wrong

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] first of all thank you for having me my talk today is about the event club and it's called everything I thought I knew about the event you brawls wrong and I also tried to dump my slides a little bit so that's and also with animations mal I used Google Translate for that so let me know that Twitter if it doesn't it if it sounds like some Japanese manual or something I have no idea what I was actually so who is this guy I'm Daniel Kahn I'm developer since 1999 this means I'm reaching the point where people are starting to be in the audience that were born when I started in the industry I'm doing Noches since 2012 I'm member of the noches Diagnostics working group though I should do much more there I'm technically product manager and noches guy at dynaTrace we have performance monitoring company around 2,000 people and I'm also yeah everything that has something to do with noches kind of goes through my inbox there I'm lecturer for noches at the local university which I find like is it really translated with profesor because I'm not professor that's just I'm just doing trivia and I'm course all author or for Linda and LinkedIn I'm from Austria as already said this is little spot here in the middle of Europe we around nine million people we are mostly known for Wolfgang Amadeus Mozart Red Bull and Californian governess [Laughter] means when I say hasta la vista baby it sounds a little bit like him hopefully this is how we look like on a Sunday no really it's a beautiful country if you happen to come to Europe don't skip it on your way for I don't know from Prague to Italy or so visit us it's really worth to spend a little bit of time there it's small so you you're done pretty quickly what will we talk about here today we will cover my or some common misconception I add about node.js event loop we will then hopefully learn how the event loop really works and then we will also cover a few metrics that we can collect off the event loop to know how healthy it is why am I even doing this topic so as I said we had performance monitoring company and customers suddenly started to come to us and tell yeah but do you also have no chairs event loop matrix because of course performance monitoring we collect a lot of metrics like memory CPU etc yeah suddenly everyone wanted to know about event loop metrics and I thought this has to be something really obvious everyone knows how the event loop works I know it as well so what attended it was just I created a cheer a ticket and like let's do the event loop metrics yeah and then this story began and I figured out that I don't really know what I actually wanted and also I figured out that most probably our customers didn't know what they actually about were asking for and this is kind of the story of the of the journey year what do we know about no chairs and the event group we know that no chess is evented it's evented like everything that this javascript this means when you're in the browser click on a button and have some on click event that's basically the same mechanism going on as you have it in not just with the event loop because there's some event handler registered some event loop that executes then this callback when this event occurs no chance runs in a single thread that's well known knowledge and no cluster module etc but the kind of request as such runs in a single thread and no chairs you notice all I owe is asynchronous this nice saying like put it let's put it different everything in your chairs is asynchronous except your code right you get that quite often and there is something called libuv that provides an event loop and the thread pool and this basically does all of that so if you're like me I just I need always to see some pictures to understand things I'm a more a visual person so I try to draw a little bit so how does traditional request handling but let's talk it's like PHP or rails or also Java most of the time so if the request comes in every time a request comes in you create one thread so one bar here is a thread a new request comes in and the new request comes in and every request as you see it here will create spawn a new thread which means also that if one thread is blocked like doing a very heavy database operation that's the problem because all other requests are not affected it becomes a problem if many of those requests are plot like that because then there are many threads hanging around waiting to be executed so that's the traditional way the node way is very different because we have some kind of a interval Van Wyck standing here this means we have once read and then a request comes in we process it but then there is something isn't a synchronous going on so we load this off and then another request is fruits processed so we do this all in one thread and like switching between requests we are handing everything in this one thread so this goes on and on and on and there at then there is something like a UV and it's totally magical and no one knows how it works - it gets a unicorn here and that's known knowledge so that's facts now we can derive some misconception from that everything I say now is wrong don't take pictures please with me on like so we can say the event loop runs in a separate thread to the user code makes sense so we have our main thread and every time something is to be done we load it off to our magical event loop and today this does this whole magic and then Pink's the main thread back with the callback and let's our main thread executes that everything that this asynchronous is handled by a thread pull yeah make sense we have a thread pool so we have to handle asynchronous tasks somehow so obviously everything is increase means thread pool and the event loop has to be something like a stack or queue right so if you think of the easiest way to accomplish this whole whole event loop magic is you have a queue or a step and all the tasks you schedule on the event loop are kind of your stacked there and then they are processed one by one at the callback is then executed makes for me at least this makes total sense again a little bit of a drawing so we have our main thread here and here I have this function is asynchronous it's called cow spell and put in some potion and then call back and then I have my event roof and the event loop already has some spells step so these are requested or like tasks that are pending on the event loop and then we have the thread pool threads look like aliens from me and here's a little duct so it's a pool and now we send off a lot of these costs bail to our event loop and then this gets added to the stack here then at some point all other tasks are done no it's our turn so we send it to the thread pull the thread pool does some magic there then then the result comes back and we can resolve this call back and we ping the main thread and on the main thread callback runs makes totally sense for me and it's totally wrong like really and the fun thing is that I'm not the only one with that so there was this very funny and great presentation of Bert Valda at not interactive in 2016 in Amsterdam and he was talking about libuv because he wrote much of it so he knows how it works and if you know I don't know if you know that if we speaker prepare for thought we'd always use google image search to find if we have something we can steal there and reuse and he did this as well so it was hoping that he finds like good diagrams of how they went through works and he used the image search and everything he found there was actually wrong so it was really completely plain wrong so this also there is some learning in there just because it's a Google image search it does not mean that it's yeah correct in any way I have a few examples here like this one or that one or this one or that one and interesting you see also how everyone kind of copies from the others it's like always the same just with different graphics and like wrong information is replicated in a way those three are basically the same diagram until you see this event queue I was talking about before and this is equally wrong so everything there is basically also wrong let's talk about reality and this is really I had to find this out the hard way like our developers that are doing the agent for not chess they'd think C++ they're not JavaScript developers so they looked into the event loop they looked into notjust looked into how things work and I was yeah totally confident with them and I told them yeah I think it works like there that and they were always no that's wrong and so they took their time and explained the reality to me first of all Devin does not run in a separate thread in the user code there is only one thread that executes JavaScript and the event loop is inherent part of this whole runtime and this whole processing of a node application there is nothing else of course thread pool we will talk about that of course Dawid and james bond's a few more threads for garbage collection etc but when we talk about your code running in noches then it's just this one thread everything that's asynchronous is handled trails read pool no Libya ve creates a pool with four threads you can override that environment variable but basically it's four threads and it's only used and it only uses this read pool if there is really nothing no other way to accomplish the task because on a modern system you have a lot of API is that are already asynchronous like a pole or database libraries that are basically or database back instead are basically already asynchronous so the event loop is a lot more liberal is lot more smarts than one would think it's not stupidly like throw every task at the thread pool and let it execute no it knows okay this is a file system operation okay this will need to thread pool this is a database operation and we can use some system libraries for that so in any in any case when it's possible the thread pool will not be utilized and I think that's also a very important point here ant event loop is something like a stack or queue no it's a set of phases and I will show this in a second it's a set of phases and of course each phase here has some data structures that are used in this phase and these may be stacks or queues but still the the whole process we are running through is phases and not traversing a stack so now it's my turn to create something you find on could be image search then and I'll show you now the phases and I will explain each of these phases then in more detail first the first phase is called timers so we process Alzheimer's the next phase is called callbacks then we have a phase dedicated to I appalling then one to set to mediate and then we have close events and for each of these phases you see we have of course data structures here yeah may have some step of timers we process or here we have some polling against system events or orders red pool and yeah there we have stacks again maybe but this is traversing phases and not like running through a stack and this all is then kind of tech like intact with with with takes this means the switch between one place today another is is a tick let's look at how how we can schedule tasks that run in such phases like timers for instance timers very easy to set timer or do a set interval and you schedule something to run in this timers phase and here I have already what you see here in red is actually always callbacks and that's very interesting because callbacks is basically your code example if you have create we create an HTTP server here and we passed in some request handler function these request this request handler function is executed for every request coming on it's exactly run in this callbacks phase which also means that that is basically your code because in no chairs everything is basically a callback so think of our Express application and you create a route and there you do something in there then this route is actually already a callback and then you have a cascade of callbacks like you do some database operation in this route etc etc but it's still a cascade of call backs this means everything that your code is processed here in this callback space and if you do a set timeout of course this is processed within timers but then also this function this callback function is then put on this callback stack here and execute it then when the callback runs the next codex that runs through then we have I Oh Pauline you can create that by doing a simple read file this means here we are calling for system events or for the thread pool to see if anything kind of is done now and we we can proceed or execute the callback means putting again this function on this callback stack here and then we have set immediate very easy also just do a set immediate this is processed and then we have all those close events that are processed with socket on on close and then you have like these are all on events like closing as a socket etc yeah so that's basically the phases of the event view what you might wonder because that's also part and you don't see it anywhere here what about next tick next tick is also kind of something we use sometimes mostly we shouldn't what we do next tick basically happens here so it's like on every like where you see this this rectangles triangles here is every time we every time one one phase is over to this tick we will execute everything that is scheduled with next so this runs kind of yeah next to the event loop this is not really part of those phases and I created a little example he can even read that it looks is not not really sharp for me so here what I do I I schedule something with next take then I do some I all polling this is number one so I do a read file then I do another IO calling this will not utilize this read pool but we'll run through system events then I do some set immediates then I do a set timeout then I just write something out here on on the console and then I do a next again maybe in case you wondered why I'm using process standard right and not console.log this is because console.log also can already utilize again epi or can already be asynchronous so this would kind of destroy this whole thing's over because then it would be another asynchronous operation and the result you will get them is first this will execute it so you have this asynchronous callback here because that's already already when you write it into your main like file index file and you write something in there this is already the first isn't gonna call back here that X is executed and this will be run first and then you see those next tick regardless of where we started them will be executed them end in sequence so if this this means yeah they really have priority after that then we have timer events then we have set immediate and then system polling is read poll polling makes sense because these are always like file system reads etc they need a little bit more time so they run last I think now we basically know or have an idea how they event loop basically works right hopefully yes no I see someone not so now when we know that we should maybe start to find out if we can derive metrics or yet get a metrics from the event loop because obviously our customers were very very eager to get that and just to let me add that it's interesting because we just learned that no one or did everyone has kind of a misconception about how the event really works but everyone is asking for metrics so how does this even kind of make sense because I have to know actually what I'm measuring here but who knows metrics of the event loop the simplest one is tick frequency and that's a little bit misleading already because as we saw already and I have I'm not sure there is some something ambiguous here because ticks are sometimes meant like one step in a face from one phase to another and sometimes a tick also means this whole run-through of the whole event loop and I have to figure out yeah how to really call that by now we recall this tick frequency but if you think about next tick it's obvious that this is not really what we're measuring here so what we are measuring here is the number of ticks per time or tick duration also means how long does it take take and this is very easy to do because you just have to queue some tasks via sets well set immediate and this gives you more or less a very clear a a very clear point to a measuring point in your application and if your time the time between two such set immediates over time you're very close to measuring how long a run-through of the event loop takes and we were sitting in our sprint planning and this was something similar like that was shown just at this time in EXO and we looked at that and we had also some scenarios and here I will kind of redone this scenarios here so this is idle so nothing the application is sleeping basically then I use a patch bench as a benchmarking tool with a concurrency of five then with the concurrency of ten and then I use it and do some requests there by a slope acting to kind of also simulate some congestion here and what you see here that's awesome is that when they went to his idle it looks a little bit similar to when it's on the heavy load right which makes these metrics not so valuable for us here you see when there is like this with a concurrency of five you see okay there is now the event loop is at full speed very slow duration very high frequency here it's kind of balances out a little bit and here suddenly the tic frequency goes down and the duration goes up not so good so the problem we see with this metric is idle looks very similar to high load and we don't really know what we are actually where the time is really spent and we were sitting in our sprint planning they were like yeah that's nice metrics but if I have this now in my application what do I do with that so this does not really help and you might wonder why this is even the case here so why is idle here like means that event loop frequency is so low I mean it makes totally sense from a programming point of view because we are measuring said immediate here but the event loop waits in I appalling why every time you do something in no chess it's kind of first triggered by some kind of event thread event or incoming requests etc and if there if the event will figures out there is nothing in timers nothing in callbacks then it makes sense to spend a little bit more time in IO Pauling to see if during this time something something comes in because obviously there is not more so it waits in our polling for a longer time and this means the event loop is low it's like really when your car and you speed up it's really rendered there is loads the event loop will also start speeding up because it will adopt to the speed again a sign that the event loop is a lot smarter than we may seem yeah yeah you see that so this is IO polling what do you have here so we had that and we figured out this is nothing we want to put into the product alone because it does not really tell us much so we invented something else we came up with work process latency but is it we measure how long does an asynchronous task wait to be executed how can we measure that we simply schedule a task on a thread pool and wait until it's executed so this is easy you can do this when you are a native module where C++ plus you can scale your work item on the thread pool and basically put yourself here and then you see when you're executed when it takes long to be executed makes sense then obviously there is a lot more on this read for pending and we have to wait longer important here is that where we measured the tick duration before here like this vertical here now we are now measuring the work process length latency horizontally means that there can be a lot of ticks and still there can be in this time but I don't know 1000 ticks that there can be the i/o polling can be going on this means the tick duration is not directly affected by this polling here so deep they are not directly correlated because every time there is nothing going or nothing to process in our polling and we are still waiting because the thread pool is still congested or like busy then the ticks will just proceed to the next week to the next etcetera so so this is really a totally different metric than we had with the tick duration it's not like that we now measure this polling phase out of this tick duration make sense and if you do that you already get something that makes sense so here I did a beam with a concrete currency of five there's not not much going on and here I'm using sharp as Michelle I showed before like image processing because sharp utilizes to spread pool and so I just created a scenario where I just rendered an image again and again you know in an Express route to find out if I see this on the event loop and as you see you indeed see that and you can really derive some insight from that you can say that a high work processed latency means that the thread pool is busy or exhausted and this is basically a matrix that people actually care about because this means that you're doing something or too much on this red pool maybe processing too many images in no chances or another metric we came up is the ventral latency it means how long does a callback wait to be executed all the also quite easy to do we simply schedule something on the timers queue here and then we wait until this virus so this really matters now your callbacks this measures if there is much going on in your code what does latency minier so if you create something like a function few of a set timeout and it should run after one second everything from 0 to 2 this one second is X is what's what's expected so we expect it to be processed here everything that comes then today after so that the Delta between that and the real execution is 10 the latency because if we schedule some price something to run after one second it runs after one second and two million two hundred millisecond we have a 200 milliseconds latency because obviously no chess was too busy or this callback queue was too busy doing something else before our request was called so this is quite insightful and I used in this case again a batch bench a page bench and with the concurrent currency of five and then I calculated Fibonacci in another route to see if I now I'm now busy on the on the main thread like on the on the callback on in the Copic step if there is something release that or if I see that there is something going on and as we see it really worked so we really see now okay there is there is something it's really busy it's still just 25 milliseconds but yeah inside is a high event loop latency means that the event loop is busy processing callbacks so great again we have learned something new the result of all of that is that in the end we were able to create this dashboard in our product with all these vent loop metrics on one page and that's totally magic but the problem is or what does this really tell you right so we did it basically for our customers we created metrics but you still don't really know what's going on in your application it's just one pile of metrics more and how do you even tell if this behavior here is no normal who defines what is normal and this means that every time you collect metrics also like here you always have to make sure that you baseline and correlate all metrics you have together and kind of look what's going on in your whole system to really make sense of all of that because the metrics alone don't tell you anything as I said before especially with the event look metrics so highly in demand mostly by people that are not even know developers because our customers are more operations folks devant want event loop matrix without even knowing how the event loop works and what they are actually measuring so this was also was also a learning for me to really understand what I was doing of which metrics I collect to really make sense of them but after all I think this whole journey and also exploring that really yeah I learned a lot through that because I know know better how everything works together as a whole understand no chairs a little bit better in the hope I could shared it with you in this talk a little bit thank you [Applause] thank you thank you very quickly commercial break now they switch me off well now highlights for some of you yeah commercial break so what you saw here what I talked about this is also a medium in the note chess collection so this is a post that covers basically everything I was talking about in this talk if you like my Austrian accent you can listen to me and LinkedIn learning there is I have a few courses like how to build a slack but let me know via Twitter if in case you want a coupon or so I can do something I guess in cut your license to watch it for free and now really done hasta la vista [Applause] [Music] you
Info
Channel: NodeConf Argentina
Views: 4,541
Rating: 4.9349594 out of 5
Keywords:
Id: gl9qHml-mKc
Channel Id: undefined
Length: 33min 43sec (2023 seconds)
Published: Fri May 04 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.