ElixirConf 2021 - Chris Ertel - The Hitchhiker's Guide to Elixir Performance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Applause] [Music] [Applause] [Music] howdy folks um welcome to the elixir conf 2021 virtual truck uh my name is chris ertel at curtail and github on twitter and today we are going to be talking about the hitchhiker's guide to elixir performance why are we here this talk is about several things we're going to talk about performance and if we want to talk about performance we're going to need to talk about benchmarking and if we're going to talk about benchmarking we probably are going to have to get down into the weeds and talk about disassembly but the overarching thing we're going to be talking about today especially in these times with such a critical topic really talking about science now the scientific method was something that in the english world we would credit uh sir francis bacon with in the early 1600s it has been independently discovered and rediscovered uh in times preceding that in antiquity in various parts of the world different cultures and civilizations nobody really has a monopoly on knowledge and learning um for us we're going to say that again an english tradition that started with bacon in the 1600s and it's this very interesting mix of inductive deductive reasoning where we go and in order to understand the world first we come up with an observation about oh that's a strange thing or you know we have a question like i wonder how we would have the system do x and then you research it to kind of build some context and build some understanding of the system and then you form a hypothesis about how you think the system will behave or why you think the system does something you test that hypothesis by actually conducting experiments or making some changes and collecting data you analyze that data to kind of identify trends and make sure you understand what the you know observed behavior is and then finally report your conclusions along with you know this is what we observe this is why we think this is happening now the important thing is that we use science to overcome superstitions and that means a lot of things to a lot of different people but in our field that could be a superstition like erlang and elixir are not fast well [Music] or the beam is not for math to an erlanger you know the beam is for message passing to an elixir developer the beam is for probably surfing web requests neither party's going to think that it's for math if you want a performance you need to rewrite everything in rust well okay you need to use written in c or rust or zig or some other language we're going to push back on all of those today we're going to dig into them a bit why can't we write fast code well i think that we can we just need better tools and we need some understanding of things so we'll learn some tools today our design scenario is we're gonna write the world's worst vector library it's a 2d vector library we're going to be doing that because it's relatively small we're not going to worry about most of the operations on vectors other than addition we're not going to worry about dot products or cross products or anything like that we're going to worry about adding two vectors together with dimensionality too in computer graphics we might have dimensionality 4 in data science we might have dimensionality of tens or hundreds or thousands or hundreds of thousands of elements in machine learning you know if you're using nx or something like this you might have dimensionality of a vector in you know millions or tens of millions or something like this for our purposes today because it just makes a very convenient small example easy to follow code we'll be doing vector size 2. so 2d vectors now the first part of our method is we're going to pose our question which is what is the fastest way in pure elixir to add two two-dimensional vectors the next step of the scientific method is we're going to build some context we're going to research the topic a little bit we're going to research the topic by focusing on kind of two things first what does it even mean to add two vectors together and then second how do we represent vectors the answers to these two questions are probably going to have a very large impact on how we implement things so first question is how do we add two vectors for two vectors a and b we rewrite those as the two column major matrices that they're represented by a is you know ax a y two elements b is bx by two elements and then the addition you know the vector a plus b is going to be ax plus bx ay plus b y just very straightforward some of the elements now i promise this is all the math we're going to do so that one slide was my contractually obligated law tech now we'll return to our normally scheduled elixir our next question then is how are we going to represent those vectors now we can represent those two poles because we know it's going to be exactly two elements of some primitive makes sense we'll use two poles it's a good fit you know in the back we've got you know robert sitting there with lfe going you know why why don't we just use it as a as a cons list i mean we have lists lists are great we don't need anything else listed two elements no big deal and then us web developers we are probably going to say okay it's a map clearly because you know in the web everything is a map it's a map with a key x and a map with a key y you know and you're done two maps you probably produce on an app all as well so where we're going to do is we're going to blow off the ancient tomes we're going to go to mount erickson and we're going to return with the erling efficiency guide we're going to turn to section 11 verse 1 and we're going to get an idea of the memory sizes now we're going to assume a 64-bit system and we're going to assume an intel architecture the actual sizes and some of the ranges that these primitives can represent are different whether you're on a 32 or a 64-bit system and might behave a little bit differently if you are in an intel architecture versus maybe you know like an apple m1 or an altera or some arm variant uh to say you know nothing of like a black fin or mips or more exotic you know long zoom processor so looking at this we see that the numerical parameters we care about they have different sizes so a small ant which is any inch between approximately negative five grillion and positive five grillion is represented by one word eight bytes sixty bits of which are devoted to actually representing the number and that's using the kind of tagged memory architecture and the different storage of things that you'll see in the beam there's a form for storage on the heap and on the stack and like how we actually run things then there's a storage for storing to disk or interop um you know external term format we won't be covering that stuff in this talk uh just know that it's there and know that you can read up on it if you find yourself needing that information if it's a large integer which is to say you know larger than a small ant it's larger than five girlian in size that can be represented by at least three words possibly more because we do support arbitrary precision integers which is going to be at least 24 bytes and then a float again is going to look at three words which is going to be 24 bytes now those are numerical primitives but vectors are collections of these in our case collections of length two so let's look at some of the sizes there so a tuple is two words long plus whatever the size of each element is so it's going to be at least 16 bytes for us um a list of link in is represented by n plus one words plus the length of each of the elements so it's going to be at least eight bytes then we're going to say okay what if we put in a small map so small map is any map less than 32 entries um any map larger than that uses uh what's called an hmap which is a different sort of data structure if you look at the efficiency guide you'll see there's a little bit of an involved formula in calculating that size for our purposes we're gonna assume a small map because we have you know two entries uh three if it's a struct and that's going to be you know five words plus the size of all the keys and entries so we're looking at at least 40 bytes note that usually for our purposes when we have a map we're going to be using atoms as keys i mean you don't have to you can actually use all kinds of neat stuff as keys but for our purposes we'll be using an atom typically if you're sanitizing user input you'll be converting it to an atom hopefully using two existing atom although as my colleague lee points out we can have such large atom tables these days it's not as big of a deal um it's fun a little bit of trivia so that's kind of our backing stores and getting a rough idea and so combining all of those things so for vector with floats on a 64-bit intel machine our best guesses at memory usage are we're looking at probably seven words for the tuple approach nine words for the list approach and then 13 words for the map or start approach and so now in the method we're going to form our hypothesis right and our hypothesis is that the simplest data structure for representing a vector is a tuple not knowing anything else about how it's implemented we're going to just expect the tubals will be the fastest so that comes the part of the method where we actually go and we set up our experiments and then we run them and so in our case we're going to have to just code up our different approaches and see which one actually has to show our hypothesis being correct so using tuples um it's very straightforward we create a tuple we destructure the argument tuples and then create a new tuple and return it with the sum of the elements no big deal our list is the same thing basically we just traded our curly parentheses for square brackets and we get back that our final version of this is going to be using a struct which is really a map and ignoring the slightly clunky syntax for defining instruct we go we destructure two structs we add it we create a new structure and push it up again nothing too exciting and then just for giggles just to just to satisfy you know the the new person to elixir on every team who really really is excited about writing everything as a gen server because we have message passing we have the runtime system we have all of these great tools we have otp and by golly we're going to use them we're going to write a version of this where we do all the edition integra gen server and so this is a very stripped-down gene server all it does is you initialize it it registers itself using its module name and then you can just call you know add using our nice little client interface here and it will add it using what is effectively the tuple mechanism and then you're set so if we want to test these manually we can and we look in the timer module and there's a function tc aird3 which is in module function arguments form so mfa form and you can run it it will return a tuple the first element of which is the number of microseconds it took to run the second element of which is the return value of the function and so we ran this a bunch of times and the first time we ran it it took 3 400 micros which is about 3.4 milliseconds if i'm mounting my decimals correctly um the rest of the time it ran in like you know two microseconds three microseconds four microseconds and one of the things we notice about that is that those times are all over the place and if we really wanted to do a good experiment we would have to run that a whole bunch of times and the reason we believe that to be the case is we believe that some we're gonna have some outliers that we can just throw out of hand and we're gonna make some guesses that due to the you know statistical behavior of our noise and a measurement error it's normally distributed there's a bunch of hand-waving statistical things that say look as long as we take a whole bunch of measurements it's all going to come out in the wash and we'll be able to identify a distribution and run statistics that give us the true representation of reality so that would be pretty fun that would be a fun project we're going to wrap timertc in some code that actually looks at you know generating these values and maybe does some statistics and that's that's good fun um but we might also have to worry about initializing the values we might have to worry about are we having the correct core affinity or the correct scheduler behavior correct process behavior it can be finicky so that's why we have something called benchy so this is an example script file that we use with benchy and the first part of the script file is we're just going to alias vector2d as v just to make life easier to click on stuff we're going to have our definitions of the tuples the lists and the structs we're going to be operating on uh looking at the tuples as an example we just go from one to a thousand so we're going to make a thousand different ones of these each one is going to be a pair of vectors the first vector is going to be n by n and then the second one is just n minus one by like sorry rather n minus one plus n plus one um just to give like you know some actual numbers to bang against if we just initialize it to like zero for example i might worry that an overzealous compiler might just allied any sort of additions so this gives like a little bit of grist for the grill now then what we have is we go ahead and we're going to start off our gen server to make sure it's running and then finally we just invoke benchy.run now benchy.run is an aird2 function the first argument is a map which has a scenario name followed by the function that it corresponds to now the function here is a rd0 anonymous function it supports up to arity one if you need to pass in inputs and i want to say arity too although i may be mistaken on that if you need to do any sort of pre-processing you can additionally instead of passing in functions you can pass in tuples and if you consult the documentation that will tell you how you can specify set up and tear down for each one of the testing scenarios and then the second argument is a keyword list and that contains many different things in our case that contains time which is how many seconds to run for memory time how much time to time memory allocations for you can specify formatters so benchy comes with a json formatter and i believe a console formatter which we'll be using today and then you can also get like an html formatter for it which gives these nice little embedded web pages you can kind of view your results on so each one of these scenarios when it runs it's going to take our already generated values and then it's going to throw them into an enum map and we're just going to run the function over our values now it's important that we did all of this outside of the actual timing because if you don't if you don't somehow warn binchy about it if we had you know decided to replace tuples here with that for loop or you know inline to generation or something we would be timing the generation as well as the actual stuff that we cared about so it's incumbent on us to pre-processes all of this stuff and we go ahead and we're going to execute this function and we run it and so we're just going to go ahead and i'm just going to run the benchmark binchi is going to go and figure out my operating system which is linux it's going to find out that i'm on a pretty beefy like first generation ryzen chip i've got plenty of ram i'm running on elixir 112.3 i'm running on erlang241 it's going to double check my settings the most interesting of which here is that first we don't have multiple inputs specified and what that means is that binchi is capable of running a collection of scenarios so you'd say okay i would like to test the tuples and the lists and the maps but i would like to test it with you know one vector with 10 000 pairs of vectors with 10 million pairs of vectors and we're not doing that for simplicity's sake but you totally can produce more interesting testing scenarios finally there's the parallel setting which says we're going to run one of these at a time we're going to do all of these tests back to back to back to back now that's important because if you have multiple processes that are each running the tests concurrently you can have weird knock-on effects on your system if there's any sort of resource that might be under contention or anything else that can be a problem so we run them back-to-back now it's going to go it's going to benchmark these and the results that we get back we see that tuples went ahead and they were absolutely the the fastest they had something like 50 000 instructions and iterations per second jinserver as expected was dog slow um it actually breaks it down for us here you know it's like almost two orders of magnitude slower list is close to tuples but definitely still a little bit of a hit and structs were almost three times as slow as tuples and then finally on a memory usage this is where it becomes really obvious there's a non-trivial difference in memory usage for lists and structs and that's kind of interesting and we see the bytecode for that it'll make a little more sense and then gen server is again like just gnarly memory usage so you know we we've now done and actually bled over a little bit into the analysis section where we've taken our data we've taken our measurements which benji did internally and then eventually ran the analysis for us so that's pretty neat unfortunately it's not enough right having the notion of the distributions is is good but we also want to have a little bit more context for why did we you know why did we get what we get and so let's dig in a bit deeper and the way to do this is to actually dig into the beam files that get emitted so you have your elixir code you run it through the compiler and then after a bunch of things and steps internally it emits a beam file that dot beam file is what is then fed to the interpreter and the interpreter will actually do additional optimizations uh what are called people optimizations that code elimination and things where it looks for you know the kind of the last leg of cleanup it can do before executing your code um but for our purposes the beam is what we want to look at because that bytecode is almost as close as you can get to what we'll be running now you can use erlangdism.file in order to actually read a binary and give it to you and so what we're going to do here is we're going to go ahead we're going to open that file that beam file is going to give back a binary blob and then we're going to feed that binary blob and inspect it and that's going to give us this very neat set of you know it's a beam file here's the module and then all of these chunks for it and that's cool unfortunately i use vs code and i like having nicer interfaces so scott 119 has a plug-in called beam dashem and it will actually give you a friendlier version of this so remember here's our source code for the tuple version and then our emitted byte code is this stuff right here and the first thing we do is we're going to check the arguments so x sub zero refers to at this point in this the first register which is the first argument and they're up to 10 24 x registers they're also something called y registers which refers to slots in the stack the beam itself is ace register machine with 10 24 registers as well as a stack which is used for i believe handing stuff between functions uh the work we're doing here doesn't require that so you're not going to see any y registers mentioned so the first x register here is we're going to check that it's a tuple and we're going to test that as a tuple verity 2. we're going to do that same thing for the second one so we've kind of validated what has been given to us as functions arguments then we're going to go and we're going to pull out the first element of that first tuple and stored into register two we're going to pull out the first element of the second tuple and store it into register three then we're going to go ahead and we're going to say okay we want to invoke a function if it doesn't work throw an exception and blow up otherwise um we're going to go ahead and execute our length plus so we're going to actually add these things together we're going to blow up and consume i think we need to keep four of these on the stack if we need to garbage collect um we're going to add registers two and three together and store it into register two then we do the same thing for the y component so we're going to grab the second component of registers one and two and then we are going to go ahead and again add those together and store them into register one it's all zero based indexing so apologize for that we're just we're going to test the heap we're going to make sure that we have three words available to us and failing that we are going to say okay garbage collect if you have to but make sure that our first 3x registers are preserved and then we are going to go ahead and shove the resulting value into the first register x0 and that is also accumulator so when we hit return we're going to push it up so very quickly that is us talking through the assembly and that's like actually what the beam is going to do when it's running this code so doing it for list it's the same thing except notice here that the first thing we do is a whole bunch of are you a list are you actually a list do you contain lists what's up with those lists so there's a bunch of that work and then here the map version is again similar uh we're gonna do one thing that's kind of neat is that you have this instruction get map elements and what it does is it can actually take this nice interleaved array of keys and then where to store them when it looks it up and so that's that's kind of neat and you'll also see here this kind of interesting bit where it's going to pull out whatever's in the struct field and then the next thing it does is it says hey is that strict field set equal to your you know elixir vector2d struct and that is how we implement this check right here so that's kind of slick because if you remember the way that we do structs in elixir they are maps with that struct key set so the gen server the fight code i'm just not going to show it looks very similar to the tuple one for the actual edition part the rest of it is a bunch of make sense so you know what are we seeing here um we're seeing that the struct variant has the fewest op codes but it's extremely slow um when we actually measured it why is that well some of those op codes are kind of chunky right like they do a bunch of stuff they're not just testing a value they're doing a bunch of destructuring or other things now the list variant uses fully half of the thought codes for doing error checking all the variants do garbage collection which is kind of interesting right like they actually have several points where even if we're just extracting a couple numbers and then smooshing them together and setting them back out again we're still pausing and giving a chance to do garbage collection using the gc biff 2 or the test heap ops and then finally the tuple variant uses the fewest registers that i think is kind of interesting right several of these use like six or seven uh tuple uses four so which of these matters well i'm pretty sure that what makes the tuple fast is that the registers it uses the fewest and what makes the struct and list variant slow are that they either do a whole lot of extra error checking i mean they probably kind of have to or they use instructions that are just heavier weight so at this point we can now conclude that with elixir 112 3 and erlang 24.1 on x64 linux on an intel architecture with some small integers individually tuples are the fastest way to add 2d vectors and we believe this to be the case because we use fewer registers and we use less memory now i want to say there that again we had to be very precise and we can't generalize those results if we do we're gonna have to change it so look there's no optimization without experimentation right that's just superstition if we're going to say that something is faster faster then we actually have to do it like the actual we're testing it and it's tricky to do that work we have to make sure that we understand exactly what architecture that we're on we have to make sure that we're testing exactly what we think we're testing we have to make sure that we're warming everything up and that we're doing it with a clean environment but all of this is great because it means that we have the power to write performant code we can make efficient code on elixir and on erlang and on other you know on lfe and other things we have the tools to write fast code and we don't have to necessarily leave our tooling behind we don't have to leave our languages we just have to put in a little bit of extra work and today we saw the science and i'm gonna leave you with the engineering and the engineering is what's up to you engineering is science with cost in the equation engineering is science when you have a budget and as a developer does your day include time between meetings and progress reports and you know code reviews to do this extra benchmarking is your project a project where you're sending people into space and you need to make sure that everything is like hard real time and like so you're going to put in the hours to profile this are you a web developer just putting up an mvp and it might not matter then you know maybe it's okay to kind of cut corners and not put in this extra effort on the code but i will caution you and i'll caution all of us as an industry that inefficiency is ultimately paid for in the short term it can be paid for by having a test suite that takes a long time to run by having compile that takes forever in the medium term it can cost our employers money and cost you know if we're sell funding ourselves money because we're paying for servers to deal with code that's not running as fast as it could and in the long term it's our kids and our grandkids and future generations that are paying for it when you know maybe the water level has raised because the glaciers have melted because we wasted a bunch of heat and energy on these poor little rocks with you know heated registers trying to do code that's optimal right we spent a lot of time in sorting or in addition or math that was just inefficient because we didn't profile it and you know enough of us did this for long enough that it hurt the planet so you know the engineering is up to you and i hope that today we talked about some things in this hitchhiker's guide so you can say hey like i'm empowered to do good engineering and i'm empowered to make the world and my code a better place i'd like to thank my colleague lee i'd like to thank several people and organizations and specific contributors and kind of keepers of arcane knowledge i would even like to thank erickson for documenting at least several parts of the way the beam is implemented um you know this is all a group effort while in this together and you know i'd like to thank everybody that helped me get here uh finally uh this is time for our question and answer period so if you have any questions i would be more than happy to answer them uh thank you
Info
Channel: ElixirConf
Views: 570
Rating: undefined out of 5
Keywords: elixir
Id: cONM2YwCO0Y
Channel Id: undefined
Length: 27min 22sec (1642 seconds)
Published: Sun Oct 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.