Functional Performance by Martin Thompson at Functional Conf 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

morning it's 9:00 we're gonna get started how many of you were here yesterday wow that's a lot of people and that's why you're here on time because we like to start on time and end late that's how we do it all right a few people who are asking about slides and other things I'm gonna do I'm gonna give the Welcome talk after this where I'm gonna talk about some of these logistics but I wanted to kick off the day with something that's very important and hence we have Martin here I actually met Martin a couple of months ago in Hong Kong we were both presenting at a conference martin was doing this talk on you know the performance stuff that he's been working on and one of the things that occurred to me as functional programmers we seem to be so away from the metal that you know a lot of times we actually talked a lot about using multi cores and stuff like that and sitting in his talk it realized that I actually don't understand anything about even using a single core so I think Martin's talk is really gonna help understand it was actually a request of Martin to kind of create a new talk for this conference to talk about some of the things of how we can actually leverage a single core and then talk about how we can leverage multi-core so without too much do thanks Martin okay good morning everyone hopefully can all hear me everyone awake the sleeve had coffee yet I'm going to ask questions somebody scary presenter maybe come on talk to you for a while could be very scary so my normal programming languages are not one of the functional languages can I confess a little bit I write Java most of the time which is probably pretty horrible to most of the people here I write a lot of C C++ and I even write up for a bit of x86 assembler so that's all horrible horrible stuff but truth be told I actually like functional programming and I do a fair bit of functional programming if anyone sees my C or Java it actually looks quite functional and there's a lot of really good things in this space and what I want to talk today about is how do we bring together some of the functional side along with a kind of awareness of how computers actually work and also some prius basic mathematical principles and we're gonna kind of dig into this a little bit so kenneth let's start off with a question oh no look at memory operations we're all kind of used to like is all memory equal like it's random access memory it doesn't matter really what we do but does it let's sort of test some of her thinking test them or for understanding so here's simple question if i want to go through on a rare of memory big vector and i'm gonna touch every single slot or every element in not a rare and i want to add up all of those volleys there are all integers and if that array is one gigabyte in size what is the average time per operation to do this so I said I was gonna ask questions so I am what do we think how much time let's say nanoseconds microseconds milliseconds what do you think how much time does it take per element in the array and you gotta think I've got to load the element I've got a summon up into a scaler for what I have before I'm gonna move on to the next element I'm gonna load it somewhat up and move on what is the average per operation to do that depends good answer that's always the right answer in computing it depends but if we were to measure what do we think we're gonna get what would be the typical average on say a three gigahertz machine or two gigahertz machine XT is what I tested this one on guesses 1010 what 10 nanoseconds ok cool let's see if I actually rolling the task this is the answer I guess so I've written a benchmark I'm gonna actually warm up a VM and get the result and I find that it's actually point eight of a nanosecond per operation and kinda go look when you run this like how do I end up with one nanosecond approximately well we're probably quite surprised if you look inside a modern processor because you can think what is this is it really this one nanosecond this is inside a house well CPU sorry for the squish screen we had something with the presenting mode but inside we don't have one ALU if you sort of studied at school and you look at the old sort of all human architectures we look at we have one L view we've got memory we don't have one Lu and our modern processors in fact we have a scheduler inside the process is not an operating system schedule this is inside the processor itself instead of processor core not a processor socket and we have eat ports on that scheduler and I thought we have multiple ALU look there's an Lu there's an ALU there's another one there's another four inside a typical CPU core so you can be doing four integer operations at the same time and usually how can I do that I've written single threaded code how can I get that level of parallelism out of my idea you find is there's actually a lot of natural pallas 'm in code when you write it for example if I'm gonna go over that a rare and I'm gonna add up all of the values in the array and incrementing the value which is the summation that I have so far I'm also incrementing the index that I'm walking over that a rare those two things can happen at the same time and they can I see help them slightly before slightly after doesn't matter as long as it's ready by the next iteration and we can go forward with that so we get quite a lot of natural par lism that's inside there so we're looking at I've got a load a piece of memory I'm gonna store it in a register I'm then gonna increment the value of this scalar that I've been keeping for the summation by that register I'm gonna store that result and then I'm gonna write it back out at the end and I'm gonna throw doing this so our processors can do a lot at the same time but let's make it a bit more interesting what if I've got a different pattern of access yes yes so I know there's no paging so yeah good question good point so yes this memory has the way that benchmark has been set up as I've actually created an array of one gigabyte in size and I've pre initialized it with random values and particularly it's important to do this with random volley so the compiler just doesn't eliminate all of the dead code and give you the answer at the end because our compilers are very smart they won't compute something at run time if they can do at a compile time and sort of see of all of that so this is an array that's been pre-allocated with random values put into it so we get these sorts of results but let's try some different patterns like what if rather than going through the whole array linearly I decided to stay within an operating system page there a 4 kilobyte page and go around and pick the holidays at random and add up all of the volleys that are in that 4 care page then move on to the next page do the same thing move on to the next page do the same thing so a little similar to work but I don't do it with the different access pattern and then I'm gonna try stuff Oh what if I just randomly go around the whole heap and do that will not make a difference where I'm not inside the same operating system page or how good let's make it really interesting if I go from one step to the next step and I can't take the next step till I know the value of the previous step so this step the next function are then at the step function that I'll apply needs to have read the cell and the random values in the cell would feed into the random staff that goes next I'm gonna write this so that it's gonna all do roughly the same amount of work because we have all of these ll use and we have all of the address generation units to do all of this work and parla without really impacting the code the only thing that's going to be different is the memory access pattern let's see what happens if I benchmark some of those different scenarios so here's these different scenarios we had our sequential pattern going through this let's do randomly within a page and then step on to the next page randomly within a page where the next step is dependent on having read the previous value randomly through the whole heap and in randomly through the whole heap whereby the next step is based upon the previous value similar to work big difference in the different steps and for those at the back to be can't quite read it it's nearly 90 nanoseconds when I'm going random through the whole one gigabyte heap where the next value is based on the previous one and they well why is this so different well our hardware is doing some very nice things for us so in the first simple cases where we're just linearly going through the iraq reading everything sequentially we have a thing called a hardware prefetcher helping us where it's going ahead getting the memory before we need it it's helping us ID it's gonna have the memory in our cache ready to use whenever we need it the in peds random one is benefiting from our different things so it doesn't have that hardware prefetcher but it has a thing called a TLB cache a translation lookaside buffer where our virtual memory addresses get converted to physical memory addresses and if we had to do that look up every single time will be slow we do this lookup on a per page level not a per address level so if you stay within the same operating system page you benefit from that cache if we start going down to the wider heap the heap is too large in this case for the TLB cache to benefit it's too large to benefit from the prefetcher but there's something interesting going on with the dependent loads now what's that well what's kind of interesting is we can have a cache miss going on to me and memory so you go to our cache if it's in the cache we get it if it's not in the cache we gotta go to memory to get that we can have 10 concurrent cache misses being tracked concurrently by our x86 processors and so in the case of what you just got random volleys and there's no dependency between them those 10 can go off at the same time that can be amortized and come back at the same time so we divide the cost basically by 10 if you take one step and you can take the next step until you know the value of the previous one there's no going off and parallel to get it it's like go out and get some shopping or whatever it is if you don't get the first item and you still have to feed that into the next one you can't go ahead and get the other nine items if they're totally independent you often fetch them you bring them back so how much dependency we have in our data structures on our code starts to really matter and this becomes fundamental to a lot of our design to design of our data structures to design of our code so let's go forward from not so I'm going to expose you to some fundamental laws and then we're gonna dig into some implementations and see what happens so so far in fundamental laws anybody knew what that is come on this is India the home of mathematics queuing yeah Kendyl notation that's an MD one Q so how does this play out if we want to work out the mean response time of something we need to know the service time and we need to know the utilization of something service time and utilization are very interested in how they work so service time is the time to do a job utilization is if I have some resource how much of its time as a percentage is busy doing that for example above a job that takes 100 milliseconds and I'm using it once per second I'm using 100 milliseconds of the second it's 10% utilized if I'm using it five times per second it's 50% utilized so you'll see this one little equation quite often where Rho equals lambda ass and Rho is utilization lambda is the arrival rate and ass is the service time you'll also hear this called littles law it appears in lots of places really fundamental to how things work so how we plug in some figures to this what does it look like from a response time perspective and you see that this is not a linear function so if I increase the utilization along the x-axis what happens to the response time on the y axis things look pretty good as you ramp up utilization till you start getting to about 70% then all of a sudden you go up this curve where things become unresponsive and so if you use things too much and you've your new slack in your system you end up with something very unresponsive it doesn't respond and if you work in teams you'll know this if there's no slack in your teams you just can't react to requirements from the business so anybody who's got a project manager hot but the worst sometimes be a word if you work your team's to the limit they become totally unresponsive how can we put this in a little bit more context that's what happens when you get something fully utilized I'm a huge fan of cars and the road system here is insane I have never seen anything like this before so I can look at the mathematics of queueing theory and see what's going on and your road system is a wonderful example when you use something too much this is what happens and it ends up with gridlock so like how could we make this better so tip for driving if you leave no space to the car in front you have no slack and if you got no slack when something goes wrong you have no buffering and that's what ends up as a result so if you want to tip to improve your driving keep a little bit of distance it's basic mathematics you cannot get away from it yet most people drive right up to the car in front of it it happens all over the world so we got to give it a little bit more slack it comes the same in our software we'll see some of this later another equation what's this one maybe less people can recognize this and he won't hurt our own dolls law yeah this is not on Gauss law this is actually what's known as universal scalability law which takes our dolls law further in fact our dolls law was just an argument put together by G Nam doll to scare people away from mini computers he wanted you to buy his mainframes that were very fast and go threaded machines he wanted people to use them rather than the mini computers to try to scare people off Neil Gunther who came up with Universal scalability law discovered that when he was measuring stuff at Xerox PARC that couldn't achieve arm dance law they couldn't get the scale up factors and what he realized is there's two major components the scaling one is the contention penalty is what arm doll talked about which is the Alpha in this but there's also the beta that needs to be considered and that is the coherence panel they like what is coherence how do we think about that well it's really agreement it's whenever you get multiple parties working together it takes them time to agree to do something it's not even just the time that's the contended part you have to agree that so in a cash subsystem that's the time to get the state to where everybody knows what's going on so you have agreement and without that you really limit what's going on what's this look like if we feed in some figures to a graph so let's take a job and this job can be run ninety five percent in parla that's pretty good really in many cases so there's five percent of it I have to do say under a locker under some form of contention ninety-five percent of it I can do in parlor if I add CPU cores or our CPU nodes of some description how does it scale up but then I'm gonna mix in a coherence penalty and the coherence penalty in this case is 150 microseconds what would be the difference between a pure Andaz law and the universal scalability law well you get this graph this blue line is on dolls law in action so as I add processors the speed-up that I get as a factor never reaches twenty acts because if you think about the problem is if it's if it ends up being 95 percent parlo one twentieth of it the five percent cannot be done in parallel so it doesn't matter how much you subdivide the rest you're still left without one twentieth so you cannot get more than a twenty act speed-up no matter how many processors you put it the problem but what if it takes us some time to get agreement and the agreement becomes interesting and all small numbers of processors it easy to get agreement as the number of processors goes up the cost to get agreement starts to actually become the dominant factor you don't even get to that and after a while the cost agreement starts to go dine anybody noticed how you work in teams small teams are really good they're really productive and as teams grow to a certain size they actually slow down it's mathematics it hunts you down there is no escape you cannot beat mathematics it gets you and that's the basic coherence penalty that's going on here and so you see this and lots and lots of places like let's put this again into real life and context this is what happens when you try to get agreement across a road without rules it slows dime because you're spending so much time reaching agreement rather than grading will flow like what I find is kind of fascinating in this picture here is if we look up here that's a stoplight are those like are those decoration here I did they just like add a little bit of color to the street it seems that nobody uses them I've seen this in Italy as well it's also a common factor I think in Italy their advisory we kind of look at a scene sort of here Brazil a few other places there really are just decoration for street furniture and people do their own thing but yeah we get organized and we get the agreement quicker we can achieve really quite cool things now where some of this going which is kind of interesting is like for a long time we've had single processors just a few processors in the machine what if we have many more processors and where are we going so AMD are just bringing out their new Rison chips and their server infinity chips are kind of fascinating so this is like what an AMD chip looks like in a server so we've got four little groups of chips on the same socket each of those have got each CPU cores and those individual cores can run two hyper threads so we've ended up with 32 here and 32 over here and they're all interconnected in different ways like what's interesting about this is you've got so many cores that are freely available noise so we can go in parallel but look if you look at the interconnects they're all not the same you've got different levels of coherence time so you cannot escape this so two cores talking together here different from a core here talking to a core over there I'm taking a walk through this mash so the figures I showed you from memory access times are your best-case scenario on a single core this gets much worse in multi-core because what if you go into memory and you're joining the queue waiting queueing the facts come in if you start dealing with cores communicating that are on different sockets and across this and they're using the fabric to talk with longer latency so you've got more for coherence cost so we end up with there's a lot of time that we need to think about it so how do we design for things and deal with it I'm particularly how do we design whenever we care about what's going on underneath this who's heard of Numa a fact non-uniform memory access that's what this is memory costs are not the same these days depending on the location makes a big difference so this it really is location location location matters a lot now how do we put this together there's an interesting concept called systems engineering and as well we got a look at problems as a whole if you're a mature engineering discipline that exists in another field like cells the mechanical engineering chemical engineering or an article you don't do your one little bit you have to have a wider view of what's going on you can specialize in an area but you need a wider picture and today we need to have some understanding of our stack on which we run choice of what our data structures choice of what our programming patterns apply to that and if we show some of that sympathy we can achieve really good things either for our hardware another for software on the hall so I want to sort of talk about like how do we put some of this in context what sort of stuff is useful I don't want to use an example of a messaging system I worked on sir and they'll pick messaging because messaging is a kind of cool example it has lots of in really interesting features that have also happened to be something I'm quite experienced at when we deal with messaging systems they are highly concurrent so we're going to deal with all of that side they're also distributed so the coherence costs can get to be quite large and they also require not just the CPU and memory knowledge he also need to have a fairly good knowledge of networking and what can go on if you build one of these things from scratch so it was a really good task heus for how we put some of this stuff together and so you look at well what are some of the normal approaches so if you're gonna build a normal messaging system you can take the easy route and the easy route is you build it on top of TCP and a lot of the world's taking care of for you the hard route is you build it on top of an unreliable protocol like UDP you can get certain benefits to that and we're moving very much in this way because TCP was not designed for what most people are using it for it's actually it's a really good protocol it's designed for something else than what everyone's typically doing with it today so anybody here use Google Chrome do use any Google services like Gmail Google Docs or lots of stuff when your Google Chrome browser is talking to one of those Google services over HTTP do you think you're using TCP anyone think they're using TCP most people probably think there are in fact you're not because Google have realized that TCP is an inappropriate protocol for that and they're using a protocol called quick that runs over the top of UDP and it's based upon a lot of the problems that we're having with queuing and setup and different ways of working so I'm working on our messaging system call air on its UDP PS than doing similar things because it's more suitable for this sort of environment but once you go to UDP you've got a couple of interesting problems like how do we deal with this unreliable data like we sent a packet from one machine to another it may not get there it may arrive in a different order and then you sent it so you sent a bunch of packets they'll arrive in a different order then they in fact arrive multiple times you get some really interesting things that you've got to bill with and so to do that you have to use some interesting data structures and typically those data structures are things like skip list trees school boards those sorts of things where you put together the packets on the other side so that they're back in the order that you sent them in and not where we can deal with it it becomes an interesting challenge now typically if you look at trees any structures like this you end up with something that's got a lot of depth and what happens when you go from node to node they remember the memory access the beam dependent data dependencies it's called a data dependent load you'll also hear at armed as pointer cheersing this is one of the biggest performance hitch you will take on a modern processor so if you've got linked lists if you've got any sort of data structure where you can't get from one node so how do I get to that node I starting up here well I've got to resolve this one I've got to resolve that I've got a resolved I gotta resolve that and I can't jump I can't do anything in parallel I have to resolve everything before I get there so we gotta watch out what we're doing and this is I think one of the biggest challenges to the functional programming community is so many of the data structures are built on linked lists on tree BIA structures because how do we achieve immutability we do a thing called path copy all of the time and for example if I wanted to change something down here what I have to do is I take this tree I create a new version with a new route I can reuse a lot of it but I got a path copy from the leaf back to the route replacing all of those nodes and now I've got a reference to a mutable theater structure that I can use for that so this is kind of cool but it gets a bit worse than that what's wrong with this thing isn't working anymore the Wi-Fi just stopped working on my thing and as a result it's locked my machine up so I'm gonna have to go to this instead so if we do the path copy stuff we end up with this so I create the whole thing from top to bottom and if I do that I've got this interesting challenge so if I'm going to do it concurrently the typical technique is you go from bottom the top and you can say need a compare-and-swap of the top node - they replace that node enough that concurrent access to that and let's see if I miss immutable but the problem with that ends up being is if that feels because another thread has beaten you you repeat the whole process over again so you start at the bottom you copy it right up to the top you cause it in and it'll work and they go back to universal scalability law at this point if you use Universal scalability law you've got the contention penalty of building that up and then you get the coherence penalty of getting an agreement the more threads you've got doing it the more you repeat that process tan threads good it at the same time one of them is gonna succeed and win the other nine are going to feel I'd have to do it again and of those nine one of them's gonna win each of them is gonna fail we've got this really interesting problem you see the math unfolding here it becomes really expensive and difficult thing to do so you end up with this kind of interesting costs of what's going on and you also end up with garbage collection hell because all of these nodes have to be allocated they've all got to be collected again afterwards so this becomes a difficult and interesting challenge if you build these things you end up with this sort of interesting concept so in the UK people love sausages these are things this is as horrible as you can imagine to a vegetarian culture so usually the skin is something like a pig star a sheep stomach and inside that the stuff meat and lots of other things and they are really quite gross many people like them as TSD but they're really not very nice if you understand anything about them like I've kind of got an interesting way of thinking about this like and if you get into functional data structures inside think that's a little bit like sausages then they look beautiful on the outside in the Mattias nice but the more you know about how they're made you really don't sleep very well and I've built a lot of these structures inside so sometimes people are using this nice beautiful facade but you really don't want to see what's behind there AXI d6 is a really good example of this so we're gonna be a world what do we do and how do we do this better that's dumb so one of the really interesting techniques that's available and IRC are DTS these are our different air conflict-free resolution and replicated data types and there's some great ideas and this that we can use for concurrency so what I want to look at is or how do we take some of the concepts from this and how do we take some of the other nice functional concepts and build something that performs but also sort of is giving us good behavior underneath but nice and easy the reason about gives us all the benefits of functional programming in the space so there's two major types of these you'll find is one is operational type and they're known as the commutative replicated data types you see that the C small M R DT the other type is the state-based ones are the convergent types of these and this is the convergent replicated data type like what's a little bit more detail on these and what what are the good properties we can have well of the commutative ones you need a replicate operation so imagine I wanted to change the structure I can send an operation like decrement an from a vowel the increment 10 to a value and if these all arrive at an endpoint you will eventually end up with this theater changed at the other side the thing is is they all must be commutative operations so if I can send our plus 10 and a minus 10 does matter what order the arrivin will end up with the writes did this is a kind of really nice feature but the thing is they can't be idempotent so we can't apply the same thing multiple times so if I'm going to build the data structure to go to the network I can't really do that we need to have an underlying reliable transport the other ones are kind of interesting so this is what we're going to replicate the state from one structure to another we can use this for concurrency we can use it for distribution but we have to replicate at the whole state the interesting thing is the state must increase monotonically it must be purely increasing monotonically so it's app and only the problem with that is if you need to do removal don't need solution to that as a tombstone and I'll show an example of how that can work and when we do this we need a merge function and not merge function must be associative at this stage and so if it's gonna have it item potent we got to deal with that and we got to deal with the conflicts the really nice thing is someone this can be dealt with at a lower level if we deal with deltas and the deltas is a kind of good where to go so how did it build something like this and bring these techniques together so airons an example of this it's not really a sales pitch bird it's it's about how do we take the ideas and how do we apply them well one of the concepts want to do is so applied that systems engineering thinking I've got to shift messages from one machine to another I want to respect the hardware I want to respect the data structures I don't want to respect the properties of the network can I pull those all together and come up with a good result and the log buffer is one way of doing that so I want to have something that goes through memory as a nice bigger rare function I don't want to look at it in random patterns and I also need to be able to deal with it concurrently so if I have a file a not file contains a header for the message followed by the message body and a tail for words next I can grow this D it monotonically so I can add in messages but there's going to be multiple steps to that so I need to increment the tail I need to copy in the message I need to apply the header and if I do that I can move forward in a sensible means you may say well one big file that goes on forever that's where a lot of the academic research goes behind this IQ funding there's an interesting sort of views that you read academic research papers and they don't work in practice for lots of very practical reasons and one of the practical reasons with this is it doesn't work because of things like pH cache turned virtual memory pressure all sorts of issues we are constantly allocating those so how do we deal with that well if you do that sort of good old technique from backpacking where you wash one world one dry one and bring it around with you we can reuse these and go forward so if I've got a clean muffler I've got a dirty buffer and I've got a currently active buffer I can continue adding and use them over a period of time where some of this ideas come from is anybody familiar the guy called John Carmack he started IT software quick doom although that sort of stuff he wrote the original games engine in C and then he immersed himself in functional programming and he liked a lot of what he phone he rewrote the whole thing in Haskell and then he learned a lot from I took him over 18 months and so he really just immersed himself properly when he came back from it he realized there was really good elements of both and one of the things he wanted to do was bring the immutability because it's a highly parallel highly concurrent problem with games and immutability really helps with that but if you just try a pure immutability approach to it he discovered that it didn't work and it didn't scale the garbage collection pressure the management of that just become totally impractical what he realized is if you'd limited over periods of time you can have immutability within a time frame and then start over again so he'd be functional to a point had a check point he would go to be an imperative change the state of the world so he can start again with the clean slate and he did this by using the frame to frame within the game I stole the idea from it undone a buffer to buffer and by taking that I was able to have this nice immutability within the buffer but once I reached the end of the buffer I'd flip over and there's a herd of technique to do that and then I continued again so sometimes it's not just one of the other will blame the techniques together and so how do we do things like concurrent publication into something like this but I don't want to have locks I want to have something that works well with Universal scalability law what do we need to care about well let's say if two threads we seem to put messages in here I could lock the structure I could put in what I need to put in and then release the lock and another thing now I've got a large contention pounding and I've also got a large coherence penalty because in lakhs off are quite expensive than many microseconds what can I do that cuts down the contention and coherence pounding so contention pound Lee I want to do the minimum under a contention and what's the minimum I can do is just increment the tail kind of Dirk one message is incremented it the other message now has incremented it that is the only contention penalty and the coherence cost of that is a single cache line between processor so we're down to very small amounts of time now then in parallel I can copy in the message completely independent of the concurrency window that we need to care about the really hard case here is I've actually reached the end the log and I've got to deal with rotation the one that's reached the end can be responsible for just filling in the log and rotating to the next I'm putting in a message again I see looking at the contention and the coherence penalty pulling them apart and fixing your algorithms to do with that no locks no complexity and we're able to go forward and mostly because it's a monotonic function that gives you the beauty of doing a lot of thoughts so we can put it together that way like you may say well what's in a header how do we get this across to another machine well our header sort of typical ITF porn here where we lay out our header with the different fields with what's in the message the length of the frame the verse and some data to rebuild the structure and otherside merging the two together becomes really interesting well we've got a look at what happens here so what if a publisher dies mid operations so the previous cases I could have been putting something under the lock if the lock fields I can revoke the lock and I can continue forward with this being concurrent in fact even across processes with no coordination other than what's in there we need to have a protocol on protocols as how we interact quite nicely so the way to do this is if I'm gonna write the freedom I can't write it all atomically but I can write it in an ordered fashion so first of all I can write the length I write the length negative to begin with so if this gonna be 100 byte frame I'll write minus 100 and then fill in all of the rest of the head I fill in the message body and I say I'm done by flick negative back to positive by putting in the negative to begin with if this feels and I need to go in and fix it up I know the size that's been reserved because I've got the absolute volley if I take the absolute of it by reversing the sign and flipping the sign at the end is a really nice way of dealing with this so now I've got a ride the different laws that I care about and well how do we end up with the removal in the failure kiss so let's say I put in the frame length here and it's negative if this thread crashes it's a process that crashes what do I do going forward so like if you think of the Erlang world we just let things die but we clean up then well I'd fill in the type to say this is actually a tombstone and I flip the length back at this stage and so now they can have monotonic state going forward even on the buffer that's in there and that gives us what we need to do like we replicated have messages to another machine and we go forward from there so as the messages arrive we copy them in on the other side so copy and make the first message the header contains all the information for where it needs to go if something arrives out of order we just copy it to its location in the buffer at the other side because the header says what is the offset it starts at we don't need the trees we don't need to skip less the scoreboards anything like that then another message arrives out of order but notice there's two characters is the rebuild and the high-water mark but having those two counters whenever they come together we know we've got a complete log if we don't have them together we know we've got a gap we got lost and we deal it so just protocols very simple ways of dealing with this and this gives us something that's strongly eventually consistent and a nice way of putting it together now what's interesting about this is consider the memory layout it's sequential all through a memory that will prefetch you'll work really well it'll Vactor eyes with their modern processors so we've got to go through with these things we can apply a lot of really interesting things like I can apply searches monotonic functions across here to see many different states most of this just comes mathematics and how I deal with it very little branching just looking for values it's all just math and how it applies like lots of these sort of techniques have been around for a long time just not applied like the APL folks got a lot of this right a long time ago and a lot of the industry is just kind of ignored what's going on so again we can steal lots of cool ideas from other places and so how do we get to the spheres where we reason about all of this and this is what another beautiful thing about the monotonic functions is everything that we have like if we've got publishers we've got Sanders we're good receivers we've got suppliers they all just keep a counter and the counter is the latest version of state that they have ever seen and now where we can see progress right through the whole system you can track it you can debug it you can deal with it and really kind of nice wise so kind of wrapping up here now in closing so reaching the interesting part we could all kind of agree that shared mutable state is evil is that we's doable in this room it's gonna I think it's a kind of fur thing to say so we want to avoid this so given a lot of what we learned what can we do to avoid this well if we are going to share processes we start looking at sort of the different laws that I've talked about is if we must update we should only ever have single riders if we've got concurrent update to anything we're in the world of shared immutable speed on its hell and it's not nice and we need to deal with it in another way and the easy way we can start thinking about that is having if we must share a process memory in somewhere we'll only have one owner fit from right perspective many owners from a read perspective and we can share like how can we have then multiple processes or threads or whatever updating that will send a message or send something on a queue to the one that updates it it can update it with whatever type of structure it needs then it becomes nice and clean like we can hide all sorts of elves behind an interface provided we only have to update it in one way once you get updated concurrently it becomes really really difficult the other way at the batter think about this you start thinking about sure nothing so you get a little bit of a flavor of different structures and how we need to think about them if we do not share them at all we can do whatever we want we can have things in different Laz so the whole idea of the shirts did we need to get away from we need to have our own local state and with our own local state we can use whatever techniques whatever we want and look at messaging so I spend a lot of my career looking at messaging and developing messaging systems to deliberately get the isolation this is a really really important thing so we've got many languages we wish sharing memory and by sharing memory were difficult that gives us issues for garbage collection it gives us the issues for what types of structures we have it gets as issues for failure detection by putting the evilness of the sausages into the communication mechanism so if there's one thing I would say it to the airline folks is you've got the stuff right from a high level you really need to invest in your messaging infrastructure so it is super fast super efficient and sort of bears all the right properties to let this stuff work really well and we build protocols on this and the secret to so much of it to work well is we've got a look at building things with increasing monotonic state so if we do that it becomes so much easier we don't go backwards we always go forward and we tombstone if we need to deal with that if you want to see some examples of encode have a look on arrow and come to the workshop tomorrow then I'm gonna be giving and or not I'll thank you very much and I'll take questions I think of left enough for some time previous slide you

Info

Channel: ConfEngine

Views: 2,893

Rating: 4.8947368 out of 5

Keywords: FunctionalConf, FnConf17, multi-core, performance, littles-law, queueing-theory, usl, aeron

Id: OqsAGFExFgQ

Channel Id: undefined

Length: 43min 10sec (2590 seconds)

Published: Tue Nov 28 2017