Adventures with concurrent programming in Java: A quest for predictable latency by Martin Thompson

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay good morning everyone I think we're running about five minutes late here but sort of giving the opening keynote it was really worthwhile I very much enjoyed that it was a good laugh okay so what are we going to talk here today about is kind of a lot of fun I've had of the last two 10 years plus of dealing with concurrency and the challenges that we have in this and I kind of get this picture that sort of helps summarize my view of concurrency a lot I was in Brazil just over a month ago and this is sub how where the traffic is just absolutely unbelievable and you just look at the state of this Junction I particularly like this kiss here you see look at this why is that van pointing that way and I have to debug a lot of concurrent applications and so often you keep finding there's a thread somewhere and it's in this kind of weird weird state and just like how did it ever end up there and so you end up ah how do I work all of this side so I want to kind of explore a number of things here today so let's start off with sort of thinking about well what is concurrency and what does it mean to be parallel what do we do these sorts of things so one of the sensible reasons we'll do this is sometimes you want resilience in our system so if you do something more than once at the same time if one of them succeeds one of them feels that's kind of good we can have a lot more resilience to our systems and that's quite often a kiyose will they say actor systems being a good example of this but more commonly what people will employ parl is Ammar concurrency for its performance so we want to get a greater amount of throughput or we want to get a lower response time so lower latency so we can respond better and the response time is something I get involved a lot in because we need to have systems of respond quickly fast particular in finance this is quite common and if a system doesn't respond quickly we get issues particularly if it doesn't respond quickly all the time it needs to respond in a timely manner and we often end up with these kind of all cases of work we're getting responsive most of the time but then when we really need to work I'm not and I like to think of it this way that if a system doesn't respond in a timely manner it's effectively unavailable and if you just really slow it doesn't matter whether you're not there because if you're not there on time it doesn't matter if you're a little bit late and many kisses a little bit late it's almost too late and sometimes it's really really bad so what are we gonna cover here I want to spend a little bit of time on the theory behind concurrency and what's kind of relevant along the systems I've looked at I'm gonna talk a lot of parts of the quest of looking at data structures over the last decade to turn address some of the latency issues that we've had and then kind of a little bit of where this can go next and what we've learned on the way so let's start off with some theory kickoffs threw it away with the idea of parallel and concurrent we've heard these terms a lot who can comfortably say they know the difference between the two few hands I'll even skin earth unsure but sticking my hand up here and sort of to be honest one of my electives at university was parallel and concurrent programming I done a computer science degree and sometimes I'm not sure and I can over the years have become more comfortable but even looking at what other experts have to say like John Gustafson is a really good example of this where he says he doesn't even like to compare the two because they're so confusing whenever he writes puberty actually pulls these subjects apart rather than dealing purely with this so like parallel being the opposite of serial concurrent being the opposite of sequential and Vactor being the opposite of scalar he doesn't even like to cross these so that's kind of interesting the people will pull the side I like Rob Pike's view the concurrency is about dealing with lots of things at the same time or Paul isms about doing lots of things at the same time particularly this dealing point and the dealing point is what makes things complicated we have to deal with multiple things at the same time we're gonna manage that somehow and the complexity that brings I come the knife from that so we want to focus on like what do we have to deal with when we're accessing a concurrent resource so let's pull out a couple of simple examples our modern systems will have processors and memory and lots of other resources in there so we're gonna look at well what is concurrent and what's parallel well we have a thread of activity running on our processor dealing with some data structures that we have in memory and as we deal with that is that concurrent or parallel in this case it's fairly simple really not it's neither concurrent or parallel single thread single memory nice and easy now let's introduce a different case here of where we're gonna use a single processor but we're gonna timeslice on it and this is typically the world we were in coming through the 90s and as we go forward before we got truly multiprocessor systems as a common occurence and we kind of asked is this concurrent or is it parallel just for a bit of background well it's not parallel because the two things aren't really happening at the same time but we're having to deal with the consequences of the fact that they're interleaving so they are happening concurrently at that stage so that's sort of interesting then we're into kind of our modern world even our laptops now I have multiple CPU cores they're dealing with the same memory they're dealing with data structures shared in that same memory are we concurrent or parallel here well obviously yes on both cases if we're dealing with the same resource at the same point in time we can also get cases of whereby we've got independent so this could just be data structure it doesn't have to be the same memory so separate data structures no relationship between them being dealt with by different threads running at the same time we can have parallel but not concurrent and so kind of pulling these apart and the thing that really stands out for me is parla isms a good thing because we can get more things done concurrency is questionable to whether it's a good thing because it introduces a lot of complexity we have to deal with all of this complexity when we're dealing with concurrent applications so let's look at this a different way a typical flow through a system were we've got a request response so we've got some input I put we deal with that we then decode that we typically have bites coming off the wire we will then do a bunch of computing of your work on some data structures well then store some results of that for the future key bar or the trail of what's going on so we'll archive that will then include a message to go back on to the wire and we'll send that out as a response back to the end system now we could put a single thread on this and that will work fairly fine in fact many very high-performance systems just do this along but we end up with issues of throughput so we can have great response time on this is actually a fairly decent response time but once we're dealing with one request we can't be processing the next and this slows us down we can't make progress and especially we got multiple CPUs in our machines and also different steps along the way here can take time like I'm going to write to disk whilst that's happening I'm not making other progress so the typical response to this is what our industry has gone for is will have thread pools this would be your classic add tomcat style container or something we'll have coarse green Paul ISM threads are picking up requests at a time going through the whole process and going back again and I've seen a lot of systems like this I ended up working on a lot of high-performance systems sort of from the 2000 onwards and this was the common layer people started dealing with these things it still is probably the most common where we deal with that now but it's kind of got an awful lot of interesting consequences when you start measuring this particularly the question of CPU caches so CPU caches are the main thing that gives us performance today on a machine because the cost of going to memory is so much greater than the cost of just going to the cache just for a kind of reference you can typically have processed more than about a thousand instructions in the time of a full cache miss on a modern CPU because we have multiple execution units even within the same core so it's a huge wasted opportunity so if we can pick something up from the cache that's a better thing so let's go through this in a little bit more detail and see how this plays out so if we're in the course grained parallelism world multiple threads running at the same time we've got some really interesting concurrency issues like our data models that were working on we're having to protect them with locks deal with mutual exclusion and as we deal with this all the different steps along the process is constantly flushing different things through our caches because a threat picks something up runs from beginning to end doing all of the different things which is washing lots of data through our caches and things that we've used recently gets evicted because we're pulling in new stuff because we're doing different things all of the time we also start looking at writing down to our storage we've got bottle max here because we'll end up having to manage that and manage the contention if we end up having mutual exclusion on something for a critical section we must queue to take turns going through that's only one thing goes through at a time so even though we've got all of these threads we still have all of these interesting Kuna facts and we're not even a worth it and so we're looking at all of the different stage of the pipeline there's a lot of concurrency here sure there's a lot of pallas 'm but the concurrency is very difficult to manage who here finds it easy to hunt down concurrency bugs in data structures that are accessed by multiple threads I don't think anybody finds that easy it's an incredibly difficult thing to do and we struggle with dealing with this like even dealing with concurrency oh no a very simple data structure that's well understood is a difficult challenge on some business process that's vague to begin with to apply concurrency on top of that it just seems insanity because it's so difficult to deal with so what can we do that's different well if you look at these sorts of scenarios let's take away the coarse grain parle ISM approach but we've got all of these cores and threads what do we end up doing and it's interesting to see how many of our modern systems start evolving to these patterns women particularly performance or scalability or security and different really difficult qualities of service have to be applied we change to very similar designs here and that this sort of design we end up seeing is pipelines where rather than having a pool of threads that apply to something as a thread will take on each stage and they own that stage and just think of it like in the real world if you if you go to a factory you don't see a person picks up a request after this is the product being ordered at the door and they go through the entire factory and every step of the process and then returning to the customer there's a pipeline of activities and different people sort of deal with different bits through the different parts of it but to make this work we end up we need cues between each deed say if we're going to have the conveyor belt of tasks we have queues between each of the stages this is where we push the concurrency into the queues rather than the individual tasks so the task that's working on the data structures for example is it concurrent well it's not concurrent as far as that thread is concerned because there's only one thread ever dealing with the data structures for them in a business model and it comes much simpler to deal with it's also hot in the cash it's doing the same thing over and over again it's held in the cash for instructions it's also hot in the cash for data and then we look at some for issues like bottlenecks in the system like going to storage well rather than all taking a turn and using one operation at a time we can be batching these things up so we can be kicking a lot of things and writing them in the same gal it's this sort of things that we're seeing like if we all want to leave London and go to Manchester it would be crazy to have one motorbike that furry each of us backwards and forwards it would be much more sensible to get on a train of British Rail wasn't so bad and we can't go up there if it worked properly or we'll hire a bus because we need to look at how we box these things correctly because when you start digging into this the mian performance enablers we have are our cache lines how we pipeline and how we batch these things if we start thinking like that we can get a lot more throughput and the reason I've kind of ended up looking a lot of these designs is I've worked on many of the core screen parlors and problems I've then had to deal with situations where you deal with incredibly high volumes of transact systems that are going into hundreds of thousands of transactions a second you just find that those other approaches don't work no matter what you do they're really difficult to make them work so you end up with these pipeline approaches that are cache friendly are friendly with use of resources but the really nice thing about them is they're much easier to understand and reason about so our industry is starting to head that where for some of what we're thinking about it now when we get into concurrency is kind of some interesting things that stand out and just to cover a little bit of terminology we talk about things being blocked but a threat is blocked if something is stopping and making progress and this is the mian thing you want to start looking for when you want to make progress because if you have to wait on something else install the pipeline there's many things that can stall different pipelines and so the end of focusing on these things a lot so what does prevent progress what does stall these pipelines well they kind of come in to two classes of things the one group I call them the systemic pauses so things like our garbage collectors are running so most people are familiar with like if you're writing some Java code every known again the garbage collector will run it will stop the world whilst is doing part of that process and then you get to run again whilst that's happening nothing else can make progress it's not just our garbage collectors we get similar things inside our operating systems things like transparent huge pages who knows what a transparent huge page is probably a few people it's kind of interesting so we start dealing with pages of memory we don't deal with physical memory on our machines we deal with virtual memory and that virtual memory address must be converted to a physical memory address we have a nice little cash inside our processor called the TLB that will convert those virtual to physical addresses so we know how to address our memory they're very very tiny like for our l1 caches you've only 64 entries in there for your typical operating system pH and a page is only 4 K so 64 times 4k is a very small amount of memory on a modern server we also have gone Paige is not that can be two megabytes in size and we have 32 entries for 2 megabyte pages and so Linux for example 9 will try to use those entries so it will try to use those 32 entries for 2 megabyte pages and they're called transparent ease pieces because it's using it transparently that raises a really interesting problem after a while the memory for your process starts looking like Swiss cheese you have lots of little species of pages that are used and not used now if you want allocute a 2 megabyte page and you've got a whole load of memory but it's all in tiny little for aria care contiguous chunks you can't allocate the 2 megabyte page you have to compact your heap to be able to do that it's the same problem we're seeing inside our JVM guess what the operating system has to do some of that stuff too and not Canada pausing for just as long in many cases our GC does so these things are kind of all over the place we got to be aware of them but even done into the hardware our hardware's constantly checking is it in a good state like am I too hot am I seeing errors have I got certain problems going on these are called system management interrupts and they can just stop the whole machine so we will look at making progress there's this kind of class of things it's a grid an interesting subject I could talk for hours on it alone but I do want to talk about that today because I want to focus on the other class of problems and that's the algorithms themselves so our own algorithms end up having concurrent parts of those algorithms where they interact with each other and one thread can block another thread every one process can block another process ok look at what are some of the cases of that so things like just notifying another thread is you completed part of the task mutual exclusion all the different concurrency primitives that we have in there and how we imply those because those are the things that end up making our own algorithms quite interesting so kind of getting back to the quest so I've been building coarse-grained systems have also been building pipeline system so when you start building lots of pipeline systems you realize that queues are fundamental see you you look at lots of different types of cues if you spend enough time on this you start realizing that even the cues can become the bottleneck in your application so you end up measuring them a lot and if you imagine them a lot certain things start to jump out at you becomes one of the interesting problems we're dealing with and in particular the evils of blocking so let me mention that if one thread is stopping another thread making progress its effectively blocked it has blocked what's going on well what sort of things can block progress with our cues so we have two methods that are blocking methods on accuse of this puddin Tek if I want to offer to a queue I do a put operation if there's space in the queue I get to put into it if there's not space I block and I weird for space to be available that means I can't do anything else whilst I'm blocking the good thing is I can sort of have the CPU power down I can have it used for something else that's a contact switch but at least it's a line that sort of operation it's exactly the same on the other side I can take from the queue if I could attack from the queue if there's something there I can take it if there's nothing there I'll block and wait for something to become available and this way we can pass along the pipeline but if we got multiple threads we've got some interesting concurrency to deal with here the concurrency is all pushed on to the queue because it's dealing with the interactions either end but its own data structures may not be having to deal with concurrency but you're dealing with concurrency at the point of exchange if you get underneath these queues and you imagine them a lot you start to find out that to signal from the producer thread to the consumer thread that something's available we have to use condition variables if you profile your JVM and you ask Tracy to see what is doing when a calls into Linux it's calling POSIX condition variables and what these do is within a mutex if you enter a mutex you've done some work and you want a signal out it's done you signal on the condition variable the other side of we it's on the condition variable to say it's done so it's the flag that lets you exchange between producer and consumer to let you know that some work is there and it's a available now let's look at measuring the cost of these so let's serve a simple experiment we run going to have a producer that's gonna signal on a condition variable that something is ready there's gonna be an a Kosar vez run in the other side that when it gets a ping it'll send a pong back so think of it like hard sonar where it's ping pong so you know that you know the round-trip the how something works and rather than take an average because averages are really misleading we don't want to use those well I see histogram these and get a full percentile distribution of what the response time is on this and so we can pass off from one producer to one consumer and just echo it back again how long do we think that would take on a single thread are two threads running in the same process anyone can guess with that we'll take ten microseconds that's pretty good it's kind of in the same ballpark that's in there than the right sort of ballpark you think that's the kind of mode mean median 99th percentile median that's pretty good gas really that's fine what it actually is if I measure it on a fairly modern CPU is a runner but yet microseconds so the gentleman over there was very close this is with the machine doing nothing else well-tuned running a fairly recent version of the Linux kernel we're looking up at ten microseconds so at four microseconds either laugh and that's for 90% of the cases but look as we go up to 99 99.9 the distribution starts to get a little bit ugly after a while some things can take a lot longer to respond and in the case of doing that a hundred thousand times my worst case scenario is actually up at five point five milliseconds and that may seem really quite small five milliseconds is well less than the blink of a human eye but in computing terms essentially a very long time like anyone guess what the round-trip is between two machines on the CM local area network using some of the latest we're technology 11 milliseconds got a guess here three milliseconds any others user using completely the wrong order of magnitude yeah we're looking at single-digit microseconds the very best will be low single-digit microseconds which is actually less than going between tragedies and condition variables and that's a machine to machine which is kind of scary but that's also was just the the best year scenario the bad news is it gets much worse in real world applications you're probably looking at around 50 microseconds between two threads for signaling because the caches are not going to be hot they're going to be very cold there's lots of other interesting problems going on there so we've got to be aware of that so let's look at alternative so if you're building these systems you realize that's the major issue what can I do that's better well there's the non block in air PI's I'm talking here about off for an and Paul where they don't use the condition variables whenever they're trying to signal so you can just offer to the queue if the other side is using tech you will use the condition variable but let's just go with the pure non blocking versions so we can offer an unknown composer offer to the cube it's full you return false that means our thread can go off and do some other work pull the same kiss on the other side so I'm gonna pull from the queue if there's something there I get it if not it returns no I can go off and do something else you see the fundamental thing about thinking about making good progress is can I go off and do something else and that's a big change in now how we need to think about designs if we want to get really good at this so trying to do reasonably good science here so any experimental evidence should always be backed up by making it publicly available and to letting people repeat them for themselves if anybody's interested all the benchmarks are up on github I run this on Java eard update sixty and on a fairly recent version of Linux running in performance mode on include CPU so what sort of figures do we get not kind of wall of numbers here but I want to look at a couple of scenarios so one of the scenarios is I want to send one message from a producer thread to a consumer thread and then a co the message back again such as the single message this round-trip time but I want to make this interesting I want to try some of the different data structures I get with the JDK and I also want to increase the number of producers so I want to make the concurrency more interesting like real world so I want to track with one producer to producers and three producers all contending and in a single consumer a queen back so really simple case here if a burst of one message accurate back but have some contention and these are the sort of figures we get but this is not a very real-world case a much more real-world case is we give bursts of messages the real systems that you measure in production don't have this nice simple arrival rate of one an Anna nice bragging in another message you get a burst and so say and Finance you'll typically get bursts of a hundred or more messages that'll come at you at a time and what I want to look on is I want to send a hundred messages from sort of one two or three producers and then have a single consumer acknowledge that and send a response back again so this way we actually drive the concurrency we make the concurrency much more interesting because we have to deal with that scenario and our algorithm I also want to measure the average in this case which is gives me a good idea of general efficiency but then I will look for blocking scenarios and how good the algorithm is at coping with those blocking scenarios and the 99 percentile is a pretty good proxy for this now you should be looking for beyond the 99 percentile but that tends to be dominated more by those systemic pauses that I mentioned earlier there are an interesting problem to go look at but let's focus on the algorithms here themselves I'm gonna pick the 99 percentile and what starts to stand out here is things like the jedikiah classes to send in a hundred events on to the queue and get one response back when I have got some contention like a rev locking queue that's 90 microseconds link blocking queues a hundred and eighty microseconds that is many run trips between two machines on a modern network just to send messages via the queues that we get in the JDK and this is much better than the blocking alternatives I talked about before because those numbers are crazy in comparison so we're looking at quite a lot of time spent at the sand outbursts of messages between two threads but the nice thing about this is link blocking queue is much better so if you start building these systems with what's available in the JDK you start gravitating towards link blocking cube because it uses different techniques internally and for reference the baseline implementation of the top is a variation of lamport's algorithms from the 1970s on circular buffers with some influences from fast flow some of the best ways you can exchange between two processes and easily work in single producer to single consumer but it gives you a baseline for what is the communication round-trip time to pull out the differences in your algorithm for dealing with concurrency against the base design of what the transit time is and words like that's very very low so you're starting to see compiling the concurrency is a very expensive part of this so you move forward you use some of these cues and then something occurs to you with the real world you need a lot more than you get with the standard JDK akyüz we have to deal with back pressure so if you use concurrent link you it's a linked list of nodes it can grow indefinitely in my experience that shows that you always end up with an application that crashes eventually unbounded queues will just keep growing you get into this nasty position of whereby the producer will ipace the consumer and it goes catastrophic ly wrong because the producer ends up cash hitting our caches at least recently used the consumer is cache missing so this slowed down even more you end up very quickly exhausted in all memory and result in your application crashing so if you use unbounded queues you're kind of really playing with fire and making your applications reliable on top of that if you're gonna start modeling your systems and applying queueing theory you need to know the length of the queue so you need to be able to check its size at times and the size methods on the JDK accuse if their link block in keonjhar a blocking queue use the sim lock that's used for exchange so now we've got an observer a fact which is slowing down progress we don't want to have that concurrent link queue is an order and operation to walk the queue so it's not really working for me for real-world usage they generally a lot garbage they're not Google finite so you know we can't really use those we've got to look at some different alternatives and on an explorer now for the next section what alternatives i've looked at and been involved with and kind of where we're going with this so what are the alternatives well we start realizing that if you use locks and condition variables we end up with the blocking strategies happening all the time and preventing progress and it's a bit like the traffic light scenario there are other alternatives where we can use lock free techniques and use cars operations and one of the reasons concurrent link two gets a lot more throughput than the others with better response time as the fact that it uses some of these locks three techniques they are more complicated to get right but we're gonna focus on isolating our concurrency to this and not dealing with this the rest application so we can end up dealing with it if you deal with locks anybody who remembers Hitchhiker's Guide to the galaxy I'm off by the era when it was a cool book to read when I was young and the Vogons are the sort of most extreme bureaucrats in the universe I'm actually starting to discover some European countries have got amazing bureaucracy that can compete with this do you try working across Europe you start to realize some countries are amazing of bureaucracy but we have to fill in all these forms in triplicate rather than working in a more sensible protocol-based manner and we use locks we end up doing this we have to delegate to the lawyers and the governments and stuff to get really simple things done where we want to make really simple progress and these sorts of techniques and what's better so a lot of the lock-free techniques are based on the simple concepts off taking a ticket that gives you your turn and then you take your turn whatever your number comes up when you make progress a lot of the stuff back to lamport's work from the 1970s indeed he's very useful sort of stuff so how do we apply this what does it end up looking like so one of the first data structures we put out on this was the disrupter and the disrupter is using the same sort of concepts as taking the ticket the bakery algorithm sort of ideas so heavily influenced from Lamport on some of the work on network cards how does it work well first of all you've got to claim your ticket you've got to get the ticket the sequence number that you're gonna deal with this is one of the points we've got to deal with concurrency because we'll end up waiting between the different producer threads we're really a cast so you'll read the value your friend conditionally set the value back if it hasn't changed if it has changed you've got to read the value again and go around to do it again so we go in this little spinning Koz loop to claim a sequence once we've got the sequence we can take our turn you work with the element that's been allocated to your sequence and then you've got a signal that it's done in this case we sing the love's done by updating another counter whenever it's reached the value that it was just before that so this is where the kind of the first version of the disruptor ended up working once you've weird it or not the consumer and I can see that that's available they can take the element out and they can update their own counter so just by using a sequence of counters we can coordinate without having sort of third party involvement so this can all happen simply in user space now that's spinning on the cursor had a kind of interest in the fact and this little piece of code is probably the worst piece of code I have let get out into the world than open source so I can sort of hold my hands up and say yep this is not very good and I can blame myself for it no this really simple piece of code why is it so bad so whenever you you've completed the action you do and you've taken your ticket you've used your ticket you're nice saying you're done well what you're waiting on is whenever the sequence is 1 less than the sequence that you claim for it you can then move it forward so you can't move it forward and till this one lasts behind so you get to move along whenever the others have caught up with you and you end up weeding in this loop for the others to catch up but this is kind of bad in some ways so if you've got more cores than your threads that need to run you're not in a bad position here because you're just spending waiting for this if you have more threads that want to run then you have physical cores that can run them you end up in cases where this is running around spinning eating up a core and starving out the other threads that need to run so there's bad size at times to dealing with some of these lock-free structures if you don't have the right amount of resource to deal with this and this can get particularly nasty because as you're spinning here you actually are starving out the threads that you want to make progress and as a result this can go kind of catastrophic ly wrong and be much much worse than the locking scenarios I ended up feeling like this the first time actually it dawned on me how bad this was we didn't have these problems at Lmax when we worked on it because we were always running on systems that had more coarse than threads that were able to run so how do we learn from not well the disruptor one and two had the structure of using the Clem sequence and the cursor to say a complete how can we Blair how can we break that blocking between the two threads because you've got concurrency dealing with that structure how do we break from concurrency and move more to par lism so that we don't have to deal with that closed off situation well we did that by just introducing another array that said what was available and how this works is we go through the name that seems steps before we need to claim the cursor so get your ticket take your Turner's deli once you've got the ticket you use the slot that's been made available to you and then afterwards rather than waiting on another thread is you just update a flag to say you're done that can happen in parallel afterwards so now we've teased these things apart the consumer can just watch the available or rare to see when things are ready they can take the element died and they can update their own sequence so very subtle little changes to an algorithm what does this mean for an overall system well here's a a plot of a tree system that was using the disruptor for version two we upgraded to version 3 and what you've got there is over a two billion data points plotted onto this or most of them overlaid on each other and this is kind of great way to look at a system you got a log scale there on the left and a time period over which it run we go from disrupter to the disruptor three with the same trading flow and not me it's such a big difference to the overall system just some simple changes of like breakdown concurrency and turn it into parallelism and then we can make a lot more progress things can get a lot better but let's say you can't use the disruptor you just want an existing queue what sort of options do you have there and take me sometimes people just need a kick well let's take a lot of those influences and see if we can deal with this in a better way so let's claim the tail of the queue 9cm where grabbed your ticket out an element into the queue the nice thing here is I don't need an a rare to say that the stuff is available because if it's in the queue if it's in the array itself as a reference it's already signaling it's done so using actual places becomes a nice way of dealing with this you can take elements out sim where and update the cursor so really simple change so like this is a very simple algorithm but just inspired by some of the work that has come before now what do these approaches give us from a performance perspective I've taken off for a blocking killing blocking cube because those aren't very useful let's look at concurrently and cute and compared against the disruptor and this many to one concurrent a rat you'll notice that in the one to one kiss things have got a good bit better because there's a lot less work particularly few look at the the kiss-off than many to one concurrent or rare it's a very simple algorithm it works really well and really nice but let's move on to bursts of a hundred and bursts of a hundred is worth I said this is more realistic it's what stands out and the thing that occurred to me after measuring lots of these systems is once you start dealing with concurrency and spinning it doesn't matter how clever her head simpler algorithm is it all kind of equals ight with the spinning cowslip and applied with many different variants of this and the spinning cows live where you end up going to take the ticket it ends up dominating the cost because you've got the different threads going to take the ticket if you see the vowel do you miss it you gotta go try and take again and so we end up in this sort of interesting kiss so try how do we get around some of those problems what other ideas are that and there was some interesting problems in this we find be a measurement which influenced other stuff which I'll talk about in a second so we also end up having to work not just across threads but also across processes becoming very common we're also in unless more polyglot environment now where many interesting systems are made up of many different languages so we need to be able exchange and particularly across process now network cards have been doing this for a long time with ring buffers our operating systems do these sorts of things with ring buffers a lot let's take some of these concepts open to our application well if we just have a memory map file if you write your data directly into the memory map file we can exchange that between threads or between processes Seema's that are operating systems are network cards or device drivers have all been doing for a long time let's do this at an application level how do we claim space well rather than the tail being the slot sequence that we're advancing it can become a range of bytes so I just need to claim in number of bytes so if I cows the tail forward for a number of bytes I can claim a region and cleaning the region is just a very similar concept to claiming an individual slot but you're into that spinning cars again because you've two threads going for one negative the other one will feel and then you have to take your turn and go around again so you end up with this waiting action between the threads but once you've got it you write the message in and then you write a header on the message and you write a particular field as the last thing that you do and as a result of that you've said you're done so just simple protocols of interaction how do you read it-- will you read from the tail read the header to find out how long the message is copyright 0 it afterwards move the tail along and that we're really simple exchanging producer to consumer what do you need in a header well you just need the type of the message in the length I kneazle II put the length up front seating to skip forward message to mess it's quite simple stuff the problem with these sorts of approaches is you have to zero as you consume and so what you find when you measure them is you spend twice as much CPU a firt on the consumption as on the production so you're kind of limited fundamentally by the consumer side we end up on how do we get a brand out look at the spinning cars I almost got this problem of zeroing out the memory can we move forward well this led on to me looking at air on an IPC I wanted to take the problems we've seen before and work out how do we come over them so in an ideal world we would have a log and that log could go on forever if I didn't have to keep zero in a night that would be really nice and also it would be really nice because I wouldn't need a spin in cars if I have to climb and I go to read and I miss and get it I just want to increment I want to want to get my memory and I don't want to do the zero well how can we do that well if we start just having a concept of a teal I can just go on forever logically we get the space we write in the NASA's read in the header exactly the same as the ring buffer but we're not trying to go around and deal with that with this one big file that goes on forever work and a lot of the academic literature points decide but if you measure it's not a good approach because you'll get page faults you get VM turn we get loads of those systemic pauses coming out of our operating system from this type of behavior so how do we address that well I just stole blatantly from whenever I used to be backpacking and the thing I learned from backpacking quite quickly is you get down to watch one wear one dry one and this is how you learn to travel light and by taking that approach if we just circular the knot circle within a buffer but circular cross buffers we can do some interesting things and also how can we get rid of that cars so what we can do on that is accidie sex has got this really nice instruction called AK sod which lets me increment the number and if I increment the number I can move forward and make good progress without having to read increment and try that it can do the whole thing in one go so it'll actually increment the number and give me the value before so if I do an axe ad on the tail and this is available in Java EE at knife through the Atomics so I acts out on the tail I can claim an amount of space to threads could be exciting at the same time and they're not going to get into the spending cycle but the second one could move it even beyond the end of the buffer so let's look at the worst case scenario the first one copies in their message just as before the second one has got a little bit more responsibility now and it knows that a trip the buffer because it gets the value the counter was before plus its length says it's tripped over the end of the buffer if another one was incrementing at the same time this would be fine because it would have gone beyond the end of the buffer for its starting region only one thread has the responsibility of rotation so it needs to close out the log and rotate it to the next one in which case they can put the message in now this is gives us a really interesting semantics because we've beaten the cars now we don't have to deal with that but also as we wrote it we can have a separate thread that's gonna do the zeroing so that zeroing can become a background activity so we've dealt with two of the interesting problems what does this look like from a numbers perspective well Aaron IPC ends up taking a little bit more time in the uncontained of kiss why is that well it's actually because Aaron is putting a much larger header on and done the normal ring buffer Aaron was designed to send messages across the network not just across the processes and if anybody's up on the latest distributed systems coolness at the moment which are CR details and Ciardi keys lat sauce recreate our data structure across a distributed network where the packets arrive out of order can arrive multiple times and it can still assemble the same thing we end up we can create this same data structure on another machine which is a nice property and we end up having to describe the whole header if anybody's interested it's an ITF style protocol porn here on the screen and we can dig into the so let's look at the numbers for what's actually going on here so the ring before going off heat nine is a first major step forward is we get better latency especially at the 99 percentile this is looking good why has that actually happened well it comes down to this thing called false sharing when we have the rare that has all of the elements in it in the Q keys or in the disruptor case of the availability Q is we move memory around our system in all blocks of catcalled cache lines and there's 64 bytes at a time you end up having to share those between different threads even though they're not logically writing to the same location but the writing to the CM physical location block it also appears in our garbage collection through a thing called card marking and you can see that a fact if you turn it on or off now the getting rid of the spinning cars as well what difference has this actually made to our algorithm what kind of interestingly now we've got a major step forward so especially at the 99 percentile we're getting much much better results with Aeron than we have had with anything else under contention in fact it's a lot better so you kind of see numbers there what does this look like and something we can get our heads around a bit more well let's just graph it and looking at the different data structures over time if we are having a burst of hundred messages what is the 99 percentile for the round-trip time between this and the Rev blocking queue and link blocking queue aren't that good over on the far side you cannot get the middle range of stuff with it the concurrent link you the disrupter that concurrent a rare queue that's in there but because they are all fundamentally limited by the spinning cars and have issues with false sharing they all end up around the same thing we go off heap to the ring buffer we get some nice advantages and then err on getting rid of the issue of zeroing and the issue of using the spinning cask we can make a lot more progress with that so that's kind of thick the last 10 years where can we go next and what if we kind of learned from this well this has been developed in both Java and C++ and what's been kind of frustrating from me at times is it's been really great competing against need of implementations that are written in C C++ and even FPGAs from different messaging systems and the Java implementation is beating them all but one of the guys who's working with me on this is a really good C++ programmer and he annoyingly can just copy the new algorithms that were applied and in C++ they're even faster again and so we're running a little bit behind probably in the order of 10 to 20 percent now between Java and C++ we also do in c-sharp and i2 and it's getting close on performance to the Java side so kind of learning what comes out of it but there's some standout features from the C++ that if we could get them in Java can really make a big difference what some of these things are is our spin loops for example and accidie 6 is a speculative processor so what that means by that is you hit a branch in your code it will speculative past the branch and continue to make progress we wanted to do that on normal code but not on concurrent code because it often makes a bad decision on concurrent code and there's a way we can control not with this instruction called pause and accidie 6 which will prevent the speculation that doesn't have to be unwind it also means the CPU doesn't use more energy than it needs to in C++ we can easily get out that from Java we couldn't but now I come Java 9 we're gonna have a hint that's gonna be applied as an intrinsic and we can get to that like some tasks on the disrupter are showing a 20% reduction in latency but I've been able to use this technique that's available in JDK 9 we also have some really interesting problems with what are known as data dependent loads and if anybody wants to grab me after this I'll hopefully talk about this subject where our biggest performance issue we know I have a modern processors is if you read a pointer and then you need to go through that pointer to find something else and then on to something else is our processors kind of next progress they cannot speculate and as a result we get really limited with our throughput we can address this and this is probably the last big performance gap with Java versus native languages and volley types will partially address this but only parsley probably 30% of the problem object layout is another things been poked froze that'll help us more significantly make better data structures with better performance on this we also find some really interesting things with memory copying were at times see was ten times or even 40 times faster than Java and we find some interesting alignment issues that we need to get fixed where Java is copying and different arrays are not starting on multiples of 32 in size for a memory address not ended up hurting us and kind of last but not least on this is things like our data structures particularly for a concurrent cannot just be based upon our single threaded alternatives so our queues being based on Java collections have fundamentally limited the performance and the behavior of what's possible with that and they have completed concerns together we need to break some of those so kind of wrapping up now in closing we are getting into the world of many more cores and many more threads and we needed be able to deal with them so we just gotta start thinking differently and we got to start thinking about how do we manage this concurrency in a better way and how do we get par lism that doesn't have to deal with concurrency you can find all of this stuff up on github so feel free to go and have a play and send us any feedback and on this I'm almost out of time I've got one minute so it can probably take one question yes it's a good question on how we will give hints I don't see any direct work on that at the moment what people are typically doing is writing some native code that's called from Jay and I are jenn-air to control some of that at that point we can do a little bit by controlling from the outside for example control groups can help us with new meridians quite a lot not to make quite a significant difference to the performance of your code but you need to think about things outside of the JVM there's very little activity inside the JVM that's helping with that from what I've seen thanks

Info

Channel: Devoxx UK

Views: 10,176

Rating: 4.9705882 out of 5

Keywords: DevoxxUK2016

Id: 929OrIvbW18

Channel Id: undefined

Length: 51min 35sec (3095 seconds)

Published: Thu Jun 16 2016