Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency -Trisha Gee & Mike Barker

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello I'm Trisha G I'm Mike Berger both developers at L max a financial exchange here in London and today we're going to give you a beginner's guide to hardcore concurrency we developed a message passing framework called the disrupter which recently won the Jukes Choice Award for innovative programming framework and we're going to talk a little bit about about the disruptor but mostly we're going to dive in quite deep and narrow about what makes it so fast and how it works right yes and some of the things that we might be talking about might slightly contradict some of the things you've heard over the last couple of days and for example we might be arguing against a certain a certain amount of abstractions and yet we've heard a few people talk about how it would be nice to provide some instructions to make a multi-threaded and concurrent programming a little bit easier we don't necessarily think that's true we might disagree some a few of the techniques and tips we're getting enhanced performance out of software and hardware it's moved forward this is the audience participation bit is anyone here working on high-performance code excellent at least one yeah and who's working with concurrent code on a daily basis or regularly okay who finds it difficult excellent it's always good so why is it so difficult compilers and CPUs are free to reorder code in order to produce the fastest most efficient way of working so what you actually see in terms of what you wrote here at 10 the program order you can't assume that's necessary the order in which it's going to get executed we all know this we all know that when once wrote is is is modifying some of these values you don't necessarily get the same values on another thread so that you actually wrote the code just because Z is equal to 40 does not necessarily mean from another thread modifying that code and W is equal to 10 and different CPUs can modify this at a different amount so x86 don't tend to reorder things very much but there are other CPUs which might reorder at some ridiculous amount so the order is only guaranteed from inside the the one thread which is actually executing that piece of code at any one time so there's an ordering thing and a visibility thing that's what makes concurrent programming quite difficult I was talking about reordering at the CPU level so it's worth going into a little bit of the architecture of the CPU and we've heard this a couple of times in the last few days how in order to get faster performance we've gone multi-core in our CPUs so we've got a number of different cause a couple of sockets we've got inside the core you've got the registers where you're working on various data you've got the store buffer which is where information is waiting to be written to your different levels of cache l1 which is l1 cache small fast l2 slightly bigger slightly further away or slightly slower access l3 shared across two cores and eventually main memory shared across both sockets as he as you go further out your access time it gets slower to access those different levels of memory and that's the point about the caches so that you can access things as quickly as possible and to reduce the amount of latency between your your core and your main memory some CPUs you've also got an invalidation cues but we're not really going to talk about that necessarily here and the Java memory model gives us a good model for reasoning about how things are going to interact in these different in these different levels when your thread is working on a piece of data that data could be we're in there it could be in your register it could be sat in the store Buffalo hasn't been flushed to cash yet it could be in main memory now if you're a fuel tomatoes executing on here and the data is that somewhere here you've got another thread executing over here it's not necessarily going to be able to see that data so this is a taking a slight detour for a bit one of the things I noticed when took especially when I start listening to talks about performance even from some you know fairly high up in the java community type people josh bloch is a good example it seems to be a degree of not wanting to explain or help people understand details it sort of seems to be more about let's provide some really nice AP eyes and features use it will be fine and I'm sort of coming from slightly different approach I think we actually need to know how these things work we need to understand the details and I've got a few examples going through a couple of different concurrent programs and looking at how they interact and specifically from the performance side and that's really what I care about because I think you shouldn't really be doing concurrency and parallelism unless you're actually trying to get better performance if you don't care about performance why go parallel why make your code more complicated than it needs to be so the first one and the most interesting one and probably I think the most important concept to understand when you talk about concurrency especially from a performance perspective is understanding contention so I'm going to come up with a it's a little bit contrived but we have I'll get it anyway very simple piece of code let's iterate half a billion times an incremental value in a single thread this is not going to be run of multiple foods such as a single thread incrementing the value if we wanted to make that multi-threaded here's an approach we take care to lock each time this is a horrible piece of code and I hate it but the number of times you see this turn up in examples when somebody tries to explain how concurrency works even some really really good people like the guys are developing Google go they've got this exact example it's the first example of concurrent code using Google go really annoying I hated a lot you'll see why I hate it so much and here's a third approach it's also thread safe but we're using an atomic long and underneath the covers there's a special CPU instruction Java happens to use on x86 hotspot happens to use locked computer there's reasons why that's action of a best option but this is a more efficient way of doing the same thing but it's also thread-safe so let's see what these things do when we try and run them so I'm going to run these things half a billion times and see how long it takes to actually execute that piece of work we have just one thread without any locks without any of the atomic logs just run it takes about 300 milliseconds if we made that value volatile so we just all be doing is make it volatile only one thread still it goes up it's a 15 X cost and we'll go into the reasons why that is I'll go into a bit more detail about that later on let's introduce the atomic so we're using the atomic Piron increment but we're still only running a 1 thread and it's slightly longer than the volatile one there's some similarities between the two both of them are effectively issuing your memory barrier which has a very very high cost to it any one thread bits put in a lock that's about 30 odd times slower than them having no luck at all but where you'll notice it's roughly twice the cost of an atomic or roughly the cost of an atomic passive volatile depending on the implementation funny enough a latke does one of these and then one of these when it takes a lock out and when it releases the lock when it's uncontained it so let's add some contingent let's add a second thread and let's start with a slightly faster version which is the atomic so two threads continue on the same atomic to try and increment this value 100 times slower than not doing it multi-threaded at all and also notice it's six times slower than doing it one thread with an atomic and this is that's what starts at illustrate the cost of contentious that's illustrate actually when you're having to through its content continuing on the same value you basically got problems was trying to actually get that to be efficient even with probably the most efficient way of incrementing it available in Java at the moment anyone want to hazard a guess when I do two threads with a lock how long that's going to take any takers for minutes it's 746 times slower it's very slow and this is one of the one of the very surprising things if you're right in front of a parallel algorithm and you decide this immutable state and it's shared and you have to put a lot ground it because there's no other way to do it you can't come up with a fancy non-blocking way to do it and that data is highly contended like it's been written to a lot there probably is no point in paralyzing your algorithm this instead of where we wanted to get to when we talk about really to understand the details often it's better to actually not paralyze it all if you've got a continuing problem so this is probably one of the one the main things would will when we talk about this ruptor a bit later one don't share things especially don't share right into the same thing parallel versus serial this is kind of an interesting one if anybody's ever seen this presentation by guy Steele they did a talk at strange loot last year about fortress and about the implicit parallelism that comes inside a fortress and underneath it uses fork/join so the idea of being is that you can rewrite your algorithms using a divide and conquer model but like martine and and being explained earlier today and you decompose it we can decompose this particular algorithm and paralyze the leaf node work and then reconstruct the answer and he talked about an algorithm for string splitting the idea basically partitioned this nut near the set of texts which had a number of words in it and came up with partial results and then recombine all those results together well I thought is quite is quite interesting and also scalars introduced parallel collections as well so I did an implementation in Scala to say such an interest and I think I can probably solve it that way and then just as a tweak I thought well one she wrote a brute force version in Java just the standard iterate across the two of them the codes actually up here on github if anybody wants to have a look interesting differences the serial Java version is half the number of lines of code of the scalar version which is parallel serial one parallel one you'll notice that the the serial one on a single thread is four times as fast as the parallel version anybody want to guess how many cause is running on this is running on a 16 core box so even when I threw 16 cores at the parallel version it could going to get 1/4 the speed of the synchronous the reason so this actually start to get very very complicated they boil down to the fact that because I'm running a single thread I can do things muta bleah it can be very very memory efficient I can take advantage of cache striding the parallel version because of the requirements that joint I'm going to end up doing a lot more memory allocation I'm trying to push data through main memory through the cause all the time doing a lot more allocate a lot more allocation garbage collections kicking in all these other things so it's sort of along the lines of parallelism can work and can work for a number of problems but first understand your problem because you might find that all you're doing is making your code slower and more complicated here's another interesting one I've been told a lot today that CPUs aren't getting faster that's not the case clock speeds are not getting any higher CPUs are getting faster so with that previous version of the algorithm and I wrote the serial version obviously you know yes the parallel version is slower but yeah I could check more cores and eventually become faster but what would happen if I took my serial version and started running on newer and newer CPUs if it is the case that CPUs aren't getting it faster that code one gave me faster the results slightly different so this is for different CPUs there's very top one is from about 2008 to Core 2 Duo it's my work laptop I then have the second one down that's in the hilum ep that's a that's the 16 core box that I did the previous tests on I then have the Sandy but the Sandy Bridge over voltage which is the snack laptop and then another Sandy Bridge machine I tried on it as well and so those are much much anneal you'll notice it's almost twice as fast on the latest Sandy Bridge CPUs as compared to the p80 600 which is the dual core box interestingly the Sandy Bridge CPUs which are both faster have lower clock speeds Fox speed isn't indication of CPU performance at all there's so many other things going on especially in Sandy Bridge has a totally different cache and memory architecture to the old Core 2 Duo good example is a core to zero she is the same bus or both the memory traffic that's the actual data and the cache coherency traffic that's the all the extra little bits of traffic that tells all the caches that things were coming invalid as they change and these are the things like wider data pars and more larger caches more registers always lots of things going on so this assumption CPUs aren't getting any quicker and that parallelism is the only answer I'm not so sure it comes down to your problem so you know focus on what you're trying to solve rather than just sort of saying oh I've got to go parallel it was it's not always going to work back to the real world the what well it was a problem that we were actually trying to solve when we came up with our message-passing framework as I mentioned we're a financial exchange so we sit quite firmly in the real world we have to be FSA authorized we in order to be speed of execution is is it's fundamental to our business we have to be extremely fast because that's what's going to be our differentiator we need to be able to match prices extremely quickly in order to do that we keep pretty much everything in memory we which which I not do any i/o we don't do anything which is which is going to slow us down we want to reduce latency however we are also FSA authorized and we can't lose any data and we've got to be reliable and highly available which means that we also need to do things like make sure we have the same data on a secondary run on on dr we need to journal in memory state so that when the system goes down we can bring it back up again without losing absolutely everything our initial architecture to solve this problem was a fairly common architecture and it's a is a cedar architecture sage event-driven architecture and and it makes a lot of sense you pull your messages off the network this is your receiver you put them on a queue you wait for the replication which is going to send off the data to your secondary server you're going to get that to pull it off the queue send it off to secondary server chuck your messages back on a queue when journaling is ready it's going to put it off take the message off the queue journal your information to disk so that at this point you know that if it all comes a fall comes crashing down you can replay back off there put the next page back onto a cue then you can do the stuff that we're all interested in the actual business object the number crunching the interesting stuff then you can check the results of that on another cue put it to the publisher and publishes it out to the network this is just one service we have a bunch of different services throughout the whole exchange which are doing slightly different things but this is the fundamental architecture we found when we started measuring because measuring is very important when you have specific requirements like low latency that this wasn't going quite as fast as we would like it to go so we did some profiling to find out what was taking the time now of course replicating to some secondary server elsewhere and doing some IO to disk yeah that is going to take some time each one of these queues is taking ten microseconds well probably maybe more and so you add up all of those and you find out that this is actually taking up quite a large amount of your overall latency and your business logic which is the interesting stuff from the stuff that differentiates you is taking almost no time at all this caused us some problems we need the queues because we're passes you passing the information in between these different stages but they're slowing us down why is this like you suck there's one thing we learnt and we spend a lot of time actually optimizing queues we wrote probably two or three of our own implementations of queues to try and get around some of the problems and we basically came to the conclusion that the structure is fundamentally flawed as a way to move move data between cores and the fastest way possible and it comes back to what I was talking about previously with those locks it's about contention here's a question I've got a producer and a consumer on a queue how many writers are there to the queue is there one or is there two there's to both of the threads in the publish the producer and the consumer are both changing the state of the queue they are both trying to move the pointers along they're both taking out locks they're setting things to know that both contending on the same piece of data and as I showed with that lock example that's when you run into your performance problems that's we need to run and run into these these very interesting latency issues as an aside one of the reasons why a lock is so bad under contention and why it's so much worse when you have two threads contending is that the only way for the lock to arbitrate between those two threads and decide who has responsibility for their information is to ask the operating system so you actually have to delve into a system call once you get in there the operating system can do whatever it likes it can basically stop your thread completely move you off the core reschedule another thread on your core then I come in and stomp all over your cache and you know basically mess it up so by the time your three gets back into running again you're going to have a whole heap of cache misses can you have to draw data back into memory that was stomped over by the by the previous process that happened to be running on there and when you're running virtualized that's even worse so if you're in a virtual host and you have a contended lock it has to go not just to the virtual operating system but all the way down to the host OS to ever have that lock resolved and that's why you ran into so many performance issues with them so we find with it an array back to you we've gotthis you got a head in a tail that need to be maintained there's a size field that means means maintain you're going to take out lots and they contend over this field here and one of the things we found with queues when you do some research and look at any queueing theory excuse or either in one of two states they're basically either heading towards being overflowing or heading towards being empty they will move in one direction or the other you'll generally never have a case where a queue will have a medium amount of data in there and stay steady and you know at three billion operations per second to seat me or two independent threads there's no way they can run it exactly the same pace given they'll be doing different things it's a bit like trying to bounce off balance a pencil on its tip it'll never happen so you always will end up with a cue either filling up if the producer is faster or draining if the consumer is faster if it's filling up you're basically screwed anyway your systems about to fall over but so most the cases is your queues are draining so we're going to be stuck in the situation we're all continuing at the very front of the queue on that one object that's being passed between them and it's almost unavoidable the same is true respect ones as well you'll be tending towards a single element so if you ever look at something like the concurrent link to cure locking queue you end up in a situation you're trying to swap the same value in and out plus if you've got a linked list type queue you'll need to have a separate size variable being maintained because otherwise you won't be able to have bound in semantics so that would be contended as well what do we do to avoid this problem how do we how do we not use queues or what do we come up with as an alternative to queues the important went here obviously was the contention so we need to come up with something we need to come up with contention free design this is a really sort of basic cut down view of the destructor if you like the main thing really is what we were what you're finding with the with the queues is obviously the contending over one particular please piece or contending over the head and the tail and size and all the rest of that stuff here we use a ring buffer as the data structure that's basically in a way and we keep the sequence number of the most recent thing that was written into the ring buffer and this will like room buffers do it will wrap round around and we'll just keep incrementing the sequence number and we'll use it to figure out where in the ring buffer we are we are which point we've reached the producer therefore we'll be writing into the ring buffer and updating the sequence number so that's all that's the that the producer is the only thing writing into this area the consumer has its own sequence number the consumer knows where it is up to and the producer and the ring buffer know where they are up to the consumer can read the sequence number but it's not going to write to it because it's not writing anything to the ring buffer if this is a I don't know eight say the consumer can say right I know I can read up to number eight it might have its own pointer and it's read up to five already it's going to store five it can see it can read eight it's going to read six seven eight then update its own pointer its own sequence number it doesn't do any writing on the ring buffer at all applying this to the the architecture we had before which was this kind of linear pipelined process we get something interesting out of this we can actually parallel eyes we can actually do things in parallel so now our receiver is pulling stuff off the network it's writing into the ring buffer instead of writing into a cue it's updating the ring buffer pointer to eight replication which is going to go and replicate your that data over to your secondary it's going to read see it can read anything up to number eight it's currently at six arbitrary numbers it's going to read anything from six up to eight journalling similarly it's got its own pointer it's updating this now we know the business logic needs to have those two things to have happened before it can carry on because this is this is how the zero liability we can't do anything until we've done both of these things we can't carry on with any business logic the business logic is going to know where the ring buffer is but it can also read again not right where the other two points are the business logic is always going to have a number which is less than the others it's going to read up to the minimum of these two numbers so it's going to read up to six it can read up to six from here again nothing is writing so nothing is trampling love all over everyone else's sequence numbers when the business logic is finished then it can publish it into another ring buffer so then it is also a publisher for a second ring buffer and then the the network publisher can pull it off and chuck it off to the network again how much faster is it and we spend a lot of time prior to this bench prior to development from a benchmark in various different queue implementations and the fastest one we found was actually array bottom huge so that was one of when we compared it against and these are two of the performance that we ran the unicast one is a single producer sending messages to a single consumer and if you look at the road blocking queue the rate booking queue maxed out at somewhere around six million messages per second the disruptor maxes out somewhere in the region of about twenty five million messages per second with the diamond operation that's very similar to what we were trying to do in our actual application so this is either a message comes in it then can do some bits and parallel and then come back to a single queue where it's able to do the business logic at the very end so in that sort of diamond configuration the array blocking q-max now somewhere around about a million messages per second whereas a disruptive maxed out at about 16 million messages per second these numbers are quite old actually we've got quite a bit faster than this as well with a number of other tricks have been tried but that's not the important bit that's not the bit that we really cared about what we were more interested in was the latency side of things so I've got the array blocking queue so these are in nanoseconds so latency being it's small so the smaller number is better the mean we found the ray blocking queue about 32 microseconds dis ruptor 52 nanoseconds and this is probably a best case this is probably two threads running on the same socket but on different cause and you will notice that this is very very close to an l3 cache flight and that's sort of one of the things we took into approaching the serve design is actually thinking about what are our actual physical limits you know what is the lowest possible cost that we could see if we're going to move message between two cores on the same socket and it's basically it's basically the cost of writing down to the l3 cache because a thread on the same socket doesn't need to wait till it gets all the way to main memory for example but even as we start to scale up to the two nines and the four nines you start to see this the road blocking hue jitter right up to about two to four milliseconds and this is if you're trying to provide assistance got not only low latency but very very predictable latency this is getting terrible I mean we have a goal internally of an interim transaction flow of one millisecond so when a queue is takes two milliseconds just to pass a message between two threads and across the entire system before we do about ten or so thread transitions it's starting to add up so when we look at our tune and four nines a statin stay down to the hundreds or maybe the thousands of nanoseconds so that's really really important for us so right how does this all work so where does the actual concurrency stuff come in here so what I decided to do this isn't actually the implementation of the Ring Buffett but it's very very similar the principles are basically the same and it comes down to this ordering invisibility stuff we talked about at the beginning being the sort of difficult bit that's regarding this so I've got two methods once to publish and once to get you actually know this is kind of a similarity between the the queue implementation that mantain and Ben showed today so this is sort of it how you this is how you do a potential alternative so what happens here when we want to publish a value out to another thread we have the value come in we're holding a sort of a next value pointer in here which is just a plain long we increment it and say that's the index we're going to write to we start at minus 1 so the first index value would be 0 within doing mod of that value and we stick it into the array at that point then at the very end we at the end of this method we set the sequence to be that value so this is we were saying this value is now available to be read so we said let's say we we came in it was the first message next how you started -1 up updated to 0 we wrote the entry into the array at 0 and then we said ok 0 is now available to be read on the other side the reading thread comes in and it has to pass in and exits required to have have its own index and so what am i interested in seeing so let's say it passes in 0 it checks to see whether the index that it's passed then is less than or equal to the sequence that was supplied if it is it does a modulo into the into the Ray grabs the value out and returns it otherwise it just returns none so there's a lot more going on in the real disruptor but this is kind of a simplified version so how does this work well the only real concurrent structure here is volatile that's the only thing that's the only concurrency primitives that we've included in here and what happens when we write to this volatile field this is what's known I probably get the actual term we're doing potatoes basic it's an atomic ordered release so basically what that means from a Java memory model perspective is any other right so any other setting of a value must that occurs in our code before that point may not be reordered after it so neither the compiler nor the CPU can take any of this logic and reorder after the sequence value if it could then this would never work because it might be a case we updated the sequence before the data value was set and if we said right ok it was we would never actually see the value or you can end up with corrupted data that sort of thing for example these were switched around and you had and the sequence is updated before the data you came in here and passed this condition you're ending up with fetching something out of here that wasn't actually there yet because this code hadn't got passed and actually come to this point so a right to a volatile variable is an atomic ordered release which means none of the rights that occur on the same piece of code can be placed after that point so the other side of this is when you're reading from the sequence value this is what's known as atomic ordered acquire and what that means is no other reads that occur after that in the code order can be ordered ahead of it so if we're allowed to reorder these two lines for example let's say the compiler changed our code underneath to store this referencing in a in a local variable before it executed this instruction maybe it was doing some speculative execution decided well actually I might have a brand I might have a cap you know a branch misprediction here so if I do this but first I can get some more efficiency out of it this read from a sequence in 40 hotels a CPU and the compiler not to do that so we know if I've written the value 0 here because I've put the data into the array before I've written that value I know that if I read that value after that point I can get the value of it out of the array and that's all it does and that's pretty much all the concurrency constraints we have inside of inside the disruptor and this is one of the other principles around why you want to avoid a contention as well is that when you go down to having only a single writer for a piece of data that's not a case of creating locks only one person can but only we you only have one thing attempting to write something your concurrency code gets much much simpler it's so much easier to reason about and write this code when I know they'll only ever be one thread writing if I've got two threads trying to write to this this code breaks I've gone for the simple model of a single single threaded publisher it's actually and the amount of extra overhead you have to work in to make sure the multi-threaded publisher case would work is actually quite high to the point where actually if you look at the disruptor we provide a pluggable enum for the two different for the two different operations so if you know you only have a single threaded writer then you can actually just put in an enum saying only you're going to expect a single thread reamers into this and matching it much better performance out of it so to take a step down from that how does the CPU actually ensure the right things are going on this is where we to step into a little bit assembly code so this is the code that's effectively generated from that publish method at the very top you've got a move in an ad so this is where we're incrementing the the next value to get SMU value up this is where we're basically getting a reference to the array and we're storing the data in the array as people talked about previously the x86 has quite a strong memory model so with regards to reordering these the Intel CPU will not reorder stores with respect to other stores so this edition of the next sequence and the storing of the data in the array and the storing of the sequence number which is what each of those things are will not happen out of order x86 more not reorder those things this final instruction here is the interesting one this is basically the the instruction that's used to publish the data out to all the other course so when we talked about that model of the CPU where the data could be at anywhere in that sort of hierarchy of caches or in the store buffer when you execute this instruction it will actually flush the store buffer out mark all of the all of the other caches references to that data as valid and we'll wait till that completes before returning so you basically know that the data has been published and is visible to all other cores in the system the interesting thing is that's really expensive compared to that these are sort of single sitting you know these are only normally a few cycles because it's just going to stick a value in a buffer this can be 100 130 CPU cycles and when we talked about when I showed the example of just writing to a volatile even in a single thread taking 15 times longer that's the culprit that's the thing that's making the things slow yeah now I haven't tried this batch chip yeah if anybody wants to the code I think the code for this is in github as well so if anybody's interested let me know when I seen the code we can try it out we introduced e what the assembly looks like as well I think you're ending up something very very similar if we get to the read side there's no magic concurrence in instructions in here on a spark or a weak memory model system what you would actually have to do is if you were reading a value that was wrapped in a volatile you would actually have to go down what's done known as flushing the invalidation cues on some CPUs when you execute that lock instruction or something very similar to it it doesn't go in mark every single cache line is invalid what it does is it max its own ones is invalid and then queues up on all the other CPUs a message to tell them to invalidate their cache lines so you'd have to do what's known as a read barrier and that read bay would flush the invalidation cue marking all these things that held in various layers of cache is invalid so the next time it tried to read a value it have to go all the way to main memory or maybe l3 possibly so but with Intel all you have to do in the Intel side on a volatile reader just make sure you're not reading it from a register so it's basically here get filled sequences reading it in and again because Intel has this very strong memory model as long as the instructions are issued in this order it will not reorder reads with respect to other reads so just having the instructions in the right order for Intel is sufficient to make sure that we haven't violated those ordering constraints so we can't flip the order of those instructions when we read from the sequence value so one other little trick that we ran into the really interesting bit is that la constructions not actually required it was a yeah little stunning realization fortunate not by me by my boss actually who figured this out but actually this funky little method in the atomic would lazy set and when I talked previous so I mentioned that we have that ordering constraint inside the Intel CPU both for reads m4 writes and that's all we have to actually ensure we only have to ensure that when we write the data route we're writing it out in the right order and then we really did we read it in the right order so that's all this lazy set does it just ensures the ordering and just ensures that those things that we've written prior to this point so the index and setting the value of the data just make sure this happens afterwards and it skips the lock instruction and that actually gave us probably a factor of two or three performance improvement over above the values I saw previously yep it will do i the the one thing to be a bit weary about this is it's not defined in the memory model it's not defined in jsr one 133 it is though said because Begley wrote this and he said it does work it follows the same principles as an ordered release in the C++ 11 memory model if you read the thickness of jsr 133 cookbook which explains how to do various operations this was added just afterwards after they're finalized the memory model and he needed it in order to get some of the stuff and joined to work so I think so it does work sometimes I believe a couple of JVM venusaur any problems with us I think there's actually a bug in the JVM on Mac with regards to lazy set so and you know all of our tests work with this we run quite a lot of quite a lot of heavy performance tests as well this code is running in production and now in our exchange has been for about four months so we haven't seen any issues with it yet so it's it's an interesting one it's one of those ones that people have you know as one ones I never really knew about until until Martin found it we did a number of other tricks to try and speed up the disrupter as well one of the things that always puzzles people when they first look at the code is is we have some weird extraneous things in the code the reason being that when you're loading things into the different levels of cache you're not loading individual items you're not loading a variable you're you're loading a whole cache line a whole probably about 64 bytes of memory that are all contiguous go into each of the cache into the caches so when you're on you want to read your head and you've got a thread writing to your head you've got thread right into your tail but they're actually sat on the same cache line when thread one writes to this variable here then thread two tries to write two tails these are two totally different variables they're not actually the same thing at all but it's going to have to go all the way up through main memory to get the cache line and then then when the first thread goes back to try and watch this it's been totally trashed because it's been moved over here so you end up with this ping ponging of the different cache lines even though the two verax are two separate variables you've got full sharing based on on on that cache line so understanding this and understanding that this was a potential problem meant that we came up with we have to come up with the solution for it to give us our sequence numbers obviously those are those are cute we don't want false sharing with other people's sequence numbers we just said that only one thing is writing to them at any time so we we make them independent by putting some by doing a bit of a hack no it's not it's a very efficient way to pad out the sequence number so it's in its own cache line we provide a bunch of padding variables and we have to put them in a public method so that hot spot doesn't optimize them away in a very useful way we've had assurances from a couple of people who work on vm's that if you've got a volatile variable and a public method there is no known VM that will actually optimize away those those those values it's still not particularly well guaranteed Doug Lee actually talked about potentially adding an annotation to the code which is an act contended so if you have a field that you don't want to be full shared ever you put an app contender on it and underneath the hotspot VM can figure out what the cache line sizes and just pet it out for you and there would be a much better way to do this this is nasty it's a trick but yeah this was in terms of performance not a massive performance gain outright but it gave us much more predictable latency as one of those things we notice that when we didn't have it we would see spikes in our latency as every now and again you get two threads trying to contend on the same cache line there's an N summary so I mean the one of the key things I wanted always bring up is concurrency and parallelism the tools they're not the or needle you should be focusing on your problem and solving your problem and saying well if I need to make this faster what is the best solution for that maybe it's a concurrent solution message passing solution maybe it's going parallel maybe it's not doing any of those maybe it's coming up with something else so it's a tool that's not in itself ordering invisibility are probably your biggest key challenges especially when you're getting to trying to write really really high-performance code and avoiding locks and doing some of the non blocking type stuff but if you care about performance the details are very very important don't believe everything you read I think that was one of the first things we learnt just to come up with their own theories and text them and try a little bit science to be honest yeah and some of the traditional ways of doing stuff is just a good way to make your code more complicated and possibly slower that's it thank you questions now questions yep it does I mean we make them very very large deliberately because we actually have we have a reliable messaging system based off of them so the idea being we would publish that message out we publish it with that sequence number so they receive on the other end if it's not implementing if it's not directly incrementing it can nak back a value and just rewind the thread and replay those messages out we deliberately make them big in that case to be honest actually no you don't need them to be that big if you're just using it as a barrier between threads it needs to be oversized it's sufficient to deal with bursts in the system so that's the whole point of having a queue there's a whole reason you want a bit of a buffer is a case that you had a burst of traffic that you've got a place to put it you're not actually slowing down they're producing thread it can just stick it somewhere so that receiving three can pull it off one of the other things we do in the disruptor which haven't really talked about is we have an efficient batching mechanism so in the case where something like your Germa is getting behind when we read the sequence value let's say the journal is stuck at number 10 at the moment it's just finished writing number 10 off the disk but in the meantime it's taken about 10 microseconds to do that it's got a fast SSD or SAS raid array or something like that theory messages have arrived because you know it's off a 100 gigabit you know 10 gigabit network and it's flooding through by the time it comes back it goes our our sequence numbers now 30 I wrote 10 it can just iterate directly across from 10 to 30 batch those into the same and potentially into the same block and write them discs in a single operation and it's we manifest they're in the form in the API in the form of a callback so the way you actually write to the disruptor there's a thing called an event handler which you implement and it gives you basically the message the sequence number it was at and a boolean flag to tell you whether is at the end of the batch and so basically you just pick messages keep adding them to the block until you filled the block or you've got a billion saying reached into the end of the batch so that's one of the other things we do with with buffering I don't know any actor based or event based framework that does that I think the disruptors the only one that I've seen do that and it's absolutely essential for for low latency in high performance especially when your individual message rates are higher than the throughput rates of a component like a desk or like the network you have to have some way of efficiently batching because for example a disc right a 4k right will always be a 4k right they always write in the same block so if you're writing 4k with only 100 bytes in it every single time why not write for cable 4k in it same goes through MT use on a network we haven't done all of our performance this deliberately create no garbage one of the other things that's hitting in that we haven't really talked about is the one of the ways the disruptor works is that you can actually do it in a way where you don't actually allocate anything at all while it's running so you provide when you start it up you can provide a factory that pre allocates all the data and this is what we do in our production system we basically say we've got a message bus that gives us a byte array and we take that data and we copy it into the pre-allocated byte array that's sitting inside the disruptor so the way the API works is it hands you the object that you can sit in then you can copy it in rather than allocating your object at a time so well most of our performance tests work like that so we avoid any jitter from the GC so we haven't really bothered testing with the different juicy things because you can actually build our systems that are completely garbage free not even good who wants to do that I mean we're in a bit of a niche that we really trying to track down force low latency stuff and not a lot of people like that model but when you're using things like some of the array blocking using the standard JV standard JDK protection you don't have a choice you're forced to go down the slightly more garbage you wrote and allocate never use mutable objects the earth of this ruptor was to give you more choice so you can go down the zero garbage reuse object route or you can have an entry which just is holding a value and you use immutable objects like you might be more familiar with so the GC doesn't actually pay a huge difference and with regards to disruptor I think we're good all right thank you very much
Info
Channel: JAXConference
Views: 29,369
Rating: 4.9631901 out of 5
Keywords: Disruptor, LMAX, concurrency, memory barriers, cache coherency, trisha gee, mike barker, java, jax, jax london
Id: DCdGlxBbKU4
Channel Id: undefined
Length: 45min 19sec (2719 seconds)
Published: Fri Dec 02 2011
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.