Netty - One Framework to rule them all by Norman Maurer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so first off thanks for coming it's the last talk I know I guess you're already hungry and want to go home anyway but I hope this one will at least be a little bit entertaining and show you a little bit why I think Nettie is a good framework and why you probably should use it so the title of the talks a little bit propagating I guess um it's called like one framework to rule them all it's not like that I think that all frameworks are bad except Nettie it's more like that most of the times you may even use Nettie and don't know it because many frameworks and network frameworks are built on top of Nettie already so if you for example use word X use Nettie if you use something like play use Nettie if you use Cassandra use Nettie and so on and so on so I want to show a little bit of the bottom journals and why I think it makes sense to build something on top of Nettie so first off something about me my name is Nomura I'm leading the Nettie project so hmm I started to work on it like five years ago or six years and heating it now for about one and a half years I was just like as Apache Cassandra mfb a MSBP which basically means I some very valuable person whatever that means I wrote a book on Eddie's so if you want to support me by it I'm a member of the pep Apache Software Foundation and also a commuter in the Eclipse Foundation so first off I want to talk a little bit about neti itself and how it involved all the time so it was a very very long journey since then and I hope it will go on for this for a little bit longer so Nettie 3 was released in 2008 which doesn't sound too long ago I mean it's eight years it's still a long time but basically um it was started even before the so what started like I think 13 years ago which is a long time nettie for which was the first new major version was released like three years ago which had a very different API s and different characteristics like in performance and all the kind of stuff I will talk a little bit about this and what we change to make that happen then we released 94.1 this year which basically means or provided on top of what nettie for that old provide like HTTP to support and other protocols which I think are very important these days it's one of the most used Network frameworks for the JVM so if you don't use niño directly most of the people use neti or framework on top of neti it was founded by trustingly on when he was still arms duty he works now at line these days which is a big messenger company which builds a big messenger up for Asia and it was first a JBoss project and then it got independent so basically when trustingly left Red Hat it got well outside of red head and is now standalone on github so net III I mean that III works pretty good I think in general and it's still widely used but it has a few problems that we kind of solved in 94 or made less worse so first of 93 created too much garbage what does it mean so basically we're creating too much objects which some people may think it's not a big deal because the garbage collector would just collect them and all the kind of stuff but if you start to push stuff pretty hard it can become a very very painful thing we made too many memory copies so basically what happened in 93 was most of the times you allocated something arm then basically on the heap you get a heap buffer then you want to write on the socket so we do memory copy to direct memory to hand it over to the socket so you do a lot of memory copies it has no good memory pool included why is this important because if you want to write high-performance Network applications most of the times you want to use native memory like direct memory which is expensive to allocate and also expensive to deallocate I will talk a little bit more about this why it's expensive and why you need to pool to make it work it was not very optimized for Linux based OS what does that mean I mean you think like Java has n io abstraction which works on Windows on Linux on OS X on whatever which is fine but sometimes if you want to have like the last person of performance or even want to have features which are only there on wood knocks and Nettie's we didn't provide this because we was just using the nao abstraction that changed a little bit of metaphor and the threading model was not really easy to reason about why was this a case basically in 83 every time you wanted to read something from a socket we did this in a non-blocking fashion so we had like a thread which runs in an event loop basically and just whenever something some socket or file descriptors ready we read some data from it and then pass it through the pipeline where you can basically do all the processing transformation all that kind of stuff when you write something we basically write it through the pipeline again until it hits the socket and then write it through or through the well through the system call to the network to the kernel space in the work network and all that kind of stuff the problem here was really that inbound traffic or inbound processing happened in the either in the event loop so always in the same thread but for outbound for Reince that doesn't that was not the case so basically always this was done in the calling thread which means it was very very difficult to reason about which threads operate on which data in which sequence so we changed that in 84 I would also talk a little bit about more about this how we did this no problem here really is it worked pretty good so net III is still used and people don't upgrade which is a pain for us because we need to maintain it we just stopped to maintain it so it is three end of life since like four months I think but people still use it because it just works so please upgrade so Nettie for now we create less garbage which means the garbage collector doesn't need to run so frequently I will talk a little bit about this later how we do that we have optimized transport for Linux based operation systems we use that by doing Jane I I will talk about this as well a little bit later because Jane is fun we have a hyper phone buffer pool which is based on the chamberlock paper so basically we re implemented gemma log in java for direct memory and it has a very very well defined threading model everything like inbound and outbound have been always on the same thread so we don't need to carry care about synchronization which is great so it's really very easy to write your stuff it's more like note I would say but with better performance in Java so to give you a little bit like an idea how Neddie works this is how the concept works of Neddie so Neddie you have a channel the channel itself is basically like the abstraction up on top of the socket so you can do a write which goes down to the socket you do right or right V's this call and pass data then you basically have remote address local address and all that kind of stuff so all the information so basically it's like a connection if you use TCP then with each generally of a general pipeline the general pipeline is basically just a doubly linked list of different general handlers which contains general handlers the general handler itself can process inbound and outbound events so what this means is basically every time an event which is in Bound which may be for example we just read some data the new connection was just accepted this kind of events I mean this happens basically we start on the head of the pipeline walk through it pass it through each channel handler where you can basically do something with the event if you want to like you could lock a new connection you can transform bytes to you could whatever you want basically if you do an outbound operation like a right we start at the tail of the pipeline and fun flow from the tail to the air to the beginning to the head to the channel and once we reach this we basically do with this call this allows you to basically implement like insert an interceptor pattern where you can do all the processing logic business logic just in pipeline reusable parts so you could use this and we use this this way Neddie to basically write and codec for a network protocol so our HTTP codec is basically general inbound general outbound handler which means once we wrote this if you want to support HTTP you just put these two handlers in the pipeline you're done the same is true for SMTP for storm for whatever so the whole whole magic from transforming bytes to some poachers that you can handle happens in the pipeline there's also other interesting stuff that you can do for example you could count connections close them metrics all that kind of stuff and all the frameworks that use native build on top of this for example vinegar is using this to to do all that like surfer filter Cheney vertex is using that to dispatch that to their buffer handler or body handling all that kind of stuff so you can do a lot of interesting things here and to give you a better example about it's basically it's just like pipes if you ever use UNIX it's like pipes but in both directions so you can just chain different commands and alters the output and get something at the end and you can do that inbound and outbound in both directions basically so garbage so I want to talk a little bit about garbage collector and why it's a problem so the cat is the garbage collector the kit is basically your process that runs away because the garbage collector want to do all the kind of stuff all the time so what we did in 83 was basically each event or each network event was a Patrol which made it very easy so basically we only had like two methods for inbound we had like something which was called armed upstream event or handle upstream event so we've just had a person approach oh did some incident objects and no like this is a matter of message event this is a connective and all the kind of stuff the same we had for outbound this is a ride event this is a connective end and all that kind of stuff the problem here was really every time you wanted to do a network operation you create a new port well only for this even without doing creating one for the data of transforming and all that kind of stuff which sums up pretty pretty heavily if you handle a lot of concurrent connections even verse it's not very easy to optimize this for for added jitter for the JVM because you're calling the same method with different porches all the time do all kind of instant objects which may well have a very very complex class hair Sheen or the kind of stuff so it's not very friendly the other thing that we did for for help with this we introduced a very light white object poor I'm not saying that you should do that only do that if it makes a lot of sense don't pull every object but there are a lot of objects and Neddie that we reuse all the time well that we only use for a very very limited subset in the same thread all the time for example if we have codecs we want to transform bytes to multiple objects and put it in the list we don't need to allocate a new list all the time because the lifetime is very limited we can just basically store that and reuse the same thing again later the same is true if we do like did the porters before to call the method if we now have replaced it with direct method calls that we have now so basically each event is now like we have a general read method which basically is called every time you read something from Network layer then we have a channel active which is called every time a new connection happen we have a general inactive which is called if the file descriptor close and all that kind of stuff so you can make a big impact if you just save a few objects and there's just like a week ago there was a very very interesting issue open up and Eddie we're basically Twitter was testing how much difference it makes to either use this on small object pool or not use let's just garbage collector do a lot in this set like it's four times more CPU to not use the object pool for the use case but we even allow you to basically disable it if you say like I want to just have the garbage collector do everything for us yeah like I said we use J and I I know most people hate J and I I also kind of hate Chennai but I do it so you don't need to do it so there sometimes things that you can only do with Jane I at least at the moment so for this kind of stuff it makes a lot of sense so the problems here really is um if we talk for example about the nao java and io package which are basically the socket abstraction it's very general proposed which is a good thing right because it should work on every platform and all the kind of stuff it should work with every threading model and all that stuff we don't care too much about this entity to some extent because we have for example a very very well defined threading model so we know which thread will call something on one socket um we want to support advanced features that you don't get on the JDK it's at the moment I will show you some of the features in a few seconds we can do directly operate on pointers for buffers so we basically can get the memory address pass it over which is a lot cheaper than just pass over object for example on and do a lot of interesting things and everything like this is built for you you just include the jar and it works at the moment it only works on Linux so we only have extra transport for Linux which are optimized yesterday the body of mine just opened another PR to support the same 406 4kq basically if you are on Windows you can still use nao the API is the same it's just hidden from the user is just another dependency that we basically provide and here you on the left side you basically see the code switch from one to the other the first one is using the niño so what you see here is basically you have a bootstrap Suzette's a client you bootstrap a client you set the croup let's event loop group the event loop group basically has like it's like a thread pool which has a lot of threads which basically process different channels which are assigned to them like processing events per connection you say I want to have an niño event loop group then you say I want to use the NI Osaka general class basically so this is telling it use an AO Java and L if you want to use the native transport and you are on linux you can just say exactly the same and just alter two class names basically and that's it it's getting a little bit harder if you use features which are only provided by the native transport then you cannot change so it could also easily anymore but it could change to the new KQ transport once it's merged so what what do we provide in the native transport to make it interesting I'm not sure how many of you know about all the network socket options that are listed there so I will explain a little bit what they are doing to give you a better idea why it makes sense to use them so there's SOE use port it's completely different to si reuse address so SOE use port let you bind to the same port multiple times with different threads basically with different sockets why you want to do that basically you let the kernel and the operation system handle for you the load balancing between threads which is super nice so if you have an application which needs to accept a lot of sockets concurrently using only one thread maybe a bottleneck because you cannot accept fast enough and then basically the SOP a clock that you can set get populated think you get too too long and then basically everything just times out so you could fire up like four threads and let all the four threads basically accept different connections on the same part another very very interesting thing that you can do with that is if you write in DNS server you could accept Datagram packets on the same port with multiple threads that was actually the use case for which it was developed first by Google I think like four years ago when it was merged in the car but you can do a lot of interesting things with this so this one is very interesting you could even well start the same process multiple times and bind it basically the same port the only problem here is Java basically is that you would need to reserve the heap multiple times so it may not work out too well but their use cases TCP Korg TCP Korg allows you to basically arm send in a performant way like for example if you have h TTP you would have the headers in a buffer and the content in a file and you could send it like with very efficient with only one round trip post together so let's super nice if you want to say if you want to basically serve files TCP not send elevate is mostly useful for example if you use HTTP - so you can basically tell or give the kernel a little bit of a high end when it should buffer data and how much before it really flush it through the network layer on TCP fast open TCP fast open is very interesting if you have full control over both sides of the connection like the client and the server because if both support you can basically in the first handshake you can already send data if you use this together with SSL you basically can save a round-trip for the handshake but it only works if you have full control over both sides so if you for example at the client app in surfer app and post running the damned same data center but you still want to use SSL TLS that's super interesting feature because it can really reduce latency and TCP info tcp info i think is a very well-known feature of linux it's very old actually um it's very interesting so with TCP info you can basically get metrics on a file descriptor and I think it's certainly two different metrics basically so you can get everything that you ever never wanted to know but also interesting stuff so you could have you can get for example metrics about round trip times arrows on the network interface where the fighters script is on and all that kind of stuff and what you can do with this kind of things it's a very interesting thing for example you can write a TCP or client connection pool and can choose the connection based on the round-trip time to the host that you want to talk to so that's super interesting I mean it's very advanced but you can do a lot of stuff um they are also like you can do health checking stuff and all that kind of things if you're really interested about what you all can do about this just type man TCP and in the main page it's described everything it's basically we just pass through what links provide so everything that you read here you can read up in the main page and get more details about this if it's interesting to you and another nice thing about this is because we can act directly or because we call directly in chain I am just pass over like memory addresses we don't need to create too much objects but we can do that because it's very isolated so everything of this is hidden from the end user which is nice because it's very easy to crash the JVM if you do something horrible wrong with memory address and chain I especially chain I it's a lot of fun to crash the JVM all the time buffers um yeah I want to talk a little bit about buffers so I guess most of hopefully most of you have used byte buffer before how many people liked byte buffer so either no one is using pipe buffer or no one likes it okay so the problem with byte buffer is really the interface if you ask me it's not very user friendly because why first off you only have one index position basically in a limit and if you fill a buffer to read you basically need to flip it before which basically sets position to zero again in the limit to where the position was for before and then you can read it it's very easy to forget this so whenever you debug an application which use byte buffer and you wonder why you don't write anything on the network to a socket properly because you didn't flip the buffer another thing about byte buffer is it doesn't ship a lot of utility masses which are useful for example if you often if you write network protocols or network applications you need to iterate over all the bytes in one buffer doing this respite buffer is super expensive for multiple reasons first off every time you do a get it does a range check right so basically it takes like if you access index I is it between position between position limit whatever right so that can be super expensive we do love them like iterating over I don't know a buffer with ten thousand bytes for example in neti itself we have our own buffer implementation which helps you with this use cases so for example we have something which we call for each byte which allows you to basically pass in a byte processor which is called until you return false so basically every time it just gets the bytes in one pipe every time when you return true it keeps continuing the loop when you return false it's done so we only need to do like the index check one time which can makes a huge difference so when we introduced this for example we had before using just like get back get back get byte in our HTTP implementation and we use this it was like a 20% performance improvement just to switch to something like this because we need to iterate over a lot of bytes then what we have in Eddy in pipe off is we have different indices for reading and writing you don't need to flip basically if you write the writer index increase if you read the reader index increase once both on the same position you cannot read any thing anymore which i think makes sense by prefer itself you can't extend by buffer in the JDK because it has a package private constructor which sometimes can be painful because if you want to basically have a few on multiple byte buffers it's not possible basically if you say well I may have multiple byte buffers you need to pass in an area of byte buffers and need to iterate over them by yourself do all that kind of stuff in Eddy we basically have a composite pipe buff which allows you to wrap multiple by buffs but have the same abstraction over them so you just operate on the same abstractions before it just serves you with multiple well extra methods like you can get by buffer at specific index and all that kind of stuff I can add new ones to them which makes it super easy to say like okay now I want to compose a packet of three byte buffs just wrap them api's the same the best thing about by buffing area is its reference counted what does that even mean you are you need to basically release them so you're writing Java like you write C which sounds scary it's a little bit scary so let me first explain you why we are doing this so what does Java doing so for if you don't want to write to a socket you basically need to have a direct buffer which is off eeep which means it's native memory some at some point this native memory needs to be released right because otherwise you will get out of memory so the so Java itself tries to give you the illusion that you don't need to care about this kind of stuff which is nice that's why we write Java right we don't need we don't want to free stuff or manage our memory the problem really is here just think about what I said it's handled by the garbage collector the garbage collector runs once you're short on here memory but we allocating outside of the heap memory which means it may never run may run too late or if you're lucky enough it runs well in a good time frame so you have no good idea about whenever something will run too free of your direct memory and now comes something really scary if you want to have a little bit of fun check out the open JDK look into the allocation de-allocation parts of direct memory of the Reg pipe buffers so first of what they had I think they recently changed it in Java 9 actually so if you want to allocate direct memory it needs to go to a static synchronized method that's because they want to keep track how much direct memory you use because there's something like max direct memory that you can specify on the command line and once you go well over it is basically throws on out of memory exception or out of memory error but it's static synchronized what does it mean if you have a lot of threads which basically allocate together all the time which is very very often the case if you write network applications then everything will just block each other that's bad now comes the allocation part so the D allocation is also static synchronized which is also bad if you run short of memory now comes a catch it does a thread sleep 100 well first it well first it cost C system GC to tell the GC now it's time to run then it does this read a thread sleep on a thread sleep 100 and says now okay now I waited 100 milliseconds an hour we'll try it again to allocate which means you could have a 100 milliseconds pause when you allocate which is super bad if you write something which is kind of latency sensitive you may say well if it's latency-sensitive you should not write in Java but let's not do any more but 100 milliseconds powers for this kind of stuff could be horrible in production so that's why we do reference counting so what we do in Eddy is basically if you allocate something you get a bi buff it has a reference count of 1 once you are done with that you call release then it gets a reference count of 0 you can also call retain which means it gets incremented by 1 so it can pass it along this kind of stuff so typically like you would do and see basically once it's 0 we D allocate it for you what we do with native memory is basically we get a reference to the cleaner don't do this don't we do it by yourself so we get a reference to the cleaner call the clean method or in Java 9 the run method and then basically this one will call well all that kind of the allocation code that the garbage collector normally does that works out kind of good the problem still is it static synchronized so you still have the problem so now we have two solutions for that Neddie the first solution is basically use the memory pool which I will talk a little bit about later which means every time you call release and it's in the reference count goes to zero basically you put it back in the pool you can reuse it so it will never go to the static synchronized stuff but you need to keep the memory right in your pool which is fine or we can also if you if we detect you have unsafe we can allocate memory by ourself and don't go to the static synchronized blocks because we can also free call free by ourselves the only downside here is that basically not the same limits will be restricted that you set with the max direct memory on the command line but we provide different ones so it's trade-off because it's hard to get the releasing right I mean if you ever did see you know it's sometimes hard to free all the time you forget it all the kind of stuff and it's very very hard to debug this kind of stuff we have a leak detector nelly how did we do that so basically what we do is we have differently detector levels default a simple which basically means we will enable leak detection in one person of all your allocations and take a stack trace basically every time you elect to allocate and every time you call release so if something leaks we can tell you here you here you're basically allocated it so just check what you do with a buffer and then you need to find where you need to deallocate that can give you kind of a hind but it's not very helpful most of the times we use phantom references for this so basically we just wrap it into a phantom reference do it in reference queue and then we just pull it if you see something like this in production should still use this in production if you see this basically you should enable advanced mode what does it mean advanced basically take a stack trace arm on each allocation and de-allocation for each buffer so it slows you down a little bit but you have a better idea if that still doesn't help to find it you can do paranoid paranoid takes the stack trace on each operation on the buffer it's super expensive so if you do that in production basically we'll just everything will just crying to a halt don't do that but what you can do with this kind of stuff is and what we do in Eddy we run with this with this leak detector level every CI volt and if some leak detector report pulls up we basically fail the build so that's super handy for this kind of stuff you can even adjust it on runtime so you can do like write a chain X whatever thing and adjust it put it well enable it in production if you think there's something going on so this really helps to get you a better idea about where the leak happens what we also do is now we use direct memory by default so all the allocations in Eddy are direct by default you can also request heap memory if you want to yeah and we also have message like I don't know get it unsigned in get an unsigned bind and all that kind of stuff so you don't need to mask it by yourself so there are a lot of different message that you can take you can write by its you can write array of bytes you can write a byte buffer all that kind of stuff so we have a lot of methods that we basically expose that are very very useful I think most people that use byte buffer before needed something like read an unsigned in at some point or read an unsigned byte we provide everything like this so the buffer pooling so to give you a better idea about why it matters so here's a diagram this was done by by Twitter so it's on a Twitter blog post in the engineering block so today is using neti as well with Senegal so they are peculiar benchmark the different number of nanoseconds for allocating deallocating so all the time just allocate the allocate the same buffer with different sizes so first off with zero so why does it even take time zero basically because of the overhead I mean we have a wrapper around the memory so you see there as a different and with different sizes and I would just pick for example the biggest one because you see as bigger as the memory sizes go as more as the effect you see allocating port well direct buffer is like three times the time faster than allocating it on port so it makes a lot of difference and if your benchmark your application you will see it even in the profiler to show up but it even may make a difference for do that with sheep buffer and that's why we also support pooling heap buffers you may think well why does it matter it's it's just by Derek right which is rep so problem here is or I don't know if it's problem but it's a JVM or Java itself as part of the specification it's definite that they need to zero out the bite area every time you allocate it which is not for free so if you have a very self-contained application you may not want to do that you just may want to use the same memory all the time because also if you write C you don't do a mem set all the time to reuse binary write the same is true here you can use either pool pose none of them all kind of stuff or configurable and this is how it works so if you ever well looked at J Melek it's basically Chimel x3 in Java so J medic works like this so there's mu there are multiple threads basically that may want to allocate so first off if they try to allocate they look into into a thread cache which is read local basically in Java if there's memory or buffer already in the right size I will just pick it from there because then I don't need to do any synchronization right it's just the same thread basically if there's nothing in there I will go to some arena they call arena in the arena is basically like a tiny small allocated by itself so there are multiple arenas how much of them you can configure them in and Eddie it's basically like two times course by default so it's choosing randomly pass read and never change go there try to allocate something near inna the arena itself has different size classes so you have like if you want to allocate something which is 32 bytes you can allocate it from a size class which can be used for allocations up to 64 bytes if you want to allocate 256 you go to a different size class which support up to 512 for example and so on so on you can configure the different size class as well so it's super flexible it may happen that two threads use the same arena which C which means that basically at this point they need to synchronize right and that's why you have the thread local caches if you want to read more about this there there's a white paper and Chema log there's a blog post by Facebook about J Mello because the guy that wrote Jim melic works at Facebook so if you want to know all the details about this you can read this the change in malloc doesn't stand for Java ah it's it's Jason effing it's his name from the guy that wrote it so that nothing to do with Java it's a default allocator for FreeBSD this this day so it works out pretty well so if you want to read up on this go on the Internet you can do by yourself so this reading model I talked about this a little bit before I just want to show you some diagrams about um about 94 to make it a little bit more clearer so here's the general general is put on an i/o thread which is the event you would never change which is a good thing so you always know the same thread is operating this so you don't need to have any volatile and all that kind of stuff then the iOS read basically fires event to the channel in Bound handler general outbound handlers so this really enables you to don't care about synchronization at all because writing multiple threaded code is very hard I mean it's very hard to get right the only time when you really need to care about this kind of stuff is if you have general handlers that you share between different channels because different channels may end up on different threads if you don't have an event loop which just have one thread which makes sense I think so if you have a handler that for example counts all the connections that needs to be thread safe because it's shared but if you have a handler which is only used in one channel you never need to care about synchronized and how does it work if you call something from a non IO thread in Eddy from outside so you call general right basically from the outside we check is thread currents read the same thread that sits under the value of the channel if not we basically schedule it on the event loop so the event of itself is just an executor basically in Java the same is true for for writes and reads we change the right semantics in a t4 so net e3 every time you call general right if it ends up on the socket basically we do a socket right which may make sense um the problem really is that people tend to channel right with very very small buffers because every time you do it when you have something like if you have for example a bite output stream you call write multiple times right so we were thinking about how we can improve that for the user because I mean in POSIX there's something which is called reitv which is basically a Cisco which allows you to do one sis call and write multiple buffers at the same time which is a lot cheaper than do multiple Cisco's because this calls are expensive for multiple reasons first off for this in Java you need to well you need to basically go from Java to see through the j'ni layer which is expensive and then in C do sis call and go from user space to kernel space which is also expensive so we don't want to do too many of them so what we did in 94 on was basically the idea came from someone from Twitter that worked there from Jeff Pinner we decoupled the writing and the flushing which was one before in 83 so what we do in 84 now it's more like what you do in an output stream basically when you call write it may never end up on the socket you need to call flush if you call flush you're a guarantee that everything that sits in the outbound path recall general outbound buffer will get written to a socket at this point which means you can do a lot of interesting things it comes in pretty handy if you have protocols which were example supports pipelining so if you write an HTTP server which supports pipelining what you can do now is basically every time channel regions color is called which is when something inbound happens you call generate with the response then we have an event which is called January complete which basically said now I am NOT able to read anything more from this channel this one is triggered then you call channel flush and everything that was written before it's just written in once this call and we benchmark this and the performance different is certainly 5% if you write a lot I mean really depends on how often you call right but if you do pipelining with I mean you can do whatever like 1000 pipeline requests it makes a huge different here the catch is that you may not be able to write everything because you may explore basically the network buffer that sits on the OS but we know basically when we write we write as much as we can if we cannot write everything we basically reduce our itself on the selector you pour KQ whatever it is and once it becomes writable again it gets woken up it knows how much it needs to write because when the last flush happened and we write the rest so it's really not blocking everything like this the only downside with this approach is basically that you need to be careful to call flush at some point because if you don't do this you end up with a lot of stuff in memory right but to help you with this basically we provide support for backpressure so we have something which we call it on general write ability a general writer we change the band so every time it channel goes from writable to non writable and vice versa basically we file an event and you can check ok now the channel is writable again I will continue right now it's not writable again I will call flush in this kind of stuff and all of this depends on which settings you basically do on the channel how much bytes you're one to buffer quicker channel becomes non writable so it's super flexible you can even do more interesting things here if you but if you build for example proxy applications you can even say well now the the outbound channel from my proxy application is not writable anymore I will stop reading until it's writable again so you can build back pressure from basically two sockets which are piped together that leads me to the read semantics so net III basically what we did we use every time newer new data comes in we just fire off read events so there was no easy way to basically say I only want to read one time it does it all the day all the time well in a loop you could do something like set our set readable false then it stops but you don't have good control how much how often it will call it before it stops actually in nettie for what you can do is basically you can set something like set how to read false which means you're responsible to call generate each time you are ready to read more data that doesn't mean that something will be read exactly this moment it's more like now I'm ready to read more data if there's something there so you can really like you can preach something like reactive streams on top of Nettie way basically say each time there's a request to read some objects I will just call general read and stop doing this once basically subscriber cannot take it anymore that's basically also how how the reactive streams adapter or Neddy's is working basically which was done by a live band or typesafe or I think it's like that now by which is also used by play you can even said like how many messages per read loop you want to do because if you would just keep on reading everything that is on the channel you may get a problem if you have a very very slow remote peer because you share the same cert with different connections right if it just fills in the buffer every time a little bit until you get a treat you will never stop um you also we also have like a receive pipe of a locator where you can basically say like this is my best guess how much memory I want to use and this is a buffer I give you so there's a lot of flexibility that you can do I also read so I talked a little bit about IO threats before so basically the set of threats that do all the IO and event processing its abstracted as a vent loop which is nice in Eddy 3 basically there was no like global abstraction here basically it means you can use the same search for served one for the client why is this interesting for example again if you build a proxy application you can use the fate and same event loop of the accepted inbound soccer for the outbound socket for the remote host which means you can pass data from one circuit to another without the context which by the way in the native transport we also support splicing where you can basic just splice for one socket with another without leaving kernel space which is super nice as well that's even cheaper but you cannot basically transform data in the user space anymore the event loop itself um it extends event executor an event executor support extend scheduled executor service which means you can schedule events on the same thread that process IO for a channel why is this useful for X sample if you do right you can schedule an event and can say if this right doesn't finish in five seconds close the connection for example and again you don't need to worry about synchronization because it's the same thread so it's super handy the only problem sometimes here is that people say okay it's an executor so we'll just do something blocking here don't do this because if you block there you block everything that sits on the same thread it's not blocking right well sometimes you need to do some work out of the out of the iOS read vicars sometimes you need to plot because maybe you need to call some API Oh which doesn't support non-blocking like JDBC or whatever what a filesystem there are a lot of API is it specially in Java because Java was not built with as in currently in mind right so there are a lot of API switch are still blocking for this we provide something which we call event executor group so the event executor group is the same like an event rule basically you have multiple threads which process stuff so it's also just a thread executor well thread pool basically the interesting thing here is basically that if you build up a pipeline in the pipeline you put all the processing logic like different handlers to process like bytes and put them to pocho business logic you can specify as well in event executors which means if i want to handle something in this handle i do it on this event executors which means basically you can move stuff from one thread to another which is the event loop for example to an event executor you can even say well for this handle i want to use this event executor for this and i want to use this event executor and for this one i want to go back to the event loop so we can do all this kind of different things the same stuff event executors for event loop like I said before it's also basically extending scheduled event executor scheduled executor server so you can also schedule on this we have different implementations the default implementation basically also preserve order for you because the problem is if you have a protocol like HTTP even if you offload stuff to a thread pool you to preserve order for the responses because it's not expected to basically if you do two requests to first answer with the second request we do all the ordering for you if you don't care about this we also have an unordered event executors group which could be handy for Datagram UDP because you don't care about this kind of stuff or for example Cassandra is using this Cassandra is using something like this because they already have their message they're poachers basically which are their cql whatever messages and they don't care about ordering there anymore because they were just multiplexing on different threads and put everything together because they have a message ID so you can do this just with Nellie without any extra work Jane I've a Cesc SSL engine so most people still think if you want to do SSL you cannot terminate on Java because it's slow right it's kind of slow if you use the JDK implementation it's slow it's not as slow as it was before so this are the worst case numbers that I'll show you here so for most ciphers these days when you use Java 8 update 60 order or later it's kind of fast so it's half the speed of our implementation which I talked about now after I show you the numbers so here I did a benchmark which basically is using WR car which is unlike HTTP benchmarking tool fired of an HTTP server except basically just hello world like the teaching power of benchmarks basically I mean it's a very very primitive benchmark but it just shows you how much you can push so with the SSL engine implementation I was kind of be able to do almost 150,000 requests per second well we say with our implementation which is based on open SSL with Chennai I could almost do 500,000 so it's a lot faster another very interesting thing is because we are using open SSL or leaper SSL or boring SSL whatever you like I mean it's the same API we also support a LPN and NPN out-of-the-box which means you don't need any hacks to make HTTP to work which is a very very big painful point with Java 8 still because you need to use like the arm like the arm the boot class part would class pass HEC 4 which is support by jetty for example so this will just work we provide support for on 4 different operation system just let me show you the CPU before just to give you an idea so the right side is SSL engine implementation so it's kind of idling on 24 box 24 core box left side is open SSL engine so basically it's just melting down which is the right thing to do with a newer JDK build with some ciphers it look almost like this as well it's just half the speed it really depends on the cipher they just changed it in update 60s so if you want an old build and use SSL and use the JDK implementation update it's faster so the open SSL engine like I said it's a drop-in replacement for JDK so these days we support surface side client-side S&I all the ciphers that basically JDK support blue so exercise that open SSL or lip press SLE or boring SSL support gives you up to 6 times of performance I was not able to get the JDK implementation faster than half of the speed then the open SSL one so even in the best case it gives a like two times performance it has less memory usage because of how the caches are managed by open SSL itself let's G C because we don't need to create all the extra objects which are with on the JDK one does if you want to switch from one to the other you just need to change one line of code basically as a provider JDK or open SSL depending on what you want to do if you don't specify this by yourself nettie will just detect if you have the open SSL jars the well the module basically on the class pass if so it will use it if not it for specs to JDK which means you can basically just write a code if of the OpenSSL bindings it will be fast if not it will just work it's just slower which is nice it's the same API you can even use the OpenSSL engine without Nettie if you want to so if you don't want to use Nettie you should use Nellie but if you don't want to use Nellie you can just use all measure engine and it's it's based on Apache Tomcat Native Twitter fort it at some point then basically open source the contribute to Nettie and now we have all version so in the last five minutes I want to talk a little bit about the JVM and Neddie itself and why are sometimes need to hate the JVM I mean I really love Java and the JVM and I think it's a it's an awesome engineering work the JVM but sometimes it just comes into your way and I would want to show you a little bit why maybe for most of you it doesn't care too much because it's only care you only to careful you do a lot of performance stuff and really low level stuff like network programming the read memory management I talked about this a little bit before so direct memory direct memory management is done by the JVM by using finalizes or a cleaner or whatever and I think this just doesn't work because of the reasons I told you before because the garbage collector basically runs when you run out of heap space right so I don't think it's the right abstraction I see why they did this because I mean it's just the way how you develop Java because it was would be very strange for people to basically say like oh now I have direct memory and I need to release them right because you don't want to do that in Java but I think it's very bad for the performance if you write Network program Network application because most of the times you have a very good handle about when you want to release memory because you want to release it basically when you wrote it to the socket right you have a pretty good handle about when you can release memory and that's why we basically do it in Eddy the memory layout sometimes the JIT just tries to be too smart I mean most of the times it makes a lot of sense because it tries to basically get you the best out of performance and memory resist right so the JIT may rearrange files inside of your class because there's some space in between so it doesn't well basically lose too much memory and all that kind of stuff but sometimes you know what you are doing because you can get fault sharing I don't know if you know what for sharing what it is I think it's more known from people that do C or C++ or something more low-level than Java basically what false sharing is is you have two objects to memory well pointers or whatever which are sitting on the same cache line and you access it from multiple threads and because multiple threads using the same cache line you play ping pong to update between them because you need to basically refresh the view on the memory that they have which can have a very very bad impact on performance and it's very hard to basically see this why this happens I mean just to debug this you need to look at the hardware point a hardware counters not the kind of stuff so it's very very invisible from Java to how with this what you can do and see is basically you can add padding you can say here's a struct and I want to have the structure laid out exactly like this and the GCC or whatever uses compiler will never touch your stuff to readjust this if you want to do the same in Java or basically what you are doing here is well you may think well I could just use the fields and just put it like this but the chip may just rearrange this for various reasons like safe memory so what you can do in Java here is and that's basically what for example GC tools does which is a queue implementation which we use in Eddy it's a very awesome project so if you want to use a queue which has some very very specific excess padding like multi producer single consumer you should use something like GC tools and not the JDK ones because it's a lot faster the JDK ones are always woody producer multiple consumers to make it safe to use but if you know what you're doing you can get a lot more speed with GC tools so what they are doing is basically well to prevent a JIT to do something stupid well okay let us try this so I would just arrange them how I want them okay that didn't work I will build I will make them public because then it can't optimize it away well that didn't work out too well as well I make them public volatile to say well it may be accessed by multiple threads and you are not allowed to do anything that doesn't work as well mmm it did work at some point in time it doesn't work too well anymore so you start to do okay I will pet them by myself now I put like eight Long's before it because eight Long's has 64 bytes which is a cache line and well that will work it doesn't work because it should will rearrange the files so I say okay what I'm doing now okay so it now comes the ugly part you create a class in the class you start to do only the padding then extent this class at you fill your file there then extend this class do you again a padding because you need to pad before in behind because otherwise you may share cache line again then you're at your other stuff and so on and so on so you get like 20 classes to extend each other just due to padding and that works it's information implementation detail it may not work in some update but it works for now so that's super painful and I would like to have more control over this if I know what I'm doing and object layout may change this to some extent gen-i to make China I've work they've sometimes you need to have nasty hacks like past memory pointers directly because otherwise it's just too expensive I mean there's just in you you talk to include a new standard for this to improve j'ni but it's super expensive I mean especially if you go from the C layer to Java layer back if you need to call back into this so I hope this will change a little bit I don't have too much time so I will just move over this and I all ni to basically I don't think it's a good up it's a very good improvement over nao because I mean I think there's not too much in ni o - which is good enough to use it compared to nao I mean there's a abstraction but that's it basically for the rest you can just use and I all it reads too much garbage for example selector the key set it's backed by a hash set which means every time you need to access it you create a new iterate a new hash set you can create garbage all the kind of stuff all the Java frameworks that I know that care about performance basically use reflection to inject their own implementation to reduce garbage like Aaron like Neddy so hopefully that will change at some point pipe buffer API is not very user-friendly there are no special exceptions for connection reset events so you have not no good idea about if it's really a hard problem or not Yahoo util concurrent future is blocking by default which i think is against the design of future i mean completion stage little bit better in this terms but it's still problem because it extends future in some point the completely future stuff so there's a lot of stuff i think that could be changed to make it easier and harder to shoot yourself basically so the last one before i'm finish make me rich by my book you need to buy a lot of books to make me rich so buy a lot of books for your friends for your wife for your husband for your kids kids love network programming so just buy a lot of books and make me rich so yeah that's it if you have question i think the time is over but it's okay now it's over okay if you have questions I am here I'm around a little bit just catch me up and thanks for your time
Info
Channel: Devoxx
Views: 39,190
Rating: 4.9257731 out of 5
Keywords: DevoxxBE2016
Id: DKJ0w30M0vg
Channel Id: undefined
Length: 61min 36sec (3696 seconds)
Published: Fri Nov 11 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.