ASP.NET Community Standup - gRPC Performance Improvements

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Music] do [Music] [Applause] [Music] all right we're live welcome to the net community stand-up so i'm joined today across the world uh by james newton king so welcome back james thank you and today i'm going to be talking about grpc and performance that sounds quite exciting those who have been around for a bit will know that we start always with the community link so i'll jump right into those my name's sean galloway i'm a pm on the.net community team so let's jump right in here i will do the screen sharing and i will share the right screen because i like to do that and it is this one and i will okay wonderful um cool so um i just shared them over in the links uh or in the chat i'll share again at the end um and so here we go uh first one up this is uh this is the asp.net core improvements.next six preview two um some really cool stuff in here including of course lots of like razer and blazer stuff lately so we've got razer compiler updated to use source generators this is kind of mind-blowing to me so this is um this is really cool it's uh it's using source generators and compiling ahead of time like during the during the compilation step and uh you know big big performance improvements already so um some other cool stuff in here including um css isolation so we saw css isolation before in razer and it's good to see this also in mvc and razer pages as well so allowing to componentize your css so that is good um and there was one other thing in here i forget preserving pre-rendered state and blazer is cool and nullable uh annotations in signal are nice to see nullable just kind of bubbling throughout things so yeah well we've got quite a long uh process going on adding annotations to everything in asp.net core there's a lot of code and we're we're getting there so the aim is to do everything and on es6 oh okay cool yeah it's it's a ton of uh i've seen like pull requests come in and stuff and there's these pull requests with thousands of lines of code changed and all the things you have to check and stuff so cool um this is just i've been reading this book lately um from jurgen we featured several of his blog posts along the way as he was writing the book it's newly updated to asp.net core five so um and just all kinds of what i like about this is it just goes into all the um all the different ways that you can customize asp.net core so all the way throughout the stack different middleware and and um hosting and action filters and just all the stuff so so you know really cool level of depth it's it's been neat seeing you know over the years the community has been writing some really great posts but it's great to have in one book that i can pick up and it's all cross-referenced and indexed and stuff so this is just a great great uh resource uh this is cool from uh from rahul uh actually let me switch over and i will show this so people can get the links if they didn't get them earlier uh so rahul's got a series um and this one he's going through uh jot authentication and asp.net core with azure id um there's a lot of things to set up he shows going through and writing up the application he shows the um configuration options uh in azure id and you know walks through the the steps of actually getting it all set up and uh then he actually goes in and shows like with fiddler like going in and inspecting um so seeing the flow as you go all the way through so this is part of a series that rahul's been doing and just really good quality um screencasts on this this is a nice one uh to see so um yutang uh just posted about this and this is about uh app service apex domain support so this is these are managed uh certificates and support for all the stuff including you know like auto renewals and everything and so it's it's just something you can click through and do directly in the portal as you're setting it up it's also something that you can automate with arm templates so this is really cool there have been different ways to kind of hack this together in the past it's really great to just see first class support i i have seen some people on their twitter conversation somebody said it's not working um and you know in some cases it does take a little time to provision um and some of that stuff is covered in this faq but um but yeah overall just really cool is this like a alternative to using let's encrypt exactly yep so this is so you i mean let's encrypt is fine but you do have to do some work to set it up i've seen you know like there's extensions and things to do the auto renewal stuff and this is just something that's just kind of built in so nice yeah uh this is just a tweet that i saw go by so uh ralph is saying that there's a survey i've been posting a ton of surveys in our community links lately this one is interesting this is about um just an upcoming experience for a cloud hosted um or cloud native applications one thing that jumped out at me with this too is that they're actually offering to compensate so you can go through and and um depending on your level of involvement they'll actually compensate you for your time um in in looking at this so seems neat uh niels i saw niels on the chat earlier so he's been doing this series and it's just really cool this walkthrough of all the different ways that you can host blazer web assembly static applications um and so you know they're static but then there's other things you need to do as far as like front end routing and all that kind of stuff and so he walks through and shows this on cloud flare pages cloudflare is pretty neat because you have that kind of um local cdn distribution everything's you know like optimized and uh what edge distributed so pretty cool there um so he walks through showing how to set that up how to deploy it using github which i actually wasn't aware that they have that level of integration for cloudflare pages so that's neat um and then also uh he's got a little bonus at the end showing something up pull request previews as well so you can you can go through and um every pull request you can get a preview url so and yeah neat stuff uh this is this is kind of cool this is from the dotnetos they've um they've been doing these conferences um it's a very low level um conference where they're talking like you know very like under the hood stuff and you know optimizing gc and async and all kinds of stuff so this is neat they've got these um these infographics that you you can print out um which is some pretty pretty in-depth stuff you can put these up on your wall and people will know how smart you are so that's that's very important i i used to have the asp.net web forms life cycle poster am i near my desk and my office and that was on the show how smart i was that was just show so i could make sense of it it was it was very big on pre-render review stayed and all that yeah on a net appliance back phase one man nightmares uh uh so khalid posted here on uh hosting two asp.net core apps in one host and so the idea is like if you've got say a front-end site and then a dashboard so you can actually using the console host he shows how you know you can configure multiple hosts using the create host builder and so the the standard uses a a synchronous and it'll just die when you it'll run the first and and wait so he he uses a um task.when any and runs them both this is one of those things where i always read the docs several times to make sure i'm doing it right because i'm always worried i'm doing something wrong but my best memory is that when annie is the right way to do this sort of thing so um cool and then once he spins that up and runs it then um he's got got the app listening on two different um on the different ports and stuff so pretty neat uh this is kind of building on so steve is going a little deeper with um the stuff that brady and i covered a few weeks ago on generating clients uh for using an open api spec so he digs you know really deep into it so thanks thanks for the link over to the show that we did and then he talks you know like in-depth goes into the docs he explains some gotchas that you can run into um depending how you have your directory props set up and then looking in depth at ways you can customize it including stuff like the nswag documentation all the different kind of options and customization opportunities there so i really like that you know the the generation is built on top of these tools so it's surfaced inside a visual studio visual studio for mac um but it runs on top of the dot net open api command which runs on top of the n swag command so you can really get down there and customize it as much as you want to so good stuff all right um so this is a cool one from thomas just a quick um pointer to a nuget package you can pull in called pwn passwords and this integrates with the have i been pwned website and it will check as somebody is logging in you know is creating an account and it'll say whether or not that password has been compromised there's some neat infrastructure too behind the scenes with this i forget the exact spec but it it there's a way that it sends a hash of the password so it's not transmitting your password up it's just saying here's a hash of the password and that and the there's like some sort of um agreement that these different you know like uh security people have for a way of like transmitting um password hashes so that's pretty cool and finally just uh including this link so this is uh this is a thing that you wrote up grpc performance improvements in net five and so this is what we're going to be talking about today so that's right that's it for me awesome so so this um you wrote that post up for dot net five and then i'm of course things always keep marching forward there are is there um work kind of continuing going on in that yeah so um the big push for performance with grpc was in donaire 5. there hasn't been much work done directly for grbc and asp.net core since then um but there have been improvements and protobuf serialization since then and one of the demos i'll walk through today will show some improvements we've made um since since donald f5 but protobuf like it's a third-party library uh it's available now so it's not strictly part of dot net ah cool all right let's bring it in there cool uh so first i'll just start off by talking about what is grpc for anyone who isn't familiar with it um if you've already seen this before just bear with me for a couple of minutes so it's a popular open source remote procedure called framework rpc frameworks make it easy for apps to talk to each other you can think of wcf as another rpc framework rest you can kind of use like rpc but like it's got state and representing entities instead of mythicals grpc it's run by the cloud native computing foundation that's the cncf and what microsoft has done is we've written a new implementation of grpc grpc4.net using all the latest and greatest tools hosted on kestrel and we have contributed it to the cncf so we're working together with them to make grbc good and networking as you're talking about that sorry to jump in already but how much overhead is there how how easy is that kind of interaction in working with crs you know cncf and grpc it's been really good i think we we reuse a lot of the tools that they make so for example we use their code generation tools we use we build upon their base api so the apis between the original.net implementation and the new one that we've written they're very very similar and we work quite closely with them so every week we have a meeting with them and we discuss what we're working on what they're working on things we should work on together and we just keep in touch on a weekly basis cool okay so grpc is supported by every major programming language so c plus java go node ruby python c sharp you name it there is a grbc implementation for it and the grbc team maintains um most of those and we maintain our new.net implementation grpc it's contract first so messages and methods are defined in a contract and then that contract is used to drive code generation at grpc it's not new it was originally open sourced in 2015 but it's the union of two modern technologies so it uses hipp2 as its protocol layer and it uses protobuf to serialize messages and also as the contract language so fundamentally grpc is another way for apps to talk to each other a client app calls the server and it gets a response so the first difference between grpc and restful apis is http 2 is required and rather than sending text base json as the message it sends a binary protobuf message so the bigger difference is how grpc apps are written so grbc is contract first so the first thing you'll write in visual studio will probably be your protofile this is where you define the services and also your messages you can think of writing a profile like writing an interface and then your server will implement that interface and your client will call that interface so from that proto file we get some code generation happening both for our server app and our client app and this makes grpc really productive to use because you don't need to write the client and all the messages that's all done for you so you don't have to and because the protofile is language neutral you can give it to someone who is writing a grpc client in a completely different language say for example c plus plus they can use the grpc tooling generate a c plus plus client and call your.net grpc server and i've got some more detail here like breaking down the differences but i'd really summarize it as prbc is focused on performance um bine uh protobuf and hb2 are binary protocols they're designed for computers to read and write really fast compared to text apis which are more human focused and also grpc has create great developer productivity because of that profile and the code generation meanwhile http apis they're very easy to get started and they bring across the widest possible audience because every computer every device is able to talk http 11 and json because they're just text-based formats so let's talk about what work has gone into grbc for performance in donet 5. so the biggest improvements have come from optimizing http 2 and hp client and kestrel so we've worked to reduce latency improve concurrency remove allocations and all of these improvements are great because every app that is using hb2 regardless of whether it's grpc or not benefits so you could be doing restful json apis and you're using http 2 you're still going to get these same benefits so something to call out in particular is we reduce server allocations by 92 percent when using http 2 which is quite amazing and i'll talk a bit later about how we did that also in kestrel we now support hpac dynamic compression of response headers so hpac is a standard that's built into hb2 and it's how you can reduce large heb headers that are sent with every request and response instead of sending the entire thing every time the entire header is sent once and then after that just a very small identifier is sent for http headers so there are bandwidth savings there and also some cpu savings we found that it was actually faster to just calculate and maintain a dynamic cache and then write an id to a response than it was to write the entire header every time and have to worry about encoding and data transfer and stuff like that continually transferring back and forth between strings and yeah yeah and and also just if you think about an id like it just might be a bit in size writing that to the response is going to be much faster than having to take potentially a kilobyte long hdb header and having to check every single character and encoding every single character um so it's banned with say that bandwidth savings there and also cpu settings both on the client and on my server wow and finally we've worked with the protobuf project they write the serializer that we use to serialize and deserialize protobuf messages for their serializer to support a span of tea so rather than having to allocate a byte array every time we want to serialize or deserialize a message and that byte array will contain all the data that we're working with we're able to read data directly into kestrel request buffer serialize directly from that and then do the inverse when we're writing the response we're able to serialize directly to kestrel's response buffer and then flash it directly to the uh to the client wow yeah and and that's especially important if you're if you when you realize that previously we were having to allocate a buffer which would contain the entire message that you were sending so if you are potentially sending a four megabyte message which is very big we would have to temporarily allocate a four megabyte byte array serialized to it right to the response and then just immediately throw it away which would cause havoc if you're having to worry about a large object heap and garbage collection and stuff like that so it's a it's a big improvement wow um so actually breaking down comparing dot-net core 3-1 to net 5 we increased server requests per second by 65 and client requests per second increased by over 200 percent so yeah significant increase um so that comes from the range of improvements um reducing latency improving concurrency which can be a big deal with hp2 i'll talk about that in a second and the that i mentioned so in order to like if i had a previous service written in three one and i moved to five like how big of a do i have to do a lot to take advantage of those changes or is it kind of just automatic it's just automatic it's just like all you'll need to do is update to donate 5 and just update your nougat packages to the latest version and it it would just work like these are just fundamental low-level improvements we've made nice so this graph here we um is from a community run benchmark um comparing grpc implementations against each other and you can see grpc on.net highlighted in light blue and we set over top with other fast implementations like c plus plus go and rust and rust is the only one which we are second to so this is actually faster than c plus plus i mean yes it's close but wow that's amazing yeah well it's it's like people people think c plus plus has has to be fast like yeah but really when you're thinking about grpc and hp2 it's really about the quality and like the algorithms that are inside your http server like if you're coding in c plus plus you can still do really dumb things not saying that they've done dumb things or not but it's about just identifying those bottlenecks and working to improve them you know it's funny you mentioned this because there's a recent thread that i saw on um hacker news and it was about grand theft auto and it was a long start-up time i don't know if you saw that but it was i just yeah and they dug into it and they're like why is it taking so long to start the application up and it came down to something near and dear to your heart parsing jason and it was parsing a 10 megabyte json file and it was doing it by counting one character per at a time and so this this guy just hacking around was able to without the source code he'd like decompiled stuff and he got it like 70 faster and they've actually paid him out of their bug bounty program they paid them 10 000 and they're going to be shipping that fix so so it is true c plus plus is like not a magic bullet and it's not necessarily faster and you can shoot yourself in the foot there but yeah and i have i have no doubt that that team who was working on gta probably tested the jason parser when it got released and that 10 megabyte file was probably a few kilobytes and it was really fast like just as the data grew like there was just that one weak point which caused performance to go out of control and that can happen whatever language you're using and whatever implementation you're using um like it's be very possible to change what particular parameters are being tested in this grpc benchmark and create a completely opposite result where c plus and java absolutely smoke.net um it's just like where your strengths are and like once you identify your slow and some point you just focused on making it faster cool um i had a few questions uh pop in you mentioned http 2 earlier and is there what's the kind of status on support of azure app service and grpc and http 2. yeah that's it's a tricky it's a tricky thing to support um hb2 you need it to be supported from end to end so if you have your a client might support http 2 a server app might support http 2 but like if there are proxies sitting in the middle um they can cause issues because they're like intercepting http requests and rewriting them and like sending them out between doing load balancing and stuff like that last i heard they're working on it um but i still don't have a good answer that i can speak to um on that at the moment so what we what we have done is um our grpc implementation and net has really good support for grpc web which allows most of grpc most of the grbc functionality but allows it to work with http 1.1 so that's an alternative you can use today another alternative is you can use aks and when you do that you don't you aren't your data and your requests aren't going via built-in azure load balances um you sort of manage that yourself so that's another option okay is it you mentioned grpc web and of course the obvious thing is like access from browser client is it possible to use grpc web from a non browser yeah yeah definitely um so uh we are i think the net client is probably the only client that supports grpc web apart from the javascript client so basically what that means is you can take the.net grpc client you can run it inside a xamarin app and you can make grpc web calls from your xamarin app to your server and get a response successfully and that allows you to use pretty much all of the functionality of grpc and a lot of the performance benefits like protobuf like your messages will still be using protobuf on devices that don't support http 2. so another example of that is blazer webassembly laser web assembly it's net but it's sitting inside the browser so you only have http 1.1 support but you're still able to use grpc web from that environment okay cool all right i'll let you get back to it cool uh so let's go into a bit of detail about what we did in net5 uh so the first thing to think about is what actually impacts hb2 and protobuf like what what are the constraints um because if we're making performance improvements we don't just want to just start making changes about what we think will make it faster we want to first measure and then and then change so grpc itself is a very lightweight framework pretty much all real work happens at the http 2 layer and protobuf so it's the efficiency of our hb2 and protobuf it determines how quickly we can send and receive grpc so what we care about are allocations how many objects are we allocating how much data is are we allocating and how much time is spent doing garbage collection because time spent doing garbage collection and cpu time spent doing that isn't cpu time that is spent doing hb2 protobuf and serving grpc requests we also want to think about cpu usage how efficiently are we processing hb2 frames and protobuf messages we need to think about thread contention so hb2 is multiplexed that means you can have multiple requests happening happening simultaneously over a single connection which is great because you save the number of connections you use but it means you also need to think about connection bottlenecks there are places with multiplexing where you can only do one thing at a time with one request at a time so you need to focus on making those bottlenecks as fast as possible and then there's also network usage minimizing the amount and size of http 2 frames that are being sent over the network so an example of that might be reducing header frame size by supporting hpac compression and dynamic compression and to do that measure measuring there are a lot of choices when it comes to dot net profilers the ones built into visual studio are very easy to use and i'll briefly demo using the results from a profile result later there's also the dot trace and dot memory profilers they're also very good they don't come with visual studio you need to pay some extra money to use them but overall i think the ui in them is really great and easy to use and the final one i'll call out is perfume perfu is very advanced but it's also very tough to master the ui was definitely designed by a programmer for programmers and if you're not familiar with performance profiling it's very difficult to use people who are good at perfume are always in hot demand across microsoft internal teams to like help us debug and improve performance and our apps and also frameworks and i think i'll just call out before i jump into a demo that there's a lot of different approaches to doing performance profiling some people have knowledge with using donaire profilers others use dump files other people will be good at improving performance at an il level so when it comes to performance like the more eyes you get on a problem the more perspectives you'll get and the more people who are able to improve performance quite often i'll look at something and i'll say this is as fast as i know how to make it um do you want to take a look and see if you can add your knowledge to doing uh performance improvements so what covers today is um some approaches i generally use but they're certainly not the be all and end all of doing uh performance improvements and.net whenever i see um profilers i always think of analyzers too are there are there analyzers that will help me see if i'm writing and if like can i mess things up or write inefficient code in my grpc service or client um that would be caught by an analyzer or uh we don't ship any analyzers for grpc with the grbc project or with asp.net core there are general analyzers built into visual studio itself that the runtime team makes that will do analysis of your code and let you know if you're doing something wrong so for example a good a good example is if you call a asynchronous method and you don't await it uh and then an analyzer will probably pop up and tell you hey are you sure you meant to do this and as we get further and further into polishing donnie cord.net5 we're writing more and more analyzers that will check what you're doing and see whether there's a better way of doing it like perhaps you might do a substring and an analyzer might come along and say hey what you're doing here you could probably just do using spans and slices and avoid the allocation from a string okay one other question is nullability we we talked about nullability earlier with signalr is there is there any like no mobility impact or support in in grpc yes so the grpc library um has annotated with nullability attributes and has been since the first release i think there are plans to add nullable attributes to the generated code but we haven't done that yet so that would be the place still to do um but for the actual grpc libraries and nougat packages um those have um nullable attributes all right okay well i'll let you get to your demo just cool um so the first thing i'm going to talk about is analyzing kestrel server allocations particularly around hb2 and how we introduce the pooling of http 2 streams to dramatically reduce allocations so here on visual studio i have opened the kestrel solution so kestrel is our http web server for asp.net core it's written in net and c-sharp and we're just going to take a look in some of the internals and see what we can do to improve its performance so if you're ever interested in doing some performance analysis in visual studio what you can do is you can come up to the debug option up here and there is a performance profile option i think most people are familiar with starting an app with debugging and hitting breakpoints so performance profiling is similar except instead of starting out with debugging we're starting out with visual studio gathering statistics so here you would just select your startup project and then you would choose what exactly you want to measure so dot net object allocations is one i commonly use another one is cpu usage and this case today we're going to look at allocations so i would check this i would click start grpc app would start up and then i would get a client to make some requests to the server that would gather some statistics and i'll click stop and then i would analyze them i'm not going to do that today just because i don't want to put a whole bunch of load on my computer while presenting so i've got a some results here um all ready to go so that's all those those results you have there are those easy to like share with team members or that kind of thing yeah so i have i actually i made this measurement actually like i think four months ago when i originally wrote that blog post i just saved it as this um file i'm using diagnostic session um suffix and uh just a demo to you today i just opened it back up again and i've got all the all the data ready to go so it's just a matter of sharing these files cool um so the results here uh there's a couple of different panes um the one we'll focus on first is live objects so live objects is um how many objects are currently live and allocated in your app so here we can see our app when it first started this is a grpc server app so when startup happened we got a bunch of allocations and then we didn't have any load and then at this point i started sending grbc requests to the server so this is demoing um 50 000 grbc requests and what happens to memory usage and here we can see our allocations slowly build up so you can see the atomists of delta down here our allocations are built up until we get to this point where they suddenly drop off so this is where garbage collection happened this is where done decided it needs to free up some memory so spend some time analyzing what objects are no longer used and then it throws those all away and this is that classic sawtooth that you see like we've had other people on recently show us this exact same graph it's that sawtooth where you build up allocations and then you garbage collect and you do it all over again so yeah and then the goal is to allocate less so that you don't you don't have to garbage collect right exactly so i the goal isn't to completely remove garbage collection is just to reduce how often it happens just the time spent doing garbage collection it just takes cpu cycles and it's better to spend that time serving grpc requests than doing something we don't need to do so if we reduce the amount of garbage we create we reduce how many collections um we have to make and the amount of time spent doing gc does your grpc service like essentially stop during gc or just slow way down due to cpu usage uh so these particular garbage collections they'll be gen zero so that means um like there are generations of object lifetime um i i can't get into it all right now but these ones won't stop anything um your app will keep running i'm not even sure whether your app will pause during a gen 2 gc which is like the the worst one which takes most cpu um like we might not even need to pause with that but i'm not a gc expert so don't quote me on that but like gc i wouldn't be afraid of it happening you just want to minimize it where possible yeah um so yeah this is this sawtooth graph this indicates we're doing a lot of allocations and fairly regularly we're needing to do garbage collection so down here in this allocations pane we get a breakdown of what exact types are being allocated and where all the memory is going so this is descending by bytes so pipe is our biggest culprit then we've got http 2 stream and then we've got a whole bunch of strings which are really popular but don't take as much space so they're a bit lower compared to pipes and so on so something which jumps out to me immediately when i look at this list is there's very little here that has actually got anything to do with grpc or what our app is doing in fact the only one is this server call context so this server call context is allocated once for each grpc call which makes sense because we've got 50 000 of them and we made 50 000 requests now this is the only actual grpc or app related allocation all the rest of these are coming from kestrel and how kestrel is managing as he p2 requests and the biggest one the one which immediately jumps out is this http 2 stream so you can think of stream as like the hb2 representation of a request when you are a client and you connect to a when you make a request to an hp2 server you'll get a connection created and then on that connection for each request you make a stream will get created on the server and that stream will be bi-directional so it'll process your incoming hp2 frames and outgoing hp 2 frames but it's sort of like the heart of an hp request and then what we were doing in dot net core 31 is once we had finished with an hp 2 stream we would just throw it away and when the next request came in we would create a new one and then eventually those streams we threw away would get garbage collected and from the stream a whole bunch of other types hang off it so there'll be requests and response pipes um strings will be allocated as headers we've got a request uh hp request headers collection response heaters collection this dictionary is probably also related to uh headers um this is related to buffers and pipes this looks like an async state machine um more header stuff and a frame rider more pipe stuff so there's a whole bunch of stuff which is happening for every single hp2 request that we want to try and eliminate where possible so our allocations are really just grpc stuff that our app is doing and not the stuff that which kestrel needs to do so the way we did that um is uh we introduced the pooling of http 2 streams so let me jump into the source code so here we've got our http 2 folder this is where most of the http 2 implementation in kestrel lives and we've got this hb2 connection so when our client connects to our server we'll get a connection created and then we start processing frames so this particular method i already have open is where we process a header frame so header frame is where a client will send request headers and the server then processes them and the header frame is important because once we've finished processing it we start a new stream so once we have requests uh the headers for a new request we allocate a stream to then manage the lifetime of that request so a whole bunch of error checking checking happens here we check state a whole bunch more we do some processing until we get to this point where we're ready to start a new stream which is helpfully called getstream and if we jump into this um the code i'm viewing right now this is net5 this is after we've made the improvements but previously in donet 3 we would always just create a new stream so we got some new headers for a new request let's create a new stream and all the types which sit on it now what we do is we have a pool of streams so this pool is streams that we've previously created and have served in http 2 request and then once we're finished with the stream we return it back to this pool so it can be reused so we're sort of recycling and saving saving the planet inside our inside our hp 2 server and if we successfully have a stream to use we pop it out of this collection we initialize it with the new id and we start using it so this is like just the tip of the iceberg of what is happening to support stream reuse but this is like the core concept is rather than creating something every time we're attempting to reuse it so if i jump in a bit further to look at some details more let's look inside the http 2 stream when we create oops i've got the wrong file open when we create a hb2 stream we need to initialize all the types that are hanging off it so we've got this initialized method and we set a bunch of values back to what they should be for when a stream starts being used and now when this happens we check to see whether it's the first time a stream has been used if it's the first time it's been used then some types that are sitting on it happen to be null in which case we allocate them so in this case we're allocating some types for managing input and output flow control our output producer this is our request body pipe and this is our output pipe but if we've already used the stream rather than recreating all of those things we just reset them and each of one of these types internally then has some code to handle resetting all its state back to the right value so one of the other things we need to do to make this work is we need to ensure that we only reuse a stream that is safe to reuse so an hp 2 request could fail for a number of reasons and if it didn't gracefully complete we don't want to attempt to reuse it because if we if that request fails because it went into a bad state or it did something wrong like it was aborted we don't want to attempt to reuse it and then have another error pop up that we have no knowledge of about and it's very hard for us to do debug because the only reason that request failed is because the request before it failed and sort of bringing out those kind of dependencies is quite difficult so we have this can reuse flag and with this flag we check when we're completing a stream so we've got this complete stream method we check some state on the stream to see whether it's safe to reuse the stream was the stream aborted okay in that case we don't want to attempt to reuse it did its response gracefully complete if it did then it's safe so if both these things are true then we know the stream is safe to reuse otherwise we just throw it away and we create a new one uh better better to be safe than sorry yeah uh you mentioned the case where things can fail due to like dependencies is there support is there something similar to like distributed tracing i've seen that with like other um connect or um microservice applications and there's a w3c distributed tracing spec and stuff is there something like that for jrpc for grpc yes certainly so grbc has support and it also has its own particular metadata for open um open telemetry open telemetry yeah that's good um so when an open telemetry has been collected um grpc requests and responses will have his home metadata so it will say what grbc method you're invoking and also include the grpc specific status which is slightly different than an hp http status very cool so here is where we um are removing the stream from the connection so we're in the http 2 connection we're removing the stream from it and at this point we check can we reuse the stream if we can attempt to return it and then we pull it otherwise we dispose it and the next thing i'll talk about is when we're pulling something we want to put a limit on how many we're going to allow ourselves to pull because we don't want to accidentally pull thousands of streams we've previously used before and have memory grow out of control so for that reason we have a limit on the stream pool size and we only pull something if it is if we're lower than this limit otherwise we just throw it away and the final thing i'll mention is we also have timed based expiration of pulled streams so something which can happen with an hp 2 connection is you could visit a website and you download all the content at once so you might have 100 concurrent requests going on and once all those requests finish all 100 streams will go into the pool now it's quite possible you never visit that website you never like browse to a different web page or make any more http requests to that particular connection in which case all those pulled streams are just sitting in memory they're doing nothing um it's just it's just wasteful so for that reason we have time to base expiration where if a stream has been sitting in the pool for longer than this um configurable number of ticks then we just expire it out of the um stream pool for you and the way we do that is within kestrel there is this concept known as a heartbeat so every one second kestrel will just take a look to see is this state that should be updated based on time and on that one second interval um it will inspect this stream pool to see if there are streams that should be reused um uh removed how much support how much of object cooling did you have to write yourself versus what's built in to just.net core in general uh in this case this is something we wrote ourselves but it's it's not very complicated like it's just a stack it's just a stack of um streams that are just um it's just a regular done connection collection just because um although like i'm i'm talking about like multiple things happening in parallel um this is one of those cases where all the logic is single threaded um when we're working with the request so we don't even need to do any locking when working with this um so there isn't like there isn't a need for a special collection type um you compare that to something like renting array pools like renting arrays and that will probably be happening like from a static um collection so various locking needs to happen to ensure that two people don't rent the same array at the same time well we don't really need to worry about this that here okay okay so those are all the changes we made well some of the changes we made and this is our previous results with that sawtooth pattern and now if we take a look at our dotnet net5 results we can see that sawtooth pattern has disappeared and now we have a just gently increasing slope of allocations now allocations are still going up gc is going to happen eventually at some point but we've reduced the need for at least any gc to happen during our 50 000 requests so this is the same 50 000 grpc requests just to a net 5 server compared to a net core three one and now it looks like the slope dropped off like three quarters of the way through up here um yeah is there a reason for that or is that yeah we finished processing the 50 000 requests oh okay all right so this is this is the point where we finish doing any work and the server just stays constant from this point got it okay um and now if we take a look down at our list of allocations that server call context now happens to be the number one allocator and everything else has to do with grpc so this is a grpc message this is the task which we're returning from our grbc method another message some internal stuff so pretty much all the big allocations that are happening are now related are now related to um grpc is what we want so we want allocations to be related to our app and grbc rather than related to just http 2 stuff wow yep so that's 92 reduction that's incredible um a question on some well i don't know if it exactly fits in but question how much of this like can you use the protobuf serializer without grpc how separate are those things they're completely separate so grpc just happens to use protobuf protobuf isn't tied to grpc one of one of the nice things about grpc tooling is it will automatically code generate protobuf messages for you if you're using just the protobuf serializer by itself you'll need to generate those types manually from the command line and an alternative to that is there's also a library called protobuf net so that's written by someone in the community and that one actually doesn't use code generation at all um it uses reflection instead while we're talking about that can you use uh json serializer d serializer with grpc you could so um the the uh there are extension points for doing um the serialization and deserialization of messages so the messages with grpc like they're just a as just a blob of data so that data could be grpc it could be json it could be text it could be just plain old binary you could send an image fire it but most people are sending protobuf messages okay um and here's a here's a more geeky one um are related allocations contiguous i'm gonna pretend like i'm sure i know what that means all the allocation contiguous memory i i'm i'm not quite sure about that like obviously if you're allocating an array um that array will like all the all the like slots in that array will be contiguous but i am i'm not sure like if you just knew up a bunch of objects in a loop um how how those would get um allocated and i'm not sure why it would really matter at all i mean if you you know like there are performance benefits around like having your memory like closely grouped together so i'm i'm like this is when i talk about different skills when it comes to doing um performance analysis and improvements like that would be detail i don't really work on and you'd probably bring in someone who's much more familiar with gc or the runtime to do those kinds of improvements paging stephen taub or something um yeah but but those are gonna pull ben adams yeah there you go but but it seems like in general what what's nice with the level you're working at is i don't have to worry about that i can write grpc code and then it's generating client code and it's keeping things up to date with that and and that's something we we looked at recently too with the connected services when you generate clients like you can regenerate and update your clients from that from that uh contract right so if i update my my proto contract then i can it just it'll regenerate my client as well and so then i can focus on writing business logic and not worrying about the plumbing and yeah yeah like all the like we we we we try and focus on like making the low level parts of net and asp net core fast so that cpu time and memory is available for your apps that's that's the goal um so this is interesting here from alfred and he's asking if one of the segments since one of the signals to indicate a stream can be reused as the fact the previous connection wasn't aborted is there a way that you can prevent an actor repeatedly aborting to prevent to defeat pooling and just generally like how much do you have to worry about like a bad actor you know and like managing clients poorly behaved clients well it's not necessarily the previous request but um yeah aborted and connections that weren't gracefully closed just won't get pulled like it's not really an attack it's just the requests wouldn't happen to take advantage of the extra efficiency that we've added to to kestrel so i i wouldn't describe it as an attack um but like if you're if you're worried about just poorly behaved i guess is a better word for it right yeah like if you're worried about performance and a lot of the http requests that are coming into your server are failing for and and by failing i don't mean they're gracefully returning a 404 response like that's not a another kind of failure i'm talking about but like we're throwing an exception and they're aborting a request before they've read all the content that's the kind of non-graceful failure i'm talking about um like if if that's happening um your performance is probably already bad and you should focus on fixing that before um doing other analysis okay we have a trick question from the sebastian guy he says how big can the pools grow and how fast will they shrink um so the pool maximum size is 100 so we will pull up to 100 streams on a single connection and it will start shrinking if a stream hasn't been used in five seconds so five seconds sounds quite short and maybe maybe it should be slightly longer we haven't um like done a huge amount of analysis on that um but like if you've got a steady number of requests like thousands of requests coming into your server per second um that's the point where you start caring about performance and you start taking advantage of this pooling well if you've got one request coming in every 10 seconds like the cost to allocate a stream is and all its related types is really nothing and you shouldn't worry about it all right a question on hosting multiple grpc services in the same app but on different on separate transports so service one on domain socket service two on tcp yeah you can do that all right um easy enough okay um let me see one here timeline or a road map for using grpc with service fabric within the cluster and for external clients i i have no idea that's a separate thing right at the service fabric yeah like you i i i have no idea what we're doing and i probably wouldn't be able to speak for speak for their team okay cool wow okay yeah i know i'm just trying to think through all this this is really interesting especially like i've been doing more stuff with microservices lately and thinking about being able to like high performance microservice communication between different services using grpc um and then like looking at these performance improvements that would be amazing to just say like boom all my micro services just got 92 percent you know less allocation and stuff yeah wow so so this is um this is these are updates for net five um and then is there what are some of the things that you're thinking about and looking at with net six is there is there much um new stuff kind of in the works there or uh donate stuff um so i had a second demo but we haven't got time for it um but i'll jump ahead to the slide oh shoot i talked too much and didn't get the cmd demo yeah actually you've got it you've got a few you've got enough time for you know quick demo that's fine okay a very a very quick demo so someday for anyone who's not familiar stands for single instruction multiple data basically it's about doing more on data with less instructions so your cpu can be more efficient so this is something which has been done in protobuf and our protobuf serializer so now i've got the protobuf solution open um i don't have a profile of this but we identified that right string was a hot path which makes sense because when you're serializing json or protobuf or any message type generally strings uh probably the majority of the data so we're basically taking a string we're wanting to write it to this buffer as utf-8 bytes we'll ignore all of this and we'll focus on this particular loop this is the hotpath loop which we identified and although this looks very simple this particular line there is a bunch of stuff happening which doesn't need to happen so we've already previously identified that this particular string value is all ascii and because it's all ascii it fits in a single byte so basically we're looping over the characters in a string and we're casting it to bite and we're putting it in our byte array slash byte span so there's a bunch of stuff happening here which doesn't need to happen there are these excesses of indexes and each time you do one of these typically.net will just verify that you within the bounds of your buffer or your array because if you go outside then you can start reading or writing random memory which is very bad and also casting this byte there's probably some checking to double to like just double check that the char that you have actually fits within a byte but we already know that we've already validated that up here so this was the hot path um we'll get rid of this and what we introduced in a newer version of a protobuf serializer is let me simplify a bit is we have uh some more advanced apis that live and on it so we've got this memory marshall dog gear reference this sort of gets a reference to the very first character and our string and then we're doing the same thing with the first char first byte in our buffer what we can then do is use some unsafe apis to access values uh to and from these uh these arrays and we can read and write to them without having to do bounce checks so obviously you want to double and triple check that you're not going to accidentally read off the end of an array because that would be very bad but doing this quickly allows us to remove those bound checks and obviously this is unsafe which is indicated by this unsafe keyword so this is step one step two getting more advanced we're going to process four characters at a time using some d so rather than doing this um byte by byte that we're doing down here so we're still casting to a byte um charts to our byte one at a time we're going to do four at once by using some d so first thing we're going to do is we're going to take our chars so our reference to the string value which we're reading as chars and we're going to instead of chars read them as bytes so this allows us to reinterpret the data in an unsafe way to read the chars as if with our bytes and then we're going to process four at a time so we're going to loop over the starter and we're going to just double check we've got at least four characters remaining we're going to get the offset of our current position based on this current index we're then going to read the uh source chars which we're now reinterpreting as bytes we're going to read it as a you long so this is a 60 foot 64-bit sized data type so that will fit four characters inside it because each one of those is utf-16 so they should take 16 bytes and then we're going to set them four bytes at a time so i've got a method here which you might be wondering what's going on in this method this is where we start doing the simdee logic so what happens is we check to see whether the current cpu supports it so in this case we're checking does it support ssc 2 if it does we use some cmd intrinsics to process all four characters simultaneously and then write them at once uh if ss e2 isn't supported we then check to see whether you're on an arm computer if you're on arm then it has its own equivalent instructions for doing the same thing and then if we have neither then we'll fall back to a non-smd approach and in this case um because byte order matters we'll just check this guy and then after doing that we will set the values so basically what's going on here is we're taking ucf eight no utf-16 values which which is a char so let me yeah so this chart represents utf-16 character if we zoom in a bit we can you can see that there and we're narrowing it down to a byte we've already confirmed that it fits within 8 bytes we just want to do this as quickly as possible and the way to do that as quickly as possible is to remove all the branching going on so previously when we were doing bound bounds checking that was branching when we were checking again that a char fit into a 8-bit byte that was some more branching but by getting rid of all of that and doing this here you'll notice uh let me just say all of these if checks they'll get removed when chat happens because these are constant oh this this is like if you're on arm we'll just we'll just remove this for you likewise if neither arm or ssc 2 is supported jet will remove this for you likewise chit will see this check and it will remove either this or that for you so all these if checks just disappear so basically what we're aiming towards is just to remove all the ifs so all the branching related to bounce checking checking data types we just want to focus on converting our 16 bit values down to um eight bits and and then when you do and in most cases you're going to have support for sse 2 or arm and then right i mean and then you're going to get your you're four times faster like you're processing four um four characters or four at a time instead of one yeah it's not necessarily four times faster but it's certainly faster so uh even even if some d isn't supported um like you can see this comment here it's even the benefits do not support it it's still faster like doing this is is still going to be faster wow so i've got the results which we'll jump to um so our baseline with that for loop plus casting um that was about 11 milliseconds um this is running this on top of multiple kilobytes of ascii text if we switch to using that ref and unsafe and processing four chars at a time without some d we drop down to about six milliseconds and then if we add in some d on top of that we drop down um another 0.8 of a millisecond so our initial value we've pretty much halved it by using these advanced techniques that's crazy it's it surprised me when i looked at the cases i was like well hopefully i get one of these md instructions because that's what's going to do all the work but this is amazing you get most of that most of the performance just by like writing smarter code yeah i think i think it might be a lot of the bounce checking which was happening in that for loop like there's two arrays and you're getting and setting from both of them and then like that's just it just adds up so i yeah i i think that's where probably most of the game comes from and then someday it's just like the cherry on top to knock off another um millisecond which like seems small at this point but that's probably like 20 percent faster yeah which is certainly not bad nice for writing this code is there like some sort of guide or something for people like you only want to do this in a hot path right you don't want to be writing that kind of code all the time but if you do get to a place where hey i'm writing a service and this is my hot path and i need to optimize it you know what's the best thing for for doing what you just did there i i don't know like i did find one guide that someone wrote like there's a blog post i don't have it handy um but uh yeah writing this sort of code is tricky in fact if you take a look at the protobuf code you'll see i actually copied this directly from dotnetruntime so this particular method which is doing this logic um the dot net was doing the same thing when you take a utf um like when you have a string and you want to convert it to utf bytes with the um like encoder.utf8.getbytes um this is potentially executed um internally so um i did a shortcut and i just copied this um this code up here i just sort of figured it out myself um i i don't i don't really have a recommendation like this is this is pretty advanced stuff it's also relatively new stuff um i'm not sure the best way to learn it but when you're doing it just be careful because this is unsafe you need to be checking your your indexes and checking your links even before you do any of this stuff my big takeaway is to be thankful to you for writing this code so i don't have to and it's just it's just part of grpc now so protobuf so that's cool yep and i'll just do a summary of what's new and donate 6 with grpc so grpc retries is something i've just finished writing perhaps i might talk about it in a future community stand up another thing is support for the grpc client on donate standard 2o that's now done if you're in donee standard 2o your implementation your donet implementation might not support hp2 but you can always fall back to using grpc web grpc running on top of http 3 that's work in progress so i'm in the middle of adding support to http 3 for hp three two kestrel and i have a basic hello world of grpc working so progress is being made there and then uh one which hasn't been started yet as client-side load balancing so that's configuring a client to be able to talk to multiple endpoints and it will then use load balancing strategy and health strategy to pick healthy servers so you can do load balancing directly from an app without having to use a proxy in the middle wow wow that's cool all right well that is a ton of stuff i kept you late and i made you show the extra demo but i'm selfishly gla i love cindy stuff and i've been following along like since the early days when they're looking at it it's cool to see that like actually doing something there um so so uh for people that are interested and want to keep up i mean of course they can follow on twitter um and then this there it is you know the questions i ask oh yeah so i'm a big believer in documentation so uh there's lots of docs available and if you read the docs and you want to see examples um there's also lots of examples i will paste a link to that in the chat that is one step ahead of me that is awesome well wow those are really cool i mean that was just that was a lot of great stuff thank you so much i uh please you know let me know next next time you got something cool to show off and uh we'll be happy to have you back on yeah maybe maybe after be just before or after donate six that sounds great all right awesome well thanks everyone this was a great show i appreciate it i appreciate all the cool demos and stuff james and we'll talk to you soon yep thanks bye [Music] do [Music] [Music] you

Info

Channel: dotNET

Views: 6,144

Rating: 4.9756098 out of 5

Keywords:

Id: DkElWa3--8s

Channel Id: undefined

Length: 78min 28sec (4708 seconds)

Published: Tue Mar 16 2021