Get the most out of EF Core by avoiding these common mistakes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

i'm going to start simple with counting counting yes it should be a very easy one you know surely everyone with some sql experience can do a simple count how can it go wrong hello and welcome to tech yak my name is andreas lincke a software architect here at ssw and today i'm joined by jk he's a solution architect at ssw and he's been through the trenches of entity framework and database performance and he's here to tell us everything he's learned and some of the questions that we might encounter ourselves jk welcome to the show hi andreas uh thank you for having me i really enjoyed the last show for the logging and i'm really uh happy to be back with something new that's right we're here for part two yes and this time around uh all of the experiences that i got from uh when jumping from client into client and looking at my friends code looking how to use ef core um develop a little bit of ptsd seeing some codes and i want to show people basically how to avoid common issues and when they were wondering should i use as not tracking or should i use first of the fall to list i'll show you a little bit of consequences a little bit of what happens if you don't do the right thing in ef core and how you can fix it fantastic let's get to it all right so here i have visual studio but first i want to show you uh the data structure so over here i have sales db it's a open source we're going to put a link to it and the most important part of this is the sales table has six million records which is you know a nice representation or what happens when you make a mistake on a scale so you know usually it's like how you can scale things and i'm now going to show how things can go wrong on scale and also how to fix them uh and next part up is it's all the code that i'm going to show you is already in github with all of the metrics that i have recorded so far and uh what i'm going to do is i'm going to show you some code and then i'm going to run it in postman so we get a feel for performance but i already done uh some load test on the functions on the endpoints already so we can basically see how well it would perform in 60 minutes if you're trying to use five concurrent uh requests it seems like a really good sample database i mean six million records it's a medium-sized application and we can start to see the impact of these kind of optimizations correct and i'm going to start simple with counting counting yes it should be a very easy one you know surely everyone with some secret experience can do a simple count how can it go wrong well yes the the famous words how can get wrong so over here what i have is the worst case scenario i have here get base query because some of the machines actually can handle this so i have to do a little bit of conditional hey if it's on this machine don't run the full database but what i just do is i return the sales table as i queryable so i can query and what i have seen in some of the code bases is to list don't worry about this one this one just acts on a little bit of context i can show it sometimes later what it does but over here what we have is we're getting the sales table we get everything out of that sales table and then we count it now if i run this and hopefully i have not forgotten to close something i think i can already sense that you know the number one thing with performances and databases is try not to do your big operations in memory and here we're loading this massive table correct so what's going to happen is uh usually takes about 20 to 30 seconds for this to load actually i could do postman over here i'll do the next one uh just because this is already running um and what happened here is it does download the whole database and then it does count which obviously if you have 6 million it's horrible but some people don't notice that because you may have 100 records even less than 100 records in depth it doesn't scale that much so you end up with not enough data to be a performance issue until a year or two in and here you can see okay it's not doesn't show us currently the timings but i can tell you this would be very horrible if i go to the load tests over here and i have here the worst case what i have here is i ran a bomb idea which is a tool where you can gently load test your queries your endpoints and over here i have done the same thing as before a worst case done five connections concurrently for uh six uh uh for 60 seconds so it only runs up to five and it tries to finish them as fast as possible and was only able to do 0.1 request per second wow 0.1 per second so that is a very horrible performance and you don't want to be seeing that in your production database just for a simple count i mean imagine the more complicated queries yeah now if you ever happen that you do need for some reason to get all of that there's a nice trick you can do as not tracking and as not tracking is going to automatically give you 30 performance on the query so if i run this this is going to instead of doing 30 seconds it's going to do 20 seconds now here it's not going to give you a nice result because we are using teams we're using lots of different things and the cpu may be busy oh okay eight second that's actually better than i expected that's what they wanted and i also the github that i have created allows you to see what uh sql actually generates so here you can see that um it does nothing special it just tries to get everything and over here you can see that it's about seven million actually uh records and it takes about eight seconds so it's much better than 30 seconds than uh than what i was testing for the worst case pretty simple update there yeah very simple and if you still for some reason need to return everything uh you can even further improve it by doing select uh and just get saying just id or something like that you're simplifying the columns you're getting it just simply returning less data from the database it's less going over the wire at this stage it's almost less of how to fix it and how to limit the damage but what you can see here is we have 4.5 seconds and the query is a little bit easier okay i might have i have made the bug over here but don't worry about it this is also needed to be here but i can see it's printing the sql before it does the query on top but the actual code that it's running actually does select the single sales id column it would be just doing the single awesome id and just to show you the difference just to do that now in this particular case you don't want to do that but it's a really nice self-contained example where you can see the impacts of you know doing simple things so here just with no tracking we got from 0.1 uh request per second to 0.5 requests per second and then when we did the id only we got 1.4 requirements per second it's almost usable so yeah you can see how much performance gain you can get with just a few things uh but if we move to the best case for instance the best case is just do the query and just do dot count what that does if we go here and we run it uh it's not only 254 milliseconds wow compare that with the first one and the same result you're getting back that's a lot of performance increases yeah in fact this is right now slow i managed to get it to 80 milliseconds uh and what happens here is uh you can see that the average request per second now is almost 18 which is pretty good uh considering that you have to count all of the uh silver you need to can all of the rows it's pretty complicated query uh but another thing that you have to consider is when you run uh this for instance most of the work is on the sql server but when you run uh the queries above where you have two lists and you have a lot of things to get from the sql server you also need to consider there is a lot of work on the web application it actually becomes a point where your sql server doesn't do a lot of work because your web application is so busy passing the data you want to make sure that those data operations are being kept the database and don't want to be taking too much of that responsibility in the web app yeah and here in this case we have seven million records now you can imagine that if you have just one byte seven million that is uh seven megabytes if it's an integer that's multiplied by four and if you have uh say 50 requests over a small period of time that's multiplied by 15 and you can start talking about in gigabytes uh that needs to be allocated in your memory and if you don't use as no tracking that almost duplicates the amount of storage in all of the sudden uh your 32 gigabyte uh laptop cannot handle the load test yeah and you can see with this just simple example you think most people would look at count and go okay let's just do a select count of your id column and they wouldn't have this problem but even i guess this really shows you um the the simple increases in performance you can get and by using what entity framework provides correct and now i have done some examples that you can play around uh how you can do in raw sequel i'm not going to go too much into detail you can actually run raw sql but what they found it's at least 4d counting the performance is the same so either you go with the aps from sql raw or if you're trying to go a little bit more old school and you're trying to run more like sql commands the performance is negligible so if you're looking for improving that kind of performance you're probably wasting your time but you know it's not that this is not useful but it's not useful if you're just getting one integer so this is a pretty simple example um is there a way we can ramp this up or a more realistic one where we can see this performance hitting you know even earlier these other you know joins and stuff like that yes so i have prepared some uh join examples as well so here i have a worst case scenario uh not really worst case day with joins i what they have seen they truly truly horrendous things uh but here is one of the bad examples that they managed to find i mean this looks not too bad yeah i mean i'd use this in a regular database you know we're using the dot include method uh what's wrong with this yes so what is happening here let's see what we're actually trying to achieve is we're trying to return a sales uh with a salesperson and we have couple of things over here now what happens sometimes is we either have um say this line of code over here we're trying to get some data from the database and we might package this into a method or you know what happens is that the mapping logic change over time and in this particular case if i go to postman what we're going to see uh not pagination join what we're going to see is that it takes very long time and the reason for this is uh we are actually trying to get all of the products all of the customers and all of the salesperson within this condition but if we look down here when we're doing the mapping we're only using sale person but the ef cool doesn't know that that's only thing that we need because we already got all of the data from the database so there's no way for efco to optimize this query so what happens is if we look into visualize over here we can see that it selects everything it joins all of the tables and then does the where clause so it tries to get everything and you can see here the performance is six seconds which is pretty terrible uh let me see if i can get the join worst case okay so you can see that it's we got like 0.8 requests per second if we go for the bad case which we improve the situation a little bit where we remove all of the includes that we don't need if we go to the postman and we run so what we did right now is we manually clean up the includes that we don't need you can see we got half the time as before so it's 50 percent faster just to remove the things that we don't need and you can see here that the linq query is also a little bit better yes and to avoid those issues what it can do is what we call implicit join so over here what you can see is we are getting all of the data directly we're getting the the actual database entities from the database and even though keep in mind we're using snow tracking which already gives us 30 additional uh performance already so that would be even worse if you don't use that but here you can see that we're doing mapping to specific class after we got all of the data now if we're doing while this is happening you can notice here that i'm not using include i'm just saying hey this is all of the data that i need from the database and what happens is if i do implicit join is we might not get necessarily a better performance although we do actually get which is a nice bonus but what's important is we end up with the optimum query so here we might be getting a little bit better performance because we're not getting the whole salesperson table out we're only getting couple of the columns and so you can actually join these tables without explicitly including them correct and when you're doing this you're not only improving the performance in terms of if you remove something in the mapping you're not actually including that table you're not joining to the table but when you're starting to have thousands of records each column is going to add kilobytes or megabytes in transfer and then that is going to add in also passing and then assigning and all of that stuff and it adds up when you're starting to scale up when it's needing to get more and more data so this is where we're getting uh a bit more performance even though it technically it's the same thing it still is more efficient now another thing that you can do is a split query now this is an interesting thing because i want to show you uh the results as just for sake of time but because we are using data directly from a specific table when you use a split query you would imagine that it could be able to split query and say that this because we are only querying for one specific seal this person you would imagine okay it's going to make a query to just get that one sales salesperson and then assign it afterwards but it doesn't happen that way because there's a limitation of split query where if you have one-on-one uh relationship it will not split the query but what you can do if you want to do a little bit of optimization and also a warning regarding pre-optimization premature optimization is you can actually get the sales person as separately and then you can try to add it in the query over here so what happens in this case is and the reason why we're doing this is we're constantly getting the same sale person in the same query which is kind of a waste we already we know that it's the same so we can get it once uh so why don't we do it in such a way that we're just saying okay this is the person uh let's just pre-fetch it what happens is if you do this this way is that it will actually include the string to the query and then that's going to be executed on sql server and then it's going to include into every single row which is going to make it larger and then is going to be more stuff for you to download from sql server more stuff to pass and then uh this query in fact is like takes about uh i think 300 milliseconds longer than it should but if you do the same thing like we did before and we're doing then we're separately assigning the sales person so we're not doing it in the mapping what happens for this one and all of the results are in the github so if you're interested in details you can look up for that you can see all of the timings this is in my testing sometimes even twice as fast as the original queries now here on my desktop on my laptop it's very janky but on my desktop which has a more consistent performance uh i had a much much better results so for instance we kind of forgot to look at the uh the low test data but here we have about four requests per second uh that when we did the implicit join when we did split query we had it's roughly the same performance when we did manual but uh that's the one that we did the projection we almost lost a full request per second and when we did fully manual where we separately done mapping at the end we got seven requests per second which almost doubles uh the amount of performance and if you look about the code it's not really that difficult we just get the one person separately and then we update the list separately at the end or in memory so this is kind of for simple joins there's a little bit more complicated things going on here now i'm not going to go all too much into details but what's happening is we're getting a we're trying to select from multiple different uh tables so over here what we have is employer stats so we want to get how many sales we have and all of that stuff we want to get what's the first name last name how much uh sales they got and then in the end we want to get a list of all of their sales and here is a hidden another join which is the product name and over here you can see that it is uh two implicit joints so what happens here is if we would run this one complex we would get a certain performance but you can improve this by simply adding as split so in this case it would help there is only one issue with this one when you use s split is that the s split does not allow a select inside a select so you would need to process this separately but some people might not know that um entity framework actually provides this as split functionality where if you're doing yes these joins um you know anthony framework can try and handle that for you yeah so by default is a single query but you can say hey i want to split this up and things that are many to many it's going to split up the queries into multiple ones so for instance if i go and quickly run the complex we have here you can see it's we have couple of uh joins it's pretty decently sized sql we have two seconds execution and here we have with the split join split query sorry and what we can see is slightly easier uh query you can see that we this is much simpler made we do have a warning over here but uh it should be fine it just means that not everything was able to be split and also you can see okay here we cannot see the performance but if i go to the complex join so we can see that the complex zone is 0.16 requests per second and then with the split queries over here it's 1.74 uh requests per second so if you have a very complicated query you can sometimes get away by just adding dot as split and just modify a little bit the query and if you go a little bit further you can do manual uh splitting up of the corners and you can get even better performance i think what's surprising to me here is that you can actually get more performance by doing more queries instead of putting everything in one correct yeah and i'll just show you the code for the manual split you can see here that what i'm doing is i'm getting information for the employee separately and then once i have that i'm saying get me all of the sales for that person and the instant interesting thing what i do is i'm saying once i have all of the sales get me all of the product ids in memory distinct them so we have only um distinct product ids and then i use it in a new query products where i trying to filter them now here there's a caveat there is a diminishing return so if you have if this list returns too many unique products this is going to become a performance bottleneck so it's you will need to understand your data because this particular code over here products contains is at the same time a very very good way of optimizing your queries as well as very very good way of destroying your sql server at the same time it's all about how many ids you get here so that that's one of the warnings i would put over here and this is right now what i have for the joins but one of the last things that i want to leave with you but um i'm actually preparing a full talk about this because this is a massive topic and i have seen a lot a lot of things and this is one thing that has been consistently done incorrectly is basically pagination so here i have a worst case scenario and what what i see in practice is because people don't know how to get a count for pagination as well as filtering it so what they do is they're actually getting all of the data from the database which is a horrible way to do it so it's a horrible way to do it but i know sometimes i filter in memory you know how bad is it really so if it's a fixed fixed table we you have a certain amount of data that's going to be there and that's it that it's okay like for instance you have user roles sometimes even use a table when you know that you're not going to have uh more than say thousand active users it's not great uh but you know it's not going to destroy your database per se um but for instance in this particular case when i was testing this code i was doing against a sales page and imagine that somebody implemented that very early on and then they cannot afford developers anymore this that is basically um was going on for years for years they were accumulating data and for years the page was getting slower and slower in this case we have only two requests per second for this particular case but you can fix this um and this is actually also a query that limits quite quite a bit is just one salesperson imagine if you have a a much farther query and here you can also i forgot he you can see that i'm filtering over here after again uh the list but a more optimized way to do this is you write the whole query over here so i'm trying to do weight clause i want to do uh selecting and what i do then is i'm saying okay for this query this ice variable give me back count so it's going to execute this query but the interesting thing is it's actually going to ignore this select statement and it's also going to ignore anything that is not relevant to get the count and this is actually very fast operation all things considered and then the next line what i'm doing is i'm saying okay now update the current query to skip the amount of results that i need and take the size of the page and now this is going to be highly performant so if we go to how fast it would be it's now 28 requests per second so we gone from uh two requests to 28. and that's quite a mind-blowing performance improvement like just with these small changes yeah yeah so you can see here we've done some small changes and we've gone like uh 10 times better performance there's some cases that i have found a hundred times performance this a lot of cases where um they've done for each loops and in each for each loop they're done request instead of you know consolidating what what ids they want to get and prefetch them and there's a lot of cases where you can really optimize things but one thing to keep in mind is like there's a balance between optimizing and just making it work but you know this is saying that you shouldn't be prematurely optimizing but there's also a danger with uh just saying okay it just works because just works can destroy a database with a little bit of scale so you should really think about what is the best performance i can do for corporate readability as well as performance i don't want to spend half an hour an hour to make this but if it takes me only five minutes you probably should do it yeah it's an important consideration whenever you're dealing with a database to have in the back of your mind correct one interesting issue that i have found is uh people were puzzled how you can do filtering in ef core if you have different weight clauses for instance and what did they do is either they do a very very funky logic uh with the same if it's this and that and or i have seen a case where they did sales and they did a complicated thing where they did where x uh say salesperson id actually first salesperson id is not null because this is nullable and my machine froze and this and then they put this into brackets and then they started to stack this uh statements or i think it was uh or i think it was for and then outside it was end and then they started to stack all of this for all of the cracks and that makes it really really badly readable once you have say here we have one two three uh three different uh parameters and you can imagine and more complicated filters you have even more and these are actually being sent to sql server and sql server has to pass all of that and filter based on those rules but the better way is you can just do a i credible over here and you can just say hey i if i have value at this condition if i have that value do this condition uh and they are basically string s and operations so this would be the same if i have all three values this would be the same if i do and this and that uh do if you need to have an all operation that there is a way but i don't want to go into this because it would be too much but for that people can check out your your blog or your or your talk yes i am working really hard on getting that done and just to blow everyone mind as the last thing to show you any operation is a very interesting thing like obviously if you do a two list and then you do adding after you get all of the other that is terrible performance but what i was generally surprised is if you do first of default so you get only one item and compare it for the same query doing just.any this is slightly faster so what that means is if you want to check if a record exists in the database and then get it it is always better to just get it right away although in this case the column size is small it might not be quite true for larger theo tables for larger columns but in this case i was doing query against sales table which is 7 million rows pretty big and yes this was slightly faster which i was generally shocked yeah i think what this shows is that really for anyone if you're looking at database optimization there are so many small steps you can just see by the shift volume of examples you have here where somebody can apply tiny optimizations and the more experienced you get the more you can place them kind of as you're developing the code for the first time um but these small things that you don't do you might do subconsciously can can add up and cause like a really unperformant um database um and and having your review and adding these these small optimizations um having understanding of your data can really have an impact and it becomes so important to the end user correct and if you're interested in all of the performance uh things uh i haven't gone through all of them uh just for a sake of time but all of these images here that gives you the low test are all on github i think it's over here it's over here on this github we're going to show you the url at the moment it's on the blazer application ww root images uh and the reason for this is i'm going to make the blazer as the entry point where you can test out things and have a bit more explanations like i had right now uh and if you want to actually replicate all of the tests that i have done there is actually a powershell script in the api project where it shows you roughly how i did so this you just copy this part over here and change the end points and you'll be able to do the same test that i did on my machine and my machine was pretty decent it was ryzen um f5 uh what was it 2500 x it's a pretty decent machine 32 gigabytes of ram uh and this is what they use to test it out yeah fantastic so if people want to check out your results they can have a look at this github repository they should definitely definitely go and check out your blog and follow you on twitter follow the latest in database and and many more um you know adventures that you go on i know you speak on a bunch of different topics um i just want to say thank you for your time jager you've really been enlightening to me thank you for having me and we'll see you next time [Music] [Music] you

Info

Channel: SSW TV | Videos for developers, by developers

Views: 2,208

Rating: undefined out of 5

Keywords: ssw, software, industry, cutting edge, latest, developments, development, dev, design, functionality, microsoft, windows, computing, computer, advanced, expert, information, knowledge, dotnet, .net, angular, developers

Id: C9Fnysvvgvg

Channel Id: undefined

Length: 38min 7sec (2287 seconds)

Published: Tue Sep 07 2021