Shay Rojansky - How Entity Framework translates LINQ all the way to SQL - Dotnetos Conference 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody welcome to this talk which dive into an ORM into specifically the entity framework or query pipeline a few words about myself super quickly so as I was already introduced I'm an engineer I'm part of the team that actually works on entity framework so I'm gonna talk about sort of my day to day job I'm also the lead of of npg sequel which is the Postgres dotnet driver I kind of do a lot of stuff around post quest for dotnet if anybody needs anything come see me afterwards I'm totally a hardcore linux guy I work on Linux that's what I do all day long and I'm based in Berlin in case anybody cares or it comes to Berlin so what is this talk about first of all this talk is not about a low-level intrinsics gcj runtime I feel like I should apologize because in this talk it's almost like a disappointment but I hope I hope to show you that it can be interesting even if we're not talking and assembly and intrinsics and all that stuff yeah it's also not a talk about the new EF core 3 feature set or anything like this you can read the blog post or the release notes I don't think it's worth spending our time over this what we will do is do a very I think quite a deep dive into what actually happens inside the ORM Peter's talk for example did some sort of dive into the GC mat this morning did also a dive into some scenarios of the runtime I think I really liked what Matt said at some point you have to always know that was a nice quote you have to always know one level behind underneath you so I'm guessing a lot of people here have had a chance to use in all realm and so on this is gonna be a little bit about this so we're gonna dive into that layer that you're using a lot of you are using I think you know in some sort of permanent way and we're gonna actually look inside the belly of this beast it's quite a beast it's very complicated and see what goes on inside there's gonna be a lot of architecture and of course this is nanos there's gonna be a lot of performance as well you're gonna see how they they're kind of intermarried let's do a very like a very quick kind of getting started before we really dive into it so all our M's in this context they allow without them we'd have to do two things which we really hate doing as developers one is write sequel so the problem with writing sequel is not that we have to understand sequel one of the big problems with it is that we would be writing sequel and strings you know in our program it's not checked by the compiler we changed some property or some column name everything breaks it's not a good way to do stuff to you know integrate strings in your database and so on and in general sometimes it's good to the idea of our aims is to delegate the responsibility for this task which can also be quite complicated to a component that is supposed to know how to do it the other thing that's very important in the context of this talk is materialization which is I'm gonna focus specifically about the part where fine I sent the sequel to the database now we have to read the results back when a database sends results back it's it comes like in some sort of table which doesn't at all correspond to your objects which you have in your program to your pocos or whatever your types so you have to do this thing there's an impedance kind of thing you have to read all this you know crap from the table data that you get back and you have to construct materialized entities from it nobody wants to do that stuff a few words specifically about dotnet ohms and dotnet so c-sharp is a bit special because c-sharp has linked link is frankly one of the great features of you know c-sharp it has been there for a while and it I still consider it like a very amazing thing link the whole point of link is that we have sorry we have link as language integrated query link defines operators which we can use to express queries we can run those queries with the same operators on memory objects but we can also use them to describe something that will go to the database so this power of using a native c-sharp statically type compile time verified query operator set and to use that over us over like a sequel database or anything of that sort is frankly quite amazing in other languages where something like this doesn't exist in orem typically we'll have to define its own operators right so when Oren has a set of functions where we use stuff like where and order by these standard link operators over there in in other languages many times and Oren will come with its own set of functions and it's it's a completely different world one gotcha one thing that you have to remember is that link can express an infinite number of queries this is a very big headache for us and if cor there's a huge space of possibilities of what you can express and a very restrained subset of that is actually translatable into sequel and add one more thing even within the things that are translatable many times you cannot translate them in an efficient way so there's stuff that isn't translatable at all you can't you cannot express it in sequel and there's stuff that you can maybe Express but you probably don't want to because it's really gonna slow you down just kind of reminding things if core just really a few moments he of course Microsoft's recommend the data access technology it's the ORM that is basically receiving you know active development and innovation at the point it's not the first one Microsoft previously had linked to sequel in that sense we are standing on the shoulders of giants completely this is also something that is linked to sequel is more than ten years old now there's also EF six nad frameworks this would six which is the predecessor it's still there it's maintained but it's no longer receiving any sort of you know innovation or moving on we just had the release with that in core three we also have EF core three the query pipeline which is the main focus of the stock just got rewritten so it's kind of like a nice occasion to actually talk about what happens in that thing and one last thing contrary to most arms that you're gonna see and also the other two Microsoft alarms their core has as a goal not to be only about relational databases it's not just about sequel if core as an architecture has this split there's a basic part and then there's the part that's about relational databases what this means in practice is you can use AF core to interact with databases which are for example no sequel a database there's an in-memory database that you can also use there's absolutely no sequel involved this is kind of an interesting concept and it's it's gonna be a little bit important okay I'd like to dive right in just a quick show of hands how many people here have actually worked with the c-sharp expression tree API I'm kind of interested wow I'm actually quite amazed I didn't think there would be so many so I'm gonna breeze through this I'm not really gonna breeze through it a lot of people haven't so I'm gonna actually explain it but it's good to see so many people have seen this code and presentation is measured by how quick it gets to the first line of code so what you see here is something very trivial you have a lambda this is something compared so it's a lambda that receives one integer parameter and returns whether it's bigger larger than five or not just scaler Co no no intrinsic snow vectorization very simple stuff we assign this thing into a function this is sometimes called a delegate in any case it's a function compare is a variable of type function which receives an integer and returns a boolean this is super trivial C sharp I'm not gonna even spend another thing what most people don't know is that on the left side of that line we simply enclose the function by an expression in the generics we just add another thing everything else stays completely the same and something completely magical happens instead of compiling a delegate which is what happens in the beginning which we can simply invoke what happens now is that the c-sharp compiler generates an expression tree for us this is not code this is not something we can run it's not IL it's not assembly it's nothing of the sort it's an actual hierarchy kind of structure that represents that code all right that's very very important it's a representation of that logic what we can do what that representation is compile it that's the second line and at that point we do have the function I left out the generics we have the exact same signatures above so we got to the same thing we have in the first line but via a more complicated means if we want to complicate this further here we use the compiler services to actually generate this expression tree it's a weird thing but we can also generate it ourselves we can use the explicit expression tree API and it looks something like this that's pretty ugly already I'll go through it quite quickly so the first thing that you see here we have to generate something called a parameter node now we're in expression trees right so there's gonna be a node of type parameter everything here works as a factory so expression dot parameter returns a parameter this is just how it works because the lambda has a parameter so we generate that thing now our compare expression the second one is a lambda so we create this thing again via a factory it's body is an expression node of type less than so this represents the less than operation here it has two operands it's a binary expression which has two operands the first one is the parameter and the second one is the constant five I hope everybody understands how this ugly kind of thing naps do this nice and concise kind of thing in practice they mean the same thing we also have to give the lambda the actual parameter that's just how the API works if we visualize that we have this kind of structure once again a lambda node targeting referencing a binary expression of type less than which itself basically represents a comparison between parameter and constant once again we can take this and compile it and we're right back where we started okay hoping this is completely obvious to everybody by now what is this good for why am I talking about this the first thing this is actually you can take this home and use it right away unlike the rest of my talk which is more about internals this is actually useful stuff this is usually used for cogeneration it's a great solution for broth instead so what typically you're gonna do let's say you get some sort of you know configuration or user input you can use this ugly kind of API it's actually not that ugly I'm saying that but it's kind of okay you can use this API to construct an expression of what you want the logic and then compile that thing and in runtime this is super important the compiler will generate that code which will run almost as a fish basically as efficiently as if you wrote it yourself think about this when you do you know when you do this kind of thing when you have a very hot piece of code a very perfect sense ative piece of code you can sometimes speed it up only in that case don't do this like in general but you can speed it up by encoding the exact logic you want as an expression tree and then boom you compile it alright let's just compare this to a few other alternatives for cogeneration I'm just kind of this is like an introductory thing of course we have Roslyn which is the beloved C shop compiler we can use Roslyn from the c-sharp program to actually compile something Roslyn operates at a pretty high level so it represents source code files and c-sharp so it has even all the white space and the comments and all kinds of crap we don't really care about right that's not what we what we want so the job of Roslyn is a bit more high-level but you can really do anything with it notably with Rozlyn you can generate classes assemblies entire assemblies and so on expression trees are much more limited typically you're going to use it to generate expressions rather than say a big big function although that's also possible another alternative is something called t4 and guessing not a lot of people have heard about it this also operates at the source code level but it's basically a template language so you write a c-sharp file and inside you have if you use razor or any sort of templating language it's basically the same you can integrate stuff and you use it to cogent rate at a very very very low level we have something called reflection emit which is an API for generating il I really hope you don't ever have to use this it's pretty rare if you if you you if you need to use it it's because there's something you cannot express in c-sharp so you're forced to go down to IL this is a very very rare case but it's possible an expression tree is in a very snug comfortable kind of place somewhere in the middle so it's actually a pretty cool API once you get used to it finally getting closer to our or RM world it is expression trees are also the basis for something called a feature of the language once against again called iqueryable which probably not a lot of you know about its plumbing for our ends and things like this what we've seen up to now is taking these expression trees and compiling them into a delegate which we can then invoke cool but what i queryable allows us to do is to write a provider to write a piece of code that instead of compiling and invoking that thing it's gonna get it as input and execute it in some other way typically an ORM is gonna take this expression treat convert it translate it into sequel in some way do a sort of transformation and send it to the database this is what an Oran does it's basically an iqueryable provider in this sense this is what the query pipeline is hoping this is all very clear so I want to jump right in and start talking about some problems when we translate from the world of link to the world of sequel okay what kind of stuff are we gonna hit let's look at something very simple right let's call this null semantics so if you have this link query you see we have a like a thing called employees kind of like a database table and we have a where operator and we're comparing the first name to the last name so we're searching for all employees with an identical first-name and lastname anybody here read catch-22 ooh am i referring to no ideas major major no anyway I tried the joke nobody got it usually go read catch-22 it's a great book usually the I mean the thing a naive answer to translating this into sequel would be to translate it to this select from employees where first-name equals lastname this is wrong it's a very very typical kind of beginners mistake for people who have not really had a lot of experience with sequel sequel has what's called three-valued logic and normal you know in c-sharp and other normal languages you've got true and you've got false in sequel you've got true/false and now it's really a three-valued logic system it works in a very different way comparing first thing to now will never give you true even if first name indeed is null it will not give you false either what it will give you is of course no right this is why you have a specific operator for testing if something contains a null it's a very very different world from c-sharp here we have our first big mismatch between these two worlds so translating any simple equality this is the most basic thing you could imagine we already run into a wall all right so now this is like a point of philosophy what EF core tries to do in general and it's a decision is to try to provide the same behavior in the database in the sequel as the user would expect in the c-sharp program because we're using link once again and not a dedicated operator you know language set coming from the ORM like in other languages because we're in length there's a natural you know we have a natural expectation for the code as we write it to perform the same way as if it ran against objects in memory right makes sense that's not trivial if we want to do this if we want to mimic the c-sharp behavior we are going to have to write something like this this is the actual correct 100% correct translation you have to check if first naming is equal to last name or if both of them are null I hope everybody understands what's going on anybody with just a little bit of sequel experience understands this very well other people I mean I'm guessing it's gonna be okay note I'm hoping note that if one of these columns are non nullable we know that in the database schema they cannot accept now then we can optimize this away right it is enough for just one of these columns do not beat and not be capable of having now and then we no longer need this check and it's enough to just say just check first name equals last name so we now can use information that we know about the schema and the database to may be optimize this query a little bit just giving you a little bit of an idea of you know the kind of stuff that we do here I don't want to give a counter example to this idea of mimicking Seashore behavior it's the best effort thing for example is this comparison first name equals last name is this case case sensitive or case insensitive the answer is it depends where you're running different databases have different behavior the the most awful thing about writing an ORM as opposed to for example something like asp.net is that you have to target different databases and each one of them believe me has awful quirks and you know arbitrary behaviors it's an awful world but just the fact that you have different providers is already quite awful in in many many ways so already different databases are gonna work in different ways now we could have guaranteed a case for example a case insensitive thing just to give an example we could have added like to lower or to upper on both sides we could have done some sort of forcing a c-sharp behavior on to our target database but that would have a big performance cost because then if we forced to upper or to lower maybe our indexes don't get used and so on what I'm trying to say is this idea of trying to mimic to match the c-sharp behavior only goes so far and at some point we're gonna stop we're gonna say sorry guys this is the natural the behavior that your database gives you we cannot go around it without starting to hack things in a way that's gonna compromise performance in a very significant way let's move on I'm gonna go over a few more translation examples a bit more complicated let's say we want all employees where their tasks count is equal to two so now we have a related a related kind of table there's like a foreign key relationship and all that kind of stuff doesn't really matter so we have something a bit more complicated happening in the where clause not sure how many people are aware of you know how this kind of thing translates but the natural way to translate it is like this this is what we call the sub-query in sequel so the we're what you see here we open a parenthesis and we have another query inside that other query returns one scalar value it doesn't return like a table but just a single count for that employee and we compare that to to just showing this because it's a complex thing we have to transform this thing into that thing just as there's no real mismatch here per se but it's it's a complicated task and we're also gonna need this example in my next thing the next thing is instance equality this is another weird mismatch that you're gonna see let's say we have this code here employees where the employee is equal to their boss so basically we're asking who here is their own boss right I guess like anybody who doesn't work in a big company like Microsoft basically so this kind of query is weird if we try to translate it into something like a database an employee here employees corresponds to a table which means that a single employee corresponds to a row in that table if we try to just translate this nightly we're kind of asking to compare rows there's no such thing in relational databases there's no idea of a customer or an employee or anything of that sort there's only rows and columns so then the idea here does not translate at all what we're asking in this eShop world is basically to compare references write a thing an employee here let's say it's a normal you know a normal reference type in C shop is identified by their reference so we're doing reference equality the the corresponding thing in the sequel world is this is what I'm looking for is to compare IDs so in the relational world there is an idea of an identity in c-sharp it's the reference and the sequel world it's the key the primary key so what we have to do here is translate this equality into this kind of thing in sequel note that this is happening because we're comparing the actual instance this is not a property in the employee it's the employee itself this is the key this is why this is complicated this is the thing that does not translate the sequel okay so this is instance or entity equality if you speak the core kind of language which I'm trying to avoid a little bit note that the same thing happens even if we replace boss this this thing here we can replace it with a sub-query for example the employees where their first task is equal to some ex note that in general if you put a first you also have to put an order by but I'm glossing over a little bit because I don't want to make it too complicated there's also an ordering mismatch between C sharp and and you know the relational world and we get this thing where this complic arbitrarily complicated sub query soon you'll see where I'm building to dot ID 1 is equal to X dot ID 1 everybody's clear about this we're just playing games with you know translation I'm gonna take this two steps further in the relational world something can have what's called a composite key this is very foreign to c-sharp in c-sharp you have a reference that's the idea that's the identity that's the identifier but in the database you could have two columns as your primary key as your identifier so okay fine that's not so difficult right all you have to do is write something like this so now when we translate that thing we look we see oh that thing has two columns two keys two key properties so we're gonna do an and between them clear right again nothing to write home about but then if we combine the whole thing with again with the sub-query I'm just reminding this then we'll get this equal right here so we'll have this sequel here which is that that sequel that represents the the tasks will have it here once again we have to duplicate it because there is a composite key so the combination of a sub-query plus a composite key plus instance equality yields something which we call the double evaluation this sub query is gonna actually execute twice in the database and that's gonna be quite horrible for performance this is a great example of something which is actually translatable but we don't want to translate it right it's like it's what we sometimes call a pit of failure somebody will stumble into it they won't even know they're sequel will be horribly inefficient okay so this is the kind of problem that we have when we translate between these two languages or domains which are really quite different moving on there's two more kind two more things and then we're going to dive into actual architecture so related instances is basically we have employees and the employees have tasks right the tasks are available on the employees when we're in link to objects we're in c-sharp there's just a reference there right so an instance of an employee has a reference to a lot of to a list of tasks or whatever but when we're now going to fetch something from the database we don't simply have those tasks we have to choose whether to fetch them or not the fact that something that a user wants to have all the employees doesn't mean that they necessarily want to also have the tasks for each employee it's a very different situation right we need for the user to tell us exactly what related stuff they want to bring back from the database otherwise once again we're gonna bring a lot of useless stuff that they're not gonna use network bandwidth database load a lot of performance problems so in AF core for those who haven't used it ever we've devised we basically invented an operator this is not rocket science but we have something that is not part of link called include this is our own thing it looks a little bit like a link operator but it is not it's a way for you guys to signal to us that yes get the employees but get them along with their tasks it's a way of communicating with EF with the ORM to tell what you want it doesn't make any sense in the in the memory world right because the tests are just there okay this will generate this sequel right here so this naturally will produce a join in this case a left join now it's it's worth stressing that this is actually very new behavior in E of course three if you do this exact thing in earlier versions of EF you're gonna get something very different instead of having one query with a join you're gonna get multiple queries so you're gonna get to in this case two queries one for the employees one for their tasks it's a radically different way to translate the exact same user request get me the employees and get me their tasks there's advantages and disadvantages to both the problem with the multiple queries is exactly that it's multiple round-trips to the database so if you're including ten things then we're doing ten round trips to the database okay that's one disadvantage on the other hand this kind of query with the left join anybody who's done database knows about a phenomenon the so called Cartesian explosion when you do a relational join as you're adding more and more joints various column data gets duplicated across the rows I'm not gonna go into this because those of you who know databases know those who don't it's not worth really going into the point here is that no solution is exactly perfect each has its advantages or disadvantages one additional important thing that's problematic with the multiple round trips is transactionality if we send one query for the employees another query for the tasks maybe the database has changed maybe somebody has already updated something in between and we get a mismatch we get an inconsistent view that's about transactionality so that's a good reason to say one linked query is better translated as one sequel query it maintains transactionality coherency and all that kind of stuff all right another point is that it is still possible today of course to write your query in a way that's actually going to perform multiple queries if this is what you want for example if you have like 10 joints that you need to do it's probably a bad idea to do this with 10 joints because that's gonna do this dreaded Cartesian explosion so you can rewrite your queries if you look on our site you can rewrite them in a way that won't reduce this in the old world there was actually no real way it was much more difficult or let's say there was no way to actually produce a joint so in the new world you can have both you just have to be explicit about it but by default we're gonna produce this so just zooming out again it's another case of there's there's really two possible translations in behavior of how to you know do the thing that you asked and each of them has its advantages and disadvantages the last thing with translation is untranslatable ax T that's another important thing that also changed in 3.0 let's say we have something like this employees where and then some function that returns a boolean on e how do we translate this so the answer is it depends some function some func might be something that we can actually translate there are a lot of functions that we know how to translate if you for example you if you run like a c-sharp substring function this is something that we know how to translate into sequel because sequel typically databases also have a substring kind of function that works the same way but in some ways we don't know this function we have no idea so then what happens in all the versions you would we would translate to the server if we can otherwise we would evaluate on the client so EF core 2.2 2.1 had what's called kleine evaluation we would transparently automatically just evaluate it here all right it's a cool feature because it means you know you you do your link and whatever we can we'll execute it on the server and whatever we can we'll bring the data and executed here of course this has some problems maybe some of you have already hit this this can bring a lot of data to the client so once again if some func just to make sure everybody understands if some funk is translatable we send it in the sequel to the database the database does it and returns only the rows that matched if some funk is not translatable we are going to download that whole friggin table to the client and then we're gonna apply that function because that's the only thing we can do the difference between these two is non manifest if I look at this query I have no idea what's gonna happen depends depends on if that function is translatable on my database provider maybe tomorrow I try to run I was on sequel server now I try on Postgres suddenly the whole thing changes ok this is particularly dangerous because this is typically a problem that you find when switching from testing to production because in testing you have a small data set you're running your test so everything is fine then you push this thing to production and suddenly your table is huge so you're transferring like I don't know a gigabyte of table data to the client boom application crashes this actually happened another more subtle issue is that adding a new translation can break behavior what does this mean if today some funk is something we cannot translate we bring the data and we do client evaluate great let's say tomorrow we add a new feature it's a cause for celebration right it's like a new feature for EF core that means that all of a sudden you upgrade an instead of bringing the data and executing c-sharp code we now send a sequel that does it on the server side the behavior as we've seen is gonna maybe be slightly different because this in the database it's not going to be a 100% exact same behavior as at the client and suddenly the behavior has changed so it sort of started making it it made it impossible for us to make a change while being sure that there's no breaking change involved and this is why because of these issues of core 3 throws if we cannot translate we just don't do it the only exception is in the topmost projection so if you have a select at the very end that's like a top a top most projection then we allow a client evaluation because the data is coming anyway to the client there's no danger in doing client evaluation but I'm not not gonna talk about this anymore all right what did we learn from we talked about like four or five cases we've learned that there are many many mismatches between C sharp and and C sharp link and sequel it does not translate well these are two different worlds different rules nulls are not the same nothing is the same basically it seems like it's very easy it really is not unfortunately we also learned that Oren's have to make big decisions these decisions are not it's not like a straightforward thing it's not obvious what we should do there are decision points with pros and cons to each way that we decide to go it's it's quite it's quite hard and it's very very hard to satisfy everybody by the way each each one of our users has their own like idea of what they need and so on and they're used to doing things in a certain way it's a big headache I'll just say and we need to do many things to expression trees before they can before they can get converted okay so let's zoom out how are we gonna actually do this thing so the query pipeline that we're going to build it has inputs and outputs so the input is a link expression tree that we get from an iqueryable what are the outputs for a relational database one is a sequel obviously and the second one is a shaper or materializer which is a component that's going to read back all those remember that I said in the beginning these are inputs and outputs for this component what we are basically doing is in a sort of way building a compiler right taking an expression tree for one language which represents eShop and built and basically compiling it into a sort of sequel we actually use that word internally so it's actually an interesting thing what is the architecture of such a thing look like the query engine what we call the query pipeline is architected as a pipeline of visitors each visitor is a sort of component that is going to traverse the expression tree that we get from link and does exactly one thing one of these massaging operations manipulations that we just talked about but you will see is necessary so let's take a look a little bit at what a visitor looks like in this architecture it's quite important for example this instance equality remember what we said it has a very simple job it traverses the entire tree it attempts to identify places where we're we're doing equality between instances and when it finds those it's gonna replace those expressions by expressions doing equality over there keys I hope that's clear that's a one job description for one visitor in our pipeline let's take a look at a very minimal kind of exercise of what a visitor actually looks like I'm gonna kind of I need to do to speed up a little bit so this might be a little bit brief but let's say that what we want to do we have this kind of query where CH is larger than 18 and C name is not equal to null but we know that C name is actually a non null above column in the database there's no need to check to generate sequel that asks whether it's not null we know that it cannot be null so we want to simplify this optimize this to this so we're going to show we're gonna see in a very quick way how to write something like this so we have a base class called an expression visitor in C shop this is a standard class the default behavior of this thing is basically you traverse the entire tree and to do nothing what we are gonna do we're gonna override one method one type of expression that interests us we're gonna basically hook into all binary expressions and what this method returns is what is going to replace that binary expression in the tree this is a very common pattern called visitation so I'm not going to go too much to detail here but basically you identify nodes which are of type not equal we're looking for non equality nodes if we find those we check if the right side is a constant and that constant is null and the left side is not malleable I'm basically identified the exact case that I want to optimize and in that case I'm gonna return a constant of true so what I did is replace what you just saw before into something that looks very weird see age is larger than 18 and true okay I hope it's kind of clear what what just happened here this went over the entire tree identify this specific case and replaced it with a true it's one component because we want to get rid of that true as well we have another visitor that identifies and also and also is basically a logical and and once it finds on the right side of a constant which is of type bool and that bool is true then it's gonna return the left side all right so it's gonna replace this ridiculous expression with just the left side which is the part that interests us and we get this to recap we have this expression that came in one visitor made it into this and another visitor made it into this this is exactly how the whole thing is architected if you look at an actual this is an actual code fragment from EF core inside what you see at some point is this chain of visitors there's a lot of visitors each one gets a query visits does something returns we passed it the next one returns the next one returns each one has its own function we even have the energy equality rewriting expression visitor that's the one that I just described before it looks like this it's a chain of visitors and slowly we transform the expression in some way now up to now we've dealt with these expression trees from link but at some point we want to start to get close to something that resembled an actual sequel query this is our destination so we're gonna transform this into a model of a sequel query no longer wear order by in the link sense now we want to talk about an actual sequel kind of thing so if we look at this tree sorry this query right here so there's aware of filter and order by a taking a select the expression tree representation without any sort of change is this looks a bit weird maybe but you see the topmost method is this select it has two parameters take these are all extension methods right so this operates on the output of take and applies this selector to it this projection take gets the output of order by and so on and so forth until the DB set which represents our table this is the link expression for you but what we want at some point is something that represents the sequel that we're going to actually do which is something more like this this is a select expression that has nothing to do with this select even though the name is the same it has a projection several projections name an age it has a predicate it has you know the take and it has an order by so this is now starting to be something a different model of expression trees that's is starting to resemble what we actually want to generate it's worth mentioning here that we still are talking about expressions and expression trees but now we're talking about sequel specific ones a select expression that I just showed that's a sequel expression another good example of this if you look at this kind of query here what we are asking is to get me all the customers whose ID is either one two or three where this array contains this ID we can translate this in sequel to this kind of thing where CID in one two three sequel has this kind of expression which obviously c-sharp does not would be nice to have it one day behind the scenes this is actually represented via something called a sequel in expression what this shows you is that expression trees can also have custom types which you can invent yourselves you can actually mix them up it's a very kind of interesting model which you can actually do yeah I'll just I'll move on a little bit this is also by the way the point where providers different database providers or drivers and user translations kick in so if you want to check whether the name the length of a column is greater than 5 on sequel server it's going to be one function on Postgres it's going to be called a completely different function and you can also yourself define a function and say what it's going to translate into as a sequel fragment or as a sequel expression all right we're pretty much at the end of the line of what this looks like so there's a bit more massaging and post-processing of this this thing okay of this this expression tree that's been abused and changed and modified in various ways at the end of this line what we have is yet another visitor the query that yes the query sequel generating visitor this is just another visitor that walks over our expression tree which now resembles an actual sequel query and generate just has a string builder and generates an actual sequel query so this is the thing that generates sequel and there's another one that will generate the materializer or the shaper which means it's basically generating code via expression trees for something that's gonna take the results coming back from that sequel and materializing them into your actual objects alright quick schema so this is we're starting to build out what this looks like the expression tree goes in it has a pre-processing phase now here we're still in link kind of expression trees then boom we have a translation point now we've shifted into a sequel model expression tree then we have a bit of post-processing and let's say the two outputs are the sequel and the shaper this is obviously a simplification of what actually goes on right this is it's a 45-minute talk okay let's let's go and talk a little bit about perf I don't have that much time we'll see where we can get with this so the first important thing I'm gonna do a small segue before I come back to this the first thing that you need to know is that we have a very two completely different types of queries they look very very similar the first one we are getting all the posts where the title contains narrows the string Narrows is embedded as a literal as a constant inside that query in contrast the second query which looks the same has a variable which is coming from the outside we translate this in very different ways the first one is gonna have you know it's not a surprise it's gonna have the literal actually embedded inside whereas the second one has this weird thing which is a parameter of placeholder again for people who are not familiar with sequel typically you're gonna have a sequel and there's gonna be a parameter inside and you're gonna run the same sequel with different parameter values this is usually the way things work why is this so important because this allows you to have one idea of a query where there's a placeholder you've identified a place where the actual values are gonna change all the time it's important for AF core it's important because there's only gonna be one query behind the scenes it's important for your database which has a query cache prepared statements this is a major major part of performance it's very important now let's go back to this pipeline this is the actual optimization which I wanted to show so compilation this whole thing that I just showed you with this big string of visitors and all that is extremely slow if we actually run these visitors this whole pipeline with all these visitors on every single query that thing would grind to a halt it's super long no way we can run this every time so what do we do whenever we have a performance problem always the same answer we have to apply some form of caching right when it's not you know some intrinsic source um low level thing basically we just have to cache things like we do it everywhere else so the inputs whenever we cache we ask what's the input what's the output the input is an expression tree and the output once again sequel in shaper right so how do we do this thing when we get a query like this okay the first time that we get it so we have this expression tree we go through the entire pipeline and we do this whole compilation process that's very expensive and at the end we're gonna actually cache this expression tree the next time that we get an expression tree we're gonna do a structural comparison which means we're gonna actually walk this tree and compare each node so we're gonna make sure that they are exactly the same that's very very important it's kind of a bummer because it's very slow right I mean it seems very slow we'll talk if I have some time we'll talk about you know what we can do with this but this is the basic thing that we're gonna do there's one one more step that's kind of important I'm just gonna go really quickly over this in situations where we have a parameterised query we first have to walk over this thing find where there's a parameter because the parameter is going to be different each time and replace it with a placeholder so we're going to punch holes so to speak inside our tree and at that point after we've parameterised we've removed these parameters then we have trees which are structurally identical and can be compared for caching I hope that point is clear if not it's not the end of the world it's not that important and what we have now for the query pipeline is something like this we have the expression tree going through parameter extraction so this we go over the thing we find those parameters and we punch those holes we extract them out at that point we're gonna check is it cached the first time it's not cached so we're gonna go into what's called compilation this is the compiler this is really the EF core compiler the outputs from this whole thing after pre-processing translation close processing is sequel and a shaper as always the next time that the same structural query comes in maybe with a different parameter it's gonna go through expression tree parameter extraction we're gonna ask is this thing cached and the answer is gonna be yes this time okay so we've bypassed this whole thing this is important because it allows us to maintain a compilation pipeline we don't have to worry about speed that much in the compilation pipeline this is no longer our hot path we don't have to be crazy you know about making it run fast because we have a caching layer before it anyway however if we're really like hyper minded people which we are cuz we're on Metis every time we still have to do the parameter extraction even if it's cached we still have to do parameter extraction we obviously there's a dictionary for the cache so we have to do a hash code calculation over the entire tree which is super non cheap and then we when we actually do the structural comparison do well till we find that tree that matches or sorry or we don't find it so there's an answer to this as well okay and if we don't want to do this if we're super hyper minded we can use a feature called compile queries in order to show how this works I first have to show how we actually run a query without compile queries so this is just a very small thing anny of core in general there's something called a context you can think as the car of the context is basically kind of like your database connection to vastly simplify things so you have to use your context when you actually want to run your query your query runs you see this link query runs on the context in the end okay so this is a non compiled query this is the thing that's gonna go through the whole pipeline maybe through the caching maybe not but you can also use the compiled query feature which does something completely different I'm gonna use a special API called AF compiled query okay I'm gonna put the same lambda inside here and I'm gonna get back a funk that accepts a context an instance of a context say again a database connection a parameter and the thing that it returns just look at how I use it and it's gonna become abundantly clear I have a thing a ready-made function that I simply call the parameter the sorry the context as a parameter I pass the actual ID of the post that I want and I get that thing what we've just done is bypassed the whole beginning of the query pipeline which is parameter extraction cashing blah blah blah blah blah so this is basically kind of like the ultimate performance you can get to I hope that's that's kind of clear at this point I'm just gonna say some closing words this is I pretty much managed to do what I wanted so what did we learn about this from this kind of talk so in Oran is obviously a very complex thing I'm amazed I I've been doing this like inside the EF core pipeline itself I've been doing it for several months I've been working on databases and and so on for a long time but actually doing this stuff it's incredibly complex just be aware when you're cursing at your Orang because it's not doing exactly what you want everybody hates RMS just be aware how difficult the job is it really is quite difficult it's always important once again going back to that quote be aware of what's happening one layer underneath you I hope that I managed to get some concepts across with regards to perf and caching to understand a little bit how he of core works there are some mistakes which make it extremely inefficient so it's worth knowing about this and always always look at the SQL being generated under the hood never trust anything definitely not in orem okay one thing I have to say so the actual heroes of this story is first of all Schmitt Patel which is the architect of the new career pipeline for a 3.0 again it got rewritten a huge undertaking and it's it's really quite mind-boggling that we managed to do this so this is largely his work and also Mauricio Markovski who is also another part of the query pipe the guys working on the precogs line also a Polish guy so deserve as I mention and of course the entire EF team everybody kind of interacts with with pipeline because it's such a central component and with that thank you very much for listening thanks this if you're interested in the presentation you can go to my blog and it's it's there you can watch it so yeah it's it's basically already online we've got a time for one question from the audience amazing okay here raise your hand if you want to ask some more he will be with doing the break of course also after well I have a question about a caching compiled queries because I think I've heard I think it's a new feature but I'm not sure if it's really if I heard correctly that there is a way of caching queries not in run time but in sort of the build phase of a project and save the cache output in some form of a text file or something like that so that it's not compiled during runtime is it until frame work or feature no we'd love to have this this is one of the things we are dreaming about actually is to take the whole compilation thing offline to do it ahead of basically we're talking about compilation right this is a JIT what you've just seen so you run your program and when the query is needed it gets cheated and what you've just described is an AO T approach to an ORM to a query compiler it's amazing how these concepts map well to the world of compilers jets and IOT and all this to answer your question if core doesn't have this this feature at all at the moment we really hope that we do have it having said that yes it means that like with general like the whole point of JIT it's the same point the problem is startup so you're doing this only once it's important to understand if your program is well written you have a startup time that's gonna be long because you're compiling all your queries and from that time the stable state again this is the way we talk also in the jet world the stables we've reached stable state and we don't have to compile anything anymore unless you've done something like dynamically generate queries which are different every single time and then your compile all the time and that's a very very faulty design so I'd love to there's still a lot of value in what you asked for and it's not there yet ladies and gentlemen shy Perugia is cute thank you very much [Applause]
Info
Channel: Dotnetos
Views: 5,736
Rating: 4.9191918 out of 5
Keywords: .net conference, .net performance, dotNetconference, dotNetprogramming, developing, dotNetperformance, codeperformance, coding, Dotnetos, DotnetosConference, Shay, Rojansky
Id: r69ZxXgOIK4
Channel Id: undefined
Length: 49min 24sec (2964 seconds)
Published: Mon Dec 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.