DjangoCon 2019 - Prefetching for Fun and Profit by Mike Hansen

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so a little bit about myself I'm a software engineer at rover for those of you who are not familiar with Rover it's an online marketplace for pet services like dog walking or dog boarding grooming things like that and Rover actually got started about eight years ago at a startup weekend as a Gengo project and so we've been with Jango for many many years and it is continued to service well so today I'm gonna be talking a bit about prefetching and over the last couple of years there's been a number of times where we've had we've encountered some things that are code where it was maybe not necessarily the best but digging down into Django's internals and learning about how the prefetch related implementation works we were able to come up with I think good solutions to the problems we were having so today give you a brief overview of what we're going to be talking about so the very first section we'll be talking about N+ 1 query problems this is sort of the the main thing that prefetch related is trying to solve so it'd be good to just make sure that we're familiar with that so we know yeah what we're trying to avoid the next section is sort of prefetching under the hood section and so the things that I hope to cover here are things which a are not sort of covered by the Jango documentation and be focused on the things where the Django's prefetch related code interacts with sort of external code so these are the types of things that you'd want to know about if you were designing some code to interact with chango's prefetch related system and then given that oh one thing I want to say is that section will be probably fairly brisk and it is not necessarily my intention for for you to sort of internalize every kind of line of code that we go over and present the main the main thing that I want to accomplish there is just to sort of call out what the main sort of moving parts are with prefetch related and how they interact so even if you like don't quite understand the section of code or you know forgot why I covered earlier like that's not a big deal and so then given that knowing about how Pripet related works I'm gonna go over a series of what I'm calling case studies but these are scenarios where what Django provides out of the box maybe doesn't isn't necessarily a good solution and so with a little bit of work we can use Django's internals to say come up with something better so yeah that being said what do like talks are for the audience and so I sort of wanted to cover at the beginning like what do I want you to get out of this talk because I think knowing that maybe we'll make this talk more effective for you so given that this is to talk about prefetching I want you to come away with an understanding of the key points with respect to how Django implements prefetching but I would also like you to sort of think about your own like the code that you work with on a day to day basis and where what sort of problems or things you've incurred encountered with respect to say data fetching or N+ 1 query problems and then finally I'd like you to see how you can use say knowledge of janko's internals to improve your own code and I think more generally I think one way that we can sort of grow as engineers and become better software developers is to kind of dig into the sort of libraries or frameworks that we depend on every day especially if we encounter something where oh it's not behaving as you're expecting then like that's a great time to say hey maybe I'm going to set aside some time to really dig in to see what's going on there even though it may be quicker to say check Stack Overflow and find some you know quick solution but I think being able to sort of dive down into the code that you work with is a great way that we can become better software engineers so first thing like I said I'll talk about a N+ 1 query problem and so what is this so the n plus 1 query problem is they're calling a data access anti-pattern and this commonly occurs when using ORM s and so how it works is an initial database query is done and it fetches multiple rows of data this is the +1 in the n plus one query problem and then for each of those rows an additional query or queries are done to fetch more data related to that particular row and so this is the N queries in the n plus 1 problem so for example if in Django we run a query to fetch a list of 100 dogs and then for each of those dogs we want to say display the name of their favorite toy well if we sort of do it naively then you may you know get into a situation where you run a hundred queries to fetch all of that data and if each database query say takes 10 milliseconds then you're in a situation where like you've already are spending you know one second of time during the request response cycle just to execute all of these database queries so that is what we are trying to avoid so what does what do you n +1 query problems look like so I'm going to go over a couple examples of how these commonly come up in our code so let's say we have a Django template I'm here passed in as context to the template we'll say dogs is equal to all of our dog objects and then we do the initial query so when we iterate over this query set django calls the database fetches all of the dog objects and then as we go through and print out their name and the name of their favorite toy when we access dog dot favorite toy behind-the-scenes django does an additional query so hmm so the green the top one is our plus one in an plus one problem and then the dog dot favorite toy gives us the n different queries another example that where this comes up a lot for us in particular is using Django rest framework so many of you may have used this before so this is a pretty simple serializer for right now we're just serializing say the ID of a dog and then we have a ListView that uses that serializer and for the query set we say get all of the dogs that belong to a particular owner and great so this doesn't have an M plus 1 query problem here we have our sort of plus 1 query which fetches all of the dogs but our dog serializer doesn't do any additional queries so we're in the clear but maybe let's say a month later you say hey our dog serializer needs to like we want to know about the name of that dog's favorite toy and so we add say something like this and here when you specify the the source we're saying hey that's a the favorite toys name and behind the scenes Django will do a query there and with this sort of setup you run into an M plus 1 query problem and maybe some of these are harder to see initially because maybe your define your views in different files from your serializers and so you don't necessarily know the when you know you go and change the dog serializer that you have to change all of the views that use it to make sure that you fetch the dog's favorite toy cool so how does django solve this problem so we prefetch related and so instead of passing in like dog objects not all to the context we can call prefetch related and give the name and say hey we want to prefetch our favorite toy and then when we iterate through all of the dogs and print their favorite toys name Django doesn't actually do an additional query for each dog so how how does that work well what Django does do is it does one additional query for each prefetch that you specify for in this example it was the favorite toy so kind of how it would work behind the scenes is that would go through fetch all of the dogs then iterate through that list collect out the IDS of their favorite toys and then execute a query something like this where you select all of the toys where that toys ID is in the list of favorite toy IDs that you just collected great so how how does Django do this like what is it actually doing under the hood so that's what we'll talk about next so we'll start at the beginning for how we specify prefetches so every time prefetch related is called on a query set what Django does is it returns a new query set and then there's a attribute on the query set called underscore prefetch related lookups and basically what it does is it takes everything that you passed into prefetch related and just depends it to the end of that so for example if we define our query set like dogged objects stop prefetch related favorite toy we can look at this prefetch related lookups and see that Oh favorite toy is there if we were that to then go and add in say an additional prefetch related lookup to that we can see oh the favorite toys manufacturer is now now part of that prefetch related lookups attribute okay so that's that's sort of where Jango is storing the prefetches that you want on a given query set so then when the queries that gets evaluated say like you iterate over all of the values somewhere along the line fetch all is called it's an internal method the first part here where it sets result cache let's say if we were to iterate for all of the dogs result cache would be a list of dog instances then after it's fetched all of the dogs it checks to see if you've specified any prefetch related lookups and if you have and it hasn't run the prefetching process already then it calls self dot underscore prefetch related lookups so we sort of trace that through and here this internal method calls prefetch related objects and it passes in the result cache which is the list of say in our case the dog instances that we've fetched and then the it passes in the prefetch related lookups that we've sort of specified along the way so we'll continue to drill down and take a look at prefetch related objects so this is the primary function that Django uses to perform prefetching so you can find this function in Django DB models not query and I've gone through and the slides are available afterwards I've sort of github links to all the places in this source code where these things are happening but we're not so this function that handles a lot of the kind of the bookkeeping process during the prefetch related process and so for example if you have you know like I'm favorite toy double underscore manufacture the pre-vet related objects function keeps track of okay these are all the favourite toys you've you've fetched and then now you have to get all the manufacturers for those so we're not going to be digging too much into that but one thing that pre-vet related objects does do is it calls this gap prefetcher function and so the primary purpose of this function is to get a prefetcher so that makes you wonder okay what is a prefetcher well a prefetcher is just any object which defines a get prefetch query set method that's all so for example let's say we have our dog class and we access favorite toy from the class as opposed to an instance of dog then we get something django calls a forward one-to-one descriptor okay that's some object but you can look and that object defines a get prefetch query set method so dog dot favorite toy is a prefetcher and so i think if there's one thing to take away from this section is this get prefetch query set method like this is sort of the most important thing in defining how prefetching is going to work um so yeah if you can remember one thing it's this get prefetch query set method if you'd you know just grep through janko's codebase you'll can find all of the instances where Django se is able to define some prefetching behavior for when talking about prefetch related you'll see these things called descriptors coming up and it's kind of hard to sort of work with say defining new objects which are defining new pre fetchers without knowing about descriptors and so a descriptor for our purpose is an object with implements double underscore get or there are a couple other methods here that an object can implement to become a descriptor and what they do is they customize what's returned when the object is accessed from an attribute on a class instance so that's kind of a lot to parse but like if we use Django like we're using descriptors all of the time like descriptors are what makes Django ORM work so for example when we have a instance of the dog class and we access favorite toy we get a toy you know model instance and in order to get that you know Django can go behind the scenes and fetch that from the database instantiate it and return it but that's different than here when we accessed a favorite toy from the class and so how we get this difference in behavior is through this descriptor protocol and you know like descriptors are just everywhere in Python for example like this is how methods work in Python is if you have some function and you call double underscore get and pass in an object then what you get is a bound method so that's like if you notice you know accessing a method from the class you get a different type of thing when then when you access the that same attribute from an instance of that class okay so going back to prefetch related objects so now we have a prefetcher so that is one of the jobs of prefetch related objects is to find some object which implements get prefetch query set and prefetch related object calls a helper function called prefetch one level and this is really where all of sort of the work that we care about for the purposes of this talk is done and so this function looks basically has this outline so so what we're passed in is a list of instances so again this is in our example this would be a list of dog instances and then we get a prefetcher which is the thing that we just just talked about and so that implements this get prefetch query set nothing so the first thing it does is it calls that method on the prefetcher and so this is kind of the interface that get prefetch queries that has this is basically where all the sort of interesting stuff happens with respect to prefetching and so what get prefetch query set is supposed to return is a six tuple with these sort of six pieces of information and so sort of talk about what they are so the first the first thing in the tuple is basically an iterable of related objects so when we are doing dog objects top prefetch related favorite toy this rel qsr will be a query set of all of those favorite toys then the next to the next two things in the tuple are functions which take one argument and I like to think of these as taking basically returning something like I think of as like a join value and so they return some value which is used to associate the in stances which are say the list of dogs and the related objects which are the toys so the first one takes in one of the related objects a toy and returns the join value and so here our join value is going to be the primary key of the toy so given a toy we just returned its ID then the next one takes one of the instances so in this case a dog and returns the primary key of the favorite toy so these values using these values is how Jango is going to know oh I should associate this toy with this dog it's based on these like join values the next three things that are returned basically tell prefetch one level or the prefetch related subsystem where do I now that I know like how to associate say toy with a dog like where do I stick that data and so the first one is whether or not there is a single toy object associated with a dog and so in this case each dog has one favorite toy so single is true the next attribute is a cache name and so this is a name related to the prefetcher and we'll see sort of how this gets used when we take these values and store them on the instance and then finally there is a boolean called is descriptor and this applies in the case where a singl is true we will see how this gets used but it just changes the way the place that the related object is stored on the instance okay so this is sort of what get prefetch query set returns and this defines really the interesting behavior for prefetch related so we'll go through like later in the talk and see lots of other examples of you know this six tuples and how you can get different get different behavior in django by customizing the values returned there okay so we've called get prefetch query set we get this six tuple and then what prefetch one level is going to do it's going to go through all of the related objects and figure out which instance they're related to so it's going to take the so the very first thing we we do is we get all related objects which we just call list on the first thing that was returned from get prefetch get prefetch query set and and so basically this is the only thing that the prefetch one-level code does with this rel qsr variable and so it doesn't even have to be a query set it just has to be some iterable it Chango doesn't rely on you know specific properties of the thing that's returned there it really just calls lists on it and then goes from there so in our example this Relic us was going to be all of those favorite toys right we took toys and filter them such that their ID was in those lists of favorite toy ids that we collected so the next thing we're going to do we're going to define a cache dictionary and we're going to iterate over all of the related objects that we've gotten and we're gonna get this joined value so here remember this rel objeto was a function which took in a one of these related objects a toy and produce the join value and so in this particular case that was going to be the primary key for the toy and then once we have that we use that value as a key in this cache dictionary and the value associated with that key in the dictionary is going to be a list and it's going to be a list of all of the related objects so the toys that have that joined value and so because because in this instance the toy ID is a primary key you're at most going you're not going to have lists of say more than two more than two objects because only there's only one say toy per ID okay so we've gone through and we've build up this cache which Maps our join values to list of related objects and then the next thing that prefetch one-level does is it stores the objects on the instances so once we've fetched all the related objects like I said we need to store them somewhere so it iterates over all of our instances so these this is the list of dogs again right here you know we fetch that on our initial query and we have that and then here this instance adder is this function that we talked about earlier which takes a dog and produces the join value so I'm here like I said this is the primary key for the toy and what we do is we get this variable called vowels so we take our cash look up to see if the join value is in there if it is we returned that list of related objects if it's not then we get the empty list so now for every object we have a list of related objects so for every dog we have a list of favorite toys and here's where like I said those sort of final three values from that six tuples come into play and so if single is true was returned from get prefetch qui reset then we sort of fall into a conditional block of code that looks like this so we know there's only going to be at most one a related object so here we get that as a singular Val if there's no objects in that list we get none and so then we fall into three cases the first one is if in the when you specified the prefetch use specified a two adder that is an optional argument that you can specify and you say hey once you fetch this stick it in this attribute so we call set adder on our object and to the two adder and we get our vowel which is a toy otherwise if this descriptor boolean returned was true then you basically set the cache name attribute on the object to be this singular value finally if none of those are the case there's this fields cache dictionary which lives on the object and we put the value in that dictionary associated to the key cache name cool so this is what happens if you return single equals true you can control where prefetch related puts you're related object if single is false was returned from get prefetch query set then we fall under two cases again if you've specified a specific two adder when you specify the prefetch then you're going to set that attribute on the object to be that list of related objects otherwise if that's not the case then we'd have to do a little more work here but at the end of the day you can see on the final line you have a prefetched objects cache attribute on say our instance like a dog and in that you have a key as a cache name and you set that to be a query set and two lines above that query set you've set the result cache on which says hey I've already basically fetched that this data and this is what those values are cool so that was sort of a lot but if we go kind of over everything that was done in prefetch one level is we call get prefetch query set it got this big six tuple which defines what is going to do we we took that data and went through and associated all of the instances with their related objects and then finally we stored those related objects on the instances and how we did that depended on those values returned from get prefetch query set great so once those are there how do we make use of them how does Django make use of them and hmm and so this depends on which particular prefetcher used but in the case where we are we have this dog dot favorite toy this descriptor will call self dot field that get cached value and this looks in the fields cache dictionary to see whether or not the favorite toy has already been fetched so the code in this descriptor sort of knows where the prefetch related code is going to put the related object similarly managers on instances will check this prefetched objects cache within their get query set method so you know if you were to do something like toy dot dog set dot all right this is all of the dogs which have this particular toy as their favorite that will check on the toys prefetched objects cache to see if this query set that already has all the dogs there has been or fetched so that was kind of the main I know kind of landmarks in the in the prefetching process why do you specify the prefetch lookups those gets stored on the prefetch related lookups on the query set when it's evaluated it calls prefetch related objects this function goes through finds a prefetcher that's one of these things which defines get prefetch query set and then the prefetch one-level takes that data there and sort of joins the initial objects you have with the related objects that it just fetched combines those together and finishes its job and then when there's additional code let's say in these descriptors or these managers that know sort of the right places to check to to get access to these pre fetched values okay so that was a bit of a whirlwind tour but now I sort of want to talk about these sort of case studies and like how we can actually like make use of this information in our code so the very first one I want to talk about is like let's say suppose we store our user model in a different database than the dog bottle and so here we have in database one our Jango user model and over in database two we have our dog model and it has a user ID field which is a positive integer so we can't use a foreign key because you can't have foreign keys across databases Django like doesn't like that and so but we would still like to be able to use this in sort of a way that feels like it's actually a actually a foreign key so in particular we'd like to be able to do things like hey for all of the dogs and user dogs not all we'd like to print the dog and the dogs favorite toys name but because we want to be able to do things like that we also want to be able to be able to do things like user dot objects that prefetch related and we want to fetch all of the dogs and all of their favorite toys great so this is the this is sort of our scenario and so how do we do this there's a couple of steps the first thing we have to do is define manager which implements this get prefetch query set so this is our prefetcher and this is the thing you'll get when you access user dot dogs right you'll get one of these managers so that you can say dot counter dot all and do all that with it and sort of the two functions we'll be interested in are like I said get prefetch query set to define the prefetching behavior and then also get query set so that we know like okay once the prefetching is done how do we actually get that in use by the dogs manager the next piece we have sort of in this puzzle is a descriptor this like I said and implements this get double underscore get method and when it is accessed from a class so capital user dot dogs it will return the descriptor but if it's accessed from a say user instance like lower case user dot dot dogs it returns one of these dog managers and we'll pass in that instance so it knows which particular user you're talking about and then finally we set this dogs to be our dog's descriptor on the user so going back and filling in the missing pieces what we have to define this get prefetch query set so it takes in a list of instances in this case it will be a list of users an optional query set but the main thing we do is we go through all of the user IDs and collect them into a list and so here this list is owner IDs it's a list of integers and we return the six tuple and so the very first thing is going to be basically like dogged object stop filter owner ID is in this list of integers so we don't even need to know about like we don't have to have a foreign key there right this query executes on database - you're just getting all of the dogs that have their owner ID in this list of integers like Django can do that no problem and now we have to come up with the join values and we're going to basically join on the primary key for the user so when we get a related object like a dog we return the owner ID and we get a user we return the user ID a person can have multiple dogs so we'll return false for single we'll cash it under the dog's name and is descriptor doesn't really apply in the case when a single is false so we'll return false here and this basically is everything you kind of have to specify to the prefetch related system in order to get the things that we want to do to work and then finally we also need to implement this get query set method which and the first thing it tries to do is access the dogs key in the users prefetch really prefetch objects cache attribute so that and so when we perform the prefetching we'll get a list of dogs and the prefetch related system we'll put it in the for each user it'll put it in the dogs key and their prefetch objects cache attribute if it's not there then we sort of apply the default behavior of the manager and so now with those sort of things in place you can do things like this where this is only going to do three queries one to fetch the users one to those users dogs and then one to fetch those dogs favorite toys and so that's going to basically this looks exactly like it would if the dog objects and the user objects we're living in the same database when you add a foreign key for this particular pattern so the next sort of scenario I want to look at is that hey there's really nothing special about those integers we used as a join key in the previous example and so you can basically sort of join any two models as long as you can get values that are equal between them so specifically say suppose we're able to define a dog's classmates as all of the dogs that were born in the same year is that dog okay and so in our code like this maybe we want to iterate through all of the dog's classmates and print the classmates favorite toys name and and so here if you aren't able to do any prefetching you get an N plus 1 query problem so I'm going to sort of skip all of the part with the manager and descriptor before and just sort of go into the sort of main piece the get prefetch query set method and so how's this work so we get a list of dogs and then we collect all of the years of their birthdates so here years will be a list of integers and then we define the data so the first bit will be all of the dogs which have a birthdate in one of the years that we've collected and then we have two and how are we going to associate a dog with its classmates well we're just going to use the birth dates year and so here because the instance and the related objects are both dogs we have the same this ain't the same functions to get these joint values and again you can have multiple classmates we're gonna cash it and under the classmates name and is the descriptor doesn't apply so you know other than all of this sort of kind of boilerplate to get the managers in place this is really all you need to do to be able to prefetch all of the dogs that have the same birthday as your particular dog that you're interested in okay next one kind of want to look at so we recently had to add support for adding translations to fields in django django model fields rover kind of expanded into Europe and we needed to support a lot more languages and we have some models which store have fields which store English text and we want to be able to present that text to users and their preferred language so we went with a approach that uses one table to store all of them so the main things in this table are a generic foreign key to the object that we're translating a language code to specify which language it is a field name to specify which field on the content object that we're translating and then value which is the translation itself so one particular example of that in our system is a dog breed so a dog breed has a name and we want to be able to translate that and so we can we can use kind of the reverse relationship between the generic foreign key using generic relation and say we could prefetch all of those but it does have the disadvantage that your prefetching every language the translations for every language even though you may only need one so we'd like to be able to have something called like active translations where we will only want to prefetch the subset of translations which correspond to the active language in the sort of request-response cycle so if a user preferred languages French we only want to fetch the French translations additionally since a large percentage of our traffic is uses English as a primary language we don't want to say necessarily take this additional query overhead if if we're not actually going to use any of the translations we don't want you want to do those queries cool so how do we how do we get prefetch related to do this so in our sort of the relevant get prefetch query set method we do something like check if the active language is equal to the default language and that's what we'll assume that the the language that the the values in the table are stored in and if you're equal to that then we're just gonna bail out early and we're going to return an empty list here and it doesn't matter really what we return for our join values because there's nothing to join and in this case we don't we can prefetch the active translations but Django won't do any database queries we don't have to go to the database and come back and then additionally for that sort of manager we get the query set and then we just have to take that query set and filter it by the active language during the request response cycle so we're able to sort of transparently get get field translations without having to do a whole lot of extra work and extra queries and the final one I want to talk about is mainly more to showcase kind of the flexibility within the prefetch related system as opposed to like this being a pattern that I would recommend in your code but let's say there is a third party dog date service which provides a HTTP API that lets you fetch possible play dates for your dog so request looks like this we make get requests and we pass our dog ID as a query parameter and we get back a list of dictionaries basically and we have the dog ID is a key in there and then playdates which is our list of URLs and their HTTP API also happens to support multiple dogs per call so if you pass multiple dog ID so you get multiple dictionaries back with their IDs and their play dates and so suppose we have code which looks like this in our dog model we have a dog dates external ID so this is the four dog in our system what is the ID in their system and then we have a play date available play dates property which when you access it it goes through makes a get request to this URL and then you know and returns basically the play dates field from that response and so and so now you have something like an M plus 1 like request problems so you know you're not doing queries to a database you're making HTTP requests and so if you wanted to go through all of the users dogs and print the available play dates then you get one HTTP call per dog so hmm this looks like something that we can solve with prefetch related and so we can actually do that so in there get prefetch query set here we have we get passed in a list of dog instances and so we can build up one of these URLs that has all of the that has all of the dog IDs as query parameters and then here in we returned well the first thing we return is we make a HTTP call get the JSON response that's a list of dictionaries and now we have to find some way to associate that list of dictionaries with the dogs and so we'll use that dog ID data in there and so so the first one takes one of these dictionaries gets the dog ID turns it into an integer and the second one takes one of our dogs and use returns the dog dates external ID and we know that there's exactly one of these dictionaries per dog so a single is true we'll store it on a in say underscore available play dates and now you can do something like this where you do dog dot objects up prefetch related available play dates and behind the scenes that will make a batch request to this API and then if I and then you no longer have the M plus one problem that you would have if you didn't do something like this so that was mostly what I wanted to go over today this week I'm going to be working on getting some of this boilerplate code that you need to write in order to get your own sort of custom pre fetchers into a third-party package called Jango prefetch utils so I'll be working on that this week and at the sprints and I'll be around if you are interested in things related to data fetching or prefetch related anything like that I'm happy to talk on that so thank you [Applause] [Music] you
Info
Channel: DjangoCon US
Views: 1,776
Rating: undefined out of 5
Keywords: django, djangocon, djangocon us, python, 2019, Prefetching
Id: QYDixnGetTI
Channel Id: undefined
Length: 48min 39sec (2919 seconds)
Published: Fri Oct 18 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.