DataLoader and the Problem it solves in GraphQL

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
when you first start to learn graph QL there's a pretty good chance that your first attempt at creating a graphical API will have one big problem that you really shouldn't live with now I bet many of the seasoned developers who start learning graph QL will recognize the problem I'm referring to but there's a whole lot of developers who probably won't even notice the problem because well the problem isn't obvious this problem that I'm about to talk about won't give you any errors at least not during development in honestly you'll probably end up with this problem because it's the result of an intuitive solution but unfortunately the intuitive solution is also a naive solution so what is this problem I'm referring to well instead of just telling you how about I show you okay imagine that you and I are working on a book review app that looks something like this and we've already created the beginnings of a graph QL API for this app here's a pretty simple graph Gale query that you can use with our server to fetch a set of books that includes each books unique ID in title and it also fetches a set of reviews for each book which includes the star rating in the title left by the person who left the review now the graph Gale server that we're about to look at has the problem I've been hinting at but the problem isn't obvious I mean watch I'll run this query and guess what it worked the server didn't blow up and we got an array of books and each individual book includes some book reviews let's go take a look at how this graph Gale server is set up but before I show you I've got a challenge for you I want you to pay close attention over the next few minutes and try to figure out what the problem is with the code I'm about to show you okay to start I'm going to go ahead and simplify this graph Gale quarry by commenting out the reviews field and its subsidiary gets sent to the server a resolver function named books gets called which you see right here resolver functions should generally be thin in other words as much as possible you should avoid putting much logic in them so is this resolver function thin well yeah it's pretty thin it's simply returning the value you get when you call the all books function let's go take a look at the all books function which is defined in the book Jas file okay there's not a lot going on in the all books function right here you see a sequel statement that's querying all the columns from the HB schemas book table then on this line the query function is called which will send the sequel statement to the database and return the response from the database well actually the query function does one other thing just for this tutorial it's logging the sequel query to the console which you'll see in a bit now in a moment we'll run this graph Cal query but before we do that I'm going to clear the terminal session which is running the graphical server well come back to this terminal in a moment and see what sequel queries get logged when we execute our graph Cal quarry now back in the browser I'll run this query and as you can see it worked we see an array of 10 books let's head back to the terminal and see what sequel commands were sent to the server okay well it looks like one sequel query which is the query we just looked at was sent to the server which is probably what you expected all right now let's expand this query to include the reviews now the book ID in the title fields you see here are stored in the book table but the book reviews are stored in a different related table named review so to fetch these fields another sequel query will need to be sent to the database server before we run this query let's take a closer look at what will happen on the server when we include these new fields now when the server receives this modified query the book resolver gets called again just like it did the last time but then after the list of 10 books are returned from the database a new resolver function is called for the reviews field and of course this resolver has the same name as the field which is views so how many times do you think this reviews resolver function will get called now keep in mind the reviews field is a child of the book field so it's going to get called once for each book so in our example it'll get called ten times and as you can see here the parent book object gets passed into this resolver now this reviews resolver isn't doing much it's just calling the reviews by book ID function passing in the books ID if we peek at the reviews by book ID function we see that it's corn the HB schemas review Table four reviews where the book ID is some value represented by the dollar sign one you see here then on this line you see the params constant which is set to be an array that contains the passed book ID when the database is sent this sequel command in this array of parameters it safely replaces the dollar sign one with the first element of the array which is the book ID parameter okay so this probably seems pretty straightforward to you but do you see a problem with what's happening here to demonstrate the problem let me clear the terminal again and I'll execute the modified graph Cal query and as you can see it seemed to work we see the ten books and each book has around three or four reviews so let me ask you this how many sequel queries do you think we're sent to the database server let's take a look in the terminal and see well at the top of the terminal we see a quarry for the book table just like we saw last time then we see a query for the review table or the reviews left for the book with an ID of one then we see a nearly identical quarry for the book with an ID of two and then another nearly identical quarry for the book with an idea of three and if we continue to look at every log you'd see that ten nearly identical queries were sent to the database server so in total were sending eleven queries so is sending eleven queries a problem well it can be a real performance killer for a few different reasons first of all the ten nearly identical queries consumed ten separate database connections so should you care that ten database connections were used just for this one part of our graph QL query yep it really should be a concern database connections are a pretty expensive resource and generally speaking most smaller and medium-sized database servers should only have a small number of connections usually around 10 to 20 technically you could allow a lot more connections but it's expensive and it typically leads to worse database performance than allowing a smaller number of connections and just queueing requests if and when there are no available connections to use now additionally every time a quarry gets sent to the server there's a little bit of overhead for example the database query planner has to execute once for every single query and guess what often the amount of time it takes to run the query planner takes longer than actually running the query by the way the problem we're seeing here has a name that you might have heard of it's called the n plus 1 problem in other words we're having to execute one query to get a list of books then we have to execute in additional queries where n is the number of books returned from the first query which in our case happens to be 10 now imagine if we tweaked this query to include the user field which by the way would need to query the user table how many additional queries what our database need to execute in this scenario is assuming we're following the same naive approach well for discussions sake let's say that we run the book query which returns ten books then we have to run ten more review queries because there were ten books and now let's say there are around four reviews per book and that means we need to run for additional user queries for each review query which would give us a total of 40 user queries so we'd end up running a total of 51 sequel queries for the simple graph QL query now it's important to keep in mind most real graph queries will be much more sophisticated than this one so as you can probably imagine the number of database queries your server might end up attempting could quickly get out of hand so how do we avoid the N + 1 problem that we're experiencing well what if instead of sending these 10 nearly identical queries we instead sent one query that looks like this this single database query will fetch the same review data as these tin queries but it's only going to use one database connection and as you can imagine it's much more efficient now the question is how do we get from the current solution of 10 queries to this greatly improved solution with one query this is where the data loader library from Facebook comes in handy the data loader is a small but brilliant library that allows you to perform per request batching and caching of queries well that definition probably seems a bit vague so let me explain with a visual right now the reviews resolver is directly and immediately calling the reviews by book ID function passing along each book ID but what if we added something in between these two functions that could be sent the book IDs but it didn't immediately act on the book IDs and instead it waited some small period of time collecting all the book IDs then after the book IDs have been batched together an array of IDs could get sent to another function which could create them more efficient database query or wanting to use so what is this thing in the middle well you probably guessed it's the NATO letter okay I bet some of you are wondering how the data would or knows how long to wait to collect all the IDS well all it really needs to do is wait one tick and the JavaScript event loop which is enough time to collect all the book IDs all right let's try and use the data loader in other words let's put the data loader in between the resolver and the data access layer function that will actually look up the book reviews now a moment ago I said that the data loader does per request batching and caching and I want to draw your attention to the first part of that description the per request part what this means is the data loader only lives and does its job for a single server request in other words when the server receives a new graphical query we'll need to create a new instance of the data loader and use the data loader for the duration of the request but once the response is sent from the server the data loader instance will fall out of scope and be garbage collected I bet some of you are thinking wait a minute how can the data loader do caching if it's so short-lived well you might be falling into a common misunderstanding it's important not to confuse caching the data loader does with caching done by something like Redis and memcache the data loader library isn't a replacement for these types of cache you'd use the data loader in addition to these types of caches okay so how do we use the data loader well first of all I'll install it by King NPM install it data loader next we'll open up the reviewed IGS file and I'll import the data loader from the data loader package so how do you use the data loader well you'll need to create instances of the data loader for each new server request by King new data loader and passing it a callback function now to add an ID to the data loader instance you'll call it's a load method passing in the ID then after taking the javascript event loop all the ideas that were passed into load will be sent to the callback function okay let's do this in the Revue JS file i'll create a new exported function named reviews data loader which will simply return a new instance of the data loader and we'll pass it a new callback function named reviews by book IDs which doesn't exist but we'll create it right now so I'll create the function named reviews by book IDs which gets passed in an array of book IDs next I'll just copy the body of the existing review by book ID function and I'll paste it into our new reviews by book IDs function now I'll modify the sequel command adding a call to the Postgres any function around the dollar sign of one and then I'll tweak the params array to include the book IDs not just an individual book ID so do you think we need to do anything else with this function I mean can we just return the array of reviews that we get from the database from this sequel query well not exactly let's say for a discussions sake the book IDs parameter contained an array of three book IDs one two and three now let's say the server returned for review rows that look like this if we just returned these for review objects to the data loader how would the data loader know which books these four reviews are associated with I mean all four reviews could be for one book or there could be some other distribution of the reviews between the three books but currently there is no way for the data loader to know which review goes with which book so obviously we'll need to do something so that the data loader can know which review belongs to which book here's what we need to do to make this work we need to map the reviews returned from the database to the input array here's what I mean if the first element of the book IDs array is 1 then we should return an array that contains the reviews for the book with an ID of 1 in the first element and we should do the same thing for book IDs 2 and 3 as well now this new array is sequenced so that the data loader can figure out which reviews go with which books okay so obviously we'll need to do a little bit of data manipulation in this function to properly sequence the array we returned here's how we'll go about handling this I'm going to use one of my favorite JavaScript libraries named Ram de which is sort of like a low - or underscore but it better supports functional programming concepts so I'll import two functions named group by and map from the RAM to library so what does this group I function do well let's say you've got an array with four book reviews that looks something like this I'd like to transform this array which includes the book ID in the review data into an object where the properties are the book IDs and then values are the array of reviews you why do I want to do this you'll see in a moment here Ram does group by function we'll do this exact data transformation for you to get from this array to this object all you need to do is call group by passing in a function that returns the value you'd like to have the resulting object group by which in our case will be the review book ID then you'll just pass in the array to act on which in our case is the array of reviews okay let's go use the group I function instead of returning the promise you get by calling the query function I'll create a new constant name to reviews and I'll set it equal to the value get by awaiting the promised return from calling query and of course I'll need to make this function an asynchronous function next I'll create a constant named grouped by ID and I'll set it to the value get by calling rambha's group by function I'll skip the first parameter for a moment and then I'll pass in the reviews array as the second parameter the first parameter will be a function that takes an individual review from the reviews array and I'll have it return the reviews book ID next we'll do the mapping of the input book IDs to their corresponding reviews as we discussed a moment ago to do this I'll return the value you get by calling Ram does map function and I'll skip the first parameter for a moment I'll pass in the array of book IDs to map over as the second parameter now the first parameter should be a function that receives an individual book ID from the book IDs array then what's returned should be the reviews for the past book ID which we can get by King the grouped by ID constant we created a moment ago then using bracket notation will access the reviews for the past book ID this expression right here is why we use the group by function okay that's all we need to do in the new reviews by book IDs callback function alright we're ready to create and use the data loader but the next question is where do we use it in other words we're on our code base should we call the review data loader function let's head over to the server J's file show you in this simple nodejs server I'm using the Apollo graph QL server we're going to add a new property to this config object named context which all set to be a function this function will get called every time the server receives a request and whatever we return from this function will get passed into each resolver function on this server in other words this context function will allow us to create a new data loader for every request and it'll send the data loader into every result or function which is exactly what we need so I'll return a new object with a loaders property which shall set to be an object with a reviews loader property which I'll set to be a new data loader instance but first I'll import the reviews data loader function from the review file then I'll call the reviews data loader function to create a new data loader okay cool now all we need to do is head over to our resolver function and use this data loader that's stored in this context object so how do we access the context object in our resolve our function well it's one of the provided parameters the first parameter is the book and the second parameter contains to feel the arguments which we don't actually need in this case but the third parameter is the context object that we need next I'll just unpack the loaders from the context then I'll unpack the reviews loader from the loader object now I'll get rid of the call to the reviews by book ID function and instead we'll return a call to reviews loader load passing in the books ID okay that's all we need to do to use the data loader let's see how many queries get sent to the database server now so first I'll clear the terminal then I'll execute our graph Cal quarry and we'll take a look at what was logged in the terminal alright last time we ran this graph Gale Cory 11 sequel commands worsen to the server but this time we're only seeing two queries nice what if the book query here returned a hundred books instead of ten bucks how many sequel queries would be sent to the server in this scenario well we'd still only see two queries gets into the database it's the only thing that would change is this list of ID's would be much bigger as it would contain 100 book IDs I want to show you one more thing by tweaking this graph Gale quarry first I'll copy the entire books field in its sub selection then just below the books field I'll create an alias named books - and I'll paste in what I just copied in the books field in its sub selection so how many queries do you think will get sent to the database server in this case well let's try and see first I'll clear the terminal then we'll run this query and we'll check out what was logged okay we see three queries the book quarry was sent twice but the reviews query which is using the data loader was only sent once the reason the reviews quarry was sent once is because the data loader performs per request caching so there was no need to query the reviews table more than once nice hey if you like this video and you're interested in learning more about graph QL you should check out my course graph kale for beginners with JavaScript it's jam-packed with over 50 videos and more than five hours of content that will get you up and running quickly with graph QL
Info
Channel: knowthen
Views: 22,406
Rating: undefined out of 5
Keywords: JavaScript, GraphQL, DataLoader, N+1 Problem, PostgreSQL
Id: ld2_AS4l19g
Channel Id: undefined
Length: 21min 41sec (1301 seconds)
Published: Wed Aug 22 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.