Things every developer absolutely, positively needs to know about database indexing - Kai Sassnowski

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Applause] [Music] [Applause] welcome everyone let me start my timer really quickly so welcome everyone I am super excited to be here it's been an awesome event so far I've really been enjoying myself greatly talking to all sorts of cool people here I'm slightly mad at risky for hanging the bar so high but anyways I am here to talk to you about things that I believe every developer absolutely positively needs to know about database indexing which by the way is a the title of an old blog post by Joel Spolsky which was called things I believe every developer absolutely positively needs to know about Unicode so I just changed the last words because I thought that was cool my name is Keselowski you can find me at my blog where I post once every two years on github or on Twitter where I post once every three to four months I work for this awesome group of people we are a software development agency in Berlin we are hiring so if you're interesting come talk to me after the talk now when I interview people I like to ask them the question what is the database index and what do we use it for and most people are able to answer the question something like this well a an index makes reading faster improves query performance and yes that is that is actually correct that is the answer of why we index our database now the question is like why do we want fast reads then well turns out slow queries are one of the most common if not the most common causes of poor application performance I believe we have all experiences at least once in our careers where we just happens one query that when it runs it just grinds everything to a halt a slow query can be the difference between a site being super snappy and being literally unusable and sometimes the difference between these two extremes is just one good index away so if most developers can already answer the question of why do we use an index to improve read performance and I believe we can all agree on this that slow queries are in fact what am I doing here why am i talking to you right now the reason is that developers do not know enough about indexing indexing is a lot more nuanced than just throwing an index at every column in your where clause I'm hoping that one of them sticks if you don't know what you're doing a bad index can actually make performance worse excuse me I got a little emotional there so here's what's gonna happen I won't be able to make you into an expert there's simply not enough time for that my main goal for this talk is to show you how much there actually is to this topic if you leave this talk thinking man I should really read up on this indexing stuff then I'm happy to have done my job nevertheless we are gonna learn a few practical things but first there's gonna be some theory we are gonna talk about what an index actually is and I'm talking data structures this is to me the most important part of the talk because everything else everything about how an index behaves in certain situation why it sometimes works and sometimes doesn't can be deduced from understanding how the underlying data structure works you don't have to memorize everything that's coming up in the rest of the talk if you understand this part of it you're in a very good shape then points two and three are understanding the execution plan and common pitfalls when designing indices and these are gonna happen roughly at the same time we're gonna explore both as we go through the example okay so what's an index when I ask people what an index is most of them like to use the sort of mental image of a phone book I don't know how much longer I will be able to make that example until people go a phone what case in point I could not find a picture of a phone book on the stock site I was using so I improvised here's a phone in a book okay if you think about a phone book for simplicity's sake let's say it has three pieces of information it has the first name the last name and the phone number and a phone book is almost always going to be ordered by last name so if I give you someone's last thing you can find the person in the phone book pretty quickly because the book is ordered for exactly that type of search query if I only give you someone's first name you're gonna have to look through every single entry and you probably end up with a whole bunch of results so you could say in a phone book is like having an index on last name more generally speaking an index is an order representation of the index data now why is order so important turns out what we're doing when we're creating our data is we are searching our data for a subset of that data and it turns out that searching an ordered input is a lot more efficient than searching in an unordered input think binary search write very very fast assumes your input is sorted okay enough with all the mental models and helper images the training wheels are about to come off we are going to look at what an index actually is introducing the b-tree what does the B in B tree stand for it is not binary thank you for saying binary it is in fact balanced it is a balanced tree we're gonna look at what exactly that means and how it is balanced but it is a balanced tree I was hoping I wouldn't I wasn't sure how I was gonna react with someone set balanced so thank you whoever said binary okay so what is the B tree looked like surprisingly it looks a lot like a tree now if you look at the very top note at 23 this is called the root node you can see that it has two sub trees it has the left sub tree and has the right sub tree all the values in the left sub tree are less than 23 and all the values on the right sub tree are greater than or equal to 23 and this is true for every node in the tree the left sub tree is always less than the right sub tree is always greater than or equal to this is how the tree is ordered what we can also see is that the nodes at the very bottom these are called the leaf nodes they don't have any further sub trees the leaf nodes are all at the same level of the tree or the same depth of the tree this is important because this way we can guarantee that it takes the same number of steps to find every node or any value in the tree if we had one side of the tree there was just way deeper right and it was unbalanced then searching for a value in that part of the tree would take longer which is not what we want okay so this is why it's a balanced tree the leaf nodes will always be at the same level the leaf nodes also form what's called a doubly linked list which you can see by the tiny arrows I added pointing back and forth so each set of leaf nodes has a pointer to the next set of leaf nodes and I pointed to the previous set of leaf nodes so there should actually be two more arrows on either side pointing to nala now why is this important turns out that when we're searching for data in an index we are gonna spend a lot of our time down in those leave notes scanning through them if we didn't have a doubly linked list we would have - as soon as we hit the end of one of those leaf nodes we'd have to go back up the tree find the next leaf node Skaar scanning go up again and so on that's not very efficient by having the W linked list we can just scan through them sequentially without having to go back up the tree every time so an interesting question now is what data is actually being stored on the index if we have a very simple table like this we just have employee ID and first names and say we put an index on the name column the index you would end up with looks like this the employee ID employee ID is nowhere to be found on this index it only contains the name in other words an index only contains the values of the columns you actually put the index on now this sounds very dull but I've heard people describe an index as well it's like a copy of your table that is ordered by a different column that is incorrect that actually has consequences and we're gonna see why it's important to know what data is actually being stored on the index this kind of begs the question now what if I need data that is not only index how do I get back to the table if I actually want to find out what is the employee ID of Taylor so this image is actually incomplete the database stores one additional piece of information on the index for every value the so-called row ID it's not literally a string row ID it is a database internal identifier that identifies a particular row in a table it is not your primary key it's a database internal value and so that the database stores that bit of information for every value in the index and that way we can go back to the table and find the corresponding row to which this value in the index belongs to okay so much for the data structure now what are the consequences of choosing that data structure first of all searching for a value is very very fast because essentially what you're doing is binary search right with every step you go either left or right and you eliminate a very large portion of your remaining data and so you can narrow down your search very quickly and either find the value or find out that there is no result and it scales really really well because it has logarithmic scalability which is a fancy way of saying it pretty much doesn't matter how large your input gets this will still keep performing okay so if indexing makes our queries faster then we should just put an index on every single column right chance to redeem yourself whoever said binary right no okay of course not there are no silver bullets we should notice by now there's always only trade-offs using an index improves retime yes but it makes writing slower with every insert update and delete you perform you also have to insert update and delete into the index which would mean the index will have to be potentially rebalance which can take a while and if you have a whole bunch of indices your rights are gonna be very very slow so as a rule of thumb you want to have as many indices as necessary but as few as possible okay moving on understanding the execution plan and in tiny font it says my sequel version this is because the output of the execution plan looks different for each database vendor and I've just made a gamble that most of us here will be using my sequel who here is not using my sequel or Maria to be okay to people like okay so sorry but you're gonna have to be a bit of extra work the concepts I'm going to talk about are gonna be the same across database vendors it's just that they use different terminology to speak about them now what the key execution plan actually is it's these steps that the database needs to perform to execute your query so here's how you get the execution plan that's pretty uniform across all database vendors you just pretend and explain to your query so instead of saying select star you say explain select star and then on my sequel you will get a result that looks something like this do note that this is actually one row and I have just formatted it that way so it fits better on a slide so it is one row and the columns are ID select type table partitions and so on now that's a lot of rows we're not gonna look I'm sorry columns we're not going to look at all of them but we're going to look at the most important one right now which is the type column right here it says type Const if you look up the column in the documentation it'll call it the join type which I think is not the best name for it because like in this example there is no joins I'm not really sure what it's referring to right it's a bit confusing to me I think a better name for this would be access type because what this really tells us is how the database is going to access our data how exactly it is going to use an index or not use an index to execute our query so we're gonna go through the possible values you might encounter in the explain output in that column and talk about the characteristics of each one of them really quickly don't worry if this is all a bit much right now we will come back to them as we go through the examples is just so that you've seen them and have a rough on standing in how they differ so starting with cons or a craft now these two values are actually distinct values they might appear on their own in the explained output I have grouped them together here because for our purposes they behave the same way they access the data in the same way so I'll treat them as one so what constant EKF do is the database is gonna perform a bee tree traversal to find a single value in the index tree so basically you're doing binary search right this can only get used if you can somehow guarantee uniqueness of your result set that means you can somehow guarantee that that can be at most one result there's two ways you can do this one would be to have a primary key on the column which is what I had in the example I had a where clause on the ID the other one is you have a unique constraint on that column fun fact limit does not guarantee uniqueness if you say limit 1 it does not guarantee uniqueness because you might still fetch more than one row you're just discarding them afterwards okay so in limit does not guarantee uniqueness ok so basically you're starting at the top of the tree and then you go left or right depending on if the value is less than or greater then until you either find a value or you find out that there is no value and as you can imagine that is super fast if you seek on story craft in your explain output stop optimizing it's not gonna get any faster moving on ref and range again these are two distinct values but we're grouping them together because for our purposes they are the same they behave the same way they are known what they are known as an index range scan what they do is they perform a b-tree traversal but instead of finding a single value they find the starting point of a range and then they scan from that point on so let's say we had a query where we say where ID greater than 15 and less than 30 what this would do is it would do a bee tree traversal to find the first value that is greater than 50 and from that point on it'll start scanning through the leaf nodes remember this is where the double linked list comes in until it hits the first value that is less than or greater than or equal to 30 and these rows are the only rules that database has eaten has to look at okay so what a range or a ref does is it limits the total number of rows the database has to inspect to perform your query which is a good thing right less work for the database next one index this is also known as a full index scan we are still using the index but we are not using it to limit the number of rows like we just did we are literally starting at the very first leaf node and we're gonna scan through all of them until we hit the very last leaf node okay so there's no filtering of anything going on you simply traverse through the entire index but you're still using the index and lastly there is all which is a so-called full table scan which is everything as bad as it sounds you do not want this think of all as avoid at all costs okay a full table scan is pretty much never what you want you do not use an index at all you are gonna load every row and every column of the table into memory go through them one by one and then omit or discard them depending on whether or not they fulfill the filter criteria okay so to recap we have cons or a craft basically binary search to find a single value super fast stop if you got that it's perfect next we have referent we are using the index to limit the number of rows to a certain subset of rows that we even have to look at next we have index we still use the index but we're not using it for limiting or filtering we're just scanning through every value in the index and lastly we have all avoided all costs the full table scan where we just load up everything and go through it okay so this is where the live coding part comes in this is where it gets scary let me see you don't see anything I'm gonna have to mirror my displays for that so I don't break my neck there we go oh no I'm giving everything away so let's see I have prepared a example database that contains a single table that is a formula unfortunately contains a single table called orders it has roughly 2.2 point five million rows in it and it's just garbage data basically it's randomly generated data it has four columns the ID the total which is the total number of cents of the order the user ID there's not even a users table and the created add date when that order was placed now here's what we need to do management comes to us and says we want a report we want to know the total sum of all orders that were placed in 2013 so let's see sounds easy enough so we're going to do something like this we're going to say select the sum of the total column from the orders table we're okay we're what orders placed in 2013 so you might do something like this I have seen this so there is a year function in my sequel that basically extracts the Year part from a date and returns it as either a number or a string I'm not quite sure so you can say what year of created add equals 2013 you run that it returns a value looks plausible enough you commit your force push you go home okay next day your boss comes to you and says yes we got the report numbers look good but oh my god is it slow can you please optimize and so you say okay I need to improve read performance that means I need an index I only have one value in my where class but just the created adds so I'm just gonna put an index on that let's do that I'm gonna use the GUI for this because it's faster so I ad you see that add an index I don't care what it's called but I want it to be on the created at column of the table so I save that it actually takes a few seconds as you can see every because it has to build up the index for the first time and now we are going to rerun the same query and it's still around 600 milliseconds it did not change at all okay at this point you're thinking back to Larrick only you 2018 there was this guy talking about database indexing and there was something about and explained statements and maybe that can help so let's just look at the query execution plan I'm gonna prepend and explain to this query and I get back this output this time as a row not as a bunch of columns like on the slide and right away we can see here it says type all we are doing a full table scan we are loading all 2.5 million rows into memory and then we're gonna go through them one by one that sounds terrible what's weird is the two columns next to the type which are these two they are called I don't think it can read this they're called possible keys and key possible keys would contain the names of the indices that the database set could potentially be used for this query and then key contains the name of the index it actually used now the interesting thing is and this is kind of hard to see visit for some reason it sort of cuts off half of the text possible keys is null so for some reason even though we have an index on that column it is not even being considered right so the database things this index cannot work at all so what is going on which brings me to the first pitfalls which is functions if you have a query like this where in your where clause you say where year off created at something what a database essentially sees is this it doesn't matter what goes into the function because the column will not be seen by the database at all this is because you cannot guarantee that the output of the function has anything to do with the index values right let's assume you have a function instead of a year you have a function that calculates the number of string characters on a string so the output would be a number but you have your index on just the string field okay so that can't possibly work because now you're trying to use an index that's on a string column with the result of this function call which is going to return a number so there's no connection between these two values so if you use a function on a column in a where Clause you cannot use an index on that function anymore by saying here of created ad we have lost the ability to use an index on the created add column so this sucks what can we do there's a couple things we can do in Postgres there are things called function based indices which is basically instead of putting in index on the Year column you put the index on year I sorry on the created add column you put the index directly on year I've created at it's like a computed index and then this will behave correctly my sequel doesn't have that even in my sequel eight I checked this morning even my C plate doesn't have function based indices it has something that is a bit similar call that generated column so you basically add a new column to your table and in its declaration you can say the value of this field will be the result of calling year of created ad on that row so it's like I generated a computed field if you want and then since this is just a regular column in your table you could have put an index on that column and it would work but in our case please do not do this this is not how you work with dates in a database do not use a year column a month column add a column an hour column right there are ways to deal with dates in the database that are much better suited to this there's a reason that there's a date/time field in a database so what we're gonna do we're gonna get rid of this and let's think about what we want to do we want to get all orders that were placed in 2013 so why not just define the explicit range we are looking for why not say where created at between between what the first possible value in 2013 that is the first of December ad midnight and the last possible value in 2013 which would be the 31st of December at one second before midnight so this is how you deal with dates because now you can do much more interesting things you can say give me all rows that are between the 3rd of April at 7:00 in the morning and the 28th of August at 3:00 the afternoon right you can't do that with a year column so looks like we just solved our problem and we remove the function and it should be blazingly fast the place is gonna go wild and so let's just lets us run this let's just look at the explained statement for now oh boy still a full table scan but something changed the possible keys column now at least it sees our index as something that could potentially be used but for some reason it is deciding not to use the index and this is the point where if you don't know what's going on you might be tempted to make a mistake like this where you say piece-of-crap database I know better than you I just designed this handcrafted index for exactly this query you should really use it so in my sequel and this is do not try at home territory in my sequel you can do force index and then give it the name of the index like if it begins with force it's probably a bad idea so let's run the explain statement with the force index okay take a look at that we changed from a full table scan to arrange scan so now thinking back to the slide the previous slides a range scan means we are now using the index to find the beginning of a range and then just scanning until we hit the end of the range and don't look at anything other we can actually double check that my sequel the explained output has this column back here called rows which is terribly named like so many things in this output because Rose is not the number of rows in your result set it is the number of rows the database estimates it has to look at to execute your query so in this case it's around 460 thousand so that's around the fifth of our of our rows if I just remove the force index again to see the full table scan Rose says 2.4 million ok so we have to look at everything and with the range scan with the range scan we only look have to look at 466 thousand so we've just proven the database wrong because everything is better now we are gonna remove the explain from a query run it it's gonna be amazing we're gonna get promoted and take a look at this oh my god it is still running it just took 4 seconds so we clearly have just discovered a bug in my sequel we should go and open a ticket and flame the maintainer 'he's about this piece of software don't do it there's some people pulling out their phones don't this is a joke ok but this is confusing if you don't notice what's going on this is super weird and this is what I meant if you don't know what you're doing you can make performance worse it just went from 600 milliseconds to 4 seconds I'm bad at math but that's more than twice as slow I think so what's going on is this comes back to what data is actually being stored on the index if you look at the query oh sorry if you think back to the index we have an index on only they created that column the query however says select sum of total we are also referencing the total column the total column is not on the index so what the database needs to do is for all of these 466 thousand rows it estimates it has to look at it will have to take the row ID go back to the table fetch the corresponding row that is a read from disk right fetch the row take the total column sum and do that 466 thousand times that is 466 thousand is that's bad so you might say well isn't a full table scan even worse it has to do that 2.4 million times dramatic pause while I drink no it isn't the database is actually smart enough to know that if I have to do a full table scan anyways I know from the get-go I have to read everything anyways so it's not gonna read them one by one it'll actually batch read them and read a couple thousand at a times and the amount of disk IO is gonna be way less and turns out in this example that is actually way faster even though you have to still look at five times as many rows but you cut down on the number of disk IO the disk reads you have to perform and in total that is much faster than using the index and having to read from disk so the database rightly decided in this case a full table scan is actually faster than using the index so now we know what's happening still sucks so what can we do if we look at our where clause we've kind of exhausted our options there right we already have a an index on the created add column but if the issue is that not all data is present on the index why not put the data on the index why not put the total column on the index as well so this is what I'm gonna do I'm gonna take our existing index and just gonna change it to also include the total column gonna save that it'll take a second or two or three okay there we go and now let's rerun this explained statement there we go now it is voluntarily choosing our index because it can see that it doesn't have to read from disk anymore the interesting thing to note is that in this extra column which I think is the worst column in this entire output because it's not quite clear what is extra and the values that appear in this column are usually really weird like using index because what this really means is using index only we have just put all the data that this query needs on the index that means this operation can now be performed entirely in memory because my sequel stores its indices in memory we have put all the data that this query needs on the index so there are no reads from disk at all this is what's called an index only scan it's one of the most powerful tools you can use when you're designing an index it is also one of the most aggressive ones what do we mean by aggressive we have just created a very specific index and index terms probably only works for this query because not only have we indexed for the where clause we have index for the Select clause as well okay and thinking about the whole you should try to keep the number of indices to the minimum that you need introducing an index like that that can probably only be used by a single query is a trade-off that you you have to consider right but it can dramatically improve performance so if we now run this query without the explain it just took 92 milliseconds so from 600 milliseconds to below 100 milliseconds okay great you got a home don't forget the force push next day management comes to you and says great job on the report we have a feature request and the hair on the back of your neck starts standing up and they say we want the same report but now we want to be able to do it for only a single employee they have a sigh of relief like that's easy okay so that's just for illustration purposes let's just take a random user ID no that's a total this one one three six and you say well that is easy we just say and user ID equals this one you run it it's still running it took two seconds okay let's look at the explain what's going on here oh my god we're doing a full table scan again well luckily the solution to this is pretty simple because we just basically reintroduced the same problem we just solved we added a column to the query that is not on the index and then we have to do all that reading from disk again so if the problem is the same why not get with the same solution just add the user ID column to the index I mean this indexing stuff is easy really so I'm just going to put the user ID on the index as well save it wait for it to finish and then let's rerun the explained statement there we go we are now back to using the index we're doing a range scan something is still not right though the rows column says it's still things it has to look at around half a million rows that doesn't seem right that's the same number it thought it had to look at when we didn't have the user ID constrain that should be a much more limiting constraint right a user should not maybe have a couple dozen orders so why is it thinking that it has to use this look at the same number of rows than before this is telling me that it yes it's using the index but it's not using it to its full extent which brings me to the next pitfall so this is what our Inuk looks right now as a table okay we have created add we have total we have user ID so this means this table or the index is sorted first by created add then if two values have the same created add value they're sorted by total and if they have the same created add and total value they are then sorted by user ID here's what you need to know about multi column indices you can use a multi column index from left to right so you can use this index for a query that uses that filters on created at filters on created a done total and created a total and user ID you cannot skip columns what we're doing is we have a where clause on created add and user ID we're skipping the total that doesn't work why doesn't it work because the user ID itself is not sorted it is only sorted in respect to the total and the created add so if we just leave out the middle column the user ID column is essentially unsorted so it is still using the index but it's only using it up until created at point being the column order in an index matters and index on a and B is not the same as an index on BN a so if this is the problem and we have a where clause on on created and user ID the solution seems to be to put the user ID into the second place right so we can use the first two columns of the index so let's do that I'm gonna rearrange the columns in the index and put the user ID into the second position save it and then when we go back and we rerun it it should now have to look at way fewer rows than before it doesn't it actually has to look at more rows but that's just because this is an estimation so it fluctuates a bit but it still says it has to look at four hundred five hundred thousand rows nothing changed so everything I just said is correct right the column order matters there's actually another problem with this query which brings me funnily enough to my last pitfall which is inequality operators yes you can use an index from left to right but as soon as you have an inequality operation on any of those columns in the index it's as if the index just stops there we actually have an inequality operation we have a between on the created add column the created at column is the first column in our index so it's as if our index just stops there this is why it's basically unchanged between having a user ID and not having the user in a query because it doesn't even get to that point right the index can only be used up until created app because we have an inequality operation on that column so again if this is the problem then we should just put the user ID into first position right because we have an equality operation on user ID and then we put the created either into the second position because we have an inequality operation on that so let's let's try I'm gonna change the index one last time put the nut of total put the user ID into first position created at into first position so this could should mean we should be able to use the index to limit the number of rows on just the rows that the user ordered and then limit them further to only the orders that were placed in 2013 so if everything went according to plan and we rerun this query now there we go 886 rows okay now it's using the index to its full extent to use both column in the index to filter our number of rows we have to look at if we remove the explained statement it just ran under a millisecond okay we went from four seconds to under a millisecond that is really really fast okay so you commit you for suppose you go home next day management says cool the user report is really well but we noticed that the report on all orders seems to have gotten slower again and you're like no this is this isn't happening let's have a look remove the user ID run the query Oh God we're back to 600 milliseconds what is going on let's take a look at the explain oh this is a new one index you've seen that one before so now we're doing a full index scan we are still using the index but we're not using it too limiting the number of rows we have to look at we are basically starting at the very first leaf node and then just traverse through all 200 2.5 million rows which we can see back here in the rows column where it says yes I do have to look at every single column so what just happened well we changed the order of the columns now the user IDs in first place that query doesn't have a user ID and so since we can use since we can only use an index from left to right and we can skip columns we basically can't use the index anymore because the created add is not second column okay so the point is there is no query that can satisfy us our there is no index that can satisfy both these queries and this is sort of the the moral of this entire talk is indexing is a developer concern it is not the concern of your database person or the guys sitting in the corner over there who seems to have my sequel workbench open all the time it is a developer concern and it's because a index and a query always have to go together you do not design an index in a vacuum you always design an index for a query and only we as developers know how our queries actually look like how we are accessing the data so only we are in a position to write a good index so in this case you would actually have to decide do I introduce a second index is the report that's run for all users maybe only run once a year so it really doesn't matter if it takes 600 milliseconds right it's it's something that you can only decide if you know the context and you know how your index how your data is being accessed so to close this talk everyone repeat after me I am a developer indexing is my concern indexing is my concern thank you all for listening [Applause] [Music]
Info
Channel: Laracon EU
Views: 87,806
Rating: 4.9780402 out of 5
Keywords: laravel, database indexing, php, laracon, amsterdam
Id: HubezKbFL7E
Channel Id: undefined
Length: 41min 49sec (2509 seconds)
Published: Fri Jan 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.