when indexes are useless | The Backend Engineering Show

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this episode of the backend engineering show i want to discuss three situations where database indexes are useless how about we jump into it welcome to the back engineering show with your host hussein nasser and quick updates before we jump into today's episode uh i'll be going into vacation uh visiting my family back in bahrain and so you won't get this set up you know where i have a camera so you might get a few videos from my phone so i apologize of the quality when i was there when i will be there time's gonna be a little bit tight so expect a little bit of a delay when it comes to content uh with that out of the way how about we jump into it and talk about indexes sometimes are useless so what what do we mean by a useless index right when i when i say useless in this particular situation i really mean you have added this additional thing that was supposed to add performance but it ended up not doing anything for you at all in fact it could you can argue that it you added some overhead with this index right so this is what i mean so in order to talk about these instances we really need to talk about what do we mean by an index very quick we added this concept of indexes because of a problem that we have in database systems you see we have regardless of the way you store your data whether it's a relational table of columns and rows or documents or blob storage this data is huge in your table you're going to have millions and millions of rows in your documents you're going to have thousands and thousands of documents and your blob stores you're going to have these content these plop pictures videos whatever contiguous one after the other but searching becomes interesting because how do you find something in this pile of stuff you get a scan one by one these million rows or thousands of documents or tens of thousands of these blob images which is obviously not ideal because you have to loop one by one and you have to visit apparently work with millions of rows to find what you're looking for so so smart people found a way to eliminate working with millions of millions of rows and instead work with as few or rows as possible to find exactly what we're looking for the the best solution to search billions of rows is to avoid searching billions of rows right that's that's the best solution so the trick here is to take shortcuts so indexes was well born and there are many many types of indexes that helps you you know find exactly what you're looking for or at least narrow the search not exactly gives you one i o or one cost big of one not necessarily but it helps you you know narrow that search that's the goal of an index so regardless of the type of other b3 you know uh just regenerate search tree or bren the goal is just let me work with as let me eliminate things that is absolutely doesn't contain my results right just like concept of a bloom filter so that's an index so i have an index and you search this index so you still need to search but it's a much much much smaller data structure compared to the millions of rows that you have on your actual data blob store documents right usually you keep we call this area the heap right the heap is a data structure where you just dump everything right it's just a dumping ground index is an optimized data structure that helps you pick what column in your document or in on your row on your table and exactly index that column and the goal is to gives you give you uh a row set right a sets of rows that are candidate that may give you the results after you search the index you get a list of candidate results right it could be one or could be more you still in most situations you still need to go to jump to the heap to the big table which contains actual data and fetch the content because the index will only have a one column or two right the heap will have everything in it so you jump back but the index tells you exactly where to jump and that is the trick here the index tells you hey jump to page 179 and pull row 73 in that page it tells you exactly that right now it depends on the database the databases sometimes does an additional filtering upon fetching the row itself because the there are situations where the row might have been deleted but the index wasn't updated and and we need to see if i'm supposed to see this row or not so whether whether you're in a multi-version currency control or not so this is an index it's cool sounds like a good data structure right beautiful let's always add an index why not but careful with that what did we say we said that the index gives you a candidate rows so that you go to the table and fetch exactly what you want right cool keep that in mind because this is what i gotta go instance number one where the index is useless when the column that you have created the index on has 99 percent of the same value then the index is useless or more sometimes like 100 percent then this is absolutely useless let's say you have an orders table and you have a status and this order stable is the archived order table right where completed orders are usually placed in in this table right so the status of the order in this case in this particular table is always completed right so adding an index on this particular status column is useless because if you if you have this stable and it has like i say 100 million rows and all of these rows or orders are completed right then if you do a search uh or give me whatever order blah blah blah where status is equal completed yeah the database will say oh you have an index there but if when it goes to fetch the index guess what all the values that it's going to search it's all have the same value so the candidate rows are literally the table right so that's an absolutely useless use case for an index so pay attention to these kind of queries it really depends on the value that you're searching for right another example is gender right if you have male female or unspecified right then when you search right you really need to know the statistics of these values in your particular table right let's say if you're if you're uh whatever this table is and most of your let's say youtube viewing history 99 of your viewing history is me are males right searching where value is equal male is not going to give you much benefits because the search space is going to be so large that you're going to get a huge result that the database eventually is going to decide not to use the index at all because the databases are smart they have they keep track of these statistics and decide oh is it worth it to use the index or not because guess what scanning the index is not cheap if you're scanning a lot of pages right at the end of the day so that's another situation that you just just keep in mind right but if you for example if you flip that query and you search for females then the square is optimal because oh the the amount the the number of females that i have is so small right that this the search result is so small as a result i'm going to fetch smaller results and i i might find what i'm looking for much much quicker effectively so yeah keep keep that in mind when you're creating an index and not only just creating the index what kind of queries are you hitting [Music] your database what kind of where clause what kind of values are you doing because that will affect the decision for the database to go to the index or not if the database saw that the statistics in your table i don't believe we talked about what statistics are statistics are a collection of numbers that the database accumulate on your table to give it a hint about how to go about planning your query effectively right so in our example if it says like oh 99 of your rows are you know males and one percent is female so if you search by mail is the database will say you know what it's not really worth searching the index because the the i'm gonna end up scanning more rows anyway it might be faster to do a sequential scan or a full table scan right in that particular cases and the database is always doing these kind of decisions all the time so if you think about it this way and there's really just logic here there's no there's no magic at the end of the day right all these things that the databases have to make these decisions all the time and these decisions to plan are also costly but that's an episode for another day all right so that's a one incident where your index is absolutely useless when you're the most the volume of what you're searching for is it constitutes most of the index effectively which will result in a candidate selections that are so huge that will cost me two hops right to go to the index and fetch the candidate rows and then go back to the table and then fetch that tables that's obviously in postgres in mysql everything is organized the table is organized around the index well it depends on the type of the index right uh if you're searching a secondary index then you search you get a candidate rows of the primary keys and you hit the index organize table the cluster table and you fetch those in those rows and boy they're going to be all over the place so you'll be doing scatter shot ios not quick not fast just just just something to keep in mind all right the second instance where when the index is useless is is i think it's it's well known when you have a string field and not all databases have that but most do if you let's say you have a first name right field and you create an index on first name and you want search for a person named rec if you create an index as is create index on first name that's it then whatever case sensitivity in the column will be preserved and copied in the index and that is how the index will be built so if your name is capital r i c k that is how the index will be built right so ones and zeros at the end of the day our capital is a different letter than our lowercase but i've been guilty of doing this in the past years and i i believe some of the listeners and more people who watch this actually might be as well where when i do a query i'll always use the function the built-in function that is called upper or lower so say i don't know really the i know the user type in rec small letter or rec are a capital or capital so what do you do you throw in hand the front end just sent you a string and you don't know what the backing is right so what do you do you do where upper first name equal uk's the string that you were given what's wrong with that well the moment you have used upper as a function that will have just skipped the index because well says hey dude you changed your expression i have yes i have an index on first name but i have no i don't have an index on upper first name that's a completely different content first name and upper first name is a different thing so people end up do doing that and finding that their queries become slower and slower slower the larger the table because they're doing just the table the database will say hey i never have an index on upper first name let me do a full table scan leaving the index rotting there completely useless and just taking space and really just becoming an overhead for us the moment we insert the new first name the moment we update a new first name we have to go to this index to this useless index and update it with the new values that's we keep maintaining this index that we never use this is truly useless index how do you fix this there are many many ways you can create an index on the expression upper first name if you know that this is how you're going to search your query create an next one upper first name don't create an action first name right another solution is just if you want to handle any case sensitivity you create a generalized search index or gen index inversed index all these kind of things i'm i'm really just dipping my toe into these different data structures but the goal is just gives you heuristics into how uh how close words look like like rick capital with rex small letter are kind of given a score and they are it's within a value then you get a hit effectively sometimes those hit are correct sometimes are false you get false positives and you have to clear remove those force positive false positives effectively all right and and the final one final one when i say final one i don't mean these are the only three there are many many more that i can't think of some of you might actually comment on this video or send me emails that say i'm saying you mean you can mention this and this and this there are many many uh situations of course and depends on the type of the index where the index can be useless effectively so number three uh is specifically when it comes to composite indexes what are composite indexes composite indexes are when when you have an index on two or more columns right at the same time it says create uh index on uh first name and last name something like that right so this will create a single index data structure with both these values and and the databases have contracts when when you have two columns it favors the left hand side column over the right and it really matters where you put first because it will really build the index from left to right left to right left to right this is very important to understand because it will it will it will favor the left-hand side which means when you query this index you better provide the left-hand side column when you do a query right because you're going to rely on the left hand side column in order to filter out the right side so the index become useless when you only let's say you have an index on a first name and last name and you issue a query where you're searching by last name right most databases will do a full tables cancels well yeah i have an index on last name and first name and you only give me last name is i cannot use this index to search for last name because the index is not built that way it's built from left to right right so the database end up end up using the uh doing a full table scan so the index becomes only useful when you actually do an and say hey where first thing we call this endless this is the best case scenario when you do an and because an end means both of them are true right so the first column equal this and this call me call this and there will be an entry laying in the index b3 that have that entry so searching for it will be so fast another situation where the index becomes useless in this particular is if if you actually do an or right and really depends if you do an or what does that mean before we jump into the or if you just provide the first name you're searching for the first name and you have an index on first name and last name then the index will be used because again we're searching from the first name from the left first left columns first so i can use this index despite it having more yes slightly costly right if you're only searching for the left columns but regardless you can use the index and you're going to give get result the problem becomes when you have an or when you have first name equal this or last name i call this that becomes a problem yeah first name equal blah can use the index and it will use it right but then you're going to get a candidate rose because the results that you have from searching the first name is not enough right why because you have an or or what what does or mean or means that first name or last name could be true so both of them could be true one of them could be true if both of them are false then you have to eliminate and that's the problem that means you have to search where the values are only last name and the moment you say only last name that cannot use the index so you have to go to the table the moment the database thinks oh i'm going to the table anyway this index scan that i just did is just useless so why do i double search on the index and the last name might as well just hit the uh the table directly hit the heap directly and fetch everything and start shuffling right the device will make all sorts of decisions this way sometimes the decisions will make sense sometimes will be dumb and most of the time when it's dumb decisions is when the statistics are out of date like like when when the last name column is i don't know it's it thinks that all the last name values are nulled out right then it will make a decision based on that and the decision might be good or bad if it's nulled out then oh it's null so i can assume everything is null if that's the case i might be actually good just searching the index but the databases really have to make good decisions not to give you incorrect result it can slow you down but it never can give you in correct result otherwise you'll be inconsistent and you'll be violating the c in acid right it will be really bad so yeah guys i i think these are the three of many uh situations where the index um become useless i think and uh what do you guys think do you did you encounter cases where your index just became a bloat and you never used it you might say how do i know i'm using the index or not one way is just to add explain at the beginning of the query or take your query and go to the sql developer in oracle or ssms and my sql server microsoft's echo server or mysql php my admin is that that phpmyadmin that's what it's called running for my sequel wait a second item is that something else i might that's my sequel i don't know man i forgot so much stuff uh p sequel for my secret for a pg admin for postgres and then you you just paste that query and then do an explain that that will force the planner to spit out the plan that it used and it says hey i used for this particular query this particular value i used uh this plan i did an index scan i did i did an index only scan i just i did a a table scan i did the full tables guys i did a multi-threaded multi-processing multi-worker table scan all this stuff is just is going to be uh displayed to you then you can make the decision uh based on that and if you have the the dangerous thing is when you start your app and you build your back-end and everything is fast but you actually when you put data and you grow to the millions and millions of rows uh you won't feel the slowness when you have like three records right while it is doing full table scan databases might always do a full table scan when you have like a thousand rows or 200 rows because it says there is no point of searching the index i might just cache everything in memory let's just put everything in the pop-for pool dammit i don't really need to use the index right it just makes all these decisions but all of a sudden when you put billions and millions and millions of rows your applications start to slow down that's what always happened hey what happened hey it worked in my machine it was fast but about production and all slow right because now you're going to start seeing the plan in in in the development environment you had like three records you won't feel it because you never looked at the plans and as a result you you didn't see that the indexes were getting skipped your indexes were useless right and all right guys that's it for me today a quick plug if you're interested in databases stuff check out my course introduction to database engineering i talk only about the fundamentals the very very basic fundamentals that all databases have and that will grow a foundational strong base for you to build great backend applications head to database dot husseinossler.com for a discount coupon right now over 18 hours worth of content sheesh that's a lot that's a lot of lectures all right guys see you in the next one goodbye
Info
Channel: Hussein Nasser
Views: 9,591
Rating: undefined out of 5
Keywords: hussein nasser, databases, oracle, sqlserer, postgres
Id: oebtXK16WuU
Channel Id: undefined
Length: 26min 24sec (1584 seconds)
Published: Sat Oct 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.