How do indexes make databases read faster?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we know indexing makes your database faster in this video we'll go through how exactly indexing makes your database faster we'll go through how indexes help you minimize the disk ios that needs to be done by constructing a very simple index basic mathematics to just understand what happens behind the scene so let's jump into it a database to be very simply put is just a collection of records like now if it's a sequel thing it could be the collection of rows grouped into tables if it's let's say mongodb or any document based database it would be like a collection of json documents but the idea is all of those documents needs to be stored in this so needs to be serialized and then stored onto the disk right so how does that serialization happen so in this one we'll take an example of a very simple uh a a very simple sequential serialization like here like what we typically see with sql databases right so let's take an example of sql database will not go dive will not go deeper into b plus tree and we'll see how it fits into the scheme of things but just a basic understanding of how indexes work so let's say we have a users table and it has five columns id name h bio and total blocks so every column or every attribute over here will take up some space onto the disk which means that when that attribute or when this row needs to be serialized every attribute gets its particular size or its particular set of bytes stored onto the disk right so here id being an integer let's say it takes up four bytes name i put it as 60 bytes age again an integers of 4 bytes bio could be a big one let's say we put it as 128 bytes and total number of blocks as in how many blocks has this user published would be an integer so four bytes so in all when i am trying to store one row onto the disk it would take up 200 bytes to store one disk uh to store one uh row okay so here we saw user table five columns 200 bytes to store one row onto the disk now let's say my table my users table has 100 rows like this so what would be the total size of this table 200 for each row 100 rows like this so 20 000 so basically 20 000 bytes is what will require so it would look something like this let's say these are like dummy entries in my table 1 name is a age is 23 some bio and then total number of blocks 7 id 2 name is b age is 21 bio something something something total blocks is 10 and so on and so forth right so let's say we are playing with this six entries and there are many more entries like this totaling up to 100 rows right now let's go through and understand how disk how reads from the disk actually happen so wherever anything is being read from the disk like even if you read one byte from the disk it's not that only that byte is red it's a block is read so your entire disk beat magnetic storage bit ssd it's split into blocks right and these blocks are consecutive in nature now a block can be like a standard configuration of block is typically 4 kb but you can change it you can make it you can make it bigger or you can make it smaller each one has its own set of consequence in this example let's take an example of our block size being 600 bytes which means if you have if you have let's say 1gb of hard disk on that 1gb of hard disk 600 bytes cup blocks uh like it is split into 600 byte blocks right and whenever like let's say i have this as a space and it is split into four blocks for now and let's say if i want to read these bytes which is highlighted in uh which is highlighted in yellow over here if i want to read these bytes it is not that i'll only be reading these bytes so when i'm doing a disk io the disk io will fetch the block that is contained or rather within which the bytes that i am requesting are content so here if i am requesting these bytes what your disk io will do is it will go to the disk like it will hit the disk read that particular block load it in memory read these bytes and then send it to the user right and basically send it for the further processing right so no matter how many bytes you would want to read it's not that only those bytes are read but the entire block is a brought into memory you read that many bytes and then you basically continue your processing right so block is the unit of data read from the disk right now let's see how our users table that we just discussed how that fits into my block string so we know that each record or each row of my users table is 200 bytes long we know that a block size for this example is 600 bytes right which means that in each block like if i am serializing this entire table in each block i can contain three rows so my users table would look something like this on disk so on disk i have these blocks the four blocks that we just discussed right so the first three rows will go and sit in the first block the second three rows will go and sit in the second block and so on and so forth because a block has like let's say a block has 600 bytes cup width like that's the size of the block each record is 200 bytes long so how many rows would we fit three rows so in this block we are putting in three rows in this block we are putting in three rows in this block we are putting a three dot and so on and so forth so to represent 100 no rather to store hundred rows on the disk how many blocks would you require we would require 100 by 3 it is 33.3 but we cannot like the block is not divisible so we need 34 blocks to store this entire table this entire table onto the disk this entire table onto the disk stored serially as a as a series of blocks which is total number of blocks required is 34 right now how much time now let's say if i am going through this table row by row how much time would it require for us to read this entire table so to read this entire table what would we require is we would have to iterate table row by row request a particular row the the your disk engine or your cpu in that case will identify hey i want to read this particular block from the disk it will go to the disk uh sorry it would say i want to read this particular row from the disk it would find the byte offset it would go on to the disk read that block keep in memory extract the record and basically use it for further processing so the amount of time that we would require to go through the entire table will be equal to the amount of time required for us to read those many number of blocks so because like hyper that very hypothetical example but let's say if i if i like when i do a disc i o and i'm reading one block let's say hypothetically it takes one second for us to read one block from the disk hypothetically it's it's much lower but hypothetically let's say it's one second so to read the entire table or to i trade through the entire table because we are reading 34 blocks how much time would it require for us to do it it will be 34 seconds because each block one second total 34 blocks so 34 seconds right so if i'm just iterating my table row by row it will it is taking me 34 seconds to do it now let's say we want to evaluate a query the query goes like this that find all the users with age 23 so what i typically have to do because up until now we have not added any indexes to this table what are we doing to do we will iterate the table row by row we will read the record we will see if this record has age 23 if it has we would we would basically put it in and buffer uh and then we would basically collating that buffer so we'll be iterating it row by row we would find all the rows with age 23 will check if the row has uh if the record has age equal to equal to 23 if yes we would add that record to an output buffer if no we would discard that particular record and then after the iteration of the table is done we would then written the output buffer as part of the response of my query right so now to do this how much time would it require because we are iterating the entire table we are rotating the entire table row by row which means we'll be accessing all 34 blocks so time taken for us to answer this query is same as time taken to read all the blocks of my table so because there are the d4 blocks it would require us 34 seconds to answer this query right because we are reading we are going through all of this thing now let's see how indexes make them make this particular flow faster so what exactly is an index index are very small referential tables that holds the row references against the indexed value right it's very simple concept so let's say we because we are querying or we want to query all users with h23 uh what we would want is we would want to create an index on column h so how does that index actually be stored onto the disk see this only knows block right so somehow your index will also be serialized and be stored onto the disk as blocks so an index is very similar to like an index to be honest is just a two dimensional oh sorry two-column table where the first column is h and the second column is id which means it's just a it's just like an it's just like a mapping from the indexed value to the row id or or to the row to in which that particular value is present so for example here age 23 so so age 23 is present for row id1 h21 is present for row id2 h22 is present in row id3 so here it's that only so h21 row id2 h22 raw id3 h22 is present at multiple places so h22 is present in row 85 is 23 is present in row id 1 age 23 is also present in row 84 and so on and so forth if you closely observe the column like your sorry your index is sorted is sorted by your indexed column which means it will be sorted by the uh by age in this case so 21 minute 23 24 so on and so forth right so this is exactly how your index will also be serialized will also be serialized and stored onto the disk right now let's see how much uh size or what's the width of like how much size would it require how many bytes would it require for you to store one entry of this index right so one entry of this index is what one h and one id age is an integer id is an integer so 4 plus 4 is equal to 8 bytes so each entry of this index is 8 byte bix so there will be how many entries in the index every row one entry in the index so total number of addresses is equal to 100 uh totals which means the total size of index is equal to 8 for each entry into 100 entries is equal to 800 bytes our one disk block is 600 bytes so how many disks how many disk block would we require to disk block right so for the as many entries as it can fill in one block it will put it all remaining address so first 600 bytes will be gone in first block remaining 200 bytes will go in second block so your index in order to represent your index onto your disk like if you are serializing your index and storing it onto the disk it will only require you two disk blocks to store that now let us evaluate the same query with index and see how the flow would happen so here now that we have an index on id column what we do in worst case so this is not how sql database or nosql database does it it's a very simple flow that i am taking you through uh database actually does it with b plus tree and it has its own set of importance but just to understand how indexes power your database let's go through the worst case and say that every time someone queries every time someone queries anything i'll go through my complete index like i'll go through my complete index without any optimization right so what we do is we iterate through the complete index path so let's say i am firing the queries find all users with h equal to 23 the first thing that i'll do is i'll iterate through all the entries in my index right and find out who for which h is 23 i'll collect all the ids that i have like if i am talking about this index what i'll do is i'll go through this index like every entry one by one and i'll find all the entries whose age is equal to 23. uh then i'll get the id and i'll hold it in a uh and i'll hold it in an in memory buffer all of these ids are something just which which i am interested in because for these ids age is equal to 23 right so i'll iterate index uh so i'll iterate my index uh entry by entry which means block by block i'll check if age is equal to 23 in the entry if yes add the id in a buffer if no discard right so how how many disk blocks would i have to read to prepare this set of ids that i am interested in because i am going through the entire index an index is represented in two blocks it will require me two risk ios that's it so two discretes and i'll be done so two discretes and i'll be break and i'll be reading the entire index uh filtering out all the values like also so i'll be collecting all the ids that i am interested in which means whose age is equal to 23 and then i'll have this id and i'll have this buffer prepared right after i trading through the index i'll have the list of ids that i am interested in so first phase now because i already have the ids that i'm interested in because of the index that i have and now my next step would be to because to user i have to send the entire record so my next page my next step would be using this id i would want to get the actual row data right so the next phase would be for all the relevant ids that i have collected after i trading through my index i would want to read the corresponding records from the main table right so i would be reading the main the the actual record from the main table of all the ids that i uh that i found to be relevant for h equal to 23 and i'm basically collecting it in the buffer and then sending it out to the user as an uh as a response right now here for the second phase of it the amount of time that will require to fetch this to fetch these records uh to fetch actual records from the id would be equal to the number of disk blocks that we would have to read to answer this thing so let's say we just take an example of our of our example where age is equal to 23 so first space iterating through the index finding out relevant ids reading the entire index two blocks required two blocks means two seconds per say like our hypothetical use case so number of blocks read here to prepare that ide skull list is equal to two block reads right now now that i have id so for h equal to 23 two row ids were imported row id one and row id four like id1 and id4 both of them had had ag is equal to 23 right so these are the addresses of my of my record right so these two records i'm interested in now these two records are present in how many blocks we know that my table is sequentially stored first three blocks were stored in first block second three rows were stored in second block so how many blocks so if i want to read this one block one record as i told at the beginning of this that even if you want to read one byte from it your disk will always perform a block read it would read that entire block in memory extract that byte and then discard the block like it is basically catching it but let's say basically discarding the block so if i am reading this particular i want to read this particular record which means i have to read this entire block so i will read the first three entries i'll read the block here which means it would load these three records i would then find out okay for this age is equal to 23 because i have the offset i can directly go and read that record and add it to my buffer similarly for id equal to 4 i'm finding that it is present in the second block so i'm reading the second block from the disk in memory finding out that record uh adding it to the buffer and then discarding it so to answer this query how many disk blocks to get the actual record from the reference id how many disk blocks do i have to read to disk blocks right so to read the actual record two disk blocks to read the index to find the relevant id is two disk blocks so total number of so total number of uh disk clocks that we read to answer this query with index is equal to four blocks right and we know that approximately hypothetically we took that one disk block uh reading is equal to one second so the total number of time that we would require to answer this query would be equal to 4 seconds right so 2 for index 2 for record so now if we compare the time the time taken without so time taken to answer the query without the index was equal to 32 seconds because we had to read 32 blocks iterating through the tables step by step while with index with index we only had to read four blocks two blocks for the index and two blocks for the actual data right now if we see only with this with this example we were able to gain 8x performance out of like just by creating an index we were able to get 8x performance out of my system right so imagine at scale how things would happen right because at scale this indexes are such big 8x improvement is not a small improvement it's a massive massive massive improvement right so if it if a query was taking eight seconds to execute now it is just taking one second to execute like that's insane amount of speed that you get and this which is precisely why whenever you are querying on any data on any database a database always like you always have to keep an eye on are you querying on a column that is not indexed if that is not index it in worst case it may lead to i it may lead to sequential iteration across like a full table scan to be honest in worst case worst case right so that is where you need to understand that how these blocks are you need to understand the importance of indexes and how it makes things faster there are so many ways to implement indexes this is one very typical way of implementing indexes and basically fetching on it but it totally depends on the database that you have at hand like for example a few optimization that you can do with index scanning is you may not need to read entire index every time if you can create a multi-level indexing so multi-level indexing is what b is what b3 and b plus three are famous for right so they minimize the number of blocks that you are like here right now you are iterating on the complete index like here you are iterating on the complete index to find out the relevant ids this also can be optimized by just by doing an mba search on a b plus 3 to find out the blocks that you are interested in basically nothing else right otherwise in this simple example only you can just stop iterating once you cross 23 like you requested for aging is equal to 23 we know that index is sorted you can start reading block by block and as you know that hey you br you crossed age equal to 23 here so then you can just stop reading it so you can you'll just be reading one block so you'll be saving one more disc block read if you know that now you don't need to iterate like there are always optimizations to do that and that's what database do it well so that's what the job of the storage engine uh storage engine of the database is on how it processes the index how it stores the index or not but this would give you a very comprehensive idea on how indexes make your database faster what's the idea behind it how disk reads happen in block center right okay that's it from me i hope this video added a value and basically cleared up uh the the most fundamental thing of how indexes make your database faster and in turn uh help you understand the importance of indexes while firing a query so while using a sql database or any database in general always see you have the right set of indexes otherwise your database will very will will very quickly uh be set on fire right because one one rubbish query can take down your entire database so just so just so just keep an eye on any time your query just keep an eye on that any query that you fire any column that you are using you are indexing you it is properly indexed because that's when you'll get the best performance of your outer database nice so that's it from me in this video uh if you folks like this video give this video a thumbs up if you guys like the channel give this channel a sub and i'll see in the next one thanks a time
Info
Channel: Arpit Bhayani
Views: 54,315
Rating: undefined out of 5
Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Database Indexing, DBMS, Database Secondary Indexes, Database Internals, Database Indexing Internals, Database Indexing How It Works, Database Indexing Explained, Database Indexing Tutorial, MySQL Indexing, NoSQL Indexing., Database Indexing Basics, INtroduction to database indexing, Database Engineering
Id: 3G293is403I
Channel Id: undefined
Length: 23min 25sec (1405 seconds)
Published: Wed Mar 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.