Why do databases store data in B+ trees?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so we all know most databases store the data in B plus trees but how in this video we answer this very question and go through the evolution of storage from a naive implementation to an optimized B plus tree we will talk about why there was a need to use B plus trees how table data is actually stored in B plus trees and how is this tree serialized and stored on the disk but before we move forward I'd like to talk to you about a course on system design that I've been running for over a year and a half now the course is a code based course which means I won't be rambling a solution and it will not be a monologue at all instead a small focused group of 50 to 60 Engineers will be brainstorming the systems and designing it together this way we build a very solid system and learn from each other's experiences the course is enrolled by 800 plus Engineers spanning 12 codes and 12 countries ingenious from companies like Google Microsoft GitHub slack Facebook Tesla Yelp Flipkart dream11 and many many many more have taken this course and have some wonderful things to say the course is focused on Building Systems the way they are built in the real world they will be focusing heavily on building the right intuition so that you are ready to build any and every system out there we will be discussing the trade-offs of every single decision we make just like how you do in your team we cover topics ranging from Real Time text communication for slack to designing our own toilet balancer to cricbuzz's live text commentary to doing impressions counting at scale in all we would be covering roughly 28 systems and the detailed curriculum split week by week can be found in the course page linked in the description down below so if you are looking to learn system design from the first principles you will love this course I have two offerings for you the first one is the live cohort based course and the second one is the recorded offering the Live code base course happens once every two months and will go on for eight weeks while the recorded course contains the recordings from one of the past cohort's assets if you are in a hurry and want to learn and want to binge learn system design I would recommend going you for the recorded one otherwise the Live code is where you can participate and discuss the systems and its design life with me and the entire cohort the decision is totally up to you the course details prerequisites testimonials can be found on the course page arpitbani dot me slash masterclass I repeat ad with many dot me slash masterclass and I would highly recommend you to check that out I've also put the link of this course page in the description down below and I'm looking forward to see you in my next cohort so we all know SQL databases are known to store data in B plus trees right but it is true that even non-relational databases they leverage B plus fees to store the data for example mongodb does that mongodb stores The Collection data on disk serialized in terms of B plus 3 so their storage engine called wired tiger does that now in this one what I want like to do is to build an intuition around why there was a need to introduce a data structure like deepestry and how it is actually serialized and stored like the data serialized and stored on the disk now in order to do this let's start with something really simple and then we see why is there even a need to do that now let's say the most simplistic way to store data let's say I have a table I have bunch of rows in the table the most simplistic way to store the data is to store it in a single file one row after another dead simple right now when you have something like this when we store data in a file sequentially one row after another literally one after another in a file let's see how insert works now here just to think that when I say row it does not just imply row but even a document anything any entity that we are storing in the table could be placed over here so SQL tables typically call it rows no not relational databases typically call it documents and what not but idea holds true across the database Spectra right so here what I would want you all to do is to focus on the need and the intuition rather than the specifics then everything would make sense right okay so let's start with that let's say I have a bunch of rows over here and I would want to insert a new row now given that all the rows that I have is placed in that file sequentially which means one after another inserting a new row inserting this new row at the end of the file is really easy you open the file independently mode and add the row at the end right but now for example we know that relational database or most databases store the data ordered by primary key which means that let's say I have defined a particular column ID as my primary key and I'm inserting Row 1 2 then 5 then 4 and then 3. now I'll have Row 1 2 and 5 inserted then I would when I want to insert the row in between because the table is ordered by that for me to insert the row in the between what would I have to do given that I cannot just insert a line in middle of a file I just cannot do that because when I would write something end between the offset the line does not Auto or the rows automatically would not move down the file right that's not a classic problem on a disk based solution is right it's not in memory buffer that we've seen code editor where we just hit enter and it adds a new line it's on disk storage so when I would want to write something in between it is definitely trying to override the existing content right that I cannot do which means adding a new line in a file in between rows is not an efficient solution because now what we'd have to do is every time the insert happens I would have to first locate the position at which I would want to insert then copy all the lines all the rows before that in a new file add this new line then copy the remaining file on this new or basically copy the remaining lines in this new file and it this way my insert becomes order and become every time I am inserting something I would have to create a new file that that's the only way to do it because if whenever you write at a particular location whatever is written at that location in the file gets overrated right you cannot just insert in between in one file on any disk right so that's the limitation number one and that's why insert becomes order n okay now let's take a look at how update looks like now let's say I want to update Row three right now in order for me to do that I would have to look for the row with ID3 right so then worst case I would have to do linear scan across or the entire file and find the rows that and find the row that I'm interested in then once I discover that row I would start writing at that specific location and overriding the content now here are the challenges given my other rows are already occupying some space when I am writing something from this location starting from this location I would have to write the number of bytes equal to the width of this row I cannot go beyond that because as soon as I go beyond that I would be infringing in the location in the storage location of the next row I'd be overriding this contact it cannot move automatically forward just like in ide in ID it can move forward because it is Ram you're storing that file in you are loading that file in Ram and then making the update when you press Ctrl s then it gets saved on the disk right here you cannot do that so if let's say this number of bytes were 100 and if I would want to write 120 bytes so 100 bytes would be written over here and 20 bytes would be written over here oh infringing in the space of the row id4 right so now for you to do an update you first have to locate right and then assuming that you would be updating the exact same width you can start overwriting over here but if you want to do more than this width you cannot do that you'd have to create a new file enter like copy the rows a similar to user copy the rows write this new row and then update and sorry and then this will copy the rest of the rows right that's why file IO is so so interesting because it you just cannot do it intuitively right you need to know the constraints that you are playing with right okay now let's say I would want to find the neuro because in even an insert you wanted to hunt for location even in order you would want to hunt for location right so now let's see how hunting for a location look like how do you find one in the five for you to do that the only way to do it given because basically given right now there is no indexes even in case of indexes you cannot do it efficiently if you go with the structure but now here what you would have to do is you have to literally go through file row by row and see the and find the row that you're interested in there is no way for you to optimize it binary search you can do it but when you have right set of indexes but how we will take a look which is where B plus 3 comes in right but a naive implementation linear scan is the only way to do it okay now another common operation that you've seen databases is range queries now range queries are quite popular like give me all the rows which are present in this range like let's say 1 to 100 right now for you to do it range queries you can do it only you can do it efficiently in this case why because first of all let's say I want to find rows between 2 and 5. right so I start from here I'd find the row with id2 I found this then I'll continue to iterate one after another until I find row five and then I would just Club the results and send it out to the user right so range queries are efficient but in order to find the first row itself the worst case is ordering you'd have to do a linear scan right so range queries are once you find the first row then the range query becomes like a linear scan but before that that complexity to find the first row itself is ordering because it's a find one operation right so even find our operation is order and you would like not getting enough performance out of your database not the final operation delete right delete as you might have guessed already when you are deleting a particular row it's you would have to create a new file every time you delete given this knife implementation because what you do is you'd create a new file copy all the rows until you flick first of all you'd have to find a row that you would want to delete find one operation order in worst case then you copy the rows up until that location into a new file right and then you skip that row and then you copy rest of the rows over here that's how your delete the row because it's pile IO you cannot just delete something without and not reclaim the space because then you are effectively occupying that row right so you would anyway have to do it so all the operations that we discussed are ordering insert update find delete everyone is ordering so how does B plus T solve this problem so given that order and insert update delete cannot work well with in a transactional database it's not performant enough we have to find another solution which is where V plus t comes into the picture now what is a B plus now will not go into the data structure B plus 3 will understand how B plus 3 is leverage so in case you don't know what B plus 3 is just a Google search away but let me talk about how B plus fees are leveraged over here so in a table there are bunch of rows right a table consists of rows a collection consists of documents now when you have all of these rows you Club these rows in one B plus three note so for example if one B plus 3 node that I have if it is 4 KB big which means I can store 4 kilobyte of data in this one B plus free node then in that case if my average row size or average row document length average length of the document of the row is let's say 40 bytes in a SQL table if I may call it the schema that you have you define fixed width for every single column you exactly know the length of the row right that I am talking about if the length of the row is equal to 40 bytes and if my B plus 3 node is 4 KB big each node will contain 4 KB divided by 40 roughly 100 rows which means that in this B plus 3 node the number of rows that I can put in is 100 rows so this is Row 1 Row 2 Row 3 Row 4 and so on and so forth right so this is exactly what B plus 3 stores when it says that I am my table is storing the data in B plus 3. this is the leaf node of B plus T where you are actually storing the rows of the table right okay so now typically typically the why did I pick 4 KB as the size of this V plus 3 node in most typical configuration the size of a disk block is 4kb now what is this disk block when you do a disk IO you read a write a delete do something on the disk the most granular width in which it is done is 4 KB it is basically called as a disk block size so even if you would want to redo one byte from the disk you cannot just read one byte you have to read the entire disk block and then pick the byte that you're interested in and then like the operating system does this for you but in reality because your operating system is reading that entire block that is the most granular width that you operate on that entire one that entire 4 KB block is red it picks one byte and discards everything else now given this given that the most granular operation that you are doing is on 4 KB you keep the size of a B plus 3 node equal to 4 KB which means whenever you are reading you are reading that one B plus 3 node as is Right which means you're effectively reading roughly 100 rows over here I'm not going into nitty-gritty software like exact structure but you get the idea right there are spaces allocated for pointers and all so it's not exactly 100 rows a little slightly lesser than that but you get the idea right okay so given this is how my rows are structured in B plus 3 nodes now B plus now my table can be represented as a set of B plus T nodes which are connected to each other now how how are you connecting it for example for example I have we know that one B plus 3 node as per our assumption 4 KB this block size equal to 4 KB size of B plus 3 node each row is 40 bytes so I can store 100 rows right okay so given that I know that each node can store 100 rows I will have let's say I have 400 rows so first node will contain 1 to 100 second row will contain 100 1 to 200 third row third block third node would contain 201 to 300 and then the final node would contain 301 to 400. right okay now these nodes can be present in any location on the disk now what your database does internally it does not just allocate a random disk block on it it has a proper structure in a file where it knows from this is this this particular block is there from this offset to this offset this particular block is there I am drawing it in a randomish way so that you understand this it might not be sequentially arranged you cannot assume that they are arranged one after another right okay that's why I'm just drawing it like this so that you don't fall into the Trap and thinking that all of my rows are stored one after another they are not they might be but need not be right okay now each node contains a small pointer exactly a disk offset like at this offset in this file this is where my this B plus 3 node is present right it will show now this might not be individual files this may be part of the same file itself but in that file you split a file into different regions e each of the same size this is the same as 4 KB and you just note down the offset of that and storing that offset becomes a pointer to that node as similar to C where pointer is exactly the address of the memory location here the pointer is the offset in that specific this specific file right now this is exactly how your data is stored or your table data or your document data is stored serialized on the disk these nodes become the leaf node of a wave cluster you would have you you would remember that the leaf node of the B plus 3 also stores the data this is exactly what they're talking about these are the leaf nodes of my BP of my B plus 3 which is actually storing the row data that you have right okay but is this even making your insert update delete efficient now B plus 3 does not just contain Leaf node right it contains other nodes so what is stored in other nodes let's go through that a simple table when visualized as a B plus 3 a very simplistic visualization for you to understand the concept is this so B plus 3 may contain n levels I'm taking example of three that becomes easier to understand now here this is the root of the B plus three node then you have first level and then you have the actual Leaf notes where the actual table data is stored right here you can see how our table data is stored in B plus 3 row or rather ID 1 to 100 101 to 200 200 to 1 to 300 and so on and so forth are stored over here now let's see what each node holds in B plus b now again each of these this is a very nice visual way to see it on this they might be spread anyhow right in the table file it depends so depending on the database what storage engine they are using for example MySQL DB has my SM inodb and whatnot they may choose a different way to store this or different way to allocate this B plus T node on the disk in that region we don't have to worry about it that's the implementation specific but in a nice visual representation this is how we can visualize it very easily now every B plus 3 node is serialized and stored on the disk so all of these nodes that you are seeing they are not stored in memory they are all on disk they may be brought into memory for performance gain but the worst cases they are all stored on disk right okay non-leaf nodes they hold the routing information when I say routing information it's not something nodes routing what I'm talking about is it tells that in the child node which node holds what range of data for example here this node non-leaf node stores 1 and 1 0 1 which means that the left node starts with one ID and the right node starts with id101 right here the left node starts with two zero one and right node starts with three zero one so from 201 to 301 you can find it over this and so on and so forth right this is the offset or the routing not offset but the routing information that the range information which might be the better but range information of what the leave B plus P node holds and white so as level increases they keep holding the range information of their child right okay now for you to make any insert update delete you now you can very clearly see oh sorry before that before we go into the operation side of things as always in B plus three node will all the leaves are interlinked or they are connected linearly like each Leaf is connected like like this they are linearly connected right so the first note second note not really linearly connected and they are also part of the B plus feed now we'll see how beautiful this solves a problem right okay Leaf node as I will discussed holds the actual rows these are row ID 50150 but this is actual raw data and leaf nodes are connected linearly not now let's take a look at operations how this makes sense right let's start with find one by ID operation find one like given an ID find me the row now for you to do that in order to in order for you to find a particular row with a particular ID what would you need to do let's say I would want to find row with id4 what will I do I'll start with the root node now this node is present on disk so first thing what I'll do is I'll go to the disk read this particular block I'll read this particular block interpret the routing information and I would know that id4 is present in this range I would go over here and how would I know that I would just store the the left and the right like the least and the most value over them through which I'm writing like here 1 and 1 0 1 is what I'm storing I'm not don't want to store every single rows information right because its range is what I'm operating on correct so now here I would have information which is let me just be very specific on that I would store one 2 0 1 and 4 0 1 kind of like this information would be stored so this means that 1 2 2 0 1 if I'm requesting for this ID I have to go over here if I'm requesting something for 2 0 between 2 0 1 2 4 0 1 I would have to go over here four zero one greater than four zero and I have to go over here right that's typically what we store in B plus right okay so now what do we do let's say we are requesting for row ID 3 right so now what I'll do 3 lies between 1 and 2 0 1. so it would first read this plot understand hey I'm looking for a three it's present in between 1 and 2 0 1 so it would go to the first node here it would say 3 1 and 1 0 1 it's over here it would come over here and read so read this block come over here then read this block from the disk read all these hundred rows again it's a discrete it would read all these hundred rows and find which one is row ID3 get that and return it so now what your linear scan was is now just boil down to one two and three discrete that's it where you are doing this multiple discretes at first now it just boils down to three discretes and you get the row that you are interested in dead simple right now here you beautifully see no matter which row your request you would get it in this specification in exactly three discreets not one extra one for this block one for this block one for this block extract this row and return it to the user right this is how your reads would work right find one by ID now let's take a look at next operation which is insert now how would insert work insert what we would first have to do is let's say I'd want to insert row 4. what I will do read this block understand where the four y would lie in the leaf node I'd come over here insert or find Row 4 where 4 would lie I'd find this I would read this entire block find the place where I can place the Row 4 it would be after 3 and because I'm loading it in memory I can do that Arrangement add something in between and move other forward right I can do that array movement right you can do that if you are storing it in arrays you put it in between and then flush the entire thing on the disk which means you read this you read this you read this you put your Row 4 and then you put it and you flush this B plus T node on the disk so one disk read two disk read three disk grid update in memory and one disk flush right so this way you have your insert can happen in between you don't have to rewrite the entire file in linear sequential way that we were streaming we would have to rewrite the entire file here we don't have to here what we are getting is we are getting this benefit that without touching any other disk block any other B plus pre node I am just reading the blocks I am interested in updating them and flushing it down right and obviously the B plus T is rebalancing and all comes into the picture so that specifications of B plus 3 as a data structure go through that that's out of the scope I'm talking about how database actually stores the data right okay that's about to insert similarly you can Envision updates you read read update the row let's add one to update row 202 right let me take a different example update row 202 I would read this I would know you come over here read this I would have to come over here read this in memory now I have the entire disk the entire B plus reload in memory I would update whatever I want to flush it down on the disk right if I want to do arrange and move it and whatnot I can do all of that but without altering any other block I am just doing the updates that I'm interested I may have to split and rebalance the B plus T that is a different problem but it makes my operation so simple right similarly when you are deleting it what you do let's not want to delete row 401 I read this read this read this in memory delete this block I have literally if it's I'm storing it in an array I would just delete that part in memory and then flush it onto the disk the way I do it when I flush it the row is gone and that's my hard duration then I may have to rebalance the tree in order to maintain the M by n factor that you know like standard B plus three practices but now here you see how beautiful your operation gets right so simple so efficient without touching anything else you get the raw power of the disk right the final operation I want to talk about is range query let's say I want to find all the rows whose ID line the range 100 to 600. now for you to do this see how beautifully that that leaves that are connected that solves this problem so beautifully now let's add one to read Rose whose ID is in range 100 to 600 right okay what I'll do I'll want to read 100 right so I'll start from here where does hundred lie here I'll come over here where does 100 lie over here so let's start from here from 101 I read this block I'd say did I read 600 no so then I have to read this block but now instead of coming from here here until here because this nodes are connected like this you can linearly travel through this so I read this and they read this with this point I read this until I reach 600. as another 8600 I stopped so now what we are doing with this leaves directly connected with each other what you get is you don't have to iterate through this from top to bottom again and again once you reach at the leaf you can just linearly Traverse like this that's the beauty of this data structure now the number of discreets is one two three four five six seven the bare minimum that you have to do login to reach to the first point and then linearly depending on the number of rows that you are reaching such a beautiful implementation that's why data structures this B plus 3 is so so so powerful because of this very reason why B plus 3 forces you to store data at the leaf node now you get it B trees allow you to store data in the middle in the in the non-leaf node also but B plus T forces you to store data in the leaf node this is precise lever because it makes your range queries super efficient it gives you predictable times to read any Row from your database right this is the power of B plus 3 and this is why most databases out there they use B plus 3 to store and hold their actual data right and it's not just specific to SQL no SQL databases like mongodb in The Wire tiger storage agent they do this right this is the beauty of B plus 3 and how database and Y database stores the data in B plus 3 and this is what I wanted to cover I hope I really hope you found it interesting now I think most of your questions around why database USB plus trees to hold the data would be solved but once again explore on this in bit more detail understand B plus three in case you are unaware right but this is precisely why database USB plus T2 store the data again I hope you found it interesting I hope you found it interesting amusing sorry and uh that's it for this one I'll see in the next one thanks [Music]

Info

Channel: Arpit Bhayani

Views: 31,205

Rating: undefined out of 5

Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Real-world System Design, System Design for Beginners, How Systems Scale, System Design Simplified, System Design Interview, B+ Trees, Database Internals, Why B+ Trees, Why database use B+ trees, Who uses B+ trees, B+ trees applications, How B+ trees are stored on the disk, How database performs insert update and delete, why is database fast

Id: 09E-tVAUqQw

Channel Id: undefined

Length: 29min 43sec (1783 seconds)

Published: Sun Mar 19 2023