Understanding Database Trade-offs - The RUM Conjecture - Paper Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so whenever a database is designed or conceptualized we try to minimize the following three things the read times update cost or the memory or the storage overheads and now we know what is called as the rum conjecture so when we say a database is designed it means that we are trying to Define how the data would be organized on the disk or memory like how it would be laid out and all and how it would be accessed so be it any data structure or algorithm that we are trying to pick while we are designing this database this conjecture holds true now what does the conjecture say the conjecture says that when we try to optimize the two out of this three the third one is impacted in a negative way so if you're trying to build a system that is optimized out of two or three let's say read and write then would have to take a hit on storage or the memory consumption if you're trying to optimize on read and memory then the right would need to take a hit right so in no way we could get all three right so we can get maximum out of this three let's take a practical example to understand let's say we are designing a database in a lock structured format right which means whenever I'm inserting a record they're simply getting appended in the file one after another now if you look carefully because the records are getting appended in the file the updates is highly optimized right now if we want to minimize the storage requirement we give up on an additional index that we would create on top of this so we are optimizing on memory as well so as soon as we optimize on wrs WE optimize on additional memory if you look carefully how would read work if I'm looking for a particular key then I would have to go through the entire file each record one after another and then I would find it so when I'm optimizing for updates I'm optimizing for memory I'm Giving Up on read cost right so in no way in no way we could get all three so the read update and memory the three form a competing triangle in which one has to be given up in order to maximize the other two so at Max we could get two out of this three right and that's the entire rum conjecture now when you are visualizing it it's very easy to visualize it as a triangle now we'll take a look at different databases taking different or basically taking different trade-offs to optimize for a certain type of workload let's start with read optimized workloads so when we talk about databases or data data structures which are read optimized which means they optimized for low read overheads in which they can choose to trade off additional storage and right cost let's take a practical example if we talk about the B trees which most relational and a lot of non- relational databases use to store the data these are heavily optimized for read right so that you can do a login read when you are reading by a primary key or doing an index lookup so when we do this we are taking up that additional nodes the intermediate nodes that we would want right so we we need an a we need an auxiliary data structure to manage it right and the rights would be accessive because now we're accessing multiple disk blocks to read and reach there right similarly tries and Skip list also falls into this category so when we are optimizing for sort of thing we have to give up something else so in some cases let's say when we are doing giving users great read latencies let's say I would want to give users a great read latency by making data redundant right so that I can support large number of reads so in that case I'm increasing my storage requirement right but I'm giving great read latencies but now my rights have to go at multiple places right so you are doing dual wrs or rewrites in order to improve your read latencies right so there is always this compet triangle coming in now let's take a look at the right optimized workload in case of right optimized workload as we saw with lock structured storage where we were simply appending the key value one after another so we typically do the bare minimum when it is taken to the rights part for example LSM trees accept rights in memory and then periodically flush it to the disk so it's very right optimized because we are writing to the memory instead of desk right but when we do this we require an additional buffer and then we store it so auxiliary storage requirement is added right so you're optimizing for update cost but you are now requiring an additional buffer or an additional data structure or an auxiliary data structure to do it right so we typically get great get great great right latencies be it in place or in auxiliary structure by trading off the read latencies and memory overheads again forming a competing triangle you get some you lose some that's the Ultimate Reality the third if you look at the space optimize which means we do not want to give up an additional space which means we want to be highly space efficient for example Bloom filters right we want to be highly space efficient in that case what are we giving up in case of Bloom filters we are giving up accurate accy it's not part of read or update but again something has to take a hit so accuracy takes a hit because false positives are possible over there right but if I take a look at another example let's say sparse index so sparse index is hey if you create a dense index you would be having one entry for every single entry that you have in your index so it's not space optimized you reduce that so now you are increasing the number of lookups that you need to do to reach to that certain key because now you have to do multi- level lookups over there so this is a case where you are optimizing on space so you have to make your reads slightly more costlier right so you get some you lose some these are space optimized workloads but there is a fourth one which sits right in the middle so we look at we looked at examples of read optimized workloads write optimized workloads and storage optimized workloads but there would be class of data structures and algorithms that fit in this criteria the middle region these are adaptive access method so basically the workloads which are adaptive in nature which sits in the middle and can adapt depending on the workload or the access pattern of it so these are flexible data structures which are designed to solve this problem right now you can choose to go deeper into this I'll give you a few examples first is first example is called as database cracking adaptive merging adaptive indexing and again in most cases these are bunch of tunable parameters that you have which you can use to tuner existing systems that below certain load do this above a certain thing do this right but there are some inherent data structures that does something out of the box which you don't need to tune and they would adapt it on their own right but in most cases the tunability is what makes it possible for a database or a data structure or an algorithm to be tunable to adapt to any kind of workload right and this entire thing is a rum conjecture so it's not proven but is something which is observed that's why it's called as a conjecture right so this is a very nice framework that whenever you are looking at any database anywhere where you are storing things you are accessing things this is a very clean framework to look at it that he whatever whichever database you are looking for is it read optimized write optimized memory optimized what it is it trading off and what it is not trading of like the strong suits and the weak suits it would help you make go deeper into the database that you are already using or whenever you're making a choice to pick a database for your architecture you can make a very informed decision according to that now whenever you're reading about any database try to categorize it into see that hey what kind of trade-off is it taking what is it gaining what is it losing and then you would form an understanding to make the right decision Choice with your database and this entire thing is taken is a very short summary of a paper called The Rum conjecture I'll link that paper in the description down below but uh go through that the it they have covered some other things in details which I'm sure you would love to explore so yeah these is all what I wanted to cover in this one I hope you found it interesting hope you found it amazing that's it for this one I'll see you in the next one thanks ATT [Music]
Info
Channel: Arpit Bhayani
Views: 10,546
Rating: undefined out of 5
Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Real-world System Design, database trade-offs, database taradeoffs, LSM TRees, How to pick a database, database in system design, how databases are designed, database internals, distributed databases, how databases work
Id: ZxvulmKXIto
Channel Id: undefined
Length: 8min 17sec (497 seconds)
Published: Fri Feb 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.