How row oriented and column oriented db works?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is Marion any decision let's understand what exactly is Roe oriented database and column-oriented database is all about so if you go to any database website they usually mentioned that particular database is either row oriented database or column-oriented database but have you ever wondered how the performance of your application will get it packed it by choosing row oriented database over column-oriented database or vice versa so to understand that in this session let's discuss how exactly the data of your database is stored in the memory and how raw or into database handles it and how column-oriented database handles it by understanding that you will exactly can search for any given use case what kind of databases you have to use in your systems so let's go back to white board and understand that so I have taken one use case suppose you have an application called user management system and obviously will have a database or a table for to save all the users information so let's take this as a users table and the other table which I have is the dashboard or a back-end admins dashboard data so this particular data contains different minutes as a time and the total number of requests are the traffic to your system and also the total number of 500 errors which you saw in the system at any given point of time in in minutes granular time so this database is used to get the data for your dashboards so the dashboards will usually have all the traffic information say this one will be a request or minute in any given duration of the time so usually in the user management system how we use this users data is will usually have forms in which you show the different fields and we usually the queries the kind of queries we actually execute on this particular table is say if I want to show our users specific profile information what I what I usually do is I basically query data for that user I basically query at Suffolk record of the user by the user ie so in this case user ID 1 and I get all of this information basically when I'm querying I get name place age and even many attributes specific to that user I pull that record and show that on the form so it would be like a UK and trend is what we show in the form now the user can update this data and and save it while saving we usually use transactions so usually this kind of workload is called as transactional workloads and we're in this kind of workload where you have a lot of data and we usually fetch a specific column or one or more columns in the table with any given duration in this case if admin wants to see all the total number of traffic all all the traffic to my system he usually choose a duration say I want to see all the traffic for past friend for us in that case maybe the query looks something like ok 10 a.m. today - 10 a.m. yesterday so we'll have to fetch all the records from 10 a.m. today to yesterday 10 a.m. and we are usually interested in specific columns data we won't be fetching all of that all of the tables data we are interested in specific column one or more column it could be when we are showing one or more dashboards or one or more graph so in this case I'm just interested in the traffic to my system so I usually query all the requests per minute column from 10 a.m. to 10 a.m. yesterday so I usually get all of this data over here ok so if you see the way we are accessing data clearly sees that here we are accessing row wise here we are accessing column wise so to make your application perform better we have to save the data in the same fashion okay even though we need a table representation and a row like abstraction here and here both the cases here we usually access a specific record we there are very rare cases where we access giving all the you know places of the users or give me all the ages of the but here we will usually do that give me all the records of a specific column within a duration but here we don't usually fetch a data give me a record of specific ie - and all this metrics because that's not really our business use cases so business use cases is what actually decides what kind of data stores unit used whether it's a row oriented datastore or column-oriented data store or databases so if you see here it's a row written now if you see here it's column-oriented so this is how you basically decide which one to use now to make your application perform better we have to save the data in similar fashion if this database is saving this all data in memory it should save all of this attribute together to make the performance better so when we access the data we can fetch all of that information together so usually so if if this was the data if this was a row oriented database how this database saves is one a UK and 10 or if this information together this also together to be us and 20 all of this together in the database file so when we access the specific users information we get all of this information together and it's much faster so in this case we shouldn't be saving this data like this so how we should save this data in this table in the memory to make the performance better so in this case the database is columnar database how it actually saves is it saves all of this column data together so how it looks is so only this information together three hundred three thirty two hundred all together these columns data is also saved together one two ten and whatever the data which was supposed to be here it all is together so if I want to render you know error by time duration then we'll be accessing all this data together so what if if this data was stored like raw oriented data then what would have happen in that case so how how this data is stored in the day is as it looks like here then 301 1001 330 and 2 and so on so now if I want to get the same data to render request for a minute then I'll have to go every row keep on going until the last file and just grab this information over here over here over here over here so if you look at the time complexity it is like you are touching all of the rows in the table so it will be like art of n time complexity so that's like bad right so by saving this data by one span or one you know you will have to obviously keep reading but all of this data is together you are actually reading very less data because in this case we will roll either we would have read every row and then we would have fetch the specific column but in this case we are just fetching the columns data n which is stored all together so it is much faster let's understand how this data is also fit into the hard disk HDD or spinning hard disk to understand even much better let's understand how the data is written into disk so the first question usually is why are you still using HDD that is hard disk drive so this is a disk what can't you just use solid-state devices are Ram because they are much faster than hard disk and you don't need to worry about too many problems be spinning disk poses because stds and rams are much costlier even the distributed systems definition says that we build the distributed system using commodity hardware when it says when it comes to hard disks the commodity Hardware is history because they are much cheaper and also if you see our big data of workloads that is huge reader like petabytes of data and hardness are much suitable because they are much cheaper so we can still store much and much more data using these not cheap hard disk so the next question you need to understand is what is a memory block a memory block is a unit of a memory which usually has a fixed size and when the hard disk reads the data it either rates it completely or it doesn't reach at all and why do we need a blocks is basically to distribute the data much better in the hard-disk or in the memory so operating system will actually take care of you know distributing the data into memory whatever in the in the disk we usually call these as sectors but sectors inside will actually have blocks or whatever but for understanding we will consider these as blocks and see how the data is written so the next thing you need to understand is how I go is performed on the blocks so as I said when the i/o is happening there is a head in the hardest it actually moves and then reads the data from these blocks it either reads it completely or it doesn't read anything so usually these kind of hard disk prefers to read sequentially like he keeps on reading and reading reading and the performance of you know randomly seeking or randomly reading the data between the blocks is not really good like for example reading the data over here and then jump over here read some data and bring some bit over here read some bit out here and all like that so usually these kind of dis prefers to be reread the data in a sequential manner because it is much better so that said let's understand how the data is written in case of row oriented databases let's take an example so this is an IE so user ID a and age is 20 okay and the place is UK and how and let's take one more example 30 and u.s. so how is this data written into the risk as I mentioned earlier that we have to say this together so we can find all of this different attributes of a same user together so this data are usually stored something like this in the in the in the block a 20 and UK and one more thing you need to understand is there is a fixed size for the block usually how databases prefer is to store record into a block so so it could be saving something like this and this next record will go to one more block to be 30 and us maybe if this could have ready to hear completely then this day develop also written here since there's not enough space maybe chose to write it over here and similarly if you have n number of our know Rose over here it actually keeps on writing over here and when you want to update specific record what happens is heartless basically goes there and then updates okay and then if you want to update one more it actually goes here and updates something like this so this is like seeking between data of blocks or memory blocks to update so that's why if you know about right ahead loss they usually doesn't update the record in the memory they keep appending the updated data into end of the file because for the simple fact I explain instead of going and updating somewhere here they keep updating the updated data over here over here over here so this this keeps on spinning in a sequential order instead of just seeking somewhere else so the performances will be much better in the databases which actually use right ahead logs kind of implementation but that's like out of context right now so this kind of data is written like this so if you see here we are not efficiently using the memory so the memory consumed in row oriented database is little higher because we are not fully writing into the complete block and also when you're reading the data you have to read the complete data out of it so even if you want to just read name okay just the name with a ID one we still have to go here read the whole block so we are essentially reading all the attributes for that users record so there is more I go operation you are doing even though this is one I go operation you're reading more data from the disk and loading it to RAM and then just discarding everything except a because you can't just read this part from the block because blocks are designed that way when the Iowa happens it should read everything at once so if you had analytical workload the queries are something like we have to just read Pacific column of data in more number of rows so that what and since the disk will go here and real Jeff just to read a user it should read everything go here just to read I use the name it has to read everything so storing data in this way is not really good for analytical workloads where we just need just one column of data for many number of rows but storing the data in this way is really good for transactional lord because once we go here we will read the complete data and then utilize it or even if you want to write we just go here and update over here and it's good so let's understand how the data is written into the disk in case of analytical workload or columnar data so let's take an example of columnar data say we had ID 1001 we had like requests per minute that's 200 and failures are five to one more record 1002 300 requests per second and there were like four failures how this data is written into the disk so in case of Road into data we saw that everything is written but but in this case it is saved in the columnar database so what it does is it shows it actually gets all of the data for a specific column and keeps it in sequential order so in this case it saves all the time data sequential e 1002 I'll take more data just to show a little bit more so if I have something like this it actually saves 1003 and so consider we only have three rows now so it actually it's the next column and fills it in the next sector somewhere somewhere here like 200 300 and 350 something like this and then the next column ok start from somewhere else or maybe I'll just pick it here 5 4 and 6 ok so this way it actually stores all the column data together so when we want to read now if I just make a query ok get me all the requests per second or requests per minute data from 10 a.m. to 10 maybe we have three regards what happens is it just goes here and when it leads this block it actually gets all the information which we needed or maybe if this data was spread into multiple blocks then all it has to do is keep on seeking seeking is much faster it keeps on seeking through multiple blocks and it reads all the columns which we needed here if you see the memory which we read is completely utilized we don't need to discard anything because this is the column we are interested in so when it reads we are reading everything we need in that specific column and I fit much efficiently using spilling hard disk so that way it is much efficient and performs better and also if you see the data in the block is completely consumed there is no space left even if there was a space lab when the next data comes in we can still fit in here because we are going to read it all together so the space utilization is much efficient in this case of columnar oriented databases and also one more important thing to know is when we have similar kind of data in each block operating system our databases can use a specific compression algorithm to store this data and compress and store this data in blocks in case of I know row oriented database we didn't have a homogeneous data type data in a block we had a mixture of data because say for example just to represent the row oriented date of block what we had is ID name of a user age of a user and place so if you see there is this in and this is string this is again maybe if it was a int or load or whatever in the string so if you see there is heterogeneous data types we can't use a efficient algorithm to compress the data because of the data types are different so in this case all of the data type is similar it could be like if the date and time was represented in a string a string string string and here if you see int int int so we can use int integer compression algorithm better and then come the data and save it here so if you see the facts you know the compulsion the space utilization is like we can save three times more data in columnar data than the space consumed in row oriented databases so that's a kind of performance we get when you are using columnar oriented databases so one more thing you need to understand is when it writes any row say for example when the new row comes in 1004 407 so this is the record so this is the new record we need to insert it how it dual Dez's if it was a rower integral base it just takes all of this and puts it in one block but since it is called Cardinal orient data we'll have first the data base splits into each separate column so it gets four separately in zero for supper key 400 separately you know separate chunks and it goes here so wherever this idea was written I didn't show it okay it's fine so let's just write a date our time it goes it takes this data and goes to the block wherever it was if it is full it just right over here 1004 and now we need to update the request per minute data so it goes here and prints in over here and this one goes here over here so this is how it actually does it just that it has to break and then write it it could take three different IO operation in case of columnar databases but if it was a row oriented database with one I've operation Ephrem written all of this information here it's like it took three times more i/o operation but if you see the you know analytical workload once we write the data we never update it because this is all kind of you know historical data right who will go back and update the request per minute or the number of failures in back in time we don't do that right so it's okay even if we sacrifice couple of I goes when you are updating the data it is efficiently written in the disk so the reads are much optimized because we can read it much more efficiently so that's why when you are deciding which kind of you need to use or is it row or a column or a oriented database you need to understand what is the business requirements if your business requirements require more of a column reads too much of a column reads then it should we should actually go for corner databases or if your databases is transactional workload where you have to specifically weed lock it and record and read that information it's better to go for row oriented databases and also one more last thing you need to understand is how the partitions are done so in case of row oriented databases you'll be basically partitioning by a rose okay like how we doing my sequel but if it if in case of columnar oriented databases we basically partition by columns itself that means that all the data for this specific column is save in one machine all of the data for this particular column is saved in one more machine called this caller data for this column might have been saved in a different machine if it was a row oriented database we basically split the records by kind of values say all the data from one to three store it in one machine all the data from four to ten storage into another machine and all the data beyond that soul into a different machine so that's one more important differentiation I guess I have explained all the important things you need to know about row and column oriented databases internals so if you like this video please subscribe to the channel please tell your friends about these content and please leave a comment if you think that I should be improving something or if you want some your topics of video to be made on this channel thank you
Info
Channel: Tech Dummies Narendra L
Views: 40,946
Rating: undefined out of 5
Keywords: Amazon interview question, interview questions, interview preparations, algo and ds interview question, software interview preparation, developer interview questions, Facebook interview question, google interview question, Technical interview question, software architecture, system design, learn System design, row vs column oriented db, how database memory works, row vs column memory management
Id: uMkVi4SDLbM
Channel Id: undefined
Length: 21min 26sec (1286 seconds)
Published: Fri Apr 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.