Introduction to NoSQL databases

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
happy one today we'll be talking about where no sequel databases can be used and why they're so popular yes we will be talking about no sequel databases and why they're the best in the world and we're really excited about actually talking about this topic yes there are a lot of scenarios where no sequel databases are used they're an equally high number of scenarios where no sequel databases are not used so it's important to know when to use them and when not to use them well yes whenever you're building toy apps that time you need your are DBMS but whenever you need scale it's no sequel databases alright that's not entirely true actually it's not that scalability demands or no sequel database there are certain scenarios when these databases tend to do well and we'll be getting to them in this video could you give us an example well YouTube doesn't use no sequel databases Stack Overflow doesn't use no sequel databases Instagram doesn't use no sequel databases whatsapp doesn't use no sequel databases whatsapp doesn't have a database let's start with the video so what is the difference between sequel and no sequel well if you have a look at the database schema that we have for an example of a person in which case they have an ID that's the user ID we have the name address age and rule now the address is a complicated object so what I'm going to be storing the way I am going to be storing it is in a separate table the address ID 23 corresponds to this row which means that the address is Munich and in Germany and the district is blank so you're seeing that there is some sort of foreign key mapping here and that's how we store data in sequel this is how you store it in know sequel you have the ID 1 2 3 and you just have this big fat blob of data right this is JSON and the way it stores it is the column name maps to a value which is named John Doe the same thing over here the address is no longer a foreign key the address is another object within this object so that is just JSON you know a nesting we have address ID city and country and because there's a null value of district we actually don't even store that we have the age and the rule also defined and you are saying that this is this big blob of data over here so what makes no sequel so efficient the key is to think about how we are storing and retrieving data when we are storing data usually it's never like a user registers and they send their age later on or they send the role later on or they send the address later on it's all together so when there is an insertion there's usually all the fields inserted together which means that this entire fat blob could have been written on the on the API I mean when the request is coming in all this information was there and you could have done this in a single insert and whenever you are pulling out information about any user usually you will need all the information about that user right select start is something so common that people don't even think about adding that column names these days unless it's of course a very big table or if you if you have some column which is pretty big and you want to avoid that that's the separate scenario but usually select star is very very common so because select star is so common because you need all the data or relevant or user all the times this means that this entire blob will also be pulled out all the time so that means insertions and retrievals require the whole blob so why not keep it together you see when in your running query on the sequel database usually the pointer comes to let's say this ID this row and then it has to sequentially read all these columns not just that you also don't have a clean way to denormalize things like this address could store the string but this database is not in built for denormalizing things so you might need a join which is pretty expensive considering that most of times you'll need both data it's cheap oh here and that's the first benefit of using no sequel right all your data relevant data is contained together in one block and so it's it's a little easier to insert and retrieve the second thing is that this schema is flexible we saw that the district was null and for this sequel approach what we have to do is we have to add a new column although we don't need it we still do it over here what could happen is that if the address is null if the address is entirely blank that's fine because this blob doesn't care about schema all it cares about is a JSON document so they'll be named John Doe comma straight away age 30 and the role is SD so what you see is that the schema is very flexible in this case and not so much over here in fact whenever you are doing a new attribute addition let us say that we have some new attribute added over here which is Salvi so whenever salary is added we have to actually add a new column to this SQL database which is a very expensive operation because you need some sort of locks on the table and it's also risky to maintain consistency at this time so I mean if you want to maintain consistency then you need the locks and that's the reason why it is expensive well over here if there is something that you're adding which you don't need for all the older users what you can do is just start adding them straight away because like I said the schema doesn't care the older schema doesn't know that there's anything called salary okay so the second or Anton's that schema is easily changeable the turn advantage of no sequel databases is that they have horizontal partitioning inbuilt most of the times they expect a lot of scale to come in I mean the users for these know sequel databases expect a lot of scale so what they tend to do is they horizontally partition this data now you can have a look at the sharding video to get a better understanding of horizontal partitioning and of course when it is allowing this kind of partitioning it's more focused on availability which is a good thing a lot of systems actually require availability or consistency yeah so that's the good thing built for scale the fourth and final major advantage on those sequel databases is that they're built for aggregations also when a person is storing data in the no sequel database they're usually expecting to be getting some important information out of that data for example what is the average age what is the total salary these kind of databases are built for finding metrics and getting intelligent data so that's what they're built for aggregations okay so these are the advantages that we have of no sequel databases what are the disadvantages not too many updates are inherently supported in this so if you have lots of updates this is not really nice what are the possible problems here well the data may not be consistent meaning that the two nodes may have different data for the same ID yeah while this SQL databases usually gives you something called acid properties by which you can contain this issue so that is a problem so I'll just write it down consistency is a problem which basically means acid is not guaranteed if acid is not guaranteed you can't have transactions using no sequel databases at least you can't have the same transaction properties of acid okay so that's a big reason why financial systems don't use no sequel databases for their transactions because it doesn't make much sense the second problem is that these databases are not read optimized if I asked you to find me all the ages of all the employees that we have in the company what's going to happen is it's going to go to these blocks and each time it's going to read the entire block then filter out the edge and do that for every row then return you the result while in a sequel database all you need to do is just come to this column I mean it won't be that easy the reader has to actually go to that column and then read that column but this is more efficient than this so these are not read optimized three times are comparatively slow the last two problems I can see here is that this does not have implicit information about relations so in an our DBMS the R stands for relation now 23 the address ID maps to this point and what that tells you is that this row is somehow related to this row in the across or two tables right while in no sequels there's no easy way to do this if you had a separate table for let's say the all the values of the addresses then the information would be implicit you couldn't force a constraint like a foreign key constraint which would say that this column 23 can only exist if there is a corresponding column in the in the employ table so relations are not implicit and the fourth and final problem which is a major problem is that joints are hard if you have to know sequel tables let's then when you're joining those two tables what you need to do is you need to run through every block of data here find that relevant column on which you are joining to the other guy who's a relevant column you need to find again of course and then you need to merge them together joins are actually all manual so to speak in a no sequel database there's no intelligence behind these kind of joints you can try to improve on them but there's only so much you can do well sequel databases are to some extent built for joins you have inner join outer join left outer join all the things that we didn't read in college those kinds of things are very common when you are ceiling with sequel databases because they have an inherent relations in them so these are the advantages and these are the disadvantages that we have no sequel when do we actually use no sequel well it depends on these things of course it depends on if your data is a block and if you are making few updates and you want to keep all of them together like if you're finding something which has to be right optimized there's a lot of writes coming into that maybe no sequel is the way to go there are scenarios where you might want inherent redundancy or aggregations in the data in which case no sequel provides that for you in a really nice way of course you can see all that disadvantages and that's one of the reasons why applications like YouTube or stack off you're still don't use no sequel databases but it's really nice and we are going to be taking a example of cassandra to understand these databases in detail so this is the Cassandra architecture that we'll be talking about the requests will be coming into this Cassandra cluster which is going to have five nodes and it's a pretty expensive thing to actually host a Cassandra cluster it's going to be having request ID is distributed in this cluster so any requests between zero to hundred will be falling in node one between 100 to 200 will fall in node two and so on and so forth this so there are five nodes and you can see that there's a request ID one two three so it should fall somewhere over here or rather it should fall somewhere over here so request IDs may not always be numeric it might be a UUID or it might be a person's name or something like that so what we do is instead of thinking about IDs in no sequel databases often things are considered as keys so have a look at the sharding video tour to get a better understanding of how these keys are mapped but basically we just take the hash of 1 2 3 so it's passed through a hash function this might be a string this might be anything you like and we get a value so I'm going to take that as 2 5 6 so this hash is then used to map this request to a particular node in this cluster so 2 5 6 falls between 200 to 300 so it falls over here right or rather I should take it as any any place that falls I'm going to take the clockwise next node so I'm going to pick up 4 so if the hash function is nice meaning that it's uniformly distributed what we can assume is that if there's a lot of requests which are coming in then they'll be falling with equal probability in any of the nodes so all the nodes should have approximately equal distribution for any percent of the load the advantage of this is that if you have a lot of requests coming in and you know you want to of course make sure that all the nodes are being used to their full capacity so because this is random distribution all of them will be having equal load and they can actually go up to their full capacity instead of one node having too much pressure when can one node have too much pressure when your hash function is not really nice so let's say your hash function is that anything less than 100 is equal to 0 and anything greater than 100 is 1 so what will happen is all requests greater than 100 which is from zero to 500 we said so that's around 400 of the requests will fall in hash function 1 so let's say it falls into and the other requests fall in 1 and the rest of nodes are not even touched in this case what's going to happen is the moment you hit your load for 2 your entire cluster is going to be fully loaded right because 2 is going to crash so to avoid this you need a good hash function or if your hash function is bad and you can't change the hash function for some reason maybe you can't do a hot change of the hash function what you can do is you can do something like a two layer cluster where when the request goes to two it doesn't actually store it in its in its database instead it sends it to another cluster which has five nodes and you pass a different I mean you run this request through a different hash function so H of H dash which gives you a different value so this hash function sucks this hash function can be really nice uniform distribution and therefore these five nodes will have approximately equal distribution and using this technique of multi-level sharding so to speak just have a look at the sharding video multi-level charting you should be able to survive but of course this is not a very good idea by have multiple levels of hash function well why not if you are a user let's say of Google Maps and you're in India then maybe your your hash function over here is hashing on the basis of country so the country ID is the only thing that you're looking at and based on country ID if you are sending it to one place it's possible that one of the countries is going to have a tremendous amount of load for certain festivals let's say Diwali everyone is using Google Maps everyone's going somewhere what will end up happening is this node will have too much pressure and in those cases what you can do is go for multi layer charting all right so these are the major advantages of using a hash function the thing that comes in very intuitively with the hash function is that you have a node where you are going to be sending the request let's say to and you also want to make sure that this data is persisted in a way that you don't lose the data f2 goes down so because this is important data if two crashes you don't want this data to be lost from the entire cluster so you want to make copies of it you want to make a replicas of that data who do you choose to have those replicas because of this hashing concept you can just ask three to also have a copy any node after two if it falls on to if the request falls onto any node after that should have the copy if the request falls on five one should have the copy you have two nodes which are storing the data which means that the probability of you losing the data is lower and also when a person is making a query what you need to do is you need to hash it hash this request figure out where it falls and any one of the replicas can actually answer so if you are making tulip liquors one or five can also if you are making three replicas then five or one or two can answer and so on and so forth so your read queries optimized your rights are also more guaranteed and it could also be optimized because if five misses the right then you can just write it to one and still working so through this cassandra gives us two features of load balancing you can have a look and the description for a good link for this it's a system design in playlist video and the second thing is redundancy so redundancy or let's say the application the slightly different but this gives you data guarantee and this gives you speed in reading so because like we said we are going to be distributing the reads and we are going to be making sure that the writes happen really well we have both of these features in the Cassandra cluster one of the very important concepts when it comes to no sequel databases is the idea of distributed consensus what I mean by that is that there are five nodes and let's say that application factor is three so if a request falls on 5 then 5 1 & 2 are going to be copying that data they need some sort of mechanism to agree on a particular value to return to the user why is that the case well let's say I write on 5 so there is some data appended here concurrently I am going to be writing on 1 & 2 also however let's assume that 1 & 2 are a little slow so they actually haven't got the right yet if that is the and I make a read operation now so let's say I added my profile on five I'm expecting one and two to have it - I make a read operation on my profile and five crashes nothing to worry about because one and two should be having all the data that five should so I go to one and I ask for my profile I see that it doesn't exist one returns an error the application now assumes that this profile doesn't exist so it returns a user not found error so I will get confused that I just made my profile and wise in there on the database to avoid these kind of issues what cassandra should be doing is returning a database error so that the application knows that there is something wrong in the database and as the use of that hey there's something wrong with our database wait for some time okay so to do that what we need is some sort of distributed consensus and one of the ways to achieve this is chorim okay quorum is a way in which multiple nodes who are related to a particular query accept a particular value or they come upon or decide or would for a particular value what I mean by that let's say five did crash we went to one one said I don't have this data and two said I have this data let's say the Konkan try it will happen if that is the case I will be picking up the data with the latest timestamp okay the version ID the timestamp whatever you like to say and returning that to the user in this way the user is happy that the profile is created however let's assume that even two does not have the profile created yet in this case both of them will agree that there is no profile created and unfortunately the user will be given a no user profile found if the quorum value is equal to two and the replication factor is equal to three that means that if two of the three nodes which means a majority of the nodes accept a particular value then we take that to be the truth so in this case unfortunately if one and two both do not have the rights replicated on them that will result in a wrong error sent to the user so do we mind this little bit but this is really rare the possibility of five crashing and one into not having the rights before they get a read operation is really there so this is a risk that we are willing to take when you are taking a no sequel database and just move forward with availability instead of consistency but what are the other good scenarios I mean the other good scenarios are that one has it the timestamp is more relevant so two's data won't be taken finally maybe both of them have it why don't we become optimists as engineers so that would also result in the correct data being returned and that's the reason why quorum is an important concept what it allows you to do is take a risk but in most cases it is correct right a quorum of two is highly unlikely to fail what if I make it a quorum of three like three nodes have to agree with the replication factor of three in this case this query will fail because five is failed you need three nodes to agree on value1 and ii don't agree i mean one and two will return a particular value but five is not returning a value and therefore the query fails I'm also taking a special case where I am picking up the latest timestamp if the quorum factor is equal to two it's very very likely that unless both of them agree on some value you're going to fail the query okay I have taken timestamp because in this case it clearly shows that you can still walk around with you know one and one one versus one basically there's no majority but most of the times it's going to be they don't agree on a value just fail the database query and tell the user that we have unavailable for some time for your particular quest now if you want the details of how a Cottam work I'll be taking a video on this a little later it's distributed consensus so they'll be Paxos they'll be the gossip protocol but in general you can just assume them to be sending all that information to a central server yeah let's say three is the person who they all send information to and three then counts the woods and then chooses one value and returns it to the user now what happens if three-phase that's the master sort of speak in this in this cluster there's this weird consensus that we'd be doing in the in the future video so that should answer that the final way in which cassandra stands out as no sequel database even elasticsearch has this feature is the way in which it stores data and the way in which it writes data so if you have a request coming in to Cassandra and you have this key value pair assume this table to be existing in memory right because you need to write it somewhere so Cassandra will be storing all of these records in memory as a log file okay the reason I'm calling it a log file is that whenever there's a request for some write it's going to be writing in a sequential fashion so if a new request comes in it goes to the next point you request the next point in this way you are actually storing all the data like a log this is efficient because all you need to do is go to the point where you have the current pointer and just write down the data instead of searching anything okay so this is fast and periodically this memory is dumped into something called an SS table so sorted string table why is it sorted string because the key is sorted in this string table so if I have some data or here so the key is going to be sorted and the values are going to be perky now this is persistent storage which means that it's going to be stored in one of these cluster nodes this concept comes from a very famous Google paper which is the big table data structure that Google made you can have a look in the description below but the special thing about a sorted swing table is that it is immutable right so this data is not going to change it is immutable so every time cassandra has some data in its memory it flushes it it into a new sorted string table now you can imagine that because these requests are coming in after a few days what's going to happen is you are going to have a lot of sorted string tables all over your cluster and these are going to be taking up a lot of space why because any update let's say the key is 1 2 3 & 2 days later you got an update on that key 1 2 3 so some data and that has changed maybe the name has changed from John Doe to the middle name has been added so in this case what's happening is you have an update on that key the latest record is this record it's in some other sorted string table because that was created later on when it was flushed to the SST and effectively what has happened is you have multiple records for the same key right if you have multiple records for the same key it's not a problem the thing is you can always use a timestamp this record will have a timestamp and you can use the latest name stamp to get the data the problem is not consistency the problem is data usage like you are going to be using a lot of storage in with these duplicate keys so if you have 10 records for the same key then you are using 10 times the storage required so cassandra and elastic sauce provide a feature called compaction what we do is we take different sorted string tables and we merge them so you can imagine this to be a merge sort yeah you have two sorted arrays and you are just merging them so this is an order n operation and it's also the space complexity is the minimum of m n n where the size of the two air is eminent I have actually taken this 1017 that dim sort will you do which nobody saw if you just want to know how this works it's just the merge sort algorithm if you want a detailed explanation of why the space complexity is so low you can have a look in the description below for a team shot video so that's the important thing basically we have sorted string tables which are immutable so they are really fast to flush into disk you don't need to worry about whether they are duplicate keys or anything and later on like a batch process you are going to be compacting these SS tables to do optimize for space how do you get rid of deleted records well you can go to the deleted record and yeah Cassandra calls it a tombstone so you place a tombstone you probably set a flag and the tombstone says that this record is dead yeah any read operation on that if the three or four records and there's a tombstone so you see tombstone on the latest timestamp you call this record to be dead and all three of them are killed if there's an update on that key again if you see a tombstone then you know that an update is impossible and therefore you fire an exception like record doesn't exist so in general this is how no sequel databases work we have picked up a example of Cassandra specifically but there's a lot of concepts that are actually extensible to elasticsearch extensible to Amazon DynamoDB and so on and so forth so there is this video was created by going through a lot of blog posts and a lot of videos on the Internet there's a really good course for system design which is grokking the system design interview the best thing is this community is getting 10% off on the promo code GK CS do check it out it's definitely worth the price and it's a great introduction to system design there's also a link for computer programmers below the International Labour Organization is collaborating with a company called sound rocket so they reached out to me and they were talking about how to compare programmers how do they work how do they do this stuff I'd really like a lot of you to go and give this survey it takes just two or three minutes and it's it's interesting to see if this helps our community grow this is a lot to digest over here and if you have any doubts or suggestions then you can leave them in the comments below if you like this video then hit the like button and if you want notifications so further such videos hit the subscribe button as you next time
Info
Channel: Gaurav Sen
Views: 527,153
Rating: 4.8787456 out of 5
Keywords: system design, interview preparation, software interview, problem solving, design interview, system design interview, nosql, why use nosql, why nosql, cassandra, what is cassandra, cassandra architecture, nosql database, nosql architecture, nosql benefits, nosql problems, nosql drawbacks, nosql explained, database, cassandra database, database interview, nosql features, cassandra features, gaurav sen, sorted string table, bigdata, big data
Id: xQnIN9bW0og
Channel Id: undefined
Length: 27min 0sec (1620 seconds)
Published: Fri Feb 08 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.