Database Design Tips | Choosing the Best Database in a System Design Interview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone. Let's talk about databases now. So in your system design interview how good your design is and how well it can scale depends very much on the choice of database that you've used. Now databases generally do not impact your functional requirements, so whatever your functional requirement is you can satisfy that by using any database that you want. But normally the non-functional requirements are the ones which are impacted by the choice of database. So let's just say you have certain query patterns, or certain kind of a data structure or certain kind of a scale to handle, there are a different set of databases optimized for different kinds of such things, right. So based on the choice of database it would impact how well your design can scale up to the requirements that are given as part of your non-functional requirements(NFRs) so in this video what we'll do is we'll go over some very common use cases that normally you might come across in your System Design Interview and then look at some of the potential solutions or possible set of databases that you can use to handle those scenarios. Now normally the choice of database depends on a couple of factors. The very first thing is the structure of the data that you have. Whether it's a very structured data or a totally non structured data that is one of the factors. The other factor that impacts the choice of database is the query pattern that you have. And the next obvious one is the amount of scale that you need to handle. So all these three factors would make a difference in to the choice of database that you want. Now in this video we'll look at some common obvious ones in the beginning and then towards the end we'll look at the SQL-NoSQL comparison and a combination of certain kinds of databases that can help you solve a real-world problem. Let's talk about Caching Solutions. So whichever system you are designing in an interview, you would definitely have to use some caching solution there. Now there are a lot of use cases for caching. Let's say you are querying a database and you do not want to query the database a lot of times, you could cache the value in a cache, right. Alternatively, if you are making a remote call to a different service and that is having a high enough latency, you might want to cache the response of that system locally at your end in a caching solution And there could be a lot of use-cases for caching. Now generally the way caching works is you have a key and you have a value. So key normally is whatever your where clause is in the query or whatever your query param or request params are when you're making an API call and value is basically the response that you are expecting from the other system, right. So all of these kind of values would be stored in normally key value stores. Now very common used solutions are "Redis". You could also use "Memcached", you could use "etcd". "Hazelcast" is picking up a lot these days. So you could use any of them. Normally in the videos that I make I tend to use Redis because that is a very battle-tested solution and it is used by all the big companies in the world and it is fairly stable. So in whatever the System Design you have, where you have to use a caching solution, you could use a Redis. I'll just note it down so that you can refer it Redis is the solution for caching. Now again this does not mean that you cannot use anything else, you could use any similar solution and that would work just fine. Next let's look at some File Storage Options. So let's just say you are designing a system like Amazon where you are having various products that you are selling. Now the sellers would be uploading product images, maybe products videos, right. You need a data store to store those images and videos. Similarly you could be potentially building a system like Netflix which has videos all together. And you need a storage to support videos So wherever you have a image/video kind of a thing, there we'll use something called as a Blob Storage. Now these are not really databases. Databases are fundamentally meant for things where you can query on. Now a file is not something which you normally query on. You will just serve it as it is, right. So then is when Blob Storages come in. Now there are a lot of providers for that one of the most common ones and the and a fairly cheaper(Cost Effective) one is Amazon S3 and it is a fairly good system used by a lot of company, so whenever you have a requirement of storing images/videos and anything of that sort, you could use Amazon S3 as a data store. Now along with S3 you might want to use something called as a Content Delivery Network which is often called as a CDN. Now CDN is generally used for distributing the same image geographically in a lot of locations. So let's take a simple example that you have a product image that is stored in Amazon S3 as a primary data source and there are a lot of people coming from throughout the globe who are accessing that product. So you might want to distribute that image into various servers across the globe so that individual people can query them in a much faster way as compared to querying an S3 which is probably located in a couple of locations. So for all Blob kind of content you would then be using something like a for all blob you would be using something like the S3 plus a content delivery network and that should be fairly fine on all your interview questions. Now the next very common use case that you might encounter is for example if you're building a product like Amazon and you need to provide text searching capabilities on various products. So the seller has uploaded a product with a particular title some description and you want users to search that, right. Now that search thing would be provided on a text of title and on the text of description. Very similar use cases would be when you have a product like Netflix to build and you want to... give the option for the users to search or movie name, movie titles, genres, maybe cast and crew names and all of that. Or alternatively you might be designing something like an uber or Google Maps kind of a thing where you want to provide text searching capability with support for fuzzy search. So for all of these kind of use cases you would be using something called as a Text Search Engine. Now a very common implementation of a Text Search Engine is provided by Elastic Search and Solr, and both of them are built on top of something called an Apache Lucene now Lucene fundamentally provides these text searching capabilities and that is then being used by both of these products Elastic Search(ES) and Solr. So the next use case and quickly write it down is your text search which is Elastic Search and Solr. Now one more thing that they support is something called as a Fuzzy Search. So what is that is.. I will quickly go over that. Let's say if you are searching for the word Airport and let's say is the user typed in "AIRPROT" with the wrong spelling, okay. this O and R are interchanged right. Now, If a user searches for this and you do not return back any result, then that's a bad user experience, right. So you need your database to be able to figure out that the user did not really meant this thing [AIRPROT], the user actually meant AIRPORT, right? Now how does that database identify, so this word can be converted into the correct spelling of airport by changing two characters, right. R needs to be converted into O and O needs to be converted into R, right? so this is at an edit distance of two. So you can provide a level of fuzziness that your search engine needs to support. This has a fuzziness factor of two, which is the Edit Distance. Normally there are a lot of other factors also that come in but this is roughly how fuzzy searching is implemented in most of the solutions. So wherever you have any search capabilities there you use either Elastic Search or Solr. Now one very important thing about both of these are, these are not databases. These are search engines. So the difference between a Search Engine and a Database is whenever you write something in a Database, Database gives you a guarantee that that data wouldn't be lost. Now both of these data stores don't give you any such guarantee. They claim that they are giving a good enough availability and a redundancy and all of that but potentially data could be lost so you should not keep any of these as your primary source of truth Your primary data store should be somewhere else and it you could load that data into either of these systems to provide the searching capabilities and these two are very efficient at search. Next, let's look at what do you do if you want to store some metrics kind of a data. So let's say if you are building a system like Graphite, Grafana or Prometheus which is basically an application metrics tracking system. So let's say if the use case that you are given is a lot of applications are pushing metrics related to their throughput, their CPU utilizations, their latencies, and all of that. And you want to build a system to support that. Then is when comes something called as a Time Series Database. Now think of Time Series Database as an extension of relational databases but with not all the functionalities and certain additional functionalities. So regular relational databases that you have would have the ability to update a lot of records right or they would also give you the ability to very random records but whenever you are building a metrics monitoring kind of a system you would never do random updates. You would always do sequential update in append-only mode. So if you have put an entry at time T1 the next entry would be a time T 2 which is greater than T1. The next entry would be at time T3 which is greater than T1 and T2 right. So it's an append only write mode. Also the read queries that you do, they are kind of bulk read queries with the time range, right. You query for last few minutes of data or few hours of data or few days of data, right. But you don't do a random read or a random update. Now time series databases are optimized for this kind of a query pattern and input pattern. So there are a lot of time series databases InfluxDB is one of them, openTSDB is one of them so you could use either of them if you have that kind of a use case. I will quickly write this thing also. Eor time series, you use something called as an openTSDB, it stands for Open Time Series Database. The next use case is when you have lot of information and you want to store all of that information of a company in a certain kind of a data store for various kind of analytics requirements. So let's just take an example of something like an Amazon or Uber or any system design where you want to provide analytics on all the transactions. You might want to provide analytics like how many orders I'm having, what geographies are giving me what revenues, which is the most sought after item, stuff like that. So where you want to do analytics on the data of the whole company there you need something like a Data Warehouse. That basically is a large database in which you can dump all of that data and provide various querying capabilities on top of the data to serve a lot of reports. Now these are generally not used for transactional systems, these are generally used for offline reporting. So if you have that kind of use case then you can use something like a Hadoop which can very well sit in for that purpose, where you put in a lot of data from various transactional systems and then build a lot of systems that can provide reporting on top of that data. Now let's look at slightly tricky scenarios where you might want to choose between a relational and a non-relational database. So the very first thing that helps you decide what kind of a database do you want to use, is the structure of the data. So if you have a very structured information then possibly a relational database make sense. Now what is a structured information? it would be an information that you can easily model in form of tables and tables would have rows and columns of information. For example if you want to store a user information something like a user profile on any social network it would have name, email address, city, phone number, and bunch of very standard information that each user will have That would be a structured information. So if we try to make a flowchart of how to decide which database to use this is how it would look like. The very first choice that we need to make is whether we have a structured data or an unstructured data. Now let's say if we have a structured data the next question that we need to ask ourself is, do we need any atomic City or transactional guarantees from the database or not? So, let's just say that you're building a payment system which supports a feature like somebody can transact money from their account to somebody else's account, right. Now fundamentally at the very core of it, it will have two queries, one of them will reduce the amount from the person's account and the other query would add the amount into the beneficiary's account, right. Basically reducing from account A and adding into account B. Now your database should be able to provide you certain guarantees which wrap both these query into transactional boundaries, saying either both of them would execute or both of them would not execute. But it should never happen that amount has been debited from account A but not credited in into account B or credited into account A but not debited from account B. Something of that sort should not happen. Also it should provide you some consistency saying if you have done a transaction the next call that you made to fetch the account balances it should reflect the amount. It should not be that sometimes it is reflecting the correct amount and sometimes it is not reflecting the correct amount. If you have that kind of requirement where in your building of payment system or an inventory management system where you have the count of number of products that you have while people are buying them and the count needs to reduce, for all of those kind of scenarios, that would fall into that yes you need an atomicity you need consistent, you need transactions, then basically you need to use a relational database, okay. Now there are multiple providers of relational database you could use any of them. Some of the very common ones are MySQL, Oracle, SQL Server, Postgres and there are a lot of them. I've just mentioned the common ones it does not mean that you cannot use any other database of your choice. Feel free to use any database which provides you ACID Guarantees. Now let's say you have relational data but you do not need ACID Guarantee. Let's say you are just storing user information which does not have any use case of these atomicity requirements. You could still choose to use a relational database or you could choose to use a non relational database, it wouldn't make too much of a difference, because normally you would be easily be able to map a structured data into a NoSQL model, so either of these scenarios would be fine. Now let's say you do not have structured data - so what do you do? So there are a bunch of scenarios in which your usecase might fit in. So maybe you are trying to build a catalog kind of a system for an e-commerce platform like Amazon which has information of all the items that are available on that platform, right. So let's just say if you are building that catalog for Amazon and if we take certain examples, so let's say there is the item like shirt. Now each item would have certain attributes. A shirt would have an attribute like a size: could be large, a color: could be red, something of that sort. If you have a refrigerator, it would have a volume like 200 liters/400 liters or whatever. It would also have a power saving mode like 3-star/5-star/what not. So those are the attributes of a refrigerator, right. Similarly, something like a milk, would have quantity and an expiry date, right. Now normally when you are on an e-commerce platform you not only need to see these attributes if it is just about seeing them you could kind of dump it as a Json and store it in any databases. But normally you would also want to query that. Now querying on a JSON or random attributes is a bit tricky on the relational databases, but there are certain kinds of databases that are optimized for those kind of queries. So these are the databases where you have a lot of data not just data in terms of volume but in terms of structure. So if you have lot of attributes that can come in and a wide variety of queries that can come in, then you fall into this category, and if that is the case then you need to use something called as a document DB. Now there are a lot of providers of document DBs, MongoDB, Couchbase are some of them. Earlier we looked at Elastic Search and Solr for text searching those are also special cases of Document Database. Now let's just say your your data is not relational and you do not have complex queries, you have a couple of straightforward queries you could still use a document DB if it doesn't fall into the third category. Now what is the third category? Let's say if you have an ever-increasing data. What do I mean by ever-increasing data? So let us take an example of Uber. So all the drivers of Uber are continuously sending location pings. so let's stay there are some number of drivers and they kind of translate into X number of location records per day. So there would be X number of records inputted per day. But this X would not be a constant, it would be a growing number why because the number of drivers of uber are increasing day by day right so this data would become probably X one day one. Additively 1.1X on day two, 1.2x on date three, so on and so forth. So it would not be increasing in a linear fashion it would be increasing but in a more than a linear fashion. So that is what I am calling an ever increasing data. Plus if you have finite number of type of queries. So let's if you want to track location pings of drivers the most important query that you will do is find all the locations of a driver whose driver ID is something right. So if you have less number of type of queries(maybe high volume) but a large amount of data then these kind of databases would be the best choice for you. This is something called as a columnar DB or a Column oriented DB and Cassandra and HBase are the most used and most stable options out there for these kind of scenarios. Now there would be a lot of other alternate as well but these are the most stable ones and have been battle tested again for a lot of years I would recommend to use either of these two in such kind of scenario. Now in my videos I generally prefer using Cassandra over HBase, the only reason being Cassandra is not very heavy to deploy HBase generally tends to have a lot of components that it comes with, but performance wise both of them are roughly similar. Each of them have their pros and cons but in a design interview it wouldn't matter which one do you use. Now, what if you don't have any of these what if you don't need ACID, you don't need these wide variety of data types and query types and you don't have an inner ever-increasing data? Then you can use any data base of your choice. Then it would basically be a low scale system having a very small number of queries on a very small number of attributes on a very small size of a data set. You would generally not get these kind of requirements in a system design interview but if you get you could use any of these unless it's something that we've already talked about which are one of the special cases in some other scenario. So if it's not that you could use any of these data bases and that should be fine. But normally when you are in a system design you would not get a design that you will be satisfied by just one of these databases. So let's look at slightly more trickier real-world scenarios. Now let's take an example of us building an e-commerce platform something like an Amazon. Now when you are managing inventory on that side you want to make sure that you are not over selling items. So let's say there is just one quantity of a particular item left and there are 10 users who want to buy that you want to have this ACID properties there to make sure that only one of the users should be able to commit the transaction other users should not be able to commit the transaction, right. You can put constraints and all of that on there. It would make sense to use a RDBMS database for the inventory management system of a system like Amazon or maybe an order management system right But if you look at the data of Amazon it is an ever interesting data. Why? because the number of orders are additive each day new orders are coming in they cannot purge the data due to lot of legal reasons plus the number of orders are also increasing so it naturally fits into this model right with you where I am recommending together Cassandra. So what you could use is a combination of both of these databases. You could use MySQL or any other RDBMS alternate for storing data about the orders it has just placed and not yet deliver to the customer. Once the item is delivered to the customer you can then remove from this RDBMS, and put it into Cassandra as a permanent store. I walked through this implementation in much more detail in the Amazon system Design, so I would recommend to have a look on how a combination of database can become a very power choice there. Now let's look at another example again taking an example of Amazon, let's say if you want to build a reporting kind of a thing which lets you query something like get me all the users who have bought sugar in last five days now sugar is not just a product. There are a lot of sellers selling different sugar alternates of different companies maybe, of different qualities maybe, right. so sugar would then be like a lot of item IDs right and on top of those a lot of item IDs there must be a lot of orders. Now again I am saying that orders would either be in Cassandra or in this(RDBMS) right but if you are doing random queries where some you might want to query on who bought sugar, who bought TV, who bought TV of certain quality, who bought a fridge of certain quality, that kind of logically goes into a model where I was recommending to use a Document DB. now what you could do is you could store the querying part over here(Document DB). You could basically say that I'll store a subset of order information into a MongoDB which says that user ID so and so, had an order ID so-and-so, on some particular date, which has these ten item ID in the certain quantity, right On this database you could run a query which will return you a list of users and list of order right and then you could take those Order IDs and query on both of these systems(RDBMS+Cassandra) right so here we are using all the three systems in combination to provide various querying capabilities like with user bought what kind of use case, right. So in any real world scenario you would have to use a combination of such databases to fulfill the functional and non-functional requirements that you have. Now all of these being said this is just an indicative cheat sheet of what database fits well in what kind of a scenario. when you are in an interview you could use this to kind of get done with it but if you are actually using it in a real-world system I would recommend to read a bit more about it to choose which out of these few do you want to actually use, right. Also there are a lot of other databases that I have not talked about which are basically examples of RDBMS's or example of columnar databases. It does not mean that these are the only options. There are a lot of other options out there you could explore them and use those as well. so yeah I think this should be a good enough place for a database choice given a certain use-case.
Info
Channel: codeKarle
Views: 99,279
Rating: 4.9583445 out of 5
Keywords: System Design, System Architecture, Interview, codekarle, Code Karle, System, Architecture, Design, Database, SQL, NOSQL, Elastic Search, Solr, Cassandra, MySQL, MongoDB, system design tutorial, tutorial, system design interview question, Grokking, FAANG, interview questions, Software Engineer, Software, DB
Id: cODCpXtPHbQ
Channel Id: undefined
Length: 23min 27sec (1407 seconds)
Published: Sat Jun 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.