Hi everyone. Let's talk about databases now.
So in your system design interview how good your design is and how well
it can scale depends very much on the choice of database that you've used.
Now databases generally do not impact your functional requirements, so whatever your
functional requirement is you can satisfy that by using any
database that you want. But normally the non-functional requirements are the ones
which are impacted by the choice of database. So let's just say you have
certain query patterns, or certain kind of a data structure or certain kind of a
scale to handle, there are a different set of databases optimized for different
kinds of such things, right. So based on the choice of database it would impact
how well your design can scale up to the requirements that are given as part of
your non-functional requirements(NFRs) so in this video what we'll do is we'll go
over some very common use cases that normally you might come across in your
System Design Interview and then look at some of the potential solutions or
possible set of databases that you can use to handle those scenarios. Now normally the choice of database depends on a couple of factors. The very first
thing is the structure of the data that you have. Whether it's a very structured
data or a totally non structured data that is one of the factors.
The other factor that impacts the choice of database is the query pattern that you have.
And the next obvious one is the amount of scale that you need to handle.
So all these three factors would make a difference in to the choice of database
that you want. Now in this video we'll look at some common obvious ones in the
beginning and then towards the end we'll look at the SQL-NoSQL comparison and
a combination of certain kinds of databases that can help you solve a
real-world problem. Let's talk about Caching Solutions. So whichever system you
are designing in an interview, you would definitely have to use some caching
solution there. Now there are a lot of use cases for caching. Let's say you are
querying a database and you do not want to query the database a lot of times, you
could cache the value in a cache, right. Alternatively, if you are making a remote
call to a different service and that is having a high enough latency, you might
want to cache the response of that system locally at your end in a caching solution And there could be a lot of
use-cases for caching. Now generally the way caching works is you have a key
and you have a value. So key normally is whatever your where clause is
in the query or whatever your query param or request params are when you're making an API call
and value is basically the response that you are expecting from the other system, right.
So all of these kind of values would be stored in normally key value
stores. Now very common used solutions are "Redis". You could also use "Memcached",
you could use "etcd". "Hazelcast" is picking up a lot these days. So you
could use any of them. Normally in the videos that I make I tend to use Redis
because that is a very battle-tested solution and it is used by all the big
companies in the world and it is fairly stable. So in whatever the System Design
you have, where you have to use a caching solution, you could use a Redis. I'll just
note it down so that you can refer it Redis is the solution for caching.
Now again this does not mean that you cannot use anything else, you could use any
similar solution and that would work just fine. Next let's look at some File Storage Options.
So let's just say you are designing a system like Amazon where
you are having various products that you are selling. Now the sellers would be
uploading product images, maybe products videos, right.
You need a data store to store those images and videos. Similarly you could be
potentially building a system like Netflix which has videos all together.
And you need a storage to support videos So wherever you have a image/video kind
of a thing, there we'll use something called as a Blob Storage. Now these are
not really databases. Databases are fundamentally meant for things where you
can query on. Now a file is not something which you normally query on.
You will just serve it as it is, right. So then is when Blob Storages come in.
Now there are a lot of providers for that one of the most common ones and the and
a fairly cheaper(Cost Effective) one is Amazon S3 and it is a fairly good system used by a lot
of company, so whenever you have a requirement of storing images/videos and
anything of that sort, you could use Amazon S3 as a data store. Now along with
S3 you might want to use something called as a Content Delivery Network
which is often called as a CDN. Now CDN is generally
used for distributing the same image geographically in a lot of locations.
So let's take a simple example that you have a product image that is stored in
Amazon S3 as a primary data source and there are a lot of people coming from
throughout the globe who are accessing that product. So you might want to
distribute that image into various servers across the globe so that
individual people can query them in a much faster way as compared to querying
an S3 which is probably located in a couple of locations. So for all
Blob kind of content you would then be using something like a for all blob you
would be using something like the S3 plus a content delivery network and that
should be fairly fine on all your interview questions. Now the next very
common use case that you might encounter is for example if you're building a
product like Amazon and you need to provide text searching capabilities on
various products. So the seller has uploaded a product with a particular title some
description and you want users to search that, right. Now that search thing would
be provided on a text of title and on the text of description. Very similar use
cases would be when you have a product like Netflix to build and you want to... give the option for the users to search or movie name, movie titles, genres,
maybe cast and crew names and all of that. Or alternatively you might be
designing something like an uber or Google Maps kind of a thing where you
want to provide text searching capability with support for fuzzy search.
So for all of these kind of use cases you would be using something called as a
Text Search Engine. Now a very common implementation of a Text Search Engine
is provided by Elastic Search and Solr, and both of them are built on top of
something called an Apache Lucene now Lucene fundamentally provides these
text searching capabilities and that is then being used by both of these
products Elastic Search(ES) and Solr. So the next use case and quickly write it down is
your text search which is Elastic Search and Solr. Now one more thing that they
support is something called as a Fuzzy Search. So what is that is.. I will quickly go
over that. Let's say if you are searching for the word Airport and let's say is
the user typed in "AIRPROT" with the wrong spelling, okay.
this O and R are interchanged right. Now, If a user searches for this and you do not
return back any result, then that's a bad user experience, right. So you need your
database to be able to figure out that the user did not really meant this thing [AIRPROT],
the user actually meant AIRPORT, right? Now how does that database identify,
so this word can be converted into the correct spelling of airport by
changing two characters, right. R needs to be converted into O and O needs to
be converted into R, right? so this is at an edit distance of two. So you can
provide a level of fuzziness that your search engine needs to support.
This has a fuzziness factor of two, which is the Edit Distance. Normally there are a lot of
other factors also that come in but this is roughly how fuzzy searching is
implemented in most of the solutions. So wherever you have any search
capabilities there you use either Elastic Search or Solr.
Now one very important thing about both of these are, these are not databases.
These are search engines. So the difference between a Search Engine and a
Database is whenever you write something in a Database, Database gives you a
guarantee that that data wouldn't be lost. Now both of these data stores don't give you any such guarantee. They claim
that they are giving a good enough availability and a redundancy and all of
that but potentially data could be lost so you should not keep any of these as
your primary source of truth Your primary data store should be somewhere
else and it you could load that data into either of these systems to provide
the searching capabilities and these two are very efficient at search.
Next, let's look at what do you do if you want to store some metrics kind of a data.
So let's say if you are building a system like Graphite, Grafana or Prometheus
which is basically an application metrics tracking system. So let's say if
the use case that you are given is a lot of applications are pushing metrics
related to their throughput, their CPU utilizations, their latencies, and all
of that. And you want to build a system to support that.
Then is when comes something called as a Time Series Database. Now think of Time
Series Database as an extension of relational databases but with not all
the functionalities and certain additional functionalities. So regular
relational databases that you have would have the ability to update a lot of
records right or they would also give you the ability to very random records
but whenever you are building a metrics monitoring kind of a system you would
never do random updates. You would always do sequential update in append-only mode.
So if you have put an entry at time T1 the next entry would be a time T 2 which
is greater than T1. The next entry would be at time T3 which is greater than T1
and T2 right. So it's an append only write mode. Also the read queries that
you do, they are kind of bulk read queries with the time range, right.
You query for last few minutes of data or few hours of data or few days of data, right.
But you don't do a random read or a random update. Now time series
databases are optimized for this kind of a query pattern and input pattern.
So there are a lot of time series databases InfluxDB is one of them, openTSDB is one of them so you could use either of them if you have that kind of
a use case. I will quickly write this thing also. Eor time series, you use something
called as an openTSDB, it stands for Open Time Series Database. The next use case is
when you have lot of information and you want to store all of that information of
a company in a certain kind of a data store for various kind of analytics
requirements. So let's just take an example of something like an Amazon or
Uber or any system design where you want to provide analytics on all the
transactions. You might want to provide analytics like how many orders I'm
having, what geographies are giving me what revenues, which is the most sought
after item, stuff like that. So where you want to do analytics on the data of the
whole company there you need something like a Data Warehouse. That basically is
a large database in which you can dump all of that data and provide various
querying capabilities on top of the data to serve a lot of reports. Now these are
generally not used for transactional systems, these are generally used for offline reporting. So if you have that kind of use case
then you can use something like a Hadoop which can very well sit in for that
purpose, where you put in a lot of data from various transactional systems and
then build a lot of systems that can provide reporting on top of that data. Now let's look at slightly tricky scenarios where you might want to choose
between a relational and a non-relational database. So the very first thing that
helps you decide what kind of a database do you want to use, is the structure of
the data. So if you have a very structured information then possibly a
relational database make sense. Now what is a structured information? it would be
an information that you can easily model in form of tables and tables would have
rows and columns of information. For example if you want to store a user
information something like a user profile on any social network it would
have name, email address, city, phone number, and bunch of very standard
information that each user will have That would be a structured information.
So if we try to make a flowchart of how to decide which database to use this is
how it would look like. The very first choice that we need to make is
whether we have a structured data or an unstructured data. Now let's say if we
have a structured data the next question that we need to ask ourself is, do we
need any atomic City or transactional guarantees from the database or not?
So, let's just say that you're building a payment system which supports a feature
like somebody can transact money from their account to somebody else's account, right.
Now fundamentally at the very core of it, it will have two queries, one of
them will reduce the amount from the person's account and the other query
would add the amount into the beneficiary's account, right. Basically
reducing from account A and adding into account B. Now your database should be
able to provide you certain guarantees which wrap both these query into
transactional boundaries, saying either both of them would execute or both of
them would not execute. But it should never happen that amount has been
debited from account A but not credited in into account B or credited into
account A but not debited from account B. Something of that sort should not happen.
Also it should provide you some consistency saying if you have done a
transaction the next call that you made to fetch the account balances it should
reflect the amount. It should not be that sometimes it is reflecting the correct amount and sometimes it is not reflecting the correct amount.
If you have that kind of requirement where in your building of
payment system or an inventory management system where you have the
count of number of products that you have while people are buying them and
the count needs to reduce, for all of those kind of scenarios, that would
fall into that yes you need an atomicity you need consistent, you need
transactions, then basically you need to use a relational database, okay.
Now there are multiple providers of relational database you could use any of them.
Some of the very common ones are MySQL, Oracle, SQL Server, Postgres and there
are a lot of them. I've just mentioned the common ones it does not mean that
you cannot use any other database of your choice. Feel free to use any
database which provides you ACID Guarantees. Now let's say you have
relational data but you do not need ACID Guarantee. Let's say you are just
storing user information which does not have any use case of these atomicity
requirements. You could still choose to use a relational database or you
could choose to use a non relational database, it wouldn't make too much of a
difference, because normally you would be easily be able to map a structured data
into a NoSQL model, so either of these scenarios would be fine.
Now let's say you do not have structured data - so what do you do? So there are a
bunch of scenarios in which your usecase might fit in. So maybe you are trying to
build a catalog kind of a system for an e-commerce platform like Amazon which
has information of all the items that are available on that platform, right.
So let's just say if you are building that catalog for Amazon and if we take
certain examples, so let's say there is the item like shirt. Now each item would
have certain attributes. A shirt would have an attribute like a size: could be
large, a color: could be red, something of that sort. If you have a
refrigerator, it would have a volume like 200 liters/400 liters or whatever. It
would also have a power saving mode like 3-star/5-star/what not. So those are the
attributes of a refrigerator, right. Similarly, something like a milk, would
have quantity and an expiry date, right. Now normally when you are on an
e-commerce platform you not only need to see these attributes if it is just about
seeing them you could kind of dump it as a Json and store it
in any databases. But normally you would also want to query that.
Now querying on a JSON or random attributes is a bit tricky on the relational
databases, but there are certain kinds of databases that are optimized for those
kind of queries. So these are the databases where you have a lot of data
not just data in terms of volume but in terms of structure. So if you have lot of
attributes that can come in and a wide variety of queries that can come in, then
you fall into this category, and if that is the case then you need to use
something called as a document DB. Now there are a lot of providers of document DBs, MongoDB, Couchbase are some of them. Earlier we looked at Elastic Search and
Solr for text searching those are also special cases of Document Database.
Now let's just say your your data is not relational and you do not have complex
queries, you have a couple of straightforward queries you could still
use a document DB if it doesn't fall into the third category. Now what is the
third category? Let's say if you have an ever-increasing data. What do I mean by
ever-increasing data? So let us take an example of Uber. So all the drivers of
Uber are continuously sending location pings. so let's stay there are some number
of drivers and they kind of translate into X number of location records per
day. So there would be X number of records inputted per day. But this
X would not be a constant, it would be a growing number why because the number of
drivers of uber are increasing day by day right so this data would become
probably X one day one. Additively 1.1X on day two, 1.2x
on date three, so on and so forth. So it would not be increasing in a linear
fashion it would be increasing but in a more than a linear fashion. So that
is what I am calling an ever increasing data. Plus if you have finite number of
type of queries. So let's if you want to track location pings of drivers the most
important query that you will do is find all the locations of a driver whose
driver ID is something right. So if you have less number of type of queries(maybe high volume) but a large amount of data then these kind of databases would be the best choice for
you. This is something called as a columnar DB or a Column oriented DB and Cassandra and HBase are the most used and
most stable options out there for these kind of scenarios. Now there would be a
lot of other alternate as well but these are the most stable ones and have been
battle tested again for a lot of years I would recommend to use either of these
two in such kind of scenario. Now in my videos I generally prefer using
Cassandra over HBase, the only reason being Cassandra is not very heavy to deploy
HBase generally tends to have a lot of components that it comes with, but
performance wise both of them are roughly similar. Each of them have their
pros and cons but in a design interview it wouldn't matter which one do you use.
Now, what if you don't have any of these what if you don't need ACID, you don't need
these wide variety of data types and query types and you don't have an inner
ever-increasing data? Then you can use any data base of your choice. Then it
would basically be a low scale system having a very small number of queries on
a very small number of attributes on a very small size of a data set. You would
generally not get these kind of requirements in a system design
interview but if you get you could use any of these unless it's something that
we've already talked about which are one of the special cases in some other
scenario. So if it's not that you could use any of these data bases and
that should be fine. But normally when you are in a system design you would not
get a design that you will be satisfied by just one of these databases. So let's
look at slightly more trickier real-world scenarios. Now let's take an
example of us building an e-commerce platform something like an Amazon.
Now when you are managing inventory on that side you want to make sure that you are not over selling items. So let's say there is
just one quantity of a particular item left and there are 10 users who want to
buy that you want to have this ACID properties there to make sure that only
one of the users should be able to commit the transaction other users
should not be able to commit the transaction, right. You can put
constraints and all of that on there. It would make sense to use a RDBMS
database for the inventory management system of a system like Amazon or maybe
an order management system right But if you look at the data of Amazon it
is an ever interesting data. Why? because the number of orders are additive each
day new orders are coming in they cannot purge the data due to lot of legal
reasons plus the number of orders are also increasing so it naturally fits
into this model right with you where I am recommending together
Cassandra. So what you could use is a combination of both of these databases.
You could use MySQL or any other RDBMS alternate for storing data about
the orders it has just placed and not yet deliver to the customer. Once the
item is delivered to the customer you can then remove from this RDBMS, and
put it into Cassandra as a permanent store. I walked through this
implementation in much more detail in the Amazon system Design, so I would
recommend to have a look on how a combination of database can become a
very power choice there. Now let's look at another example again taking an
example of Amazon, let's say if you want to build a reporting kind of a thing
which lets you query something like get me all the users who have bought
sugar in last five days now sugar is not just a product. There
are a lot of sellers selling different sugar alternates of different companies
maybe, of different qualities maybe, right. so sugar would then be like a lot of
item IDs right and on top of those a lot of item IDs there must be a lot of
orders. Now again I am saying that orders would either be in Cassandra or
in this(RDBMS) right but if you are doing random queries where some you might want
to query on who bought sugar, who bought TV, who bought TV of certain quality, who
bought a fridge of certain quality, that kind of logically goes into a model
where I was recommending to use a Document DB. now what you could do is
you could store the querying part over here(Document DB). You could basically say that I'll store a subset of order information into a MongoDB which says that user ID so
and so, had an order ID so-and-so, on some particular date, which has these ten
item ID in the certain quantity, right On this database you could run a query
which will return you a list of users and list of order right and then
you could take those Order IDs and query on both of these systems(RDBMS+Cassandra) right so here we are using all the three systems in combination to provide various querying
capabilities like with user bought what kind of use case, right. So in any real
world scenario you would have to use a combination of such databases to fulfill
the functional and non-functional requirements that you have. Now all of
these being said this is just an indicative cheat sheet of what database
fits well in what kind of a scenario. when you are in an interview you could
use this to kind of get done with it but if you are actually using it in a
real-world system I would recommend to read a bit more about it to choose which
out of these few do you want to actually use, right. Also there are a lot
of other databases that I have not talked about which are basically examples of
RDBMS's or example of columnar databases. It does not mean that these
are the only options. There are a lot of other options out there you could
explore them and use those as well. so yeah I think this should be a good enough place
for a database choice given a certain use-case.