Hi everyone!
Welcome to CodeKarle. My name is Sandeep in this video we will be looking at a
very common design interview problem which has been asked by a lot of
companies lately so let's look at how do we design a
hotel booking system something very similar to booking.com or
airbnb but just one thing to call out we will be looking at a very high level
architecture of the whole system and not at a lower level class diagram
and all of that in this video so before we jump into the problem let's
first look at the functional requirements then we look at the non-functional requirements of what we want to achieve
and then we look at the design so we have two major consumers of this
application one is the hotel side of users and then
there are the consumers who want to book the hotel
so for the hotel managers we'll have these three major functionalities
1) they should be able to onboard onto our platform 2) they should be able to
update their property so for example they might want to add a new room they
might want to change the pricing they might want to add new images and stuff
like that 3) then they should be able to see what all
bookings are there and along with that also get some insight into the revenue
numbers and all of that From a user standpoint they should be
able to search for a property in a particular location
with a couple of search criteria so for example they might want to filter within
a price range or some aspects of the property like a
five-star property or a beach front property and stuff like that.
Then they should be able to book that hotel and once they have booked they
should be able to look at their bookings okay these are the major requirements
now we should also design it in a way that we leave scope for
some kind of analytics to be done so these are the functional side of things from a non-functional side of
things we need this platform to run at a
very low latency and it should give a very high
availability and a very high consistency by high consistency I mean if you are
booking a hotel or if a user is booking a hotel he should be able to see that hotel immediately now from a scale standpoint what kind of scale do we want
so a quick google search tells me that there are roughly 500,000 hotels in the
whole world at this point in time there are roughly 10-12 million rooms in
all the hotels across the world at this point in time
and roughly there... you can assume that there are thousand
rooms in a particular hotel in general so there are some hotels who have
which have more than 7000 rooms at this point in time but
those are some edge cases we should be able to handle that. The reason I am
talking about this thousand number is so let's say a hotel has thousand rooms
now these rooms will be booked over a course of many days
so there will never be a situation that there is just one room available
and there are thousands of users who are wanting to book that. At max what will
happen is that there is one room and there are two three users who are
trying to book that and we will be able to use that assumption for our leverage
at later point in time. Now let's look at the overall design of
the whole system and how the data flows within each component
then we look individually into some of the components.
So the whole business flow starts at this point
which is basically a UI that we give out to the hotel managers
through this you it could be either a website or a mobile app
but through this UI they would come on onboard onto our platform
and the same UI would be used by them to modify the property.
So let's say they want to add a new image or they want to add a new room or
if they want to make any modifications this is the UI that they talk to.
Now this UI talks to a Load Balancer through which it talks to a hotel service.
This is basically a service which
manages the hotel part which is basically the onboarding
and the management. All right, now let's just say there's a spike in traffic
so there could be multiple nodes of this hotel services that could be added here
and so this becomes a horizontally scalable
component Now hotel data in itself is a very much
relational data plus earlier we talked about the number
of photos that's not too many so it doesn't even have a scale problem
so we'll be using a clustered MySQL here with one master and multiple slaves slaves can be added as and when required. let's say there's a huge spike in Read
traffic, we can add more slaves but this data resides within MySQL
database Now let's just say any image is added, so
hotels can add images about the rooms about their whole
building and all of that all those images would be stored into a CDN and the reference to the CDN which is
basically a URL of the image would be stored in the database and that
URL would be sent out to customers and whenever they want to render an
image that would be looked up directly from the CDN. Now what is a CDN? it's basically a geographically distributed data store
which we will be using for sending out images throughout the whole
world. So let's just say I'm connecting from India somebody's
connecting from US they want to look up for an image of a particular hotel
so I'll look up on the CDN server which is in India the other person will look
up into the CDN server which is in US So this becomes the hotel life
cycle management The next thing is basically
let's just say each time a modification is happening to a hotel
let's just say a new hotel comes in we want to bubble up this hotel to the
users who are going to search for this right, now there are multiple ways in
which we can send out this information to the search piece, right.
I'll be using a Kafka here
so each modification that is happening within hotel service
will flow through a kafka cluster and there'll be multiple consumers that will
be sitting on top of this cluster which will populate their data store for
serving the search traffic and for other traffic as well, right. So
one of the consumers will be the search consumer
what happens is let's say a hotel gets a new room, for example. There will
be a payload that is put into Kafka which has all the information that is
required Now the search consumer pulls up the
payload from Kafka and it stores into its own database
and this database would be used to power the search on the website.
Okay, now for search, I am using an elastic search.
Elasticsearch is basically a database that is built on Lucene platform.
Similarly, instead of elasticsearch you could also use a Solr here.
Both are kind of similar components ideally it would depend on what
infrastructure is being used in your company you could use that right.
But the idea of using elasticsearch is that I want
this piece to be supporting fuzzy search now let's just say i am searching for a
hotel in maldives or let's say user is searching for a
hotel in maldives, the user might not know the correct spelling,
right. If they type in a wrong word I don't want them to get no results.
I would want this to be able to support a fuzzy search
so I have to be able to handle all the typos and spelling mistakes and all of
that plus i also want to give similar..
similarity kind of a thing there so that's the reason i'm using elastic
search here. So all the data of each individual hotel, flows through
the kafka via the search consumer into this
elasticsearch cluster. Now on top of this elasticsearch
sits the Search Service. Now again let's just say there's a spike
in traffic I can increase the number of nodes in
kafka cluster, I can increase the number of search consumers here
and I can increase the number of nodes in elasticsearch cluster. So till now
whatever we have talked about is again horizontally scalable, right.
And again coming to Search Service this is the service which powers the
search on the website now website is... i'm using a generic term
sometimes I'll use a website sometimes I'll say UI but it's basically all modes
of communication through which a user can come in.
That could be an app, that could be a website right.
So the user talks to through again a load balancer
to the search service whenever they want to search for a particular hotel
again they will give a date range and a location for example as a search
criteria and along with that they could also provide some tags.
Now those tags would be the properties of the hotels.
So again going back to my previous example a five star property
is a tag. A beachfront property is a tag. Now the search on elasticity
would be happening on either of these tags and the ranges that are provided basically the date range, price range and
all of that. okay, so this takes care of the search
flow. Now once the user has seen some of the
results on the website they would want to book a hotel.
The booking again happens through this UI. So I've
made this UI saying that it's a search and book UI. Normally,
it will be the same app or the same website through which they are searching
and then booking right. Now a booking request
again comes to this load balancer and talks to Booking Service.
Booking Service essentially again sits on top of a MySQL database
now these are two different MySQL clusters. I am purposefully not using a same cluster here although we could use the same
cluster and have two different databases in that
but because we are talking about a fairly large system that has
like a good enough amount of scale I would want to keep different clusters so
as to you know take care of the scaling
separately of each other Now, whenever a booking happens,
that booking gets stored into this MySQL we go over the exact flow of booking
when we go over the details of implementation within the Booking
Service, but essentially this stores the data into this MySQL and it
talks to a Payment Service Normally what will happen...
a booking request will come, it stores something, it will send the
request for payment, once there's a success,
it will mark the booking confirm. Now again,
whenever a booking is happening, the data is flowing into the same kafka,
right. Why? so let's just say there was just one room available in a hotel
right and that room is now booked i want to make sure that this hotel is not
available for search now in that same date range,
because it's not available. So all of those information is
again sent to the same kafka which is read by Search Consumer and
then it takes care of even removing the hotels which are now completely booked.
Now if you can see, there's something called an Archival Service here.
What I have done is, I am just storing the live data into MySQL.
By live data, I mean the bookings that are done but have not been completed
thereby making sure that this is having a scale which is low enough
that MySQL can easily handle, and once the booking moves to a terminal state
so let's say booking is cancelled or booking is completed
it will move through the archival service to a Cassandra cluster.
The reason I'm using a Cassandra here is so cassandra is a
very good database which can handle a huge amount of reads and writes.
It has a constraint that it needs a partition key on which all the queries
should happen. So let's say if I want to search by a
"booking_id" my partition key has to be a "booking_id"
in that case I cannot do any kinds of queries on a Cassandra
therefore I did not use a Cassandra as a source of truth database.
Because on this database I need to do a large variety of queries.
We'll come to all of those when we go into the detail of Booking Service,
but once it is archived we just need to do GETs on those. So therefore Cassandra
makes a good enough sense over here. Now, once the booking is done, all of that
is fine, but now we need to notify all the
people right? So then comes the Notification Service.
So let's say whenever a booking is made, or any changes are happening into a
booking or it moves into a terminal state,
there'll be a Notification Service that consumes events from this kafka
and notifies the people, so for example on each booking, we need to notify the hotel
right. Whenever a booking is cancelled by the hotel we need to notify the consumer
or in fact on each booking we need to notify the consumer with an invoice
right. So all of those is taken care by this Notification Service.
Now coming back to the UI for hotels and users. So each time a
booking is done or even without that a user might want
to see their old bookings or a hotel might want to see all the
bookings that they have. This is more of a read-only view for them,
right? That will be powered by this Booking Management Service,
which talks to now two data sources. It talks to the MySQL cluster for all the
active bookings, which are to happen sometime in future
and to the Cassandra cluster, for the bookings that have already
happened right. Now i am adding a Redis on top of this MySQL
to reduce the load on this MySQL, so Redis will act as my Cache
and whenever I have a query so for example something like
get bookings of a user so I can cache this result into this Redis.
And it'll be a write-through cache, so whenever a new booking is coming in
this will get updated all right. Now this is the functional flow
the bigger component here is how do we do the analytics on this
so let's just say a business person wants to know how much revenue I'm
making or how many bookings I'm having or what
are my best performing hotels and stuff like that.
So they need to do a lot of analytics Now mostly while designing the system
we'll never always know what kind of analytics is required right
so what I've done for that is I've used a Hadoop Cluster on which
I'm pushing in all the events that are going into my kafka.
Which is basically information about all my hotels, about all my bookings,
about all the transactions that happen in my system. So there will be a Spark Streaming Consumer that runs somewhere
that reads from this kafka and puts all the data into a Hadoop Cluster
on which I can do Hive queries or any different kind of queries
and build up a lot of reporting. So this is overall how the system looks
like and how the data flows. Now let's go into
the details of some of the components. Now let's look at what Hotel Service internally is.
So it's not a very complicated service it is basically
a CRUD Service which provides Create, Update, Read, Delete operations on the hotel data store. And it is the source of truth for hotel
data. Now, this is not an exhaustive list of neither the APIs nor the DB Schema
that you see here there will be a lot more things, but this
will give you a feel of how it should be. So let's look at some of the APIs. 1) There'll be a POST API /hotels to create a hotel which will be part of
their onboarding process. 2) There will be a GET API with an id
GET /hotel/{hotel_id} which will give back the information of
the hotel which can be rendered on the screen and the hotel guy can see it. 3) There will be a PUT API
PUT /hotel/id which will be used to update any information of a hotel. 4) Similarly there will be a PUT API PUT /hotel/{hotel_id}/room/{room_id} which would be used to update the room
information or create new rooms and all of that.
Now this is not an exhaustive list there'll be a lot more APIs that you can
add it as in when you know there's a
requirement to add. Now let's look at how the DB schema might look like.
So there are a couple of important tables now this is again not an
exhaustive list of databases of the tables so there's one hotel
table into this hotel DB but before that everything in red
here is either a primary key or a foreign key.
everything in blue is just a column now this hotel table contains your very
standard things id, name, locality_id which is a foreign
key to locality table description, original_images,
display_images and is_active. Now I have two columns for
original and display images? so original_images is basically the
artifact that the people have uploaded display_images could be a
compressed version of that, that we've compressed, it could be a version that we
have uploaded on the CDN, it could be something different than
the original image but we still need to keep both of
them so we have stored it here. is_active is basically like a soft delete flag Then coming to rooms table. It has a
room id obviously, a hotel_id which references into this[hotel] table
a display_name which could just be a identifier to
tell the customer on what kind of a room it is, is_active again a soft delete flag,
quantity basically tells how many such rooms are there in
the hotel and a price_min and a price_max.
Now why do I have do we have two prizes? Remember the
hadoop cluster that we had in the original design that we made.
it has a lot of data about various kinds of things
we might as well run a machine learning model onto it and do some
supply demand analytics and then come up with the optimal price!
right? let's say supply is low there's a lot of demand there are just a
few rooms left... might as well increase the price!
or let's say if there are too many rooms and very few customers might as well
reduce the price. So this price_min and price_max could be
the ranges which the hotel provides, wherein the price could be fluctuated by
the system. A good starting point could be an average of both these prices right.
Then there's a facilities table, which is basically
a list of all the facilities that a hotel and a room can possibly have
and these hotels_facilities and room_facilities are basically
mapping tables which is a many-to-many relationship between a hotel_id and a
facility_id. again is_active flag everywhere is
basically a soft delete flag. now again this is not a full list of
tables there are a lot of information missing. I've skipped the auditing information, I've skipped the bookkeeping information
like created_on, updated_time and all of that. A lot of
information missing but this will give you a fair enough idea
and it will be a good starting point for you to come up with a DB schema for this.
One more thing to note here that if you remember the original design that we had
I did not keep a Redis cache on top of this MySQL database
but I did keep a Redis cache on the other MySQL database which was for
Booking DB Now why is that? We could have kept the
Cache on top of this and all these GET APIs could have been a bit more faster
right, but this is not coming in the critical path of any
high throughput business interaction right so all the customers are not
querying this database, neither this service, they are
always querying the Search Service so if this service is a little bit slow
that's okay but adding a Redis Cluster is a cost. So
you need to do a trade-off analysis between what cost are you adding of an infrastructure and what benefit it adds to you if it is worth it you might as well go
and add a Redis cluster here, but I don't think it is worth it and that's
the reason I did not add it. Now let's look at the internal
functioning of the Booking Service. We'll first start off just walking
through the DB Schema again it's not a full-fledged schema
there are a lot of details missing like bookkeeping information like created_time, updated_time and all of that but let's focus on the meaty part here.
So it has a table called available_rooms which has a
room_id it has a date, it has an initial_quantity
that comes from the hotel service and it has a available_quantity available_quantity is basically the
number of rooms that are remaining for that particular room_id for that
particular date. Now, it has a constraint
saying it cannot go negative. Here is where the true power of MySQL we are utilizing and that's the reason why I chose to use
MySQL here the other table here is a booking table
it has a booking_id which is the primary key here, which will
be referenced across the whole system it has a room_id, again comes from the
room table it has a user_id, a start_date and an end_date for a particular booking, number_of_rooms
which is how many rooms the person has booked,
status and an invoice_id. looking at this design
we can clearly understand that one booking cannot contain
different room types you can have multiple rooms of the same room type
but you cannot have like one deluxe room and one regular room in one booking.
If you want that there'll be a small change required
but i think that's a minor detail it can be taken care of easily.
The important part here is the status column.
it has these four values - reserved, booked cancelled and completed. Now canceled
and completed are the terminal statuses here so the booking gets first created into reserved status.
Then based on the payment success it can either move to book or cancel.
And once the user stays in that, it moves to complete.
Now you can add more statuses depending upon your conversation with your
interviewer but these four statuses are the main
ones that will help us achieve what we initially thought of.
Now, let's look at the API Signature so this will have one important API called a book API
it will be a POST API which will take these five attributes. It will contain a user_id it will contain a room_id, it will contain the quantity.
Now again if you want to make multiple rooms multiple quantity we'll have to
change it a bit to have an array but let's stick to this
for now it'll have a start_date and it'll have a end_date.
The price will come from somewhere else let's assume for now.
It'll actually come from the data store which contains the price for the room at
this point in time we don't want to generally take the
price from the user because then the request can be tampered with and that's
not really a good design okay now let's do a quick revision of
the design because i skipped some important
details in the earlier larger diagram and we'll go over that now. So the way
Booking Service actually works is when it gets a request to do a
booking it first of all queries this table and
the available_rooms table and check whether or not I have that many number
of rooms remaining or not. So if there are no rooms left for that
particular room_id for that particular date, there's no point of proceeding so we can error out from there.
But in case that's a success and we have rooms then we actually go ahead with the
blocking of the room saying that now I'll block it
temporarily and if the payment is success I'll actually
book the room. I'll do a quick dry run of what actually happens.
Assuming this is the request that came in user_id: 1 | room_id: 5 | quantity: 1
for some date "dt" to "dt +1".
The room_id: 5 on that particular date "dt" has 7 available rooms. So our first
check is a success that we have enough rooms. So what essentially will happen is there'll be a row created in this table
with a booking_id: (some_uuid) | room_id: 5 user_id: 1 | start_date: dt
whatever that is, end_date would be "dt + 1" whatever that is, number_of_rooms in the request is
quantity:1, and status would be at this point in time, RESERVED. invoice_id at this point in time
would be NULL because there is no invoice created till now Now, we have a record, along with
that we also decrement the quantity here now here again we are utilizing a very
important feature of MySQL which is part of the ACID property
and transactions. So we are creating a record here[booking table] and we
are reducing the quantity here[available_rooms] to 6. what essentially we are trying to do is
basically bounding this as part of one transaction so let's say there was just
one room left and two three requests came in only one
transaction would be successful to do both these things. Basically to insert
this record and reduce the quantity because we have this constraint sitting
over here which says that quantity cannot be negative.
okay so only one of the transaction will be success,
and only one of the rooms will be booked and no two users will be redirected to
payment. That being taken care of what is the
next step so i have written down the steps here if you want to actually look
at so what we have gone through till now is
step number one and step number two okay we've inserted in booking and
reduces reduce the quantity in available_rooms our step number three is something that I did not cover as part of the
larger design review because it was getting too much cluttered.
Now we cannot keep this room reserved for an infinite amount of time.
What we can say is if the payment is success in next five minutes,
well and good, if not then we'll assume that the payment will not go through
and will unblock the room so that somebody else could book it, okay.
So there are multiple ways to implement that what I choose to implement here is
something using the TTL(Time To Live) of Redis. So because we anyway are using a Redis we can utilize the same cluster of
Redis for this use case as well. So what we'll do is we'll put the key in Redis saying some booking_id
expires at some timestamp Now the time stamp could be a
configurable number, it could be a fixed timestamp across the board, it
could be a country specific timestamp, for India have an
expiry time of five minutes, for US have expiry time of four minutes, something of
that sort but whatever that time is, we'll insert
that into redis. Now what redis does is, it has something
called callbacks so one of the later versions of Redis
has introduced this concept called callbacks so whenever a key is getting expired you'll get a notification, okay.
And you can do whatever you need to do at that point in time, right.
So, if you get a Success notification from payment,
well and good. Success notification means the payment has gone through,
then you will mark the booking as BOOKED but before that if you get a
callback from Redis saying that the key has expired and you've not got the
success from payment you will say that the booking is CANCELLED.
Alternatively you could also get a failure from payment saying for whatever
reason the payment didn't go through and you got a failure response from the
Payment Service, in that again you can say CANCELLED.
Now if you want a bifurcation of the varieties of CANCELLED,
you can maybe make multiple statuses like cancel because of invoicing / cancel
because of payment/ cancel because of expiry, whatever right
or you could maybe add a status_reason column or something of that sort but that's a very minor detail we'll skip that for now, okay.
So let's go over what all possibilities are there in this
and how each of them behaves, okay. So, first very simple thing is -
what happens when payment is a Sucess? so in case payment is a success
everything remains the same just the status becomes BOOKED. okay, in that case, we do get some
invoice_id as well. So basically we'll get an
invoice_id from Payment Service whenever you know a booking is getting
success and we'll just update the invoice_id there and then the
regular kafka events would also be sent saying the booking is now complete and here's the kafka event for that in case
somebody wants to do something on that. What happens when payment fails?
Now in this we just have these four statuses so the booking status will
become CANCELLED. okay there would be no invoice_id in
this case Why? because if the payment did not go
through there obviously is not an invoice that is generated.
And everything else remains the same but, if the payment did not go through, we
need to revert the available_quantity again. so available_quantity in that case would become seven.
Now let's say your key expired so basically let's say the user was
redirected to payment screen and there was no response from payment
service for whatever reason what happens then if we get a call back
from Redis and based on that call back we can say
that okay now the payment has not gone through
we will follow the same process as payment failure we will mark this
CANCELLED okay we will mark this CANCELLED and we'll
increment the quantity in available_quantity so that the room is now available for somebody else to use.
Again in that scenario there is no invoice generated. But this you do only if the status is RESERVED.
Why? - coming to the next case, what happens if both (3) and (1) happen.
What happens if you get a key expiry event and a payment is also success. so there are two conditions
if the payment has already been successful and the booking has already
been moved to BOOKED status, after that if you
get this key expired event then you don't do anything because that
is any way bound to happen right? but what if it happens the
other way around what if key expired first you move the
booking to CANCELLED state but then you get a notification saying
payment is success Now there are multiple directions
in which you could take it based on your conversation with your interviewer
and the non-functional requirements and in fact even the functional requirements
for that matter you could do two three things. You could
now either revert the payment saying for whatever reason we were not
able to book the room so here's your payment back.
Alternatively you could do something even more smarter.
You could say that now I have anyway got the payment from the user,
I can check if there are rooms available and I'll book them, right?
Now this could be done based on what the requirement is
and you could talk to your interviewer and implement it either ways.
All good so far but there are a couple of caveats here.
The TTL that you have talked about it is not a very precise measure so
let's just say that a key was supposed to expire at 10:00 okay,
you will probably never ever get a call back at this point
in time it will always have some delay. Now in
this case it doesn't matter too much instead of at 10:00 if you get it at 10:01 it's possibly okay also. So it's not too big of a problem and the reason for that is because of
the way expires are implemented in Redis I'll not go too much into detail of that
but there's a background process that runs in Redis for keys that are not accessed and whenever that process
gets to access a particular key is when it will expire it.
So it is not necessary that it will acquire it at exactly the same time.
But let's say if you wanted it to be totally precise then you could possibly tweak the implementation
a bit and do a slightly different way so instead of doing a TTL based approach
you could in fact implement a queue with it within Redis and have a poller that kind of queries Redis, the topmost node of the queue every one second and whichever one it it
finds has expired then you could kind of delete that but
that's not that's obviously much better but that
comes at a cost so you'll have to build a kind of a
polling mechanism so that's additional development effort
and then it will be continuously bombarding Redis every one second
so there's a lot of CPU being utilized on both the sides on the cron side,
and on the Redis side so possibly you'll have to add more nodes into the Redis
cluster and also on the side where cron is being
developed So now that's a tradeoff. Do you want to
be notified absolutely immediately when the keys are supposed to be expired
and at the cost of additional hardware that trade-off you can again make
with the conversation with your interviewer. But
otherwise all of this being said I would still go with a TTL based approach
because in this particular example it doesn't really matter so much.
Now a couple of optimizations you could do. So, let's just say payment is success. You know that key will expire after some time,
for sure, right? because it's there in Redis. You don't need to keep that key there you know the payment is success you can
evict the key right, even if the for the payment
failure case you know that payment has failed it will
expire after five minutes might as well delete the key then and there, right.
So these are certain optimizations that you could do over
this implementation to make it even more better. But on and off this is how the booking flow works. Now again
reiterating we have used a couple of important features of MySQL
and that's what is helping us to make the code on application side much more
smoother had we used some other database which
doesn't provide for example if you were using Cassandra here
we would not have had access to the transactions and the constraints and all
of that you would have to implement it on
application side. That's additional effort on our side to make sure things
are consistent. in this case I would rather leave it to
MySQL to implement all of those things.
Now coming back to the same architecture again i just want to call out that all of the
components that you see here are individually horizontally scalable
so let's just say there's a traffic spike happening on one of the components
we could increase the number of nodes in that particular service
maybe that particular database and then that should work just fine.
As far as kafka and hadoop cluster are concerned we could add more nodes into
that as well and they should also scale to a much larger
scale than what we need. Cool, so now let's look at what kind of
alternates that we could have used instead of this particular design choice.
So first of all why MySQL, we could use any other relational database here.
We could use a Postgres we could use a SQL Server,
anything which provides ACID guarantees should be fairly fine here.
As far as Redis is concerned we could use a memcache or any other cache
instead of Redis and that should also be good.
Cassandra, I would still stick to Cassandra because
that is exactly what we need here now technically in place of Cassandra we
could also use a HBase here that would also work fine but it has a
lot of operational overhead in terms of deployment and maintaining it over time
so that's the reason I would prefer Cassandra over HBase or any other similar
system The way cassandra works is every data in
cassandra is you know sharded across a partition key
so each query has to happen on a partition key now the
queries that we are doing are just of two varieties.
1) Get bookings by hotel or 2) Get bookings by user. There is no the third variety
so we basically have two kinds of data which is distributed by two different
partition keys on which the queries are happening. So this would be kind of a very good choice here. In place of Kafka,
we could have used an Active MQ or a Rabbit MQ or any other queueing
mechanism there's an amazon queue(SQS) also we could
have used that but i think kafka scales much better than most of them so
i think it's a fairly good choice here. Other than that in general we definitely
need to monitor how are our CPUs and Memory is behaving.
So if I have a CPU spike at certain points in time that is something we need
to kind of look at so across the whole infrastructure we
need to keep an eye on how my CPU usage percentage is, how my
memory usage percentage is, how my disk usage for Redis is, how my disk
usage for elastic search is all of these things are what we need to
monitor now monitoring could be done through a
grafana kind of a tool on which i can set up alert. So if the
let's say a particular metric has some threshold
the moment I cross that threshold or with certain conditions
I could send out an alert and the team could get notified that something is
potentially wrong and they need to look at that.
this will help us to make sure that we in the end achieve our
NFRs that we talked about of latency and high availability.
Because let's just say something goes wrong let's just say
memory is you know utilize more than what we expected
eventually it will lead to some machines going down and eventually it will lead
to us having a lower availability that than what we expected so yeah these are the things that we need to monitor and alert on.
Now in the next section let's look at how this whole thing would be spread across geographies, so for example let's say
there's an earthquake in one of the data centers and everything just goes away out of the blue what do we do?
So let's look at that next So let's say we have these four data
centers data center 1, data center 2 data center 3, and data center 4 which
are located in different geographical regions across the globe, okay.
Now we want to create a topology in a way that we do get low latency and high availability
okay so one very simple approach that we could do is say that DC 1
is our primary and all the three DC's are our secondary data centers
and data is replicated to all the three data centers in near real time okay, so that's okay it's good enough. but it's not very
good to be honest because we are just using 25 percent of
our capacity as primary which is active and rest
three data centers are sitting idle and not really doing anything.
So let's try to improvise what we could instead do
is divide the data centers and thus the globe into two parts.
What we could say is this is region one and this is region two okay now the
countries or people accessing our services
who are closer to this region(R1) will connect to this region(R1)
and the people who are closer to that region(R2) will connect to that region(R2).
Now how are we able to do that so the data in a hotel management system is fairly specific to a geography so all the hotels in let's say India can be you know separated from all the hotels in US.
Similarly all the rooms all the bookings are now specific to hotels and thus
specific to geography so we could kind of bifurcate the data
as per geography right which gives us the leverage to
divide the system into two halves right now what will happen here now
let's just say DC1 is the primary in this region(R1) and DC3 is
the primary in R2 okay now if DC1 goes down all the data in DC1 is getting replicated to DC2 in near
real time so if that goes down DC2 can become
active and all the clients who are connecting
to DC1 and how will they connect so there will be bunch of clients who
are connecting via some DNS to DC1 right if this goes down DNS can flip and connect to DC2 if this link is broken right similar thing can happen on this side so this way what we have is basically
dividing our infrastructure into two halves thereby
clients who are closer to this region are connecting to the servers that are
closer to them thus giving them lower latency right.
Now we could go even one step further we could say that we'll divide the region into four parts and we could do we could go
as much as deep we want into this to increase the
latency basically to reduce the latency and
increase the availability but i think for all practical purposes
at least for a Hotel Booking System this R1 R2 thing is more than sufficient
to give us a good enough latency and a very high availability.
So I think yeah that should be it for a Hotel Booking System.