Hi everyone! Welcome to CodeKarle.
In this video, we'll be looking at another very common System Design Interview problem,
which is to design an ecommerce application, something very similar to
Amazon or Flipkart or anything of that sort. Let's look at some functional and
non-functional requirements(NFRs) that this platform should support.
So the very first thing is that people should be able to search for whatever product they want to buy and along with searching we should
also be able to tell them whether we can deliver it to them or not.
So let's just say a particular user is in a very remote location where we cannot deliver
a particular product then we should clearly call it out right on the search page.
Why? - because let's say a user has seen hundred products and then they go
to checkout flow add it into cart and then if we tell them that we can't deliver
then that's a very bad user experience. So at the page of search itself we
should be able to tell them that we cannot deliver or if you are delivering
then by when should we be able to deliver it to you. The next thing is
there should be a concept of a cart or a wishlist or something of that sort so
that people can basically add a particular item into a cart.
The next thing is people should be able to check out which is basically making a
payment and then completing the order. will not look at the how the payment
exactly works like the integrating with Payment Gateways and all of that but we'll broadly
look at how the flow overall works. The next thing is people should be able to
view all of their past historical orders as well. The ones that have been delivered , the
ones that have not been delivered, everything From the non-functional side
the system should have a very low latency. Why? - because it will be a bad
user experience if it's slow. It should be highly available and it should be
highly consistent. Now high availability, high consistency and low
latency all three together sounds a bit too much so let's break it down.
Not everything needs to have all these three So some of the products which are
mainly dealing with the payment and inventory counting, they need to be
highly consistent at the cost of availability Certain components,
like search and all, they need to be highly available, maybe at the cost of
consistent at times. And overall most of the user facing components
should have a low latency. Now let's start with the overall architecture of
the whole system. We will do it in two parts First, we look at the home screen and the search screen and then we look at the
whole checkout flows. And before we get to the real thing let's look at a bit of
convention to start with. Things in green are basically the user interfaces. It could
be browsers, mobile applications, anything. These two are basically not just load
balancers but also reverse proxies and an authorization and authentication layer
in between that will authenticate all the requests that are coming from outside
world into our system. All the things in blue that you see here are basically the
services that we have built. It could be Kafka consumers, could be Spark jobs
could be anything. And the things in red are basically the databases or any clusters
could be Kafka Cluster/Hadoop cluster or any public facing thing that we've used.
Now let's look at the user interface what all we'll have.
So we mainly have two user interfaces. One is the home screen which would be the first
screen that a user comes to it would by default have some recommendations based
on some past record of that user, or in case of new user some general recommendations.
In case of search page it would just be a text box where in user head put in a
text and will give them the search results. With that let's look at how the data flow begins.
So a company like Amazon would have various suppliers that
they would have integrated with now these suppliers would be managed by
various services on the supplier front okay. I am abstracting out all of them as something called as an Inbound Service.
What that does is basically it talks to various supplier system and get all of the data Now let's say new item has come in.
Or a supplier is basically onboarding a new item. That information comes
through a lot of services and through Inbound Service into a Kafka. That is basically a
supplier word coming into the whole search and user side of things. Now
there are multiple consumers on that Kafka, which will process that
particular piece of information to flow into the user world.
Let's look at what that is. So the very first thing that will happen is
basically something called as an Item Service. So Item Service has a lot of people talking to it and if I start drawing all
the lines, it become too cluttered so I've not on any connection to item
service but item service will be one of the important things that listen to this
kafka topic and what it does is it basically on board a new item.
Now it also is basically the source of truth for all items in the ecosystem.
So it will provide various APIs to get an item by item ID, add a new item, remove an
item, update details of a particular item and it
will also have an important API to bulk GET a lot of items. So get a GET API with a
lot of item IDs and it is in response it gives details about all of those items.
Now Item Service sits on top of a MongoDB. Why do we need a Mongo here?
- So item information is fundamentally very non structured. Different item types
will have different attributes and all of that needs to be stored in a queriable
format. Now if we try to model that in a structured form into MySQL kind of
database that will not be an optimimal choice. So that's why we've used a Mongo.
To take an example, let's say if you look if you look at attributes of some products
like a shirt, it would have a size attribute and it would probably have a
color attribute right. Like a large size t-shirt of red color. If you look at
something like a television, it will have a screen size, saying a 55 inch TV with the led
display kind of a thing. There could be other things like a bread, which could have
a type and weight. Like 400 grams of bread of which is of wheat bread or a
sweet bread or something of that sort. So, it's a fairly non-structured data and
that's the reason I use a Mongo database here. Now, coming to other consumers that will
be sitting on that Kafka. So on this Kafka, there is something
called as a Search Consumer. What this does is basically whenever a new item
comes in, this search consumer is responsible for making sure that item is
now available for the users to query on. So what it does is all the items that
are coming in it basically reads through all of those items. It puts that in a format that the search system understands and
it stores it into its database. Now, search uses a database called Elastic
search, which is again a NoSQL database, which is very efficient at
doing text-based queries. Now this whole search will happen either on the product
name or product description and maybe it will have some filters on the product
attributes. Those kind of things is something that elastic search is very
good at. Plus in some cases you might also want to do a fuzzy search, which is
also supported very well by Elastic Search. Now on top of this Elastic Search,
there is something called as a Search Service. This Search Service is basically
an interface that talks to the front end or any other component that wants to
search anything within the ecosystem. It provides various kinds of APIs to filter
products, to search by a particular string or anything of that sort. And the
contract between Search Service and Search Consumer is fixed so both of them
understand what is the type of data that is stored within Elastic Search. That is
the reason consumer is able to write it and the Search Service is able to search for it.
Now, once .. a user wants to search something, there are two main things around it.
One is basically trying to figure out the right set of items that
needs to be displayed but there is also an important aspect that we should not
show items that we cannot deliver to the customers. So for example, if a customer
is staying in a very remote location and if we are not able to deliver big
sized items like refrigerator to that pin code, we should not really show that
search result to the user as well because otherwise it's a bad experience
that he will see the result but you will not be able to order it, right.
So search service talks to something called as a Serviceability and TAT(Turn Around Time) Service.
This basically does a lot of things first of all it tries to figure out that,
where exactly the product is. In what all warehouses. Now given one of those Warehouse or some of those warehouses, it tries to see - do I have a way to deliver products from
this warehouse to the customer's pincode? and if I have, then what
kind of products can I carry on that route. Now certain routes can carry
all kinds of products but certain routes might not be able to carry things like big products or other kind of product, right. So all of
those filtering stays within the Serviceability and TAT Service.
Now it also does one more thing that it tells you in how much time I will be able to
deliver, okay. So it would be like in number of hours or days or anything of
that sort it can say that probably I will be able to deliver it in 12 hours,
or 24 hours or something of that sort. Now if serviceability tells that I cannot deliver, search will simply filter
out those results and ignore that and return the rest of the remaining things.
Now Search might talk to User Service. There's something called as a
User Service here. We'll get into that later but the User Service is basically
a service that is the source of true so for the user data and it(Search) can query that(User
Service) to fetch some attributes of user probably a default address or something
of that sort, which can be used basically which can be passed as an argument to
Serviceability Service to check whether I can deliver it or not, okay.
Now Search Service returns the response to the user which can be rendered and the
people can see whatever you know they want to see. Now each time a Search
happens, an event is basically put into Kafka. The reason behind this is whenever
somebody is searching for something they are basically telling you an intent to
buy a kind of product, right? That is a very good source for building a
Recommendation. We'll look at how Recommendation can be build later on, but
this is an input into the recommendation engine and we will be using a Kafka to pass it along. So each search query goes into
Kafka, saying this user ID searched for this particular product. Now from the
Search Screen user could be able to wishlist a particular product or add it
to cart and buy it, right? All of those could be done using the Wishlist
Service and Cart Service. Wishlist Service is the repository of
all the wish lists in the ecosystem and the cart services by repository of all
the carts in the ecosystem. Carts are basically a shopping bag, which when people
put product into it and then checkout. Now both these services
are built in exactly the same way(almost). They provide APIs to add a product into
user's cart or wishlist. Get a user's Cart or Wishlist.
Or delete a particular item from that. And they would have a very similar
data model and they are both sitting on their own MySQL databases. Now from a
hardware standpoint I'd like to keep these two as separate Hardwares. Just in case,
for example Wishlist size becomes too big and it needs to scale out so we can
scale this particular cluster accordingly. But otherwise
functionally both of them are totally same. Now each time, a person is
putting a product into Wishlist, they are again giving you signal. Each time they
are adding something into their cart they are again giving a signal,
about their preferences, things that like that they want to buy and all of that, right?
All of those could again be put into Kafka for a very similar kind of analytics.
Now let's look at what those analytics would be? - So from this Kafka
there would be something like a Spark Streaming Consumer.
One of the very first things that it does is - kind of come up with some kind
of reports on what products people are buying right now. Those would be things
like coming up with the report saying what was the most bought item in the
last 30 minutes, or what was the most Wishlisted listed item in last 30 minutes,
or in Electronics category what which product is the most sought-after product.
So all of those would be inferred by this Spark Streaming. Other than that
it also puts all the data to Hadoop saying this user has liked this product,
this user has searched for this product anything that happens, right. On top of it,
we could run various ML Algorithms. I've talked about those in detail in another
video(Netflix Video), but the idea is - given a user and a certain kinds of products that they like,
we could be able to figure of two kinds of information. One is what other
products this user might like okay and the other thing is how similar is this
user to other users and based on products that other users have already
purchased we would recommend a particular product to the user. So all of
those is calculated by this Spark Cluster on top of which we can run various ML
jobs to come up with this data. Once we calculate those recommendations, this
Spark Cluster basically talks to something called as a Recommendation Service. Which is basically the repository of all the recommendations, and it has various kinds of recommendations. One is given a user ID,
if you store general recommendations saying what are the most recommended products
for this user. And it will also store the same information for each category.
Saying for electronics for this user these are the recommendations. For the
food kind of a thing, for this user, these are the recommendations. So when a person
is actually on home page they will first see all in just general recommendations
and if they navigate into a particular category they'll see the specific
recommendations we are specific to the category. Now we have skipped a couple of
components. User Service is a very straightforward service which basically
you can look at any other of my videos that have talked about it in detail(Airbnb Video),
but it's a repository of all the users. It provides various API is to Get details
of a user, Update details of user and all of that. It'll sit on top of a MySQL
database and a Redis Cache on top of it. Now let's say Search Service wants to
get details of a user the way it will work is, it will first query Redis
to get the details of the user. If the user is present, it will return from there.
But if the user is not present in Redis it will then query the MySQL
Database (one of the slaves of that MYSQL cluster), get the user information,
store it in Redis, and return it back. So that's how User Services will work.
Now there are some other components that are here. One is Logistic Service and one is Warehouse service. Normally these two components
come in once the order is placed. But in this scenario this Serviceability
Service might query either of these two services not a runtime, but before
catching the information to fetch various attributes. So for example it
might query this Warehouse Service to get a repository of all the items that are in the
warehouse or it might query Logistic Service to get details of all the pin
codes that are existing or maybe details about or the Courier Partners that work
in a particular locality and all of with that all of the information this
Serviceability Service will basically create a graph kind of a thing saying
what is the shortest path to go from point A to point B and in how much time
can I get there. Now we have not covered this service in
detail. I have also made another video on implementation of Google Maps. This is
very similar to that. So if you want to get into details you can look at
that(Google Maps Video) but just to just to call it out this doesn't really do any calculation
at runtime. You store all the information in a cache and whenever
anybody queries, it will query the cache and return the results
from the cache itself, and no runtime calculation because those kind of
calculations are fairly slow. And it will pre-calculate all possible requests that can come to it. Basically if there are N Pincodes and M Warehouses, it will do a M x N and calculate all possible combinations of requests that can come
to the service, and store it in a cache. Now let's look at what happens when a
User tries to place an order. So the user interaction, to place an order is
represented as this User Burchase Flow. This could be access to apps, mobile apps
or web browsers or anything. Now whenever User says that I am ready to place an
order take me to the payment screen which is the last piece in the app flow
then basically the request goes to something called as the Order Taking
Service. Think of Order Taking Service as a part of Order Management System
which takes the order. Now Order Management System sits on top of a MySQL database. Why MYSQL? - because if you look at order and table form for order,
it will have a lot of tables, some with order information, some with customer information, some with item information and there are a lot of
updates that happen on to the Order Database, right, for each order.
Now we need to make sure that those updates are atomic and they can form a transaction
so as to make sure that there are no partial updates happening. And because MYSQL provides that out of the box we would leverage that here. So that's why MySQL.
Now whenever this gets called the very first thing that happens is a
record gets created for an order, an order ID gets generated. Now the
next thing that we do is we put an entry into this Redis saying, this order ID was
created at some point in time and this record expires at some point. Why is it
being used, we'll come to. But think of it as an entry that goes in something like -
Order id: "O1", gets placed and 10:00 let's say. it expires at 10:05.
Something of that sort. Now the record that goes into this
MySQL Database has an initial status. Think of it as a Status of: "CREATED".
So basically the record that goes into this table is something like -
order O1, was created at 10:00 with status: "CREATED". You can have different
set of names also that's fine. The third thing that happens is we basically
call Inventory Service. The idea of calling Invented Service is we want to
block the inventory. So let's say if there were five units of a television
that a user wanted to purchase, we will reduce the inventory count at this point
in time, and then send the user to payment. Why do we do that? - Let's just say there
was just one TV and three users want to come in at the same time. Now we want to
make sure only one of them buys the TV and for two of them it should show that it
is out of stock, right? We can easily enforce that, through this Inventory
Service, by having a constraint on to the table which has the 'count' column saying
that 'count' cannot go negative. Now each time you basically want to place that
order for a TV, only one of the.. if the count is one, only one of the users will
be able to do that transaction where the count reduces. For the other two, it'll be a
Constraint Violation Exception because the count cannot be negative. So only one
of them will go through with the inventory, right! Now once the inventory
is updated then the user is taken to something called as a Payment Service.
Think of it as an abstraction over the whole payment flow where in this service
talks to your Payment Gateways and takes care of all of the interaction with the payment.
We will not go into details of how that works, but the interaction
with the Payment Service can lead to two-three outcomes. One of them is that payment
service could come back saying the payment was successful. Or it could say that the
payment failed due to some reason, could be a lot of reasons. Or the person could
just close the browser window after going to the payment gateway and not come back in which case payment service would not
give a response. So these are there are these three possibilities. Now let's see
what can happen, okay. So one of the very first thing that can happen is let's just say at 10:01, we got a confirmation from
Payment saying payment was successful. In that case it'll.. the order would get
confirmed, right. So let's say we call it and that status is now "PLACED", or any
other name for that matter. So now once the payment has been successful, we
update the status of the order over here in the database saying that order is
PLACED. Now the work is done. At that point in time an event would be send in
to Kafka, staying an order has been placed, with so-and-so details, okay!
Another option - it could(Payment Service) could come back saying that.. the payment failed.
Let's say there was no money in the account or whatever happened.
Now once the payment has failed, we need to do a lot of things. First of all, we need
to cancel the order, because the payment has not happened. So let's say the other
possible scenario is the order could be CANCELLED, right? Now.. if the payment has failed we need to increment the inventory count
again. so we'll basically again call Inventory Service and saying that - you
decremented the quantity for this particular order ID, now let's increment
it back. So that's a rollback transaction kind of a thing, that is happening here, okay.
So that is one more thing we'll do. And also we'll have a Reconciliation
Service that sits on top of this whole system which reconciles basically checks
every once in a while that if there were ten orders overall I do have the correct
inventory count at the end, just to handle cases with due to some reason we missed
update of inventory count or the write fail. So all of those scenarios can be
handled by the Reconciliation flow. Now there is one more scenario that can
happen. Payment Service could not at all come back, right? The user close the
browser window, the flow ends there, payment service doesn't respond back to us.
Now what do we do? We can't keep the inventive blocked. So let's say there was
just one television available this user has gone to the payment screen, closed the
window and gone away. Now we still have that one television physically available
in our warehouse but the database says it's not available.
We need to bring it back, right. So for that the Redis comes in.
So at 10:05 what will happen is Redis this record will get expired, right.
Redis will have a will have a expiry. We'll have a expiry call back on top of Redis which will
basically be invoked saying that this particular report whatever you insert
inserted, it has got expired. Now do whatever you want. At that point in time
Order Taking Service will catch that event and say that now this particular record has expired, I'll basically follow the same flow that
was followed for payment cancellation basically I'll time out the payment and
mark it CANCELLED, okay. So at that point in time again at 10:05, this order would
be moved to CANCELLED State in the Database and the inventory would get updated back
again, right. So everything is good but there are a couple of scenarios here.
So there could be potential risk conditions. What happens if your payment success
event and your expiry happens at the same time. What do you do then? So there
are two three scenarios in which it would happen. The first scenario is that
payment success comes first and then expiry happens. So in that... that is a
scenario that will always happen. So whenever a payment is success let's
say at 10:01 payment got success anywhere at 10:05 this expiry would have
happened, right? So this is something that is bound to happen for all the the orders.
So one optimization we could do here is each time we get a payment
success or payment failure event, we can delete the entry form Redis, so as to
make sure it's expiry event does not come in. The other scenario is that
expiry comes first and then payment success. Now whenever we would have got..
let's say expiry came at 10:05 and payment success happens at 10:07. Now
whenever we get the expiry event we would have canceled the order, we would
have decremented the inventory count. But now the payment has happened,
that we got to know. There we could do 2-3 things. We could either refund the money
back to the customers saying that for whatever reason we are not able to
process, here's your money back. Alternatively, we could also say that now we have anyway got the money from the customer. We know what the person was able to... was trying
to order. Due to a issue one our side possibly the order didn't go through. So we could create a new order mark this payment for
that order, and then put that directly into PLACED status. So that is another
thing that we could do. Assuming for all the orders as soon as
we get a payment success and payment failure, we will delete these entries
from Redis so to make sure there are no.. these conditions do not happen too frequently. That will help us to save the memory footprint as well, because the
lesser amount of data we have in Redis, the lesser RAM it would consume,
and it will be much more efficient okay. Now one thing I want to call out about
this Redis expiry is - this is not very accurate. So let's say if it was the
supposed to expire at 10:05, we might not get the call back exactly a 10:05. We might
get a callback at 10:06, 10:07 or sometime after 10:05 and that is because of the
way it is expiry works. So Redis doesn't expire all the records at that point in
time. It basically it checks all the records every once in a while for expiry
and whenever it finds some of the docs/ some of the keys that are expired
then it expires that. So it might happen that it'll expire after a few minutes. Now in
this scenario it doesn't really matter so I think we should be okay with this
kind of an approach but if you are trying to do something which is much
more mission-critical, then possibly you might have to do a different approach.
But even you should talk about this to your interviewer, and if they say
then you might also change the implementation a bit. You might implement
a queue and keep polling that queue every second or something of that sort.
Okay. Now once the payment has happened or not happened or whatever happened, you
would put all of those events into Kafka anyway. There is one more reason why you
want to put two events. So let's say there was this one item one television
left okay and somebody has purchased that. Now you want to make sure that
nobody else is able to see the television because anyway the count has
become.. inventory count has become zero you cannot take an order, right. So you
need to... basically make sure that it should be removed from search.
Now remember there was a Search Consumer that we looked at in the previous section.
That Search Consumer would also take care of inventory kind of things.
So as soon as an item goes out of stock it would basically remove that element
from in the listing. Basically it will remove the documents pertaining to that
particular set of items. Now there is one problem in this whole thing that we have
decided/built. This MySQL can very quickly become a bottleneck. Way? -
Because there are a lot of Orders. So for a company like Amazon they will probably
have like millions of Orders in a day a couple of millions at-least. And
that number could bloat up significantly over an year. And normally you would have to
keep Order Data for a couple of years for auditing purposes, right. So this Database is going massively big. we need to do something about it. So remember the
whole use case of having this MySQL is to take care of atomicity, transaction,
basically the acid properties for orders that are being updated. But for the orders that
are not being updated, we don't really need any of these functionalities, right.
So what we can do is once you order reaches a terminal state, let's just
say an order is DELIVERED or CANCELLED, we can basically move it to Cassandra.
So there is something called as Archival Service, which is a cron kind of a thing,
which runs every one day/12 hours at some frequency and it pulls all the
orders which have reached the terminal status from this MYSQL and
puts it into Cassandra. Now how does it do that? - so you can see these two
services. There's the Order Processing Service and there's a Historical Order
Service. These two are again component of the whole larger Order
Management System which are do certain things of their own. So Order Processing
Service is the service which takes care of the whole lifecycle of the order. So once
the order has been placed any changes to that order happen through this. This will
also provide APId to Get the orders if somebody wants to get Order details.
Again, similarly, Historical Order Service will provide all kinds of the APIs to
fetch information about Historical Orders. Now Archival Service will
basically query Order Processing Service to fetch all the orders who have reached
terminal status, get all of those details, call Historical Order
Service to insert all of those entries into its Cassandra, and once it has got a
success saying I've inserted into Cassandra, replicated in all the right
places, then it will basically call Order Processing Service to delete that. Let's
say there is some error, something breaks in between, it will retry the
whole thing and because it is idempotent it doesn't really matter we can replay
it as many number of times as we want. Cool. Now coming to the...once the user has
placed the order all of the things will happen, logistics and all can happen behind
the scenes. Now the user can go and see their past orders, right. That will be
powered by this Orders View. So Orders View will basically...there will be a service
here which kind of is the back end of Orders View, but that'll clutter
the diagram too much. Assume there's a service here which talks to the Order View, and that talks to these two services. So basically what it will do is, it will
basically query Order Processing Service to fetch information about all the live
orders who are in transit it. It'll call Historical Order Service for all the
orders that have been completed. It will merge the data and it will return back
to display on the App or website or wherever, cool. Now why did we use a
Cassandra here? - So Historical Order Service will be used in a very limited set
of query patterns. So some of the queries that will run is:
1) Get information of a particular order ID
2) Get all the orders of a particular user ID 3) possibly get all the orders of a particular seller.
But there will be a small number of type of queries that will run
on this Cassandra and Cassandra very good when you have that finite number of
queries(by type) but a large set of data. that's the reason we have used Cassandra.
Now coming to whenever an order is placed we want to notify the
customer that your order has been successfully placed. It will be delivered to
in 3 days or something of that sort. You'll have to possibly notify a seller,
you might have to notify somebody else or when something happens. Let's say
an order is cancelled by the seller you want to notify the customers. So all of
those notifications would be taken care by this Notification Service. Think of it
as an abstraction which can abstract out all various kinds of notification like
SMSs, Emails and all of that. I made another video(on Notification Service Design)
which talks about in detail of how do you implement Notification Service.
So you can go have a look at that as well. While ... a user is placing all the
orders, all of those things are happening, all the events are going into this Kafka
on which we are running a Spark Streaming Consumer. It does a lot of things.
One of the... very first things that it does is it can publish a report
saying in the last one hour what are our items which have been ordered the most.
Or which category has generated the most amount of revenue like electronics or
food or something of that sort, right. So it can publish out some of the reporting
metrics that we want to quickly look at. Other than that, it will also put the whole of
the data into Hadoop cluster. Now we have real order information of a user which
is a very good thing for recommendations So on top of it, will have some
spark jobs which will run some very standard ALS kind of algorithms, which
would be predicting that this particular user, given that he's ordered certain
things in the past, what are the next set of things that this person might order.
Or because this customer looks like that customer and that customer
ordered so-and-so product, so this customer might also order so-and-so product.
So there could be, a lot of kind of models that you can run to predict what
are the right set of recommendations for this user, which would be then send to this
Recommendation Service which we looked at in the last section, it will store it into its Cassandra and the next time when the user comes in, we'll show them those
kind of recommendations. It would not just have like recommendations based on
your orders but also similarity, so again taking the same example if you have
ordered a white board you will definitely want to order marker. So all
of those things would also be taken care by this.