Amazon System Design | Flipkart System Design | System Design Interview Question

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone! Welcome to CodeKarle. In this video, we'll be looking at another very common System Design Interview problem, which is to design an ecommerce application, something very similar to Amazon or Flipkart or anything of that sort. Let's look at some functional and non-functional requirements(NFRs) that this platform should support. So the very first thing is that people should be able to search for whatever product they want to buy and along with searching we should also be able to tell them whether we can deliver it to them or not. So let's just say a particular user is in a very remote location where we cannot deliver a particular product then we should clearly call it out right on the search page. Why? - because let's say a user has seen hundred products and then they go to checkout flow add it into cart and then if we tell them that we can't deliver then that's a very bad user experience. So at the page of search itself we should be able to tell them that we cannot deliver or if you are delivering then by when should we be able to deliver it to you. The next thing is there should be a concept of a cart or a wishlist or something of that sort so that people can basically add a particular item into a cart. The next thing is people should be able to check out which is basically making a payment and then completing the order. will not look at the how the payment exactly works like the integrating with Payment Gateways and all of that but we'll broadly look at how the flow overall works. The next thing is people should be able to view all of their past historical orders as well. The ones that have been delivered , the ones that have not been delivered, everything From the non-functional side the system should have a very low latency. Why? - because it will be a bad user experience if it's slow. It should be highly available and it should be highly consistent. Now high availability, high consistency and low latency all three together sounds a bit too much so let's break it down. Not everything needs to have all these three So some of the products which are mainly dealing with the payment and inventory counting, they need to be highly consistent at the cost of availability Certain components, like search and all, they need to be highly available, maybe at the cost of consistent at times. And overall most of the user facing components should have a low latency. Now let's start with the overall architecture of the whole system. We will do it in two parts First, we look at the home screen and the search screen and then we look at the whole checkout flows. And before we get to the real thing let's look at a bit of convention to start with. Things in green are basically the user interfaces. It could be browsers, mobile applications, anything. These two are basically not just load balancers but also reverse proxies and an authorization and authentication layer in between that will authenticate all the requests that are coming from outside world into our system. All the things in blue that you see here are basically the services that we have built. It could be Kafka consumers, could be Spark jobs could be anything. And the things in red are basically the databases or any clusters could be Kafka Cluster/Hadoop cluster or any public facing thing that we've used. Now let's look at the user interface what all we'll have. So we mainly have two user interfaces. One is the home screen which would be the first screen that a user comes to it would by default have some recommendations based on some past record of that user, or in case of new user some general recommendations. In case of search page it would just be a text box where in user head put in a text and will give them the search results. With that let's look at how the data flow begins. So a company like Amazon would have various suppliers that they would have integrated with now these suppliers would be managed by various services on the supplier front okay. I am abstracting out all of them as something called as an Inbound Service. What that does is basically it talks to various supplier system and get all of the data Now let's say new item has come in. Or a supplier is basically onboarding a new item. That information comes through a lot of services and through Inbound Service into a Kafka. That is basically a supplier word coming into the whole search and user side of things. Now there are multiple consumers on that Kafka, which will process that particular piece of information to flow into the user world. Let's look at what that is. So the very first thing that will happen is basically something called as an Item Service. So Item Service has a lot of people talking to it and if I start drawing all the lines, it become too cluttered so I've not on any connection to item service but item service will be one of the important things that listen to this kafka topic and what it does is it basically on board a new item. Now it also is basically the source of truth for all items in the ecosystem. So it will provide various APIs to get an item by item ID, add a new item, remove an item, update details of a particular item and it will also have an important API to bulk GET a lot of items. So get a GET API with a lot of item IDs and it is in response it gives details about all of those items. Now Item Service sits on top of a MongoDB. Why do we need a Mongo here? - So item information is fundamentally very non structured. Different item types will have different attributes and all of that needs to be stored in a queriable format. Now if we try to model that in a structured form into MySQL kind of database that will not be an optimimal choice. So that's why we've used a Mongo. To take an example, let's say if you look if you look at attributes of some products like a shirt, it would have a size attribute and it would probably have a color attribute right. Like a large size t-shirt of red color. If you look at something like a television, it will have a screen size, saying a 55 inch TV with the led display kind of a thing. There could be other things like a bread, which could have a type and weight. Like 400 grams of bread of which is of wheat bread or a sweet bread or something of that sort. So, it's a fairly non-structured data and that's the reason I use a Mongo database here. Now, coming to other consumers that will be sitting on that Kafka. So on this Kafka, there is something called as a Search Consumer. What this does is basically whenever a new item comes in, this search consumer is responsible for making sure that item is now available for the users to query on. So what it does is all the items that are coming in it basically reads through all of those items. It puts that in a format that the search system understands and it stores it into its database. Now, search uses a database called Elastic search, which is again a NoSQL database, which is very efficient at doing text-based queries. Now this whole search will happen either on the product name or product description and maybe it will have some filters on the product attributes. Those kind of things is something that elastic search is very good at. Plus in some cases you might also want to do a fuzzy search, which is also supported very well by Elastic Search. Now on top of this Elastic Search, there is something called as a Search Service. This Search Service is basically an interface that talks to the front end or any other component that wants to search anything within the ecosystem. It provides various kinds of APIs to filter products, to search by a particular string or anything of that sort. And the contract between Search Service and Search Consumer is fixed so both of them understand what is the type of data that is stored within Elastic Search. That is the reason consumer is able to write it and the Search Service is able to search for it. Now, once .. a user wants to search something, there are two main things around it. One is basically trying to figure out the right set of items that needs to be displayed but there is also an important aspect that we should not show items that we cannot deliver to the customers. So for example, if a customer is staying in a very remote location and if we are not able to deliver big sized items like refrigerator to that pin code, we should not really show that search result to the user as well because otherwise it's a bad experience that he will see the result but you will not be able to order it, right. So search service talks to something called as a Serviceability and TAT(Turn Around Time) Service. This basically does a lot of things first of all it tries to figure out that, where exactly the product is. In what all warehouses. Now given one of those Warehouse or some of those warehouses, it tries to see - do I have a way to deliver products from this warehouse to the customer's pincode? and if I have, then what kind of products can I carry on that route. Now certain routes can carry all kinds of products but certain routes might not be able to carry things like big products or other kind of product, right. So all of those filtering stays within the Serviceability and TAT Service. Now it also does one more thing that it tells you in how much time I will be able to deliver, okay. So it would be like in number of hours or days or anything of that sort it can say that probably I will be able to deliver it in 12 hours, or 24 hours or something of that sort. Now if serviceability tells that I cannot deliver, search will simply filter out those results and ignore that and return the rest of the remaining things. Now Search might talk to User Service. There's something called as a User Service here. We'll get into that later but the User Service is basically a service that is the source of true so for the user data and it(Search) can query that(User Service) to fetch some attributes of user probably a default address or something of that sort, which can be used basically which can be passed as an argument to Serviceability Service to check whether I can deliver it or not, okay. Now Search Service returns the response to the user which can be rendered and the people can see whatever you know they want to see. Now each time a Search happens, an event is basically put into Kafka. The reason behind this is whenever somebody is searching for something they are basically telling you an intent to buy a kind of product, right? That is a very good source for building a Recommendation. We'll look at how Recommendation can be build later on, but this is an input into the recommendation engine and we will be using a Kafka to pass it along. So each search query goes into Kafka, saying this user ID searched for this particular product. Now from the Search Screen user could be able to wishlist a particular product or add it to cart and buy it, right? All of those could be done using the Wishlist Service and Cart Service. Wishlist Service is the repository of all the wish lists in the ecosystem and the cart services by repository of all the carts in the ecosystem. Carts are basically a shopping bag, which when people put product into it and then checkout. Now both these services are built in exactly the same way(almost). They provide APIs to add a product into user's cart or wishlist. Get a user's Cart or Wishlist. Or delete a particular item from that. And they would have a very similar data model and they are both sitting on their own MySQL databases. Now from a hardware standpoint I'd like to keep these two as separate Hardwares. Just in case, for example Wishlist size becomes too big and it needs to scale out so we can scale this particular cluster accordingly. But otherwise functionally both of them are totally same. Now each time, a person is putting a product into Wishlist, they are again giving you signal. Each time they are adding something into their cart they are again giving a signal, about their preferences, things that like that they want to buy and all of that, right? All of those could again be put into Kafka for a very similar kind of analytics. Now let's look at what those analytics would be? - So from this Kafka there would be something like a Spark Streaming Consumer. One of the very first things that it does is - kind of come up with some kind of reports on what products people are buying right now. Those would be things like coming up with the report saying what was the most bought item in the last 30 minutes, or what was the most Wishlisted listed item in last 30 minutes, or in Electronics category what which product is the most sought-after product. So all of those would be inferred by this Spark Streaming. Other than that it also puts all the data to Hadoop saying this user has liked this product, this user has searched for this product anything that happens, right. On top of it, we could run various ML Algorithms. I've talked about those in detail in another video(Netflix Video), but the idea is - given a user and a certain kinds of products that they like, we could be able to figure of two kinds of information. One is what other products this user might like okay and the other thing is how similar is this user to other users and based on products that other users have already purchased we would recommend a particular product to the user. So all of those is calculated by this Spark Cluster on top of which we can run various ML jobs to come up with this data. Once we calculate those recommendations, this Spark Cluster basically talks to something called as a Recommendation Service. Which is basically the repository of all the recommendations, and it has various kinds of recommendations. One is given a user ID, if you store general recommendations saying what are the most recommended products for this user. And it will also store the same information for each category. Saying for electronics for this user these are the recommendations. For the food kind of a thing, for this user, these are the recommendations. So when a person is actually on home page they will first see all in just general recommendations and if they navigate into a particular category they'll see the specific recommendations we are specific to the category. Now we have skipped a couple of components. User Service is a very straightforward service which basically you can look at any other of my videos that have talked about it in detail(Airbnb Video), but it's a repository of all the users. It provides various API is to Get details of a user, Update details of user and all of that. It'll sit on top of a MySQL database and a Redis Cache on top of it. Now let's say Search Service wants to get details of a user the way it will work is, it will first query Redis to get the details of the user. If the user is present, it will return from there. But if the user is not present in Redis it will then query the MySQL Database (one of the slaves of that MYSQL cluster), get the user information, store it in Redis, and return it back. So that's how User Services will work. Now there are some other components that are here. One is Logistic Service and one is Warehouse service. Normally these two components come in once the order is placed. But in this scenario this Serviceability Service might query either of these two services not a runtime, but before catching the information to fetch various attributes. So for example it might query this Warehouse Service to get a repository of all the items that are in the warehouse or it might query Logistic Service to get details of all the pin codes that are existing or maybe details about or the Courier Partners that work in a particular locality and all of with that all of the information this Serviceability Service will basically create a graph kind of a thing saying what is the shortest path to go from point A to point B and in how much time can I get there. Now we have not covered this service in detail. I have also made another video on implementation of Google Maps. This is very similar to that. So if you want to get into details you can look at that(Google Maps Video) but just to just to call it out this doesn't really do any calculation at runtime. You store all the information in a cache and whenever anybody queries, it will query the cache and return the results from the cache itself, and no runtime calculation because those kind of calculations are fairly slow. And it will pre-calculate all possible requests that can come to it. Basically if there are N Pincodes and M Warehouses, it will do a M x N and calculate all possible combinations of requests that can come to the service, and store it in a cache. Now let's look at what happens when a User tries to place an order. So the user interaction, to place an order is represented as this User Burchase Flow. This could be access to apps, mobile apps or web browsers or anything. Now whenever User says that I am ready to place an order take me to the payment screen which is the last piece in the app flow then basically the request goes to something called as the Order Taking Service. Think of Order Taking Service as a part of Order Management System which takes the order. Now Order Management System sits on top of a MySQL database. Why MYSQL? - because if you look at order and table form for order, it will have a lot of tables, some with order information, some with customer information, some with item information and there are a lot of updates that happen on to the Order Database, right, for each order. Now we need to make sure that those updates are atomic and they can form a transaction so as to make sure that there are no partial updates happening. And because MYSQL provides that out of the box we would leverage that here. So that's why MySQL. Now whenever this gets called the very first thing that happens is a record gets created for an order, an order ID gets generated. Now the next thing that we do is we put an entry into this Redis saying, this order ID was created at some point in time and this record expires at some point. Why is it being used, we'll come to. But think of it as an entry that goes in something like - Order id: "O1", gets placed and 10:00 let's say. it expires at 10:05. Something of that sort. Now the record that goes into this MySQL Database has an initial status. Think of it as a Status of: "CREATED". So basically the record that goes into this table is something like - order O1, was created at 10:00 with status: "CREATED". You can have different set of names also that's fine. The third thing that happens is we basically call Inventory Service. The idea of calling Invented Service is we want to block the inventory. So let's say if there were five units of a television that a user wanted to purchase, we will reduce the inventory count at this point in time, and then send the user to payment. Why do we do that? - Let's just say there was just one TV and three users want to come in at the same time. Now we want to make sure only one of them buys the TV and for two of them it should show that it is out of stock, right? We can easily enforce that, through this Inventory Service, by having a constraint on to the table which has the 'count' column saying that 'count' cannot go negative. Now each time you basically want to place that order for a TV, only one of the.. if the count is one, only one of the users will be able to do that transaction where the count reduces. For the other two, it'll be a Constraint Violation Exception because the count cannot be negative. So only one of them will go through with the inventory, right! Now once the inventory is updated then the user is taken to something called as a Payment Service. Think of it as an abstraction over the whole payment flow where in this service talks to your Payment Gateways and takes care of all of the interaction with the payment. We will not go into details of how that works, but the interaction with the Payment Service can lead to two-three outcomes. One of them is that payment service could come back saying the payment was successful. Or it could say that the payment failed due to some reason, could be a lot of reasons. Or the person could just close the browser window after going to the payment gateway and not come back in which case payment service would not give a response. So these are there are these three possibilities. Now let's see what can happen, okay. So one of the very first thing that can happen is let's just say at 10:01, we got a confirmation from Payment saying payment was successful. In that case it'll.. the order would get confirmed, right. So let's say we call it and that status is now "PLACED", or any other name for that matter. So now once the payment has been successful, we update the status of the order over here in the database saying that order is PLACED. Now the work is done. At that point in time an event would be send in to Kafka, staying an order has been placed, with so-and-so details, okay! Another option - it could(Payment Service) could come back saying that.. the payment failed. Let's say there was no money in the account or whatever happened. Now once the payment has failed, we need to do a lot of things. First of all, we need to cancel the order, because the payment has not happened. So let's say the other possible scenario is the order could be CANCELLED, right? Now.. if the payment has failed we need to increment the inventory count again. so we'll basically again call Inventory Service and saying that - you decremented the quantity for this particular order ID, now let's increment it back. So that's a rollback transaction kind of a thing, that is happening here, okay. So that is one more thing we'll do. And also we'll have a Reconciliation Service that sits on top of this whole system which reconciles basically checks every once in a while that if there were ten orders overall I do have the correct inventory count at the end, just to handle cases with due to some reason we missed update of inventory count or the write fail. So all of those scenarios can be handled by the Reconciliation flow. Now there is one more scenario that can happen. Payment Service could not at all come back, right? The user close the browser window, the flow ends there, payment service doesn't respond back to us. Now what do we do? We can't keep the inventive blocked. So let's say there was just one television available this user has gone to the payment screen, closed the window and gone away. Now we still have that one television physically available in our warehouse but the database says it's not available. We need to bring it back, right. So for that the Redis comes in. So at 10:05 what will happen is Redis this record will get expired, right. Redis will have a will have a expiry. We'll have a expiry call back on top of Redis which will basically be invoked saying that this particular report whatever you insert inserted, it has got expired. Now do whatever you want. At that point in time Order Taking Service will catch that event and say that now this particular record has expired, I'll basically follow the same flow that was followed for payment cancellation basically I'll time out the payment and mark it CANCELLED, okay. So at that point in time again at 10:05, this order would be moved to CANCELLED State in the Database and the inventory would get updated back again, right. So everything is good but there are a couple of scenarios here. So there could be potential risk conditions. What happens if your payment success event and your expiry happens at the same time. What do you do then? So there are two three scenarios in which it would happen. The first scenario is that payment success comes first and then expiry happens. So in that... that is a scenario that will always happen. So whenever a payment is success let's say at 10:01 payment got success anywhere at 10:05 this expiry would have happened, right? So this is something that is bound to happen for all the the orders. So one optimization we could do here is each time we get a payment success or payment failure event, we can delete the entry form Redis, so as to make sure it's expiry event does not come in. The other scenario is that expiry comes first and then payment success. Now whenever we would have got.. let's say expiry came at 10:05 and payment success happens at 10:07. Now whenever we get the expiry event we would have canceled the order, we would have decremented the inventory count. But now the payment has happened, that we got to know. There we could do 2-3 things. We could either refund the money back to the customers saying that for whatever reason we are not able to process, here's your money back. Alternatively, we could also say that now we have anyway got the money from the customer. We know what the person was able to... was trying to order. Due to a issue one our side possibly the order didn't go through. So we could create a new order mark this payment for that order, and then put that directly into PLACED status. So that is another thing that we could do. Assuming for all the orders as soon as we get a payment success and payment failure, we will delete these entries from Redis so to make sure there are no.. these conditions do not happen too frequently. That will help us to save the memory footprint as well, because the lesser amount of data we have in Redis, the lesser RAM it would consume, and it will be much more efficient okay. Now one thing I want to call out about this Redis expiry is - this is not very accurate. So let's say if it was the supposed to expire at 10:05, we might not get the call back exactly a 10:05. We might get a callback at 10:06, 10:07 or sometime after 10:05 and that is because of the way it is expiry works. So Redis doesn't expire all the records at that point in time. It basically it checks all the records every once in a while for expiry and whenever it finds some of the docs/ some of the keys that are expired then it expires that. So it might happen that it'll expire after a few minutes. Now in this scenario it doesn't really matter so I think we should be okay with this kind of an approach but if you are trying to do something which is much more mission-critical, then possibly you might have to do a different approach. But even you should talk about this to your interviewer, and if they say then you might also change the implementation a bit. You might implement a queue and keep polling that queue every second or something of that sort. Okay. Now once the payment has happened or not happened or whatever happened, you would put all of those events into Kafka anyway. There is one more reason why you want to put two events. So let's say there was this one item one television left okay and somebody has purchased that. Now you want to make sure that nobody else is able to see the television because anyway the count has become.. inventory count has become zero you cannot take an order, right. So you need to... basically make sure that it should be removed from search. Now remember there was a Search Consumer that we looked at in the previous section. That Search Consumer would also take care of inventory kind of things. So as soon as an item goes out of stock it would basically remove that element from in the listing. Basically it will remove the documents pertaining to that particular set of items. Now there is one problem in this whole thing that we have decided/built. This MySQL can very quickly become a bottleneck. Way? - Because there are a lot of Orders. So for a company like Amazon they will probably have like millions of Orders in a day a couple of millions at-least. And that number could bloat up significantly over an year. And normally you would have to keep Order Data for a couple of years for auditing purposes, right. So this Database is going massively big. we need to do something about it. So remember the whole use case of having this MySQL is to take care of atomicity, transaction, basically the acid properties for orders that are being updated. But for the orders that are not being updated, we don't really need any of these functionalities, right. So what we can do is once you order reaches a terminal state, let's just say an order is DELIVERED or CANCELLED, we can basically move it to Cassandra. So there is something called as Archival Service, which is a cron kind of a thing, which runs every one day/12 hours at some frequency and it pulls all the orders which have reached the terminal status from this MYSQL and puts it into Cassandra. Now how does it do that? - so you can see these two services. There's the Order Processing Service and there's a Historical Order Service. These two are again component of the whole larger Order Management System which are do certain things of their own. So Order Processing Service is the service which takes care of the whole lifecycle of the order. So once the order has been placed any changes to that order happen through this. This will also provide APId to Get the orders if somebody wants to get Order details. Again, similarly, Historical Order Service will provide all kinds of the APIs to fetch information about Historical Orders. Now Archival Service will basically query Order Processing Service to fetch all the orders who have reached terminal status, get all of those details, call Historical Order Service to insert all of those entries into its Cassandra, and once it has got a success saying I've inserted into Cassandra, replicated in all the right places, then it will basically call Order Processing Service to delete that. Let's say there is some error, something breaks in between, it will retry the whole thing and because it is idempotent it doesn't really matter we can replay it as many number of times as we want. Cool. Now coming to the...once the user has placed the order all of the things will happen, logistics and all can happen behind the scenes. Now the user can go and see their past orders, right. That will be powered by this Orders View. So Orders View will basically...there will be a service here which kind of is the back end of Orders View, but that'll clutter the diagram too much. Assume there's a service here which talks to the Order View, and that talks to these two services. So basically what it will do is, it will basically query Order Processing Service to fetch information about all the live orders who are in transit it. It'll call Historical Order Service for all the orders that have been completed. It will merge the data and it will return back to display on the App or website or wherever, cool. Now why did we use a Cassandra here? - So Historical Order Service will be used in a very limited set of query patterns. So some of the queries that will run is: 1) Get information of a particular order ID 2) Get all the orders of a particular user ID 3) possibly get all the orders of a particular seller. But there will be a small number of type of queries that will run on this Cassandra and Cassandra very good when you have that finite number of queries(by type) but a large set of data. that's the reason we have used Cassandra. Now coming to whenever an order is placed we want to notify the customer that your order has been successfully placed. It will be delivered to in 3 days or something of that sort. You'll have to possibly notify a seller, you might have to notify somebody else or when something happens. Let's say an order is cancelled by the seller you want to notify the customers. So all of those notifications would be taken care by this Notification Service. Think of it as an abstraction which can abstract out all various kinds of notification like SMSs, Emails and all of that. I made another video(on Notification Service Design) which talks about in detail of how do you implement Notification Service. So you can go have a look at that as well. While ... a user is placing all the orders, all of those things are happening, all the events are going into this Kafka on which we are running a Spark Streaming Consumer. It does a lot of things. One of the... very first things that it does is it can publish a report saying in the last one hour what are our items which have been ordered the most. Or which category has generated the most amount of revenue like electronics or food or something of that sort, right. So it can publish out some of the reporting metrics that we want to quickly look at. Other than that, it will also put the whole of the data into Hadoop cluster. Now we have real order information of a user which is a very good thing for recommendations So on top of it, will have some spark jobs which will run some very standard ALS kind of algorithms, which would be predicting that this particular user, given that he's ordered certain things in the past, what are the next set of things that this person might order. Or because this customer looks like that customer and that customer ordered so-and-so product, so this customer might also order so-and-so product. So there could be, a lot of kind of models that you can run to predict what are the right set of recommendations for this user, which would be then send to this Recommendation Service which we looked at in the last section, it will store it into its Cassandra and the next time when the user comes in, we'll show them those kind of recommendations. It would not just have like recommendations based on your orders but also similarity, so again taking the same example if you have ordered a white board you will definitely want to order marker. So all of those things would also be taken care by this.
Info
Channel: codeKarle
Views: 90,064
Rating: 4.8848286 out of 5
Keywords: System Design, System Architecture, Interview, codekarle, Code Karle, System, Architecture, Design, Amazon System Design, Flipkart System Design, System Design Amazon, System Design Flipkart, FAANG, system design tutorial, tutorial, system design interview question, interview questions, grokking, systemdesign, amazonsystemdesign
Id: EpASu_1dUdE
Channel Id: undefined
Length: 33min 0sec (1980 seconds)
Published: Tue May 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.