Grokking the System Design Interview: How to Design a Social Network

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
is in this video I'm gonna help you understand how to attack a system design question on a technical interview for Facebook Google Amazon Microsoft and Apple and in this video I'm gonna be talking about designing a social a social network kind of like Facebook or Twitter so if you just land on this channel like and subscribe this tells me how I'm doing and if you guys want more content like this there's give me feedback on what parts of the video are working in what parts need to be improved and I would and honestly like just smash that like button hit subscribe and enjoy the video so we're going to be talking about designing a social network this is a pretty typical question on a major tech company interview and so for our social network we're gonna be designing a similar to Facebook so we're gonna have a search feature we're gonna have a news feed for the user and the news feed can contain text image URLs and videos and and generally anything that the user wants to put on so seems pretty simple the first thing we should ask when we get this question from our interviewer is what are the constraints and for here we're gonna have the constraints of 300 million users so this basically covers the entire population of the United States one-third of those users gonna be daily active which is a hundred million two hundred million daily posts so this is a very write heavy or read heavy service most people aren't going to be posting most people are gonna be reading other people's posts we need to have very low latency 300 milliseconds and everybody's newsfeed should have three hundred posts in the newsfeed minimum this is a number of posts that are readily available that the user doesn't have to wait for and other parts of a distributed system it's highly available and it's highly consistent so what kind of features should we have we should have a newsfeed new post feature and search for posts so that seems pretty simple now we have to come up with the overall system design and the overall architecture I'm going to be using is a pretty standard distributed system setup we're gonna have the clients over here they're going to be interacting with the web server over here the web server is gonna have going to be linked to an application server the application server is going to be handling our main application logic and from here we're gonna have a database we can do this based on sequel because all of our users contain have posts and every post is linked to a user so it's relational and the time that we don't need to know the users and what posts they have then we can use a no sequel database when it's document based but in this case we're going to need a sequel database and a file store for storing large files like HBase and some additional add-ons would be the search index and the newsfeed cache so why are these additional add-ons because in a small-scale application the the logic for these parts of the application would be self contained within the application server however when a application gets large enough then those independent features that drive most of the functionality can't be kept in a single server because the storage requirements tends to span several terabytes and it makes more sense and it makes the system more fault-tolerant to move those two separate services it creates a better system and make sure you have higher uptime which is one of the key features of this distributive system that is highly available and highly consistent so now that we have the overall architecture we can go ahead and define the sequel tables so the sequel table for the users is gonna have an ID which is eight bytes because this is expanding 300 million users and likely growing so two to the 32 minus one would not be able would be able to cover everybody however the system overall we should be scaling this for the future so we're gonna be using 8 bytes 4 bytes would work however this is an issue that comes up a lot more often than not and we should just keep keep that in mind for the future email which is 128 bytes this covers 128 characters full name 128 bytes this could probably be scaled down to 40 bytes honestly however 128 bytes is pretty standard size and then of preferences which allows for the newsfeed functionality next we have the metadata the metadata is 8 bytes for the ID this is because this could have users can post a lot of posts and this will cover four million post especially if it's two million posts per day the four bytes will be will be used up within the four bus will be used up in 21 days so if we span this up to 64 then this will give us 92 92 billion days of uptime which is a lot more than we need then we have the user ID which links to the user the body 124 bytes this is hundred four characters an H base URL which links to the database server and then the post time which is a timestamp and then we have a relationship between users which is a source user to the destination user so now now that we have the basic system in place we need to define what the newsfeed cache is going to look like so to visualize how a post comes in a post would come in and it would hit the server it would follow to the application server it will get stored into the database and then get sent to the search index as well as a newsfeed cache to see if this post is relevant to any of a peep any of the people who have their news feeds in the cache so to visualize this we're gonna see the newsfeed cache and how it looks to do this we're gonna need a data structure and the data structure we're going to be using is a priority queue so we can rank posts based on preferences which are stored in the user array and then match those posts to the user to all the users that have these in your preferences with a sort of match score and this could be calculated as an aggregate function of the relationship between the users pass preference history as well as the the post is ranked preferences so how this will look is a a sort of a priority queue that contains 300 posts because remember we every user should have 300 posts in their newsfeed minimum ready this is to make sure there's no latency and load new posts after the 300 post then it'll take some time to load but let's just say most people don't spend that much time on the site and we can do that in the background so we when a new post comes in we find its relevance inside the feed and then we insert it into the feed the constraints of this problem to keep 300 posts per user in the cache the size in the metadata is 1,300 times 300 posts times 300 million users which is actually 100 million users because only a hundred million users are active on the site so instead of 117 terabytes we're actually going to be using 39 terabytes of cache space however we can actually get this down a little bit more so if we apply the 80/20 principle which states that 80% of consumption is used by 20% of the users we can actually scale this down more and up and actually only use 7.7 9 terabytes of cache space this is a general principle of distributed systems the fact is most people are in can be not every users can be using the service everyday and most most of the heavy heavy users are gonna be the same users over and over again so if we store if we provide those users with the best experience that way we can optimize our storage space cut down on cost and also ensure 80% of the functionality that 80% of the usage is covered and that's how we got to our final number of seven point seven nine and to find the users in this cache we can use an LRU addiction scheme we can also do more intelligent things like IP tracking and and data user tracking but for the purpose of this we can also just use an LRU cache and then every time a user comes in they'll be inserted into the newsfeed cache and a victim one person so this covers a newsfeed cache and now that we have this covered we need to understand how the search index works so the search index is similar to the newsfeed cache and we can use a data structure to do this to implement this as well and the data structure that we could use is something called a prefix tree so if you're unfamiliar with a prefix tree this is also called a try it's a way of indexing words in the dictionary where every character is a note inside of the tree every character is in DotA side of the tree and then and then it links to all the subsequent notes so if you wanted to find hey for example you start H and then you would go and see if there's an e inside of the children and then you would see if there's a Y inside of the children of the e so that way if you traverse a tree you can actually spell out hey and you can also realize that hei hi which is Nordic hot hillo is inside the tree as well and him is also inside of the tree because it spans a different child so to define this we're gonna be using a tree node and what the tree node kind of looks like is the value which is 1 byte because the character children which is 26 max times 4 bytes for a pointer location on the server bushes location on the database server which in this case this is defined by the system is 8 bytes for the IP address of the server then eight bytes for the pointer of the server and this will span how many servers that this tree node is located on so to the constraints there's three hundred million three hundred thousand words in the English dictionary the average size of a word is eight characters so 832 times 26 to the power of eight senses of the tree will give us a hundred and seventy three terabytes of search index so this could actually grow larger if we have more database servers and the tree does allow us to break this search index into multiple instances because we keep this it's stored in cache memory then we can access this pretty quickly and we won't have to rebuild this often we can just keep on adding things adding new items to it rebuilding the search index would take a lot a lot of time sorry it will take a lot of time so ideally we don't want to rebuild the search index a lot but overall that the the retrievals would take oh of eight worst case because at worst case or eight retrieval is eight level inside of the tree so retrieving words are gonna be extremely fast however it does take a lot of storage space and there are other solutions but this one this one will work and it works very quickly and it will help us get to our 300 millisecond latency goal over here so this is essentially the overall architecture of designing a social network some other considerations that would be good to bring up is of course how do you scale this more because the database is going to be is going to grow extremely quickly and with 200 million posts coming in per day we're gonna have a metadata at 1300 that leaves us with 260 gigabytes per day and we need to start thinking about shorting the database so one way we can shard this would be breaking into multiple database instances and then using consistent hashing to reference the database that way when we spin out more databases we won't have to figure out where it is and it won't mess up our system and we can do this using some sort of hashing algorithm we also need to think about load balancers because we're handling a lot of requests a hundred a hundred a million daily active users which comes down to around a hundred million users per second our servers are going to be very of get overloaded extremely quickly so load balancers would be a good way of breaking this application apart and making sure that we distribute the load and that our application will scale to to a million users and definitely during times of peak times of peak usage this could even go higher otherwise this system is pretty much done implementing the system would make a working social network at the scale of Facebook and this is sort of the attack plan of how to go about explaining this so if you enjoyed this video make sure you hit the like and subscribe button down below this tells me that you enjoyed this video and I should be focusing more on these system design style questions otherwise good luck on your interview and thanks for watching peace
Info
Channel: Justin Barnett
Views: 4,639
Rating: 4.8684211 out of 5
Keywords:
Id: CMlT1EouFYk
Channel Id: undefined
Length: 17min 5sec (1025 seconds)
Published: Tue Jul 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.