Design Reddit: System Design Mock Interview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how would you design reddit's homepage feed [Music] hi everybody and welcome to another mock interview with exponent today we're here with kevin jenn kevin would you like to introduce yourself hi nice to meet you guys uh yeah my name is kevin i am a senior at vanderbilt studying computer science and i have interned at a number of different companies both large and small such as google tesla and a couple other small startups i'm excited for today it's fantastic we're excited to have you kevin um so today we're going to be doing a system design question and my question to you is how would you design reddit's homepage feed gotcha gotcha um reddit's home page feed like uh just kinda like the first page that you see when you visit reddit yes exactly uh let me think about it for a second that's okay sure um okay uh can you tell me first of all how many um daily users are on reddit let's say 50 million okay and um all right so that's quite a sizable number uh from what i know about like reddit's like home page there are like a certain number of like posts um i'm guessing like from like various like subreddits right yes that's correct home page and um i can't remember like uh how many posts are on the homepage let's go with 25 for now gotcha gotcha so 25 posts on the home page and each post is like comprised of like some kind of like content um uh as well as like uh several links um to like the the full post uh page uh it also has like a title it also like shows a number of of comments i think uh on the post just like a variety of like different like metadata and um like the the post content can be like one of several types i think it was like uh like it could be like a text post uh or it could be like um a video post or it could be like an image post yeah yeah should we like design for like all these different post types i think let's focus on text and images for now gotcha sounds good cool um actually like let me like mark this down uh that's fine uh you sent me like a like a whimsical link earlier right yeah i just like write my notes on that share that there we go that's good okay so 50 million users um 25 posts um like on the home page just think per page and you said we only have to care about um image and text posts right now yes all right and from what i remember um on reddit there are like several like tabs at the very top uh where you can like sort by different settings like like hot uh like top controversial uh like should we like care about all these different like like post um like or like feed sorting methods or let's not worry too much about the ranking for now um let's just pretend that whatever posts you see are the top posts of the subreddits that the logged in users subscribed to in the past 24 hours okay okay top post uh 24 hours all right um let's see oh yeah so i guess like the reddit home page um i think like looks different for whether you're a logged in user or not um like if reddit posts come from like various like subreddits um if you're logged in you're probably subscribed to like different like subreddits from compared to like another user or compared to like like the not logged in user and only post like from your subscribe subreddit should appear on your home page uh i think that was like the behavior um should we care about like uh like building like this kind of like home page uh like feed for logged in users or um is it fine to just like do like anonymous users um let's go with logged in users okay okay yeah sounds good sounds good so we have to like think about that too as well during the system design consider logged in users okay cool i think that's um all my questions for right now uh and i think i can move on to like calculations uh that sounds good okay i think that sounds good oh cool so let's make another text box right here so if there's 50 million um it's like daily active users that visit reddit's home page um and i i guess like it's if each of them like visits like maybe like the home page like five times um if that's like a reasonable number is this per user uh that would be around 250 million um like page hits uh on the homepage so that's like pretty large and each page uh hit would require page hit would require like noting 25 uh different posts 25 posts so that means um just like in a single day we would have to load 250 million times 25. i already wrote that earlier let me just calculate that number real quick i guess like 6.2 billion posts let's see and then for every single post um we should also calculate how much um like data um or like how much memory it would take to like load it yeah so every single post has like a title it has the number of comments um the number of like awards or like like gold i guess um let's see and i i guess if it's like a text post uh oh right it also has like the time stamp too um if it's a text post um i think like textbooks probably have like usually around like a couple hundred characters so maybe just like 300 characters that sounds good for text posts um and or if it's an image post it might have like an image uh attached to it yeah so or image how are you assuming the 300 characters value um i'm just like thinking uh from my experience like visiting reddit uh and like clicking on text posts like uh 300 characters is uh kind of like enough for like um like one or two like small paragraphs and yeah that's kind of uh that's just kind of been my experience um like looking like i read a text suppose but like correct me that no that's fine i just wanted to know what how you were approaching that and is this is this like an excerpt that you would see of a post or is it the entire contents uh it would be like the entire content i see okay yeah um and then there's also like the hyperlink to the full post uh if you click on this like post on the home page all right so i guess we can like start calculating um like how much uh each like piece of like metadata uh would require in in terms of like memory uh so the title would be um the title would be like i'd say around like maybe like 50 characters okay and then if we consider like each uh like character extreme character to be around like uh two bytes of storage for char that would be around like a hundred bytes for the title okay and then the number of comments that would just be like an integer value i'm thinking and then which would just be like four bytes and then um let's see the number of golds that would also be like an ant so that'd be like four bytes two the time stamp that would be a it's like float or maybe like it yeah like a float value or a big end um like eight bytes for that and then 300 characters for the text content 300 times to be 600 slides uh image is i think usually like around like um 100 or 200 kilobytes um so yeah like 100 like kb okay and the hyperlink to the full post i think that is also around like maybe like 50 characters um or maybe on the high end like a hundred so just set it as so 100 characters times two bytes is equal to 200 bytes okay cool cool um so when i like visit reddit like most of the posts i see um are like image posts um i would would you like the like uh like the actual distribution um of like image like text posts not really i don't think i i have a number off the top of my head but let's let's assume in our world um we're gonna go 50 50. let's assume at least half of your posts have an image in them all right sounds good yeah so half of the posts have an image uh have don't and for a single post or i guess like for 25 posts let's do let's do single post like the average amount of data for a single post that would be 0.5 times um 100 plus 4 plus four plus eight plus um 100 kb which is like 100 000. um so it's like 0.5 times 100 plus 4 plus 4 plus a plus 600 um let me think about it for a second would you like plug this into like my calculator real quick sure okay so like yeah honestly um i i think like the size of the image uh takes up like most of like the data required to like start post it's like 50.4 kb um which we can just like round down to like 50. maybe so for 6.2 billion posts per day times 50 kb um that's a pretty large number um at least like yeah let me see how bad that is 6.2 is there anything that you can do to reduce this number off the bat yeah yeah yeah i was like like first like thinking about like how big this would be it's like probably bigger than a goodbye right gigabyte terabyte i don't know what's next but yeah i think first we should like think about um like how to reduce like um yeah like this this value for a single post yeah if we have to like load like like six billion posts a day and each post like 50 like kilobytes around on average that would be like like pretty significant and probably would not be good for like yeah like bandwidth bandwidth outgoing bandwidth yeah um so i'm thinking um i i think like when i like because i read it um usually uh i don't see like a full-size like image next to like the image post uh instead like i see like a tiny uh like thumbnail kind of yeah and then like i can like click on like this plus button like below it to expand the image if i want to um i'd say like usually like when i visit reddit i don't like click on every like single like image post i just click on a few that i'm interested in yeah i'm guessing like that's probably like the same for most other users as well so perhaps what we can do is instead of like sending a um a full size like image um for like every single post on the home page initially uh we can send that like uh like big image on demand once the user decides to enlarge it but by default we can just like send like a smaller thumbnail great okay how's that gonna affect your your calculation yeah so i think like if we change it to a thumbnail uh maybe like like 10 kilobytes just like a like a low resolution image um that would probably reduce it by like 10x as well uh so that each image post um or each post on average would be like five kilobytes instead of like 50 kilobytes okay um yeah although like another like additional metadata we would need to have would be like a link to that like like full size image um in our object store cool cool and i assume that the size for that would be what's similar to the other hyperlink right right yeah would be like uh yeah some hundred bites or so okay yeah i think uh is there anything else i should do for uh like the calculations do you think i think this sounds reasonable um so you have a sense for how big each post would be and just how much bandwidth you're gonna need if you're loading 25 posts um every time someone visits the home page right um so this sounds good let's move on to the high level design i'm thinking that there will be probably like quite a few services involved in loading um like the home page and it might be good to split our like system design up into kind of like a micro service architecture um where we have multiple uh different services uh online different servers like talking to one another okay and that would just like be better in general for like scalability like fault tolerance and also like separation of concerns between these different services like building them okay let's go with that so as for like the different like services uh we would need for this home page um i'm thinking just like the default like web server basic web server that talks to the client uh some application servers that like handles like the different like back and logic um like maybe like a ranking service um that would be in charge of uh like ranking like the different posts uh i guess in our situation according to like uh top uh the top most post and let's see what else um maybe like a feed service where that would be in charge of like creating like the reddit feed for a user uh a user service that would be in charge of like uh keeping track of like updating and getting data from like the user database as well as a let's see a post service as well um do you like yeah right keep track of like a new post and keep track of like fetching post to display on the homepage and also i'm thinking about yeah as for like databases we would need like a db uh database for this one uh let's do for this one and we would also need a object store uh to store all the different images as well yeah i think that's like pretty much it for kind of like the higher level design um i have some thoughts about like um the post database uh as well okay sure what are your thoughts on that um i think like given the number of like posts we want to um like the number of like posts like reddit has uh i'm guessing like there's a lot um we would have to like um like be careful like how we create this like post database um i'm thinking like uh we would probably need some like horizontal scalability um or else it might not be possible to like fit uh i'm guessing this for like hundreds of millions of posts on reddit into this database um in that case i'm also thinking of something like a nosql server um because like uh when like displaying your like you read it like feed on the home page it's probably not as important that um everything that you see is like you know like the freshest or like most up-to-date or like the most consistent uh so it's no sequel server and uh we should have some horizontal scalability and i think we should like scale it uh in terms of like sharding uh so we can i probably like shard the or like distribute the posts across uh multiple like database shards um based on like their post id um so shard by all right post id um and that would allow us to um kind of have like an equal amount like data across all of our shards and what what are you what's the criteria for sharding here what's the criteria um i was just thinking of like uh i guess like a simple formula would be like uh if like the post id is just kind of like um an integer um like value and the post id is like auto generated like number um we could just like like use like something like uh modulo to uh like like this uh to like i basically like distribute like this post to like that shard or as opposed to like this other like shard if you have like 100 shards then it would be something like um like the post id mod 100. um okay we belong to that shard however like yeah i guess like usually like module um like sharding like using like something like simple like modulo is not a good um solution uh if you want to if you ever need to like repartition uh you're uh like a database is uh you need to like add a new database to if you need to like add a new shard to this like database like like if each of your shards are getting too big or something in that case yeah and then it would kind of shift like all like uh i guess like the like the modular values yeah like modding like 100 money like by 101 that result in like like totally different um values um so um i know there's something like called like uh what's it called yeah i know this i thought there's something called like like consistent hashing we can do to like avoid um like having to do that so we can like increase the number of shards without having to um uh without having to repartition a lot of our data yeah yeah okay cut that like are you probably out there just like pause for a little bit okay yeah sure yeah okay and um okay so i think that that strategy makes sense just to um to your accounting for when your shards get too big and you have a strategy to deal with that if that happens okay all right yeah all right sounds good um and then like the feed service um actually before that i kinda wanna draw out uh like this like like system that sounds good yep that sounds good yeah that would help me like visualize too a little bit so we have like our web servers right here we would have our app servers over here um and then we would have let's see i'm thinking like so the client talks to the web servers which talks to the application servers and the application server should pull the like home page feed from the feed service i'm thinking and then the feed service would um talk to a couple different services the ranking service would be one of them to break the different posts to get the posts themselves would have to talk to the post service and right the app server should also talk to the users service um something like that uh something like this i'm thinking of and then let me just like connect the lines okay i'm gonna like switch this like arrow but i can't uh uh which arrow i got i got it okay yeah i feel like the errors just represent like making like um like requests uh too basically okay yeah and then we also have like a database right here and also a post database right here and then our s object store that would be you know like right here and you just can just like talk to it directly um image okay and then let me move this all a little bit and then this would be like the client who talks to the web servers and who also talks to the post image store okay all right let's see uh oh yeah i'm also missing little dancers but i definitely want them to be in here so let me just i can just make this triangle and right we want to have like load balancers and from like the web server and from like the app server as well um and let's see yeah i feel like yeah i feel like a service would probably consist of multiple servers like honestly i think there should be like throw downstairs in front of like most of like these like lines uh most of these like errors also just because we don't want like the system to have like a single source of failure uh where one server goes down the entire like right at home page can be loaded yeah um yeah this is kind of like the overall design okay so how are you going to be calculating the top posts with this system yeah so what i'm thinking of is um [Music] like the app servers uh when the client like makes the request uh to like the reddit home page it will go to web servers and then the app servers will like kind of like uh pull the data um from the various other services it would talk to the uh to the user service like determining um like what subreddits this user is subscribed to and it would talk to the feed service like passing in those subreddits and the fee service should generate uh should she like pull like posts appropriately for those subreddits back and the feed service should rank them appropriately and then give them back to um the client basically okay yeah i know i'm thinking of like with this flow um in order to pull the um i guess like posts across like different subreddits because we are starting posts uh because we are starting like the post database by post id we would basically have to um like query every single like like shard yeah to get the different like posts across like all the subreddits yeah uh which might be like a slightly like which might be like contacting and maybe there's a better way to do it um perhaps like we can instead of like starting by a post id we can shard by uh like what subreddit that particular post belongs to okay that sounds reasonable yeah so like if like two posts are created and i guess like the funny subreddit uh they would both belong on like the same uh like post database chart um and that way like like say like a user is only subscribed to like like a few subreddits like one of them being funny uh or like let's say he's only subscribed to like funny then we would just have to only query that post database uh from the feed service instead of like querying like all the different shards yeah okay so let me just update that shard by subreddit id um i don't like a problem with like starting by subreddit is that some subreddits have a lot more posts than like other subreddits and that can result in um just like like some charts being much larger than other shards yeah so um yeah i think in that case maybe we can just like have like some like functionality where um if a shark gets too big we can you know explain to like two different shards we can like break it up and then like with our consistent hashing that we talked about earlier it wouldn't be too difficult to like move some of that data like from that start onto like like the new shard okay that sounds good how would you be avoiding stale data i think for avoiding steel data uh like can you like uh like clarify a little bit so we're saying that we want we want the top posts in the past 24 hours right um what first of all what's your let's go back to the strategy that you're using to collect those those posts uh 25 posts um and then how would you make sure that you are constantly giving back if the user is visiting reddit five times a day um each time i'm assuming that those 25 posts are going to have to be fresh right um yeah right you mean fresh as in like they should be like different uh from one another or if they're yeah if there are changes to what the top 25 posts are for this user then yes they should be different okay gotcha i got you i see what you mean yeah so i think like the easy solution uh for this would just be to um i guess like to generate the uh feed like on demand um like basically when the user visits reddit um we would like at that point in time um like have the feature is call the post service uh get the like latest like um like 24 hour uh post and then like rank them accordingly send them back to the user um yeah by doing that generation uh on demand it will ensure that the user will have no sale data but it would also mean that um like it would also mean that like that might be like kind of like taxing like for like our back end yeah to do that uh given like all the number of like pages that we have yeah so i'm thinking that we could possibly uh implement it in a better way um let's see we could have just like off the top of my head i'm thinking of like having like a cache so maybe like when a user visits uh reddit like for the first time that day um yeah if the user visits reddit for the first time that day then we would like do like like this like online generation of like their feed and then like cache that result um if the user like visits in like the next like minute or two we can just like serve like that feed from the cash okay so do you have any strategies to not tax the back end so much right so i'm thinking like the first thing is that uh we probably don't have to like look at like all the posts in the post database um maybe we can like cache um like the top posts like per subreddit um and that would allow us to uh on unkind like a daily basis or like uh like every couple hours and that would allow us to like when a user makes a request uh to like read its home page uh we can just fetch those like top posts from all the subreddits that they're subscribed to instead of like looking at the entire like post database um across like all the like hundreds of millions of posts and just like uh rank those like top posts accordingly okay and um we could have a crawler that would like do this kind of like caching um and this caller could crawl like it like every like couple hours or so like first subreddit actually just quick question uh do you know how many subreddits there are on reddit um let's assume for our purposes around two million but i i also would say not all of them are active at the same time so i'll assume like a hundred and thirty thousand or so are active okay okay gotcha right so that's like yeah it's like that's like really important to know uh that like only a small portion of like these only small portions like of all subreddits are active subreddits so maybe we can just like do uh like crawling on these like active subreddits um crawl like the top posts uh like in the last 24 hours of these active subreddits and yeah keep track of those and then for non-active separatists we can maybe just like if it's yeah if not active subscribers wouldn't have any new posts we can just like exclude them from um like being fetched uh for like our home page purposes and how are you differentiating between whether a subreddit is active or not um i'm thinking like we could have like another layer of um like crawling i guess like for this one maybe like uh once every day uh i like a much slower cadence that checks does this subreddit have any new posts um if it does have a new post or maybe yeah if it does have a new post then we can like uh check this subreddit as i active we can move it to like the active subreddit it's like category otherwise um yeah it would just like be considered like non-active um yeah so yeah it's like having like that cache uh we can add that in uh right here then remove these two top posts cache and we would have a crawler that populates this cache from the post service okay okay um yeah i think yeah i think like this is like pretty much um yeah it for like the like design then uh do you have any other questions i think this is it's a good place to stop um i i want to ask you kevin what do you think is the most challenging part about approaching questions like this [Music] i think just like keeping track of like all the different um like components of like the system keeping track of like all the different requirements um that is like pretty challenging um but i think just like the more like system designer you do like the more experience you have um yeah yeah the better yeah over time yeah and the devil is in the details for system design so um i think you did a good job in in kind of highlighting the little things that you wanted to add to this system starting from the load balancers that you addressed right from the beginning the sharding strategy that you wanted to use and then caching at the end so it's these little things that you kind of add to a system that convey to an interviewer that you you've thought about this and you know what you're talking about and you have ideas for how to design design a really robust system thank you yeah thanks so much for being with us today kevin uh this was great and i hope you build the next reddit and good luck with your upcoming interviews thanks so much for watching don't forget to hit the like and subscribe buttons below to let us know that this video is valuable for you and of course check out hundreds more videos just like this at try exponent dot com thanks for watching and good luck on your upcoming interview
Info
Channel: Exponent
Views: 21,827
Rating: 4.4973264 out of 5
Keywords:
Id: KYExYE_9nIY
Channel Id: undefined
Length: 41min 31sec (2491 seconds)
Published: Mon May 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.