System Design Introduction For Interview.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello friends my name is too sharp and today I'm going to talk about introduction to system design pretty much on a daily basis I get tons of emails and messages asking how do I prepare for system design were the topics to prepare for system design hence I decided to create this video system design is a very very subjective topic so whatever I'm going to say here is just based off my experience and talking to friends and peers in this video I'm going to talk about the basic things you need to do in the interview but most of my time is going to be spent on talking about the topics which are very important for system design so max let's start with the ABC DS of system design a stands for ask good questions it's your responsibility to ask good questions and define the Minimum Viable Product for the system design problem a problem could have many features and it's Yard sponsibility to work with interviewer and figure out which features he cares about and which features he does not care about remember you're working under a very strict time frame so make sure that your feature set is small and you go deep into this feature set the second thing to ask about is how much the system should scale so for example how much data you needs to you need to store in the database or how many requests per second needs to be handled or what kind of latency is expected from the system B stands for don't use buzzwords suppose tomorrow is your interview and you read about this consistent hashing for 10 minutes today don't just go to the interview and start showing those words in the interview it might work sometimes but it does not work most of the times in fact it backfires so make sure whatever technologies are concepts you're mentioning in the edge will you have some sort of in-depth knowledge on those technologies C stands for clear and organized thinking so before jumping into the minor details of the problem phones try to define the 50,000 feet view of the problem make sure you have defined all the ApS make sure you draw the right boxes make sure you understand who are the actors for this for the system once you have defined all those things then go deeper into the details working with the interviewer and this sense for drive discussions I have a very simple 8020 rule you should be talking 80% of the time an interviewer should be talking 20% of the time so make sure you lead the discussion make sure you anticipate the problems which are there in your solution and fix them preemptively so obviously ABC DS are much easier said than done you can improve on them by three aspects first is your personal experience if you are working on high scale system it's much easier to improve on those things or solve those things on a whiteboard second is through practice so if you so come up with so think about a system design question and work with your friends and peers and brain strong those ideas and see what technologies you can use to solve this problem and third is gaining knowledge through reading blogs and going through videos and things like that so next let's talk about some of the basic features which is required in a system design problem-solving first thing you need to work with the interviewer is on the features this goes back to defining the minimum viable product by talking by asking good questions that interviewer for example if interviewer asks you to design a facebook Messenger then some features you would want to include is one-to-one chat and if to show the fact that there are other party receives the message and write the message and so on and some features could be excluded like group chat or security on those things so feature is something you have to work with the interviewer to figure out what he or she cares about and what they don't care about and can be excluded the second thing is defining APs now that you are set on the features you need to figure out what are the ApS for your service which are going to implement those features so what are the EPS who is going to call this ApS how are they going to call those ApS or things like that which you need to figure out you in the second step third is availability so now that you have you have come up with a service you need to figure out how available this services for example if a horse went down is the service still going to be available or heck if the entire datacenter went down would the service would still be available and you have to discuss with interior to figure out how much availability he cares about in that system fourth is latency performance so if it's a background job then you do not care too much about the latency on the other hand if it's a customer facing request then obviously you want your system to be super fast and based on the requirements you might want to add a cache and things like that to improve your latency so that's one of the aspects you need to care about while designing your system then we have scalability now you design a service it works for hundred users but the question is is equal to work for thousand user or is it going to work for a million user and things like that so that's scalability is it going to scale as we add more users or more requests is it continue to have the good latency performance is it continue to be available as we add more and more users so that's again you need to consider in your design solution then we have durability durability might be important for some interviews might not be important for some other interviews so durability is the fact that data can be stored in a database securely and data is not lost and data is not compromised so sometimes it's okay to say that hey I'll use this database and that database will do all the job for me on the other hand another interviews where you are designing the database that that's that's the place where durability plays a central role central role and you need to make sure that your system is durable enough then we have class diagrams so sometimes you get questions like design a parking lot or design an elevator system and in those questions and sometimes it's where is interested in knowing how you would design the class and what are some of the object oriented principles you will use for those solving those problems let me have security and privacy again and most of entries you will probably not care about security and privacy but let's suppose your the question Yale is designed an authentication system and if you're doing such a question then in that question security and privacy will play a central role and finally we have cost effective so whatever solution you suggested is it a cost-effective solution is that an alternate solution which would be more cost effective so you have to discuss some pros and cons of different solution as far as cost is concerned so now that we know some of the basic things we need to do an interview next let's jump on to the concepts and topics which we care about which you should know for before going into a design interview these are some of the concepts which you need to know to improve your system design problem solving although it's a big list by no means it's an exhaustive list so what I'm going to do today is go through them one by one and give a one-liner explanation on each of the concepts obviously I cannot go into too much detail because frankly speaking each of them deserves a video of their own so let's start with vertical versus horizontal scaling so if you need to scale up your system either you can do vertical scaling which means that you add more more memory CPU and hard drive to an existing host or you do a horizontal scaling which is to keep one host small but instead add another host so what occurs scaling can be expensive and also there is a limitation of how much memory and CPU you can add to a single host but it does not have distributed systems problem on the other hand horizontal scaling you can infinitely keep adding more hosts but you have to deal with all the distributed system challenges so obviously horizontal scaling is more preferred than vertical scaling second is cap theorem cap stands for consistency availability partition tolerance consistency saying is that your read has the most recent write availability says that you will get a response back it might be the most recent most recent write or might not be the most recent white while partition tolerance is say is that between two nodes you could be dropping network packets so what Kathy Odom's is that you can only achieve who are of these three tanks and partition tolerance you have to have partition tolerance because you drop Network packets so basically you're choosing between consistency or availability there are traditional relational databases they choose consistency over availability which means that they could be less available but their data is always consistence consistent on the other hand no sequel databases they prep for availability or consistency if you choose to configure it that way next up is acid versus base as it stands for atomicity consistency isolation and durability while base stands for basically available soft state eventual consistency acid is used more in terms of relational databases traditional relational database and base is used more for no sequel database and you need to understand the references because once you start using more no sequel differences you need to understand which part of acid properties you're willing to sacrifice then we have partitioning or shouting of data let's suppose you have trillions of Records and let there is no way you can store the Stallions of records in one node of a database so you need to store them in many different nodes of a database and that's where sharding comes into the picture how do you shard or split this data such that every node of a database is responsible for some parts of some of the records of those trillions of writers and one technique used heavily is consistent hashing and you definitely need to know how consistent hashing works what are some of the advantages which consistent hashing brings to the table then we have optimistic versus pessimistic locking so let's suppose you are doing a database transaction and in the optimistic locking you you do not acquire any logs but when you are ready to commit your transaction at that point you check to see if no other transaction updated the record which you are working on another hand on pessimistic locking you acquire all the locks beforehand and then you commit your transaction both of them have their advantages and disadvantages I need you and you need to understand when to use which which of this locking to use in what scenario strong versus eventual consistency so here at strong consistency consistency means that your reads will always see the latest Freud while the eventual consistency means that your reads will see some right and eventually it will see the latest right so strong consistency is oddly obviously used in relational databases in no sequel database you have to decide if you want strong versus the eventual consistency and the benefit the eventual consistency is that it provides higher availability and this all goes back to the cap theorem next up is relational database versus no sequel database these days I see that most of the people prefer to use no sequel database and that's fine but do not discard relational database just yet remember relational database provides all this nice acid properties one no sequel database scales a little bit better and has higher availability so depending on the situation depending on the problem try to see which one of the two fits better types of no sequel database the first one is key value so these are the simplest kind where you have a key and we have a value and it stores this key value pair into the database the second one is white column database so in Wykeham database our row can have many different formats many different kinds of columns and it can also have many many columns that's why it's called white column database then we have document base database in this kind if you have a semi structured data or if you have an XML or JSON data and if your purse is tied into the database then you would use document based no sequel service and the final one is graph based let's suppose you have entities and let's suppose you have edges or relationship between those entities so basically if you have a graph the graph based no sequel database is used to hold that graph caching is used to speed up your request if you know that some data is going to be accessed more frequently then store it in the cache so that it can be accessed quickly caching are of two types one is if every node does its own caching so the cache data is not shared between notes and the second one is called the spirit cache where the cache data is shared between different nodes if you're in caching you have to consider few things first cache cannot be the source of truth second cache data has to be pretty small because cache tends to keep all the data in memory and third you have to consider some of the eviction policies around cache then we have data centers racks and host so did so this is just saying is that you should be aware how the data center is architected or how data center is data centers are arranged today so data centers have racks and racks have horse so you have to understand that what is the latency between talking across racks or cross horse or even cross-species or what are the worst-case can happen if a rack goes down or heck if and then if the entire data center goes down then we have CPU memory hard drive and network bandwidth all of these are limited resources so when you design your system you need to consider how do you work around these limitations and how do you improve the throughput latencies and scale your system along these limited resources then we have random versus sequential read and write on the disk we know that reading on write on a disk is usually slow but sequential reads and writes are actually amazing for the disk so you should design your system around sequential reads and writes well try to avoid random reads and writes which are order of magnitude slower than sequential reads and writes for that disk next up is HTTP versus HTTP 2 versus WebSocket so HTTP is the request reply kind of architecture between client and server pretty much the entire web runs on HTTP HTTP 2 does some of the improvements on the deficiencies of HTTP like it can do multiple requests over a single connection and then we have WebSocket which is fully bi-directional connect communication between client and server so it would be good to know some of the differences between them and some of the inner workings next up is tcp/ip model and there are four layers of tcp/ip model and it's good to know what each layer does then we have ipv4 versus ipv6 so if you know ipv4 has 32-bit addresses and ipv6 has 128-bit addresses we are running out of ipv4 addresses so the word is migrating towards ipv6 and it's good to know some of the details around that and also how does the IP routing works then we have TCP versus UDP TCP is connection oriented reliable connection while UDP is unreliable connection so if you are sending if you're doing a streaming of video then you are better off using UDP because it's all those other level is superfast on their other hand if you're setting some documents then you're better off using TCP then we have DNS lookup domain name server lookup so if it hype dub dub dub facebook.com in your browser then DNS if the request goes to the DNS which does a translation of this address into an IP address so it's good to know how that how those how those things work what is the hierarchy around them how do they do caching and things like that next is HTTP and TLS TLS is transport layer security so it is used to secure communication between client and server both in terms of privacy and data integrity and when used with HTTP it pretty much becomes HTTP next is public key infrastructure this is used to manage your public key and your digital certificates and certificate authority is is trusted and a trusted entity which tells you if the public key is from the correct party for example if you type dub dub dub dot facebook.com in a browser and if this is going over HTTP then you will get a public key back and certificate authority is used to do is tells you that this public key is definitely coming from Facebook and not coming from a third party who has had between you and Facebook then we have symmetric and metric encryption asymmetric encryption is computationally more expensive so it should be used to send small amount of data preferably a symmetric key so an example of asymmetric encryption is public private key encryption while example of symmetric encryption is AES load balancers sit in the front of a service in delegate the client requests to one of the nodes behind the service this delegation could be based on round-robin basis or the load average on the nodes behind that service load balancers can operate at l4 or L 7 and these are the levels for OSI model so L for load balancer considers both client and destination IP addresses and port numbers to do the routing while at L 7 which is an HTTP level it uses HTTP URI to do the routing most of the load balancers operate at level 7 then we have CDN and edge and CDN is content delivery network let's suppose you are watching Netflix from CL so what Netflix will do is I'll put thus the movies and series in a content delivery network close to you f CL so when you're streaming this swimming the movie the movie can be streamed right there from the CDN close to you instead of all the way from the data center and this helps both in performance and latency for the end user and then edge is also a very similar concept where you do processing close to the end user another advantage edge provides is said that it has a dedicated network from the edge 2 all the way to the data center so your request could be routed through this dedicated network instead of going over the general internet blue filters and complement sketch or space efficient probabilistic based data structure no filter is used to decide if an element is a member of set or not blue filter can have false positives but it will never have false negative so if your design can tolerate false positive you should consider using blue filter because it's very space efficient humpin sketch is a similar data structure but it is used to count the frequency of events let's suppose you have millions of events and you want to keep the track of top events then you can consider using count made sketch instead of giving the count of all the events so for a fraction of space it will give it an answer which will be close enough to the actual answer with some error rate then we have taxes which is used to derive consensus or distributed host before taxes came along kind of finding consensus was a very hard problem an example of consensus is doing a leader election among a distributed host I do not expect you to know how faxes work internally but it's good to know water some of the use cases which faxes sauce for design patterns things like factory methods and Singleton's are good to know while for object oriented design things like abstractions and inheritance are some of the things you should be knowing virtual machines are a way of giving you an operating system on top of a shared resource such that you feel like you are the exclusive owner of this hardware while in reality that hardware is shared between different isolated operating systems while containers is a way of running your applications and its dependencies in an isolated and wormans containers have become extremely important and they run a lot in the production environment these days MDF publisher subscriber / a queue so you have some publisher publishes a message to a queue a subscriber receives that message from thank you and this pattern has become extremely important in the in the system design these days and you should definitely use them whenever you have an opportunity one thing to remember is that customer facing requests should not be directly exposed to a pub sub system then we have MapReduce which is used to do distributed and parallel processing of big data math is filtering and sorting the data and reduce is summarizing the data and this is something which is very important if you are in if you're working in a big deal of heat and finally we have multi-threading nc locks synchronization and comparison semantics and these are all very important to know in the world of multi-threading some programming languages like Java comes with these things built a while other languages like C you have to depend on the platform specific platform specific implementations so this is all how to talk about the some of the concepts next let's look at some of the actual implementations of this concepts these are some of the tools which are useful not just for the system design interview but also in real life if you're going to work on a high scale system obviously this is a very small list and there are many many other tools out there but in the interest of time have kept it restricted to this small number of tools so the first one is Cassandra Cassandra is a white column highly scalable database and it's used for different use cases like simple key value storage or for storing time series data or just your more traditional rows with many columns Cassandra can provide both eventual and strong consistency under the hood Cassandra uses consistent hashing to shard your shard your data and also use gossiping to keep all the nodes informed about the cluster the second is MongoDB or Couchbase so if you have a JSON like structure and if you want to persist that then mama DB works perfectly fine they provide acid properties add a document level and they also scale pretty well if you have a more traditional use case with many tables and relationships within these tables and if you want full set of acid properties then I would go ahead and use my sequel database and my sequel database also has master slave architecture so it also scales up pretty well memcache tea and Reddy's are distributed cache and they hold the data a memory when caste is simple fast key value storage release can also do key value storage but it also does lot of other things and Redis can also be set up as a cluster so you can provide things like more availability and data application Raley's can also flush tier on the hard drive if you can do so to things remember when using distributed cache first is that they should never be the source of throat and they can only hold a limited amount of data which is limited by the amount of memory on the host zookeeper is a centralized configuration management tool it is also used for things like leader election and distributed locking zookeeper scales very well for the reads but does not scale that well for the rights also since zookeeper keeps all their in memory so you cannot store way too much data in the zookeeper so if you want to store small amount of data which would be highly available and which has tons of read then zookeeper is what you should be using Kafka is a fault tolerant highly available queue using publisher subscriber or streaming application depending on your use case it can deliver message exactly once and also it keeps all the message ordered inside of partition alpha topping nginx and echa proxies are load balancers and are very efficient for ings for an example nginx can manage thousands or even tens of thousands of connection from a client from a single instance next up is solar and elasticsearch both of them are search platform on top of you see both of them are highly available very scalable and Forge tolerant search platform and they do provide things like full-text search next is blobstore let's suppose you have a big picture or a big file and you want to store it somewhere on the cloud then block stories blobstore can be used in a very popular blob store is Amazon s3 which is provided as a part of AWS platform docker is a software platform for in containers inside which you can develop and run your distributed applications this containers can run on your laptop on the data center or even on the cloud Kuban it is and missus are software tools used to manage and coordinate this containers Hadoop it has many things going under under inside it one of the things is MapReduce and we already talked about mapreduces which is a processing on a panel of large data and if you want a faster version of that then you use spark which is which does all the MapReduce in-memory HDFS is a Java based file system which is distributed and fault tolerant and Hadoop relies on HDFS for doing all its processing this is it this is my introduction to system design we went through a V series of system design then we talked about some of the basic things you need to do in then through you and then we went through tons of concepts and tools I know I went through the concepts really fast but my intention today was not to give you too much details but instead introduce them so that you can read them in your own time so I'm going to put all these details and lot of other references in the description section of this video please like this video share this video comment on this video and check out my Facebook page thanks again for watching this video
Info
Channel: Tushar Roy - Coding Made Simple
Views: 429,851
Rating: 4.9535041 out of 5
Keywords: system design, cassandra, consistent hashing, system design interview questions, system design interview
Id: UzLMhqg3_Wc
Channel Id: undefined
Length: 27min 22sec (1642 seconds)
Published: Sun Nov 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.