System Design : Design a service like TinyUrl

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello friends my name is Tushar and today I'm going to talk about how to design a tiny URL this is my first design video and I take this question because it's an extremely popular interview question why is the popular entry question there are a couple of reasons first the problem statement is extremely simple which is you have to design a service where user will give you a long URL and you need to return a short URL and when he gives you the short URL you need to give him their original long urine the second reason if you will of this question is because this question has some very interesting challenges specifically try to design it at scale so before we go into the solution there are a couple of things I need to talk about first design questions of extremely subjective in nature so take this solution with a grain of salt take more about it and try to come up with your own solution second I might not be able to cover all the areas around this question so hopefully I will try to make individual videos in the future covering those areas third I would like to know your feedback especially how you have designed this question if you were asked this in the interview and finally I would like to thank my peers at work and my friends who helped me design this who helped become the solution for this question so next let's see how we are going to design this how we are going to design tiny your when an interviewer is asking this question I'm fairly certain he is not looking for a solution where you take a longer URL generate a shorter URL for it store it in a map and then return the longer yours from the map each Miro is asking you this question to test your knowledge on scalability and durability this solution of using a map is neither scalable nor durable when I'm designing a service I think about these things straightaway first is a PR API is how the user is going to interact with my service second an application layer answer is the persistence layer for this question API is extremely simple you have a creates tiny which takes a long URL and you two tiny oil and then you have get long which takes a tiny oil and return to the long yours if you need to add features like exploration time or ability to provide shorter URL you can easily do that and you can easily do that by updating this API so we won't spend a lot of time on the API and next let's jump on the application layer when interviewer asks turn your question I think he is looking for one of the two things first is how do you generate a tiny URL which is unique remember the actual URL could be hundreds of characters long while we need to generate a tiny world which is seven or eight cap is long and it has to be unique or interviewer is interested in see how you would design the persistence layer basically where and how you would use towards the short oil and the long yodel in this video I am going to concentrate mostly on the first part which is how do you generate a tiny oil which is as unique as possible and differ the discussions of persistence where for some other video having said that I will discuss little bit what persistence where in the end so when until you ask you turn your question at the minimum is expecting you to have a block diagram like this let's quickly run through this diagram so here is the customer or a client he talks to the service through either REST API or HTTP or bunch of other open-source software available in the market and if you know to what rest is I highly recommend reading about it customer talks to the servants through a load balancer so load balancer is the front-end for the service load balancer is either software or software plus fiber combination which at the at the minimum does the delegation of delegation of the request to one of the worker threads load balancer could also do a bunch of other fancy things it totally depends on depends on how much money you're willing to pay to get the load balancer this water threads or this worker hose what they do is they take the URL generator tinyurl for it and store both of them in the persistence layer and when the get request comes he'll take the get the longer yours for the shorter URL and return it to the client we'll also have a caching layer which which could be memcache or reddit or bunch of other caches available in the market and we discussed little bit about them in the end of the video so next let's come to the gist of this problem which is how do you generate a tiny or other widgets as unique as possible let's see how many characters we need in our tiny URL before that let's see what are the characters we can have into an URL we can have a to Z which is 26 characters then we can have our case a to Z which is another 26 characters and then we can have 0 to 9 which is 10 characters this is total of 62 characters so if my time URL is 7 character long in that case I can have 62 raise to 7 combinations and this is approximately about 3 point 5 trillion combinations if your service is generating thousand tiny orders per second in that case it would take you 110 years to exhaust with 3 point 5 trillion combinations on the other hand if your service is generating million tiny orders per second then you exhaust this three point four trillion in about 40 days so if you're doing more requests per second then you need to increase your number of characters in the Tanyard for this video we will assume that our service is supporting thousand requests thousand thousand requests per second so we are happy with seven character long URL also notice that of any number from zero to three point five trillion can be represented by 43 bits so next let's see what are the different techniques to generate a seven character along your as unique eight these are some of the techniques to generate tinyurl by no means this list is exhaustive but it's going to give you a good idea on how to generate highly URL so before we go into the techniques let's talk about the database what would the table schema look like the table schema would have key as a tiny oil and value as the long the longer URL and unless you need to add more features this is more than enough for the table schema so now let's let's talk about the first technique the first technique is to generate the tiny order and then check the database so what happened here is that you get a request from a user who says he generated tiny world for my long URIs the worker host gets the request then he generates a random tiny URL after he given silent on URL he could do one of the three things here so the first thing it does is he doesn't get on that time URL and checks to see if it only exists in the database or not if that sign of town URL does not exist in the database then you can put this time URL and longer yours into the database does this work it works most of the time but sometimes it won't work why because after the worker is done getting the time yours and why is putting the time you order and a long URL another worker thread for another request could be putting the same tiny URL for some other long URL into the database and one of those puts is going to win and it's going to corrupt the longer for the other requests so stable connection system will not deploy this technique let's look at the technique number two the technique number two is saying is that put is absent into the database so this requires support from the database so here what is saying is that hey why you put this time yours and long yours into the database if there is no key whose value is equal to this time you are so for relational databases like Mike in Oregon who support food acid properties at obesity isolation consistency for them is a trivial operation for no sequel databases they might or might not support put is absent so why would you use an article databases or relational database because no sequel databases skills really well compared to relational databases on the other hand some relational databases like Oracle is extremely expensive so the technique number two works if you have database support let's look at the technique number three what I'm doing here is again worker generates a random tinyurl then he puts its tiny words and long URL into the database then he doesn't get of the tiny order and checks if the longer order the test with this title is same as the one which he puts if that's the case then he is done otherwise he generates a new random time yours and puts that again and then does it get again to verify that the long URL matches or not so he can theoretically keep doing this infinitely until he finds a random yours which does not exist in the database or not so if the if the if the function generating random tiny yours is smart and generates random is an randomness is good in that case it should not take more than one attempts to get this done but the problem with this approach is there that for every put you're still doing at least one get or more gets depending on if there is a collision or not so this is this is the three different ways if you generated a right-hand turn URL next let's look at the md5 approach in md5 approach we calculate the md5 of the longer URL then take the first 43 bits of that md5 and then use that to generate the tiny URL or is it md5 and defined that hashing function which are is 128-bit long hash and out of that we will take the 43 bits and then you're a tiny URL from that so what is the power greatest collision there is a probability of collision if you take more bits of that 128 bits that the probability of collision will get lesser but your time you others will get will get longer on the other hand is to take you less bits of that 128 bits then the probability of collision will get get more for your tinyurl will get shorter in our chains since we want to generate seven times along time yours we will take for the three bits of that and define to be foolproof what we are going to do is you know take this 43 bills generate a ton URL and then apply our checking the database technique we discussed before so what is the advantage of doing this or the right of your L so this has an advantage of saving some space in the database what suppose to users want to generate a tiny URL for the same long urine in the first case in the first technique we will generate two random tine URLs so they'll be tools in the database in the final technique both the longer URL will have the same false simplifies so you'll have the same false for three bits which means that we can get some video being which means that we will end up saving some space since we only need to store one row instead of tools in the database for the viewers who not clear how this for three bits will convert into a into seven country longest swing let me show that quickly so for three bits this one is zeros and ones suppose we have 43 bit long zeros in one so this is Winery you can convert this into decimals by doing 2 raise to 0 to list 1 and adding them so suppose this number comes out to be 1 million sometimes something once you get this number then all you have to do is convert that into base 62 so my basics to do is means that you divide by 62 and percentage 62 this is same as your convert this number into binary but instead of binary you are converting to basic c2 so when you convert this into basic c2b you get numbers from 0 to 61 so it could be 65 13-0 something like that then what you do is then you match these these characters through a den you map these numbers to characters so remember we talked about our our time URL that have a to Z into Z n 0 to 9 so what we would say here in a maps to 0 B maps to 1 and on and then capital averages a maps to 26 and so on so then you can take if I hit the 60 will be number 8 and this 5 will be e and this is 30 could be uppercase B and so on so this way you can get the tiny URL this we can generate seven character long tiny yours from this 43 bit so next let's talk about this counter based approach counter based approach has different ways to solve the problem so first is a single host so in a single house approach a single host is responsible for maintaining a counter this single host could be a database or it could be a zookeeper instance if you don't know what zookeeper is I highly recommend reading about it so what happened here is that when worker hosts get a request with your little tiny URL they talk to this counter host the host which is maintaining the counter and he returns then this one unique number and then he increments the counter eternally and when the next request comes he turns him that number and then again increments the counter so that way every worker holds get a unique number and then they can generate this tiny URL based of that unique number the problem with this approach is single point of failure and single point of borderline so is this host went down it might take some time for another host to take over his responsibilities also it's possible that if the number of requests per second is extremely high then this counter host might not be able to handle that kind of load so then we move on to the second approach where every host internally tries to maintain this counter so how would that work so suppose we have 64 a water hose so how many bits you need to present a number from 0 to 63 you need 6 bits so so you assigned unique 6 bits to every host then you take a current time step which is 32 bits so 6 month a 2 is 38 and we have 5 more bits to reach for entry so then you can add 5 bits of either random or incremental or random or incremental value so here so here leading to total of 43 bits so in this case or the probability of collision I would say pretty high if your number of requests is thousand per second it's an emergency as a thousand percent it means that every worker thread is doing 20 requests per second so in a second this is not going to change this is not going to change this here can be total of 30 can you represent max of thirty two numbers and giving twenty per second there is a good chance that if you generated this randomly you will have a very high you will have a collision also in terms reading randomly if you were just incrementing it if you happen to have do more than 32 requests per second and that is also this useful also if you're adding holds or removing horse water hose then this will not be stable because not you want to add a sixty-fifth host then you have to take out if your ethnicity so then you have to then the 6 bits representation is not by the word so so to make this work you have to pick more than four three bills or you have to place the techniques where you use less you were very don't waste 32 bits on a time stamp but you sum all these parts of the time stamp and increase your randomness or your counter so so this is all they all host approach where every host is trying to maintain an internal counter but it has a little bit higher probability of collision next let's look at a range based of remember when I said that 7 characters generate up to three point five trillion combinations so range based approach tries to use as much as many combinations as possible and it scales really well so how does that work so first so already three point five trillion combinations we'll just worry about the first billion combination initially so what we do is we take we divide this billion combinations in two thousand ranges of 1 million each so forth which will go from 1 to million secondly it will go from million 1 to 2 million all the way to 1 billion and we'll have thousand such ranges and this information will be maintained in zookeeper a zookeeper is highly available and durable so what happened is a worker threads or worker host they come to zookeeper and they say that hey give me an unused wage so when you start new all the ridges will be unused a two billion worker one will come he will claim range one two thousand five million so this will be used smart use and then he was internally whenever he gets a tiny credit annual request he will internally work between this one two million range keep incrementing will maintain an internal counter keep incrementing and keep generating a unique number and then generate a tiny URL based of that so where the two comes he gets range a million one to two million and so on so what it does is it guarantees that there is no there is no collision because all the older worker hosts are working within a Ridge and it has any point of time the exhauster range what they could do is they could again come back to the zookeeper and say hey give me a new unused range so the best thing here is that you would have workers you you can add new water threads without worrying about where well they would go because the newer concern would come in and you and say hey give me a new unused range and zookeeper can give him a new one use range also lesser water treads dies in between so all it happens is that we have we just lose type whatever you need not use we just lose that many combination and I think since in the worst case you're losing a bill in combination it's not a big deal because we have up to three point five trillion combinations to work with now when we have done using all the ranges till billion what we do is we say that hey our lower mark is now billion video we have nothing below combinations left till billion so now we'll take 1 billion to two billion create another thousand such ranges and then this worker threads will work on those ridges so you can keep doing this until you reach three point five trillion combination so this works really well as long as you have a service like zookeeper which is able to maintain this configuration information there is one little one little problem with this approach is that we are generating a URL in a sequential manner and suppose some could say that this is a security threat because because because hackers can predict what the next what the next URL would be so to weather on that and all of the counters based approach has the problem if there is no randomness so the water on that what we can do is we can take we can take us we can take this counter this number unique number and then add some random bits to the end of it like let's say add another 10 2010 15 bits and the end of it and then generate the time you are so in this case you have 43 bits plus another 10 bits so 53 bits and then use those two bits three bits to generate with higher ed which gets annual review of lengths and nine or ten instead of seven so this is these are all the three techniques based on counters where you can generate very and generate chanyeol so finally let's summarize all what we discussed about Stan you are well create our your request comes in it goes to the load balancer your balance and picks one of the worker hosts that the create journal goes to the worker hosts the worker who is if he is working on a range based counter then he's going to see if his country is exhausted or not if it is exhausted and it talks in the zookeeper gets a new range on base of that regen she has a new number base of that number created short URLs on oil and then source a tiny URL and long yarn into the database at the same time this world is also will distort this data into a distributed cache one hatch because the nature of business for tiny oil is that if you put it on your than a Twitter it's going to be accessed a lot initially and after 2 hours nobody is going to care about it so it's very important to cache the unicredit tiny URL into a it's some sort of it is a bit of cash when the get request comes in so when someone comes with a tiny URL I get again goes to the load balancer from load balance to wonder thread what which I will first see is that information is there in a cache or not if it is there then it's going to respond the long URL from the cache otherwise is going to go to the database and try to get the long URL from there one more thing I want to talk about is global usage suppose you're suppose you're this system server system is in u.s. but let's say your users are you are in Europe or in India or any other country on the world then how do you how do you work in that kind of system for target specific use case I think that the create URL can be slow but the get URL get the longer your for the shorter it should be extremely fast so what we do is for during the get URL you go all the way to your request will go all the way to this server system you will put you create a tiny oil for that and then you will cache this time you others and long yours in the CDN in your local country in your local city so what happens next is that when the gate comes in the gate can be responded from that CDN itself once again seeding is a content delivery network I I would again ask you to read more about this so this is all have to talk about tiny your these are some of the terms which are used but did not get chance to go deeper into it and maybe in a future video I might look into them more again and looking back to your feedback specifically how you would have created tiny your times for what it is
Info
Channel: Tushar Roy - Coding Made Simple
Views: 460,475
Rating: 4.8015423 out of 5
Keywords: tinyurl, shorturl, system design, large scale system, scalable system
Id: fMZMm_0ZhK4
Channel Id: undefined
Length: 24min 9sec (1449 seconds)
Published: Sat May 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.