Google Maps System Design Interview Question

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone!! Welcome to CodeKarle. My name is Sandeep, and in this video we'll go over how do we design a navigation application like Google Maps. Now let's go over some Functional and Non Functional Requirements(NFRs) that we want this platform to support. So the very first thing is when a person wants to move from point A to point B, this platform should be able to give them a couple of information . 1) What is the route that they should follow 2) Given that route how much distance they'll have to cover and how much time it will take. As an enhanced variant we could also build a system in which we show them 2-3 routes and then let them choose that they want to minimize the distance or minimize the time. 3) The next thing is that it should have a very pluggable model. So let's say we make a very basic design which just has no traffic data or anything of that sort. Now let's say if you want to plug in traffic information it should be easy enough to add it. Now post that let's say if you want to add information about weather or accidents or construction going on somewhere or road blockages or something of that sort, it should have a very easy way to input this data and not make a change in whole of the architecture. So that's the main idea of making a pluggable. 4) There's one more thing that we'll cover in a very brief thing, that is basically how do you identify roads. So there are two three ways in which we can identify new roads. So one very basic or the most efficient way is to just ask government sources to give us all the information about all the roads data that they have. Now they might not have all the data but at least we'll have a good starting point. Now beyond that we can capture a lot of organic data. So let's say if a lot of users are going on to a place or a route kind of a thing which we don't identify as a road, then we can easily say that it there is a road that we don't know of and we can kind of onboard that Road onto our system. Now we can use some data like number of people traveling on it, which sides are they traveling to, what is the volume of traffic, what is the speed and correlate that with the existing roads that we have and try to even come up with how wide the road is!! Is it a single lane road or if it's a four lane/six lane kind of a road. All of those information could be figured out. But we will still not go into this in much depth over the video. Mainly what we'll focus on is How do we efficiently find the route between two points? Now coming to the non-functional requirements the very first thing is the system should be always Available. This should not go down. The next thing is it should have a good enough accuracy. Now let's say if you are not giving the very best route but if even if you are giving a good enough route that should be okay, unless it's not a terrible route, okay. And it should not be too slow so let's say if it takes 1 second/2 seconds/3 seconds to do all the calculations and come up with output that should be okay. It's not that it should respond back in a millisecond with the best route between two points. That's a bit too much to ask for, a couple of seconds one, two, three seconds should be okay. But it should not be too slow. Like it should not take 15 seconds to come back with the response, okay!! Coming to the scale, again I do not know if these numbers are authentic, but from what I've heard and read, Google Maps roughly has a billion MAU (Monthly Active Users) and these users generally access Google Maps a couple of times in a day or maybe a couple of times in a week at least. What that means is we'll probably have roughly five to ten billion requests coming in a month for time, distance and route between two points from users. Then there are also five million companies which use Google map so companies could be companies like Uber who are using Google Maps to come up with the Navigation and Pricing System, or it could be smaller companies which have a very similar use case. But that means that we need to keep these things in mind that we have a very large volume of users who are coming into system who will be using this platform and design it accordingly. now building this Navigation App is a very hard problem to solve. Not because of the kind of Algorithms that we'll use. But more because of the kind of requirements this has. So for example, if we try to calculate the number of roads in the world there are various theories, which you know come up with various numbers. In general people cannot claim that there are somewhere close to 50-100 million roads in the world. It could be you know even a much larger number. Now if we try to model it as a graph it would probably have somewhere close to 50 million vertexes and maybe you know hundreds of millions of edges if we divide the roads into multiple chunks and all of that, right. So that is a very very massive data set.Now a lot of companies don't even have access to this kind of a data, where in what road is connecting from where to where. That kind of information is kind of non-existent throughout the whole world and there is no single place where you can fetch that, which makes it a very hard problem to build an navigation system like this. The next thing is, it is very hard to even quantify a lot of attributes that will impact the ETA. For example traffic and weather condition and Road quality. Things like these are very hard to quantify so there is not a very easy mathematical formula that you can come up with to calculate the ETA that will be required to cover a particular distance. One more challenge this is, a lot of things into the whole road systems are very unpredictable. There could be and a random accident happening somewhere, there could be a road closure due to any kind of reason and those things are not really predictable and that's the main reason why you know a lot of companies have not been able to build this kind of system successfully. We will be using an ideology of Dynamic Programming while we try to solve this problem. So basically what we will do is, we will try it for a very small area, and then we will try to solve it for a larger area, and then even larger area and then across cities and countries and so on and so forth. So before we get started let me introduce you a concept called "Segment". Now this is not an industry standard term I just made up this name on my own, so just keep that in mind. A Segment is basically a small area that is small enough that we can easily operate on to it. Think of it like this we can have segments of one kilometer by one kilometer throughout the globe. So let's just say if you have a city, something like this, what we'll do is we'll divide the city into multiple segments. Now ideally there will be a lot more segments than what we are drawing but for understanding this should be good enough, okay. Now the idea behind these segments is that it will have four coordinates which are the four corners of a Segment, through which we can identify the segment boundaries and each segment would have an identifier. You can say this segment s1, this is s2, something of that sort. Now the thing with segments is whether a point lies within a particular segment? - is something we should be easily be able to calculate. So for example think of it like a coordinate system you have these x axis and y axis, this is your point (0,0), this is your point (0,1), this is your point (1,0) and let's say this is your point (1,1). This is your square of one each, right. Now whether a point lies within this or outside this we can easily identify based on what its coordinates are right and similarly for this based on its XY coordinates we should be able to identify whether this lies in this particular segment or not. So what we will try to do is each user will try to map it to some segments basis their coordinates, right! Now one good thing about these segments is because these are functions of Lat-Long, we should easily be able to identify some approximate distance between two segments. So let's just say if you want to identify this distance between these this segment and this segment, right. So basically let's say if we come up with the center points of both of these and these are two lat/longs we can easily find the distance between (aerial distance) between these two points. We use this in the later sections but it's a very easy mathematical calculation given two lat/longs, find the aerial distance between these, right? Now what we'll actually do is, let's say the whole globe for example, we will actually divide the whole globe into multiple segments some of the segments would just have water and nothing else but most of the segments would have some localities and we will be solving for one segment each and then we will extend the solution for multiple segments segments which will then cut across multiple cities and countries and whatnot, okay!! Now the way we model the road network is like a graph. So think of two points let's say this is point A and this is point B, and then the road between these two points, okay. So this becomes one vertex(A), this becomes another vertex(B) and the road between them can be represented as an edge. Now each road would have a weight. Now will not have one weight like you normally have in usual graphs, will have multiple weights. so let's say this road is of length 2 kilometers so one of the weights on this road would be 2 We could have a bit more weights along with that. We would have a traffic weight, we could have a time-based weight. So let's just say we come up with the average number thing in general it takes 150 seconds to cross this road, so 150 becomes another weight. okay! So this is how we represent a particular road and similarly there could be other weights that we can add. Now a road, even though it looks like a non-directional thing, but it is actually are directed relationship. Normally how would you represent it? - you'll say that this one Road from A to B which is 2 kilometers long, takes 150 seconds. There's also another road which is from B to A, which is again two kilometers long, but this one probably takes 200 seconds to cover, right! Why do we need a directed relationship here? - Firstly in order to handle if there is a different ETA or different traffic pattern or something of that sort. Plus more importantly, if this says that it takes infinite amount of time to reach here that's basically a way to say that it's a one-way Road. Only you can go from point A to point B but not from B to A, right! So we can use all of these ways to track various things like one ways or various different kinds of roads. Now even though if it's a directed graph but for all the like talking within this video I will draw a single line thinking of it as a non directed relationship, because it makes the drawing easier and explanation easier, but when you are actually going to implement it, you will implement it as a directed relationship. Now let's look at what happens when you want to find the distance between two points within a segment or even ETA or something of that sort? So let's say this is a graph of all the roads, within a particular segment and there are these ABCDEFGH... are the junctions or the points where you might want to start or end from, and everything else is basically in edge. And edges could have weights, assuming that all the edges have certain weight. Let's say this edge has a weight of, first attribute is the length which could mean 2 kilometers the next attribute might say that how much time it takes to cover that 2 kilometer. This would have a length of 5 kilometers and would take 700 seconds to cover that. Now let's say if you want to go from point A to point C, how can we go? - There could be multiple routes. So one of the routes could be from A to B to C. There could be another route from A to B to E to C there could be another route from A to I to H to G to B to C and there could be another, a lot more number of routes. How do you find the best route from going from point A to point C? This is a very standard problem in the Graph Data Structures, there are various algorithms to solve it. We can use anything like a Dijkstra or a Bellman Ford Algorithm to come up with the shortest path between point A to point C. Once you have calculated this and we might as well decide to cache it. Because within a segment, it's good to know how can you move from point A to point B. Normally what you would do is you would run something like a Floyd-Warshall Algorithm, which calculates the shortest path between all possible edges in all possible vertexes within a particular segment and then store that information so that you do not have to recalculate it. So what it will end up being is, there are these all normal roads which are edges and then there could be a calculated edge also which says that, for A to C, the shortest path takes ten kilometers, it takes two thousand seconds and it is via Point A then B then E then C, something of that sort could be a calculated edge as an output of Floyd-Warshall Algorithm, which can be stored in another Data Store which can then be looked up if you want to quickly look at how much time does it take from reaching a point A to point C. Now what we looked at earlier was given a junction A and given a junction C how do we calculate the shortest distance between those two junctions. What if the point with what we are looking at is not really a junction which is the meeting point of two roads but any random point in the globe which just comes on a particular road. Let's say that point is X, okay. And now we want to find the shortest distance between X and C. What that translates is basically find this distance (X->A), lets you call it . Find this distance (X->B) let's say we call it J. Basically find that two short... two or three closest junctions from that point. Normally if that is on a road it'll just be two junctions right, and find the shortest distance from those two junctions. So whatever is the distance from A to C and B to C, add I and J respectively into that and then whichever is the shortest one that becomes your answer. Now if you want to extend it further and say that I want to find out the distance between X and Y? - What you basically mean to solve is A to C, the distance between A to C, the distance between B to E, and then these two good respective distances add it. So this (C->Y) is let's say i`, this(E->Y) is j` So what you want to do is a comparission between [ i + i` + distance(A,C)] or this the distance of [ j + j` + distance (B,E) ] So this is how you can handle any point even if it is not a junction. Now whenever you're calculating remember we looked at that we can use a Floyd-Warshall Algorithm and then calculate all the possible distances within a segment, right. What we should also do at that time is look at what all possible exits are there some a Segment. Why that's important? we'll come to. Let's say there are these two roads which are exiting a particular segment. Let's say this is a Exit 1(E1). Let's say this is a Exit 2(E2). And the segment Id is let's say S1. So what we say is this particular Junction is S1E1 and this particular Junction is S1E2. What we'll also calculate is distance of each Junction not from just other junctions but also from these two points. So we will also know that what is the shortest route from reach from C to S1E1 and from C to S1E2. And we'll store this also as a cached information. Now let's come to how do we solve it across multiple segments. Now let's say if you have this point A. From now on we will just look at junctions because that makes the calculation easier okay. You want to calculate distance from point A to point B and all these rectangles that you see are unique Segments and these two points are in different Segments. Then how do you actually calculate the distance between the two points and the same logic would be used to calculate the ETA and anything between these two points okay. So what we are essentially saying is, we have this point A and we have this point B. Now given the lat/longs we would be able to identify an aerial distance between them. Let's just say that the aerial distance between them is 10 kilometers, okay. We also know that each segment is one kilometer by one kilometer in length and width right. So what we can say is we'll at max go... so because this is 10 kilometers on aerial distance, what we can say is.. adding some buffers we will not cross more than 20 Segments on each directions either going from North to South or going from East to West. That 20 number is basically a buffer that you can come up with based on your conversation with your interviewer. But the idea is this number(10 KMs) will give us some input onto how many segments do I want to look at. So 20 segments on the North-South side and 20 segments on East-West side is what we'll look at to come up with the optimal route from point A to point B. Because normally what would happen is there will be millions of roads into the world. And if you start doing a Dijkstra across the whole globe just to calculate distance between these two points, that will be very un-optimal. So we have to break that Dijkstra at some point in time. So this number would be used to break that Dijkstra saying that we will not go further that beyond this point. Now one obvious way to solve this is you have all the roads connecting the whole city which cuts these two points as well. Run a Dijkstra across the city and then come up with the shortest distance between point A to point B. But if you look at it and if you look at it that way that there are probably millions of users querying at the same time to find distance between any two points, Running those many number of Dijkstra's Algorithms would be heavily inefficient. So we'll do some slight optimization there. What we'll do is, there are... we'll basically say that there are two exits out of this block(Segment having A). Basically to reach this block(Segment having B) we need to exit this block and there are these two ways. Let's say if we have calculated that these two are the only exists within this segment. Let's say this a segment S1. So A belongs to S1 and these two are the exits, this is S1E1 and this is S1E2 and they have some great W 1 and W 2 let's say. Now what we'll do is, this segment... this junction is basically having two points. One is S1E1.. two identifiers basically, S1E1 and S2E1. this is segment ID S2 and this is exit number one of that segment so this has another identifier for S2. What we are saying then is this has another point on the same route which is S2E1 and weight between them zero. similarly there'll be a lot of points. This graph will fan out. On the other side for B, let's say B belongs to segment S4 there are another two exits through which you can reach B. One is S4E2 and one is this one which is S4E1 even. Now if we just connect all the exits to exits for ten blocks(Segments) on the North side, ten blocks(Segments) on the South side, ten blocks on this side and ten blocks on that side, we'll basically be able to come up with a crisscross of a lot of Road which connects these four points right and there'll be a lot of junctions within that, there'll be a lot of roads, but these will be basically just entry points and exit points of the segments. We don't really care what are... what is happening within the segment when we are cutting across segments. Why? - because to reach this segment we anyway need to cross all of the other segments right. Now we don't care if this route which we are calling as let's say this is segment S6 and this point is S6E1, this point is S6E2 Now this is basically a calculated path from these two exit points. This exit point to that exit point basically. This real path would look something like this right but we don't really care what's happening within the segment. All we care about is from this point(S6E1) to this point(S6E2), I know the distance, and I'll use that. Then from this point to this point I know the distance I'll use that. So we'll basically be using all the exit points distances and use them as weights, and come up with this graph at runtime. Now given like ten segments on each side, we'll probably have this graph of a couple of hundreds of edges at max right. Now we can run a Dijkstra on this graph staying with that what is the shortest path to go from here(A) to here(B). This Dijkstra runs on top of the graph which was created by just entry points and exit points in everything else, and entry point and start point and entry point and end point in the starting and ending segments. Everything else was just a connector from individual end point to individuals basically exits of a particular segment. Now this is fine when we have to travel for like ten odd kilometers we can still create a graph of these 100-200 edges and calculate. What if you want to go inter-city? What if you want to go from City 1 to City 2 and then you want to calculate it? Then it becomes a big trouble to... because let's say the distance between those two city is thousand kilometers then there are possibly thousands of segments multiple thousands of segments that you'll look at, right. That would too complicated to run a Dijkstra at runtime. Now like we extended the solution from one segment to multiple segments, we can basically create a Mega-Segment, right? We can say this whole thing that you see here is 1 Mega-Segment and there could be this exit to this Mega Segment this exit to this Mega Segment and probably an exit here to this Mega Segment, right? And then what we can say is, let's say you have this whole map of a country. I know it looks terrible but let's say this is one country, what you can say this is 1 mega segment you can basically divide this country into mega segments. And then recursively solve for it even further. Now when you were to if you are to solve in the mega segment world you have basically converted this whole country into possibly 50 mega segments and now if you want to go from here to here, you basically need to just connect the exit points of individual Mega Segments. You don't really care what is happening within a mega segment, you don't really care what is happening within segments under those mega segments and roads under those segments. All you care about it from my start point how do I get to the nearest exit or all the exits of that mega segment and from these exits of mega segments how do I get to further mega segments recursively and then somehow reach this end point, right. Now you can have n levels of nesting. Ideally three levels of nesting is more than enough. Now what is the ideal size of this mega segment? - Again there is no right answer you can come up with based on some calculation with your interview again on this. But the idea if you still want to calculate this thing across cities or across states, this segment approach would still not scale you need to do an abstraction on top of segments which can be represented as a Mega-Segment and then solve it recursively. All good so far but we have not really talked about how do we come up with the weights on an edge and how do...what all weights do we even consider? So the very first obvious weight is the amount of distance between two points right so that is one thing that will definitely have. Along with that we will also keep an ETA that normally under normal traffic conditions, in the normal weather conditions, how much time does it take for a user to go from point A to point B. Now when we have both these information we can obviously keep a next obvious thing saying what is the average speed of vehicles. Speed = (distance/time) so basis that will be able to come up with an average speed. Now you might think that the next obvious you things are weights like traffic and weather condition and all of that but I would definitely not recommend that. Even... let's assume this is what we have build already and now somebody comes up and asks you to implement traffic data into this okay now if you want to add traffic as a weight onto your edges that'll be a very very bad design. Why? - Because with an additional attribute that you are adding you are basically changing the logic of your underlying Dijkstra which runs on a segment, right. Now let's say tomorrow somebody comes up asking you for a weather information you will again change the whole logic of your Dijkstra. That piece is a bad design, okay. So ideally.. basically the set of things that a good design should entail is basically that it should not restrict addition of new features plus each change should be small enough and contained. It should not basically impact the whole feature. So what we'll say is traffic and all all of those in things are basically attributes which impact your speed okay and basis whatever your average speed is you can calculate the ETA as a function of distance and average speed, right? So all we'll say is traffic, weather, roadblocks all of those they'll never be weights on your graph, okay. They would just be attributes that impact your average speed. So average speed would be a function of traffic, it could be a function of weather it could be a function of any number of fields that you want. Let's say if you want to add a accident, or construction work going on someplace or whatever you want to add. So all of those would basically impact your average speed. Now normally you don't have to do all these fancy things, never ever! Why? - Because it's for a company like Google Maps they know... let's say there are hundreds of people going from point A to point B. They anyway know how much time they took right plus a very important feature about this kind of data is, this traffic data would always be normally distributed. So if we want to plot, Let's say this is basically any metrics let's say ETA of how much time it took for people to go from point A to point B on a graph it would always look like a normal distribution, okay. What that means is you do kwow that there is this point which most of the users are hitting and as a... with some statistical calculations of one standard deviation from each of these point you will be easily be able to calculate this range that normally for most of your users at least 50-60-70% of your users, this is the window of time that it will take. Let's say this is 5 seconds this is 7 seconds and let's say average is 6 seconds for example so statistically you can come up with this number as a function of lot of users who are crossing the roads. So you don't really care how much is the traffic, what is the weather, when you anyway have this real-time information So for all the geographies where you have people using your application, there you have organic data from real users at this point in time so you don't really care about traffic and all of that. Traffic and all other things will come into picture only in places where you can either not get enough data from users or there is some other reason why you're maybe legally not allowed to get information of users right because of which if you don't have real data users, then is when you will bother about using traffic and weather and all those attributes to come up with this average speed number basis that you can even know the distance from point A to point B, you can come up with a ETA, okay. Now how do we quantify traffic? you cannot say that if there are 805 vehicles, it will take this much amount of time, right. The number of vehicles on a road and things like that cannot be calculated at all. No matter how good technology you build right. So when these things cannot be quantified how do you use them as a measure? So what we can say is basically create multiple "Tiers". Let's say this is traffic measurement and this is Weather Management. All you can say is traffic is low or it is medium or it is high and whether it is good or it is bad, right. So you will have these kind of Tiers for each of the attributes that you have and all you can say is every time traffic goes from here(Low) to here(Medium), it has a 20% increase onto average speed or something of that sort. right every time traffic goes from low to medium, average speed reduces by 20 percent. Basis these transitions of traffic volume... now you can always say that traffic is low or high based on the number of users that are probably using your application or from some other sources, right. Let's say you have this kind of a measure, so there are companies like Waze and all, which will give you the kind of information that how's the weather, how is the traffic in a very abstracted form. Now once you have this information you can calculate average speed as a relative number from the original value calculated. so let's say with the good weather, with the medium traffic you knew the average speed was later 20 kilometers per hour Now if the traffic goes to high, you kind of reduce it by 20%, so it basically becomes 16 kilometers per hour. Basis is this kind of a calculation, you will calculate the ETA. Now as and when inputs are coming, you are making changes into your weights on the graph. How does it impact the overall system? So coming back to the previous drawing that we made remember we had something called as a segment and we had lots of roads within the segment and let's say the traffic increase we got a signal from some third party saying traffic has increased on a particular Road. Now do whatever you want to. Now let's say there was this one exit point and one exit point and distance between this was some Let's say there are some X weight to this particular theoretically calculated road. And let's say it was actually using this particular road on which the traffic is increased, let's say it actually was following this kind of a graph, something of that sort. Now similarly all other roads are getting changed. So all you can say is remember in this all the theoretically calculated roads that we have to be calculated we were caching that what all real roads are a part of this theoretically calculated road. Which is... it is not a real route, but a calculated edge. So each time this thing changes, weight changes on any real road, you basically try to infer, what all actual cached roads are impacted. And then you basically make a change into the cached roads also. So you say that okay the time taken initially was 10 seconds, for this path, now it has become 15. So let me increase the value of this particular calculated road from point E1 to E2 by 5 seconds Now let's say you do this for one segment, then what basically happens is you recursively bubble it up over a lot of cached things. So let's say if there was a mega segment who's... who had this kind of an entry-exit point that this is ME1 and this was in ME2 and you know that ME1, the route from every ME1 to ME2 uses this E1-E2. So what you will do is you will basically bubble up the amount of time for this route by 5 seconds. So what we are essentially saying is as then when you get traffic data you basically update the ETA on a particular road. Then update the ETA on all the calculated routes that this road is a part of. And update and then basically that becomes at a segment level. And then do the same thing recursively for all the segments that are a part of a mega segment. So basically what you are doing is you're bubbling of all the calculated values and then updating whole of the graph. Now these things are not happening at once not all the signals are coming up at one, right. So as in when something changes you bubble up till the level where it goes and then stop. So let's say this road was not a part of this ME1 ME2 for example, then it will stop at the segment level, right. But if it is a part of a mega segment road also, then it will bubble up in the mega segment level. Now, other than just traffic and weather, there's also one very important factor that you can use. So you can also use historical ETAs to come up with a real or a predicted ETA for a particular Road. And this historical ETA is basically calculated in a way that you basically look at the day of week and the hour of day so let's say Monday, 5 p.m. this road takes 20 seconds to cover, but on Sundays 5 p.m. is this road takes just 10 seconds to cover so something of that sort is what you can cache at a level of a day of week and hour of day. And then use that historical data to infer. So let's say if you do not have traffic information also, then you can use this kind of an information to somehow come up with the predicted ETA. Now coming back to the segment, due to some activities within the segment so let's just say there are a lot of roads on the segment. There is there are these two exit points and there are these points which have some roads now let's say we've calculated that the ETA of fastest way to go from point SE1 to SE2, involves 500 units of time. let's say 500 seconds, okay. And it uses the route SE1 to A to B to C to SE2, okay. Now let's just say that there's a traffic increase on A to B. Now we could identify it by a lot of attributes. Let's just say people moving between A to B their average time increases right. Or we get a signal saying you know the traffic is increasing significantly. For whatever reason let's say we increase the time of basically the ETA for going from point A to point B. So it would impact the ETA from SE11 to SE2. Now there is a limit to which it can bubble up right so let's say it increases from 500 to let's say 550 due to some increase in A to B so all that is okay. But let's just say that we have real data of users, and for some reason it is taking a lot of time for some people but we put to go from point A to point B and let's just say this increases to let's say 850. Now a jump from 500 units of time to 850 units of time is a very big jump. It is having a.. roughly a 70% right. So now it is possible that the fastest route from SE 1 to SE2 does not include this route ABC! It could be some other route. So what we will do is if ETA increases or decreases, generally in case of increase only we'll do this. if ETA increases by greater than some percentage. It's a configurable amount. Let's just say we have it set that if the ETA increases by 30% then we recalculate things within that segment. Because all the updates that you are getting are against a segment only. So whenever a particular road's time increases you can calculate everything within that segment, right. Which means recalculating paths from SE1 to SE2. Let's just say, that we figured out that now the new shortest path is not this path and it is basically the pass from SE1 to A to H to G to F to E and to SE2. So, this basically becomes your new shortest path, right. Now if it is the shortest path, then we need to bubble up that information, right. First of all we need to say that whoever was traveling from SE1 to SE2, now needs to follow a new path right which is a path from SE1-A- H - G - F - E - SE2, right. This is one part of the things. But what if there was a mega segment over here which encompasses a lot of segments, whose exit points were also including this particular path. This is let us say ME1 and let's say ME2. Now if the exit point path of this Mega Segment also includes this part then even that mega segments path would change so the part where they use ABC now that would get converted into this, right. So these are the things that you'll have to do each time a weight changes. so when the increases beyond the threshold then you have to do. So let's just say increase happen from 500 to 510 then all of this calculation is probably not even worth it. Because the best you can get is probably somewhere between 500 to 510. Now let's say 10 seconds would not be too much of an enhancement right. But if it is such a big jump from 500 to 850 for example then you might as well recalculate everything within the segment and then bubble up whatever is required. Now you'll not just calculate with respect to exit points, you'll also calculate the shortest path to reach from let's say SE1 to A, or SE1 to B. All of those will also be calculated but for representation this was a good enough way to explain you the concept. Now let's look at the architecture of the whole system. So we will do it in two parts: First we'll look at the flow wherein all users are connected to the system wherein we are capturing their location information from all the users even if they are not using our map service. This is basically for improving our map service altogether and a bit of user profiling. In the next section we'll look at how does the actual navigation flow works for the user who want to find the shortest route between two points and what are the systems required for that. And at the end we'll look at some of the analytics. With that let's get started. A bit of a convention to start with: Things in green are basically user devices could be mostly these will be mobile phones, could be web browsers as well. Things in black are load balancers plus reverse proxies plus an authentication authorization layer in between which will authenticate all the external requests coming to the system. Things in blue are the things that we have built on our own. It could be web services, Kafka consumers or anything of that sort. Things in red are either some infrastructure component that we've used or some databases or some clusters like Kafka, Hadoop and all of that, cool. So let's look at how the user flow begins. The very first interaction with the user is when they when they have the app installed on their phone, with their location services turn on, so we get regular pings from a user's device every few seconds. Now if the user is stationary, the device will have the logic to not send the location ping every five seconds and maybe send it every five minutes or ten minutes or something of that sort, because that will reduce the load on our system if the user is anyway at the same point + Better battery life. But while the user is in transit, we will increase the frequency, so as to make our data more accurate. This talks to.. this basically keeps the persistent connection with the device. The reason for a persistent connection is, so that we don't have to create a new connection everytime. Also this would be generally a Websocket connection, because sometimes we might need to send some information on to the user devices as well. So this WebSocket handler is basically the service that talks to all the user devices. Now there would normally be thousands of such WebSocket handlers for talking to all the users that are currently online. So we need somebody who can manage the fact that which handler machine is talking to which user. So then comes something called as a Web Socket Manager. It basically keeps a track of which machine(read user) is connected to which WebSocket handler, so which user device is basically connected to which machine of this WebSocket Handler Service. It basically keeps that information in a Redis. If let's say it lost that information is lost we can kind of pull that information again from the servers. So we don't need to kind of store that information in a persistent Data Store. The next thing what happens is, as in when users are sending their locations the frequency of the location pings is handled by the device and then the device location comes to us basically coming at the Location Service. Now Location Service is basically a repository of all the location related information . It basically has some endpoints through which you can ingest some users information, something like a user ID with the timestamp with a particular Lat/Long. That kind of a structure. It puts all of that information into a Cassandra as a permanent data source to keep a track of the fact that which user was at what location at what point in time. Remember this is not just for users who are using our app right now for navigation, but for all the users in the world. How this data would help us we'll come to in a while. Now, as and when it's getting location pings, it is also putting all of those pings into a Kafka. All the location pings that are going into this Kafka, are basically read by this Spark Streaming Consumer. This does a lot of calculations, some of them are here and some of them we look at in the later sections. First of all what it does is, if let's say a lot of users are traveling at a place where.. which we do not identify, okay. But their movement pattern tells us it looks like a road. So this basically is a job which would be a Streaming Job which would look at last ten minutes of data and see how the users are moving and basically it'll be used to add new roads into our system. Now as and when a new road is added each time a new road is added that would basically impact the particular Segment within which the roads are added and then basis that all the Segment related information that we calculated earlier that we talked about earlier that would be recalculated. So that might impact a lot of segments and Mega Segments recursively if that's a very critical new road that we've found. The next thing is basically an average speed job. What that does is, it basically looks at again if we look at within a segment a lot of people moving from point A to point B, it tries to figure out what is the average speed that these users have over let's say last fifteen twenty minutes and use that information to as a proxy for a real-time information and then suggest that and basically it will then update the weights of the graph that we talked about earlier. Again that would bubble up to various Segments and Mega Segments if that's a part of a critical road. Now how would that bubble of process happen? - So this Spark Streaming Job, individual jobs would actually write back things into Kafka. So let's say if a new road is identified, it would write that to a separate topic saying I've identified a new road which connects point X to point Y. Now there will be a map update service which listens to that particular topic and it updates it in a Data Store. It updates it in a graphs Cassandra, which is managed by this Graph Processing Service, so each time a new road is found, Map Update Service would invoke Graph Processing Service to tell that I have found a new road just store that in your Datastore which sits on top of a Cassandra. Now what this service is, we'll look at in the next section in more detail. Now, each time average speed job finds out there is some change that has happened it would again put an event into Kafka in a separate topic now and there'll be something called as a Traffic Update Service that listens to this topic which would update the traffic related information again into the same Cassandra through Graph Processing Service. There is something called as a Hotspot Identifier. So let's just say there's a particular locality on which there is some average number of people per given area kind of a metric that we have. Now let's have suddenly a lot of people start gathering in that area it gives us a signal that there is some activity happening at that point. It could be that there is some social event, some sports thing happening or but something is happening there so that could be able to identify hotspots within the graph. Because eventually once we identify that hotspot we'll know that after probably a while, there will be a lot of traffic in that area as well right. So that could help us in certain ways. Now for certain things Spark Streaming would itself insert a lot of information and put out events but in general it will dump all of that data into Hadoop cluster, which has location pings of all the users across a lot of time. Now we could do some ML Jobs onto it which classify these certain other things. So let's say if you've identified a new road but we do not know what kind of a road that is. If it's a one-way, if it's a two-way Road, if it's a single lane road or maybe a six-lane Road. So all of those things would be identified by this Road Classifier and this particular job could now put that event into that road topic and Map Update Service could now store that additional information saying now that I have not just identified that road but now I also know that this is a probably a six-lane Road or something of that sort There could be something called as a Vehicle Identifier. While you are getting location pings, if you start getting those pings at a very high frequency let's say every one second if you are getting that you can input a lot of information about the vehicle in which the person is traveling. So let's just say that the traffic is going normal at certain pace but one particular person is stopping every few kilometres let's say, okay. Now what that could mean that could mean that either there's a red light or there's a traffic jam there, possibly. But we know what are the junctions you know where the red lights can be we know how the traffic jam are because we know the data for the user as well. Now if it's not any of these two criteria then that means that the person is traveling in a public transport something like a bus and then the bus is stopping at each bus stop, all right. Other kind of things that we can infer is if let's say the ride is bumpy or if the person is going left and right here and there very frequently or having a very fast acceleration and very fast breaking it's probably a two wheeler. if not in all other scenarios if it's a smooth enough right stable enough acceleration going straight in one road it is very likely a four-wheeler. Now we do tell information to Google specifically that we are actually traveling in a two wheeler or four wheeler while we select a map but if let's say that option is not available then this kind of data would be good enough to infer which kind of vehicle a person is traveling in, which will help us in a lot of other attributes. So these individual jobs what all they identify they could put that into back into the same Kafka in probably a separate topic for each job and then those information could be analysed even further. Now let's look at the user journey when they want to actually get the navigation from point A to point B. So normally what people do is they search for a particular location which then gets converted into a particular lat/long right. So all of those is being done by Area Search Service. This basically does two kind of things. One is, it has some areas within Elastic Search on which it provides fuzzy searching and once the person has search for a particular area or locality or location anything. It then gets converted into a lat/long. So basically it stores the lat/long along with the place name, description all of that, in a document. The other thing is it tries to dynamically figure out: Given an address kind of a thing which lat/long that address points to. So it does that address resolution into a lat/long as well. Now let's say... so post the response from this user will have a start point and end point as a latitude longitude both of them and then they would want to request the path, okay. First let's quickly go over what is this Navigation Tracking Service. So while the person is actually started the navigation and they're going towards the destination all of their location coordinates will be tracked as part of Navigation Tracking Service. This does two main things: 1) If the user is deviating from the route that we suggested, we will kind of inform the app back and then that will probably show a pop-up kind of a thing, saying that you are probably going wrong, maybe you want to course-correct to the right path. 2) It will also store all the data so it will push all of the data into Kafka, while the person is traveling then that data could be used for later analysis saying, at this point in time a person searched for a route from point A to point B, and this was the route that we recommended. It could be used for figuring out how good the recommendation of the path is. That being said, let's jump to the main piece which actually figures out how to connect point A to point B, okay. So all of those requests are being handled by this Map Service. This Map Service over here is basically an interface service which provides multiple interfaces for people to request the directions from some point to some point. One of the interface is given to the user devices. A similar interface could be given out to companies where in all the rate-limiting and their throughput calculation against their account information is being calculated and stored, okay. It's not an intelligent service it just forwards the request to something called as a Graph Processing Service. Graph Processing Service basically does a lot of things. First of all it queries Segment Service which sits on top of its own Database of Cassandra, in which it stores all the segments and their corner coordinates along with a lot of information of those segments. So let's see if they if the Graph Processing Service wants to figure out that what are the segment's for the start point and the end point, it will get that information on the Segment Service. Now if it so happens that both the points are within the same segment then it's a short duration ride. Then the Graph Processing Service might choose to do a couple of things it might look up that, Do have this data already cached? if yes, it will directly return to the Map Service saying there is a response. If not, it'll then decide what do I do next? It can now if it is the same segment it can probably say that I will run a proper Dijkstra's algorithm and then try to figure out the shortest path between the two points. It can do that and then return the response. Alternatively, if it's in different segments then it will try to fan out into the whole basically it will try to create a pick the sub graph of the main whole world's graph that we want to look up as part of the Dijkstra and run it, okay. Now for doing that it will look up on a lot of things. First of all it will look up for all the roads within those segments, the entry-exit points of the segments, into its Cassandra. It will also have live traffic information which would have updated the ETAs into the graph again from the same Cassandra. Now it would have also got some input from something called as a Third Party Data Manager. Now it doesn't query Third Party Data Manager at runtime. Third Party Data Manager in fact, pushes data into Graph Service which then impacts the live traffic information so let's say if it says that there is a huge amount of traffic at certain point it'll update that. Let's say if it says that there is a massive kind of a rain happening somewhere, it will update the live traffic information thing there is some rain and that would probably cause some delays and the average speed would slow down and ETA would increase, something of that sort. Now using all of that information it basically creates the whole graph that we looked at in the previous sections and then it either looks up in the cache and returns the result or it might run the Dijkstra's Algorithm or some Algorithm to basically figure out the shortest path between two points and then return the result. let's say if it doesn't have any traffic information, any ETA information it might also query Historical Data Service to find out at this point in time how much ETA can expect from point A to point B. if it doesn't have that information for some of the points. Now Historical Data Service sits on top of its own Cassandra which basically stores the information of the form that at some day, day of the week like Sunday, Monday, Tuesday in some hour, let's say 4 PM, 5 PM, something of that sort, from point A to point B, how much was the average speed of people normally and how much ETA did it take. So all of that kind of information would be stored in Historical Data Service. Bases all of that, Graph Processing Service will return back the response to Map Service and then to the user going by the exact same set of logic that we looked at in the previous sections. Now all of this is done a lot of events have gone into Kafka. One of the common events was that by the Area Search Service which we didn't go over. So each time a search is happening the service will put an event into Kafka. Now what that would do is it will basically give us a list of areas that are being searched most often. So we can do some optimizations for them. Another thing would happen is, so not we can do a lot of analytic on top of this data okay that is coming into Kafka. One of the very first thing we need to do is to figure out that out of all the third party that we are integrated with which of them is actually giving right information and which of them is actually given the wrong information. How do we do that? So let's say some third party gave us some information saying from point A to point B, there is a lot of traffic, and ETAs would increase. But our data if it says that ETAs have not been impacted at all, that means that particular information was incorrect, right. Now if that happens from the same Third Party a lot of times, then we can safely say that, that third party has a wrong information or it gives us wrong information. right. Now this Third Party Data Manager, abstracts out on a lot of third party companies, so we could be pulling data from a lot of companies all of that is abstracted under this Third Party Data Manager but the reporting would be done at an individual organization level. That being said, there are a lot of other kinds of analytics that we might want to do. First of all we might want to identify what is the road that a particular person is on, right? so as part of navigation you want to probably have a audio kind of a thing saying turn right in ninety meters to get on Mahatma Gandhi Road,or something of that sort, right. All of that inferences could be done by the Spark Streaming saying where exactly I am in how much time am I about to reach a junction and what is the road name on which I will get to next. So all of that could be done by this. There are a lot of other analytics that we could also do on top of this data. Let us look at it next. So out of all the events that are being put into this Kafka, there is a lot of analytics that we can do and we should do. So one of the very important things to analyze is how accurate our ETA predictions are? So let's just say we predicted that it will take 40 minutes to cover a distance from point A to point B and if it takes something around 38-39 minutes then that's fairly accurate. But if in real world it takes somewhere around 20 minutes then that's a fairly bad prediction. So we need to have this data on how accurate our predictions are so that we can figure out that we need to invest more time into you know upgrading that algorithm that comes up with the ETAs. Along with that it will also give us some information about which of the routes are the ones that we are recommending to people, but people are not taking. Maybe there is something wrong with the routes that exist in our database, right. So all of those things could also be inferred. Some common things that we can infer is what are some common hotspots in the city that people gather at. These could be some social gathering places or things of that sort. So that will kind of give us some points of interest that we can build over time using this data to start with. Other than that some obvious things that we can also figure out is, what are the home locations and work locations of some of the people. We could also infer user profile data. So for example if a person is going to a lot of pubs for example then they are more of a social kind of a person. If a person is traveling outside the city very frequently on weekends, the person likes travelling. So things like that can also be inferred basis just a location data. Coming to the last thing, now this normally wouldn't be a part of a design interview but there's something you should be aware of because this is an important thing that companies like Google Maps have to take care of. So how do you handle disputed areas? So while in Google Maps there is a clear-cut boundary of states and countries and all of that but if there is a disputed region how does Google Maps decide where to make the boundary of the country or state right. So a classic example of this would be a dispute between the countries of India, Pakistan and China in the state of Jammu and Kashmir. So India claims that whole of the land is under their territory. China claims that some of the parts of that state belongs to China and Pakistan claims that some of the parts of that state belongs to Pakistan. So now how does Google decide or any such mapping provider decides, that how where do they define the country boundaries, right. Now this becomes very tricky for Google because they cannot take anybody's side and whichever decision they make they are going to kind of piss off some of the countries. Plus that's even worse in this scenario because these three countries combined together have more than one third of the world's population. So that's a very rich source of revenue for Google. So how can you handle? - so what Google does is something very smart and also very tricky. So what they do is, based on the country where you are coming from they'll show you a different country boundary. So for people who are coming from India, to all of those people they show that the whole part belongs to India. Now to the people of Pakistan they show that the part that they are claiming belongs to Pakistan and the parts that are disputed between India and China are shown as the dotted line. Similarly for people coming from China they show the part that China claims is theirs they show under China's territory and remaining disputed area between India and Pakistan is shown as a dotted line with no country boundary specified. So this is something if if you come across this kind of the thing in an interview could try to replicate but keep this out of the scope of the interview at least, but still something good to know. So yeah I think this is it for Google Maps kind of a system.
Info
Channel: codeKarle
Views: 46,733
Rating: undefined out of 5
Keywords: Google Maps, System Design, System Architecture, Code Karle, codekarle, Interview, grokking, Google Maps System Design, System Design Google Maps, system design tutorial, tutorial, FAANG, system design interview question, googlemapssystemdesign
Id: jk3yvVfNvds
Channel Id: undefined
Length: 61min 6sec (3666 seconds)
Published: Tue May 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.