Scaling hotstar.com for 25 million concurrent viewers

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Unknown: So I am Gaurav, I am Cloud Architect at odzer. I manage the infrastructure security monitoring. And quick show of hands, how many of you were actually watching Indian userland game on Hotstar? Wow. Thank you. So, this graph is the actual concurrency pattern, the traffic pattern, and the people who are watching you are part of this graph. So, what we can see here is, this is the loss. This is the entire cricket journey. So you will be able to relate to order this. This is the TOS. This was like 3pm match, since it was happening in UK. And the TOS used to happen around 230. The first spike you see the small one is tos and usually like three, 4 million people come at that time. And then there is a half an hour break and the match used to start at three. The first file over here you're seeing is the start of the game. That particular match, it was spread across two days because it was rain affected. That is why you see day one and day two, look at how quickly the traffic is growing. And the other key thing to notice here is it is not like one of the event wherein you are about 10 million. But if you look at the graph, most of the time it is about 10 million. So that shows you how resilient the platform is how stable the infrastructure is. And the dips that you see here. These are, if you are a IPL follower, these are strategy timeout. And this is a One Day International to dessert drinks break, that often happened at the 16th over under 32nd over. And this is the graph of New Zealand batting. And they were able to go up to 13 point 9 million during their inning. Then what happened is, unfortunately, it started raining. And then there is a sudden dip. This is the most interesting part of that day. And this is like one of the event we have never seen before. When the rain started happening, there was a dip from 13.9 to 5 million. There was nothing on the screen, right? It was raining, they were showing some highlights, there were no cricket being played, still, 5 million people were glued to the screen, waiting for rain to stop and the match to begin. And this happened not for two or five minutes. This happened for three, four hours. So this event was around 630 In the evening, and the match was called off around 10. So what three, four hours, like four or 5 million people stuck on their screen waiting for a match to start. And this is like first time we have seen in Hotstar like people were so eager to watch that match or India domain. On day two, what happened is New Zealand still had like three, four hours left. So the first pack that you see over here is when they came to bat, the remaining four overs, then they quickly went inside India padded up and came to start their running. This is the most critical part for us. The sudden spike that you're seeing platform was around maybe 1.5 2 million, it went almost close to 15 million. And then there is a sudden drop. This is very, you can say harmful for backend services because of the spiky nature of the traffic. All your services needs to like be scaled up. Well in advance your you cannot rely on auto scaling because it's slow in nature. And this entire scale up is almost like 1 million permanent. So that's the growth rate. Yep, that's the growth rate that we are looking at. And then there were regular like this India batting and then there were regular follow up wickets. So the traffic didn't grow that much. But suddenly, then Dooney came on to bat and he was playing really well. And all of us were thinking okay, now we'll in the match. That's when marketing team and everyone like started sending push notifications to bring on more users. And the spike that you see here is almost like 1.1 million users being added to the platform per minute. So within a span of 10 minutes, we almost added you almost went from 13 million to 25 million. That is a scale that we are talking about. And unfortunately at this point, Dhoni got out. And so two interesting things. Not interesting, I will say. But that is a time when Hotstar made a global record. But at the same time your country is losing so you cannot really go out and celebrate that you have made a world record. So we have to manage those mixed feelings as well. The drop from 25 to here is what kills most of the platform. If you're an Android or an iOS developer, you can quickly relate to this What happens is, when you are watching a video, you are just requesting a playback file, you're just watching the cricket match, you're not doing much. There will be a bunch of API calls, but not that much. Certainly when you shift from video to homepage, now you're making homepage calls like your masthead, your continue watching play, then there is personalized content. What you have, like watch previously, based on that it will recommend you new content. So all these are API calls, what happens is when the match is being played, those API calls are not made, because you are just requesting the video playback file and few API's related to heartbeat, concurrency, stuff like that. So certainly, all 25 million users, if they exit the app by clicking the home button, then it is okay. If they click a back button, all this 25 million users are going to come to homepage. And at that moment, your homepage need to handle that load. Yeah, and that's something that you have to like prepare in advance, because you don't have time you don't know when the traffic will drop, what will the peak will be. So there is a certain hit and the beating that your backend infrastructure and the application state. These slides need no introduction. But I'll just go over this point. So the de India New Zealand match was being played in a single day, we had 100 million unique users, which is the first of a kind in the world. And it has increased our concurrency record 2.5x. So before this last year, IPL, we were at 10.3. Then this year, IPL we went to 18.6. So from previous year to this year, our concurrency has grew grown up like 2.5x. And let's talk about scale, that 25 point 3 million number. And if you're watching the match on Hotstar, that day, you must have seen this number or like you will be part of this record. This is a big concurrency that we spoke about around 1 million requests per second. This is again an interesting number 10 terabytes of video bandwidth being consumed every second. This, if I have to read to India's bandwidth, this is almost 70 75% of total internet bandwidth available in India. And we were consuming a port of 10 terabytes every second. These are all clickstream messages, your social chart, the metrics that your app sends to the backend. Those were around like 10 billion messages. And usually we do around 100 hours of live transcoding every day. So this is the scale that we are talking about. And people may ask why 25 point 3 million is a big number. So to give a perspective, before we hit 10.3, the global concurrency record was with YouTube. There was a space jump event by Red Bull and Australia, which was livestream on YouTube. The peak concurrency at that time was 8 million from and that happened in 2012. From 2012 to 2018. There were many significant events like Royal Wedding Superbowl happens every year, which is the biggest sporting event in United States. Then there was Donald Trump inauguration which peaked around 4.14 point 7 million. But in this all like 6867 years, no one could break a YouTube record until IPL apple. At that time, we were at 10.3. This year, we went to 18.6 and World Cup 25.3. So if you look at the closest competitor, we are three of them. And how do we prepare like it is not like one of the event or like auto scaling can save you wherein you can handle 25 million or more users. But a lot of awards goes on in the background to prepare the platform for such event. So we do a lot of game days. And what happens is there is a very large scale load testing that goes into preparing the platform. These are network test these are application test how much each application can handle. And we have an in house project called Project called. And this is the amount of infrastructure that goes beyond behind that low testing. So these are all sci fi next large machine. If you are aware each sci fi annex log machine has 36 CPUs and 72 GB of RAM. You multiply that by 3000 you will get this number. So we use 3000 or more sci fi class machines to just generate the load and this load generator then hits Our API services or applications, that we saw on the first graph, because you have to prepare the platform for those spikes. And this is the network out that is generated due to that Logen machine. So, funny part was whenever we used to do load testing, other customers used to get impacted, because in a public cloud environment, you share the network with all the customers. So, are the CDN partners regions or like the edge location used to get overwhelmed. So, to avoid those things, and others customers been like impacted, what we did was we moved to Geo distributed load generation. So now, instead of a single single region, we have our load generation machines in eight different AWS regions, and all of it generates the load together so that not a single edge location or a region is overwhelmed when we perform our load testings. beauty of this project is apart from load generation, it helps us do performance and tsunami testing. Tsunami is the graph that we saw in the first slide, the sudden surge and the dip, it can kill any application unless you are like prepared for and it also helps us do a lot of chaos engineering. I'll talk about that a bit more later on. And this project also helped us to generate traffic pattern using ml. So with all the information that is available to us, we know at what concurrency how much traffic each application was handling basis that we go back to the drawing board and figure out what is the breaking point of each system. So this helps us deciding traffic pattern, what will happen if India bad first, what will happen if two favorite teams are against each other. So you get all those answers by analyzing this raw data. This is what the load generation info looks like. Very simple, nothing fancy. Sci Fi next large machines distributed in eight regions, goes through internet to CDN load balancer lb, or ELB. Other important thing is ELB. Like the name says elastic load balancer, but not really elastic in nature. What happens is, each load balancer has its limit to peak, what it can handle this load test, the individual capacity of one load balancer is not enough. So we actually have to shard load balancers for each single application, we use four or five load balancers and then control them using a weighted routing so that the load is distributed and we are able to scale our application and common HD application either hosted on EC two or Kubernetes. So what does a scaling look like? We saw that the growth rate is 1 million per minute. So we have to scale up in advance, I cannot have like my concurrency at 10 million and then start scaling up. Because what will happen is by the time your issue is provisioned, it boots up, then your application becomes LD register itself in under a load balancer. Five to six minutes are wasted. And in a live match, you cannot afford that because by that time, your traffic can in five, six minutes your traffic and increase by five or 6 million. And the ladder that you are scaling for might not be sufficient there. So you have to scale up proactively in advance, we keep some buffer while scaling up. The application boot time is around a minute. And 90 seconds is the reaction time that we have in hand to make a scaling decision whether to scale up now or to wait. If there is a strategic timeout which is scheduled, we know that traffic is going to drop so we have some breathing room there. Also, push notification is something that marketing team sends out, especially if there are interesting moments happening in the match like donating 360s Or someone takes a Hatrick, what happens is these push notification goes out to a user base of 150 to 200 million users. Even if you talk about two to 4% conversation, we are talking about four to 6 million users getting added to a platform in a very short span of time. So and these push notifications can go out and any time because there can be an interesting moment any time during the game. So we have to account the buffer that we have that push notification can go out and it should handle that spike as well. And we use fully back AMI, there was an interesting BoF session going on or earlier during the day on infrastructure as a code where we were discussing this as well. So we use fully baked AMI, we don't use any configuration tools. Because tools like Chef puppet, anything, which provisions or does any configuration after the server is up, adds delay to your application becoming healthy. So to save time we use fully make AMI is even the container images have whatever it is required to run the systems within itself so that it don't have to wait for any Ansible script, or Chef or Puppet to configure the application or make it self healthy. These are a few of the reasons why we don't use auto scaling, the traditional one that is available with AWS, we get a lot of insufficient capacity for anyone who has like tried to launch a server and has faced like an error that says that okay, capacity is not available in a particular AZ. Okay. So what happens is when you want to like go from 10 million to 15 million, you have to request a lot of servers from AWS, let's say you're operating at 400, you now want 600 servers, so you're adding 200 servers. But what if you only get 50 And you don't get the remaining 150. So those kind of problems are there. single instance type per Auto Scaling group, this will be more easy to fleet. So a single ASG can only support one launch configuration. One launch configuration can only have one instance type. So if see for forex is not available. And gives error, I cannot scale my application. So this is a limitation. Like if there is a different instance type that has more capability, I can scale up using that. But since Auto Scaling group only supports one single instance type, I'm blocked by that. Next is step size during auto scaling. This is again an interesting problem that only happens at scale. What happens is when you request or you increase your target or desired capacity in ASG, it adds server in a step size of 1020. That way, so let's say if you scale up application a from 100 to 800, what it will do is it will try to add 10 or 20 servers in all availability zone. And what happens is this process is very slow it because if you want to go from 100 to 800, and you're adding let's say this happens every 10 seconds or 30 seconds to provision 800 servers, it's gonna take around 1015 minutes, which is simply not acceptable when you're running a live game. And there are a lot of API throttling. When activities like this happen, because you can ask AWS to increase your service limit and allow you to scale or increase the step size to maybe 100 or 200. But what now you're doing is you're doing more damage to the system. If 200 servers launch at one go, you are making 200 control plane API calls. And these are multiple calls like really easy to instance API call, then there will be attached those easy to instance to load balancer. There will be some monitoring calls that will go to cloud watch because now your system is LD, it will start recording CPU and network metrics. Then there are disk Attachment API calls, which all happens in background, but it all uses your control plane and data plane API, which is transparent to user, but at the scale all those are fixed limit which cannot grow. So you have to operate within those things. And game of availability zone. So at what Sir, everyone is fan of Game of Thrones. So this is a reference to that. This again is an interesting problem. So I'll give you an example. You have three availability zones, one a one b and one C. What happens is if once he has less capacity, and you try to increase the target capacity of your Auto Scaling group, AWS, this is the internal algorithm it has will try launching server in all three AZ equally, so it will launch 1010 10 which will be successful, let's say the onesie he only has 10 servers left, which we got now, in the second attempt, like the second cycle, it will launch 10 in one a one b but in one C it won't be able to launch 10 because that particular AZ is now out of capacity. What happens is provisioning a server in a particular AZ is not something in our control, because that is taken care by the internal algorithm that he has. In this case, it will still try to launch servers. Every time in Muncie. What happens is when you face an error, the applicant adds an exponential back off. So, first it will try every 10 seconds, if it fails, the duration increases to 30 seconds, then one minute, then five minute 10 minutes, it goes on increasing how this harms our scaling is that one, your infrastructure becomes skewed, you have more capacity in one name and be your once he only has 10 servers. If something happens to one be that easy goes down or there is some fault. Now all your traffic is being served through one which doesn't have enough capacity to handle all the load. Because ideally, it should be like divided in three ages. But now all the load has come on one day. The second big problem is even if C doesn't have capacity, the Auto Scaling group is still trying to launch servers there, which is increasing my scaling time. So if I have to go from 100 to 800, instead of getting the server's provisioned in under five minutes, we have seen a time where this time increases to 25 minutes. In a live match where your traffic is rapidly increasing, you cannot wait on servers being provisioned. Because end of the day, you cannot like you as a hot char customer, I cannot come and tell you okay, there is no issue. That is why I'm not able to show you the match. So this is what we do. Pre warm the infrastructure before the match, and there are buffers proactive, automated, scalable, so we don't use the automatic auto scaling that AWS provides. Instead, we have developed our own auto scaling tool. What it does is instead of scaling on default metrics like CPU network, it instead scales on on request rate and concurrency. So we get concurrency like the total concurrency or the total active users on the platform. And bases that we have latter define that okay, at 3 million, each application will have this many servers at 10 million, each application will have this month server so that data is already fed into the system. If whichever metrics is high, if the concurrency is high, it will scale that way. If the request count per application is high, it will scale using the request count as a metric. Because CPU and leathers impact impacts your customer or increase your latency shouldn't be a metric to scale upon. At the scale, because your CPU can be high, it can be 60% 70%. But if it is not increasing your latency, if it is not creating any problem to your users, CPU should not be a metrics that you scale upon. Instead, we have benchmark how much each server or each container can so like the rated RPM that each container can serve, and basis that we take decision. So if at 2 million platform concurrency, I have application A, which is now doing 50k DPS. If that is enough, I'll stay at that later. Let's say it is doing 75 DPS, so I'll scale up with before because now my RPC is more than my application can handle. So, that way we take decision it is totally based on either request rate or either based on your platform concurrency secondary Auto Scaling group, this is for the problem of single instance type. So, what happens as soon as you cannot have multiple instance type we spin up a secondary Auto Scaling group. So, if there is a C for large machine, there will be a secondary SGA having m for rock machine or any other instance type and we have so, as you also gives you a scaling notification in case it is unwilling or unable to increase the capacity we use that notification push it to SNS which triggers a lambda function, which automatically scales a secondary Auto Scaling group for that application. This way even if a single instance type is not available, still we are able to scale up the secondary instance type of that application. Last is we use spot fleet the advantage of spot fleet is two things One is cost saving the second is a single spot fleet can allow 15 different instance type in a single configuration. So and this is again before like issue to fleet came into picture, which now allows you a mixture of on demand and spot. But this is what we have been following since then. Because now with spot fleet you are able to configure more than one instance type. So you can have mix match compute memory intensive family so for example, see for 4x 348 X you For a sci fi next sorry and sci fi at next, this can be your base configuration then you spread this across three availability zones. So, now you have more opportunities you are diversifying your infrastructure and chances of getting capacity error will be less even if one particular instance type is not available in one AZ there are other instance type and easy to fill in the capacity that you require. These are the ingredients for chaos. So, people are aware of what chaos engineering is. Okay. If I have to explain it in one line, chaos engineering is something like you find out breaking point in our system, or art of breaking things. So that you know that, okay, failure is about to happen, how can you overcome it without impacting user, like simple, it's like more technical what is the best definition. So these are the ingredients for chaos. Push notification, we already spoke about this, this is not in our control, there can be any interesting moment in the match and marketing team might decide to send a notification to 200 million people, our infrastructure backend services need to cope with that spike. Increase in latency is another problem. Even if one application in your entire user journey is impacted. What happens is it has a cascading effect on other services. So let's say my content platform API's have increased their latency by 50. Ms. What happens is there are other services that consume this API to show content. So on your homepage, you have personalization engine, you have recommendation engine, which shows what content you have, what are what content you should watch. So these API's depend on content platform API. And if that API has increased latency, This, in turn will work slowly, which in turn will load the homepage slowly or may increase the app startup time. So a single increase in latency anywhere in the systems can have a cascading effect to the entire application. network failures are another scary thing. So streaming at this scale, you depend a lot on your CDN providers, what happens is, if the edge location or the pop location goes down, or is overwhelmed, they have to reboot or ship traffic. Now all the traffic that was being served for a particular edge location closest to your home or ISP, all those requests will now come to origin. Think about it, you're operating at 10 million. And the edge location or anything like closer to your home is down. Now your request will directly come to the mid grass layer or the origin endpoint. If it comes to origin. Now we even if this is like 5% of 10 million, we are talking about half a million users. If your application is not scaled up to handle that origin request, maybe your DB can go haywire or your application may go crazy, because the backend is not provisioned to handle that sudden spike. All the applications have appetite for linear growth, wherein you gradually increase the traffic. But if you have a flood of requests, let's say half a million requests coming to your origin, it can actually bring down your application. Delayed scale up is another problem. And that's the reason why we don't do auto scaling. Because when a live match is happening, and if I need to scale up, I need those servers. If it is not available, or my scaling decision or scaling grip takes time to scale up the infrastructure. Used users may be affected or they might get a bad experience. Tsunami traffic is the first graph that we saw. The sudden surge and the sudden the both are equally bad. Because now your application, you can still scale. But think about your backend servers, elastic cache rds, these are not scalable on the fly, you cannot go increase your RDS do you can do it. But there can be a downtime associated. And it's not a thing that you can do offline between a live match. Bandwidth constraints are another problem. With more users coming in, you consume a lot of video bandwidth and the stats that we saw on the first slide. It's like more than 10 terabytes per second and almost running at 70% of India's capacity. So there is very limited room to operate. And in terms of adding more users, let's say the concurrency goes from 25 to 3040 50. Any number is there enough internet bandwidth available to serve customers, because we can push out a video with but if you are living in an area wherein there are latency concerns, the last mile latency especially, or your ISP is now choked or throttle because maybe 1000 users are watching watts are on the same ISP in your local area. So those are some of the things that constrains us from bandwidth point of view. And what are we looking for, like from all this chaos engineering exercises and game days that we do? We try to, and discover hidden patterns and issues. So the main goal of chaos engineering is not to, like just bring down a system. But in fact, find out what happens if your one availability zone goes down? Or if 30% of your compute capacity is taken away. Or if there is some network level issue between your EC two and your DB. Will your application still perform those kinds of tests we perform in a controlled environment to figure out what are the worst case scenarios that can happen if anything goes wrong in the system? We find bottleneck and choke point this is again related to things that cannot be scaled in time, especially data stores and back end systems, which needs to be provisioned to a peak capacity because they are not that elastic in nature. Breaking Point of each system. So I spoke about rated capacity earlier. So we have developed a dual in house which now understand how much each application can take a loan. So what is a Benchmark Number or rated capacity for each micro service or application, what it helps us identify is at what RPM or TPS application a can go down. And then we take action to avoid going or reaching near to that number, or how we can offload some of the things by maybe introducing a caching layer in between, or maybe do some API action that is not required to be done in the moment can be done later on, like a backfilling process, something like that. That favors again, that sudden spike, adding 1 million 2 million users every minute. It has its own upper ceiling, right? It cannot infinitely scalable system. So for this one, we were prepared for 50 million concurrency, though the traffic was only at 25 million, unfortunately, only got out. Otherwise, even if the concurrency would have been 3540, we would have been able to take that beating. But let's say for this World Cup, it might have gone above 50, which is like a bit unreal nature. But at that point, we might have like started seeing issues, or that is a territory for which we are not prepared. And failures can happen at any level network servers application. So by doing all this chaos engineering exercise, we try to find out what hidden patterns are in each of these areas. And then we try to fix them. So the rated capacity. The other good thing about that is once you know your user journey, once you know that, okay, when you open a Hotstar app, these are the X number of API calls, it will make you can scale your back end for a sufficient number of users. If you don't know what is your user journey, you won't be able to scale or make your infrastructure resilient. So knowing your user journey is very important. Once you know your user journey, you can take decisions like okay, if application a can only handle 10 million load, maybe I'll turn off that application before it reaches 10 million to avoid an outage or stuff like that. Or maybe I'll offload the processing to a CDN or introduce a elastic cache something like that. And if those things are not successful, Last Resort is panic mode. So, what happens here is is as a very basic principle that your key services should always be up degradation should be graceful, the customer other users should not know that okay, or RDS is down. So like not, we should not show an error message that cannot connect to a dB handle that gracefully work it work out a solution wherein you can maybe allow a user I'll give more example on that. But panic mode essentially is when you turn off your non critical services, thereby making room for critical ones. So, when 25 million people are watching cricket, that's like almost 99 point 99% of the total traffic available on the like, active on the platform at that time. So for point zero 1%, it not really matters if they need personalized content or the other non essential things which can be turned off, thereby making room for the critical API services, which delivers your video, your ad, your concurrency numbers, your key health check metrics. P zero service must be always up. So for us p zero is video, advertisement subscription payment systems. All this is very essential. non essential services like recommendation, personalization, this can be turned off for maybe half an hour during the key or interesting moment of the map. So you reduce that traffic, and you make that bandwidth available for the P zero services. And graceful degradation is for applications, like a given example, application can only handle 10 million traffic. And it doesn't make sense to scale it up more by just throwing hardware capacity or anything unless it adds a business value. So at that time, you can cut off that system, maybe return 200 Okay response so that your client doesn't create an error message or shows the user a bad experience. The other thing is clients are also smart enough to know an application is in panic mode by returning a custom error code. So what happens in this case is let's say your application payment application DB has some issues and due to which you are not able to make or complete the financial transaction, which is a loan issue at all right? So what we do is we can put the payment system into panic now client, if it retries will know that okay, there is a issue at the backend and not a customer related thing. It will allow you to bypass the payment and just watch some ad. Same thing with the little login system. If your login system anything like a DynamoDB in the backend, or ELB has an issue or is having any error or a bug, we can put the entire logging system into what this does is it allows the user to watch a video without even asking it for a login or a valid subscription. Because we know that okay, there is an issue at our end. So we allow users to bypass those services for a particular time till that issue is fixed. Once it is fixed, we disable the panic mode, so that the normal system flow can continue. And yeah, at every ladder like after 10 million, there are decisions taken to understand one either this service or application is necessary for business to whether it is near its rated capacity. So the tool that we have created shows us in real time, at what level of the rated capacity we are operating at. So if we are around 80 or 90% of it, we manually put that in panic and degrade them gracefully. So that it doesn't have any cascading effect on other systems that rely on it because now it will return 200 Okay, the other application will think that okay, the system is fine, and the clients will not show an error message or a pop up, which impacts your user experience. So the key takeaways are prepare for failure. I cannot emphasize how important this is because at this scale, especially if you are operating in a public cloud failures are bound to happen, which are usually not in your control. There will always be a factor wherein you cannot do anything. But it's your job to overcome those and design your application or system in such a way that it can handle those failures. So that is why chaos engineering is also important and understand your user journey. Unless you understand what happens what API calls happening in our system. When a user opens the app, you won't be able to make your interest stable or resilient. You have to understand each API. Also, when you tap on a homepage, what all action happened in the back ground, which database it touches like which API call goes from where to where you need to know that when you click a play button, what old API is are called if you know that journey, you Then script that and then you can also create your load testing pattern and it is okay to degrade case gracefully we should avoid like showing errors to users which impacts the user experience or adds bad name to your brand. So whenever possible degrade gracefully without user knowing that okay your system or infrastructure has issues cool, that's all I have for today I had to watch the match next day was in the match. Like happens we don't even get sometimes we don't even get chance to look at what is happening. We look at match from a different angle. So people enjoy the match. We just look at scorecard and who is planning playing so that we can take decision whether Okay, Dhoni is going to come to bat so there can be an increase of two or 3 million. So we have to think from that lens that okay, now be ready to scale up because favorite batsman is coming in. Same thing happens in IPL as well like the top three teams like at least in the fan base, our CAC, MN RCB, whenever these teams are playing our traffic is crazy. And there are so many Dhoni and kolu fans, whenever they get out, our traffic drops by 2 million, 3 million. So we look at from that angle sometimes like people asked me whether you saw that fantastic cash or you saw him hitting 50. We in that tense moment, we don't even remember like such events are happening. Yes, yes. Almost two years. Yeah. Because there are always learning like first attempt, you cannot always succeed and two scalar systems. And there are things that so for many of the problems, we didn't had any references that Okay, someone has done this in the world, we can copy paste that idea. Some of the problems we encountered was very unique in nature, due to the fact that no one has crossed this scale. Even with CDN partners, no one is screaming at the scale. So it adds its own unique challenges. And there are failures as well, but you learn from them, and you try to do better next time. So I will say it has taken almost two years worth of effort to reach this thing. And now even IPL is like far away, but our preparations have already started. I have a small doubt here. So here, you talked about the game day, right? So how were you able to mock such a huge load onto your systems during the game day that you're doing? So like I said, it uses more than 3000 you finance slot machine, these machines are scripted to generate equivalent load that 50 million concurrency users will add to the platform and logs and the access pattern gives us values like if I say at 5 million concurrency not all application will handle 5 million payments will have lower video API's much more must be having higher if everyone is watching match, my personalization will not have that much hits. So at every ladder, each application added own. You can say RPM RPM advocate operates. If you find that ratio, you can corrupt your load testing to mimic the entire graph that we saw in the first slide. Did you actually automate the chaos testing that you have done? Sorry, the chaos testing that you have the chaos engineer? Yeah. So only scripted. Yeah. So can you tell me the Are there any frameworks for this for this automation? Mostly, it's Python based. So it depends on so chaos engineering, in our cases very, you can say there are tools available and market which does this for you open source tool, but what we try to achieve from chaos is to things errors that have happened in the past, if they happen again, what will be the impact? So that is why we simulate network failures we simulate DB not being available, just to see how will it impact at 10 million versus 25 million. This is one thing and sometimes we just randomly go and do stuff that is not even like thought of so maybe you just if a system is talking to another system through VPC peering, you delete that network connectivity or you go and change your route table settings. So just see that whether the application can handle either load or performance impact If a network connectivity is not available sorry it's a homegrown Python not using any third party. So these are luxurious Python Boto script user script. What do you want to perform? You want to change the route table, which is easily doable through a Python window. No, no. QA team does it announce? Not yet, not yet. But it's very simple like removing a route table entry through Python. Boto is just like three or four lines of code. About the panic mode, or you talked about like bypassing any other service like payments or something. So is it like just a feature toggle that you have implemented, kind of kind of feature toggling at client level? And at the backend level custom error codes which return an okay response, instead of a four xx or by xx?

Info

Channel: Hasgeek TV

Views: 234,408

Rating: undefined out of 5

Keywords: hasgeek, Hotstar, AWS, Disney, OTT, Indian Premier League, ipl, t20, world cup, sre, aws, system design

Id: QjvyiyH4rr0

Channel Id: undefined

Length: 46min 4sec (2764 seconds)

Published: Thu Nov 21 2019