Unknown: So I am Gaurav, I am Cloud Architect at
odzer. I manage the infrastructure security monitoring. And quick show of hands, how many of
you were actually watching Indian userland game on Hotstar? Wow. Thank you. So, this graph is the
actual concurrency pattern, the traffic pattern, and the people who are watching you are part of
this graph. So, what we can see here is, this is the loss. This is the entire cricket journey. So
you will be able to relate to order this. This is the TOS. This was like 3pm match, since it was
happening in UK. And the TOS used to happen around 230. The first spike you see the small one is tos
and usually like three, 4 million people come at that time. And then there is a half an hour break
and the match used to start at three. The first file over here you're seeing is the start of the
game. That particular match, it was spread across two days because it was rain affected. That is why
you see day one and day two, look at how quickly the traffic is growing. And the other key thing to
notice here is it is not like one of the event wherein you are about 10 million. But if you look
at the graph, most of the time it is about 10 million. So that shows you how resilient the
platform is how stable the infrastructure is. And the dips that you see here. These are, if you are
a IPL follower, these are strategy timeout. And this is a One Day International to dessert drinks
break, that often happened at the 16th over under 32nd over. And this is the graph of New Zealand
batting. And they were able to go up to 13 point 9 million during their inning. Then what happened
is, unfortunately, it started raining. And then there is a sudden dip. This is the most
interesting part of that day. And this is like one of the event we have never seen before. When the
rain started happening, there was a dip from 13.9 to 5 million. There was nothing on the screen,
right? It was raining, they were showing some highlights, there were no cricket being played,
still, 5 million people were glued to the screen, waiting for rain to stop and the match to begin.
And this happened not for two or five minutes. This happened for three, four hours. So this event
was around 630 In the evening, and the match was called off around 10. So what three, four hours,
like four or 5 million people stuck on their screen waiting for a match to start. And this is
like first time we have seen in Hotstar like people were so eager to watch that match or India
domain. On day two, what happened is New Zealand still had like three, four hours left. So the
first pack that you see over here is when they came to bat, the remaining four overs, then they
quickly went inside India padded up and came to start their running. This is the most critical
part for us. The sudden spike that you're seeing platform was around maybe 1.5 2 million, it went
almost close to 15 million. And then there is a sudden drop. This is very, you can say harmful for
backend services because of the spiky nature of the traffic. All your services needs to like be
scaled up. Well in advance your you cannot rely on auto scaling because it's slow in nature. And this
entire scale up is almost like 1 million permanent. So that's the growth rate. Yep, that's the growth rate that we are looking
at. And then there were regular like this India batting and then there were regular follow up
wickets. So the traffic didn't grow that much. But suddenly, then Dooney came on to bat and he was
playing really well. And all of us were thinking okay, now we'll in the match. That's when
marketing team and everyone like started sending push notifications to bring on more users. And the
spike that you see here is almost like 1.1 million users being added to the platform per minute. So
within a span of 10 minutes, we almost added you almost went from 13 million to 25 million. That is
a scale that we are talking about. And unfortunately at this point, Dhoni got out. And so
two interesting things. Not interesting, I will say. But that is a time when Hotstar made a global
record. But at the same time your country is losing so you cannot really go out and celebrate
that you have made a world record. So we have to manage those mixed feelings as well. The drop from
25 to here is what kills most of the platform. If you're an Android or an iOS developer, you can
quickly relate to this What happens is, when you are watching a video, you are just requesting a
playback file, you're just watching the cricket match, you're not doing much. There will be a
bunch of API calls, but not that much. Certainly when you shift from video to homepage, now you're
making homepage calls like your masthead, your continue watching play, then there is personalized
content. What you have, like watch previously, based on that it will recommend you new content.
So all these are API calls, what happens is when the match is being played, those API calls are not
made, because you are just requesting the video playback file and few API's related to heartbeat,
concurrency, stuff like that. So certainly, all 25 million users, if they exit the app by clicking
the home button, then it is okay. If they click a back button, all this 25 million users are going
to come to homepage. And at that moment, your homepage need to handle that load. Yeah, and
that's something that you have to like prepare in advance, because you don't have time you don't
know when the traffic will drop, what will the peak will be. So there is a certain hit and the
beating that your backend infrastructure and the application state. These slides need no
introduction. But I'll just go over this point. So the de India New Zealand match was being played in
a single day, we had 100 million unique users, which is the first of a kind in the world. And it
has increased our concurrency record 2.5x. So before this last year, IPL, we were at 10.3. Then
this year, IPL we went to 18.6. So from previous year to this year, our concurrency has grew grown
up like 2.5x. And let's talk about scale, that 25 point 3 million number. And if you're watching the
match on Hotstar, that day, you must have seen this number or like you will be part of this
record. This is a big concurrency that we spoke about around 1 million requests per second. This
is again an interesting number 10 terabytes of video bandwidth being consumed every second. This,
if I have to read to India's bandwidth, this is almost 70 75% of total internet bandwidth
available in India. And we were consuming a port of 10 terabytes every second. These are all
clickstream messages, your social chart, the metrics that your app sends to the backend. Those
were around like 10 billion messages. And usually we do around 100 hours of live transcoding every
day. So this is the scale that we are talking about. And people may ask why 25 point 3 million is a big
number. So to give a perspective, before we hit 10.3, the global concurrency record was with
YouTube. There was a space jump event by Red Bull and Australia, which was livestream on YouTube.
The peak concurrency at that time was 8 million from and that happened in 2012. From 2012 to 2018.
There were many significant events like Royal Wedding Superbowl happens every year, which is the
biggest sporting event in United States. Then there was Donald Trump inauguration which peaked
around 4.14 point 7 million. But in this all like 6867 years, no one could break a YouTube record
until IPL apple. At that time, we were at 10.3. This year, we went to 18.6 and World Cup 25.3. So
if you look at the closest competitor, we are three of them. And how do we prepare like it is
not like one of the event or like auto scaling can save you wherein you can handle 25 million or more
users. But a lot of awards goes on in the background to prepare the platform for such event.
So we do a lot of game days. And what happens is there is a very large scale load testing that goes
into preparing the platform. These are network test these are application test how much each
application can handle. And we have an in house project called Project called. And this is the
amount of infrastructure that goes beyond behind that low testing. So these are all sci fi next
large machine. If you are aware each sci fi annex log machine has 36 CPUs and 72 GB of RAM. You
multiply that by 3000 you will get this number. So we use 3000 or more sci fi class machines to just
generate the load and this load generator then hits Our API services or applications, that we saw
on the first graph, because you have to prepare the platform for those spikes. And this is the
network out that is generated due to that Logen machine. So, funny part was whenever we used to do
load testing, other customers used to get impacted, because in a public cloud environment,
you share the network with all the customers. So, are the CDN partners regions or like the edge
location used to get overwhelmed. So, to avoid those things, and others customers been like
impacted, what we did was we moved to Geo distributed load generation. So now, instead of a
single single region, we have our load generation machines in eight different AWS regions, and all
of it generates the load together so that not a single edge location or a region is overwhelmed
when we perform our load testings. beauty of this project is apart from load generation, it helps us
do performance and tsunami testing. Tsunami is the graph that we saw in the first slide, the sudden
surge and the dip, it can kill any application unless you are like prepared for and it also helps
us do a lot of chaos engineering. I'll talk about that a bit more later on. And this project also
helped us to generate traffic pattern using ml. So with all the information that is available to us,
we know at what concurrency how much traffic each application was handling basis that we go back to
the drawing board and figure out what is the breaking point of each system. So this helps us
deciding traffic pattern, what will happen if India bad first, what will happen if two favorite
teams are against each other. So you get all those answers by analyzing this raw data. This is what
the load generation info looks like. Very simple, nothing fancy. Sci Fi next large machines
distributed in eight regions, goes through internet to CDN load balancer lb, or ELB. Other
important thing is ELB. Like the name says elastic load balancer, but not really elastic in nature.
What happens is, each load balancer has its limit to peak, what it can handle this load test, the
individual capacity of one load balancer is not enough. So we actually have to shard load
balancers for each single application, we use four or five load balancers and then control them using
a weighted routing so that the load is distributed and we are able to scale our application and common HD application either hosted on EC two
or Kubernetes. So what does a scaling look like? We saw that the growth rate is 1 million per
minute. So we have to scale up in advance, I cannot have like my concurrency at 10 million and
then start scaling up. Because what will happen is by the time your issue is provisioned, it boots
up, then your application becomes LD register itself in under a load balancer. Five to six
minutes are wasted. And in a live match, you cannot afford that because by that time, your
traffic can in five, six minutes your traffic and increase by five or 6 million. And the ladder that
you are scaling for might not be sufficient there. So you have to scale up proactively in advance, we
keep some buffer while scaling up. The application boot time is around a minute. And 90 seconds is
the reaction time that we have in hand to make a scaling decision whether to scale up now or to
wait. If there is a strategic timeout which is scheduled, we know that traffic is going to drop
so we have some breathing room there. Also, push notification is something that marketing team
sends out, especially if there are interesting moments happening in the match like donating 360s
Or someone takes a Hatrick, what happens is these push notification goes out to a user base of 150
to 200 million users. Even if you talk about two to 4% conversation, we are talking about four to 6
million users getting added to a platform in a very short span of time. So and these push
notifications can go out and any time because there can be an interesting moment any time during
the game. So we have to account the buffer that we have that push notification can go out and it
should handle that spike as well. And we use fully back AMI, there was an interesting BoF session
going on or earlier during the day on infrastructure as a code where we were discussing
this as well. So we use fully baked AMI, we don't use any configuration tools. Because tools like
Chef puppet, anything, which provisions or does any configuration after the server is up, adds
delay to your application becoming healthy. So to save time we use fully make AMI is even the
container images have whatever it is required to run the systems within itself so that it don't
have to wait for any Ansible script, or Chef or Puppet to configure the application or make it
self healthy. These are a few of the reasons why we don't use auto scaling, the traditional one
that is available with AWS, we get a lot of insufficient capacity for anyone who has like
tried to launch a server and has faced like an error that says that okay, capacity is not
available in a particular AZ. Okay. So what happens is when you want to like go from 10
million to 15 million, you have to request a lot of servers from AWS, let's say you're operating at
400, you now want 600 servers, so you're adding 200 servers. But what if you only get 50 And you
don't get the remaining 150. So those kind of problems are there. single instance type per Auto
Scaling group, this will be more easy to fleet. So a single ASG can only support one launch
configuration. One launch configuration can only have one instance type. So if see for forex is not
available. And gives error, I cannot scale my application. So this is a limitation. Like if
there is a different instance type that has more capability, I can scale up using that. But since
Auto Scaling group only supports one single instance type, I'm blocked by that. Next is step
size during auto scaling. This is again an interesting problem that only happens at scale.
What happens is when you request or you increase your target or desired capacity in ASG, it adds
server in a step size of 1020. That way, so let's say if you scale up application a from 100 to 800,
what it will do is it will try to add 10 or 20 servers in all availability zone. And what happens
is this process is very slow it because if you want to go from 100 to 800, and you're adding
let's say this happens every 10 seconds or 30 seconds to provision 800 servers, it's gonna take
around 1015 minutes, which is simply not acceptable when you're running a live game. And there are a lot of API throttling. When
activities like this happen, because you can ask AWS to increase your service limit and allow you
to scale or increase the step size to maybe 100 or 200. But what now you're doing is you're doing
more damage to the system. If 200 servers launch at one go, you are making 200 control plane API
calls. And these are multiple calls like really easy to instance API call, then there will be
attached those easy to instance to load balancer. There will be some monitoring calls that will go
to cloud watch because now your system is LD, it will start recording CPU and network metrics. Then
there are disk Attachment API calls, which all happens in background, but it all uses your
control plane and data plane API, which is transparent to user, but at the scale all those
are fixed limit which cannot grow. So you have to operate within those things. And game of
availability zone. So at what Sir, everyone is fan of Game of Thrones. So this is a reference to
that. This again is an interesting problem. So I'll give you an example. You have three
availability zones, one a one b and one C. What happens is if once he has less capacity, and you
try to increase the target capacity of your Auto Scaling group, AWS, this is the internal algorithm
it has will try launching server in all three AZ equally, so it will launch 1010 10 which will be
successful, let's say the onesie he only has 10 servers left, which we got now, in the second
attempt, like the second cycle, it will launch 10 in one a one b but in one C it won't be able to
launch 10 because that particular AZ is now out of capacity. What happens is provisioning a server in
a particular AZ is not something in our control, because that is taken care by the internal
algorithm that he has. In this case, it will still try to launch servers. Every time in Muncie. What
happens is when you face an error, the applicant adds an exponential back off. So, first it will
try every 10 seconds, if it fails, the duration increases to 30 seconds, then one minute, then
five minute 10 minutes, it goes on increasing how this harms our scaling is that one, your
infrastructure becomes skewed, you have more capacity in one name and be your once he only has
10 servers. If something happens to one be that easy goes down or there is some fault. Now all
your traffic is being served through one which doesn't have enough capacity to handle all the
load. Because ideally, it should be like divided in three ages. But now all the load has come on
one day. The second big problem is even if C doesn't have capacity, the Auto Scaling group is
still trying to launch servers there, which is increasing my scaling time. So if I have to go
from 100 to 800, instead of getting the server's provisioned in under five minutes, we have seen a
time where this time increases to 25 minutes. In a live match where your traffic is rapidly
increasing, you cannot wait on servers being provisioned. Because end of the day, you cannot
like you as a hot char customer, I cannot come and tell you okay, there is no issue. That is why I'm
not able to show you the match. So this is what we do. Pre warm the infrastructure before the match,
and there are buffers proactive, automated, scalable, so we don't use the automatic auto
scaling that AWS provides. Instead, we have developed our own auto scaling tool. What it does
is instead of scaling on default metrics like CPU network, it instead scales on on request rate and
concurrency. So we get concurrency like the total concurrency or the total active users on the
platform. And bases that we have latter define that okay, at 3 million, each application will
have this many servers at 10 million, each application will have this month server so that
data is already fed into the system. If whichever metrics is high, if the concurrency is high, it
will scale that way. If the request count per application is high, it will scale using the
request count as a metric. Because CPU and leathers impact impacts your customer or increase
your latency shouldn't be a metric to scale upon. At the scale, because your CPU can be high, it can
be 60% 70%. But if it is not increasing your latency, if it is not creating any problem to your users, CPU should not
be a metrics that you scale upon. Instead, we have benchmark how much each server or each container
can so like the rated RPM that each container can serve, and basis that we take decision. So if at 2
million platform concurrency, I have application A, which is now doing 50k DPS. If that is enough,
I'll stay at that later. Let's say it is doing 75 DPS, so I'll scale up with before because now my
RPC is more than my application can handle. So, that way we take decision it is totally based on
either request rate or either based on your platform concurrency secondary Auto Scaling group,
this is for the problem of single instance type. So, what happens as soon as you cannot have
multiple instance type we spin up a secondary Auto Scaling group. So, if there is a C for large
machine, there will be a secondary SGA having m for rock machine or any other instance type and we
have so, as you also gives you a scaling notification in case it is unwilling or unable to
increase the capacity we use that notification push it to SNS which triggers a lambda function,
which automatically scales a secondary Auto Scaling group for that application. This way even
if a single instance type is not available, still we are able to scale up the secondary instance
type of that application. Last is we use spot fleet the advantage of spot fleet is two things
One is cost saving the second is a single spot fleet can allow 15 different instance type in a
single configuration. So and this is again before like issue to fleet came into picture, which now
allows you a mixture of on demand and spot. But this is what we have been following since then.
Because now with spot fleet you are able to configure more than one instance type. So you can
have mix match compute memory intensive family so for example, see for 4x 348 X you For a sci fi
next sorry and sci fi at next, this can be your base configuration then you spread this across
three availability zones. So, now you have more opportunities you are diversifying your
infrastructure and chances of getting capacity error will be less even if one particular instance
type is not available in one AZ there are other instance type and easy to fill in the capacity
that you require. These are the ingredients for chaos. So, people are aware of what chaos
engineering is. Okay. If I have to explain it in one line, chaos engineering is something like you
find out breaking point in our system, or art of breaking things. So that you know that, okay,
failure is about to happen, how can you overcome it without impacting user, like simple, it's like
more technical what is the best definition. So these are the ingredients for chaos. Push
notification, we already spoke about this, this is not in our control, there can be any interesting
moment in the match and marketing team might decide to send a notification to 200 million
people, our infrastructure backend services need to cope with that spike. Increase in latency is
another problem. Even if one application in your entire user journey is impacted. What happens is
it has a cascading effect on other services. So let's say my content platform API's have increased
their latency by 50. Ms. What happens is there are other services that consume this API to show
content. So on your homepage, you have personalization engine, you have recommendation
engine, which shows what content you have, what are what content you should watch. So these API's
depend on content platform API. And if that API has increased latency, This, in turn will work
slowly, which in turn will load the homepage slowly or may increase the app startup time. So a
single increase in latency anywhere in the systems can have a cascading effect to the entire
application. network failures are another scary thing. So
streaming at this scale, you depend a lot on your CDN providers, what happens is, if the edge
location or the pop location goes down, or is overwhelmed, they have to reboot or ship traffic.
Now all the traffic that was being served for a particular edge location closest to your home or
ISP, all those requests will now come to origin. Think about it, you're operating at 10 million.
And the edge location or anything like closer to your home is down. Now your request will directly
come to the mid grass layer or the origin endpoint. If it comes to origin. Now we even if
this is like 5% of 10 million, we are talking about half a million users. If your application is
not scaled up to handle that origin request, maybe your DB can go haywire or your application may go
crazy, because the backend is not provisioned to handle that sudden spike. All the applications
have appetite for linear growth, wherein you gradually increase the traffic. But if you have a
flood of requests, let's say half a million requests coming to your origin, it can actually
bring down your application. Delayed scale up is another problem. And that's the reason why we
don't do auto scaling. Because when a live match is happening, and if I need to scale up, I need
those servers. If it is not available, or my scaling decision or scaling grip takes time to
scale up the infrastructure. Used users may be affected or they might get a bad experience.
Tsunami traffic is the first graph that we saw. The sudden surge and the sudden the both are
equally bad. Because now your application, you can still scale. But think about your backend servers,
elastic cache rds, these are not scalable on the fly, you cannot go increase your RDS do you can do
it. But there can be a downtime associated. And it's not a thing that you can do offline between a
live match. Bandwidth constraints are another problem. With more users coming in, you consume a
lot of video bandwidth and the stats that we saw on the first slide. It's like more than 10
terabytes per second and almost running at 70% of India's capacity. So there is very limited room to
operate. And in terms of adding more users, let's say the concurrency goes from 25 to 3040 50. Any
number is there enough internet bandwidth available to serve customers, because we can push
out a video with but if you are living in an area wherein there are latency concerns, the last mile
latency especially, or your ISP is now choked or throttle because maybe 1000 users are watching
watts are on the same ISP in your local area. So those are some of the things that constrains us
from bandwidth point of view. And what are we looking for, like from all this
chaos engineering exercises and game days that we do? We try to, and discover hidden patterns and
issues. So the main goal of chaos engineering is not to, like just bring down a system. But in
fact, find out what happens if your one availability zone goes down? Or if 30% of your
compute capacity is taken away. Or if there is some network level issue between your EC two and
your DB. Will your application still perform those kinds of tests we perform in a controlled
environment to figure out what are the worst case scenarios that can happen if anything goes wrong
in the system? We find bottleneck and choke point this is again related to things that cannot be
scaled in time, especially data stores and back end systems, which needs to be provisioned to a
peak capacity because they are not that elastic in nature. Breaking Point of each system. So I spoke
about rated capacity earlier. So we have developed a dual in house which now understand how much each
application can take a loan. So what is a Benchmark Number or rated capacity for each micro
service or application, what it helps us identify is at what RPM or TPS application a can go down.
And then we take action to avoid going or reaching near to that number, or how we can offload some of
the things by maybe introducing a caching layer in between, or maybe do some API action that is not
required to be done in the moment can be done later on, like a backfilling process, something
like that. That favors again, that sudden spike, adding 1 million 2 million users every minute. It
has its own upper ceiling, right? It cannot infinitely scalable system. So for this one, we
were prepared for 50 million concurrency, though the traffic was only at 25 million, unfortunately,
only got out. Otherwise, even if the concurrency would have been 3540, we would have been able to
take that beating. But let's say for this World Cup, it might have gone above 50, which is like a
bit unreal nature. But at that point, we might have like started seeing issues, or that is a
territory for which we are not prepared. And failures can happen at any level network servers
application. So by doing all this chaos engineering exercise, we try to find out what
hidden patterns are in each of these areas. And then we try to fix them. So the rated capacity.
The other good thing about that is once you know your user journey, once you know that, okay, when
you open a Hotstar app, these are the X number of API calls, it will make you can scale your back
end for a sufficient number of users. If you don't know what is your user journey, you won't be able
to scale or make your infrastructure resilient. So knowing your user journey is very important. Once
you know your user journey, you can take decisions like okay, if application a can only handle 10
million load, maybe I'll turn off that application before it reaches 10 million to avoid an outage or
stuff like that. Or maybe I'll offload the processing to a CDN or introduce a elastic cache
something like that. And if those things are not successful, Last Resort is panic mode. So, what
happens here is is as a very basic principle that your key services should always be up degradation
should be graceful, the customer other users should not know that okay, or RDS is down. So like
not, we should not show an error message that cannot connect to a dB handle that gracefully work
it work out a solution wherein you can maybe allow a user I'll give more example on that. But panic
mode essentially is when you turn off your non critical services, thereby making room for
critical ones. So, when 25 million people are watching cricket, that's like almost 99 point 99% of the total
traffic available on the like, active on the platform at that time. So for point zero 1%, it
not really matters if they need personalized content or the other non essential things which
can be turned off, thereby making room for the critical API services, which delivers your video,
your ad, your concurrency numbers, your key health check metrics. P zero service must be always up.
So for us p zero is video, advertisement subscription payment systems. All this is very
essential. non essential services like recommendation, personalization, this can be
turned off for maybe half an hour during the key or interesting moment of the map. So you reduce
that traffic, and you make that bandwidth available for the P zero services. And graceful
degradation is for applications, like a given example, application can only handle 10 million
traffic. And it doesn't make sense to scale it up more by just throwing hardware capacity or
anything unless it adds a business value. So at that time, you can cut off that system, maybe
return 200 Okay response so that your client doesn't create an error message or shows the user
a bad experience. The other thing is clients are also smart enough to know an application is in
panic mode by returning a custom error code. So what happens in this case is let's say your
application payment application DB has some issues and due to which you are not able to make or
complete the financial transaction, which is a loan issue at all right? So what we do is we can
put the payment system into panic now client, if it retries will know that okay, there is a issue
at the backend and not a customer related thing. It will allow you to bypass the payment and just
watch some ad. Same thing with the little login system. If your login system anything like a
DynamoDB in the backend, or ELB has an issue or is having any error or a bug, we can put the entire
logging system into what this does is it allows the user to watch a video without even asking it
for a login or a valid subscription. Because we know that okay, there is an issue at our end. So
we allow users to bypass those services for a particular time till that issue is fixed. Once it
is fixed, we disable the panic mode, so that the normal system flow can continue. And yeah, at
every ladder like after 10 million, there are decisions taken to understand one either this
service or application is necessary for business to whether it is near its rated capacity. So the
tool that we have created shows us in real time, at what level of the rated capacity we are
operating at. So if we are around 80 or 90% of it, we manually put that in panic and degrade them
gracefully. So that it doesn't have any cascading effect on other systems that rely on it because
now it will return 200 Okay, the other application will think that okay, the system is fine, and the
clients will not show an error message or a pop up, which impacts your user experience. So the key takeaways are prepare for failure. I
cannot emphasize how important this is because at this scale, especially if you are operating in a
public cloud failures are bound to happen, which are usually not in your control. There will always
be a factor wherein you cannot do anything. But it's your job to overcome those and design your
application or system in such a way that it can handle those failures. So that is why chaos
engineering is also important and understand your user journey. Unless you understand what happens
what API calls happening in our system. When a user opens the app, you won't be able to make your
interest stable or resilient. You have to understand each API. Also, when you tap on a
homepage, what all action happened in the back ground, which database it touches like which API
call goes from where to where you need to know that when you click a play button, what old API is
are called if you know that journey, you Then script that and then you can also create your load
testing pattern and it is okay to degrade case gracefully we should avoid like showing errors to
users which impacts the user experience or adds bad name to your brand. So whenever possible
degrade gracefully without user knowing that okay your system or infrastructure has issues cool,
that's all I have for today I had to watch the match next day was in the match. Like happens we
don't even get sometimes we don't even get chance to look at what is happening. We look at match
from a different angle. So people enjoy the match. We just look at scorecard and who is planning
playing so that we can take decision whether Okay, Dhoni is going to come to bat so there can be an
increase of two or 3 million. So we have to think from that lens that okay, now be ready to scale up
because favorite batsman is coming in. Same thing happens in IPL as well like the top three teams
like at least in the fan base, our CAC, MN RCB, whenever these teams are playing our traffic is
crazy. And there are so many Dhoni and kolu fans, whenever they get out, our traffic drops by 2
million, 3 million. So we look at from that angle sometimes like people asked me whether you saw
that fantastic cash or you saw him hitting 50. We in that tense moment, we don't even remember like
such events are happening. Yes, yes. Almost two years. Yeah. Because there are always learning
like first attempt, you cannot always succeed and two scalar systems. And there are things that so
for many of the problems, we didn't had any references that Okay, someone has done this in the
world, we can copy paste that idea. Some of the problems we encountered was very unique in nature,
due to the fact that no one has crossed this scale. Even with CDN partners, no one is screaming
at the scale. So it adds its own unique challenges. And there are failures as well, but
you learn from them, and you try to do better next time. So I will say it has taken almost two years
worth of effort to reach this thing. And now even IPL is like far away, but our preparations have
already started. I have a small doubt here. So here, you talked
about the game day, right? So how were you able to mock such a huge load onto your systems during the
game day that you're doing? So like I said, it uses more than 3000 you finance
slot machine, these machines are scripted to generate equivalent load that 50 million
concurrency users will add to the platform and logs and the access pattern gives us values like
if I say at 5 million concurrency not all application will handle 5 million payments will
have lower video API's much more must be having higher if everyone is watching match, my
personalization will not have that much hits. So at every ladder, each application added own. You
can say RPM RPM advocate operates. If you find that ratio, you can corrupt your load testing to
mimic the entire graph that we saw in the first slide. Did you actually automate the chaos testing that
you have done? Sorry, the chaos testing that you have the chaos engineer? Yeah. So only scripted.
Yeah. So can you tell me the Are there any frameworks for this for this automation? Mostly, it's Python based. So it depends on so chaos
engineering, in our cases very, you can say there are tools available and market which does this for
you open source tool, but what we try to achieve from chaos is to things errors that have happened
in the past, if they happen again, what will be the impact? So that is why we simulate network
failures we simulate DB not being available, just to see how will it impact at 10 million versus 25
million. This is one thing and sometimes we just randomly go and do stuff that is not even like
thought of so maybe you just if a system is talking to another system through VPC peering, you
delete that network connectivity or you go and change your route table settings. So just see that
whether the application can handle either load or performance impact If a network connectivity is
not available sorry it's a homegrown Python not using any third party. So these are luxurious
Python Boto script user script. What do you want to perform? You want to change the route table,
which is easily doable through a Python window. No, no. QA team does it announce? Not yet, not
yet. But it's very simple like removing a route table entry through Python. Boto is just like
three or four lines of code. About the panic mode, or you talked about like
bypassing any other service like payments or something. So is it like just a feature toggle
that you have implemented, kind of kind of feature toggling at client level?
And at the backend level custom error codes which return an okay response, instead of a four xx or
by xx?