Prime Video Swaps Microservices for Monolith: 90% Cost Reduction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the engineering team behind Amazon Prime Video has released a Blog detailing how they moved one monitoring service that detects you know freezing frames and clicks and audio in their live streaming portion in the Prime video app and they moved that piece of the architecture from serverless and microservices to a monolith and they saved 90 of the cost and they were able to scale better how about we read this article and then discuss so this comes from Prime video tech right written by Marcin colney scaling up the Prime video audio video monitoring service and reducing cost by 90 the move from distributed microservices architecture to a monolith application helped achieve higher scale resilience and reduce costs so yeah so how did they actually move from microservices to monolithan how did that happen let's discuss this uh I gotta warn you guys this article is not very detailed unfortunately right there is a lot of missing things here uh this article could have been great it it's just okay to be honest right that's that's one one thing out of the way and and the reason is because there's so much background information that we have no idea about the diagrams is is not already well designed in my opinion and the the the the text doesn't explain the architecture well it's just it's just a bunch of boxes and they talk to each other and we have no idea what the use case is what's the workflow like what is what am I doing here as a customer all of these things are missing right maybe they are explaining in other places I looked in other articles it's not there but I'll try based on my understanding to actually explain what I think is happening right of course I might be wrong okay but let's get it Prime video a prime Amazon Prime if you don't know this is a service you pay what 120 a dollar a year you get so many stuff like two-day delivery from Amazon Prime video Prime music so much Services right so you can also subscribe with twitch you know to support your creators Prime something almost all you know people in the US have like even even outside as well so Prime's a very popular concept right so a Prime video we offer thousands of live streams to our customers live streams here has nothing to do with twitch by the way this is their own new thing apparently that you know to live stream sporting events you know and having you watch it on this app that's called Prime video so don't confuse that with twitch that I don't think it has anything to do with twitch because at first like live streaming isn't that twitch because Twitches are owned by Amazon so that's why it first went but nothing to do with twitch they're they're apparently working and from them these articles that you see on the side it's like always like the streaming how you're trying to pop promote it because that's their you know the better version of twitch if you will the actual high quality streaming and they are spending good money when it comes to here right that's all you notice so to ensure that customers seemingly seeming seamlessly receive content so if I'm a prime paying customer unlike twitch which is free anyone can go go and watch it you know this is actually payment so if you're paying something you better be good so they are spending they have an architecture to detect right so just to summarize this article uh maybe let's do that like yeah they haven't they have an architecture to Monitor and detect the experience of the user so because if the experience of the user on a on a PS5 watching a live stream versus on an iPhone versus on an Xbox versus on an Android phone it's all different and the reason is because the device does the decoding I suppose of the video and the audio and and it takes these packets and there's a client-side logic that executes and does more work and it could be there is there is a bug in the decoder or the encoder that that shows you this uh freeze you know frames or or something it clicks in the audio so there's a bug there in The Client app and Amazon wants to detect those bugs in the client apps that's my understanding there's also of course it could be the encoder that the data you receive based on what whatever streaming platform you are in whether it's good better quality like the load I think that's how they categorize it good better or best something like that that's how they categorize it so based on those streaming qualities and bitrates they they could have problems in those as well in the source themselves so I don't know if that is being detected by this okay but in general maybe both actually so if if I am actually monitoring my device what what I think this tool does well I think because they don't say that the client application actually once it decodes that and I see it right there is an option where you can set up monitoring I didn't see it in my app here but maybe there is a place that I didn't see look when you enable this monitoring thing it will re-upload the stream or part of the stream that you just watched back to Amazon as as it is as you decoded it so it's as how you watched it is Amazon will see it that's the only way that you can actually ensure that whatever the user's seen is uploaded back right because the client's doing more work so this uploading thing back to this architecture that we're going to discuss then goes into the steps that we're gonna that they're optimizing it goes into a microservice architecture there's a media conversion it converts the stream that the user just uploaded then the application I just uploaded to a bunch of frames right which is images and the audio is converted into certain buffers certain you know uh uh bytes and those are fed into something called the Detector service another micro service how we're going to talk about that and then once the detector will use machine learning based on training data let's say okay oh I detect these these frames actually Frozen and these audios are actually clicking so it will once it detects something it will issue a notification to a service and it will write down that okay this portion this Frame is bad this Fortune This the frame is is not good right and will detect those so that's how monitoring works and so now that we understand that let's continue on reading and discussing the distributed architecture of that that I want to just explained versus the monolith version right let's go ahead our video quality team at a Prime video already on the tool for audio video Quality Inspection so a Quality Inspection we're inspection the quality but we never intended it nor designed it to run at higher scale which is understandable when you first design something they never thought about it you know it's to actually serve thousands of concurrent streams right 1000 concurrency and we're talking about are we and here's what what another thing I didn't understand what is concurrent streams right we're monitoring concurrent streams back from the customers right the customers watching something right and it's downloading it then it turns around and upload that part of the Stream to the to the service and the reason I'm saying this is because that's exactly what the diagram is showing right their diagram is actually showing the user the custom is actually uploading something it's not a direct and the reason is because you want the customer to upload which makes also more sense is because the the decoding happening in the device and and if there's a bug in the client application you want to detect that after the decoding so whatever so I have no idea how much bandwidth that takes when it comes to the upload again guys I might be way off here but again this article is not details enough so I have no idea like the from what I'm saying is actually correct or not but yeah so these concur Cinemas are being monitored while onboarding more streams of the service we notice they're running the infrastructure was high scale was very expensive so the the more streams they start to onboard I suppose the more customers they're starting to monitor their experience that's my guess at least that's where they saw the bottlenecks the initial version they did was a service consisted of distributed components that were orchestrated by step functions AWS step function which is I think this is their Lambda the two expensive operations were the orchestration workflow and the data passed between and when the data pass between distributor component because now you're having this decoded frames and you're passing around between microservices of course it's going to be slow right to address this we moved all components to a single process to keep the data transfer within the process memory right which also simplify the orchestration logic because we compiled all operations into a single process so that's why they did they moved everything from microservices down to a single process and once you do that all the single process have is the Heap memory right so you can store these frames into the hip and uh just access them doesn't get any faster right distribute to the system over it so here's where we're going to go into details once our service consists of three major components three major components the media converter converts input audio video streams to frames or decrypted audio buffers that are sent to detectors that's the first thing media converter how the media converter got the data they don't explain but they're but they're but their diagram does their diagram has this customer which is a very bad name to to label something why did you call it it should be called the client application right it's just to me the cost if you say customer I don't know it's not clear to me like what is this you have to be the client app the Prime video Client app right I know it might be I might be like over exaggerating here but but see do you see the narrow audio video stream where is the auto going arrows going from the customer to the media conversion service like if you're watching something you're gonna watch it from The Source down right so the the customer is actually consuming this stream but here the customer is actually uploading something and that's where they something they never mention here there is a monitoring concept here and I think the app itself have this feature where you can opt in maybe in the app on any platform to monitor your stream the quality of your stream and if that happens you the app will periodically probably sample your stream and upload of course not gonna upload everything hope not right back to the media conversion service that we just talked about this media conversion service will take that raw stream it's not really raw it's the converted stream on your app it's as if it's whatever you saw and again anything I say here is just my assumption because it's not it's not it's not stated in the article and so anything I say here is just my assumption and I could be wrong so now the stream is received by the Miss media service it's being converted into the frames and the audio decrypted audio buffers now what they said is like now they are sent to the detectors how do they they are sent to the detectors to send to the detectors they are actually writing it to an S3 bucket why and the reason is because there's another process that the customer effect essentially triggered to say okay start monitoring now there's this arrow that is not labeled here but I think from the other from the other diagram I kind of deduce is called start analysis I think that's what it is so the there is a thing that's called okay start analysis which calls this Lambda function which then starts the conversion because when when you upload when the customer uploads this it doesn't really start the conversion just stores it locally apparently here and then this Arrow this explicit Client app actually says okay now let's convert I don't know why it's like that right now it's gonna convert this and then we'll store it the moment to restore this this is not that's the orchestration okay now you go go ahead and convert now let's wait oh did you did you convert like it has to be you the media conversion has to send back something and it's not it's not in the diagram of course right so it's like okay I'm acknowledging I just did the conversion maybe this is done asynchronously right once it's done the Lambda will now code the answers okay hey detector go read from the S3 bucket uh whatever the media conversion wrote those frames and these uh decrypted buffers the audio buffers and then run your beautiful machine AI thingy right and let's just do the thing and once you have the results go and write it to the notification service now who's gonna read the notification service probably the customer but as a customer why do I care to see the that my f my frame froze it's like I want you guys to fix it not like this so This notifications like for for Amazon not for me as a customer so I don't I don't understand why it's called Amazon isn't its customers real time notification topic I don't care if your frame froze right I saw it freezing I know it's freezing so why would you tell me I don't know the whole thing is just I don't know there is so much missing thing and maybe I'm missing one component that it will make everything make sense but part of this doesn't of course there is another result aggregation function that collects this user and write this aggregation to another S3 bucket and if you want to learn more uh there's another article describing this machine learning thing is right here so now they they talk about the problems we design our initials this uh solution as a distributed system using serverless right and the whole thing is almost all of this serverless except this media service conversion they didn't say what that is however the way we use the some components causes to hit a hard scaling limit at five percent of the expected load so you couldn't scale past that they just hit that so of course there is a there is a bottleneck here right the main scaling bottleneck in this architecture was the orchestration management that was implemented using AWS function because this this thing like okay stock conversion and then customers trigger this and then now you go read this and now you do that and now you aggregate that's orchestration that's the expensive part because there's always a delay like when do you know when to actually orchestrate like do you do asynchronously do do you have a timer do you do it asynchronously it's just interesting our service performed multiple State transitions for every second of stream that I that I didn't understand exactly what like what is the state transitions maybe it's an AWS thing that I don't know about so we um we quickly reached our account limit beside that AWS function charges users per state transitions it's like here's the third thing why are you charging me for something you're responsible for with I don't know guys I still don't know what the heck is this it's like what why would I care it's like it's like you're responsible for Prime video not me why are you charging me for your serverless like charges user I think it's just the way it's written is just it is it's it's weird it's just the way it's written it's like it's like an engineer actually writing and and we're treating us like users you know it's not it's not written as a as a product that makes sense it's written as an actual engineer trying to explain the problem although their the engineering piece is actually belong to them does that make sense the code the second course problem what we discovered was about the way we were passing video frames images around the different components to reduce computationally expensive video conversion jobs we built a micro server that splits videos into frame and temporarily upload images to this S3 right that's the way they did and the reason is uh they don't they want to reduce the conversion jobs so they they convert it once and then distribute images and instead of having and that's that's this service the media conversion instead of passing the audio stream directly to those because it's going to be expensive for the compute unit to actually convert and analyze as opposed so that kind of makes sense I'm with them on this it's a good idea to convert it it's not a good idea to put the output in an S3 bucket to my in my opinion they could have at least instead of writing that to us it's three bucket and then reading it it's like you're incurring the cost of a right you're in terms of cost of a network right bandwidth because that's not the same machine right then you are incurring another cost to a ride and that's another i o and then there is another cost of the network so why all this stuff and there is these are these frames are all small they are huge right so even if you like have say this is HTTP and you can have HTTP 2 compression right so gzip or whatever then then then then still these are really large thing you're downloading and uploading and downloading uploading there so they have their limit even in this three it's always like this is odd the way they're talking about this like they own the product but they're talking about S3 and Amazon as if it's something else does that make sense it's just odd I know I know yeah so uh what what else what else so one thing just just look at this diagram like forget about the monitors you could have eliminated this by just having the media conversion talk directly to the S3 to the to the detector like have a serverless function that takes the frames as an input so have the media conversion once it's upload both immediately uh go ahead and upload this way you can kill technically you can kill the starting version and you can kill this orchestration altogether you don't need orchestration because once the video is uploaded the media conversion will have the result and it will buffer it in memory and then once it have it it will upload it to the to call this serverless Lambda function as I say here's a frame go and detect them right and this will just scale right because it's just a scalable function so this way you don't even need the S3 so I don't know why they didn't do it this way for example that's just one way to do it I suppose it's always expensive to write to S3 right again I speak here now like an armchair architect of course right but I I'm not in the in the midst of this there is so much missing things here so you we have no idea what's behind this it could be a way more complex process that doesn't allow what I just mentioned here but okay okay they said okay uh microservice is bad uh just just Mom give me more or less please give me beautiful models so whether it's okay to address the bottleneck they said all right let's fix everything let's put everything they made the Bold decision to to re-architect what they did they basically everything is the same right it's just they put everything in a single process so they're still talking to this orchestration layer which I still think is unnecessary to be honest right but they have this orchestration says okay now let's go ahead and start analysis but still see the user still uploads the stream from their client app to this media converter like there's so there is an end point here that allows you to upload stuff and this ECS task what's the difference between an ECS VM and an ECS tags those two different thing I don't know maybe they are but that that whole thing is just one beautiful process right and these are the components so they put everything in the same thing so now when when you call start analysis it will call start conversion and then it will convert everything that has been uploaded by the user so start conversion and then upload the new buffers uploading it somewhere here new audio buffer why is it going back to the orchestration I think that's oh I know so that what is the dotted line I suppose the dotting line is the content that's what I understand seems like and this solid lines are the response and requests it's okay hey I have a new buffer see for example why you guys you didn't do this why is it not here why is it not complete that is so slobby I'm sorry it's so sloppy it's like yeah this is Amazon Tech we're talking about this Amazon you gotta produce some good piece of content guys this is not acceptable right so yeah if you did then there's like so I called it I saw it's like there must be something coming back here as an acknowledgment here they show it right okay let's go now new audio video boom good and now the orchestration here The Next Step kicks in says all right now let's analyze what we have but but what did the media conversion did they also write in memory it wrote these frames and they decrypted audio in memory beautiful because in memory it's a it's in this process Heap right so this is assuming this is just a single process even it has it can be different process that's fine but then this instant memory could be a shared memory pool right and then multiple processors can access that that's fine it still is fast right so in that case the whether the detection is a different process or not it doesn't matter it's still the whole thing is in the single machine so we still have access to the memory direct hot memory access right and we don't care about persistent so if we lose this if we crash who cares we don't care about durability right like this is there this is not one of their goals to like okay oh I crashed sure we lose it and that's fine I think right because it's fine to lose the work for it's a monitoring service it's not like a serious thing that you need to persist right so that's fine if you do it in uh generally so yeah we have this director one director two right and then detecting goes off and then we write the detection results and then notify people and then still we're writing the aggregation final output to ns3 bucket cool so conceptually the high level architecture remain the same they didn't change that's why they they wanted to keep the orchestration just because they do they want to they don't want to change that code a lot because the orchestation is still there they just change the how the orchestration is talking to each other by making it local calls effectively right and then uh so all the components are still there the initial design we could scale several detectors horizontally right why because it's just it's another serverless function right as each of them are in separate microservice right and they can just spin up here they cannot right because well I can argue that you still can if you do the detection as a different process you can but the problems that you're limited by the compute power on that thing right which is still I guess that's fine also all right like then they they the problem is that now they that box is reaching its limit right let's say you spin up even one or multiple products I think that's what they don't they're spending on multiple processes which is each detector is a process or a thread whatever but then the so yeah that's what they're doing right so so each detector I think is its own process that I wish they talked about this I really I wish I wish I wish I wish these details are explained this this block could have been great but it's just okay I'm sorry I'm sorry I'm sorry it's like why don't you explain that oh we now every detector is now a process or the video director is a thread why not why not okay whatever okay I I apologize but sometimes like these things it hurts my heart because this is a really good piece of work they put there but the block doesn't do it is Justice I think it really doesn't do it as Justice I'm anyway in the initial this is we could scale several detectors horizontally as each of them has said okay we talked about that so what did you guys do um however in our approach the number of detectors only scale vertically because they all run on the same instance vertically because that's just the same as I suppose these detectors are processes in this case our team regular regularly add more detectors to the service and we already exceeded the capacity of the single instance again they don't explain what the detectors is I'm assuming it's a processes to overcome this problem we cloned the service multiple times parameterizing each copy with a different subset of detectors so it's a very simple thing the whole ECS cluster now its own thing so there is an escs machine with everything right I think so right at least right part of the things has been cloned right as a group so it's still everything is talking to each other it's just they added another layer on top uh to to orchestrate to for to load balance the forward request to to forward the request so think of this box these clusters is this whole thing right this is that that Orange Box everything now become there and now if you you just horizontally scale that so now this here is where microservices when in that particular case uh they they effectively did macro Services if you will macro not micro macro Services just grouped everything and that's the perfect solution for this right in that particular case because all of these things that tightly talk to each other let's put them in one monolith all right that makes sense because these guys talk to each other there's no point to separate them right if they are if they are very coupled you need to put put them together or somehow destroy the coupling if you can so there's a two detectors here there's three detectors in this case right and then uh they just you load balance the whole thing okay before I forget and go go through the final piece uh I think here's what I think will break in the future currently there is only one consumer if you will for this converted frames and these are the these detectors right uh I think if there's another set of detectors that need to be added it's going to be interesting I don't know how they're going to do that right the only place where this detector should live is in this big monolith and that's the cost that they will have to incur right that's where Kafka and and and other you know pop subsystems come in handy right unfortunately that's that's what will happen in that particular case where that media conversion if there will be more consumers for this decrypted audio and these frame images other than the detector then it will it will be interesting because they have to put these detectors this new detector types will have tests to be left in this whole cluster and the only way to scale is to the scale the whole old thing right although the media conversion doesn't need it to be scale you had to scale it you had to incur the cost of putting it in a cluster putting in that process that is the selling point of microservices in that particular case where where see these these two components right what they did is like initially that's what they thought about it's like a media conversion uh put in a its own Microsoft this thing put it in another microservice but then if one scales more than the other let's say the media conversion is not much right or I want to scale the media conversion service more versus the detection service you don't get a choice you have to scale them both and that might be fine and that might be fine it's just in the future if you want to add more uh it's gonna become interesting to see like what will happen the only way is just to add another type of detector in this instances so that's something I actually uh I'm interested in to think more about results and take away microservices and serverless components are tools that do work at high scale but whether to use them over month has to be made on a case-by-case basis I have to agree with that statement 100 It's All Case by case basis all depends on what you're trying to do moving at our services to our model to reduce our infrastructure cost by 90 percent okay and that because uh everything is now simpler now there is no more these uh the the S3s is was killing them to be honest right and the bandwidth here and the and the orchestration cost right all of the thing is now a single process or maybe a multiple processes right depends on what whatever that is so it also increases our scaling capabilities today we're able to handle thousands of streams and we're still we we still have capacity to scale the service even further moving the solution to Amazon ec2 allows us to use the compute saving plans that will help reduce costs down even further some decisions with uh we've taken are not obvious but they resulted in significant improvements for example we've replicated a computationally expensive media conversion process and placed it closer to the detectors whereas running media conversion once and caching it outcome caching its outcome might be considered to be a cheaper option we found that this is not cost effective because that's what I just said right it's very interesting because they they rather recompute the because they were thinking about like let's let's make a media conversion and let's cache it but apparently that didn't work for them it's just interesting that it's an interesting use case in indeed the changes we've made allow Prime to monitor all streams viewed by our customers not just the ones with the highest number of viewers this approach results in even higher quality okay so that's interesting right so this actually proves that they're actually monitoring the streams viewed by our customers again that statement is still I'm not clear about right are we monitoring the raw stream that is that is being produced or are we monitoring how the stream is being consumed by the by the customer that that's that's a step that I'm not clear about yet all right guys that's it for me today hope you enjoyed this video and what do you think about this let me know in the comment section below see you in the next one goodbye

Info

Channel: Hussein Nasser

Views: 153,328

Rating: undefined out of 5

Keywords: hussein nasser, backend engineering, amazon prime microservices, microservices vs monolith, software architecture, backend amazon

Id: dV3wAe8HV7Q

Channel Id: undefined

Length: 35min 9sec (2109 seconds)

Published: Sat May 06 2023