Streaming a Million Likes/Second: Real-Time Interactions on Live Video

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning and welcome to cubecon so as nikki mentioned we're going to start with something that has a 90 chance of failure um so i want to request all of you to take out your phones or laptops and go to this url on the screen tiny dot cc slash qcon live you can also scan this qr code and log in with your linkedin credentials if you don't have a linkedin account please make one right now uh you should see a live stream of this very talk delayed by a minute and my lovely wife sitting here in the audience is streaming the demo live from a phone if you're successful you should see something like this on your screen how many of you are able to see something awesome all right i guess you're getting there you're still making a linkedin account right all right so this comments area is where you can interact with each other you can post comments you can ask questions and i will try and answer some of them at the end of this talk the only thing i request is to not share the stream because this is just a demo and i will delete it after the end of the stock all right let's start with a question anyone know what the largest live stream in the world was think of something that grinds an entire nation to halt yes oh that was the second largest yes yes cricket all right so believe it or not it was the semi-final of the cricket world cup last year between india and new zealand more than 25 million viewers watch the match at the same time also overall the match crossed 100 million viewers and as this gentleman said the second largest was the british royal wedding which was more than 18 million viewers concurrently so remember this we'll come back to this this so this is me and my team uh we call ourselves the unreal real time team and we love cricket and we love coding this has a range not not very great and we believe in something we believe that problems and distributed systems can be solved by starting small solve the first problem and add simple layers in your architecture to solve bigger problems and today i'm going to tell you a story of how we built a platform called the real-time platform using that principle to solve how we can stream or have many many people interact simultaneously on live videos i hope that a lot of you will be able to learn something from this and also apply it to your own systems so what is a live video a live telecast a live conference broadcast a sports match all of them are examples of live videos how many of you have interacted with a live stream on youtube or facebook perfect so you know the general idea the difference that these streams have is that they allow you users or viewers to interact with each other so what is a real-time interaction on live video so the easiest way is to just look at it or try it on your phones with the demo people are able to like comment and just ask questions and interact with each other so this is an example of a linkedin ios app doing a linkedin live video and similarly on desktop the experience is very similar there are many rich interactions that happen there this particular live stream is from the linkedin talent connect conference that happened in september of last year so i want to talk about the simplest interaction here which is how do you how do these likes get distributed to all these viewers in real time all at the same time so let's say that the sender s likes the video and wants to send it to a receiver a so the sender s sends the like to the to the server with a simple http request but how do we send the like from the server back to the receiver a publishing data to clients is not that straightforward so this brings us to the first challenge what is the delivery pipe to send stuff to the clients by the way this is how structured the presentation today i'm going to introduce problems and i'm going to talk about how we solve them in in our in our platform and these are all like simple layers of architecture that we added to our platform to solve each of them so as discussed before the sender s sends the like to what we call the likes backend which is the backend that stores all these likes with a simple http request and this backend system now needs to publish the like over to the real-time delivery system again that happens with a simple http request so the thing we need is a persistent connection between the real-time delivery system and the receiver a so let's talk a little bit more about the nature of this connection because that's the one that we care about here so once again log in at linkedin.com hopefully you've created an account by now and go to this url tiny dot cc real time this time you will actually see what's happening behind the scenes you should be logged in for this all right so for those of you who were successful you should see something like this on your screen this is literally your persistent connection with linkedin this is the pipe that we have to send data over to your phones or to your laptops and this thing is extremely simple it's a simple http long pole which is a regular http connection where the server holds on to the request it just doesn't disconnect it so over this connection we use a technology called server sent events and this allows us to stream chunks of data over what is called the event source interface and the client doesn't need to make subsequent requests we can just keep streaming data on the same open connection so the client makes a normal http get request it's as simple as a regular http connect connection request the only difference is that this the accept header says event stream that's the only difference from a regular http connection request so the server responds with a normal http 200 okay and sets the content type to event stream and the connection is not disconnected chunks of data are sent down without closing the connection so you might receive for example a like object and later you might receive a comment object without closing the connection the server is just streaming chunks of data over the same open http connection request each chunk is processed independently on the client through what is called the event source interface and as you can see there is nothing terribly different from a regular http connection except that the content type is different and you can stream multiple chunks of bodies on the same open http connection request how does this look like on the client side on web the client first creates an event source object with the target url on the server and then it defines these event handlers which will process each chunk of data independently on the client so most browsers support the event source interface natively on android and ios there are lightweight libraries that are available to implement the event source interface on these clients okay so we now know how to stream data from the server to the client and we did this by using http long poll with service and events what is the next challenge yes so so think of think of all the thousands of indians trying to watch cricket right the next challenge is multiple connections maybe thousands of them and we need to figure out how to manage these connections right so connection management at linkedin we manage these connections using aka akai is a toolkit for building highly concurrent message driven applications anyone familiar with akka actors wow room full so yes not hollywood actors they're this very simple thing it's this little small guy so this is the only concept you need to know to understand the rest of the presentation actors are objects which have some state and they have some behavior the behavior defines how the state should be modified when they receive certain messages each actor has a mailbox and they communicate exclusively by exchanging messages an actor is assigned a lightweight thread every time there is a message to be processed and that thread will look at the behavior that is defined for that message and modify the state of the actor based on that definition and then once that is done this thread is actually free to be assigned to the next actor since actors are so lightweight there can be millions of them in the system and each can have their own state and their own behavior and a relatively small number of threads which is proportional to the number of cores can be serving these millions of actors all at the same time because a thread is assigned to an actor only when there is something to process so in our case each actor is managing one persistent connection that's the state that it is managing and as it receives an event uh the behavior here is defining how to publish that event to the event source connection and those many connections can be managed by the same machine using these concept of ag actors so let's look at how in our characters are assigned to an event source connection almost every major server framework supports the event source interface natively at linkedin we use the play framework and if you're familiar with play we just use a regular play controller to accept the incoming connection and then we use the play event source api to convert it into a persistent connection and assign it a random connection id now we need something to manage the life cycle of these connections and this is where our characters fit in and so this is where we create an actor to manage this connection and we instantiate an act with the connection id and the handle to the event source connection that it is supposed to manage so let's get back out of code and see how the concept of our characters allows you to manage multiple connections at the same time so each client connection here is managed by its own actor and each actor in turn all of them are managed by an akka supervisor actor so let's see how like can be distributed to all these clients using this concept so the likes backend publishes the like object to the supervisor arc actor over a regular http request the supervisor character simply broadcasts the like object to all of its child actors here and then these arc actors have a very simple thing to do they just need to take the handle of the event source connection that they have and send the event down through that connection so for that it looks something very similar sorry very very simple it's eventsource.send and the like object that they need to send so they will use that to send the like objects down to the clients and what does this look like on the client side the client sees a new chunk of data as you saw before and we'll simply use that to render the like on the screen it's as simple as that okay so in this section we saw how an event source connection can be managed using and therefore you can manage many many connections on a single machine what's the next challenge sorry fan out is one built before that oh mailbox q size so we're now talking already about big big scale even before that something simple i'll give you a hint my wife and i always want to watch different shows on netflix no no yes so we just we just the thing that we did just now is just broadcast the like blindly to everybody right without knowing which particular live video they're currently actually watching right so the next challenge is different clients watching different live videos how do we handle that we we don't know how to make sure that a like for let's say the red light video goes to the red client and the green live video goes to the green client all right so let's assume that this client here with connection id 3 is watching the red light video and this client here with connection id5 is watching the green live video so what we need is a concept of subscription that so the client can inform the server that this is the particular live video that they're currently watching so when client 3 starts watching the red live video all it does is that it sends a simple subscription request using a simple http request to our server and the server will store the subscription in an in-memory subscriptions table so now the server knows that the client with connection id3 is watching the deadline video question why does in-memory work right so there are two reasons the the subscription table is completely local it is only for the clients that are connected to this machine and secondly the connections are strongly tied to the life cycle of this machine if the machine dies the connection is also lost and therefore you can actually store these subscriptions in memory inside these front-end nodes and we'll talk a little bit more about this later so similarly client 5 also subscribes to live video 2 which is the green live video so once all the subscriptions are done this is the state of the front end or the real-time delivery system the server knows which clients are watching which live videos so when the backend publishes a like for the green live video this time all that the supervisor actor has to do is figure out which are all the clients that are subscribed to the green live video which in this case is clients one two and five and so the corresponding actors are able to send the likes to just those client devices similarly when like happens on the red live video the supervisor actor is able to decide that it is destined only for connection ids three and four and is able to send them the likes for the videos that they're currently watching all right so in this section we introduced the concept of subscription and now we know how to make sure that clients are only receiving likes for the videos that they're currently watching so what's the next challenge now we can go back to the gentleman here so somebody already said here that there could be billions and millions of connections right like now that just more number of connections than what a single machine can handle right that's the next challenge so we thought really hard about this right like this this is where we were a little stuck and that's us thinking really hard and we finally did what every back-end engineer does to solve scaling challenges right you already know we added a machine right so we add a machine and we start calling these front-end servers the animation worked and we introduce a real-time dispatcher right whose job is to dispatch a published event between the newly introduced front of machines because now we have more than one now can the dispatcher node simply send a published event to all the front end nodes yes it can it's not that hard it can but turns out that it's not very efficient if you have a small live video with only a few viewers that are connected to just a few frontal machines right and there's a second reason which i'll come back to a little later but for now let's assume that the dispatcher can't simply send a like to all the front-end machines blindly so given that the dispatcher now needs to know which front-end machine is has connections that are subscribed to a particular live video and so we need these front-end machines to tell the dispatcher whether it has connections that are subscribed to a particular live video so let's assume that front-end node one here has connections that are subscribed to the red light video and front-end note 2 here has connections that are subscribed to both the red and the green live video so front-end node 1 would then send a simple subscription request just like the clients were sending to the front-end service and tell the real-time dispatcher that it has connections that are watching the red light video and the dispatcher will create an entry in its own subscriptions table to figure out which front end nodes are subscribed to which live videos similarly note 2 here subscribes to both the red live video and the green live video now let's look at what happens when an event is published so after a few subscriptions let's assume that this is the state of the subscriptions in the real-time dispatcher and note that a single front-end node could be subscribed to more than one live videos right because now it can have connections that are watching multiple live videos at the same time so in this case for example note 2 is subscribed to both the red light video and the green light video so this time the likes back and publishes a like on the green live video to the real-time dispatcher and the dispatcher is able to look up its local subscriptions table to know that nodes two three and five have connections that are subscribed to the green live video and it will dispatch them to those front front-end nodes over a regular http request and what happens next that you've already seen these front-end nodes will look up their own in-memory subscriptions table that is inside them to figure out which of their connections are watching the green light video and dispatch the likes to just those clients make sense all right so we now have this beautiful system where the system is able to dispatch between multiple front end nodes which are then able to dispatch to many many clients that are connected to them and we can scale to almost any number of connections but what is the bottleneck in the system the dispatcher is the bottleneck in the system it never ends so the next challenge is that we have this one node which is which we're calling the dispatcher and if it gets a very high published rate of events then it may not be able to cope up right so that takes us to challenge number five which is a very high rate of likes being published per second so once again how do we solve scaling challenges we had a machine right and i mean engineers just do the most lazy thing and usually works out pretty well so we add another dispatcher node to handle the high rate of likes being published and something important to note here the dispatcher nodes are completely independent of the front-end nodes any front-end node can subscribe to any dispatcher node and any dispatcher node can publish to any front-end node there is no persistent connections here right the persistent connections are all only between front-end nodes and the clients not here and this results in another challenge the subscriptions table can no longer be local to just one dispatcher load right any dispatcher node should be able to access that subscriptions table to figure out which front-end node a particular published event is destined for and secondly i tricked you a little bit before this subscriptions table can't really live in memory in the dispatcher node i mean it can live in the memory in the in the front end node but not in the dispatcher node why because even if a dispatcher node is lost let's say this one just dies then we can't afford to lose this entire subscriptions data right and so for both of these reasons we pull out this subscriptions table into its own key value store which is accessible by any dispatcher node at any time so now when a like is published by the likes backend for the red live video on a random dispatcher node and on the green live video to some other random dispatcher node each of them are able to independently query the subscriptions table that is residing in the key value store and this and and they're able to do that because the subscription table is is completely independent of these dispatcher nodes and they're the data is safe there so another dispatcher nodes dispatch the likes based on what is in the subscriptions table over regular http requests to the frontal loads cool all right so i think we now have all the components to show you how we can do what i promised in the title of the stock anyone remember the title of this talk oh thank god i'm not that boring huh all right so if a hundred likes are published per second by the likes backend to the dispatcher and there are 10k viewers that are watching the live video at the same time then we're effectively distributing a million likes per second so i'm gonna start from the beginning and show you everything in one flow because everyone tells me that i got to repeat myself if i want to make sure that you will remember something when you walk out of the stock so this is how a viewer starts to watch a live video and at this time the viewer is the first thing that the viewer needs to do is subscribe to the frontend node and subscribe to the live video topic that they're currently watching so the client sends a subscription request to the front-end node and the front-end node stores the subscription in the in-memory subscription table and the same happens for all such subscriptions from all the clients so let's go back to our overall diagram perfect so now the subscript the subscription has reached the front end nodes so the front end node as i said before now has to subscribe to the dispatcher nodes because the dispatcher will need to know during the publish step which front-end nodes have connections that are subscribed to particular live video so let's look at that flow the front-end node sends a subscription request to the dispatcher which creates an entry in the key value store that is accessible by any dispatcher node so in this case node 1 has subscribed to live video 1 and node 2 is subscribing to live video 2. so this is the end of the subscriptions flow so now we need to look at what happens during the publish flow so the publish flow starts when a viewer starts to actually like a live video and so different viewers are watching different live videos and they are continuously liking them and all these requests are sent over regular http requests to the likes backend which stores them and then dispatches them to the dispatcher so it does so with a regular http request to any random dispatcher node and they look up the subscriptions table to figure out which front-end nodes are subscribed to those likes and dispatch them to the subscribed front-end nodes okay so the likes have now reached the front-end nodes and we have the last step which we began the presentation with they need to send it to the right client devices so each front-end node will look up its local subscriptions table and this is done by the supervisor actor to figure out which actors to send this these like objects to and they will dispatch the likes to the appropriate connections based on what they see in the subscriptions table and done we just distributed a million likes per second with a fairly straightforward and iteratively designed scalable distributed system so this is the system that we call the real-time platform at linkedin by the way it doesn't just distribute likes it can also do comments typing indicators scene receipts all of our instant messaging works on this platform and even presence those green online indicators that you see on linkedin are all driven by this system in real time so everything is great we're really happy and then linkedin adds another data center now this made us really stressed we don't know what to do so we went back to our principle we said okay how can we use our principle to make sure that we can use our existing architecture and make it work with multiple data centers so let's look at that so let's take the scenario where a like is published to a red live video in the first data center so this is dc1 which is assume that this is the first data center and let's also assume that there are no viewers of the red live video in the first data center so this is where remember i spoke about subscriptions in the dispatcher it helps here because now we might prevent a lot of work in dc one because we know whether we have any subscriptions for the red light video in dc one we also know that there are well in this case there are no viewers for the red live video in dc2 but there are viewers of the red light video in dc3 right so somehow we need to take this like and send it to this guy over here right really far away so let's start the likes back in gets the like for the red live video from the viewer in dc one and it does exactly what it was doing before right because it's not the likes back ends responsibility it's the platform's responsibility right we are building a platform here and therefore hiding all the complexity of the multiple data centers from the users that are trying to use this platform right so it will just publish the like to the dispatcher in the first data center just like it was doing before nothing changes there and now that the dispatcher in the first data center has received the like the dispatcher will check for any subscriptions again just like before in its local data center and this time it saved a ton of work because there are no viewers of the red light video in dc one but how do we get the like across to all the viewers in the other data centers right like that's the challenge any guesses no no don't add another dispatcher we already have too many dispatchers okay okay so we can do cross-color subscriptions cross data center subscriptions what's another idea good broadcast to any dc and we'll talk a little bit about like the the trade-off between subscribing in uh in across data center fashion versus publishing in a cross data center fashion so it turns out that publishing in a cross data center fashion is better here and we'll talk a little bit about that a little later so yes this is where we do a cross color or a cross data center publish to dispatchers in all of the peer nodes right and this is we're doing that so that we can capture capture viewers that are subscribed to the red live video in all the other data centers so the dispatcher in the first data center simply dispatches the likes to all of its peers dispatchers in all the other data centers and in this case a subscriber is found in dc3 but not in dc2 right that's the onl and by the way this dispatcher is doing exactly what it would have done if it received this like locally in this data center there's nothing special that it is doing right it's just that this dispatcher uh distributed the like all over to all the dispatchers in the peer data centers and the viewer in dc3 simply gets the like just like it would normally do because the dispatcher was able to find the subscription information in dc3 and this green this viewer with the green live video does not get anything all right so this is how the platform can support multiple data centers across the globe by keeping subscriptions local to the data center while doing a cross color fanout during publish make sense all right so finally i want to talk a little bit about the performance of the system because like everybody here is here because hey scale right all right so we did this experiment where we kept adding more and more connections to the same front-end machine like we just kept on going and if you wanted to figure out how many persistent connections a single machine can hold so any guesses oh million wow no not that many we also are doing a lot of work sorry yeah so he's very close so turns out that we were able to have a hundred thousand connections on the same machine yes you can go to a million but at the same time because we're also doing all this work and because we use the system not just for distributing likes but also for all the other things that linkedin has we we were able to get to 100 000 connections per front-end machine all right so anyone remember the second largest live stream royal wedding all right so the royal wedding had 18 million viewers at peak and so we could do that oh yeah so we could do that with uh just eight 180 machines right because a single machine can do 100 000 connections and so with 180 machines you are able to have persistent connections for all the 18 million viewers that are currently streaming the royal wedding of course we just didn't get to this number easily so we hit a bunch of file descriptor limits port exhaustion uh even memory limits and luckily we documented all of that at this link tiny dot cc slash linkedin scaling and i hope that you will be able to get something out of reading something like this because it's it's very interesting it's just like regular scaling challenges it's just that we hit it in in context of trying to expand the number of connections that we could hold on a single machine all right how about other parts of the system how many events per second can be published to the dispatcher node again before you answer this question i want to talk about something really important about the design of the system which makes it massively scalable the dispatcher node only has to publish an incoming event to a maximum of the number of front-end machines right it doesn't have to worry about all the connections that these front-end machines are in turn holding it only cares about this green fan out here which is the number of fronted machines that this dispatcher can possibly publish an event to but it doesn't have to worry about this red fan out because that's the part that the front-end machines are handling and they're doing that with in-memory subscriptions with akka actors which are highly highly efficient at this right okay so now with that context what do you think is the maximum events that you can publish to this dispatcher per second very close that's a very very good guess so turns out for us that number turned out to be close to 5000 so 5000 events can be published per second to the dispatcher node to a single dispatcher node and effectively we can publish a 50 000 likes per second to these front-end machines with just 10 dispatcher machines by the way this is just the first part of the fan out right these 50 000 likes per second will then be fanned out even more by all the front-end machines that are able to do that very very efficiently so that's a multiplicative factor there and that will result in millions of likes being distributed per second so lastly let's look at the time because everybody really really cares about latency right like you're you're building a real-time system so you got to make sure that things are super fast right so let's talk about the end to end latency all right so if we record the time t1 at which the likes backend publishes the like to our real-time platform which is the dispatcher machine and we record the time t2 at which point we have sent the like over over the persistent connection to the clients and the reason we're measuring it there is because you can't really control the the latency outside your data center right i mean you have some control over it but that's the one that the platform really really cares about then the delta turns out to be just 75 milliseconds at p90 the system is very fast as there is just one key value look up here right and one in memory look up here and the rest is just network hops and very few network hubs so these are some performance characteristics of the system now this end-to-end latency measurement is also a very interesting thing how do you really do that right because most of you must be familiar with measuring latencies for a request response system like you send an incoming request and the same machine can measure when the response is sent out and therefore you can say that hey it took this much time in this case there are multiple systems involved you're going from the dispatcher to the front end node and then to the client how do you measure latencies for such one-way flows across many many systems so that is also a very interesting problem and we wrote about it we wrote about a system that we built using using near-line processing using samsa so sams is another technology that we use at linkedin and you can use that to understand how latencies uh sorry you can measure latencies across end-to-end systems across many many machines so we wrote about it at tiny dot cc slash linkedin latency don't have the time to dive into it here well i would love to and you should i hope that you get something out of creating something like this and if you have a system where you want to measure latencies across many different parts of the stack you can use something like this to measure latencies so why does the system scale i think it scales because it is uh you can add more front-end machines or more dispatcher machines as your traffic increases it's just completely horizontally scalable the other thing that i mentioned in the beginning of the stock is that we also extended the system to build presence which is this technology where you can understand when somebody goes online and offline and now that we have these persistent connections we know when they were made we know when they were disconnected so we also know when somebody came online and when somebody went offline but it isn't that easy because uh mobile devices are notorious right like they will sometimes just have a bad network they might disconnect and reconnect without any reason so how do we kind of average out or produce all that noise to kind of figure out when somebody's actually online and when they're offline and not just jitter all the way where you keep going offline and online because you have connections and disconnections simply because of the network that you have so we wrote about that at linkedin.cc sorry tiny.cc at slash linkedin presence where we use the concept of persistent connections to understand how uh how somebody goes online and offline and we built the presence technology on top of the exact same platform so i hope that is also useful to you all right so that was probably a lot to consume in the last few minutes so i'll try to see if i can help you remember some of this real-time content delivery real-time content delivery can enable dynamic interactions between users of your apps uh you can do likes you can do comments you can do polls discussions very powerful stuff because it really engages your users and the first piece you need is a persistent connection for that there is built-in support for event source in most browsers and also on most server frameworks so there are also easily available client libraries that you can use for ios and android play and actors are powerful frameworks to manage connections in a very efficient way they can they can allow millions of connections to be managed on on your server side and therefore they can allow millions of viewers to interact with each other so everyone remember our actors way cooler than hollywood actors the principal i started this presentation with that challenges in distributed systems can be solved by starting small solve the first problem and then build on top of it add simple layers in your architecture to solve bigger challenges this is all we did throughout this presentation and when you hit a limit horizontally scaling the system is usually a good idea add a machine distribute your work and the real-time platform that i described to you can be built on almost any server or storage technology you can use node.js you can use python all of these server frameworks support some methodology of maintaining persistent connections and for the key value store you can use couch base redis mongodb anything that makes you the happiest anything that you are already using and so most importantly you can do the same for your app real time interactions are very powerful and i feel that if you use some of the principles that i shared with you you can do some pretty interesting stuff and pretty dynamic experiences in your own apps thank you everyone uh for attending the session i'm a proud indian i work at linkedin in the us and i'm so glad that i got this opportunity to talk to her talk to you here at kubecon london this talk and all of its slides will be available at tiny dot cc slash qcon 2020 i'm assuming very soon and there is also an ama session at 1 40 pm where you can come and ask me anything not just related to this but anything else that you have in your mind and that's happening at guild one more thing so now that we have about nine minutes left i wanted to check on the stream i don't know if it's still playing it is nobody said anything wow all right questions [Applause] it looks like we're going to have all have an orderly queue if you can just come here then i think that's how it works in this one what hello thanks for that very interesting um do you have any full bags for clients that don't support service and events or do you just say modern browsers are our focus here great question so the question was uh do we have any fall backs if servers and events don't work so the beauty of servers and events is that they are literally a regular http request there's absolutely no difference between what a regular http connection would do in fact websockets is something that sometimes gets blocked by firewalls in in certain certain systems and we have never experienced a case where service and events don't work because it's a regular http connection most firewalls will not block it and most clients would understand it and we would like we've never seen a case where service and events doesn't work hello so how do you synchronize your video stream with your likes with time oh i see um so i think the question here is that once these likes have happened how do you make sure that the next time somebody watches this video the likes show up at the same time is that what you're asking yeah and also i guess the video streams are delayed a little bit on different servers and your likes are happening in different places i see i see so yes i think you must have noticed here that there is a delay right and i think the question here is that hey i i liked at moment x but maybe the broadcaster sees it at moment y right so so yes there is a delay and some of it is simply because of natural causes like you just speed of light and the other is that there's also sometimes uh something that we do to make sure that the broadcaster can like cut something off if something is seriously wrong the good thing here is that once somebody has pressed like it will show up to all the viewers almost instantaneously like you should have actually you can actually try it right now if you press like you should actually be able to see it almost immediately so the distribution is real time but yes there may be a delay between when you think that the broadcaster said something versus when you actually liked it and that is natural i think there are also security reasons to do so yeah hi first of all thanks for your talk thank you um my question is do you have any consistency guarantees especially in view of dispatch dispatcher failure even across data centers yes great question what about consistency right like how how do you show guarantees uh how do you make sure that a like will actually get to it get to its destination short answer is that we don't right because in in this case uh what we are going for is speed and we're not going for complete guarantees for whether something will make it to the end having said that we measure everything right we measure the cross color dispatch we measure the dispatchers sending requests to the front ends and we also measure whether something that was sent by the front end was actually received by the client and if we see our four nines or five nines falling we will figure out what the cause is and we will fix it now i do want to share something else now that you asked this question which is kafka right i mean a natural question is like why not just do this with kafka and if we do it with kafka then yes you do get that because the way you would do it with kafka is that the likes back end would publish a like over to a live video topic that is defined in kafka and then each of these front-end machines would be consumers for all the live video topics that are currently right so you already see a little bit of a problem here which is that these front-end servers are now responsible for consuming every single live video topic and each of them needs to consume all of them because you never know which connection is subscribed to which live video and connect it to this front-end server but what this gives you is guarantees right you cannot drop an event anywhere here in the stack but you can drop an event when you send it to the client from the front-end server but you can detect that in fact event source interface provides a built-in uh support for it uh it it has this concept of where you are like it's like a it's like a uh it's like a number that tells you where you are in the stream and then if if that things get dropped the front end's over the next time it connects it will tell you that it was at point x and the front-end server can start consuming from the topic at point x but what you give away here is speed and also the fact that the front-end servers would stop scaling after a while because each of them need to consume from each of these streams and as scale grows that doesn't like as you add front-end machines that doesn't help because each printed machine now needs to still consume all the events from all the kafka topics yeah thanks a lot hello so question related to the so you have 100 connections to the clients yes some other clients are very slow they might not be consuming your data properly on the pipe okay how do you ensure that you don't have memory exhaustion on the servers great question so notice that when the front end server sends the data or has the persistent connection to the client it is actually a fire and forget right the front end server itself is not blocking on sending the data to the client it just shoves it into the pipe and forgets about it so there is no process or no thread that is waiting to figure out whether the data actually went to the client and so therefore no matter what these clients are doing they might be dropping events they might not be accepting it because something is wrong on the client side the front-end server is not impacted by that the front-end server's job is to just dispatch it over the connection and be done with it again because we are going for speed and not for uh yes thank you
Info
Channel: InfoQ
Views: 15,415
Rating: 4.9349594 out of 5
Keywords: Software Architecture, LinkedIn, Performance, Distributed |Systems, Case Study, Scalability, InfoQ, QCon, QCon London, Transcripts
Id: yqc3PPmHvrA
Channel Id: undefined
Length: 49min 35sec (2975 seconds)
Published: Wed Nov 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.