ElixirConf 2021 - Jace Warren - Distributed, Scalable, Fault-Tolerant Video Streaming

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Applause] [Music] [Applause] [Music] hi everyone uh my name is jace warren and i'm currently a software engineer at savvy solutions uh it's an honor to have the opportunity to speak with you and today i will be talking about distributed scalable fault tolerant video streaming with react core elixir and kubernetes that's quite the mouthful but uh again just want to say a big thank you to uh elixir conf and my company for giving me the opportunity to do this and let's jump into it uh to quickly tell you a little bit about myself i recently just moved to dallas it's now my home it's a beautiful place i love the warm climate as far as my experience goes i started working in elixir three years ago and even though i've been in the industry for 10 years i really love the language it's very beautiful it's fun to work with and i'm i'm not really going back to any other languages i haven't looked back um as far as social media goes i i don't really do very much in that area but if you would like to find me here are some of my links so to give a little background about us as a company we are a small engineering team that does a lot of things centralized around video we provide the customers you see here and many more with a video management system which enables customers to easily access download and play all of their location video in the cloud we also do interesting things with this captured video like tying it to point-of-sale data for example to provide high-risk transaction audit reports we provide an event search which allows customers to filter and view video clips related to major events throughout their day and we we do a lot of other things a lot of fun and exciting things as well and so if any of you are looking for work uh we are hiring and if any of those uh items listed there look challenging feel free to reach out to us so uh why even build a video streaming platform to begin with uh there are many third-party providers with much more funding and a higher volume of engineers set up to do a much better job than us i mean just to name a few we have mux they have an incredible presence online aws elemental media store there's a company called video experts group and many others with great platform apis already in place and ready to go also video streaming at scale is notoriously difficult and can be very error prone things can get very resource intensive and expensive very quickly also as mentioned earlier we are a small team and uh our engineers are already pretty leveraged as is anyways uh so the answer to that question and most questions is money so i'm going to be a bit vague here but we've been very fortunate as a company and we have a great overall team and have experienced tremendous organic growth also that growth coupled with some very promising partnerships with stacking up in such a way that we would need to build a video streaming platform that could that could support hundreds of thousands of concurrent camera streams in a relatively short time frame again uh why the need to build this video streaming platform unfortunately at that scale all third party providers for our business use cases proved to be uh too expensive as a result we had to try something else and fortunately after calculating cost analysis after cost analysis against various cloud providers and tech stacks building proof of concept after proof of concept we found a promising way to build the video platform our customers will need at the necessary price point and then also to reference by manager there's also the programmers credo on why we do things we we do things not because they are easy but because we thought they were going to be easy which is very much how this process goes and has gone for us uh now i want to talk just a little bit about the world of video streaming in this world especially if you work with surveillance and closed-circuit television cameras it is not often that video will play straight out of the video camera within a device video player or like from the camera to the video device player why in most cases video stream chunks transmitting out of the camera are not natively supported for playback in browsers mobile devices computers and televisions as a result there are usually multiple stages at play here initially there is what's called a first mile delivery stage that involves capturing content from a camera to an encoder which in turn transmits the video over a network to a series of intermediary processes or servers from there there is usually a series of pipeline operations performed on the video including transcoding transmixing producing multiple bit rate outputs captioning thumbnail generation cdn population and more in which video is ultimately prepared for device compatible playback once that video is prepared for that device compatible playback and the content gets transmitted to the end viewer that is known as the last mile delivery stage now let's talk a little bit about streaming protocols a streaming protocol is a standardized method of delivering different types of media over the internet essentially a video streaming protocol sends chunks of content from one device to another streaming protocols are often used at the first and last mile delivery stages like we saw in the last slide in building a scalable video streaming platform we had to evaluate various streaming protocol pros and cons and choose the right protocols for our business use cases for example hls developed by apple and will play back on most if not all devices it's also arguably the most widely used last mile stage however compared to other last mile protocols it can come with uh increased latency there is also mpeg-dash it's also mostly used in the last mile delivery stage it's developed as an open source industry standard alternative to hls however it's not supported by ios or apple tv next we have rtmp it's mostly used in the first mile delivery stages by companies such as youtube twitch facebook and more however because it was developed by adobe in order to play in the browser it requires adobe flash which is no longer supported in major browsers and as a result it's not a great last mile delivery stage candidate there is also rtsp rtsp is great for first mile delivery stages because it remains the standard for most surveillance and closed-circuit television cameras but android and ios devices don't have rtsp compatible video players out of the box we also have srt which is kind of the new kit on the block competes directly with rtmp and rtsp as a first mile solution but unfortunately it's still being adopted as encoders decoders and players add support and finally we have webrtc which will work on all major browsers but was designed for video conferencing and less for the type of scale that we would need there are others but they are not included in this presentation for brevity so uh what approaches did we end up going with in our video streaming platform really a mixed bag as far as first mile delivery stage goes rtsp made a lot of sense for us because it remains the standard for most surveillance cameras and nvr devices we also have open port and third proxy scenarios which really enable our video streaming platform to reach out and pull footage directly rtmp push protocol really made a lot of sense for us as well because uh as my colleague austin hammer mentioned in his 2020 elixir comp talk we have a fleet of edge devices sitting behind our customer firewalls and rtmp enables our platform to kind of uh push that footage out from behind the firewall into our video streaming platform as far as last mile delivery goes even though hls can have increased latency it makes the most sense here because it will play back on most if not all devices we also found that we can get hls latency into an acceptable range for our customers by tweaking camera settings such as like the keyframe interval setting so now that we have had a refresher on the world of video streaming let's talk a little bit about elixir and what makes it such a great fit at the application layer like the pipeline processing layer for starters our team was already using elixir and we are big fans of its ruby-like syntax pattern matching and functional nature elixir is scalable in our video streaming platform we had to plan for literally hundreds of thousands of concurrent video stream processing pipelines elixir scales very nicely vertically by being able to run hundreds of thousands of processes on a single machine elixir scales very nicely horizontally because these processes can also communicate with other processes on different machines in the same network and coordinate video pipeline work across multiple nodes now in video streaming customers expect the video to always be up and accessible elixir provides supervisors monitors and more which help us to monitor and restart these video pipelines back to a known initial state that can get the process back in working order so it's very fault tolerant and finally elixir runs on erlang vm which gives developers complete access to the erlang ecosystem meaning we can import and use a library like say react core uh react core what is that many who hear the word react immediately think of the distributed nosql key value datastore of the same name however react core is actually a distributed systems framework that forms the basis of how react the database distributes data and scale this means similarly to how developers will use the phoenix framework to build web applications developers can use react core to build a distributed system it provides the basic blocks to build distributed services it provides consistent hashing routing sharding replication distributed queries and more and what makes react core especially interesting is that it implements ideas from amazon's dynamo architecture and exposes that infrastructure as a reusable library allowing easy adoption in any context that can benefit from decentralized distribution of work for example we use react core as a distributed video process handling system react core ensures video pipeline work is always evenly distributed and handled by the same nodes across our distributed cluster um as far as react core goes it also helps to know that when you're using it you are in good company react core is battle tested and widely used by the companies you see here and also it doesn't hurt that it has near linear scalability so uh everything in react core revolves around a circular key space aka the ring which is what we see here for example uh you'll notice this ring is divided into a fixed yet configurable number of partitions also known as virtual nodes or v nodes for sure for short a v node is nothing but an erlang process responsible for one partition of the ring not unlike a gen server a physical node represented as different colors in the image can and usually will post more than one v node at a time it is important to bear in mind that v nodes are not bound to a particular physical node they can be relocated to other nodes as new nodes are added to or existing nodes are removed from the system via handoff react core uses consistent hashing to map keys to specific positions in the key space this mapping determines what partition is responsible for this key hence it also term also determines which v node is responsible for that key part of the strength of react core is how it enables scalability with small operational effort and so in terms of handoff there's really kind of two types of handoff that can happen here there's an ownership handoff and it happens when a physical node joins or leaves the cluster in this scenario react core reassigns the physical nodes responsible for each v node and it executes the handoff to move uh the v node data from its old home to its new home and then there's also hinted handoffs uh if there's a v node redundancy uh when the primary v node for a particular part of the ring is offline react core still accepts operations on it and routes those to a secondary v node when the primary v node comes back online react core uses handoff to sync the current v node from the secondary to the primary once the primary is synchronized operations are routed to it once again and it's also worth noting that ring sizes can be dynamically changed when the ring size is changed up or down the number of partitions in the key space goes up or down too react core will figure out how to move the v node data around your cluster members as it conforms to this new partitioning directive as it uses the resize handoff type to achieve this now let's talk a little bit about kubernetes what is it why is it useful within video streaming for those of you who aren't that familiar with kubernetes uh also known as kate's or cube it's an open source container orchestration platform that automates many of the manual processes involved in deploying managing and scanning and scaling containerized applications this means that you can essentially take a pool of servers vms or hardware and essentially make one big resource pool that you can use the kubernetes api to deploy your containers to so what about kubernetes specifically is useful to video streaming just to name a few things it's really nice that kubernetes is cloud agnostic meaning we can build and run our cluster the same way on virtually any cloud provider for elixir applications there is extremely useful acc abstraction called stateful sets that i will get into in a moment here if you're not really familiar with services in kubernetes land services can be external external or internal load balancers for routing traffic to your apps across your kubernetes nodes what is particularly useful here as far as video streaming goes is you can use different load balancer algorithms for example if you're routing an rtmp tcp connection and passing it through to an elixir application uh you can set the internal load balancer on your within your kubernetes cluster to use uh least connections algorithm as opposed to say a round robin algorithm which would be a much better fit in that scenario and uh next up ingress i'll go into more detail about this in a second but there is a lot of power worth mentioning here and as we can see here kubernetes also provides useful tools like volume auto scaling um and these things called uh sorry volumes auto scaling and these things called helms and operators and so i've listed quite a few things here and really we're just touching the tip of the iceberg so getting into pods deployments and stateful sets i'm not going to go too deep here just for time but pods are basically the smallest deployable unit of computing that you can create and manage in kubernetes in terms of docker concepts a pod is similar to a group of docker containers with a shared namespace and shared file system uh usually you don't need to create pods directly instead you create them through deployments and so uh deployments are basically just a declarative way to define pods and replica sets with replica sets being how many instances of you want of the pods you want to run in your kubernetes cluster and so on top of that you have this really interesting concept of stateful sets and so stateful sets is the workload api object used to manage stateful applications it manages the deployment and scaling of a set of pods and provides guarantees around the ordering and uniqueness of those pods whereas with a typical deployment when you deploy that your pod is going to have a unique ip a random unique ip and a random name and so with stateful sets everything's kind of predictive and orderly and determine which is very nice when we're trying to connect elixir notes here in a second all right so just to give a high level overview of the power of ingress so one thing that is really cool is i mentioned earlier that kubernetes is cloud agnostic but what you can also do with kubernetes uh thanks to the apis that are defined when you deploy your kubernetes cluster it will basically auto provision a load balancer in the cloud provider that you specify depending on how you set it up which is really nice you can even in some circumstances name those load balancers you could have like zone one zone two uh and then also when you're setting up those load balancers it gives you the option to write your code once and it will deploy the same way across multiple uh cloud providers and you can select your round robin or lease connections kind of routing algorithm which is nice especially for rtmp tcp connection and then also one cool thing about ingress is you can just within your markup you can define unique routes to all of your different applications you can pair it with cert manager which will communicate with let's encrypt and basically generate ssl certs so that it can handle all of your ssl termination really it's a very powerful tool to have in your video streaming platform so next up is volume so this gets kind of interesting so like i mentioned earlier volumes in kubernetes land are similar to volumes in docker land uh but what you can also do with volumes is you can set them up in such a way that they survive restarts in your kubernetes cluster so let's say you have an elixir application it starts up it gets some state uh it fails when your elixir or when kubernetes goes to restart that application uh it's it's disk it's um volume that's mounted to it will still be there and accessible to the application one thing that's also really neat is you can have two containers that are sharing the same the same volume which is really nice you can have a file that's writing to disk here sorry you can have a container that's writing to disk here and another container that's co-located and reading the same disk here and producing something else with it one other thing that's also kind of neat is you can have your volumes be completely in memory which is really cool and can get some interesting performance gains there and so also auto scaling uh you want to dynamically scale your kubernetes cluster both vertically and horizontally you can so the horizontal pod auto scaler automatically scales the number of pods in a replication controller deployment replica set or staple set based off of observed cpu utilization so basically you can put in some restraints based off of cpu and ram and it will basically create replicas of your app to to keep up with the demand within your cluster then there's also the cluster auto scaler which increases the size of your cluster and so uh there are pods um let's say you have pods that failed to scale on or sorry failed to schedule on some of your nodes due to insufficient resources this auto scaler can do that in a new vm on which the pods can schedule and run and so uh there's really some great options here to let's say you have variable bit rates coming through your video streaming platform let's say at night there's not as much movement in the video frames and so as a result uh the bit rates are just lower because uh not as much changes between the keyframes and keyframes and so uh in that case maybe you just don't need as many vms running you can have this completely configured dynamically to kind of auto manage that process for you okay so this is where kubernetes gets crazy so for those of you who don't know what helm or operators are helm is essentially a package manager with helm you can install charts which help you define install and upgrade even the most complex of kubernetes applications like databases messages or sorry message brokers and more you can do a helm install of an entire database operators extend the functionality of the kubernetes api to create configure and manage instances of complex applications and what that basically means is operators give you the ability to define and create anything you can think of and tie it into the kubernetes life cycle events i believe that one elixir conf i i i remember seeing someone create a to-do list with uh operators and that was really incredible and so uh for example in our cluster here uh we have found a great use in installing and leveraging the cube prometheus stack helm chart this chart does quite a few things like installing grafana and alert manager but it also installs a prometheus operator now what the prometheus operator does is creates configures and manages prometheus database clusters inside of kubernetes this means that the kubernetes has been taught and educated on what a prometheus database is and you can tell you can basically tell your kubernetes cluster dynamically to spin up and tear down prometheus databases now it's really dynamic and very powerful and also this is worth mentioning thanks to all the hard work of i'm sorry if i mispronounced this a kutmos and his promx library uh all you have to do is install and configure prom x leverage telemetry and with our uh with our uh q prometheus stack we now have complete visibility into our kubernetes cluster nodes and pods and elixir applications with pretty minimal effort so uh tying it all together um this these are some patterns that we found in using kubernetes elixir react core and getting everything to work as expected within our environment and so uh what we usually or what we have done here is we use a stateful set instead of a deployment for deploying our kubernetes pods and and containers and uh we also join our ring via a connector gen server which i'll get to in a second and then we leave the react core ring via what's called an erlang signal handler which i'll explain in more detail so um one of the first things we want to do is ensure our elixir application is no longer configured as a kubernetes deployment but instead configured as a stateful set the wonderful thing about stateful sets is that it gives our application some guarantees within kubernetes no longer will kubernetes be spinning up our elixir applications with random unique ids and network identifiers in fact applications will now be started up with stable unique network identifiers our applications can now have stable persistent storage which can persist across all rolling updates we also get ordered graceful deployments and ordered automated rolling updates what this means is we can now reasonably query the kubernetes api server and reasonably determine which applications are our oldest we can also sort our applications lexicographically this means that we can always be certain which elixir app is the best ring node to join within react core and by having a predictable ring node this helps us to avoid net splits in which two separate react core clusters of two nodes or more cannot join each other also worth noting uh within a stateful set it takes some time shutting down a react core application because it has to hand off all of its v nodes onto another node before shutting down and so what we what we want to do here is update the termination grace period seconds which indicates to kubernetes how long to give an application to shut down before actually sending a sig kill and like i mentioned before we want to set a value here that will give our react core application ample time to hand off its virtual nodes and shut down otherwise the state in those virtual nodes can be lost forever next up is an approach that we found extremely useful in allowing our elixir application to join our react core ring what we ended up doing was setting up connector gen server process that is supervised by the application supervisor of our elixir app what this connector gen server will do upon starting up is query the kubernetes api server to get a list of existing react core applications it can join with because we set up these applications via kubernetes stateful set config files we can filter our applications by the oldest and easily determine which ring node we want to join so as to avoid net splits once we have selected the ring node it is as simple as calling react card join react core clement plan and react core clement commit so what is essentially happening here is when we call join we are essentially letting the existing ring know that we are here and we would like to join by calling plan we are telling react core to build a new plan on how to redistribute the v node partitions across our new node in our cluster occasionally the react core ring might be processing another node depending on how many nodes we're trying to roll out at once and it might be busy and in that scenario we simply wait via process send and we try again once the plan is finally ready we commit and upon commit react core gets busy and starts redistributing v nodes uh via handoffs across our new nodes that make up our react core cluster uh just as a precaution we also schedule on our connector gen server to occasionally diff the existing ring against any deleted stateful set applications to ensure there are no orphans finally we need to make sure that upon shutting upon shutdown react core successfully hands off all of its virtual notes before shutting down and one approach we have found useful is to use the erlang gen event behavior which consists of a generic event manager process with any number of event handlers that are added and deleted dynamically and so once we swap the default signal handler with our gen event process we can now intercept and receive all sig term events that will be sent by our kubernetes cluster this means that when kubernetes is performing a rolling update and gives our application the signal to shut down we can intercept that signal and manually trigger a react core leave event in which our application will hand off any existing v-nodes and once they are successfully handed off it will shut down with this functionally in place when everything finally comes together react core works beautifully for a long time before setting up react core we were actually using horde and unfortunately though we were leveraging the delta crd and attempting sig terms and just trying to give it enough time to offload state before shutting down our distributed cluster would still drop video streaming processes upon uh rolling updates we actually ended up having to build a boot process that upon every fresh deploy and rolling update would ensure that each video streaming process would still be up and running we also noticed that if we would try to deploy more than two applications at a time horde could get into a permanently broken state however once we replace that with react core as long as we don't experience any sig kills our application will keep our distributed video streaming processes up and running indefinitely without dropping a single process which is incredible but disclaimer all the problems we experienced with horde were quite a while ago and i've since heard and read that those issues have been fixed but react core has been working so well and we have been so busy i haven't really had time to go back and validate i am still excited about the horde project and i do want to take it for another spin at some point just haven't had the time so that is pretty much everything uh i want to say thank you to elixir comp i want to say thank you to savvy solutions my employer for giving me for allowing me to speak and have to have time to speak and again want to mention that we are hiring
Info
Channel: ElixirConf
Views: 920
Rating: undefined out of 5
Keywords: elixir
Id: HGm4rLup0fw
Channel Id: undefined
Length: 32min 23sec (1943 seconds)
Published: Sun Oct 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.