Going FaaSter: Function as a Service at Netflix

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

we're good cool so at Netflix right I don't have any free shows today but I swear if you come to this intuitive reason scription it's called a free trial go sign up it's good anyway a hundred and thirty million subscribers just a little bit of us we're wearing 190 countries something crazy 140 million hours of streaming content every day most of it is from the snail household apparently so it's good anyway our biggest thing is you can watch netflix on basically any device with the screen and internet connection and that's largely kind of more promise you know we have some over the years we've developed some really nice responsive user interfaces so this is kind of the latest iteration where everything is responsive videos playing real time this is probably what you guys are familiar with if you are a subscriber anyone remember with this one this is 2009 this is what the site looked like so it's a little bit different and kind of look the the thing that's kind of got us to go from 2009 in this view to this is the how do we make that change but how do we go from that really sort of integrated view of the site to this we do a lot of a be testing to get there so we don't just make changes of the site randomly we actually a be test everything so every little change that you make on the site every little product design improvement anything that happens to the algorithms they're already tested in fact we do a lot of a/b testing at Netflix when I say a lot I mean really a lot like a thousand thousands of a/b tests a year each of them has tens or twenty variations so it's a lot of changes happening to the service at one time and as most of you know actually the service is made up of a bunch of micro services on the backend so here's a live view of Netflix traffic one of our data centers and literally there's a lot of 200 services that make up the Netflix data graph so quite a lot of services each of them is responsible for a small part of Netflix and not all of it and so the question I want to post the audience is how do the clients of what you have you know quite a few out in front talk to all the back-end services right today I'll talk to each other at once 200 services lots of devices probably not the best idea and we thought we thought so too and so instead we have something called the Netflix API that both the back-end services and the front-end devices talk to and this gives us that flexibility to be able to integrate both the front and the back end without having to make changes having each each of them make changes to each other and you see what I mean in a second so the API decouples clients from the back-end services and it provides an integration point for both services and clients and what's really important actually is that we use something called the BFF pattern so the backend for phone and pattern who's heard of that quite a few folks great so the big premise of this is that you have a client and you have a service the front stuck that serves that client and these are very tightly coupled and if you imagine the Netflix app or the experience your iOS app looks very different than what you see on your TV but it looks very different than what you see in your browser right and so you want these things to be tightly coupled together so that when client engineers are pushing code to the to the devices they're also pushing their services in lockstep and using using the Netflix API we can look and we can also abstract away kind of the rest of the Netflix API and so you can make individual changes to your client and your service without having to affect the rest of the Netflix stack and move at a higher velocity and these buckets like we said before these BFFs are maintained by UI teams since it's tightly coupled to their UI and what this means is you know they don't fly engineers they don't necessarily have the expertise to run services at scale in fact we hire them to not do that we hire them to write responsive user interfaces you know and it's really hard to maintain and write services that skill generally even if your back-end engineer and so I want to talk a little bit about the API requirements and how that fits in with function as a service in JavaScript right so we weren't really high velocity for these BFFs since we're making you know two or three changes a week to each and every single client they need to be really reliable if one of these things is downed and the entire client is unavailable and that's that's not good for our customers it needs to be really easy to use and we want to abstract away the operations because again a target audiences for like client engineers who have no experience writing services and so I think this is kind of where function as a service JavaScript and node really come in handy and we'll talk today about how we're building a function as a service platform in Netflix and how we use that to unlock the ability for client engineers to be able to run their own services that's highly available low latency and super performant and so before we can talk about function as a service I want to talk a little bit about the evolution of function as a service and kind of cloud computing general in the industry so in the very beginning you had everyone who owned their own data centers right before the cloud before you see to you you own like the literal data center itself and the walls and the machines you also own the services platform you don't the application and you also own the business logic somewhat somewhere you know in the 2000s 80s came along and now they were able to abstract away using infrastructures of service the data center itself and so now instead of having to build your own do the center and get your own get your infrastructure you could just provision that using ec2 but you were still responsible for the services platform the application as well as your business logic then we'll do the industry move to platform as a service so now we abstract away the services platform as well but again you still own the application and the infrastructure sorry the application and your business logic and you still have to own the the operation set of it and then finally today I think this new paradigm is emerging with function as a service where you know we abstract away everything right everything's managed for you except for the business logic that you want to write and I think this is helpful to think about because for our client engineers this is really important too because they don't really care about any of the infrastructure or the operations or the platform they just care about the business logic that's specific to their app and I think this probably finds a lot of common ground some of you here as well who do that kind of stuff and so when we're thinking about building or bought versus buying our own function as a service platform this was a few years ago lot of questions came because one of the biggest questions is you know how come you guys aren't using insert some third-party service function as a service solution at the time there wasn't really anything in the market that could support our scale or integrate with our ecosystem or that was built for request responses services most of the offerings out there were really for event-driven workloads and so we decided to build our own and so we'll cover kind of the platform today in three steps three sections the runtime platform architecture the developer experience as well as management operations of the function platform so let's talk about the runtime platform architecture and before we can talk about the function as a service platform and want to go back to this diagram because I think we need to talk about how we're building infrastructure and platform in general Netflix so talk about infrastructure as a service most of you probably know were completely hosted on the cloud using AWS and kind of ec2 is that common base level of infrastructure that we use to run every single service though with that exists right however as we were building the function of the service platform we decided actually instead of just using VMs we decided use two containers as the foundation and the reason that was just how to do that was that it gave us a bunch of these advantages which make sure there are platforms that art was ergonomic it was efficient and was and we had a really great develop malossi so are some examples containers are much much much more quickly than your closet VM would so we can get deployments and startup times on the order of seconds instead of minutes and they're really portable across multiple environments so you can take a docker image and run that locally or you can take your image and run it remotely or in a cloud or vice versa and that's really helpful for debugging as an example and for our cloud infrastructure costs we were able to more efficiently PIM pack using containers because we were able to do scheduling ourselves on top of the ec2 instances already owned and so as a result of our decision to use containers we built a technology called Titus which is our own container management platform it's it's really high scale open-source platform that allows us to schedule and launch millions of containers per day I encourage you guys to check it out it's it's really its own talk so we're not gonna get into the details of this here but it is like the substrate that allows us to run containers at scale and Netflix and so let's talk about the platform right and what does it take to actually run an application at Netflix so we've created our own reliable and open source services platform some of you probably have already seen it or heard of it but it's again it's open source online but it provides us with its really a set of components that you need to be able to run a reliable service things like service discovery or RPC or configuration metrics etc there's a whole bunch of stuff that's built in really too many to list all of them are open source and you can check them out and so sometimes I get asked the question why are you guys to building function as a service if you already have this really robust platform I can't teams just assemble all this stuff on it on their own and just use it well it's kind of like going to Ikea and you want to buy a chair right um you see the chair at the at the story like this is a great chair I'm gonna take it home and it comes home in this pallet of parts and you have to go assemble these parts yourself it's kind of like that with services right we have this great store of really robust platform pieces but you have to assemble them as a team yourself and so often you do it wrong right and this is kind of what we're trying to avoid you know and then we have two or three hundred engineers so we're gonna be building these services we want to avoid them to have to experience that pain right wait so some of the disadvantages where you have the key always keep the components updated to the latest versions yourself and so we know for example that update time you update your application change one of the modules or the dependencies one it's painful one - it's not something that you really want to be doing and three it could you know yeah there's always the likelihood that you could introduce bugs you also have being sure that the metrics and dashboards are creative for your service and none of these things are automatically generated so again getting visibility and ultimately because you own the infrastructure and the platform so you're on the hook for managing and operating the infrastructure so none of these things are ideal for that use case we're looking for which is we have client engineers they need services they don't want to have to operate them right and so you shouldn't have to set from scratch every time when literally all you want to change is the business logic and so let's look at the function as a service platform right and the one of the thing that we've built and here are some of these requirements that we have there's no assembly that should be required all the parts automatically update themselves reliably you get observable metrics and visibility into the into the service and the operations are managed for you and so ultimately the overview of the platform is it's a services container that we've preassembled with all those components that we just talked about that's ready for a production service we start with a docker container and inside I wish to drop a whole bunch of daemon so we need to be able to run a service reliably so things like discovery and metrics and log rotation we also bring a node because the function is a service platform run JavaScript and instead of our node process we have a bunch of libraries that we load for you as that as the application developer so everything that you need to write to run a service reliably and then we have an HTTP server and really only instead of they're the things that are highlighted in red which are your route handlers those are the only things that you have to worry about and right so this everything else outside of those little red boxes is already pre-assembled for you and this is I think a really powerful idea of function as a service and server list in general which is you're kind of abstracting away all the common pieces so that all that's left is just a differentiation part which is just your business logic we also packaged in version the platform as a single entity so instead of in those 15 different components you have to keep up to date yourselves we do it for you and release those as a single version of our function as a service platform and so what that means is we can update upgrade the fleet uniformly we can make sure that everything is well tested that every time we do a platform upgrade that our customers aren't being affected and because we control the runtime right we can make sure that the platform emits a consistent set of metrics for every function that's installed in the platform so again when you go and operate the system all the functions all the metrics then alerts already set up for you and so we did set up we sound a bit of about fastball form to try to solve these issues and I think we did in terms of not having this demo pre-assemble anything getting automatic updates and having observable metrics and we'll talk about the manage operations aspects of it in the third section we're talking about operations so let's talk about the developer experience right this is also really important how do you actually develop on this platform in the new kind of server this function as a service paradigm so we can talk a little bit of the about the development API functions are managed by our configuration API so most of these fields are optional but they give you a way to declaratively state and declare what your API looks like right so yes some things like this name of your service and the platform version and what's really important I think is the ability for you to declare your own functions so you can declare the verbs their HTTP verbs as well as their sources we give you the ability to add additional source code that you might want to bring in as a part of those functions and then there's stuff like configuration and lifecycle management which are really important so that gives you a way to one configure your app but also introduce additional things into your functions at the beginning of process start which we'll get into in a little bit and then the business logic themselves the functions are literally just your connect cell middleware which everyone here should be super familiar with otherwise you're at the wrong conference but basically you know they're literally your your standard we uh you know expressed specify whatever middleware not everyone is super familiar with and you just have to export one of these out of your JavaScript files that we import and so this is like a vanilla example of it but you know it can get pretty complicated if you want to add additional business logic but one of the nice things about our platform is you know by default all of the platform components such as your metrics and agents and your loggers your RPC clients are all initialized and ready to go for you when your function is loaded so you don't have to initialize those and so by the time your function executes everything is there and ready to go and again this is really important because you're no longer having to do the set up and back to the analogy of the chair you could assemble it wrong and here everything's just set up for you ready to go and this also improves kind of the velocity the developer velocity that we're looking for for our developers and you know here's another example where we you may want to bring the additional components to your functions and traditionally you would do that in your regular lab because you've access to like index jets or whatever your main function is and you can start stuff because you can't in this environment we give you the ability to do that by a start up and shut down hooks where you get access to kind of the function environment you can start any set of additional components that you want and those are live for the rest of the lifetime of the process so these are good ways for you to add additional things like database drivers or additional logging loggers or whatever it is you need like the cou same the hooks are initially before the platform starts and they have access to all the platform components and so it makes it really easy for you to integrate any additional libraries that you need additionally any external dependencies can be imported from NPM which is really nice it's real you know as a developer you usually use a lot of other code so all that stuff can be imported into your functions and so our goal here was to also create a local function development experience that improves the software development lifecycle for developers as well so we have this great runtime API for you to write code how do you actually develop on your local desktop right and so what we've done is created a work full toolkit called mute and this simplifies and facilitates like common developer tasks right and like typically the example I would give is you started a new company what's the first thing that you have to do before you start coding you have to install all of the ear in your entire environment on all the dependencies right so this tool has looked with one-click setup and then get your developer desktop to be consistent with everyone else's and installs all of the dependencies that you might need to start developing and so we've also created a development function as a service platform for local development and so now you can interact to test your functions in seconds right reducing friction and increasing velocity and the way that works is you have local functions that you you start developing and you instead of a git repo somewhere and what we do with docker is we drop a local container on your desktop with the entire fast platform that's loaded and instead of having to rebuild the docker image every time which takes forever you actually do a live reload across your your host and the guest docker container and that way when you make changes to your code that gets file synced into the local container that restarts and then you have access basically to the service as it's running locally and this is really nice because you can now look at logs you can attach the buggers you can hit the endpoints everything's there locally and you get this truly native experience and it's so in this in addition to being with work on stuff locally you can also integrate your local container right in with the Netflix cloud because a lot of our devices for example they need they need there sits all terminated or they need authentication or they need DRM decryption and we have services in the cloud that does that and so we can route device traffic into the club back help to your local container for you to do testing on and that routes back into the rest of those back-end services for you to integrate against but sometimes you also want to be able to test functions in isolation without having to connect to or depend on upstream or downstream services right and so how do we enable that use case well we let you test your isolated local functions like providing mocks and unitized API right and the hard thing about providing unit tests is that a lot of these components that you'll need generally require you to talk to a downstream service and so we've provided mocks for these services so that you can easily mock them out and then run these testing isolations so here's an example where we've got a whole bunch of mocks coming in and now you can test your functions of isolation without having to make network connections and so this developer forum can also be easily deployed to Jenkins and this allows us to unlock like the CIC deep aspects of it of for teams so they can have nice CIA pipelines set up to make sure that we're improving you know quality and reducing bugs and so lastly we'll talk about the management operations aspects of the platform and this is kind of where we really want to make the platform truly service and I'm gonna abstract away the infrastructure for a member wyne right so generally speaking getting your code to production involves these steps right you need to publish your code deploy to production and then operate on it and so the way that publishers work with the platform is you literally just publish your functions and we create an image of your function and we version that in a centralized religious tree so there's no you know there's no there's a command let you publish and then that's read that's what what's happening underneath the hood is that we combined a docker image of our base platform with your functions and create a new docker image and then that's added to our centralized function registry for all the teams so every team kind of has a bunch of apps function of fast apps and each of them will have different versions every time you publish these versions are unique and you immediately save to the registry and now we can publish these functions to the cloud by our new the new commands that we talked about earlier but this is kind of our suite the developer tooling and so here you can see you know a one-liner which that's just to play a set of functions to the cloud really quickly right in a span of a few minutes we're now live and these functions are deployed using titus which is that function orchestration system I talked about earlier what happens is they did we take the function from the registry and then the titus container scheduler that goes that goes ahead and schedules them into the cloud for us we use things like Canaries and canary deployments of analysis as part of the deployment which helps us minimize outages and increase availability and these are all automatically added to your function deployment process so whatever you deploy you can choose to have Canaries and these are a good indicator for you to figure out if a new new build is is a good candidate or not and we do an automated analysis where we look at specific metrics to make sure that hey you know your new functions look just as good or it's not better than the old ones and then every single deploy function version can be managed by our control plane so I blanked out kind of some of the RPS numbers but here for a specific team you can see the various versions of their functions that's running you can see how many instances of each one you have time stamps you have links back to the git repository and all the detailed historical information and deployment and management activity is available to eight debugging as well so you know next time something happens you'll see like oh I did an appointment and that's what's causing failures for you we also use auto-scaling to automatically scale the infrastructure for each function and this saves costs and increases availability and all you have to do is you require an initial configuration for each function that you're deploying and like I said earlier metrics and dashboards are automatically generated for each function so that you know you have full visibility into into your service a lower so generator and they're tied into page of duty for each of the teams so the goal here is really to automate and sort of provide a lot of value out of the box and we have things like we're making real-time and historical observing that you deploy profiling and post-mortem tools a lot of the work that we've been doing with kind of the diagnostic working group we make all of that available for these services as well and so the ones the other thing I would add is of being for structure in the operations of this platform and the application is actually handled by the centralized platform team and so the UI teams are only responsible for manage of their own individual functions and they're only on-call if there's problems with that business logic but if there's problems with infrastructure or the platform itself that's handled by a centralized team and this kind of gives you that server list operations environment so today we talked about our fast platform we talked about the runtime platform architecture and how we use node docker and the Netflix platform to be able to enable it people to write functions as a service we talked about the developer experience a way to give you native to bump a native development experience locally and our develop a PS and we talked about management operations right and how we provide tools and I'll whole suite of tools and metrics for us to easily manage your functions and so I want to take up to sort of leave you with a bit of some of the lessons that we've learned along the way one of them is that you need really solid foundations and so before we could build function as a service we needed a really robust infrastructure service and platform of service platforms and if we didn't have those we wouldn't be able to build function as a service so if you think you've going down this route it's really important for you to first invest in your infrastructure in your platform the developer tools are also really really important in order for you to unlock Devon malossi that you're looking for it to get with fast you want to make sure that you invest in the development experience right so how do you make the development phase as native as possible and then lastly investing the operations tooling as you go up stack and abstract away a lot of the infrastructure and platform away from your developers they now no longer get access to like the actual instances or they lose ability to see some of the tooling so how do you think about how how do we provide tooling up stack like the like the way that we can automate it automatically generate flame graphs or give them access to metrics or logs by a web UI to make it easier for them to offer their service so I talk to you today about our fast platform I think the thing that I will leave you with is that you know JavaScript a node and function as a service I think is like this natural evolution of where the web is going it's really worked great for our developers is allow us to get really great velocity and improve the operational and availability for service and I really do think that like there's there's something to be able to run JavaScript on the client my JavaScript on the service and really unlock the potential for you developers and thank you very much [Applause]

Info

Channel: Coding Tech

Views: 13,912

Rating: undefined out of 5

Keywords: node.js, nodejs, functions as a service, faas

Id: 66PxX3oGVCA

Channel Id: undefined

Length: 25min 41sec (1541 seconds)

Published: Wed Dec 05 2018