Building a Modern Machine Learning Platform on Kubernetes | Lyft

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you [Music] my name's Sorum I'm an engineer on ML platform machine learning platform and lift and today I want to speak a little bit about the technical decisions we made around building a scalable machine learning platform I want to get start by talking about problems that we're trying to solve and then get into some of the details of how we approach the solutions how many people out here have used lift before just raise of hand raise hands please okay so most of the audiences and so let's talk about the classic lift experience you open your app you're trying to put your pickup spot and destination select a payment method and call a ride when this process underlying this process there's a lot of machine learning that's happening we can smartly pick the pickup spot from a very noisy GPS data to know exactly where you might be then we can also do some predictions on what are the most likely destinations you want to go to based on your ride history based on your previous ride you took we might be able to pick automatically a payment method if you have several of those maybe use a different card during computer hours compared to a business trip that you're on or are during weekend rides we can run machine learning models to optimize what driver gets dispatched to you or predicting the ETA of your ride from your source to destination or picking up the right matching passengers if you are taking a shared ride so every part of the experience has machine learning powering it we have hundreds of machine learning researchers working on just the classic lyft experience aside from that we have a ton of people working on autonomous driving using deep learning extensively for computer vision planning and mapping so my team's job is to build out a machine learning platform that makes it easy to develop new models prototype many different approaches to pure machine learning model once you have a few models that you have selected we want to make it easy for you to try out different training algorithms for example if you're you might want to tweak the learning rate on your model and try out different learning rate try out different optimizers how can we make it easy to scale out one model that you've developed and run many executions of those with different hyper parameters once you have a model that performs the best we also make it easy for you to ship that into production there are different paths to production for a model that runs during the web request for serving a ride it needs to be served as a web service other models may do some kind of offline prediction on a daily basis or on a weekly basis for example figuring out the right population to send out an incentive to there is a third type of inference which might be a machine learning model running on the edge for example in an autonomous car so how can we make it easy across all these different types of deployments for research science to go through this life cycle of model development training and production so research scientists spend a lot of time doing non modeling tasks today if you're if you've if the audience is full of people who are building their own models you might realize you spend a lot of time collecting the right data set querying it filtering it slice and dice it to generate the correct training error and then if your training really compute-intensive models like deep learning models you need to be able to provision the right hardware for your model you might want to run many different iterations of those models you need good logging and monitoring and our team's job is to make all of that simple so I'm gonna get into some of those requirements to start we wanted to build a system that's framework agnostic the current landscape of machine learning is so rapidly changing there are some really great frameworks like tensorflow pi torch that have come up in the last few years and then there is the whole psychic learn ecosystem or boosted trees certainly any new libraries that might come up so we want to support any arbitrary framework and not be tied into an implementation for a specific framework support for GPU workloads almost all deep learning models are extremely computer intensive several orders of magnitude more than traditional machine learning approaches and we want to serve those internal customers to lift who are training very large scale models that feed imagery or video data and can have model training that can run for days how can we make their job easier and support single GPU multi-gpu or distributed training workloads we wanted the system to be clouded Gnostic and the main reason for that is along with the software ecosystem the hardware ecosystem for machine learning is also rapidly evolving a study by open AI shows there's almost 10x improvement year over year in ml compute performance which is extremely fast and there are custom accelerators ASIC based model training systems that are being developed NVIDIA has the volta GPU which is the current state of the art GPU for model training however Google is working on TPU chips that there are custom Asics built for model training and we wanted to be able to leverage any of this hardware because we cannot anticipate exactly how the Seco system is going to move so we wanted to be cloud agnostic now with a lot of power comes a lot of responsibility is rising however we have many levels of users we have researchers who come from a academic background and don't really want to deal with a lot of infrastructure problems they don't want to get into the nitty gritties of how certain infrastructure is set up to launch their jobs however for some advanced use cases such as distributed training or if you want to build custom apps for your tensorflow model we also have very advanced program software engineer users who want to get the maximum flexibility around defining their models so we wanted to support both such use cases so we decided to use container docker containers as our primary abstraction for most of our jobs this solves a lot of problems around dependency management it makes it easy to be framework agnostic as long as you can package your code into a docker image you can run it on our system now with containers we need to decide a good execution engine for those containers and the kubernetes ecosystem has really blown blown up in the last few years kubernetes provides a very rich set of api's for launching very complex workloads you can scale out very easily using kubernetes on any of the cloud providers it provides a cloud agnostic api first to launch jobs which can then translate from either AWS to google flower to any custom deployments it handles a lot of different infrastructure challenges that we would have to design on iran for example and i'll get into some of the details one of the challenges with using kubernetes however is its power it is very complex there's very low level api's that most users may not want to deal with so I'll show how we create the right abstractions to help our users launch all kinds of workloads we decided to go the route of providing easy path that is fully UI based so we have our own UI based system internally we call it left alone that allows a data scientist to go from developing in Jupiter notebooks to containerize in their models to running very large-scale hyper parameter search and then deployments however for advanced users we provide a much lower level api using command line to give them full flexibility for defining their machine learning workloads so let's go through some of those features notebooks this is the entry point to any researchers machine learning development flow you want to have an interactive environment where you can easily query datasets build out different stages of your model development and try it out and Jupiter is the de facto standard right now for a lot of these model developments so this is lift learn on the left side you see the different functionality we provide and the first functionalities environments which are Jupiter environments for developing either Python or R models we are moving quickly towards Python as our standard programming language for most workloads just because it provides us really great into an integration from development to production izing so in this environment I can quickly pick a software environment that I would like to launch so we provide some pre baked software environments here you can see I have four different environments one which is the standard docker image containing most of my commonly used Python libraries and next to it is a deep learning plus tensor tensor board image in the in the standard image we only launched a jupiter notebook however with the deep running image we can launch both notebooks as well as tensile board tensile board is a visualization tool that tensorflow and our pythons use for measuring your learning performance over time similarly we allow different teams within left to provide their own software environments in the form of docker images so if you care about a certain set of requirements which are different from most users and you want to pin different versions of scikit-learn or you use a RC branch on tensorflow you can do that and provide there's a custom environment right here next you pick the hardware resources you want to run this job on so we provide some pre-configured Hardware environments you can choose many different configurations of CPU memory and GPU that you would like for your job for your notebooks having some fixed environment chart choices makes it easy for users to quickly go through this flow however you can use the command line tools to arbitrarily create any kind of environment which translates to kubernetes resources behind the API so you click Next and now you have a new environment set up for you now you can see we already provide a link to the notebook right there in the UI you click on it and we'll take you to your Jupiter environment where you can start developing your models in the background this is these are some of the links that are available to you once you launch the job you have the notebook interface where you can develop any moral and you have the tensor board for visualizing any model training that you might be running on it in the background we're creating kubernetes objects the primarily one being a part pods are the lowest execution units inside kubernetes and you can configure many different containers to run inside a pod so here we are launching a pod with a given set of Python requirements along with the right resource requests that the user submitted it picks up a certain image but you can use using it as a docker image gives you full flexibility to pick the image you would like to run along with the pod we attach certain volumes our volume can represent either a temporary disk space the user might want for loading some datasets or generating some training artifacts such as checkpoints intermediate checkpoints for your models we can also we also mount NFS into these parts so we have AWS EFS which we mount as a persistent volume on each notebook the primary use case there is a user can just start a new notebook start developing a new model save their notebooks on the EFS and once you're done you can terminate the environments come back again next day and start where you left off so it's it makes good use of the resources by letting you checkpoint all your work and come back to it layer at a later point we also provide github base integration so let's say you've done all your iteration and now you want to push this code out for review then you can log into github from your notebook interface directly and push it out as a as new poll requester has a new commit we also launched services and ingress these are other object types in kubernetes which allow you external access to any kubernetes links so in this case my notebook URL is actually a kubernetes ingress backed by elastic load balancer which allows any user to access the notebook over the Internet so what are some of the challenges we faced while building this some of these machine learning images are extremely big in size on average our notebook images are between 12 to 15 gigs in size and just downloading because it contains cuda drivers it contains the whole cuda toolkit it has all the binaries for tensorflow pi torch supporting python 2 and 3 and all other frameworks that users might want so loading a new job can become a bottleneck because it might take a long time to download these images so what we do is we create a warm cache of images commonly used images on every single node on the kubernetes cluster that's one of our approaches for pre-caching most commonly used images directly available on every single node this helps us eliminate half of our boot up time our kubernetes clusters are also auto scaling so based on unfulfilled capacity and unfulfilled requests the kubernetes scheduler can scale up the number of AWS instances that are needed to fulfill those requests however that itself starting a new instance can take a while as well we use Packer which is a packaging tool open-source packaging tool for generating AWS a mis or many other types of images this helps us create the VM image pre-baked as it's needed to and abuse and it cuts down our yeah in the original Packer builds it cut it down from 20 minutes to 6 minutes or so so we combine there's another thing we need to do which is for notebooks you want the experience to launch a notebook really fast however launching a new node can take up to six seven minutes downloading the image if it's not pre cached can take ten minutes so we solve both of those problems we create a warm cache of extra nodes available for both CPU and GPU types and that's a configurable limit so kubernetes provides a great construct called priority classes which we can use to create lower priority jobs that can be preempted with real workload this allows us to specify extra capacity for jobs that are just sleeping around and they're they're just waiting to be preempted by real workloads this allows us to make the notebook experience extremely fast we're able to provide consistent notebook launch times within 10 seconds for these 15 gig images on kubernetes this makes it makes the data science super productive and that makes us happy so once remember I mentioned how your notebook code is residing on NFS however in order to productionize something we don't really necessarily if you don't want to rely on some code lying on NFS as our main source of shorts so we provide good github integration just from the UI a user can go and set up their github credentials they can go to github find it generate and your token for their account and register that in the credentials page these credentials get translated to kubernetes secret objects the kubernetes secret objects is a it's a construct in kubernetes to allow you to store arbitrary secret in encrypted format on the backing data store called HCD these are encrypted at rest and they can be mounted as part of any job that gets lost on the cluster so once you set up your github token once you set up your hive credentials once you set up your presto credentials then we make them automatically available on your container during any notebook or training job creation so you can access external data source for retrieving your training data sets and the same way the credentials are used for publishing new code until github so github token is a form of credential but access to hive can also be another form of Prudential and these are all configurable alright so let's talk about training and deployment once you've developed your model and you just want to try out many different variations of a model with different optimizers a different learning schedule let's say for deep learning models it's very common to try out different number of iterations for the mini batch training so it involves a lot of exploration and we want to make the exploration really fast and all of these jobs can run in parallel you could large hundreds of jobs with different learning rate or different optimizers or any other parameters hyper parameters that your model takes how do we do this we create a really simple contract for users to implement so as a research scientist your notebook just needs to contain our model for us with no extra imports or anything you just need to have a class named model and a bunch of methods which will become automatically become your entry points for different types of jobs you want to launch so let's go through all of those so here's my model class and the training method I'll go into all of these in the future slides actually from the notebook also we provide a custom plugin which allows you to take your notebook code along with any other code around it from that base path in the NFS pack of folders and package it all up as a docker container so as a data scientist I don't need to deal with any kind of CI I don't have to deal with involving an engineer to convert my code and package it to a docker image I don't need to understand how docker files are written I just say hey by the way here's my model here's my logical name for that model and a version now go build a docker image with that and you just can pick a base image that they want to put their code on top these base images represent certain Python package dependencies so let's say you're using tensorflow in your model and you want to use the base image which has cuda drivers for GPU execution and a certain version of tensorflow so you can pick that base image as you're importing image in your darker file and then basically over layer layer your code on top of it so for a user all you have to do is click a button to generate docker images we use darker in docker by running a docker process inside dock around kubernetes pods to build an image and then push it to a registry alright so let's get into the details of this model contract it all starts by initializing a model class with a default set of hyper parameters so we we automatically detect hyper parameters set up on a model based on the hyper parameters dictionary mentioned over there the lists mentioned over there where you define what are the default hyper parameters what are their default values and during job launch you will automatically get a chance to override these values if you want the Train method becomes your entry point for any kind of batch training if you define some default hyper parameters and you want to run ten different versions of those with different values then it will invoke the training training endpoint for that we'll get to the UI let's go back you see the training tab over there so using the training tab you can just pick a marble that you defined through the same model button and you can invoke a new run offline for that model and with overwriting certain hyper parameters that you care about and that in turn invokes the Train method out here now we provide two types of predictions for your models let's say you trained a model that performs really well you're happy with it and you wanna generate daily predictions from that model every day I want to predict a drivers likelihood of driving at a given hour for example if I want to generate that I can do that offline it doesn't happen in real time it's an ETL job so the ETL job we'll just call lyft learn API with batch predict call which will run the batch predict method as your entry point for your model so you can do different things and you have the different end points available for those if similarly if you want to run web services on top of your model and make it available as a web service for real-time inference you just create a predict method in your model class which takes a dictionary of their features as inputs so if user defines the predict method we would automatically run a G unicorn and flask server on top which starts serving the predict method at a canonical endpoint slash invocations so as a data scientist still no engineers involved in the process of taking your model from development to either offline prediction or to web-based prediction and these tools make it extremely fast for any user to go from idea for a model to running it as a product feature what's the in it predict method the in ik predict method is useful if you want to do some kind of heavyweight initialization for your predictions for example you might want to load model check points from s3 and that might take a while so the init predict method is called once at the initialization of your prediction job that allows you to do some heavy operations create a in memory state and then run many predictions really fast if you look at this contract it's extremely generic we're making no assumptions about what kind of training job is running what kind of predictions happening we're just providing hooks for a user to run these jobs at a given schedule or manually through a UI or through command-line we also provide easy command-line tools for users to build docker images locally with their models and push it into a dev registry from where they can start using the same features from the URI moving forward hyper parameter search so so far I've talked only about manual hyper parameter search user is specifying either through a Python script or through the UI many different runs of jobs they would like to launch however for certain models like XG boost or any kind of booster trees you have very large number of hyper parameters and you do very large exploration studies across those kind of models those explorations can yield real significant improvements to your model so for such type of models which require very large scale iterations or expiry mutation we integrated an open-source tool called Khattab which is part of the cube flow project for automatic hyper parameter search so a user can perform many different types of hyper parameter search like giving us a grid of values let's go here for example so you can define different metrics you would like to optimize and you specify either a range or specified as a numerical value with some kind of step count or it could be a dimensional value with many values for those and once once you specify those you can also specify what kind of instance you want to launch these jobs on so you can pick is it a high memory instance is it a GPO instance is it a for GPU instance and we're working on adding a cost limit where users can specify how much money they are willing to spend on these jobs so the study will automatically run many different iterations of these models by using different combinations of hyper parameter values and listen to metrics being emitted through each of those jobs in terms of the final performance metrics for each of those jobs and then a controller will make the decision on how to tune certain hyper parameter values in the direction where it performs better and better so it's a fully pluggable strategy which allows for choosing different kinds of controllers either you can use a really dumb controller that just takes a bunch of different hyper parameters and runs every iteration of those or it could be more like an a real optimizer machine-learning optimizer which can try to optimize all these values based on its gonna try to fit the values to find the optimal hyperplane meter values that perform the best it kind of looks like this this is the UI that comes with the kata project which is part which is a implementation of Google's automatic hyper parameter search system called Vizier you can it can trigger very large number of iterations on kubernetes and listen to their metrics that it's trying to optimize and based on that they define the next set of hyper parameters to launch we've seen really big gains in terms of data scientist and research science productivity from using these tools because that can automate the process of discovering the right hyper parameters which can take several days if you were to do that manually you don't have to keep track of all the different job states if you were to launch them manually you don't have to handle failure scenarios the controller takes care of all of those so scheduling so I spoke a little bit about batch predictions and the way we integrate this is through triggering the lift learn API on a schedule using external schedulers at lift we use airflow heavily which is a very successful open-source scheduler and we've provided custom operators inside airflow which hit the rest API for lift learn on a given schedule and execute certain models we also have some other internal schedulers inside left and since the REST API for lift learn is the primary interface to have all the functionality it's really trivial to integrate with any custom scheduler we decide to use some of the future work we want to do better integration with github so we can have stronger code review if a model is developed we want to be able to hook into get prev process more deeply so we can track what our model what's the code running for given model running in production now we also want to expand and make it easy to create custom environments for your notebooks let's say you just care about certain types of libraries that no one else does but you want to quickly start using the the whole functionality of the system you can do that we want to make it easy to for it for users to register new base images the way we've designed system is we have the API layer that that is highly pluggable and you can push new images that you want to register as software environments or you want to push new hardware configurations as your hardware resource options users can also basically everything underneath is kubernetes for execution and it has a database that stores some metadata about historical job execution so kubernetes is not very good at storing very large amounts of state historical state the HDD data store is more designed for current workload is not designed as a data storage system so we use a meta store Postgres type database for storing historical runs there their status what kind of hyper parameters were submitted in order to run a certain training job so that any user can replicate a training job that was created anytime in the past so by capturing metadata about model execution and running it all on kubernetes and using docker abstraction we have the ability to replicate the models any number of times you want and so that prevents a lot of errors in machine learning commonly seen around not being able to replicate a certain result or the data changing underneath or the model code changing over time affecting model performance by container izing those models we freeze the state of the algorithm that was run in order to produce a certain result we're working on open sourcing this so that it can be available to a bigger community if anyone's interested definitely reach out to me in the office hours I'm happy to talk more about it thanks for listening do you still have people at lyft running notebooks locally or is this just like presented itself is the most obvious solution for running notebooks anywhere and everybody's all in on this so in the very beginning we launched the very first version of this system in May and earlier people were still hesitant because they had a very well set of environment locally they just didn't have big data sets they just sampled it enough what they didn't care about having a remote system but very very quickly the adoption has been insane like we have AI will have a very high NPS almost 80 90 percent of the research scientists use the system now because when they started when they finally switched over tried it out for their near use cases they realized that they can get a huge amount of power unlocked because you get persistence of your code over time you can get any arbitrary machine configuration very easily your code is actually shared through NFS across all your instances so you can develop on a smaller instance but then you can scale out on very high memory or multi-gpu instance so that kind of flexibility that people really love it and they don't have to worry about closing their laptop and losing the state all those problems are gone so it's kind of eliminated the need for local development completely now we also provide cold syncing tools to make it easy to develop in your favorite editor and then just continuously sync your code into the kubernetes parts that was one of the big asks that was blocking a lot of people not everyone cares about developing in jupiter notebooks they want to be able to use their favorite editor so that was I guess like one of the biggest features that move people away from local development other questions yeah thanks for the talk um I was curious do you do you have also engineers using this or is it just data scientist and yeah that's a good question so as I said before we have multiple layers of multiple tiers of users the researchers are less programming oriented more on the they're more they care about the machine learning problem more they don't want to get into details so the UI solution is used heavily by the research scientists the engineers seem to prefer to use command-line tools so they are we have a ton of command engineer users I think that who use command-line tools more than the UI one other thing is in lift we from the beginning we've had a distinction between a software engineering role and m/l role so those distinctions are getting more and more merge now that they're going away so we see more engineers using the ML tools now yeah so do you still have like a production ization software like there's the development of it in a jupiter notebook and it seems like this could actually like automatically spin up a service that could then be hit from production but do you still have like people taking it taking a model developed by the mo researchers and like production Eliza yet to make it faster I think that's that's a common problem we were seeing a lot in the past we actually have the ability to directly spawn a kubernetes deployment based on the model you generated using the print and it calls the predict method that I talked about earlier however was kind of putting on the back foot and trying to better integrate with CI and Jenkins and deployment pipelines so we have good traceability of what cords running for a web service and how to roll back or being able to inspect what's running inside that code so we want to tie it closely to github commits so right now our way of doing that is just data scientists will push the code to github and we're working on making it really easy to use the CI process at lyft to launch production services from that and we generate automatic monitoring and we maybe manage the SLA for rolling out those modules yeah [Music]

Info

Channel: Data Council

Views: 5,796

Rating: 4.869565 out of 5

Keywords: machine learning, Kubernetes, ai, Saurabh Bajaj lyft, data science, kubernetes architecture

Id: CbiJ_dYZXnQ

Channel Id: undefined

Length: 40min 7sec (2407 seconds)

Published: Wed Jan 02 2019