Building ML Products With Kubeflow - Jeremy Lewi, Google & Stephan Fabel, Canonical

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome everybody to our building machine learning products with coop flow session my name is Stefan Fable I'm from canonical this is Jeremy lui from Google glad you're here and showing interest in how to build products that have something to do with machine learning on top of kubernetes so just going through the agenda a little bit first we're going to talk a little bit about the about the background of coop flow and the rationale why cube flow what what problem does it solve we're going to go through an end-to-end example starting with deployment options for kubernetes and cube flow and the Nvidia enablement that's necessary to make machine learning happen on Prem as well as on the cloud and then Jeremy is going to walk you through building a product on top of coop flow and you know showing you an end-to-end example of how this would look like and then finally we're going to go through a summary in a roadmap of coop flow and how does that how's it going to evolve over the next couple of months or the next cycle you know anticipation for our 1.0 release later this year there are several talks regarding coop flow I don't know if you've noticed this but you know we're all about machine learning just these days so you know you you missed a couple already if you if you haven't attended them but there's a couple more here so Jeremy is gonna do a deep dive into coop flow intestines yes all the way in so a little later today then there is another another talk about data management and finally tomorrow we've got a keynote that I would encourage you to attend so first of all the recognition is that machine learning is everywhere these days you see this in connected cars in smart cities and you know at home the whole you know translation engine of Google is built on on machine learning obviously you have you know you connect the devices at home the echos the series etc the the natural language processing the image processing the image recognition the computer vision all of those things obviously now are backed by machine learning the self-driving cars is a huge effort you know Wayne will put in I think three hundred thousand miles last year self-driving cars you've had you have a hundred fifty thousand miles you know with Ford GM is falling closely behind so machine learning is behind all of those decisions and obviously now there's a couple of issues that you would as you can imagine as you roll this out at scale in a distributed fashion so first of all people assume that when you deploy a machine learning products that really mainly this is about machine learning code you would expect that the majority of the code that you write would have something to do with machine learning and that the other things that are sort of gathered around this the central block here would just be sort of taken care of right your machine learning code gets deployed into this bubble and and then all those tools the the process management to feature extraction the monitoring the infrastructure etc is sort of you know it's there right you assume it's there but actually this is this is very far from the truth the machine learning code is actually very very small because the intelligence is often expressed in only a couple of lines of code as Jeremy will show you later and really most of the problems that you're having are going to be in those other boxes you thought we're not important right so how do I deploy this thing how do I manage it how do I monitor it how do I manage the configuration of it etc so we asked ourselves this question right what could be good at DevOps right what could what would be the tool that could enable us to actually solve all those other problems that are really important before we get to the good stuff well we thought that containers incriminated be a nice approach to this to this problem and so consequently we started this project project called cube flow yeah so with cube flow our goal was to take advantage of kubernetes to solve a lot of these DevOps challenges and so we started building cube flow which is a platform a kubernetes native platform for ml and our tagline and our goal is really to make it really easy to use kubernetes to build ml products that are portable so you can run them wherever kubernetes runs so what what specifically is coop flow it's really two things first and foremost we're a community so we are a community of data scientists and researchers DevOps product managers and what we're doing is a COO as the community is we're all coming together to sort of work together to build a community driven platform that takes advantage of kubernetes and makes kubernetes the best platform for running ml and the reason that we're doing this is because we there's a growing recognition I think that building an ml platform is a really big problem to take on and that it makes much more sense to try to work together as a community to build it and that's inspiration that's coming from kubernetes where we've seen you know a similar effort to build a container platform worked well when you had a community driven platform as opposed to trying to do it all yourself and so what exactly does it mean to be a k-8 native platform for ml so first and foremost we want to run wherever kate kubernetes runs right so we're taking a hard dependency on kubernetes and we assume that you have kubernetes as the substrate for running your ml platform and solutions and the reason we want to do that is because we're gonna rely on kubernetes to take it to move your platform from say your local machine for doing early development and testing to your on-prem cluster or your cloud clusters to productionize it and so that's why we need kubernetes the others the other aspect of being kubernetes native is that we're taking advantage of kubernetes api sand patterns to solve challenges in machine learning so a concrete example of that is we've built custom control controllers in kubernetes and custom resources to do things like run distributed training for tensorflow so if you're running distributed training for tensorflow you have multiple processes you have to manage so you have basically masters that coordinate training you have workers that actually do the work and then you have parameter servers that act as a distributed data store for the parameters that you're learning so you have all these processes that you need to coordinate so we built up a custom controller for managing all that other people in the community like Selden have contributed custom controllers for managing model deployments and rollouts the other thing that we're doing is we're adopting a machine learning patterns and kubernetes patterns such as using micro services and declarative infrastructure management and to do this we use KC net to package our infrastructure and package and make it easy to deploy our ml platform and our goal is to support multiple ml frameworks including you know tensorflow py torch psych it's learns extra boost in others and finally as part of coop flow we want to provide end-to-end solutions illustrating how ml products are built on top of coop flow and that's one of the things we'll be demoing today so the first piece on deploying a coop flow is actually deploying khun flow and that's what Stephan is gonna demonstrate sure so when you get started deploying your or actually creating your product the first and most important thing for you is to have a stable community substrate that you can use to actually build on top off and then have a repeatable way to deploy cube flow on top to actually enable the platform wherever you wherever you run it so you can run it you can run it either on on-prem or you can run it in the cloud this is one one option that you can do where you deploy actually the canonical distribution of kubernetes you deploy it on Prem you select your your credentials you select a substrate you select the networking that you would like you then go fill in the required your required attributes including which case on that version you're using and you know off you go so now this this cluster builds itself in the background and and the kubernetes is deployed to the cloud of your choice or on Prem so once you have the kubernetes deployed and you you have cube flow installed which is part of that installer that you just saw then you actually have a problem in terms of in terms of actually establishing a data scientist or a developer workflow so how does that workflow look like in the context of of cube flow now if you break it down into the most basic components you identify a problem then your experiment train your model see if the deployed model actually works the way you like if it does the things that you thought it would do oftentimes in the subset of the data on a filtered set of the data not under full data set you keep doing this until you're you're happy with you know with the way that your model is working out and finally you then deploy it at scale and and you have to operate it once it is deployed so I think Jeremy is going to go through the two the ante example now okay so let's let's identify a problem and so in this particular case the problem that we're going to look at is github issues with uninformative titles so if you're dealing or managing a project on github or even with your internal issue tracker system and not uncommon problem is that people file issues that have uninformative titles so not uncommon for a user to report a bug with a very generic title so as one particular example a user in coop flow reported a bug and they titled that some setting problems a new guy needs a little help and then you go and you do a little bit of triage you find out what the actual issue is and you update the bug or issue on the thread and eventually you identify what the particular problem is and you would like to go back and update the title to sort of maintain the hygiene of your project so in this case we identified the particular problem it was a problem with the KC net configuration in our template so we went ahead and updated the title to reflect that so this is so keeping github issues titles up to date though is a lot of toil it's a lot of manual labor so this is an area where we would like to think about you know trying to automate this using machine learning and so so that's that's the goal so let's let's switch to the demo so the first thing is to is you need to sort of enable your data scientists and machine machine lurcher machine learning researchers to do experimentation and exploration to see whether that whether this problem is actually solvable and so the tool of choice for this type of exploration in the AML community is Jupiter notebooks Jupiter notebooks are an interactive Python environment that allows for easy interactive and ad-hoc analysis right so after you've deployed Kubb flow one of the things that we deploy as part of Kubb flow is jupiter hub so jupiter hub is a multi-user server for jupiter notebooks and it's already been integrated into kubernetes so it's very easy for us to incorporate that package into coop flow and make it part of our deployment so after you've deployed coop flow as stephan showed you you can log in and what you get here is the spawner window and this allows you to launch your individual notebook environments and so one of the advantages of running Jupiter on kubernetes is that you can take advantage of containers for having reproducible environments for doing data science and so you can load up your doctor images with you know all the libraries you need for your particular problem and give all of your data scientists reproducible environments and as part of Kubb flow we're actually providing curated runtime environments so right now we support provide runtime environments for tensorflow and we're gonna provide runtime environments for other frameworks like py torch is that some port emerges in coop flow and then we're also integrating this with other packages in coop flow like pachyderms so pachyderm provides a data repository get four data basically as well as a pipelining solution and we that's a nice tool to integrate with Jupiter and so the other thing that you can do is you can specify the resources and take advantage of kubernetes to launch notebooks with more CPU RAM or memory so then you can click spawn and I'm going to after if you were to spawn a notebook what you would end up getting is you would get a notebook environment like this and from here you can go ahead and select your particular notebook so I've gone ahead and loaded up a notebook and this is a notebook that was developed by home hüseyin so he used a data scientist at github who originally had the idea to sort of tackle this problem and this is the notebook that he developed to do that so you can see if we scroll down here the first step is we're downloading a little bit of data so in this case we're pulling data from the github archive which publishes you know an archive of all the data events on github so we can get issues from that and use that to Train so the next step is we want to try processing the data and one of the things I'll note here because I think it's going to be important later on is that we're using we're using a small subset of the data we're down sampling it and the reason we're down sampling the data is because we're running interactively in a notebook and we want to get feedback pretty quickly so that we can iterate quickly and see whether the analysis that we're doing is actually promising so you can see then the next step is what he's what he's done here is he's printed out and taken advantage of Jupiter to actually sort of start looking at the data get a sense of what the what the data is so we can figure out what sort of transformations and pre-processing he needs to run and then actually down here he's doing the actual pre-processing so he's doing some things like removing stop words that aren't uninformative he's also computing some basic statistics like you know what is the average word length lengths of a github issue in terms of the body and the number of words in the body and the conversation because those are going to be used as parameters for his model so you can see there's some pre-processing going on here down here he actually defines the model architecture so this is a tensor flow model written using Kerris so he's going ahead and defining the the tens of the tens of foot model and if we actually scroll down here he actually prints out stats about what the model is in terms of the different layers and then in the in the notebook the model the number of neurons or shapes in those layers and then he's actually got a graph down here of what that that layer that model works looks like so he's actually taking advantage of the interactivity of the the notebook to make it easy to develop his model then down here he actually trains the model and we can see we run training for a small number of steps again because he only wants to get feedback quickly once we have the the model trained we can actually generate predictions on the on the model and so you can look here we have the body of an issue we have its original title and then we have the actual title that was generated by the machine learning algorithm so this is pretty great as a data scientist he's proved that you know you can build a model to solve this but there's a couple of problems the first is that he only trained on a subset of the data so if you're actually going to put this into production you'd probably want to train on all the data and sort of scale out to see if you can improve your quality of your model by doing that and the second is while using a notebook is great for experimentation and analysis it's not a great way to productize the this model and make it consumable so we're going to use kubernetes to solve both of those problems so kubernetes makes it really easy to scale out so for pre-processing you could run that as a batch job using kubernetes to run asynchronously and let it run for a long time and then we can also take advantage of our tensorflow job operator to run as a distributed job on kubernetes and scale out horizontally and so that's what I'm going to show next so if we look here what we're seeing here is our no that's the wrong one so this is our tensorflow job spec for our tensorflow jobs and it looks very similar see if I can zoom in so it looks very similar to your normal K eights manifest you know we have a pod template spec and we specifically provide a pod template spec because we want people to take advantage of the familiar kubernetes api so by using a pod template spec you can use volumes to attach storage to your pods you can set environment variables to inject credentials like GCP service accounts er s3 service accounts and then as I said you basically have multiple replicas so you have like the master replicas the worker replicas and then the parameter server replicas and for each one of those you specify the number of replicas that you want and that's how you scale out horizontally so while our model is training we would like to see the performance of how well that model is doing and so for that we want we launched and run Tenzer board so Tenzer board is part of the the tensorflow ecosystem and it's the package for visualizing all the metrics so as part of coop flow we've made it really easy to deploy tens our board on kubernetes we provide Kayson and packages for doing that and then we integrate that into our ingress story so you can easily access that remotely so you can see the metrics that you're running and see how well your models doing right so once you've done done that and you've trained a model you actually want to put that model into production and so to do that you want to build a model server so a model server is basically a server that's going to get the input to the model in the body of a request and then Jenner turned the prediction in the body of the response right fairly straightforward and so for that we're we're taking advantage of a package from Selden called Selden core which provides a package for building model servers right so I'm a refresh so what you do to take advantage of Selden is they provide a model server and to use this model server you basically just have to implement the predict method by defining a Python class and that predict method basically takes the function signature that they've defined and you define the body that's it call your actual model and then return the data in the form that they expect and so once you've defined this class you can basically use tooling that provide that's provided by Selden to to build a docker image that contains your model and the server that will deploy that can be used to serve your model and then to actually deploy your model you can take advantage of Selden has a custom resource and that custom resource takes care of rolling out your models and handling model deployments let me zoom in here so this is the Selden deployment spec that's used by Selden to control model deployments and really the thing that distinguishes the Selden deployment from a regular deployment in k-8 is that they allow you to express the inference graph inference as a graph of microservices so you basically have multiple models that are combined to generate the actual prediction and this allows you to do some very complicated things very easily so as an example you can take advantage of this to do a/b testing you can also take advantage of this to do things like multi-headed models so you can have like a deep learning model and a decision tree model and pick the outputs from both of those two this said use both models to generate the actual prediction so we use this to actually deploy the model and then to actually make this consumable you might want to have an actual web app that you go ahead and deploy and so we built a simple web app to sort of illustrate this and then we went ahead and deployed that in kubernetes and then this takes an input from the user so in this case the user intent as a github issue the web app goes and fetches the relevant data from github then makes the RPC to the cert the Selden server to actually generate the request and so if we do this we can click generate title so we we fetched the body of the issue and down here we have the actual the title that was actually draining it generated by the machine learning algorithm which is pretty accurate in this case and the title is training stuck on waiting for workers so that's the entire product so if we go back to our demo to sort of summarize so this is what the actual system looks like to serve this product and build this product right so we end up with a distributed system that consists of multiple micro services and so this is we have to manage and deploy the system in order to build this product right and so to go over it you know we started by deploying Jupiter to give our data scientists ml researchers the tools they needed to actually build and develop their models once they were done developing their model they needed to train at scale and so for that we provided them the TF job controller that they had to go ahead and deploy and then to support training we have both a coming up and Kubb flow we have a UI that's going to help you manage your TF jobs we also had to launch and manage Tenzer board then we had to actually deploy the Seldon controller to manage our model deployments and inference and then we had to actually roll out the actual model that was serving them the model server for this particular model and connect it with our web app and then of course we need like things like sto that were integrating with to provide monitoring and DevOps for your logging for your for your apps so we have multiple web applications that users need access to so to support that we need to have a good ingress story so as part of Kubb flow we've integrated a reverse proxy using Ambassador that can then and one of the functionality is provided by Ambassador is we can do external authentication so we can access these services securely from outside the cluster and then we have to connect that with a you know load balance or some other type of ingress to give us access to these services from outside the cluster so as you can see we end up with significant DevOps challenges in order to build and support this application so this is our roadmap for coop flow we just released our 0.1 release in April and this basically contains some some core components so we have a bunch of existing packages that we're integrating with so we're integrating with our go for workflow management we're integrating with Jupiter hub for notebooks we also have our TF job operator we're integrating with Selden and TNT observing T observing is another package for doing model serving and we also integrate with pachyderm for our 0.2 release we have a bunch of new components coming so we have cat tip for HP tuning we have the py torch operator we're gonna hopefully support batch inference vaad which is a framework using MPI for distributed training then we have a central UI that we're also adding for coop flow and then we're going to also be making some improvements to our existing functionality and packages then we're targeting a 1.0 release hopefully a coop con USA which will be in about December and the core cuj that we're kind of targeting for our wonder release is continuous integration in the ml space so in ml what can integration means is every night as you get more more data so as you get your your latest web logs you'd like to retrain your model and if your latest model is doing better than your current models in production you'd like to automatically push those into production there we go all right thanks Jeremy so in summary first the first problem you have to solve is deploy cube flow in order to deploy Q flow repeatedly there's a couple of options you've got you need a community somewhere that can be on Prem or it can be in the cloud if it is on Prem because you you want to do your training with a limited data set that might be in-house or if you have a problem where your data has to stay in a house because you're you know you have compliance concerns etc then there's a couple of things that are important for you to to consider first you have to consider two compliance or the compatibility of your kubernetes with the communities in the cloud that you're going to eventually deploy to does it support the same is it the same version does it support the same primitives is that a smooth story number one number two you need to expose the GP GPUs in a correct in the correct way so that both queue flow work on Prem as well as in the cloud and this is a little bit more complicated it requires something it it requires deep kernel integration it requires the CUDA library compatibility and it requires obviously the the right way to expose those NVIDIA GPUs inside your docker containers that are running on top of communities so all of that work needs to be done correctly and so does these are some of the data points that you should look at when deploying communities on Prem for that purpose and to make that story smooth once you've deployed you then can go through this data science process that Jeremy just illustrated in terms of this end-to-end example so you can experiment in Jupiter build the docker image stand train at scale and then deploy wherever wherever you like with the model server of your choice so here's just a couple of links where you can find out more so the code for the end-to-end example is on the top this is really the github issue summarization you can try it on captor coda there's a blog post about the application that we just showed and then obvious also there's the acute flow mailing list as well as the github link alright thank you thank you there any questions hey as I remember there was something like TF serve for serving models which was included intensive follow last I looked which was about one year ago so is this plan to be included in cube flow as well or is it going to be replaced by your own model service model surveying services yeah so we support tensorflow serving as well we didn't use it in this particular example but we support tens of flow servings we provide doctor images for example for for tensorflow serving and case Annette packages that make it easy to deploy tons of pho serving so you can choose which which package which which models however you want to use what was that really cool installer you used that's our conjure up installer thank you for asking so you had said that Jupiter was Python but do you support like Julia and R and some of the other things that are getting added to Jupiter notebooks there's a runtime limitation here um we just purchased running Jupiter so if Jupiter supports our and these other frameworks and we get that support out of the box and we can add other packages potentially if there were people that wanted to integrate like our support because that is a very popular tool in the in the in the data science community so we'd be happy to integrate that if there was a good package for that to do so yeah and there's a good combination I mean I could good flow from r2 to tensor flow right there good interfaces that are available you can also get specify your own image when you spawn on Jupiter notebook so if you have your image with our kernel somewhere it will just work alright alright I think that's a rocket man thanks [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 8,187

Rating: 4.860465 out of 5

Keywords:

Id: sC8Ce9vUggo

Channel Id: undefined

Length: 29min 53sec (1793 seconds)

Published: Sun May 06 2018