Managing Machine Learning in Production with Kubeflow and DevOps - David Aronchick, Microsoft

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Original Title: Managing Machine Learning in Production with Kubeflow and DevOps - David Aronchick, Microsoft
Author: CNCF [Cloud Native Computing Foundation]
Description: Managing Machine Learning in Production with Kubeflow and DevOps - David Aronchick, Microsoft Kubeflow has helped bring machine learning to Kubernetes, ...
Youtube URL: https://www.youtube.com/watch?v=lu5zHvpQeSI

👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Aug 29 2019 🗫︎ replies

Captions

hi I apologize I was not working on my slides one minute I'm not one of those presenters but I was very late so I apologize thank you so much I am David Ron chick I lead open source machine learning strategy and ml at Microsoft and Azure and I was previously the lead p.m. for kubernetes and I helped start the Q flow project and I'm here to talk to you about how to bring your machine learning to production using Q flow and mo ops so at Microsoft because the widget is not working all right I will be operating from my laptop and Microsoft's we do have a lot of experience bringing ml to production we bring together your data we bring together cloud your models and and our job is really to help customers large customers small customers whatever it is help you move through and and this is something that we have a lot of experience doing we've have a lot of internal experience around mic Microsoft Research and an ml generally with many of the most recent benchmarks and achievements from rimmel coming out of Microsoft Research we're really really proud of that and of course we do give all those back to the research community in the form of open papers and notebooks and data and the reality is that ml does touch every aspect of Microsoft today literally every one of these logos and many more from your customers clients rich clients dead clients are you know thin clients whatever they may be Xbox phone you name it we're using ml in all of these various places and we're using it at enormous scale you know a hundred and eighty million office users today every day use often a I features in office we have 18 billion queries asked of Cortana which is obviously rich NLP and other things and six point five trillion security events evaluated every day there's simply no way that we could operate and and process this data without something like machine learning so this is the point in the slide where everyone's like wow that does sound really great except they also say this right ml is hard and it's really on us that those that build these platforms in the ml community to help folks get there better because we know something that a lot of new people to ml don't know and that's the following today a lot of people new to ml think that it's all about the model and and I understand why every new article out there talks about well you know Google just released Bert or Microsoft to release this and and you know alphago did this and it's all about this amazing model that they built but it's not it's about the data processing and cleaning and all the various things that are involved and actually bringing something to production because that's the nature of machine learning it is these many many micro services each of which have a very specific functionality often very specific tooling that do very specific things well but then need to be coupled together in an intelligent way and if you just focus on the model then you're gonna be in trouble and I know what you're saying you're saying your data scientists and you don't care and I believe you a little bit but I'm here to tell you you actually do and the reason is is tweets like this right models are relatively easy to build but they are very hard to roll out because more often than not data scientists operate in a way that they are familiar with they understand their tools and they build using their tooling and local laptops and local clusters but they don't know how to reach out and roll it to production and the reason is is because you have this separation right the data scientists over here they're trying to iterate as quickly as she can she wants to use frameworks and tooling she understands she wants to mix and match tools the the absolute latest build of tensorflow or PI Torture onyx or whatever it may be you know somebody from Carnegie Mellon just launched a brand new tool around reproducibility and it's respectability and that's she should have the capabilities to do that she also wants to not worry about management cuz it's just her laptop shave you know something goes wrong she can flatten it and restart and on the other hand she wants unlimited scale she's got a paper due you know on Thursday at 5:00 p.m. she wants all the GPUs in the world in order to achieve the results she needs on the other hand you have the sre right and she needs consistency she needs observability she needs to reuse tooling that's already been approved by her organization because they have support and things built in and she needs uptime if the if things are constantly changing under the hood with no records of what's going on then that's not gonna work now I am here to propose that we can bring them together and and we're gonna do it through ml ops now before I begin I want to talk about what the eee machine learning lifecycle looks like now you saw a bunch of like boxes connected together with arrows there but the main things that I think we need to identify and solve for in an edu lifecycle is what you see here first you do have the development and deployment of the model then you're gonna have to package it in a way that can be used and migrated to in production you want to validate that model behavior before you roll it out then you want to deploy the model and then you want to monitor it and in monitoring it it is not just about having it out there and making sure it's up but it's in fact taking all that data and feeding it back into the original so that you can now train it again and be smarter about it so you may say you might have heard this before and the reality is you have you heard this several years ago when things started getting kicked off around get ops and that was the idea that you could start with get and record everything that you were doing relative to your overall pipeline iterate very quickly on that pipeline and then once that pipeline had passed all its tests and humans had looked at it and said yes this is ready to go you trigger a second cycle and that second cycle is rolling it out to production but again it's all driven off of get you don't have someone in the middle there introducing any new changes because those wouldn't have been tested and relied or tested and observed and made sure that they pass all your bars and what that really gets to you at is is those lines at the bottom you get velocity plus security so what we need is ml ops right and ml ops is going to look like this right you start with the data scientist the data scientist is able to iterate exactly as quickly as she could and potentially even faster because you're giving her the Best of Breed tools and you're giving in to her in a system where she doesn't have to think about well this Python dependency didn't work with that other package that I had over there you give her a path to install and manage these things in a very regular way and then from that you start the second half right you you have her finish you have her check in and move forward through dev that's integrating it into the application itself and then integrate again when it comes to rolling out to production and when you do that you get these benefits right you get great observability that means you exactly what's being rolled out so you're able to observe it using standard toolings and reproducibility you get validation again because the stuff is checked in and good you can do static validation you could do runtime validation by rolling it out to Canaries in a very regimented way and then of course you get reproducibility and auditability you know that whatever that query was that prediction that happened at the end of that very long cycle you're able to trace back exactly what code rolled it out at every step of the way and in this byte you get velocity plus security for ml so some of the specific components here and I'll just walk through this really quickly like I said you have the data scientist she checks in to get at that point you do code set version or you do code dataset and environment versioning you snapshot all of those things from there it automatically builds both the app and trains the model and in that you're able to do hyper parameter sweeps and all those kind of things in an automated way you validate the model at that testing point using things like model validation and certification and things like that you release the app and then finally you roll it out and so where that's what we're gonna try and get to here today and we're gonna do it right here on stage for you well technically so the question is this sounds pretty good I'd like to take it on what should I do in order to achieve that well I have a suggestion for you go join one of these big companies because they already have and they have it it's you know at Google it's tf-x and at Facebook it's FB Lerner flow at uber it's Michelangelo Microsoft has our own which is called ether and what this is is it's a very standard data science platform where a data scientist will come in and you know be able to interact at a very high level see other people's experiments see her own experiments iterate and version on her own experiments and then basically download and begin to use those experiments rather than having to start from scratch every time she can start there and and move forward rather than you know resetting everything so you have rejected my suggestions that's very harsh of you I don't want to work at a company so I'm gonna help you by building your own ml ops platform using first we'll do it with C ICD second we'll add in git and third we'll add a soon you start with Q plug that was get in the middle or get lab or bitbucket or whatever it might be and then finally we'll do it with a C ICD platform by the way I hope I did my best I'm trying to be very neutral here all logos are the same size and by the way if you are thinking of inventing a new C ICD platform there are so many there are so many C ICD platforms it's enough just help another one help one that already exists I promise you okay so joking aside let's go ml wops in the real I'm gonna take on a challenge that many people I don't know if Corey's here Corey are you here I know that you're attending cube con but I'm not sure you hear Corey doesn't believe that at multiclad exists I'm gonna say that it does and I'm gonna say it does in the following way multi-cloud really does exist in the real world and I'm gonna lay out a very standard scenario and then solve it for you on stage so what when I say multi-cloud this is what it looks like you have at the top you have cloud whatever cloud it might be and at the bottom you have on Prem and again that could be that's a special case of a second cloud it could be a second cloud it could be on Prem could be your local laptop doesn't matter and in the middle you have get the reason you've chosen that top cloud is because it's distributed is something that your datacenter can't offer you today maybe it's up time maybe it's locality because of regulatory reasons maybe it's just closer to your IOT deployments right you don't want to have that far big Layton sees and so on but the reason used to have stuff on prem is because you actually have a lot of data that you've been collecting for many years and it's just a lot a big pain in the butt to move it to the cloud we see this scenario all the time and it's whether or not you know maybe that's a poor latency connection or maybe you just don't want to spend you know many many millions of dollars on petabytes of data being held in the cloud because you already have a bunch of discs and it's doing its job you don't want to do that so and then we have our to cast of characters we have our data scientist she wants to iterate very quickly on the the model and make sure that it works and she's gonna want to train next to the model but then the sres and the ml engineers are going to want to host in the cloud so let's do this so what she does is first she's gonna check in her code and to get get will be watched by the CI CD pipeline for a new check-in and that will automatically trigger a cube flow pipeline it that cue flow pipeline will process the data first in whatever way it needs processing or if it's already been processed and check pointed then it can skip to the next step and cue flows smart enough to do that it then will run a training run automatically on that process data again using whatever parameters you've passed in via get right again this isn't something where the data scientist is picking it on the fly you want to be extremely prescriptive about this once that's done it's gonna register the model at a central point which can be picked up and then rolled out to the cloud and in this case you would roll it out to something like a canary end point or a staging endpoint to make sure that it runs a human being often will after the tests of paws come by make sure everything is operating properly that the human being will then say okay it's time for me to trigger the next step and that's at that point it gets rolled out to a public service endpoint and use for the model for production okay so that's the high level let's see what this actually looks like when it's running and yes you can okay well this is gonna be very hard because I shoulder the whole time alright so this is um you know the one downside of doing we're moving into ml is you have to record all your videos cuz everything takes forever so I promise you you can look at the date I don't know where the data is on anyone see a date on there this is literally run last night so I promise you you can go to this repo which is public and download all the code and have a good time okay where are you okay so first is this running it is okay so first what you see here this is a very standard repo out there and it's got all the code in it and I let me show you actually exactly what this code looks like here see this you cannot oh okay so they always say never show code when you're doing a demo because everyone's eyes glaze over but these are very smart room so I'm gonna trust you your eyes are not going to leave over so what you have here is this is a very very standard CI CD pipeline this is using Azure dev ops which is something that we use hosted dev ops CI CD on Azure but the Yambol is gonna look super familiar because it follows very standard cid CD practices looks just like jenkins and what you see here is at the top you see a trigger that means it's triggering off the master once you check in and here you have individual steps this in fact does a build of container and we do three builds here because that's what cube flow requires cube flow needs every one of your steps to be built into a container and then you roll that for and into your cute flow pipeline and what you're rolling in what the pipeline looks like is this and actually why don't I get to this in just a second so we'll go back to our video here and like I said what you see is a standard repo or is my thing here and that was just the code that you saw me pointing at earlier with standard pipelines and this is a very important one it does burritos versus tacos you upload and it'll tell you whether or not something Sabrina virgin taco we're just a hardware heart problem that you would think you know a half-open burrito what you get the idea all right so we moving on here and you can see what it's doing is this code that's in the repo right now is doing transfer learning using very standard transfer learning tool called mobile net and so we layer on on top of that so like I said this is just incredibly pipeline code and these are the pipeline steps that you'll see here and I've into those in just a second but the idea is that we want to show how when you check in it's going to kick all these things off so we come back here and we're gonna hit merge on our pull request and it's gonna kick off there we go so we're hitting merge and off we go and there you go so this is the UI for a DevOps again all we did was hit merge and it all this kicked off so at this point we're gonna go forward you can see there right there build number nine has just kicked off and that is you know May 21st so you know I'm being honest here and you can see how the things are aspiring here so in this case it's going to go through those three build steps where it builds each one of the containers using standard as your build tools and again this can be docker this could be builder this could be anything that makes sense for you and you've all seen docker build many many times so I'm going to fast forward through those okay and then this last step is where you take all those containers and now you've uploaded them to in this case a CR but it could be any docker registry okay so now that is done and now we have the interesting bit because that is the build pipeline our steps here are first off a build pipeline where it takes those containers them and then we do what's called a release pipeline and you want those to be separate right you want to make sure the build is complete that all the artifacts pass and already go and now we're going to do this release pipeline and the release pipeline is where it reaches out to cube flow and executes and you can see this here it's reaching out using a swagger client that we wrote to connect directly to the cue flow API you can see there it's forwarding the connection and here you can see the run this these are the rough he had previously we run 84 we had a lot of debugging not not cue flows fault our fault and presto we're at run eighty-five right so again I want to call back because this is a small but special thing all I did was kick hit get all I did was check in and merge my final PR and it kicked off all of this goodness so now we were sending through here and you can see a rerunning know okay so this is a standard cue put pipeline I'd already described it it does pre processing here it's mounting in Azure blob and you can see it watching the logs right there and you can see the artifacts once that's done it automatically moves forward the containers been created and here it's actually doing a training run right now it's using the latest tensorflow 2.0 you can see and not just tensorflow 2.0 but let me just highlight that right there it's using GPUs natively on kubernetes directly through this check it ok and it's going to operate and run quite fast much faster than it ran when it was running on my local machine forward there and now it's doing registration and this is where you start to mix and match between things running in queue flow and things that you may want to run in the cloud so in this case it's going to do this last register step in Azure we have a service called the model management service this is basically a sophisticated storage location which understands machine learning concepts first-class way and by that what I mean is you're able to understand whether or not something needs versioning how to compare versions within each other what the performance of something is and you're able to visualize it in a first-class way and so what it's going to do here at this last step here are three versions that we ran already and once it's done you can see here this is the model registry service it's going to kick it off any minute yeah okay so you can see here it's using in this case the azure SDK natively inside of cube flow it's able to reach out and you can see they're using our service principles service passwords the subscription information it's able to take that hc5 file and simply push it up to the model registry service which then will kick off the second half of the solution and you can see there so that's roughly what it is you know rolling things out here's where you can see some of the richness that we're able to upload and the metadata that we have relative to these things and no I don't need you they were real so that's what model that's what mo Ops looks like in reality right and from that point forward you know I just rolled out a service endpoint everyone seen that right nothing special there but that's what we're talking about this is something where you had a cube flow cluster that was in no way connected to you know any of the more sophisticated azor or other components and with just running standard kubernetes and standard cube flow and and the reality is is that gives you the opportunity to do the training in a way that makes sense for your organization you don't have to take your data you don't have to change the way your processes work you don't have to adopt some crazy new standard just because that looks slightly faster this is an open platform where you can pick and choose and plug in the components that make sense to you and that's the power that that things like open source and things like Q flow provide now I know what you're saying it sounds like a lot of work I would like to recall this statement here you know what's a lot of work eleven months sitting around without your model shipping out to production that's a lot of work this you know okay you have to spend a few minutes making sure that service principals are able to cross clouds but other than that you'll be fine and and you're not just fine right you're actually moving faster and and that's what we're really trying to provide here data scientists understand their problem areas but too often they have thrown things over the wall and asked production to ultimately you know deal with a lot of the hardness what we're saying is is we want to bring through things like ml ops we want to bring the same advances that get ops brought to app developers we want to help data scientists be part of how to bring things to production and get into end ownership and even better than that help them learn real software engineering best practices like for too long we've we've said hey data scientists you know don't worry about you know sweet best practices you know come back when you're older no that is not okay these are these are software engineers they want to help get things into reduction they want their models to be used just as much as you do we need to help them bridge the gap but not by forcing them to you know figure out how to do git branch strategies on top of that it Abel's continuous delivery the the the automated pipeline that you saw earlier can be automated every night and that is something that we learned you know for years all of those big production clients you saw there that were able to continuously iterate using these ml ops platforms they're all moving be moved to continuously training these models and the reason is is because unlike a lot of software models go out of and go stale very very quickly you need to continuously train them you need to continuously watch your data and make sure things haven't drifted one way or another people change code changes situation changes continuous training is the only way that you're going to be able to deliver value at scale over time and then finally it's impossible to overstate how much value you get by by instituting these things around lineage and auditability unless your code is checked in it everything is is moved forward from that point you will not be able to defend in a court of law or your Terry are exacts or to the software engineer sitting right next to you exactly what's going on in production and again because we haven't spent the time yet to make a miops easy we're putting that that onus on the work of the software engineers and the developer and the data scientists to build their own solutions and and oftentimes they're gonna miss something so it's up to us to give them a great platform to do that I always like to end this slide with or I always like to end my talks with a slide like this because it's so important the reality is is that the people in this room and the people watching on the video you're already there you already know a lot of this stuff and you're working towards this data science will affect every industry in the world ml and AI will affect every industry in the world it's up to us to figure out how to bring our data science to these people and help them the people who have domain knowledge be smarter about what they're doing cube flow is open I we would love for you to come and contribute to all of the work we're doing around cube flow with ml ops you could see a bunch of the things here everything that I demoed on screen is a source and ready for you we do obviously it is built on Azure pipelines but like I said that's very portable to other pipelines and we would love to help you do that we actually have an entire repo dedicated to ml ups and we're hoping to move that from it's right now it's hosted under the Microsoft one we would love to host it in a more general place where people can share and then of course in cube flow we're talking about ml ops constantly the work that we're doing around pipelines is specifically designed to help solve this very critical problem and let me give a quick shout out to other talks here at the conference around cubic flow please go and and other than that thank you very much for the time okay with that we have a few minutes for questions what can I answer for you if anyone has a question make yourself visible and I will bring the microphone now this one Oh everyone's running in production I don't believe any of you do have any examples of a parameter tuning using cat tip-sample runs using cat tip there are examples right now in the Q flow repo for how to run it but we haven't built one a specific ml ops pipeline for that yet I think that's a wonderful thing and I'd love if you're interested I'd love to collaborate with you but that's a great thing that we can work on soon well yeah what's your recommended approach to versioning large datasets yeah when they are huge amounts of images of videos so we do get that question all the time most data providers today will have a way to do snapshotting or versioning a lot of it comes down to what the underlying data set oh sorry the question was what's your best practice for versioning datasets checkpointing and things like that so a lot of large data providers Sparco do a sequel server my sequel so on and so forth they'll have ways to accomplish this in their own systems but a lot of it comes down to what is the immutability of your data look like if you have a practice a best practice around for example audit trails for your database then that might be enough right you could simply say okay I'm gonna do select star from table that I like between January 1st 1970 and January 1st 1980 right that could be enough you could simply say okay I'm done and as long as that that dataset hasn't changed then that's enough because I know I can always go back and recreate that if you need things like you know deltas and and more complicated structures like that our recommendation is to work with whatever data provider you have because most of them have it there are also a number of open source tools that are doing some it very interesting things around this pachyderm has stuff DVC has some stuff there's a bunch of really interesting stuff around this right now but the net is is I would start with wherever you're storing your day today and then go forward from there we do get that question a lot though so if you have thoughts around that I'd love to pull it into cube flow and some of these other work so if I understand correctly there are two main pipeline one for training and one for the serving up and you how you do you versioning the the model created by the training pipeline er how do i person so the question was there are two main pipelines that I demoed on stage how do I version the model is that what you said yeah okay so as you saw there in fact what I use is a service from Azure called model management service there are many services out there model DB is one and that's already integrated into flow and so on more often than not what you'll want to do is you'll want to have your model registry service whatever IB be separate from your cube flow deployment because oftentimes those those exist in different planes and one has a much longer persistence than the other cube flows often times brought up and torn down quite regularly whereas your model registry you'd want that going on forever to be clear though I demoed two pipelines on stage that could have been one pipeline that could have been six pipelines it's whatever makes sense to you generally speaking your best practice is to have one pipeline per environment if you will because that or our per particular function so what you didn't see here is for example oftentimes you'll have a data pipeline for doing your overall sharding and processing of your data moving it to its final location my I did pre-processing as part of the pipeline that you saw on stage but like I said you can use the azure model management service that is you know nothing about cube flow is required there it's just an open end point and you can use something like that or you can use anything that you would want to host yourself it's a be 95 Euler sorry we're speaking about the p90 firing they're not to another container you need the model is just a file yeah so yes and no the model is just a file that is absolutely correct but as you saw there on stage what you'll frequently want is more information about the model or more sophistication about how to store that model so as you saw there I recorded that that even though this was there were in fact four different models there they were all part of a single model training you know job if you will and what they would have they had different versions and the reason you want to do that is because that will give you the opportunity to do intermodal comparison Oh for continued you know ongoing training runs and so on again that is absolutely something you can achieve yourself though I talked about a sure this is something you can definitely do it's just a matter of you know if you want to host it yourself or have someone else hosted thanks no you just retching okay other questions one here I noticed at the beginning of the Diaz your pipeline there was a kind of preliminary step to build some containers I guess those containers are the images are used in the coop flow pipelines would what do you think about the idea of bringing the the image building step within the Cooper pipeline itself so that the pipeline could become entirely portable yeah the the important thing to take away here are the steps in the pipeline the not important thing the way is the exact things that we chose so for example there is already a container builder inside of cube flow that could be perfectly great for you we just happen to choose this one because the you know software engineer and I who we're working on it knew how to do that very quickly and so that that's basically it again you do want to the the more important bit is you want to store those artifacts or order both the commands you used to execute it and the artifacts you want to store separately so that they persist beyond the single run that answer your question yeah okay great I think we're pretty much at time we are at time but anyone and I'll be up here and I'll be around whatever you like thank you very much [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 7,332

Rating: undefined out of 5

Keywords:

Id: lu5zHvpQeSI

Channel Id: undefined

Length: 33min 30sec (2010 seconds)

Published: Fri May 31 2019