Accelerating the Machine Learning Lifecycle with MLflow 1.0 | M. Zaharia, A. Davidson, G. Buehrer

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi welcome everyone it's great to see you here on the second day I'm excited to kick off today with an update on ml flow the open source machine learning lifecycle framework that we launched less than a year ago at SPARC summit as well so first of all why have a machine learning lifecycle framework what what sort of the problem so the problem is that machine learning development is very complex and I think anyone who's tried you know to build a production application with machine learning can tell you a lot about that it's it's it's a lot more complicated than just a traditional software application so I want to explain some of the challenges and then explain how a framework like ml flow helps and then of course give you an update on the things that have been happening in the project so to understand why machine learning development is complex let's look at let's look at a typical machine learning lifecycle and you can see you know machine learning applications have to have this kind of continuous loop where you take in data you prepare the data you do model training and then you do deployment and once the application is deployed you know if it's doing something important you probably want to monitor it and you always want to collect new data and update your model based on what's happening in the world so you know machine learning applications have to learn from data so you need to continuously feed them with you know recent high quality relevant data so that's one of the challenges is just keeping this cycle going our second challenge is that each of these steps involves many different software systems and libraries so for example your hard data can come from a variety of storage systems and you really want to combine all the data available to get a really good model you can't just do one your data preparation and training can use hundreds of open-source libraries and algorithms and again you're always experimenting and machine learning trying to find you know the best model and the best pipeline to get the best thing you can't just say ok we'll only use one thing here and likewise in deployment once you build a model you know you may want to deploy it to a web server you may want to deploy it to mobile phones you may want to use it in batch jobs or compared with previous models and you need to support all these modes to make things even more complicated some of these steps have tuning parameters so not only do you have to keep track of which algorithms you're using you also have to keep track of these you know many parameters that affect performance and you have to get the right to get the right result and then there's also the challenge of passing your models between these steps as you build a model using lots of data sources you know lots of data prep functions and so on how do you get it to work again the same way later when you have to rebuild it or how do you deploy it and make sure it's actually performing the same computation once deployed and then finally a couple other challenges are just monitoring and governance so how do you make sure this whole keep this whole thing keeps working how do you make sure that it's sort of compliant or you understand you know what's going in and you can you know you can fix it if if it's not using the right data or doing something wrong so it's a pretty complicated process and organizations you know often have to go through all these steps and come up with a solution separately for each ml application that they built so really makes it difficult to build these applications so what are people doing about this problem well one trend that's really started in the past few years is companies building custom machine learning platforms so some of the best known ones are the internal platforms at large tech companies such as Facebook FB Lerner uber Michelangelo Google tf-x but every company we talk with is actually thinking about this problem and is designing some kind of platform to help with at least some stages of the machine learning life cycle so these platforms are very powerful what they do is they standardize the data preparation training and deployment cycle so as long as you work with the api's inside these platforms you get an application or a pipeline that can easily be deployed and monitored and updated and so they're being used for hundreds of you know machine learning products at these companies and others and they really you know accelerate development and production use so a really good idea to design an ml but there are also some challenges with the way people are doing these things now so the first challenge is that each platform is often limited to a few algorithms or framework so whatever the platform you know developers are able to support and want to put in and this goes against what I showed earlier which is you know in machine learning you always want to be able to experiment with the latest and best algorithms and frameworks you don't want to block on you know a team because they don't have the bandwidth to support you know the latest deep learning framework or whatever so it creates this kind of tension and then the second issue is that each platform is very customized and tied to each company's infrastructure so there's no sharing of common work across them so about a year ago we looked at this space and you know we thought this is a good idea but how can we make it better and we asked can we provide similar benefits to these platforms but in an open manner and that's why we started ml flow which is basically the first open source and to end machine learning platform so the thing that's different from between ml flow and a lot of these existing ones is it's really designed to be able to integrate with you know whatever machine learning library programming language you know deployment tools or internal process you already have so it's what we call this open interface design philosophy all the interfaces are things like REST API or command-line interfaces or files that are super easy to add around the existing code and the the project is built out of components where you can easily you know swap things in and out or it just use some of the components and even use it even if you have you know some some existing pipeline that works well so the first version of ML Flow had three components experiment tracking projects for packaging reproducible horns and and models which is a way to package and ship models around and this is what we started with but you know were also expanding the project to include other things so when we started this last year we you know we launched it in June as an alpha release we really didn't know you know what the reaction would be - this is uh you know is open-source ml platform a good idea is the way we set it up with these components and these interfaces you know a good way to do that and will people actually use it or will it sort of get in their way or anything like that so we weren't sure what would happen and we've been super excited to see the growth of the community since then so since then you know it's a ten months after now we already have 80 contributors to the open-source project and you know just for comparison our team working on this at data breaks is about eight engineers so lots of people from outside and we see hundreds of companies using this and actually achieving you know really good was also said so I'll talk more about the community later but we've been really excited with with the growth and with the contributions that have made the project grow so to give a little bit more background on ml flow before I'll do that I'll just quickly show you a few of the components you'll you'll see them in more detail later anyway so I'll start with the tracking component just to show kind of the philosophy over the project so one of the first problems we saw everyone was having is keeping track of experiments and and you know being eventually being able to reproduce them later a lot of things affect quality of a machine learning model it's the code you use of course to build it but also the data that goes in various tuning parameters and so on and people had all kinds of manual processes or found it difficult to keep track of these so what we built is we built this very simple API you can call in your existing code to report you know metrics and parameters or like objects that you've built up and then the ml flow tracking API also automatically collects information about your environment like for example you know the gait revision of the code that you're in so you can see the exact same code later and you can use this API anywhere you are in code you know whether it's notebooks local applications on your laptop something on a server whatever it is it's just the REST API that you you know call from your code and then this collects the information in a central tracking server and you get a user interface an API to share and compare these results across teams so what can you do with this UI so in the UI you can see all your past runs you can click on or on and see the parameters and things that's produced you can produce you know complex things like images and files you can actually take notes about the run later so if you want to remember additional stuff you know you can just add it in there and share it with your team and you can sort and compare these things in various ways and we've seen people use this in a lot of different manners so for example some people use it for training they establish sort of a leaderboard and they try different models or even people in the team compete on you know on specific models some people use it for production monitoring you've got something that's being updated every night and you want to quickly compare the metrics and the artifacts produce so very simple the concept is you've got this agnostic way to log stuff but there are many ways to use it to help you know make machine learning development more reliable the other two components are also pretty simple so first one ml flow projects is just a way to specify you know in your code what dependencies it needs in order to run and basically you add this file to your repo that explains what it needs and then other people can run it either locally or they can submit it you know to a cluster and you don't have to spend you know two days emailing back and forth with someone to figure out the environment needed to run their project you can just run a command and get the exact same environment so really good for sharing code and then the last one ml flow models is a similar packaging system but for packaging models and in our case the models can contain you know arbitrary code whatever you want to put in there so the idea is you can use many common libraries to set up a model or you can even just write your own code in there and you get this little package which is just you know a folder of files and then there are built in tools that can take the same model and deploy it either to rest serving or batch and stream processing or to debugging tools down the line and all those things can be agnostic to like what library was used to build the model so this is really great for handing off the model between users between teams with different expertise getting the same result so that's a quick tour of the path just to go back to the community so as I said the community's already grown quite a bit well you know we we like to track how many people actually send patches so we've got 86 contributors already from more than forty companies and then the other exciting thing is you know what kind of patches are people sending so we've got a huge number of like large external contributions from companies outside data breaks so these include you know the database storage back-end integrations with lots of libraries darker packaging for projects this was actually you know contributed by a company in Brazil like energy company plug-in system the our API which was contributed by our studio lots of new visualizations these are contributed by a financial company in New York so many you know many interesting sort of features a lot of things we hadn't thought of but that make the project better and coming you know from around the world as as people are using this to put this in context I was also involved early on in Apache spark and you know working with early users and trying to build a community around that and for Apache spark even though there were people using it early on it took about three years since launch to get to 80 contributors and four ml flow we got there in about 9 months so we're really excited with the in about the interest in it and you know we want to keep going this project and fostering a really great community for for everyone involved ok so that's that's that the another way to kind of showcase the ghost of the project is some of the integrations that are built into it so when we launched this last year ml flow only supported Python and it had C libraries that it could automatically capture models from it had a couple of storage backends and you know a few deployment systems it could deploy to and now all these things have gone so now we also support Java and are lots more libraries are kind of built into it so you can easily capture models many more storage backends a lot of deployment systems actually there our next one is is one that I'm really excited about that's currently in development and a pull request so that's that's also exciting to see and I think you know a year from now we'll see even more logos on this slide and a little bit about some contributors so these are all you know the ones at the top are once you've given public talks about it so Edmund and Brandeis are two data brakes customers who spoke at the ml flow meetup here in February and comcast Showtime and Kojak are all talking at this summit comcast is actually talking right after me so I won't steal their thunder but they're doing really cool things with that go jack is also a very cool company it's a huge you know right hailing and everything else app in Southeast Asia and they've been using ml flow as well and of course we have you know other tech platform companies using it we heard from Microsoft yesterday our studios contributed a lot and splice machine and faculty are two other companies that are building products based on it and and contributing back so the one sort of major announcement on this front at this summit is Microsoft doing the ML flow community and actually adding the ml flow tracking API in Azure machine learning as a way that you know they can capture results and so we're really excited to see that and to talk more about that I'd like to call up Gregory who is the chief architect of machine learning to give you a quick demo of ml flow over there hi I'm Greg Biewer chief architect of Azure machine learning I'm very much on the dev side of things so talking in front of 45 billion people with brand new software and doing a live demo this should be pretty interesting so what is a jury machine learning essentially it's a set of managed services that handle this machine learning lifecycle that you saw and heard metate talked about a little bit ago there are a lot of complicated steps as he talked about we have a lot of sort of custom things in Microsoft that try to make that easier for you including FPGAs custom hardware special feature eyes errs Ottawa mell hyper parameter tuning with hyper drive a lot of services like that one of the really important services as Matteo was pointing out is experiment tracking when we started our experiment tracking service two years ago there really wasn't an open source standard but in working with matei and Andy and the team over the last say year mostly on deployment for on Azure for ml models in MO flow we realized our philosophy around experimentation was really quite similar and then when looking at that we said hey how can we snap to this open source standard it has a ton of energy it really fits what we're looking for and so that's over the last couple months that's what we've done and what I want to demo to you today is native support for mo flow in Azure machine learning for your experiment tracking okay so the way I'm going to do that is I have this notebook it's a data bricks notebook it's running on a database cluster and I'm going to launch two training jobs one experiment training job is going to be native in the cluster one experiment job is going to be running on a remote MPI cluster and what I want to show is it sort of doesn't matter where you train these things you can move all these things to a common experiment tracking platform in Azure and then leverage the power of edger across all your products so the first thing you need to do is you need to set the URI of your tracking server this is if you've used them out flow before you know this is the first step I've linked my workspace in azure email to my workspace in data bricks so all I have to do is make this one function call and now I'm set to track two as the next thing I do is I set an experiment name this is really just a string that allows you to group runs so they're easy to compare against each other if you've used demo flow you're very familiar with this and you can see I'm just using the standard mo flow client that you can get in the open source community okay the actual experiment we're going to run it's basically it's M this so what it does is it looks at handwritten digits and it tries to detect what the digit is it's not super important for the demo but I just thought it you should know I'm going to use PI torch the important things to look at here are that i'm logging my metrics with mo flow here i'm logging more metrics with my m with mo flow here and when i run I use mo flow start run which is standard mo flow procedure and when I'm done I'm going to log my model my PI torch model using mo flow so let's get that kicked off appears to be running okay so it's off and running the next thing I want to show is I'm going to launch a remote job it's essentially the same job it's using the same ml flow api's but it's on this remote cluster so I have to do a little gobbling goop here to set up the environment I've actually already provisioned the clusters so if you look at this these three green lines you'll see that the cluster has been set up already I'm not going to run them again but the clusters name is CPU cluster and then all I have to do is say experiment equals new experiment experiment submit and now I have a job running off in this remote cluster so the purpose of this fairly busy slide is to tell you that you can use mo flow you can train anywhere you can train in data bricks you can train with any framework will then package that up in a docker container and you can host that docker container anywhere you can host it with inside edges cloud you can host it on the edge you can host it in heavy edge or you can host it locally you can pretty much put it anywhere and then you can still use a common framework to collect your telemetry so let's jump over to our experiment tracking service API let's try to refresh that guy I've run this job a lot of times so you'll see there's a lot of experiments in here we've run at 67 times this is the amount of time each one took and the average loss it's automatically tracked this you can see here the two jobs I just launched one of them finished the local run the remote run is still running let's see if we can get a look at that so you see here it's just go ahead and it's logging in real-time which is pretty cool I can come up here and look at what my different compute targets are so we mentioned we have this compute targets we track these compute targets we know when you train that job what compute target did you train it on if you generated a model you can register that model here I have three models registered you get that you increment your versions so you know exactly what model was trained with what data and then you can take that model and you can deploy it as I showed you before it's really easy to deploy so I deployed one earlier today I don't want to take up too much time now doing that but here's the deployment and you click on that deployment and it'll tell you hey what models were in that deployment what version of that model was in the deployment so that all this is tracked for you so now if we go back to our demo here basically this is when I did build this scoring container I just used the standard ml flow scoring container API from ml flow and then I just deployed it on Azure using as rimmel now I need to get a handle to that because it's it's actually a real live URL so I've got the URL and now I've sort of made this these helper functions but I'm going to try to see if it actually works so I have a thousand random images of digits I'm just gonna pick one it's a seven I guess see if we score it right and we scored it a seven so yeah it worked so just to summarize we have all these services in Azure ml that are really powerful we feel they really helped customers and now we have an open source API to use ml flow which we're really excited to be contributors for if you have any questions just ping me if you want to try it out just ping me and thanks for your time thanks a lot Greg yeah that's super exciting to see and as your team has been contributing a lot to the things that are going you know into into the next version of ml flow so the other thing I want to talk about in my presentation is kind of what's next for the project and I want to mention a few different things so first of all the next release coming out is ml flow 1.0 this is already an active development and it's going to be finalized this May and it basically stabilizes the API at ml flow for long-term use it you know fixes a few small things that we discovered as it was being developed and it also adds many frequently requested features so for example better search UI Hadoop file system storage has been merged in x coordinates on metrics simpler commands to use it pagination windows support you know various things that people have been asking for so this is an active development we're excited to see it coming soon and then the other thing that we want to do you know as people begin to use the project is a new component so we've been trying to get a lot of feedback from the community for example we ran this survey at the start of the year on our website got a lot of responses on that and of course we talked to a lot of active users as well and when we started the project we always kind of designed it to have these modular components so that we can add new ones in the future and today I'm excited to announce two components that we've been prototyping and developing a database that we plan to roll out you know and later this year so the first component is ml flow workflows which is a way to easily share and edit multi-step pipelines so in with ml flow workflows you can define you know a multi-step pipeline and code by calling a few different projects but then you can also see it in the UI and actually edit it and you know resubmitted with new parameters so we found people often want to just quickly change a few parameters or change you know which step they're using without going back to the code and figure out you know how to build everything scratch and we might want to make that easy this workflow system can also automatically he use results if you ran a step before with the same parameters we can reuse it and also when building this you know like the rest of ml flow we wanted it to be easy to integrate with existing infrastructure so we didn't want to create you know yet another new workflow system so instead we designed it so it can run on existing job scheduler is like Apache airflow you know various other schedulers as well so you can just plug it in and run these workflows on your platform of choice so we think this will help a lot with Schenck and editing complex pipelines and then the second component is actually the most demanded component in our survey is ml flow model registry so this is a special sort of new features in the tracking server that lets you manage tag and version models in the server and then also you know keep track of where they're deployed which version is deployed and so on so you can take anything that you packaged in the ml flow model format and you know register it as a registered model you can have different versions of one you can deploy these things to various infant systems you can also see who's using the model and add metadata about it like notes and this is also something that we want to you know to plug into existing you know registry systems people have so you know it's not a new thing you have to manage if you if you already have a way of you know of managing models so these are you know these are two things and that we're building and to show you some prototypes of these and a demo I'm excited to introduce Aaron Davidson who is a staff engineer at data breaks and will give you a demo of these new components I hello everyone excited to be here I'd love to stop and chat but we have a lot of material to work through so let's get started today we'll be working on a relatively simple recommendation use case but we'll see how even in this simple use case we'll run into some pretty complex problems we tried to build a production ready end to end machine learning system so we'll be recommending tweets to users I already have an end point deployed but right now it actually ignores the user name and just returns popular tweets so for example we see a haiku we see a tweet about Beyonce's Google Calendar and we see a tweet about it's a motivational quote so these are all perfectly good tweets but they don't really speak to me I'm a very simple person I like two things companies going IPO and cats this is neither of those so how can we make a model that speaks to me we might train a model like this we might take tweets and produce tweet embeddings where we put map tweets to a vector space such that semantically similar tweets are close to each other we might then take those tweet embeddings and produce user embeddings so that users with similar preferences are also close to each other in that same vector space we might then create a training data set based on how users have interacted with tweets in the past with all three of these inputs we can produce a ranking model where we recommend specific tweets to specific users even in this simple workflow though some problems arise for example what if different users say what if different parts of the different team's own different steps what if different parts of the pipeline are in different languages like Python versus Scala or different frameworks like tensorflow vs. Park ml what if some parts of this code are in github other parts are in a local IDE and another one is in Jupiter how do we string together all these different steps to make a reproducible pipeline for the best results and once we have good results how do we actually take those results and deploy them to production how do we do that deployment once we deploy it how do we update it and how do we track what's going on where so let's start with the multi-step workflows so this is the new component coming in and I'll flow and we can see it's a very simple definition language for defining bags so in this case we have a step called treat embedding this is pointing to a get your eye at that URI as an ml flow project which matei talked about this project contains the code for this step any dependencies such as a condo gamble or dockerfile we can also pass in parameters in this case we pass in today's date so we want the most recent tweets we have another step for creating a training data set and another step for user embedding this one is interesting because it depends on the tweet embedding step so here this this syntax says that the tweet embedding step will output as artifacts to a cloud storage location like s3 or Azure blob storage maybe HDFS and the user embedding step will pull that in to run finally we have our training step which also is like github takes to all the other inputs as parameters and then has its own regularization parameter as well to give you an example of one of these steps let me dive into the training step so this is the github code that underlies our trainings in the training pipeline and what we can see is you know we take the user and tweet and scorning information and produce a combined embedding we take this combined two bedding and score and train a simple linear regression model on it we're gonna learn the weights of each of the features of the embedding given this linear regression we log the f1 score using ml flow the f1 score is just a combined precision recall metric and then we log the actual model as an artifact which we'll use to do the actual serving at the end so that was the example of a single step now let's run the entire pipeline will open here now the ml flow UI and we can see this new visualization which shows how the dag is executing so in this case the tweet embedding and training data set steps are executing in parallel because they don't have any dependencies between each other and as they complete the next step start running we can also see this graph in JSON form this is the same exact information just as JSON and we can also see it as an air flow pipeline if we want to run it in that context so as these steps complete we can observe the metrics and parameters that came out in this case for example we see an f1 score of 0.4 4 which is actually pretty bad this might be because the regularization is 300 which is an insanely high value so what I can do now is I can come up here and rerun this pipeline with different parameters so I can just hit this button and now I see an editor where I can edit this JSON and I'll change the regulation from 300 to something more reasonable I hit OK and it submits a new run of this pipeline we can go to that run as we see here something interesting to note is that the first three steps were all cashed if you recall they were keyed on today's date and because we've already run them for today ml flow has automatically you reuse those results and just rerun the training step this allows us to iterate much more quickly on the parts that changed so we can open the training step and now we see an f1 score of point eight six which is good enough to go to production so I'm going to scroll down here to my artifact if you recall we log that spark an artifact and here it is I can now register this artifact as a model I'll create a new model called tweet recommendation and this will register the first version of it so now this is the ML phone model registry it's very simple but what we can see is that it it's pulling from that run so as one version of this model based on that run that I created and I can take this and deploy it so I can deflate my favourite cloud service like a dremel or sage maker or just your kubernetes in this case okay guys right now and I'll hit deploy and this will actually spin up a container in a journal it'll create an endpoint which I can then ask questions like give me recommendations for this user as this is happening though I remember something my good friend Brooke told me she said that neural nets are way cooler than linear regression I also happen to know that she wrote a ml flow project that takes tweet embeddings and user beddings and produces recommendations so although I now have my endpoint let me actually go back and train another model and see if it does does better so here's my workflow I'm gonna change the name this step I'm gonna change the code from plenty of the linear regression to pointing out Brookes code and instead of regularization all have a hidden units parameter I set it to five okay so this rent a new version of this workflow I come back to the mo of the UI again I see that the first three steps are cached and we're only running the training on that step one cool thing to note here is that Brooke rotor code and tensorflow if you recall my code of spark ml as long as it takes the same inputs and both produce an animal model it all works out I don't have to understand what the code is actually running here we see an f1 score of 0.9 3 which is even better than my point nine point eight six from before so I come down here I see an ml model this one is a tensor flow model but it has the same input-output schema and so I'm gonna register this to the same model and this will create the second version of this model so I can come here I see a second version I'm gonna update my endpoint from version one to version two and this will just take the existing endpoint and upgrade it to status version I can now copy the endpoint for this new model as this gets deployed and I'll come over to my inference code and now paste in the new URL sorry the new URL running now if you recall I only like two things companies going IPO and cats so let's see how this does so the first in tweet is about Pinterest going idea pretty solid the second tweet is about petting a cat also solid the third tweet is about an IPO of a company called crypto blockchain kitty cat nip that is surprisingly irrelevant so we've seen how we can use ml flow especially the new components of the ml flow workflows and ml flow models multi-step giant model registry to create an end of end production ready machine learning system thank you alright thanks a lot and just to wrap things up I have one more quick announcement for you which is you know for users of database the managed ml flow on database is now generally available on both Azure and Amazon Web Services and yeah that's exciting to hear yeah yeah and you know if you're not familiar with that you know the way that we're integrating ml flow and database is it's the same client library it's the same code you know on anywhere but we've integrated it into you know the graphical data breaks workspace so for example if you are in a data breaks notebook you know you're using the mail flow API and creating creating ones there's this sidebar where you can immediately see your results and then you can also click on any faster on in the sidebar and it's actually integrated with notebook version control so it shows you a snapshot of the notebook as it looked back then when you have that version so you know very simple integration this is the kind of thing we hope to see in other products and IDs as well to really bring these ml lifecycle you know passes as kind of to the forefront as you're working with you know with your quote and finally you know if you want to get started with ml flow it's very easy to get started with pip install and there are a lot of ducks and tutorials at ml flow org and we'd love to hear you know what you end up doing with it thanks a lot

Info

Channel: Databricks

Views: 21,427

Rating: 4.9349594 out of 5

Keywords: #ApacheSpark, #SparkAISummit

Id: QJW_kkRWAUs

Channel Id: undefined

Length: 34min 32sec (2072 seconds)

Published: Thu Apr 25 2019