Introduction to Kubeflow Pipelines - Dan Anghel | ODSC Europe 2019

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Applause] thank you for the applause and I hope you will not be disappointed okay thank you everyone for from coming to this workshop I'm gonna skip this page so for next one hour and a half I will try to talk to you about how do we try to build machine learning in production how we how do we try to learn product to run production great machine learning pipelines so just a show of hands I have two quick questions so how many of you have already run machine learning in production awesome how many of you found that easy as expected now attracted so I'm here to try to ease your pain and probably give you some solutions for making that task easier so I'm here today to speak to you about queue for pipelines so okay so queue for pipelines is a framework to run machine learning frameworks on kubernetes which is a framework for managing infrastructure right so basically you get the idea more seriously queue flow is a curated set of components and tools and artifacts that group together they will provide you the foundation to run machine learning workloads wherever you are right if you're running on your laptop in your development environment or in production queue flow will allow you to use basically the same artifacts and same components so that your life is you know much easier and actually to make you to enable you to forklift machine learning code from development to production right so during this workshop we will talking about a stack of made of three layers right so there's the application platform and infrastructure layer so if we go from bottom to top we will use Google cloud platform as an infrastructure but by no means is the only place where you can run the code that I'm demonstrating today you can also run it on other cloud providers if you want or on your own data center the platform that we'll be using to run motion and workloads it's cube flow and the application the machine learning application will be building is probably everyone knows this github issue summarization right so it's a sequence the sequence model which takes the body of a github issue and tries to predict its title some you know a short sentence to recap what has been said in the issue right so we're not going to focus that much on the model itself it's something that you I think you all know very well where we will try to focus on building a machine learning pipeline from A to Z that will enable you to you know trainer model serve the model and consume it from an application cube flow is open source so as I said it runs everywhere it runs everywhere kubernetes runs so you can write it on Google Cloud on gke will gratis engine you can write down n WS on a KS the community is open they're open to new ideas it's a tool that is it's very fast evolving too with the target to become an enterprise-grade tool rather quickly I would say probably sometimes early next year in the first half of year so the problem you're facing I think you're all facing this is to move forklift code machine code from development to production you don't want to have to rewrite everything every time you write some code that runs and trains nice you don't have have to rerun it to deploy it in production and the solution that you propose is it's to have a you know packaged infrastructure made of components that you can bind together you know you separate concerns you focus on each problem each step of the pipeline and then you package these together and you deploy them everywhere because the challenge is that when you run machine learning code on your laptop it's not the same thing as running it in a data center where you probably want to run you know distributed training on 32 GPUs or 120 GPUs so the constraints are fairly different from your you know code that runs fairly well on your laptop machine learning is complex especially the pipeline is complex we'll go into that into more detail and the solution to to to this complexity is as I said to you know provide your set of components that you can easily articulate together to build your pipeline so when I was saying that machine line is complex is that the perception of machine learning is the model right and there's a bunch of other stuff there that you need to worry about but what attracts interest to the field you know is names of you know convolutional neural networks and general generative adversarial networks and masks are seen and stuff like that but that's you know that's the modeling part but actually when you want to run your pipeline production the model is roughly probably 10% of what you need to worry about right because you need to worry about data you need to worry about features you need to worry about you know analysis model analysis you need to you need to worry about the infrastructure accelerator hardware accelerators you want to use GPUs and then monitoring and there's a bunch of things right so if we look at quickly at a you know typical machine learning pipeline probably it will not encompass everything you may think of but I think we have here like the basic steps the first one is you need to gather the data if you don't have data you don't have machine learning right so and it's not just ingesting data so that you can train your model you need to go through data validations because your model will be retrained frequently in production now every day or every week or every month and before training your model you need to check is my taste data distribution still the same like do all the assumptions that I made when I built my my model in my you know development environment based on historical data do all my son friends hold when I try to retraining on on new data that comes from the real world right so any data validation you need to transform data because when you put it from your operational systems it doesn't go like this into room model right and of course you need to create your date datasets training validation test so that was the data part now to train our model you need features right so you know feature engineering - I know marketization you create varying stuff like that and you need to do need to be able to do this you know chain all these operations one by one every time you rerun a training and the training is not just you know training my model right you train your model and your new I don't know precision recall the metrics you're using is 80 89 % like is that good or bad do I put two production or not so you need some introduce a model validation also yeah you want maybe you want to understand you know if your model has bias so probably you want to most probably want to do model analysis right so you look at how your model metrics perform on different subsets of your data and guess what in production you need to do distributed training you need to be able to run it on multiple machines make sure it doesn't crash and you know funny things like that then a model that has been only trained is practically worthless if you don't do anything with your prediction right if you don't integrate it in an application that you can derive you know business value out of the prediction of your model so that means you know you need to build application probably something with a UI that will call an end point where your model serves predictions and all of this you don't do it in thin air you run it on a platform and in production so this means that you need to worry about funny things like authentication logging monitoring you know CIC be all the stuff that you know it's it's it's nice and and makes a great store very interesting now I'm joking good security is not place it's a pain so as I said there is all this complexity and with Q fo we want to solve it through you know composability I already say it ten times right and we'll look at what components you can build so you can run your pipeline and also from the fact that you are building components you can also really reuse them so next time you don't need to code anything from scratch then you need maintainability right so you have a problem production or like a customer says someone from businesses hey this prediction is wrong okay so how do i troubleshoot this right how do I understand why this prediction is wrong so how do i trace basically how this model has been trained from what data making what assumptions so I can you know understand why exactly this person was wrong and I need once I figured that out I need to quickly be able to reiterate everything if I need to be trained or something and deploy the new version in production right so again using components usually helps that right and the last thing when you train a model versus when you serve a model you have completely different usage patterns in terms of resources like when you run a model you probably when you sorry when your server model probably you will have spikes in in usage because at some point you know is Black Friday coming so you're going to have a lot of hits on that day and probably it would go lower afterwards so you need an infrastructure that is being game is able to scale up what you know to absorb spikes in traffic while when you train a model the resource usage is completely different probably you need you know sixteen nodes with a GPUs each for 12 hours or 24 hours right so you need something that is able to provide you that flexibility to adjust to resources and allocate them as you need depending on on the activities that you're doing that's why kubernetes right that's why queue flow has been built on top of kubernetes because basically you don't need to do much the underlying the underlying kubernetes framework is just managing these resources and you just tell it okay give me 32 details right and also cube falls goal is to make it you know to be usable and to be yes to be used by everyone right you don't need to be DevOps to be able to use it you know you can be a data scientist and data engineer a devil person it doesn't matter it should be it's supposed to be easy to use portable and scalable for everyone so in queue flow like I said the entire stack is portable so from the infrastructure to the model to the application the scalability embedded into kubernetes so you don't have to worry about that you have one tool cube flow pipelines to run all your machining process you can do all your product lifecycle so development a testing production and you can use specialized hardware right GPUs GPUs you just need to ask them to kubernetes and it's going to give them to you so as I said queue flow is for is intend to be for everyone the current release is zero point six point two so you can see it as a collection of portable components that you reuse to build your pipeline and it's there to provide you that platform to run your machine learning pipeline so you don't have to worry about things like you know how do I get my GPUs and things like that also is based fully on kubernetes as I said so you have operators for distributed training inside cube flow you have all the notions and all the patterns you can reuse them from kubernetes you know for managing a structure for micro services there is a packaging framework let's open source that's used to you know package the bills and transport them from running over to another and you don't always have to use tensor flow you can use PI torch or cycle it or XT boost or whatever you want as soon as you can no as long as you can package it into a container it runs on kubernetes disclaimer the demo will be intentional flow so sorry sir I would say can you pack it can you containerize it if yes it runs I am I don't have much experience with our to be honest so it should be you know yeah you'll see so when one will demo the pipeline you'll get a better understanding what you need to have it follow with our I hope so for the next let me check my time okay for the next hour so I go through a code lab so the if you want to note it write down its code lab ISM is available on the Google developers website I strongly encourage you not to do it in the same time as I'm doing it because it takes quite some time right and you may hit some problems like you know do you have all the rights in on GCPD to create a kubernetes cluster zoom and so forth so I suggest you write it down and you know you can play with it whenever you want and I will try to do it in front of you fingers crossed this move this morning it worked right so I still hope it's going to work okay so let's do this first thing first I need a I need I need to create a cube flow cluster right basically it's a kubernetes kubernetes clusters with a ton of pods running into it there is a very simple interface and demonstrate I'm going to demonstrate for google cloud you just enter a few informations and automatically it deploys the it deploys it creates a cluster inside your Google Cloud project okay so I'm going to start with that so I have here a reading Google Cloud project so I need the project ID I will call my deployment queue flow code lab sorry I will login with GCP IEP so just just do you know so that you know IP is a identity aware proxy it what helps to manage authentication through a public endpoint on Google cloud platform so kubernetes integrates us or q fo integrates nicely with that and provides you the authentication mechanism and security mechanisms provided by a by Google Cloud and same is on line WS so I have everything I need here now please don't look because this is for auth user and secret hey I I should have made it black on black okay and I will install it in Europe West one B version zero point six point two okay so now the deployment has started I'm going to explain you very quickly what is going to happen so this is going to take about 20 minutes and I will take some time during those 20 minutes to give you more details about the pipeline that we're going to build and fortunately if everything goes well after I tell you everything about the pipeline the cluster is up and running fingers crossed so what is going to happen is so I'm on I'm on Google Cloud I are familiar with Google Cloud perfect kubernetes engine as the name says it's the service the managed service to run kubernetes clusters now in my project I have no cluster and my expectation is that cube flow will deploy a kubernetes cluster so it's in progress the way it will deploy it this is still a Google cloud feature it will call it will deploy what's called a deployment manager configuration so actually there's a ton of things that are happening behind the scenes there's the kubernetes cluster that is being created you have some firewall rules that are being also created so that you can connect to your cluster through the identity aware proxy there are some service accounts that they are created so this deployment manager configuration packages everything for you and it ensures that if you go and delete your cluster if the deployment manager configuration is still active it will recreate your cluster right so that's the goal to do it this way ok so the cluster is going to be its deployment in progress so we'll leave it install cluster and I will move to a another Q flow classes that I have running so I can show you the pipeline that we're going to build today [Music] okay so if you look in the URL it's from cloud next London which was yesterday so I'm cheating because I ran the same demo yesterday anyways ok the resolution is not great but let's try to let's try to do this so this is how you can see the pipeline that you built in cube flow pipelines so it's a web interface and you have the steps ok so you have first is the copy training checkpoint step why do we do that so imagine that you do transfer learning right so if you do transfer run you should you start from an already trained model let's say imagenet and then you want to build your own classification of objects because you're not happy with the classification that image that provides so what you do is you you you you take the model until the last the layer before the last one the one that generates the embeddings you you take it as a tensor flow checkpoint and then you build on top of it you know your a classification layer the last one with your classes ok so we're trying to we're going to do something similar because we don't have too much time and the model takes a long time to to train we're going to start from an already safe that's a flow checkpoint so the training will be quick compared to what it should take in reality so the first step is to copy that checkpoint from from TF hub from tensorflow hub so tensorflow hobbsy you know open library of tensorflow models and we'll copy it on a Google Cloud storage get and then the training will start from this checkpoint and we'll you know finish the training the second step okay is called log metadata right so I told you about you know troubleshooting when someone asks you why this model that's running in production gives a bad prediction the Waqf low tries to solve this problem is by providing you a method metadata store where you can log information about each step of your training link each step in a what's called a run and execution and group multiple executions into a workspace so this will provide you valuable information regarding lineage for example I so you know that you will know by by logging this metadata at each step you will know that my model that is version zero point one point 17 in production has been trained on this data coming from those bigquery tables for example with this query and I did that those transformation etc right so at every step in the training every meaningful step in the training of course you can log metadata for the current execution of the pipeline and this will give you the lineage right so it should you should be able to tell how like provide all the all the action that have been taken to train that that model push it to production of course after copying the checkpoint we're going to train the model so the outcome of the training is another checkpoint and of course again we will log some metadata any we will say okay now at this step I trained the model and this is where I put it on Google Cloud Storage this is where the model lives and you can link it through the run identifier and through the workspace you can link it to the previous step which is copying the the checkpoint after you train the model you want to serve it so inside the so everything happens inside the cube flow pipeline cluster if you want to use for training something else like sage maker or cloudy air platform so these are managed services for for running the machine learning training jobs or if you want to run the training in your own cluster you know in your own kubernetes cluster you just from the training step you just write the component that launches the training on that remote place you don't have to run everything in your cube flow cluster so many people use the cube flow cluster for orchestration for workflows for experimentation so I'll demonstrate to the exploitation feature which is quite nice and prefer to run their trainings elsewhere that's perfectly feasible again you can serve so in this example we're going to run a pod ATF's or pod inside the cube flow cluster that will publish an endpoint that can serve predictions of this model again you can install your model you can deploy your model everywhere you want and then based on a condition we will deploy a web UI which will allow us to consume the predictions to make predictions using the model we trained and we serve right so this is a classic I mean it's a super simplistic marshalling pipeline but imagine that you can have whatever steps you want right data ingestion data transformation data validation training model analysis model validation you can have everything right so I talked about the application so let's see you know what the end result is supposed to be okay [Music] Oh so as I said github issue summarization so we'll enter here a the description of a github issue and we will try to run the prediction and see what the title comes ok Oh conveniently I have a guitar shop it's from tensorflow ok so it says using skip on a dataset that contains corrupted data and then applying ignore errors because it's fit method to Hank before the validation step the feed method uses data set for training and validation ok ok so believe me I clicked on the button something is happening right oh yes so the title is validation error handling not that bad actually and the actual title is you skip ignore error because training Hank you know it cannot do better than the date that you give it cool so where are we in the in the deployment right so as you see it already tells you oh I created a an end point and that's the end point published by the identity where proxy on Google Cloud that exposes the cube foe UI right let's go quickly to see how my cluster appeared so this is what the Installer created and if I click on so I have a you know a nice cluster with two nodes by the way is it too small we wanted me to okay it's too small it is better perfect okay some people in the back saying yes then I'm happy okay so there are two nodes cluster is up let's see the workloads okay so you have a ton of workers that's a ton of positive running that has been deployed by kubernetes by you a cute flow not all are working of course you know probably you recognize khatib to do a hyper parameter tuning what else is interesting yeah Nvidia driver installer if you want to work with GPUs and you have also operators for sensor board TF job and PI torch right or I said not only tens of Pi tortures okay so I have the I have the URL hmm okay let's try to access it okay it's still it's still getting up and running okay so what I'm going to do next I'll wait for the cluster to pop up and I'll try to walk you through the through the code that we wrote to build this pipeline right and as soon as the cluster finish is creating then we will demo it so now I'm on another cluster that they have running so please bear with me for a moment until the other one is created one cool thing that you have in queue flow pipelines is you have jupiter notebooks right so you create your you can create your your own jupiter notebook server and you know you can do you can do very nice notebooks random enya's inside your inside your cluster of course you're not forced to run demand in your cluster you can run them anywhere else and like this notebook demonstrates the pipeline which is this one so I'm going to go through the Python code that we wrote to create this pipeline I'll first go through the components because that's the most important part and then we'll see if my cluster is up and running okay so the most important thing in the in the pipeline are the components so a component does one step execute one step in your pipeline and you have basically two ways of creating a component in cube flow so either use the container up constructor which basically as its name says it allows you to create a container on and run it on on kubernetes right so you know we can say this is my image run this image with I don't know how many workers how many GPU resources you can you can specify this it's kind of four resources right so it becomes completely transparent for you how these those resources are allocated or even how cube flow works like you don't need to write the manifest it just creates it for you a a container and runs it on on the kubernetes cluster so for we use container up for model serving and for so yet to serve the model and to run the front-end the web front-end which is this one and you can also create your own reusable components and there is one here which is of a lot of interest which is log metadata so basically you're calling the same component twice because you want to log you know it's basically you love the same information only on two different objects so you don't want to write your Python code ten times to talk the same thing so the way to do that is by creating your own components in a component definition file so let's have a look at the one for the metadata component well not very fancy you have the image you know what docker image you want you need to run with what parameters so here you have as I mentioned the run name the execution of your pipeline the identifier of your execution the workspace and you know the data URI and model your eyes where where I saved my checkpoint at the start of the training and where I said save my model at the end of the training of course lock type in this case is either model or data set right so if we look at what information can you log about the data set so here is the schema of what you can write in the metadata database with regards to your data set I will make this larger okay so you can track things like where is the file that contains your your training data and something very that I'm very interesting is the query that you used to extract that data if you're using a sequel database also the time you know who's the owner so those are crucial information about how your data was collected to train your model right so this is the first thing you want to know when you're wondering why that model in production version blahblah is giving that prediction it's like how was the data collected you have here the information I love this it's amazing I mean I'm excited because like you know these are simple questions and for I don't know what reason it's so difficult to answer them and like this metadata store which is not tensorflow dependent it's I mean you can use it with absolutely every framework you want it provides you a very simple way to retrieve lineage and information like this so just want you to have a look quickly at the model so this is the model metadata that you can you can write you know where is the you know what kind of model model type training framework hyper parameters this is great this distancing gives you exact information on how you train the model again no owner information creation time whatever annotations you want to add where is the model did you arrive so in the case of the metadata component you know you create your own reusable and you call it as many times as you want and then there are two other components defined like this the data copy the component copies the the checkpoint at the beginning of the training and the training component so each one has a its own component definition file again what image am I going to run with what parameters okay so this is the component that copies the checkpoint I need to know where is the checkpoint directory where is the data directory so I can copy it and you know for this workshop of course all the containers have already been built we're not going to build docker containers here but just if you want to like look what does it look like so source code you know simple Python function that copies all the files from one cloud storage bucket to another and then I have a basic docker file I'm installing you know start from the tensorflow image it's installing Google Cloud tools because you want you know you're going to use gsutil and things like that to copy the data and the final container file that actually built the containers is using the previous the base docker and just runs you know train model dot Pi which can if you run it with different parameters will copy the checkpoint or run the training right so the actual training is in the same file a bit below so we were lazy in this case so we just ran T to t trainer so it's from tensed from the tensor 220 tensor framework so like I said we didn't focus now too much on writing fancy TF estimator code we want to to have something up and running but you know if you want to sit like this is code you're very familiar with as as long as you can docker eyes it you can write on capable pipelines okay so I will give the deployment under look the expected time I'm in 14 minutes okay let's see how all my workloads are if everything is green except this one that's one we always read I need to understand why okay maybe we can try now to hey we may have a cluster up ironic awesome okay so I will go back to the Installer I have my great okay so just two so you're sure that is the one I just created look at the URL it says ODS see London 2019 right so it's the one I just created I'm so happy okay I can close everything else for the moment a bit of housekeeping first right so this is the queue flow UI you have your pipelines inside the pipe lines you have experiments now this is a very nice feature because it helps you group several executions of your pipeline and compare the results there's compare runs button on top right so you want to understand between two different runs that you know give probably different results your model has different performances you want to understand what's different okay so you can compare all the parameters you used to run those pipelines and see where the difference comes from comes from then to go back you have Jupiter notebook servers and last you have the artifact store where you will find information about the meta data that we just lost some you have also a nice visualizer that lets you let you see you know the metadata that you're not doing your your pipeline we would use for this training a GPU so for the moment my node has sorry my greatest cluster has only two nodes and none of them has a GPU so what I'm going to do next is I'm going to create a new node pool with one node with a GPU so we can run the training okay so first thing first I'm going to run a clock shell so the cloud shell is a small VM Linux VM in Google cloud that has all the g-cloud tools that allow me to interact with all the Google cloud services probably this is again too small is this better hi okay okay thank you now if my commands have den line don't worry it's not okay so I will create some variables and configure my g-cloud commands so my deployment name is called cube no code lab does the name of the class term by project a Google called project I will run everything in Europe West 1b then let me see if I have this bucket so I need a Google Cloud Storage bucket where I can store my checkpoint and my model after the training so we see this already exists yes it does perfect then I will configure cube CTO to work with Michael Burnett this cluster so this is a geek out command that gets the credit credentials for my kubernetes cluster that I just created I will set the default namespace to cube flow okay so let's have a look at nodes so I am expecting to have two nodes up and running nice now I need a node pole with GPUs so I will do something else first I need to set the right access rights to the pipeline runner service account so I can do everything I want in my cluster for the moment and before creating the node pool with GPUs I will activate Auto provisioning and I will tell you in a bit what Auto poisoning is so what I'm saying is to the cube cube kubernetes cluster is that you know if you need to scale up to create new nodes because the pods you know I need to run new pods and there are not enough resources you can scale up to 48 CPUs and up to 24 GPUs nvidia tesla k18 so that's a that's a very nice feature that will help you provide the necessary resources without having to have them all up and running all the time right if GPUs are expensive we all know that you don't want to have you know 16 nodes running all the time so Auto provisioning will create new nodes automatically when you need them so I'm ready now to create my node pool and I will ask my note pool to contain one note machine time n1 hime m8 so that's like a 8v CPUs 64 gigabytes something like that with one Tesla k80 GPU okay once I have this note pool I can run my pipeline sorry mic training by the pipeline per se doesn't need GPUs but the training step does so let me go to the to the dashboard okay so I only have the demo pipelines configure there so next step is to upload the pipeline that we just saw right so how does this work I'm using upload by URL in the notebook I will demonstrate how to upload it programmatically like by calling an API I will call it a github issues a show yes right so that was my next thank you for the question it's a very good question okay so it's a tar.gz file okay actually I uploaded this so the pipeline I wrote it in Python with a DSL provided by cube flow then there is a step of compilation of the pipeline so the the compilation step actually takes the pipeline and generates a Yama file this one which is an Argo file so our go is the kubernetes Orchestrator right nothing fancy you could even write this by yourself but you know if you have something that does it for you automatically it's much nicer so pipelines there let me see if my okay so I should have okay so my cluster does it have a GPO nope yes so right now I have a new notebook of GPO GPU pool and inside this pole I have a node with one Tesla k8ttie GP yes here it is nice so we can run the we can run the pipeline the first thing I'll create an experiment okay from the pipeline that I just uploaded I will call it wrong zero one because the first one today I just need to specify some parameters so the first one is my project ID which is this one I have it here project any I only this and my working directory is my bucket so my bucket is called cut storage look at this one okay so it's GS right bucket now the checkpoint that we're going to copy is in a public bucket so that's where we're going to copy it from and the data on which we're going to do training is on a different public bucket and I can start the pipeline okay so the next thing that will happen is we'll go into the pipeline into the run actually and we will follow the execution in real-time so the first step is currently being executed okay step is pending with message container creating fair enough look we need to wait a bit until the container has been created let's go in the meanwhile back to the pipeline sorry back to the notebooks and I will create to save time a new notebook server right and you can even bring your own image which you can say how much CPU how much memory you want we just want a very very small notebook server we don't you know it for much okay so being what happened I thought I lost the internet for a moment there okay so I have one running one run in progress mm-hmm okay so no I don't think I can increase this so you'll have to believe me in the back basically the first step is running so you can see all the files that are being copied from the from the bucket containing the check point to our bucket so you can you know there's model dot C kpt files you recognize them those are there's a flood checkpoint there's a flow checkpoint files right okay so this is going to run and then we will start the second step which is training and also logging the metadata in the meanwhile let's see okay my also my notebook server is currently being created and as soon as my notebook is is done I can you know upload my notebook and we'll continue demonstration there okay so my first step finished you can see again probably too small for the people in the back you can see the input and the output parameters so every in your pipeline you can pass parameters from one step to the other so you can in one component you can reference output parameters from a previous component like for example where did i copy my checkpoint that's useful for the training extent and the input input parameters are you know checkpoint directory and data direct okay lock metadata again it's waiting for a container creation how is my okay my notebook server is up and running so I will upload my notebook okay is it still too small okay [Music] okay so again the components that I just described few lines to load each predefined component from its file and now let's go quickly over the code that creates the actual pipeline right so I will create each step one by one so the first one is you know copy the data it's it's created from the copy data component that we loaded from file and you pass just the input parameters that it needs you can also pass run information like DSL don't run ID placeholder is the one is the information that will give you the current sorry the current run of the current execution of the pipeline so that's super important when you want to log metadata for example or when you want to create like here we want to create a special dedicated folder on cloud storage for my run so having a run ID is super useful then I will create a log data set step from the metadata log component I would say I want to log information about a dataset with the log type the workspace I gave a give it a name ws gh son also for logging purposes the idea of the current execution and the data directory where the data has been saved check the metadata for checkpoint then there's a training step which is created from the Train component that we loaded earlier so what I was saying about passing information from one step to another this week you can use the here we want to use the output copy output path parameter of the copy data step to say okay so you start from that checkpoint that is saved there so if I use the name of my my step outputs and I pass the name of the parameter I can fetch a parameter from previous step again a log step and now it's log model so I'm logging information about my model and the serve step size for the third step we're using the container up constructor so basically we're telling kubernetes you know run this container pretty straightforward you just tell it the image now again the image is already built but if you want to have a look at the cube flow manifest actually it's what we're doing here is we are running okay let me zoom it's pretty straightforward so we are running the tensor flow serving image right and we're saying oh here is where the model save the model is and you run you serve it on this port for as a REST API right so again pretty straightforward so we defined the serve step and in a few lines now you just link the steps together so you say that log data set comes after copy data log model comes after train of course then I need one GPU for my training you know you don't have to know anything about kubernetes manifest to do this you just need I want 1 GPU and I want 40 gigabytes for my training and again at the end well if I my condition says launch the web app server the server serving the UI for predictions then again I will run under step which I create with container up which is running this image that we also pre-built and you want if you want to have it a quick look it's a Python flask application nothing fancy it just knows you know this is the URL I need to call to get my predictions and the hosts like the server and the server Bowl name are read from to environment variables okay and with this you can do everything let's quickly go back to the execution of our pipeline okay so the log metadata finished so if everything went well in my artifact store yes I have information about my data set which I just locked its basic like you you you can find the run information the workspace which we group several runs and where my model Check Point has been saved first metadata log and the training is in progress so this is the training step which is in progress if we look at the logs or we are at step two million something it's running end in artifacts you will soon see from this training step a launch tensile board button and you should be able to run tensor board while the training is happening tensor board will run in the same cluster and it will of course read all the event information from cloud storage where the training is saving that data and you can followed your model performance metrics while the training is happening right so okay let's go back to experiment run yeah okay it needs a bit of time probably what it probably waits for the first for the first evil step let's go now back to the code so where were we we code it basically our pipeline so it's a simple your pipeline is a simple Python function you just use some cube float DSL and basically you so like you just you just connect containers basically right so you can do everything you want now this is a fairly let's say generic way of doing things but because it supports anything any framework any you know any any kind of data pipelining you want to implement but you have there are ops for when you if you want to use if you want to do distributed tend to flow like there are ops for for MPI there are apps for Python so that you know again makes your life much easier in terms of code right you you want you don't have to build containers every time so running the pipeline I uploaded in the in the queue flow UI 8rg set file which is actually my pipeline compiled so transform from Python to an Ergo definition file how do I do this well I have a client sorry I have a compiler package in queue for pipelines and just call compiled with the name of my function and the output file and if I want to output this pipeline into my into my control cluster I just need to call client upload pipeline with the name of the compiled file so let's run this yes of course I forgot the most important thing to run my so please bear with me let's go back to the training and see ok how my training goes ok so we are at the eval step so hopefully after the Ivar step you will see a tensor board appearing here okay import EF KP now let's load the components oh something happened unexpected keyword metadata nice okay let me switch I'll show you on the on the other cluster I know what problems is not because but in the interest of time okay I already have a notebook server running in my namespace okay best create the pipeline okay so clear [Music] oh yes of course error handling is not the best okay so we're called to call it of the pipeline run again okay so in response to my call to create the pipeline I have all these Jason that describes my pipeline and I have this super important ID information that I will use it to run my pipeline further down the road okay so now I can create an experiment with the same client object I will call it an exponent let's run this okay so experiment like here it's prone and created and now let me copy this ID okay let's check that we have the pipeline by clients it's here ODST pipeline the one that i just uploaded and I will run my pipeline in the experiment a newly created so running here okay it's called run for da so if I can go I go on experiments I'm an experiment that obvious see this is my run so it's Danny I mean we it's linked exactly the same thing as I did before just now I triggered everything programmatically so what does the what does this mean is that you know you're in production you can schedule your pipelines right you can schedule your you don't need to go on the UI and click on the button run like you can do everything from your whatever schedule you're using air flow or you know anything else so let's go back to the training that was oh did you see it start tensor board that was the modern times with okay so start and support open tensor board you see tensor board is accessible through throughout through the same identity where proxy URL on the top so you have the cue flow pipeline gy you have the web application that is exposed through the same URL and now also tense report you had nothing to do just right up I play [Music] so if I check my pipeline is finished webapp deployed so I am on the cluster then you just created now I want to check that my web has been deployed however however before I see that the log metadata the second log metadata step has executed so I will check in my artifacts artifacts store what appeared and I have the second artifact which is my model metadata again very useful information the execution ID the workspace and the training data sorry the the path where the model has been exported so let's go back now to the pipeline I'd love to see tensor board Hey okay nice probably not the nicest learning curves you ever saw but the point is that you know you have all these components working together to help you you know seamlessly and without much effort run a fairly complex machine learning pipeline so the other one that I just ran programmatically is you know running and it's doing exactly the same thing as this one so I'm going to not we're not going to follow this any further the last thing I want to check is okay experiments to turn my run the first one today okay my web app has been deployed so the URL should be something like web right now let's take the okay so this is a different training right probably we get a different result let's go you're fun okay so we're taking the same github issue and will generate the title I'm going to do the wheel oh very broke I think it's the same anyways that was the small like fairly small control pipeline demo you know a real machine learning pipeline in your life will be much bigger but you know I hope you have you know all the elements you need to imagine how you would build it from components and how fairly straightforward it is I'll go back to my presentation for the last ten minutes five to ten minutes so we went to you know setting up the environment create a cue flow a cluster when you run a pipeline from the dashboard just by clicking we run a pipeline from the from a jupiter notebook in a programmatic way clean up let's do clean up while we're here so i'm always mentioning deployment manager and google cloud so you see to actually to deprive manager configurations one for the storage because we want persistent storage for our q4 cluster and one for the actual queue flow cluster so it's easy you don't go and kill your kubernetes cluster please don't do that just delete nicely the Diploma manager configuration and it will clean up everything that has been installed like the kubernetes cluster firewall rules service accounts everything so it's it's the very nice way of managing things on water so at the end yeah I just wanted to give you an insight what's under the hood I think we touched upon many of these these things so you have an ingress point so you have a proxy that is managing all the access from outside your cluster to all your UIs be it you know the queue for pipelines UI tensor board or a web app that would that would deploy in a cluster you have a dashboard with pipelines with notebooks Argo is the interface and the tool for orchestrating the steps of the pipe pipeline cut tip you can use it for hyper parameter tuning you have operators for different frameworks a PI torch and TF job so you can run directly code for those frameworks on your cluster if you don't have those operators yet build a container and you run everything you want and for serving like we demo'd you can either use TF serving just pick the name of the image where the model sits on tone storage or wherever it is and the port number or if you want to serve something else and tends to flow the nucleus seven yeah the community so q4 is open source it's a very vibrant community it's going to be so the current version is zero point six point two and the goal of the team is to make it enterprise ready and do it very soon because clearly there is there's there is need for such a tool and an enterprise level Enterprise readiness level tool things like you know handling of the education handling logging alerting monitoring these are the kind of things that you know enterprise products are really so those are the things that make a product enterprise rate right basically cube flow has been has been around for probably almost two years now every component in the tool has been around there for for a long time but making enterprise ready is providing that quality that helps you you know troubleshoot your problems whenever they appear and integrate seamlessly with the security of your system and you know things like this so you know if you want to contribute the current is open and very very very in need of new ideas and contributors so please don't hesitate that was all from me I think we finished them in seven minutes earlier so do you have any questions that you may not have asked so far yes my master commissioning of the truth beneath this castle sooo slow compared to the aiding of the GU moon Oh a very good question so the provision of the cluster was fast that's not the one that's not the thing that takes time what takes time when you deploy q4 pipeline is actually you create a cluster you install a bunch of like 50 pots basically not only this but you create a deployment manager configuration you create firewall rules you create service accounts there is the configuration with the identity aware proxy so you can access it so what you didn't see actually is that when I access that proxy I am authenticated by Google Cloud right so if I go let me show you if I go on an incognito window it says bye-bye I don't know you yeah so there's a bunch of things that happen in addition to just creating a cluster but there are SSH keys that are generating and that's not something very nice and fast let me try to because this is very interesting right so this is my URL of my cluster I am authenticated in on Google cloud and doesn't work this is awesome sorry maybe the slack it's a specialty oh yes thank you thank you thank you but I have a different it's called next thank you very much okay so I am authenticated to Google cloud so you didn't see anything like hey user password pop up okay now let's try it in an incognito window fingers crossed okay so this means authentication security access rights everything okay any any other questions such as sorry okay as people okay so I think the primary goal of control is is running the pipeline there are other tools that provide such functionalities I can only speak for Google Cloud so there is the I have a memory lapse so there is a tool on Google Cloud where you okay there is TF hub like to start with but there are there is it similar to AI hub in Google Cloud which is similar to TF hub which which which whose purpose is called this kind of collaboration that you describe and you can also collaborate with pipelines in in AI hub right not only models you know basically you can put any kind of notebook but yes those features would be outside so I'm gonna crowd there is like AI hub that would provide such a such features welcome yes yes that seems the eye right it's creating that awh no I mean yeah boy I mean you think you can create if you're as long as you can have the image you can create it I doubt you have teaming I mean I tried once so that's it that way in that way it can run any image but you have so you can use notebooks outside queue for pipelines I mean my first I'd say intuitively I wouldn't necessarily go and use a notebook inside queue for pipelines because I consider that as my you know pipeline manager tool so on Google Cloud I was you would use something like cloud di notebooks for example which which is more probably more focused on collaboration and you know they're really you can just use you see as long as the images on Google call you chronic right I mean you can you can do it right now it's a g-cloud compute command that's it you like if you if you so that's very good specific okay so Q flow being open source and its goal is to run everywhere that would be a feature that would type to Google Cloud so I don't think the team will implement such a thing but if you want it's one line of code right so you can write your well I can write your component does that mean from yes Oh excellent question right yes yes yes exactly so the way the way yes sorry what's the thing about Q flow and tf-x how they how they relate so Q flow is for orchestration and for providing you a an environment where you can run your machine learning workloads right GFX is first it's tensorflow focused and it's for implementing certain steps in your pipeline right so you have tf-x data analysis tf-x model validation TF transform right so I would use the effects to implement my steps in my pipeline so for example to generate my TF records I would use two TF transform which runs on data flow if you're on Google Cloud right to run at scale and that would be one step in my Q flow pipeline right model validation I would use it's it's awesome T FX model validation it creates that very nice notebook where you can explore you do features splitting I guess I think that's name and that would be one step in my pipeline but but I would use cue flow to orchestrate my pipeline right you can also use air flow which is a another Orchestrator but it's like the difference is that air flow is multi-purpose is generic like air for is to do you should be able to do anything with air flow q flow is super focused on machine learning and like one feature that it's only existing in in Q flow and it's for machine learning is this experimentation right I want to be able to compare my runs how they perform based on and understand what the difference in configuration is between between my my runs right that kind of things doesn't doesn't exist in air flow but you can totally run your pipeline on air flow right and and and and tf-x would implement steps in my in my pipeline but what I would like to see from a cube flow like in the next versions I would like to see as I want I would like it to be as rich as the effects for example ipfx has everything everything out of the box to run proper machinery pipeline on cue flow you still need to run to right containers sometimes that would be ideal like to have like the same rich richness of of components in cubes low energy effects welcome any more questions yes so metadata is a as I said it's a it's a it's not framework specific it can be either a sequel base or I mean its sequel but it can be either on on a file or on an actual database right so there's a there's a sequel this um icicle I guess running inside queue for pipelines know the artifacts like like the artifacts you can see in the metadata that the things that you're logging was a dystopia great Oh another question create two levels and sometimes you get like these few spots right right like embedded into the self-image or the oh you never embed the mat in me I would never in so you always when you run the training you always save the model on on a storage somewhere so you don't have to embed it in any oh well you mean when you serve it okay so when you serve it it's you need a big machine right so you're running TF syrup and it's picking the model from from a cloud storage location sorry yeah when it starts up and it's it loaded it needs to load it in the memory so you need so when here's the thing you can say you can say to the using container up and you can say to the container up hey I need to want a gigabytes of memory because I know that I will need to load a huge model so that should be much easier than going and writing a kubernetes manifest okay it's time today thank you very much [Music] you

Info

Channel: Open Data Science

Views: 5,729

Rating: undefined out of 5

Keywords: Machine Learning, ODSC, Kubeflow

Id: i8CrqPUWBI4

Channel Id: undefined

Length: 90min 55sec (5455 seconds)

Published: Tue May 19 2020