Machine Learning on Kubernetes | Salman Iqbal

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome everyone thanks for coming here and second to last session so I'm going to try and keep it light hopefully uh we're going to be talking about machine learning on kubernetes but I just want to show you the extent of machine learning and AI how good it is uh I don't know if we can read that that's uh Donald Trump and Mario Kart 8. this is from a project called Dali Dali mini this one so you know you can give it some text and it'll produce you some of the art uh I don't want to be biased towards uh America so this is Boris Johnson and Mario Kart 8 I know it's a little bit old but I think before we start maybe we can do a little bit of a game uh maybe you can guess what's happening what prompt was given to this and this is that but what do you think this is you can shout out that yeah that's very good Court sketch this court sketch of Godzilla on trial how about this one yeah Connect Four cucumber excellent what about this one I like this one oh this guy's on it I need to give you something uh you you deserve that I I came with with a prize whoever was gonna answer a few of those uh I have I have a prize for you so well you you came to the right session basically that's what I'm saying I have a I have it somewhere in my bag a there you go this is uh Lego of tux so you know talks from Linux so well done well done too sorry I didn't catch your name well done to Drew so there you go that's a happy toaster happy toaster and the bubble bath you know you're guessing that one is good what about this one Gandalf doing Vape tricks right anyway you know this is uh this is this is Gandalf doing fake tricks so that this is machine learning right this is the peak of machine learning data science you can you can check out you can check out like there's quite a few weird ones out there you can go on this Twitter page again I'm not going to make any jokes about Twitter today um this is Edward brunch but all of the stuff that you just saw came out of open AI it's a it's an organization that have been doing a lot of work in uh artificial intelligence and data science and machine learning It's a combination of a few of these projects that have been coming around the gpt3 dolly and clip for example you can uh there's gpt3 you can take some text you can feed it to gpd3 and it produces a SQL statement which is pretty good like you can use the general purpose kind of general purpose not quite with general purpose for example you can give it some emojis and train it and then it will guess what Star Wars is so I don't really know if this is this is correct this is apparently Star Wars anyway so this is uh another one you give it a text and it tells you what's happening inside the image and the point is the reason why I'm showing this to you this opening I trained all of the all these modeled on kubernetes they are the kubernetes Clusters around seven and a half thousand kubernetes nodes uh in in their clusters and they use they needed all this compute to be able to show Donald Trump and Mario Kart 8. so um that's that so my name is samanibo I'm an envelops engineer just devops for data science I work for a company called Appia and I'm also a kubernetes instructor you can find me on on Twitter I do also have a YouTube channel so if you want to check out some short videos on YouTube that's what I have this is apnea's website so Appia is a cloud consultancy company we're a platform that lets you manage minus kubernetes you can check that out appia.io that's my that's my channel you can contact me on Twitter or if I don't reply just leave a comment on YouTube and I'll get back to you on that so before before I start thanks to Casey DC for having me thanks to all the organizers and uh volunteers is really awesome now who here does machine learning really quickly anybody does your machine learning okay excellent are you all interested in machine learning great everybody does kubernetes I'm guessing here right okay cool so I don't need to explain what containers are you know you know what containers are we take run time we take the configuration dependencies stick it all in together and we've got a container and then we deploy them in kubernetes there's challenges you know you can't really uh manage one container so how do you get that to run on you know multiple containers you've got kubernetes in service Discovery and all that kind of stuff anyway that's that's that so before I go into why machine learning on kubernetes just very briefly you can break machine learning down into two two areas supervised learning and unsupervised learning supervised learning you have a bunch of data in this case you're seeing a fashion data set and it's usually labeled so there's a shirt or whatever it is but you got tons of data you don't just get a little bit of data you get a lot of data and what you're trying to do is come out on the other end with a model so you want to train basically a model is a file depending on the framework that you use you want to train this model to recognize certain types of patterns so you train a model over a set of data providing it an algorithm that you can use to reason over like is this a shirt is this a is this a car or whatever it might be and the algorithm that you can provide that you can use could be anything in this case we have neural networks and usually you do not write the algorithms from scratch you use from one of the Frameworks you use one of the Frameworks that's available and then you deploy you put your data in and hopefully will train you a model and then once it's all trained up you can submit your data you can serve your model so people can submit the requests for example if I put a image of of a Sheridan it comes back and says this t-shirt that's great right so this is this will super high level machine learning but you don't just have one model as an organization you'll probably end up with a number of models so we have recommendation engine whatever it could be this there's quite a few of these engines um and usually what you do with them is models they they get updated all the time depending on the data that's coming in whatever is happening you update them daily monthly whatever it might be and you do that for everything and you do it quite often Monday Tuesday Wednesday Thursday Friday and if you work for Elon Musk Saturday and Sunday I knew I was I wasn't going to make any jokes but it's a good point um and then if you if you do this all manually it's quite quite hard to do and if you if this was me on a weekend trying to figure out why the model is not training uh this would be me I don't smoke but this would definitely be me but that's not where the story ends for machine learning and data science it just starts because there's quite a few things you need to get hold of before you can even train your models and serve them serve so people can submit requests you need to grab the training data split up data and get the model and evaluate the model this is the core of data science but what actually happens when somebody tries to do data science they have to do a lot more have to figure out how they collect the data how do you feature extraction data verification and the stuff that we really care about like machine Resource Management how do we get these bunch of machines for us so we can train our models because there's a lot going on and then how do we serve it how do we scale all of this because people are going to be submitting tons of jobs and also analyzing the tools so anyway you need to do a lot of this stuff before you can even train your models and serve them so it becomes beneficial for users now we've all seen this right this is the cloud native Computing Foundation landscape if you want to do anything in a cloud native fashion it's very easy you go to here and pick a project right it's all good and if you want to do machine learning stuff we have something similar this is this is on purpose I've made this small I don't want you to do recent words on here but uh there's this is from ml Reef they put together a landscape of machine learning Frameworks and platforms that could help you do all of this stuff that I showed you so this is quite massive but we're going to focus on a couple of things so really briefly why should you use machine kubernetes as a platform to build your machine learning to run your machine learning models or get your data scientists to use it there's a few things as we all know if you are training a model kubernetes gives us self-healing if something crashes it can come back up no problem we don't have any issues so I just thought sticker sticker meme just so you don't get bored um we can do a lot of scheduling as well smart scheduling sometimes you have workloads that require GPU and you don't really want to run run them elsewhere so we can do that with kubernetes you know or we just saw Marco was showing us some labels and node selectors and whatnot so we can we can do that with this kubernetes gives us that that uh that benefit so we can do that if we have something that needs to run as low priority we can also do that and if we're running out of uh pods to run our requests we're getting a lot of requests and we can use pod Auto scaler and again you know this we can use the built-in pod Auto scaler so you can have deployment whatever your pulse are now we have this horizontal part over the scalar which is going to query some metrics and based on the metrics is going to scale up or you can use any of the other projects that that might help you to do that and that's quite useful because a lot of machine learning tasks do come in bursts so sometimes you want to scale out do the thing that you need to do and scale back down well if you can scale out the pods and then you can scale back down again and this is if you can scale pods that the torque that you just had for Marco about the cluster Auto scaler you can use Carpenter or whatever you use but you know this is cluster Auto scalar once our pods keep taking space in the cluster we can spin up more cluster more nodes so we can run our run our apps inside no so but how do we figure out all of this so I'm sure we all agree kubernetes is the best platform for this right running machine learning stuff data science I'm sure we agree but in the end we'll talk about why you might not want to do that so what I'm going to do is focus on only one or two things because there's quite a lot and we're going to focus on some of the famous well the famous Frameworks the ones that people use a lot like Pi torch and tensorflow and how can you take these and run inside kubernetes usually in kubernetes it all starts with day oh sorry usually in machine learning or data science it still all starts with data you want to grab some data extract it transform it load it this looks like a box but most cases you'll look end up looking something like this so it's quite a complicated part of your of your whole pipeline but that's not where all it ends because once you've got the data you need to train it as well and that also becomes a bit complicated and you have a lot of steps that you need to look after you need to understand what's happening and this could lead you to a situation where you're trying to figure out where where something went wrong and you end up something looking at looking at it like this you don't know what's going on especially if you fight all the people then you're kind of stuck nothing is gonna happen all right no more Twitter jokes Elon Musk jokes we'll move on but then again we analyze and then we do some predictions we can basically production is where we serve the model so what kind of Frameworks are available for us to do all these steps in machine learning right so this is and then you can keep going around in circles with this so one of the things is there is an open source project called kubeflow people have heard of kubeflow or yeah excellent everybody knows so we do not need to go into Cube flow at all but it's an open source project basically you want the most I'm not going to say useful the most popular projects that are out there become come as part of a bundle so you can deploy them on your cluster and they give you some of these things so you install it on kubernetes you have this Cube flow Jupiter lab notebooks these are the books that data scientists use to do their job and you can plug into oidc so you can do a multi-tenancy and then you have pipelines as well because you've got to set up those pipelines to figure out where how to set up the ETL Pipeline and again you can do this in Python you don't have to write yaml files so I know it sounds weird writing pipelines in anything other than yaml but you can do that in Python which is useful for data scientists and machine learning people and then I guess the most interesting bit is the operators so operators extend kubernetes functionality we're going to focus on this in in a few minutes but there's quite a few operators that come that do extend kubernetes functionality in a really useful way and we got some stuff like usually you have uh neural networks you need to you have different parameters you need to configure them and tune them you can do it manually like how many number of layers you have what's the connection weight but you can do it automatically using a project called cathode you can't really see that and I think it's called it says metadata this is if you need to taggle your information what your model's versioning and that kind of stuff and then you can do some serving this is based on case KF serving this is based on k-native and okay native basically allows you to serve models and scale them and scale back down when not needed and then you can run any of your favorite Q machine learning machine learning Frameworks on top of kubernetes you can do all of that on on Q5 this is good it installs a lot of operators on your cluster and a lot a lot of crds on your cluster has some challenges of managing and installing if you don't need all the options in here this might not be the solution for you because it installs everything you can install specific bits of this framework on your cluster but this is again something that you have to install and you have to look after and this is like qflow pipelines this is probably the thing that data scientists and machine learning people use the most you write a multi-step workflow and it's all written in Python and then you can productionize it you can run it on your laptop let's say where you're running in the same way you're running any kubernetes cluster and each of these blocks that you see in the pipeline they're all pods so you can scale them them out scaled back down depending on your requirements anyway so that's that's queue flowing again as I said qflow might not be the solution for you because you don't need everything that it provides so what do you do then what if you need to use some of these Frameworks what you can do I mean the easiest thing to do is of course you can take any framework that you like for example we've got tensorflow here you can take that and you can write your code so this is just the model code that you can serve and of course you can we all know this if you ever want to run this in kubernetes we can stick this in a in a Docker file like the model itself we can stick in a Docker file all the code itself whatever we need or if we have a training job we can stick this in a Docker file and run in kubernetes as a part of a deployment uh unfortunately you can't really see but this is a normal deployment kubernetes deployment in here and then you can run it this is just serving a model once it's serving a model you can expose it using an Ingress and people can start calling it and you can stick a sticker pod Auto scaler in there if there's more requests coming in they can scale out and scale back down when not needed I guess the most interesting bit is the kubernetes operators we all know operators we love them imagine I submit a I'm sure I don't need to explain what operators are but imagine I submit a deployment file and you create a couple of pods in here um I submit that took API server that stores the information in in a database called adcd then I have this component called the controller manager that's listening to changes in at CD once it sees oh there's a there's some diploma that needs to be created it it gets an action and starts creating some pods inside LTE and then the pods actually get created inside uh inside the worker nodes so along the same thing you know and they go pending this is kubernetes kubernetes understands PODS of course and it understands a lot more things deployments replica sets Services you know there's quite a few things that he understands but what if you wanted something new or if you wanted post for us right let's say this is a postgres database so you can define a custom resource definition in your clusters as Marco was showing earlier on he has he had a custom resource definition for the cluster that need to be created and then you can this is the kind but we explain what this is that we can deploy in our cluster a a an operator so it's kind of like the controller that was running that listens to changes in NCD so if somebody sends a kind postgres the operator is going to kick in and say okay you're asking for a database and then it's going to run its own logic this is super high level by the way and the operator is going to run its own logic listen to changes in that CD and then eventually it's going to go in and spin you up a postgres database in the cluster uh I'm not saying you should run postgres in a cluster but you know this is just an example um that's what this is so along the same lines we have a bunch of Frameworks in Cuba in in machine learning this is mxnet people you've ever heard of mxnet before yeah excellent it's it's more for like uh deep learning framework uh you can Define like neural Nets deep neural nets for a lot of things you can run in in the cloud you can run it running on mobile devices so a lot of vision stuff goes in mxnet because it's like neural Nets NLP national language processing a lot of there's a lot of uh algorithms in here that can be used and one of the things that mxnet supports is what's known as a distributed training because you know the our jobs our training jobs could require a lot of um a lot of resources to be able to run and in order to leverage like faster training we want to be able to split up the training job so we can run it on multiple nodes I mean that'll be that'll be quite beneficial so what we have is this concept of um we got a scheduler in here so in MX now wall out these are all the machines by the way so these could be a virtual node or an actual node and it's basically scheduler is usually bootstrapping responsible for bootstrapping a cluster making sure each node knows what job it needs to run but this is a machine to install install mxnet on these cluster on a bunch of these machines and then you can get these servers and server is basically where they just store some information about what the job is being run again each of these and then workers actually so this is uh the server connects to the scheduler and then we have workers these are the bits that actually do the training and MXN understands the job that is split up because the scheduler understands what job it's been split up and then the worker each of these workers they're all machines they also get the jobs and they do the training and then they can come back with a response and say oh hey your job is complete now that's you have to set up your own cluster like that you have to set up all your machines but you can in kubernetes use these mxnet operator so all of these then become pods and that's great because all you need is a cluster and you need to install this uh this operator on the cluster and then you can start submitting your jobs in the cluster so for example this is one this is an example of an MX net operator comes with Q flow you don't have to install with you for Standalone you can install it on your cluster you can write your own operator of course but you can install it on your cluster and then you can Define the kind of job you want to run it's probably quite small for you to read there but you can Define in here what kind of architecture like how many schedulers do you want how many servers you want how many workers would you like and what job you're trying to run it's usually like a python file that you're running inside the container and then also you can stick in an auto scaler so if you need more worker nodes no problem we'll spin up and you can just submit this job to the cluster and you can see like we have a scheduler we have a server and we have workers we can configure this and if we have a if you have an auto scaler pod Auto scaler or cluster Auto scaler that's great because it can scale out and scale back down when it needs to and that's quite useful um there's quite a few Frameworks that take use of this make use of this operator pattern which is quite useful and I'm not going to touch up up on everything because there's quite a lot this is spark people use spark here yeah excellent like you go we've got a few people in here use this box it's basically like a real-time batch processing framework for Big Data um you can do like data jobs like ETL jobs iot jobs ml jobs this is quite it's quite good and the way it works is you have what's known as a spark driver and it's just a Java process that runs the main that runs the whole thing your application that's running I'll show you an example in a second then you have a cluster what's known as a cluster manager cluster manager lets you decide what kind of uh underlying cluster you want to use so you can use Standalone so you can run that on your machine or you can do kubernetes and then what happens then is you have worker nodes and in the worker nodes there's some there's some executors that actually run your tasks so whatever your task might be for your spark application and then they can they can run their task and then they can report back to the driver basically the cluster manager doesn't really live in cluster managers like they're initially to spin it up but that directly is talking to their driver and then and then you can basically run the in in whichever setup you're doing now here's the thing though so if you run this and you can run this in kubernetes spark as an operator as well and guess what they all become pods so if they all become pods again same thing like we can scale up scale back down whenever we need to and you can submit your spark application you can monitor them like like you monitor any kubernetes resource this is what a spark application looks like it's got some information about what Docker image you'd like to run what's the actual file you want to run and if there's like any volumes or anything like this again there's quite a few things in here that we can use and that's not the only framework that allows you to do that there's there's many more that use kubernetes as a platform to be able to run because of the scalability and um in in the cloud as well and there's just there is managed now so we're coming coming to the end to an end I just wanted to mention a couple of things um we saw how we can use kubernetes to some extent uh to do machine learning but when should we not use kubernetes for doing machine learning jobs I mean I shouldn't really say don't use kubernetes because kubernetes Community today but we got to be we got to be we have to look at every site there's there's a lot of complexity you're running just like any other operators you're running operators and if you are if you just manage a cluster you might be able to understand other operators but you also need to understand if it runs is absolutely fine if it breaks I gotta understand what spark is to be able to fix it because people come up and say oh we need to run this so there's a lot of complexity that we're dealing with and again releasing and all that stuff sometimes it might be easier to maybe use cloud services if there might be maybe you want to use AWS sagemaker and sometimes maybe I don't know uh also if you don't need Auto scanning if you don't need any of the auto scaling then it doesn't make sense if you're already using kubernetes for the rest of your uh organization you might be able you might want to still extend it out to data science scientists sometimes if not done correctly kubernetes can cost a lot I don't know if you'll agree but that does happen but anyway before before I end I just want to show one thing so if you run machine learning on laptops it's kind of gets you sometimes because there's a lot to learn if you try and do machine learning on kubernetes the outcome is the same but you might maybe look a little bit cool while doing it so that's the same output but again there's quite a few things there's a lot of complexity to deal with so um so we just saw a couple of things just wanted to summarize really quickly I know we only got a minute so of course we can use machine learning uh kubernetes to build machine learning platforms and it's got some benefits and drawbacks as long as you know how to manage kubernetes that's fine a lot of data science machine learning organizations don't have that capability I'm talking about like medium to small scale not talking about like large scale I'm not going to make a Twitter joke but anyway let's catch on so uh yeah so uh check out the appears website also we got uh I just want to check out like learnkates.io website that we got a bunch of blogs in here A bunch of blogs in here you can if you have any questions I know we're coming to an end so I you can catch me here I'm going to be sticking around today and tomorrow uh you can follow me on Twitter uh so my neckball so what I want to say is thank you all very much for listening [Applause]
Info
Channel: Kubernetes Community Days UK
Views: 2,957
Rating: undefined out of 5
Keywords: KCDUK, Kubernetes, K8s
Id: r7YMDBmlGh4
Channel Id: undefined
Length: 25min 44sec (1544 seconds)
Published: Thu Dec 01 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.