Machine Learning Algorithms with H2o and PowerShell by Tome Tanasovski

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody welcome to another day of power killing and summiting first of all the session is a IML with h2o and rose power show hopefully you're in the right room as far as Who I am why am i up on the stage why am I talking about this number one I work at at work in New York City I worked for a fortune 100 downtown over there I spend time working on distributed computing and and sort of cloud computing platforms and those types of systems that's kind of where I spend most of my time I do not work as a data scientist nor do I work as a machine learning expert I simply am a hobbyist so I've taken some Coursera courses from the Stanford Coursera course I've implemented a bunch of these algorithms kind of by hand from scratch just for academic reasons and what I found is that this stuff is actually accessible now right so what we're going to be talking about today is stuff that you can embed into your applications build data models workout predictions clustering type algorithms to to you know help in whatever way you think that it might fit the the applications that you're developing within my company I enjoy looking at the infrastructure side of this so I like spending time with data scientists to understand their problems so that we can understand how to build the cloud computing systems underneath it to be able to figure out how to distribute these problems more easily how to help them get there they're processing to happen faster etc etc but again that's not my day job the third thing is that I do spend a lot of time with some PowerShell some of you may know me through the years so I think eight times MVP award recipient I founded the New York City PowerShell user group I also found it there's at Xtravaganza talk that we do in New York City every now and then which has an entire PowerShell track to it and I spend a huge amount of time working with bringing things that are happening outside of the Microsoft ecosystem and the PowerShell community and bring them into it so helping people see that there's these frameworks that exist out there that people are leveraging all the time so today we're gonna be talking about h2o tomorrow I'm doing a talk on Apache zookeeper and writing clustered applications with that okay so that's why I'm standing here right now so if you're gonna drill me on a lot of the data science questions I probably will flounder a bit but I'm gonna do the best I can to at least give you some starting points and it's sort of a soup to nuts so you don't have to have any experience with machine learning you'd have to have any experience with h2o PowerShell will definitely help because we're gonna be looking at basically invoking a lot of rest methods through this and understanding the web commandlets which I think there's an hour and 45 minute talk happening right now on exactly how to use them okay that's actually the stuff that I'm leveraging in this to be able to access this this platform so what is the platform why are we even talking about this what is h2o most importantly it's an open source platform okay so h2o is a company there is an enterprise version they build on top of the open source and you know add all this color and make you know things super-easy for data scientists in the future that's where they're headed I'm not talking about that I'm talking about what's open source what you can leverage today what you can embed in your applications when you're PowerShell code the next thing that I really like about it is that it's API driven okay so there's a couple of SDKs out there there's Python and art that come out of the box and all the books that you're gonna read have those two languages in them there is no powershell one so what I what I would recommend is that I you know all the prototype code that I'm gonna be showing today actually could be converted into a module very very easily and it's something you know if anybody's willing to take that on as a project I'm happy to handle that code and it's all MIT license so you know I can feel free to play with it and do something with it if you want to get into that fantastic project for you know academic reasons if you even just want to understand how to build a module and out of something that's rest driven okay the next thing is that because it's API driven that makes it cross-platform and what what that means is it's not just cross-platform but it's it's it's actually you could have multiple languages using the h-2a framework at the same time so I can have Java applications set up a data model and then I could have the powershell connect to that instance and leverage that data model or I could do it vice-versa there's there's it's completely inter interdependent h2o is a java application on its own that spins up it's a server so it will maintain the state while it's living and when it goes away you know we lose the state of everything that was there unless you see realized some of that stuff out but the point is that it's cross-platform so it will work from multiple languages the next thing that's kind of interesting is that there's a GUI for doing all of this and what I'm going to be showing you today is a series of demos and we're gonna start in the GUI because the way the rest API works is it's translatable one to one based off of the GUI so if in the GUI I train a data model using these parameters when I go to PowerShell later I will use those same parameters and I will go through the entire workflow the same way the GUI is actually pretty nice for just getting been prototyping things pretty quickly so you can learn how to do something and then turn that into a script that then you can then repeat over and over another thing that's kind of nice about it is that it's self-contained meaning that it's not dependent on a cloud vendor so one thing the the API driven idea is not new there is plenty of people doing this right Azure ml is starting to get into that space aw Sh maker kind of works this way to a degree and there's this proliferation of libraries that are sort of coming out here but the point is that the API driven thing isn't new but what it means is that I don't need to have a cloud vendor for it I can run this on Prem I can run it in a cloud I can do whatever I want with this open-source bit the next thing that's super interesting is sparkling water gotta love that library name so that sparkling water is a combination of h2o and spark so if you don't know spark spark is a distributed computing platform that provides you so you can have clusters of these spark instances and what sparkling water actually does is any algorithm that you would pick and you start to train a data model on if it could be done in a distributed fashion it will leverage your spark clusters to do so so if you happen to have spark clusters this is fantastic or if you can spin up spark clusters on demand for this type of processing it's really neat I toyed with actually doing all my demos with sparkling water but I only really have my laptop so we're gonna do a single instance h2o for for all of this stuff but it's pretty it's pretty simple to spin it up okay the next thing that I really like about this is that there is a flow because the company h2o is in the business of making this super easy to productionize the data models that are built on it they they had another open source project called steam which is now deprecated but it's it still exists and you can basically fork it you can and there's one particular bit of service that's open-source there that will convert the data models in h2o into rest services and it's really cool so this is a way to basically export out of h2o get a war file that you could then run as a web service somewhere that you could then invoke via PowerShell right so you don't necessarily even need h2o after you've trained up your data and I don't know if this is the final one but grid search and auto ml the other thing that's really nice about this this platform is that it provides you so when you start working with these algorithms and you're gonna see in a second when I show you like the first algorithm and we look at all the parameter options that are there it's blinding like there are so many parameters and they all deuce things and we don't know what half of them actually do write data scientists may some of them with when you talk to them they're like oh no we I like to play with these parameters or I like to try these things but there's no hard and fast rules of like this will always work and this will always work so the answer to that is what's called grid searching or there's there's another technique called Bayesian optimization which is actually even better than that which hgo doesn't have out of the box but there is this the third-party library for that but the idea is that I can test a whole bunch of different parameters against my data mount like my algorithms try a whole bunch of them and see which ones turned out the best and evaluate which model I want to use now and so that that functionality actually makes this all really easy and the auto ml one is the super easiest of them all it just takes a little while to run that's a no-brainer auto ml here's my test data find me the best thing possible and a couple hours later we'll come back with with a bunch of data models that you can look at and evaluate at which point you could then say alright this algorithm seemed to work best for this data set it seems to be pretty consistent you know what in the future I'm not gonna run all of them I'll just do grid searches on just that one algorithm now okay so it's sort of simplifies things for you and it's very easy so this demo this this entire thing that I'm doing right here like we're gonna train up some data models we're gonna we're gonna look at some of the open datasets that exist almost all of it could be done in like five minutes of all the demos that I'm doing today but I'm gonna take a lot of time we're gonna go through each of the arguments we're gonna talk about what it's actually happening under the covers and kind of get a little bit deeper with it okay so as far as what we're going to be covering so we have we're gonna start with h2o I'll show you how we we downloaded spin it up how we access it we'll talk about the user interface so h2o via flow and we'll train up our first data model in that at that point we'll also look at rest hacking so I'll show you how you translate from the the web UI into powershell will then do will generate some data models will create it for like I said a pair of open-source I'm sorry open datasets iris data which is predicting the flower type based off of the dimensions of the flower as well as the m-miss data which is handwriting recognition for digits numbers and so we'll look at both of those we'll also look at not really mathematics I'm not going to go into the mathematics of this but what I what I have found is that at least understanding from a visual perspective some idea of what this Matt what the math is actually doing is very helpful in understanding what you're actually doing so again it's there's gonna be no numbers there's gonna be no symbols during the mathematics part but you will see some charts and graphs and I'll explain what's what's actually happening as we go through this also do a demo on unstructured learning so this is the idea of finding clusters and groups of common patterns within your datasets so the greatest example of this is like the Netflix recommendation engines where how do I bucket let's let's say I want to bucket the world into let's say I think there's fifty different types of people in this world I want to find out which movies they've been liking together or figuring out and so we find like little groups that we think that they'll like and then we can you know maybe recommend those those additional movies to them we'll do a demo on the grid search so the parameter tuning I'll talk about I'll show you a demo of production izing these data models and turning them into rest services that you can access from PowerShell much more easily and then we'll do a demo of auto ml and look at the additional resource big day big day everybody get ready hands up now that that's the end of the slides that I promised the rest of it is it's pretty much all in code until we get to some of the math stuff alright so the first thing that we're gonna do is we're going to start h2o so h2 o you download from their website so h2o AI there's a latest stable release which you can get and their instructions to install are pretty simple zip run Java - jar - the jar file that they have okay that's that's kind of hard to see but three easy steps and what it looks like over here is simply so I I have a folder already where I've downloaded a whole bunch of files so let me just show you what's in here so this way you can see how well that's visible but in this directory I'm gonna highlight the things that didn't come with h2o so really it's just anything that's unhighlight it is all that came with it the h2o jar file is 86 Meg's so it's not even that bad and then what happens is it looks like this so Java - jar and we will run this h2o obviously you have to have Java installed don't ask me about versions of Java or anything like that I really know nothing about Java so I just know how to execute and compile okay this actually it usually takes only a second so hopefully I'm not getting the J cool curse but looks like we're alright okay so it spins up h2o and then it gives you this this h2o flow browser don't ask me about securing this at the moment there are as plenty of documents that you can find online about all that stuff I'm just simply going to be pretending that there is massive security and that I'm the only one who can access this okay this is what the interface looks like and let me make this a little bit bigger so you can see stuff okay the interface if you have played with like Jupiter notebook or a Zeppelin notebooks or any of the notebooks that exist web notebooks it's kind of a similar in its idea these are actually what's what's called cells so in each of these cells I can clear this and what would actually is happening is if if I can run tree run any command that I've run so I can come back up to here I can hit shift enter and it'll run that again I can also run that assist command just by hitting this button and I'll keep every time I do something it's basically you're gonna just add something to the bottom it's called it's called a flow because you're showing kind of the the flow of how you're doing this you know data manipulation and the modeling and processing and all of that stuff before we also get into anything real I also want to show on the right hand side the help actually has a decent amount of stuff which looks like it's a little strokes here there's examples and I'll be showing one we actually be going through the cluster k-means clustering later in this theory one of the examples and then there's also the h2 arrest API documentation which is actually embedded in here which is super helpful when you're when you're using this with PowerShell okay so this is the interface so the first thing that we want to do is we actually want to import some data into your I'm gonna take this is really hot in here so the first thing we want to do is we want to import some data so let's let's actually talk about the data that we're looking at here it's called the iris data it's actually a very small data set which is really interesting when you think about it because we're talking about 120 150 rows I think it is a hundred and twenty hundred fifty whoa too far yeah so 150 rows so it's not a huge data set and all that's in this data set so you've got this is a CSV file so you've got the sepal length and sepal width width which is you know a dimension of the flower you have the petals lengthen the petals with and then a definition of what that actually is so in this case you can see that this is an iris setosa and if you come down here you'll see that there's three categories of flowers there's iris setosa iris versicolor and there is iris virginica and what we're gonna be doing today is we're gonna be saying we're basically gonna be trying to Train using this data set to say okay with these inputs these four columns here we know that this is going to be this class so build me a model where if I give you those dimensions in the future you can predict what class of flower that actually is and tell me with what confidence you think it is so that's what we're actually doing so let me pull this data into h2o so the way we do this is we import files so I will take one second just to show that so that the sort of web hacking here control-shift I am chrome you know use your inspect the network bit right here is actually the ones you want to watch when you're when you're looking at this stuff but here I will so you can see it's doing a whole bunch of lookups right now which I I really actually don't really care about what I care about is the next thing that happens so after I add this so let me clear and I import you can see over here on the right I can see the the path so the URL that was used I can come in and I can look at the actual post data that was sent to it as well as look at the response data pretty easily and what you'll find is that when working with h2o specifically there are some things that the JavaScript the web GUI does and implies and so you're not sure that you're not you don't realize that it's actually like a two-step process instead it looks like a one-step process and so probably the biggest example of this is the next one which is parse this file so we've imported it and now we have to parse it because it's a CSV file when I hit parse over here it does this set up parse function which actually is the function that was called here okay the parse setup is actually kind of interesting and that it makes some predictions about what types of data you have here so it finds that these are numeric it's giving me an enforcer so it's done all of this for me and but the problem is that once I have this what I actually am doing when I click parse is I'm taking all of the outputs that were here in that above thing so anything that was returned from that parse setup I actually have to pass into the parse files function right and so I'll show you the helper code in PowerShell when we get to that in a second that'll that'll make this a little bit more easy and make you understand what I'm talking about here but for now just know that there is a lot of things that you may have to take from one function get the return value manipulate the PowerShell and then send it off to the rest interface for the next one okay so we now have this iris header data in here and when we have data in h2o and actually let me close this it's not so cluttered so when you have data in h2o the first thing that happens is you get some statistics on it so we can see a little bit about the data I can see that okay look there the the min here is four point three seven nine Max and that the averages etc I can also see in the UNAM that there's 50 that are labeled zero which is actually expected so generally that like missing or zero you may be concerned about just more about data quality and just making sure you have a good data set however in this particular case in a new ms 0 1 2 because it's ordinal so therefore there are 50 of one class so that's why that that shows up is 50 zeros this does raise a very very good point though PowerShell is one of the best languages ever for data manipulation so you know because of the fact that you have your filtering you have your conditionals your where clauses all of those things that are built into the pipeline data manipulation actually is better to do in PowerShell than in something like h2o I'll also say having done this in pandas on Python and you know using some of those data frame type technologies that PowerShell just blows it out of the park as far as use usability and making things simple to do data manipulation so what I would suggest is is you might end up writing doing anything that you have to fix the data you do in PowerShell send that out to CSV z-- and then import it into h2o it just makes your life easier so this way there's less of having to figure out like what what actually what settings and what data types and things like that to use okay so I have data in here the next thing I want to do before I do anything is I'm gonna split this data into two sets I'm gonna basically say I want 90% of this to be one set and we're gonna call that our train data and then 10% of it is gonna be my test data and this is a pretty common pattern to split up data sets like this do it randomly when you're training your data models this way you can use the data from your train that 90% I could build my model from that and then I could validate the model against the test data to see exactly how well it did and we're gonna talk a lot of later about overfitting which is this idea that you've made it so that every bit of test data works or train data works perfectly but nothing on test data ever works because it's so like hyper fit to just the one particular set of data that something that you absolutely have to avoid anyway in this point right here I've got my train data and I might test data the next thing I would do in h2o so I'm gonna call my assisting in so I can get my my things back is I'm actually gonna build a data model out of this this train data make sure I could do it from here to oh I'm sorry the one other thing I wanted to show you guys was that there's also the ability to view the data so you if you wanted to actually see what was in there that I haven't loaded into Excel let's see what that looks like you can get a sampling from from the system once I have that so I have my train data right here I can now build a model okay this is where the fun happens you have to choose an algorithm what do we choose anyone want to pick one base yeah I mean the honest truth is it's very confusing to figure out what especially when you have no understanding of what any of these are doing what I can say is that there is a algorithms page on h2o right here that has a list of it's got a list of like the common sort of parameters that are used by all of these things and then they have them bucketed into three so they're supervised unsupervised and miscellaneous which is really NLP that's a natural language processing which is basically taking documents and tokenizing them in a way where they could then be used by machine learning I'm not gonna be talking about that today we're only gonna be looking at the supervised and unsupervised supervised basically means that there is a category in the data set there's something that we know it what it actually is so in the iris data we knew that each flower what class of flower it was in unsupervised unsupervised is again that it's like that Netflix recommendation in that you want to find groupings but you don't know any you don't really know that what it's going to be and so what you want to do is find make an algorithm in a way where or make a model in a way where if I send it some data set it would tell me which group it would belong to that's that's kind of the goal with those so those are the two now as far as choosing one well they're all sort they're all kind of crazy and different there's a lot of parameters so this is this is the documentation just for one algorithm and if you look here there is a lot of stuff that can go in here I don't know what half of this stuff does to be honest and I'll be completely Frank it is a matter of you know tweaking figuring you can read up on these things as you're playing with it try a different parameter and another time you know expand your understanding of each of them as you go as you look for better data at the models but for the purpose of our experiment here today I'm going to show you the simplest way of doing these and I'm gonna actually just choose this gradient boosted machine although to be honest they're all the same the interesting thing is that what we're going to show is that the interfaces and I'm doing right now the very basic so a training frame I know it's going to be this train data the validation frame is the test data I created enfolds is probably the most interesting one to think of in that it's it's a there's techniques of cross-validation so that it's it's kind of like without having to have the train and test data separately it does some interesting randomizations to try to prevent overfitting on its own but I'm actually going to just leave that alone for now and finally I have to tell it which column is the column that were where is the the thing that we're predicting right I don't have to ignore that column because it knows that that's the one I'm predicting so it's not going to include it in the inputs but so at a very basic level that's all the information I need to submit to it like I said there's a whole bunch of other things and we'll talk about grid searching and how you can tweak these later but let me just show you how we build a model very quickly so that's done oh no it's done that data model is now exists so I could look at that data model and I can see some statistics on that data model so first I can actually see how it was built all the parameters that were used I can see log loss in this case is what's used here another one that you'll see me pull up is this thing called MSE which is like the mean square error which is trying in all of these cases it's a matter of trying to get the number as close to zero as possible because that basically means that it's fitting really nicely here you can see that what it's using in its data model this is actually really interesting petal length petal width it feels that there is you know it needs to know that information to make the prediction the sepal length and simple width is actually not as important to the classification of this flower which is you know it's just so fascinating facts right that you may not have gleamed from this data by looking at it but now as you look at the a these models are interpreting it you see how it's used underneath there you can see how it was performing actually this one is performing pretty poorly to be honest even in the training data it's getting about four wrong here three wrong there that the validation data it's actually done pretty pretty well where it's only getting two wrong out of the full set but anyway the point is like I said eventually we're gonna be tweaking this and trying to get these better and better but for now we have a data model and we can do predictions like the work is done now the question is how do we use it and using it is actually just as simple as importing some new data that we want to validate and then predicting off of that so let me do that right now I have a file that I created in PowerShell which I'll show you the code that it's just one row it has this input so 5.1 3.5 1.4 and 0.15 and it's gonna tell me what flour that is so the way I do that is I can basically say first import that data so we're going to go through all the rigmarole that we did before which is fantastic when you have PowerShell because you can just do it quickly so what was it called predict okay so we import the data parse the files we're gonna parse it a parse set up the files parse the files and then the job runs also another thing to note is these jobs any time the job screen comes up that's an asynchronous call and I'll show you in code in a little bit how I turn asynchronous calls into synchronous calls because there's basically a callback to lookup URL where you can see the status of the jobs there's a whole job subsystem built into h2o for this purpose okay so I've imported the data and now I'm gonna go back to my data model so I get my model and I am going to predict using this model I'm gonna say okay use that predict data the set that I came in and predict and so now I can look at that result and I can see that there is one row four columns which was expected and what comes back is it's saying that it's an iris setosa and then I've got three numbers here this took me a long time to figure out what the heck was going on here especially in code because it's got really confusing but this is percentage of confidence so in this case it's saying with a hundred percent confidence based off this data model this is iris setosa it is impossible for it to be versicolor or virginica and you'll see in other data models when you run it it actually has different percentages and of what it thinks that it's confidence is okay that's it we did the bulk of the hard work right like now you can actually leverage this thing and use it so what's next the next thing is how do we do this in PowerShell just make sure okay how do we do this in PowerShell so for this I'm going to basically do the same exact thing we're gonna use a different algorithm this time I'm gonna do a neural net because neural Nets the hotness right that's like the real machine learning which I'll explain why it's that way in a bit but I want to start from scratch so I'm going to actually restart my h2o and when I do that like I said it's it's it's doesn't serialize the state down so in a second this will come back and we'll see that it's blank so we're starting with some things from scratch there it goes okay so if I do get frames now I should see nothing and if I do get model I should see nothing so this is a blank h2o with nothing in it right now the next thing I'm going to do is let's build up my PowerShell environment what the heck is that Windows Inc no that is all right layered here and I'm gonna create so apologies I know you guys aren't used to the VI but I am so I'm going to up at the top put put the code in the scripts that we're gonna be looking at so I might do it ps1 and then down at the bottom is where I'll be pasting and executing and that kind of stuff okay so there's two functions I have to talk about before we get into actually using the the rest api the first function is is convert to form data this is the bit that I was describing before I'm gonna take the return value from that lets say that parse setup function and pass it to the parse function all this function is doing is it's taking an object and it's looking at its properties and it's basically saying okay if it's boolean well then the property chain that's gonna be posted into the form data says basically give me that property equals true and then I'm adding the ampersand and then I go through and I do the same thing for arrays so collections are basically formatted in the proper with the proper syntax and then finally anything else is just key value pairs right so we just output plus equals property and the ampersand at the end and then at the very end of this function I remove the last ampersand right so this way I now have a complete post URL of that data that I can throw into the function that I'm gonna call to the REST API is that clear what you got sound good all right the next one is how I turn the asynchronous calls into synchronous calls or at least give you a hook to be able to figure out how to pull it and look at it so the way I do it here is I use a wait h2o job which takes up path so which what I'll show you in a bit when we look at the return values is that it provides you a URL and that URL or it's not really URL it's the path part of the the API URL and then we'll add that to the REST API so this way you have the callback and then I query that right here it's every half-second actually I'm gonna change that to 2 seconds I think it makes my machine crash list when we do some of the really deeper neural net stuff in a bit okay the the next thing I have is the URL so this is the base of the API I'm just constantly using localhost I could have parameterised this there's so much I could have done this is prototype code please this is not production grade code this is just proving the idea okay if you're not familiar with this syntax hopefully everybody is because this is pretty much the the way I do all of my string to manipulation is to use the old vbe basic templating within your strings so if you're not familiar with this this bracket 0 basically when you do - F and a variable there anywhere that first variable is is going to go into all the dollar sign zeros and then these are all going to go into the dollar sign I'm sorry into the bracket zeroes and these are gonna go into the bracket ones and that's just a way to quickly sort of template and then add content to the strings without ensuring that there's fidelity the next thing is I have my iris URL so this iris data I did manipulate good call good call you could let me burn for a while I could have been fun all right so iris URL so the original demo so I have the code sample that I'm showing right now is on my blog so power Toth wordpress.com is the last blog post it used this from the from the internet so it grabs it in real time and pulls it down but just because I was worried about demo gods and all of that I downloaded it locally which actually required some of the later calls to be changed a bit this parse setup body actually had to be changed a little bit but it's like a syntax thing basically h2o is it's all has to do with like the destination frames that it interprets and the names so I'll spare you the details on that it just no it's it's a little bit of nonsense anyway we should be abstracting those into better functions which I which I do a little bit later okay the next thing I have here is the first the first function that was called right so in the flow that we did earlier we did the import files function and so here you can see me calling that import files function all I'm doing is changing the body I'm setting the path equal to the IRS URL which is this local one and then I'm invoking the rest method and I get a return value okay that return value is you know kind of holds most of the data and stuff and actually I'm gonna run this right now actually I'll run it up to here so then the second command after this is parse setup so parse setup is that one that ran and I needed that return data back so that's why I kind of want to pause there so you can see the output so let me just comment the rest of this out and we'll run this so chuckling at my my environment ya know this is easier there's a vim style alright anyway so I have the stuff here and what I really want to show is this return value if you look in here there's there's all that data that that actually gets pulled back and then if you look what I'm what I'm going to do in a second is I'm going to take that output from the other one and actually show to this way okay so I'm gonna take the output and I'm selecting these parameters from it I'm then gonna pipe that into that convert to format to form data which is then going to convert that into the post that I'm now going to submit back to the rest api everyone follow all right this is the actual PowerShell stuff you guys should know this all right so let me comment back this out and here I'm gonna actually invoke it and you can also see and I'll do it up to here so what I'm next gonna do is I'm gonna call the the parse function and I'm going to invoke the rest method I'm gonna get the return value and what I want to show you before I run it is this return job key URL this is the you the path that I was mentioning gets returned back in the asynchronous calls so if I look at that it's just a path here and that's the job subsystem so this is the job that's running in the background and then now if I run my wait h2o on that it is actually already done I mean this took a couple seconds but just for the sake of completeness it says it's done and you can see that it basically has been applying the same pattern so and then it just adds that to the rest API call okay next thing we're gonna do is call the REST API for splitting the frames so again the rest of this is all pretty straightforward I'm really just doing everything that was done in flow and the way that I found out what parameters to supply here was by doing the the inspecting the network messages and seeing what's actually getting passed when I was trying to do something that I wanted to do looked at what it was supposed to look like and then applied it within the code here if anybody wants to do the open source stuff on this it's really a matter of understanding taking each of these API calls and turning them into functions to be honest it's pretty pretty straightforward okay so we're gonna split the data and then finally we're gonna train the data using the this time we're gonna use a deep learning model so you can see this so that's the the you know the more black box type stuff and we'll pull that data back and then finally we're gonna predict so the next thing so this prediction is is the exact same thing we did before the difference here or it's actually we were going to predict off of the college a so let's just do it from here so this is going to predict basically off of the train data to tell me I sorry the test data to tell me how well that model is performing so I'm actually gonna just run this whole thing again probably get a couple of errors because I've already imported some of this data actually you know what for completeness let's just restarted okay so what's up and then this time I'm going to run it basically up to this point so that we can train the data model okay loading the data parsing the data and it is splitting the data it is training the deep model and it's done okay and if I look it's validated so that just did a neural net for me and what was that a couple seconds the neural nets actually do take longer with larger data sets and the emne Stata that we're gonna look at a little bit which is like you know 20 Meg's or something is gonna take ten minutes but you know it's actually not that bad and here you can see that the MSE so it's about 3.1 - this is actually a really bad data model this one I would never have actually you'd need to tweak it because it's not that accurate I only know that because I've done this dataset so many times and seen the results of what good data models look like so I can tell you that that's actually not that good finally the last thing that we're gonna do is do that prediction so you can imagine now this is one way that you can you can load these data models into h2o we can also see realize these and reload them in but the idea is that they were living in h2o now so if this data model exists there and I want to ask it for a prediction it's as simple as this code it's all I'm doing is I'm creating my CSV file right for the thing that I want to test I create it as an ASCII file right here so predicted CSV I then call all the same functions before the import parse setup and parse finally I do the prediction and I use that same data model and I predict off of it and at the end we get a result back so with all of that said let's actually run the whole thing now and we should see so I will wipe it out completely okay so we've wiped it is up and we've overrun do it so we should get see if the MSE gets a little bit better splitting the data training the deep learning model ooh that's cuz the file already exists and was open sorry the data was serializing down to predict and here you can see that it predicted with 99% that it is iris Atossa and remember I said that it took me a while to figure out these numbers there's the probabilities it was especially just I just kept looking at the e minus o 5 and not realizing how small of a number that was but anyway that's there and that that actually what I'll show you a little bit later is when we export these data models these this is how it always returns the values to you it shows you with with what probability it thinks it is one of those things okay it's pretty cool stuff right yeah feeling good all right the next thing I want to do is I actually want to look at a more complex data set and then I'm basically gonna run the exact same thing but we're gonna do it with the m-miss data and I love I love actually showing what this looks like so the first thing that we do with any of these projects is actually look at the data so here I have M missed and then I actually have some code in here called import and render so import and render is basically you know let me just start it so import and render up here the function that's that's running right now is basically going to import the CSV file it creates a headers because there's no headers in this in this data set so let me explain what's in the data set first it's a columns the first column says a number so it's a number from 0 to 9 okay that's the number that the pixels are the next column the the next set of columns are individual pixels so it goes from like upper left all the way across and then it you know kind of repeats and it tells you the shading value of each one and so what I did was I wrote a very quick parser that will visualize what that looks like so here I've taken the whole data set I'm getting one sample out of it so the sample looks like this actually no the sample does not look like that sample looks like that no oh I didn't thought sources we got to do it again okay dad source let's let that run anyway the point is I'm so here you can see that I basically am getting one sample just to see what the first value looks like I'm splitting it so that I can look at the value as well as that pixel array so I've got those two variables right now value in pixel array value is again the value which I think it's the number 7 and then the pixel array is going to be that that list and so here you can see it rendering what it looks like to render the number 7 in ASCII what it actually looks like in the real word those there's way better visualization out there for this but these are what they kind of look like they're hand-drawn numbers okay and the goal here is that when you look at this data right this is not something that we can interpret well maybe some people can put so for example here's the pixel array that that's the number 7 right obviously right so it's just gradient shading on each of the pixels and then that we know that that equals the number 7 and so what instead what we're going to do is we're going to train the same way that we did in the iris data but it's a much larger set of columns with with not that much commonality like we don't really understand this data set that much and so we're gonna apply a neural net to it because the neural nets gonna kind of figure out that that for us it's gonna do some black box and kind of put it all together for us and the goal with this is again to build a data model where if I supply a new pixel array so a new image it will tell me what number it thinks that is right that's the goal with what we're doing here so for this I'm going to run this like I said it takes about 10 minutes to run so I wanted to and actually let me just make sure we clear I just want to clean up h2o so it doesn't have cluttered stuff Oh actually meant to show you the data model we just had is actually in flow right now but I just killed it so I can't show it to you but I'll do it on the next time when we do the deep learning one ok there it goes I missed oh and let me show you what that code does so this is the form this is the the code that we're going to run it has the two functions the convert to form data and the way to h2o job there's also one additional function I wrote in this one which is import h2o data so those three steps which were really tedious and to look at and you know like you know 10 lines of code each time I created into a little bit more of a function so you can say import h2o data give the path to where it exists give it a name that you want to call it in the data frame I've been using that word data frame I don't know if I explained it but data frame is really just a data set everyone gets that okay do we give it a name we we pass it the API URL and if we want we can supply a list of columns if we know it doesn't have a column header in it it then runs those three sets for us so this way we don't have to look at all this code anymore now it's simplified down to this so we run the headers and we import the two data sets so the way this Emnes data works is they give you two data sets they give you a train data and a test data these are open data data sets you can download these you can there there's open source licenses for most of them Kaggle is actually an interesting place to look for like some of these data sets and challenges and things that exist out there but you can download this stuff and play with it yourself and so here I'm going to import this data I'm going to call one test one train and then I do the whole thing again just with the neural net and that's that's all that's going to do so let's run that and then we're gonna take a little bit of a alright so that's going to take a few minutes everybody stand up stretch I'm not kidding let's get up another thing is this is a this is missus whiskers mrs. whiskers was sent by my daughter she insisted that I had to take mrs. whiskers with me so what I'd like to do is actually get a little picture with all of you guys and mrs. whiskers everybody could raise your hands up I think that would look nice yes mrs. whiskers actually wait even better how about everybody on the count of three say mrs. whiskers says from Washington one two three mrs. whiskers all right thank you guys appreciate that all right now that the Bloods moving a little bit let's talk about some mathematics whoa all right as I promised I'm not really gonna be doing too much on the math side but let's pull that up okay public domain image I love these things I can actually put them in the presentation okay so this is the problem right now I'm training a neural net against a massive data set on my laptop so looking at these images might be a little slow as we go through this okay so the first one I'm going to show you here is and you know that's I should mention also that we've been calling this talk artificial intelligence machine learning technically some of that only applies to like the deep learning type stuff a lot of this stuff is just analytics this is just mathematics that we've been doing for a long time what I like about these visualizations though is they kind of explain what we're actually trying to do so in this example here imagine that the stuff on the Left that column on the left is actually our value that's the thing we're trying to predict the thing on the bottom is the input right so there is a number and we think that we know that what we want to say is okay at 0.5 where do we think that dot will be for the 0.5 and so what you wind up doing is you build these functions so in this case this is a linear function all it's doing is it's trying to find the easiest way through this this gets really complicated though and it's actually fun to kind of go through the mathematics on these because you'll find stuff like imagine if you had an outlier right so if there was a data point let's say up here that just killed my zoom okay so let's say there's a data point up here imagine what that would do to the line like the whole line would shift up towards it and then you'd have completely skewed results so there's actually mathematics in figuring out where those are those outliers reducing that you know ensuring there's all sorts of things that apply to this but at the end of the day all that's happening is we're creating a function we're saying give me the input and tell me where it's going to sit on that line okay that's that's what a data model actually is in in this context another thing that can be done with this type of model is so this is online and we could say that on one side of the line everybody has cancer and on the other side nobody has cancer and it but it's it's genuine right like that's that's actually how you you you know these functions can be applied to categorization as just as much I want to talk about a couple of other things so the lion is actually really easy to understand it's a little more complicated if you add more variables basically because how the heck do you put that on a two-dimensional chart right so I got that's not possible so what I can show you is there's a an article here and actually I already have them open so I'm just gonna pull it up that's gonna take a second and that's gonna take a second fortunately is only ten minutes for that thing to run so any day now okay there we go so this is a visualization of basically you know what it would look like to map out multiple dimensions so down at the bottom you can see two dimensions and then your value is at the top so at the end of the day again these are just mathematical functions right these are well they're not always mathematical functions but in this case we're looking at them as mathematical functions okay so you're finding lines that fit so that when you would put the inputs into that function where it gets graphed is the point at which it's the prediction okay very easy to kind of understand it from that perspective this came from Joe - Joe - it's a research paper from ok state DTU just try it all right that's gonna take too long their apologies but I do have to give the credit any day now okay ordination okay state ID to you sorry that took so long the next one that I'm going to talk about is overfitting so now that we're drawing these lines it's actually easy to see what what we were talking about earlier with overfitting so this is an example all right let's call this the categorization one right the blue ones are they have cancer the red ones don't have cancer okay there's a bunch of data that goes in and it applies and what you're seeing here is two lines you have a black line and a green line the black line is a decent function right that actually gets the majority of them and yes there are outliers right so some blue dots in the middle of the red ones that don't seem to make sense but you know what it's it's fairly accurate to just incorrectly do them you know predict on those now the problem is of course knowing the data set and how it's gonna be used directly affects your confidence level so if this really was cancer we probably would not want we'd be very very sensitive about false positives and false false negatives however it fits something like you know you're looking for anomalies within your logging data within PowerShell like it's a much different story you're okay looking at some false positives and false negatives the overfitting idea is what if you actually and some of these algorithms actually do this they write functions that are so tight that they fit completely around the data set like this so that now the best example of this is in this bottom sort of portion right here so imagine in that green bubble there imagine the blue that's between the black line and in that green bucket right that would likely be a blue dot right chances are just by us looking at this we can all probably agree that if it was directly even next to that red dot it would probably be a blue dot this model when it runs against this is this data set will be perfect 100% perfect but then when I start adding those those variable things that are you know would show up a different way it's going to basically over fit it so it's something to be mindful of as you look at these things that's why it's really important that you split your data into training validations that you can test it and you can and you constantly want to you know randomize those types of things and look at them more deeply and kind of evaluate whether or not you are over because it's a real problem okay the next one I want to talk about is so we've been talking a lot about the mathematical functions and the models there's another type there's what's called decision trees so decision trees are actually pretty simple so this is a GNU image from Wikipedia that I have to give credit to oh yeah when is this job gonna finish so R is R is car Rafik which I actually have to and GNU I'm sorry just to call the place I'm sorry this one is not jr. this one is the creative commons attribution/share-alike 4.0 international license okay so with this image I have an example of multiple decision trees what is a decision tree very simply I can think of it this way imagine that I started a question yeah I'm going to interview I'm going to decide what to do based off of these questions first of all how much money do you have zero to five hundred dollars in my pocket we don't want to talk to you I have a grand in my pocket we want to talk to you so that's a first decision right I'm making a decision there once we find out you have a grand in your pocket we want to find out how many assets you have like what do you own what can I take from you also and so we may find out you know which buckets you you know yes no I meet these thresholds and so if you're a big whale I want to go after you I want to like sell you some investments or something the so a decision tree is actually very simple to put together and we do them all the time and they're just simple yes/no logic flows the challenge is in tuning those why did I choose $50.00 what if I chose 49 would it work better like what would it actually look like and so there's a whole branch of this this mathematics that focuses on that and the big one is this distributed random forest so this is basically an idea of how do you combine multiple decision trees and put them together so it basically kind of like you can think of it as like averaging them together and it kind of you know comes up with some sort of thing completely between all of them when you know comes up with some confidence and it constantly tries all sorts of different patterns of distributed of decision trees to kind of even get you know some good feelings about which ones are working and not so I just want to highlight that some of these are not necessarily like 100% mathematical now the next one is the complex one this is the fun one so in this one I'm gonna reference just the image from this ACM magazine for students X RDS written by this abdullah hum Hosny who heats yeah this actually really good article I have the link of it in the deck that talks about you know what the neural nets are actually doing and how it ties but what I really want to show is this image so this is a neuron okay I think they actually put he pulls this from the Stanford Coursera course but the the neuron the dendrites everybody knows what a dendrite is right so the dendrites are where you're you're basically gonna get your sensors have inputs so you can think you know your ear right here something and it turns into an electrical signal that then gets applied to the dendrite and then I have a neuron that is dedicated to hearing that processes that in the nucleus it takes that signal does something to it and then puts it on the wire it puts it on into an axon the axon is then connected to additional neurons and so there's a whole chain of signal manipulation that actually happens in our brains and what the deep learning modeling is actually doing is is is mathematically trying to model some of that and figure that some of that stuff out the really fascinating thing about this is that there was a research paper published where they took I Karen what animal it was but they took they took the the neuron that was used for hearing in an animal and they disconnected it from the ear and they connected it to the eye and the neuron transformed into a seeing neuron which is really magic right like that's the whole point of these things is that they're actually learning they're learning what the sensor is they're figuring out how to process the signal in a way in which you'll be able to use that information later and that's what the neural Nets do now the problem with this is that they're completely black black boxes whereas the other things we're easy for us to maybe visualize show some charts and graphs and things like that these wind up becoming so this is the this is an example of the the neuron as it's drawn in the math but it's basically you've got the same thing it's the inputs I do my processing and I get something out and then what happens is we look at them through Chains of these we have these are called layers and you'll hear this term but basically some of these neural Nets may have way more than three layers they may have like lots of hidden layers and you don't know exactly what's happening they're all communicating with neurons in the way and they're all transmitting the signal and manipulating the data in but honestly this is actually probably one of the most interesting areas to watch from like a PhD perspective because there's a lot of people trying to figure out how can we make sense of the data and the metrics and stuff that's coming out of these types of algorithms to help us understand them better but honestly that's way above my pay grade all right so that's the neural net and fortunately I think our neural net has actually completed so let's take a look and you can see here that it is predicting a whole set of data and actually I'm going to pull this up in h-2a flow now so that you can see what this did okay so remember I ran the PowerShell code again this is that amnesty de I was running a deep learning the neuron type training against it so now if I run get models I actually see the neural model that I had I can also look at the predictions that were running so this was the one that was run against my test data so actually I want to look at that more closely and here I can see first of all I can see that you know the MSE was 0.5 which is okay and I can inspect this data oh sorry oh there we go okay so I'm having problems with this Oh hold on let me just actually I'll pull up let me do in parish all right for some reason I can't find it in flow right now but I know it's there somewhere and I happen to have all of this data here so I'll just use it here so the last thing I was doing was looking at return frames comm so this is the output from the last command I Rand which is looking at the predicted data and you can see here in frames not columns there's a data and this so what this is and I'll just look at the top 10 sorry geez that was weird don't normally do that anyway so here's the results so in and I'm gonna I'll just open in Excel this will just this way it'll take less time at this point cuz normally what to do this in flow but alright so this is the test data that came back it was basically what it was doing is it was looking at the pixel arrays oh I had it loaded in memory sorry so here's the pixel arrays again so this is the the prediction that the stuff that I was just running and what you can see is so for the first number it was in the number 7 and our model predicted 7.29 we averaged add down to 7 the next one 1.6 averages to 2 so therefore it predicted that that number 2 hand drawn was actually a to the 1 0 applies to this one oh here the second actually this one worked out pretty nicely with the test data as a 0 4.2 comes down so that's what I did I just bulk tested this entire test set and seen what the predictions look like I'm just comparing them manually now even though you know the MSE actually showed me the mathematical bits to that and now at this point you know I can actually through PowerShell submit you know a pixel and say what number is this right now you can start thinking about the practical applications of this you know think about your logging data thinking about anything that you know the creativity is really in it's in your hands at this point if you know your data sets and you think that there's an interesting application to it there's probably is an interesting application to it ok so that was the neural net I've got one more thing I want to show you guys on the math side and then I'm going to actually demo this one and how we do this in a second so this one is GNU and I can now move things again GNU Free Documentation License credited - chyre okay and this is the what this is illustrating right here is it's illustrating a mathematical model that is basically finding the clusters so this was again this is the Netflix idea how do I take all these data points that exist and find the commonality and the way these algorithms work is you usually tell it how many things you're looking for so in this case we're gonna actually be doing a demo where we will look at three we will basically bucket them into three different categories and this is just a mathematical representation of what that kind of looks like yeah and obviously as usual it gets more complicated with more inputs you wouldn't be able to visualize it as cleaning Li but I think the idea is sound for you to at least understand what we're actually doing it's it's basically finding the points at which it wants to break this into three and it's then calling each of those a group the illustration itself is actually in the algorithm implementation in that how it's changing the lines over time and manipulating it but you know we don't to go into that it's just more about a visualizing what's actually happening so what does this look like in h2o I'm gonna do it this one in h2o we're not going to do this in PowerShell because at this point the stuff that we're doing in h2o you can all do in PowerShell through those those rest methods okay so I'm gonna pull up an example flow here also I don't know if everybody noticed but I was loading that data in PowerShell and now I can see it in h2o right you can see all the data frames all of that stuff okay so k-means example let's load this up so here's a great example also of how you can take these flows and you can use markdown you can apply like you know you can actually hand this to somebody and say hey run through this like this is actually how we train these data models in this case what they do in the examples is they they give you the instructions they tell you okay run assist me and then they show you the function to run so you can either click the button assist me and brings you all the way back down to the bottom and you follow the the stuff or the next thing it does is it shows you sort of the functions that you're gonna have to call so in this case if I call import files it would give me that that dialog again but they also provide you the parameters so I can just simply import files here so here I'm importing another data set this is seeds datasets very simply in below some similar to the iris one so we're not gonna go too crazy on that oh actually I wanted to restart this but it's alright okay so that data is in there we're then gonna parse the data so we'll run this set of parse and then we're going to parse the files I'm just using the functions that the examples provided me the interesting thing is when we're building this model so there's a couple things to note here when I hit build model and I choose the k-means so this clustering okay so a cyst build model here we go what I'm selecting here there's a couple of parameters that we're setting c7 is basically saying ignore that column K because that's the column that actually I think it's c8 I think that's wrong that's the column that they're categorizing k is the this number three is that we're saying that we're gonna do it into three buckets so we're looking for three groupings of out of this data and then there's a couple of other things here so I'm just gonna run that I'll see if my change actually breaks anything cuz I've never tried that the model exists I can now predict against it and the important thing here is to look at what the sort of the output looks like so it's showing me that it's created three centroids which is our clusters and how many are in each one and I can actually look at this data this takes me a second hopefully I'll find it yeah I think it's this so there's some statistics about the data and then here's the actual data so it's basically doing the same thing it's predicting the difference here is that whereas in the past we were training with known quantities of like what the names were I respect Osiris set reverse the color the m-miss data or each numbers in this particular case we don't know what they are they're just 0 1 & 2 and so everything is coming back as 0 1 or 2 which basically means that they're grouped together so now if I run another prediction so I can basically take that data out I can run another prediction off of a new set of data figure out what group it belongs to and then I can go back and look at that group and then provide it to the end-user and say hey these are the ones that you kind of associate with okay make sense all right we're good on the k-means good all right let's move on to we're in the home stretch 10:03 okay so I did the unstructured learning the next thing I want to show you is the grid search stuff okay grid search as I mentioned before is the idea of taking one of these complex algorithms and not knowing all the parameters but knowing an idea of what we want to try and then iterating and trying all of them and then comparing the models so let me restart h2o so we have a clean one all right okay so we're gonna do exactly what we did before I'm just gonna quickly import the iris data just so you can see this is a quick one ch2o iris well import we parse the files parse set up sorry then we parse okay so we've imported the iris data again I'm gonna split that data into the two so 0.9 and we'll call that train point ten is called test we create that again and now I've got my two front frames so now I'm gonna actually try this grid search thing so the way I do this is I take my training data here and I say I want to build a model like we normally do now in this type kind so I'll do the distributed random forests just for fun because we haven't done one of those I'm going to basically select my training frames select my validation so the same stuff I normally do and my response columns class and now this is when I typically just was hitting run and we were creating the data model what you see over here on the right though is there's this grid question mark okay and if you look there's check boxes so anything that I can do a grid search on has the check box so for example a number of trees and if you notice when I click this watch what happens next to that 50 semicolon alright which means that I could probably put more stuff here so now I can say alright I want to try 50 I want to try 100 I want to try 150 I want to try 200 max depth I want to let's let's make that a grid search we want to try 20 we want to try 40 okay I think that's good enough for now just as an example of what's going to happen but I could do this with all of these these parameters here and as I mentioned there is there is this newer technique that people are using if you've never heard of it bayesian optimization for parameter tuning is it really really interesting there was a talk that an MD from to Sigma gave at Q con two years ago in New York that's online highly recommended he he goes into the mathematics of how it works but his prediction is that basically anything that accepts parameters will be using embedding this technology into it over time and the examples of practical implementation besides this that he gave was tuning your java parameters as well as tuning your cloud templates like the options that you submit to a juror or to AWS actually he's figuring out like what the ideal optimal VM is and like you know configurations and things like that or it's a really interesting space but anyway for this one this is not that all this is doing is bulk iterating I gave it a couple of options is gonna try this one because this combination try this combination try this combination let's run that build alright so you can see it's taking a little bit longer than the other one because right now it's building a bunch of different things it's scoring them it's trying to figure out which one is the best one and then we're gonna see what it looks like after it's done so it's finished I could look and now I see that it actually created one two three four five six actually a few more so there's eight data models there now I have to figure out which one I want to use so I can look at a couple of statistics there is a scoring history so I can see what the rmse was for each one of these it's basically saying it thinks this one is the best one so then I could say all right well let me predict that let me look at that one against my my test data directly and see how well it's performing and here you can see it's actually performing fantastic so against my test data said the MSE is at point zero zero zero so this is actually a really good with this we finally got a good data model for this this data set but anyway that's the point I just wanted to show you that that grid search works now again with PowerShell that part what you're gonna have to do with that is figure out all the primers you want to try inspect the network look at what the post data is for those grid searches and then just apply it into it okay yeah it had zero error it's pretty incredible okay that was that one so the next thing we want to talk about is production izing this data this these types of things yeah pojo good so in each of these data models and I'm going to take the one that we just built and we are going to serialize that out of the system and the way we do that so first of all I'll just say get me my models so we'll take this one that was pretty good okay so I can say download POJO so POJO is a plain old java object that's what it stands for it's basically a Java class you could either embed in your application or what we're gonna show you is how I can convert this using the steam interfaces to turn it into a war file that you can then turn into a web service that you can then use from power show without needing h2o at all so download pojo that's gonna grab the Java file and I'll show you guys what's in there in a bit because you're using it manually it's kind of cool the other thing that happens is there's this gen model that you need to download from h2o for which each version has its own it's basically just its own jar file I've already downloaded that one this one I want to actually put I want to make sure we're doing this one so let me copy that there so what I'm gonna do is I've got a POJO directory and I'm gonna paste this one in and okay so the one that we're working with is grid six four six six F four all right so that I just want to remember that in this folder again I've got the h2o gen model the thing that I just downloaded and that POJO file the next thing I'm going to do is so I mentioned that there's this open source project that's deprecated called steam so steam h2o data which you can fork and build but there's one very specific folder in here I'm just gonna shh I was gonna show in github yeah okay so there's one particular folder in here that if you build this prediction service builder will allow you to specify POJO and the jar file that we just downloaded and spit out the war file I can get this to work on Windows I've had it working in the past right now I couldn't get the class paths to work so I just cheated and I'm doing it on the blue to show but basically the service looks like this so you have to grab it download it and then run Gradle commands to build it once it's built there's a war file in here you can see the root dot war and now what I'm gonna do is I'm actually going to run Java dash jar and run one of these jetty runners I honestly don't know the differences so I'm just gonna run the one that looks like the most recent and then I'm gonna spaz pass it that route war file and so when I run this that's new well fortunately I did process this a few times well is that yeah right that's true now the reason it's deprecated is actually because they're they've turned this one and they're turning this one into more of a commercial product they left the open source one but most of the magic that they're doing now is is in the paid version okay there we go it was just a version of jetty runner I was using okay so it spins this up on port 8080 so now if I come here to port 8080 I should see a nice little service oh no oh I see the problem I do see the problem hold on yeah this is this is the wrong war file something is they just tried one more thing I could always uh anybody know any good jokes building up to date building running okay and it's running at localhost fifty-five thousand let's see if that works now it's this oh there we go okay sure that's the right one so we select this Pocho here I'm gonna pass in the POJO file that we created which was six for F I'm going to select that jar file that I downloaded the h2o gen model jar and we're going to build this when I build this it's what it's actually going to do is it's going to create a war file for me it takes a few seconds on this one there it goes so it's downloading the war once that opens I'll grab it I have no idea why 8 mega is taking that long okay so we've got the file let me open it so I'm gonna take that war file and copy it into my POJO directory oops and I'm gonna basically call another jetty runner service on this war file directly it within within windows so H C Drive h2o okay we made it so I've got this grid 6 for F dot war file so I'm gonna come here to we're gonna create another shell and in this one I'm basically going to call from the steam directory which is where I downloaded the steam and built it here on this one the prediction service builder so it's Java dash jar the jetty Runner which I will do that one and then the past the war file which was ch2o pojo shoot what was it grid 64 F no okay also let me make sure I kill the other service that was running that was doing the conversion just in case we clobber any ports or something like that okay so now I'm going to run this service alright it's running port 8080 by default so now let's check out what what that actually looks like this is really cool not this I don't know what that does all right look the lowest 8080 I now have a web interface I can say all right what's my supple length I think it was a really big one that's 30 my supple width was 15 and the pedals will go 0.1 and point O 3 or something like weird anyway now this service I can say predict so this is not running an h2o anymore this is running this is a just a web service that I pulled out and it's a war file and here you can see it comes back with my data it says sue Tosa 0.879 probability there's a very small chance less than 1% that it Cyrus virginica though the other nice thing is now I can actually you know craft these URLs so this is a complete rest service so I can take this and we can invoke rest method in PowerShell and say invoke rest method and so what you can think of is the reason I show this demo is a you know I like to think of the pipelines you need to productionize these things like you can take that jetty runner conversion thing put it as your build part of your build process build it into the service and then you know run tests against that but anyway here's the invoke rest method on this we're gonna need quotes because there are some amber sands and all sorts of nonsense in there okay so here you can see it comes back with the exact same data so now I've you know access it with in PowerShell I can access it from anywhere it would be kind of interesting if we can figure out like Pocho to PowerShell or pojo to net I think there's some projects that people who were working on but it's way too complicated what I will show you is what the POJO looks like because the other way that you can do this is you could just compile quickly a java application that uses this Java class and uses the function in there which I will now show you this took me a long time to figure out so I'm very proud of this because this is what the pojos look like and let's make this bigger there are enormous amounts of code with lots of nonsense and crazy amounts of functions and all sorts of weird words and who knows what's going on it's dividing all sorts of things this is the magic this is the mathematical model like this is this is actually I mean this is algorithmic model I suppose but anyway the point is that there is a score zero function and actually looks like I went to let me go go to the tub score zero function this is all you have to call so if you were to load this class build a java application and you leverage this this it takes a reference to a data structure so it's to a list of of the inputs and then what it does is it provides you you you provided a reference to an array for the outputs and so when you run the prediction it's going to load the that into that collection there and it shows it to you this is the the place where it was really weird we chose you than that point nine nine nine nine and the other thing that's kind of strange about it is it doesn't tell you what type it picked it actually forces you to know that the first one like the order at which they're listed which I carry remember where you see them is order that they come back in so it the scored like that the squirt function doesn't actually give you that information it just gives you those three numbers and you have to interpret them cool stuff alright see what we have left I think it flew through this this is uh took me a lot longer Oh Otto ml last one yeah we really burned through this swear list took me an hour forty five last night all right so Otto ml we are going to look at one more of these so for this one I do want to create a clean one again because this is basically gonna generate a ton of models oops just run it and as I mentioned this is the the interface like if you don't know what algorithm don't use you're not sure what to even try try the gautham LC when it comes back with it'll take some time this even against the iris data this will take you like a good hour or so there are some parameters to help it exit early and you could always kill it and look at the models generated like you don't have to have it run which is what we're going to do right now but there's a separate function called run Auto ml out let me load the data apologies and poor files do this all again hto iris okay import parse it up parse okay now we can do an auto ml I'm just gonna do it I'm not gonna split them this time I'm just gonna just to make this a little bit faster so run Auto ml Auto ml takes a training frame if the same things as the other ones did which one is the response column do you have validation data frame in this case I don't cuz I didn't split them which actually may kind of show weirdness when I when I'm looking at it but that's that's alright we'll just run this now there's some additional things that you can say okay so for early stopping I want this to only run this amount of time like you can set some parameters here to make sure that it doesn't you know get out unwieldly and out of control and I just hit build model now it's just trying a series of algorithms it is trying them in parallel and if I happen to have spark behind the scenes and was doing sparkling water at this point it would be distributing this across all of my clusters everything out there would be training up models and testing them and validating and deciding which one is actually the best one again because I didn't do the validation it's not really gonna be fair but I will say also that this is happening in parallel so right now I could see the models as they're getting generated these are all of the models it's trying which is kind of cool and then I can look at any of these individually and I can compare them I can see how they look all out of the box I mean there's more you can do with this stuff if you get really into this like you can spend a lot of time figuring out how to like what's kind of interesting is as you lose the mathematics for visualizing how the data model actually works you can actually apply mathematics to the various models that exist and how they're there they're out there to kind of visualize like which ones are working the best and not and things like that just kind of fun but there's a whole that this is where data scientists spend their time tuning these things making sure they have it right figuring out the next things that kind of come into this world are weirder things like what happens when I introduce new data sets that completely blow away my old model and don't make any sense anymore and they they're no valid also how do i productionize these things regularly like you know if I'm putting them through a CI CD system like don't test this and if I make a change to one of these models like that has some serious impact depending on like what that code is actually doing for most of the powershell stuff probably hopefully not but if any of you guys get really super creative with this and you know do some interesting practical things you may want to think about those things okay so that's going to keep running and actually I'll just kill it for now so we can take a look here it is so it will cancel that it's up to the GBM it's gonna actually try pretty much all of the the models that were out there anyway I can now look at this and it tells me here's the best ones these ones were like super fit unfortunately because I didn't have the the validation set like we can't tell which ones these actually are working because these are all just super fit to the actual training data at the moment you can also look at some there's a lot of log information and then each of the models themselves you can see the parameters that were set so if I say it like right here because a model parameters and I can see what was used when this was created so now I can go and find the ones that I liked and you know it in the future I'm not gonna run this whole Ottawa metal thing I'm just gonna run these ones in a grid search and see which one comes back the best it was a lot of information we sucked up a lot of time the last thing I want to do is just leave you guys with the resources there's a few out there that I think are worth the time just looking at first and foremost is the h2o AI portal that's their documentation there's the h2o algorithm documentation which I showed you the link to there that's basically all the parameters and what you can set there my blog post with the code samples and I'll be posting this to that blog I think we're all posting our own stuff to it looks like there's no central place so I will post it to my blog and but right now you can you can grab a lot of the you can get the iris bit of toast so so deep learning scripts right from here and I think there there is a MIT license on these I can't remember restored them I think they're in git so feel free to use and any way you see fit there and it kind of walks through a lot of the stuff that I was talking about today as far as you know how the REST API how we build the models how you use them etc etc it doesn't go kind of deeply into a lot of the stuff I talked about but just the reference example to get yourself started I also highly highly highly recommend the Stanford Coursera course if you really want to have if you want to understand it a little bit better that Coursera course will have you implementing the algorithms from scratch in some cases it's a lot of matrix math what I will say is if you do go into that course the teacher Andrew Inge who is the guy who invented the the cat categorization algorithms against YouTube to find cats he was like looking for cat videos anyway he he tells you very clearly do all the homework though when you get to the neural net one don't feel obligated to do the homework I spent two and a half months working out the mathematics and trying to get the implementation right to get it all to work I did get it to work I'm looking back at it during this quarter like as I was preparing for this I have no idea what I did so I will recommend what he says to skip that part of it he there's great stuff in that neural net one like he has a bunch of good videos and things that describe it and talk about how well the math actually works but yeah I would I would highly recommend not actually doing that portion of it if unless you you know you're masochist like I am stative anyway H Coursera course oops this is a new one I found so apparently h2 will put out a Coursera course I haven't tried it yet but you know since it exists we might as well use it and then there's an O'Reilly book the practical machine learning with h2o all the code samples there are usually in Python and are another thing that's kind of strange when you look at code samples and Python and R and something to keep in mind is that they don't map to the h2o flow model because there are SDKs and libraries they've gone and you know basically abstracted all of that stuff so you only see one import file function instead of the three that we've been doing over and over today okay and with that Twitter handle turn off I can actually have time for questions I don't know if I'll have answers for any other questions but all right I guess I'll get back to 15 thanks guys all right [Applause] [Music]
Info
Channel: PowerShell.org
Views: 1,266
Rating: undefined out of 5
Keywords: powershell, windows powershell, techsession, powershell summit
Id: M0OYzJT6uLk
Channel Id: undefined
Length: 85min 45sec (5145 seconds)
Published: Fri May 17 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.