Scalable Automatic Machine Learning in H2O

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so so let's get started alright so this talk is called scalable automatic machine learning in h2o and so we're going to talk a little bit about what automated machine learning is what does that term mean and and then we're going to show you how to do it with h2o so and by the way my name is Erin Liddell I work at h2o as a machine learning scientist which means I am making the algorithms that you see here and the auto ml project is one it is like the main thing that I work on now so so yeah this is all stuff that I've created there's two other people that I work with on this named Ray and Navdeep and yeah so this is our sort of solution to automating machine learning okay and I'm just gonna briefly let you know like what is h2o because that's kind of the framework in which we're talking today so h2o is the name of a company so that's the company that I work at and it's also the name of our platform that we put out so so h2o the company's been around for about five years and the main product that we have is this open-source machine learning platform although we do are starting to do other products as well and we're in Mountain View California and the software essentially I'll go into a little bit more detail later but it's open source software so you can just very easily download it and use it on your laptop or on a cluster and if you are familiar with any of these languages then you can use h2o so our Python Scala Java and then if you're somebody who'd prefer to just work with a GUI like a graphical interface then we also have a web interface for h2o and you can do all the same stuff basically in the web interface and the whole point of h2o is because you know back in the early days of you know big data like in the 2010s the early 2010's there there was all these new big data platforms like Hadoop was a big one at the time and a couple years later than SPARC became very popular and there wasn't good machine learning libraries for these big data platforms so all of the machine learning libraries that were available at the time were just something you'd run on your laptop or you can run it on a cluster but the idea is that you have a single machine that you're working with and everything happens like in memory on that single machine so if you have a data set that's like a you know a billion rows that might be a big data set so you might have to you know you can't maybe fit a billion rows of data into the memory on your laptop so what you have to do is distribute that data across a bunch of machines and that's what h2o does so you just give it your data set it knows how to distribute the data across machines and then it knows how to do machine learning across that distributed data and so that the real I guess technological feat that h2o accomplishes is being able to do machine learning where you don't have the whole data set in memory all at once and you have to change the way that the algorithm works so that you can actually do the machine learning part and that's that's what we're what we do and I guess the other focus is because a company made this software is we care and sort of are our customers are companies that pay us for enterprise support so they care a lot about like putting these models in production and sort of that aspect of things so once you've trained the machine learning model then like what do you do with it so if you use some of these other open-source libraries they don't have often like a good solution for taking that model and putting it in production so that's another thing that h2o focus is on is production izing the models okay and these are just three scientists that we work with so if you're familiar at all with machine learning space the first two guys wrote a book it's called the elements of statistical learning now there's another book that's I think I forget the name of it but it's like the more introductory version of that book and that's free pdf so if you go to their website you can read it so that's kind of one of the main textbooks and machine learning and so these are the authors there at Stanford and the third guy Steven Boyd is also sort of in the machine learning community as well okay so that's a little bit about what h2o is and what the company is so what we're going to talk about today is are these things so what is automatic machine learning and what does that mean we're going to talk about one particular subset of of Auto ml which which is called Bayesian hyper parameter optimization now try to say that fast like three times it's really hard word to say so then the next thing is is the topic is random grid search and stacked ensembles and that's the approach that that we've taken in our auto ml and you could combine these all these things together these are just different sort of techniques at automating the training of machine learning models to get a good result and then again I'll introduce the h2o machine learning platform at like a little bit deeper level and then I will show you how to how to use it so if you if you brought a laptop you're free to like well you know I'll point you to where the code is and you can kind of go through the code with me if you like okay so what does auto ml mean ok so there's lots of different pieces of auto ml but the the real goal is to reduce the amount of work that the human has to do basically the data scientist and there's a lot of things involved in training models and machine learning you have to have some idea about like what the algorithm that you're working with how it works what are all the important hyper parameters what values do they need to be set at because basically what you do when you train a machine learning model is you can imagine just like a big board with like a lot of knobs on it and you just have to keep turning those knobs to different values until you it's like the precise like setting that like makes the best model and it's all dependent on like what whatever your data set that goes in like the knobs will be different and are the best values of the knobs will be different so part of what a data scientist does is either just manually tweaking those knobs based on their own knowledge of what sort of what worked well in the past or using something called a grid search where you just put a big list of values that you try and kind of try every different combination of all the different numbers and you just pick the best model that way so that's the goal is to have a process that can go off and sort of automatically tune these knobs for you if you you know either don't have the knowledge to do that on your own own or you just don't want to do it because I can tell you that after you've done that at like a million times you just it's not fun anymore like 900th model that you've trained you're like okay I just wanna you know have this do itself and I and I don't you know I've learned everything I need to know about this type of model I'm just you know over it so and it'll save you a lot of time like if you're a data scientist you can spend time doing other things or you can just use this as like a first pass and like it'll get you a pretty good model and then if you want to then fine-tune it later that's another way to use it so it's not just for people that like don't know how to do machine learning it's really for people that appreciate the value of automating things so I like I basically built this tool for myself like I want something you got sick of writing all the code 100 million times so I just thought like what is my process of what I do when I'm trying to get like the best model what are all the things I try like how do I try them and just wrap that into a big function and just do it automatically so alright so that's the goal of Auto ml is to automatically find a good model I mean so there's different parts of that so you can kind of break it down into these three major categories here so data pre-processing is something that you have to do with machine learning anytime it's not just auto amel so so we can automate that a little bit so some of these words like might not mean anything to you if you're not super familiar with with machine learning so there's imputation which just means you anytime you have missing values in your data you have to deal with that so either there's some algorithms and that can handle those sort of automatically but most of the time you have to you know fill in the missing values essentially the exception is any tree based algorithms there's ways to kind of ignore them then there's something called one hot encoding so anytime you have categorical data so if your feet your column and your data is something that represents a category like the type of animal and it's like cat dog or pig or like the three values then there you know you can in h2o give that to the model like as is and it will do something under the hood but what it's doing is something that's called one hot encoding it basically takes categorical data and expands it into numeric columns so if there's three categories you're gonna end up with three binary columns one for cat one for dog one for pig and if the value was pig then the numbers that go in there are 0 0 & 1 so that's how you're encoding that you're saying was this a dog yes or no zero means no okay was this a cat zero or 1 0 because it was not and then you go to Pig and it's you put a 1 there and that's it's a way to turn like categorical data into numeric data because machine learning algorithms understand numbers they don't understand like abstract common concepts like so this is how we these are kind of some of the things that you have to do to like get the data in the right format so that the machine learning algorithm can understand it the good thing is that we do all this for you in h2o so you don't have to actually learn it if you don't want to a lot of other libraries you have to do that yourself that gets old pretty quickly and then standardization on some algorithms you have to sort of what's called normalize your data okay so these are all things that you kind of just have to do regardless of if you're doing auto ml or not okay so there's other things that you can do there's something called feature selection so maybe you have a lot of columns that represent like things that you have measured in your data set but like most of that is noise like you know maybe it's just going to confuse the model if you put all that in there so you want to kind of like trim it down to just like the the good data generally I wouldn't recommend doing that I kind of like to have the algorithm handle that on its own but but feature selection is called when you subset the number of the features to like good ones basically and then feature extraction is another word that's a little bit different it just means given your original data set can you like extract some new knowledge from that and there's methods to do that so all these are things that like enrich your data and make better models and then there's some other things that are called like count or label or target encoding I'm not going to go into what that is but this is another way to get like more value and more predictive power out of your categorical features ok so there's all this data stuff that you can do to help your models to begin with and then and then you know basically in any sort of auto ml process you generate a whole bunch of models that's part of like what a data scientist does you don't just know in advance what the best model is gonna be and you train it once you kind of just go through and do a whole bunch of different models and then eventually pick like the best one that's kind of what I mean by model generation so there's different ways to do that one is called grid search so if your model has like three what are called hyper parameters or knobs then then you might say okay I want to do a random or you know just a grid search over these three knobs and I'm going to try all the different combinations of these different values and so that that's essentially iterating over those all those combinations is called a grid search and at the end you end up with like a bunch of models let's say you have like three options for one parameter you know two options for another and four options for another the total number of combinations is like three times four times two and whatever that is that's the total number of models that you'll end up treating and then at the end you just pick kind of what the best model out of that group is and that's a way to kind of like you're finding the best parameters just by trying all the different combinations but that's that's very computationally expensive so you could try like every possible model but then your training you know a million models and that takes forever and you know so we try to think of like smarter ways to do that so one thing is basically do that except for randomly sample combinations and you know you just get get a like a random sample instead of the whole exhaustive search so that's called random grid search and then another technique is called Bayesian hyper parameter optimization and the gist of this is that you kind of start off with a model you train it you measure the performance and then you use some other you know model that that's going to tell you what the next set of parameters that you should try is and so then you try those you report back to the other model and say okay I tried your parameters here's you know the performance of that then you give that data to that model it tells you a new set and you kind of go back and forth and that's yeah question so Bayesian is kind of like a fancy word for having prior knowledge about something so the way that it's used here it just means like you're using the information about the previous models that you just trained and how they performed as like to inform your next decision so that's kind of why it's it sounds like fancier than it is it's basically just like we looked at some previous results and then incorporated that into our model and then use that to make new choices so I'll talk a little bit more about that and then another thing that you can do is you can what's called tuna model on a validation set so in machine learning you usually have a training set that trains the model then you have a test set that's where you test the model but there's also this third set that you can use that's just used for like what we call tuning the model and tuning the model just means playing with the knobs until it looks good so the way that you use the validation set is you turn the knob a little bit and measure how it did on the validation set and then you turn in the knob a little bit this way and measure measure measure and then you so you kind of like waste that data in your training process so you kind of have to get rid of it you can't use it again on your to test the model because you've already incorporated that data into your model so you have to kind of throw it away once you're done and then when you test the model on the test set that's all data that you've never seen before so when you test the performance it's like an honest evaluation of your model okay and then the last thing is once you have a bunch of models one thing that you can do to improve the performance is it's called making an ensemble so you combine the models together into like a you know sort of like a group think you know it's like you know unlike those shows those like shows where you can like ask the audience you know and then like the audience might not be very smart individually but if they like collectively they kind of like get a good answer that's like Ensemble ik okay individually most people like don't know what this is going but if you ask enough people and you average the results together then you kind of get a good answer so averaging is just one way to combine models there's other ways and that's one of the ways that we'll talk about is called stacking and it's also called super learning but that is primarily a term used at Berkeley only so if you if you hear that word it probably means you're in Berkeley or there's people who are from Berkeley that are now at Harvard that also use that word so it's the same thing stacking or super learning and then there's another type of ensemble 'uncle danse omble selection it's just it's a different way of doing it you kind of its kind of like a greedy approach to ensembles until the performance stops increasing and then you stop whereas stacking is actually just give it all the models and then it learns the best combination so if there's models in there that are bad it should learn to ignore the input from those models so it's kind of just a different way of arriving at an ensemble does anyone have any questions so far this might be a lot if it's your first machine learning talk but yeah that's a good question I think it it can depend a little bit on like what your situation is if you have a huge number of models versus a small number of models like if you have only like five models then probably stacking is I mean you you know you might end up getting like similar results either way it's probably always good like in any machine learning process to try both things and see what works best for your data there's not as much software that does the latter one so it's it's you know in machine learning it's mostly about availability of what's what's easy to use and what's there so we have implemented stacking and maybe a little bit later we might do the other one as well but yeah so probably just if your software has it you should try it if not you know probably don't want to write it up yourself unless that's the kind of thing you like to do yeah okay so yeah that's another technique discovered totally ignored oh really no very interesting yeah I mean you can do all sorts of things so sometimes if your data is kind of like a mishmash of a bunch of data sets together so let's say you have run like a web company and you have user data well all the like data from one user is more similar to itself than it is to like other users so sometimes what they'll do is create models just on like some subset of the data so you create user individual user models and then that maybe gets fed into like an overall model or something like that so usually when you end up doing something like that it means that the all the rows in your data are not like equally what what we call iid or you know into basically independent observations independent and identical observations so so yeah that's something to look out for I think I mean I would always go to stacking because that's what I'm more familiar with so that's not necessarily the right answer but that's what I would do okay so so yeah these are all different things that you can do to you know the goal is always to get the best model so if I mean actually that's not always the goal but sometimes it's to get the most interpretable model so maybe you just want a model that you can understand like a decision tree or a linear model so I shouldn't assume that that's always your goal but typically you know your goal is to get the best model sometimes you have operational constraints like you want it to be able to predict really really fast if you're running some sort of web business and you need like you know sub millisecond prediction time you might not choose a big ensemble because that you have to generate predictions from the end of individual models first and then come by and it takes longer but for most people or most use cases you're not in that situation where that timing matters so much it's just certain productionize use cases you might choose to go a different approach okay so I'm alright so we'll talk about this Bayesian optimization of hyper parameters but this is probably a little bit technical so and I'm just including it in this talk because it's an important piece of auto ml but if you know if this is if you're sort of new to data science you can just kind of tune out for a minute or you could try to understand but it's a little bit complicated so so basically what it does so this is kind of a formal definition of it so you develop a model or a statistical model of basically you want to create a map from the hyper parameter space to the what's called objective function so this is like in machine learning you always measure your performance using some loss function or error function or whatever so the formal term for that is an objective function so it's you can minimize or maximize or whatever it is that you have so maybe mean squared error is something that you can measure the performance of a model with this will all depend on like what type of model this is is that binary classification or regression some of these metrics only matter in certain cases so so basically what you're doing is you're creating a monk ssin that can take as input the hyper parameter values and then sort of predict what the performance of the model would be so that would be like the ultimate thing to have because then you could just put in you know what you think it's going to be and it can predict you don't actually have to do the model training part so so that's kind of the goal and then there's different ways that you can do it these are some of the some of the names of the approaches or you know so if any of this stuff looks interesting you can you can go into it sometimes sometimes people call this sequential model based optimization because it's if you remember how I described the process where you train a model you report the results to this other model over here it tells you something new to try you try that you report back so that's a sequence of events so this is not happening in parallel so this is one of the drawbacks of this method is you can't just have a big cluster and like try a bunch of models and then be done you have to wait for them to train in sequence so so this takes longer and that's probably the main drawback of this but otherwise this will get you to a the best model in fewer iterations than if you were to just try a massive grid search so the benefit of a grid search is they're independent from each other so you can just train them all at once and be done in like one step if you have enough computing power so just kind of depending on your setup you might choose one approach over the other I would say it depends on like which approach you're using and it's kind of what you call these like surrogate models they're like when I say you send the results to the other model that's what we're calling that a surrogate model and it just kind of like figures figures that out so there's all different approaches so it's the way that I'm describing it is like it could include all these different approaches so yeah well this can this is just a technique that people use to train their models so so could apply to anything really so if you if you just have software that does this then you can do it on your model so I think on the next slide this is a list of some of the software that does this stuff so if you're an AR person then there's a very new package called ml r hyper ops it that's sort of a higher level wrapper to this other one called ml r MBO there's if you're if you're an hour person there's a package called karat CA re T stands for classification and regression training I think and then there's another package called ml r and these these are like the are equivalent to scikit-learn so it's just a package or h2o like a package there's a whole bunch of algorithms in it and kind of a common interface to all the algorithms you can try a whole bunch at once and it also has other stuff built in to do grid search and model evaluation so these are like toolkits so all of these are these two are built off of the ML R package I don't think there's anything like that for carrot package I could be wrong but I haven't seen anything and then in Python there's a whole bunch of different ones some of them are good some of them are really slow so again one of the issues with this is that it can be very slow so I think I've heard that spearmint is quite slow but maybe so maybe try one of these other ones there's a java thing and then there's there's a sass tool called sig opt that's a San Francisco startup and they have an API service that you basically do that process that I mentioned so you train your models locally you report back the results the sig opt service will figure out what the next best set of parameters is that you should try and then you try them and you kind of do that loop were an API service but that obviously you have to pay for that so if you just want to try it out try one of these other things so yeah that's kind of an overview of that technique it's not very in-depth but so will we'll talk about this other version so the goal with Bayesian hyper parameter optimization is to get like the best model for like a single algorithm so you're like I want the best random forest that I can get or the best GBM so the the goal isn't really to get the best model in general which could be an ensemble or something else is just like you want to get the best version of that single model it's kind of more the goal of that and but on the flip side you can get off in the same performance with doing a different technique which is training a big random grid search and ensembles together so this is the approach that we take in h2o this is what I decided to build because this is all stuff that we already had in h2o so what I guess start with what you have and then maybe you know we'll do some of this other stuff at someday to improve the auto ml so so the goal was first to build the best auto ml with what the tools I already had in h2o so we have random grid search you can apply that to any any model so and we also have what are called stacked ensembles or stacking or super learning so why do we do these two things together well so stacked ensembles the way that they do well is if they have diverse sets of models so you might have heard that like in you know in tech there's not enough women and so if you get these very homogenized teams of men and then they actually done studies that if you diversify your team and get better results same idea here so so it's just always a good idea to have diversity in in all ways so this obviously proves that so anyway ensembles do well when they have diverse inputs because sort of the ideas like when one model is not doing well the other ones kind of make up for it so you know you don't really know in advance like what your data is going to look like or what type of model is going to be the best for your data and it might even change within the data set itself so some row of data might do better with the random forest prediction some row might do better in the deep neural net so if you only work with one model you're kind of choosing the model that does best like on average for all the rows but with a stacked ensemble you can you can kind of like use the best model on a per row basis basically and that's kind of what happens internally with the stacked ensemble so and so the more diverse models that you include in there the better so one way to get diversity and your models is just to Train random models and what I mean by random models is just pick random hyper parameter values and train a model and then pick another random set and just try a whole bunch of things and so individually maybe those models aren't that great maybe you find one that's like really good out of the bunch or a couple that are really good but it doesn't matter so you just put them all in the ensemble and you let the ensemble figure out what to do so this is kind of more formally what's going on so this so the ensemble's perform well if the you know the models themselves are good and if they make uncorrelated errors so if they're kind of different models or diverse models then what's really going on underneath is that the errors that they make are uncorrelated if you've taken any statistics class you might know what the term uncorrelated means but it's not really important you just have to know that like put in a whole bunch of you know different types of models random forest GBM even within like the gbma random forest try a whole bunch of different parameter settings and then what's happening is so stacking uses an another machine learning algorithm actually to learn what the best combination is so that's what machine learning algorithms are good at they're good at learning things so you give it a whole bunch of examples of of like you know how these models perform essentially and it learns how to how to like if you're using just a linear model as the meta learning algorithm then it would be learning a nice linear combination or weights that's often what people choose for that what's called meta learning algorithm but you can use gbms that those do well random forests it depends like how many models you have like if they're good or bad like so you might want to play around with different meta learners but you can't actually do that yet in h2o so we're adding that soon but right now it just uses a linear model so I'm just gonna explain this and you don't really have to know how this algorithm works but it's you know I guess since we're recording might as well just get it on film and share it with other people you know might not be that interesting to you in the room but so I'm just gonna explain like how the super learner or stacking algorithm works because it's actually fairly simple so what happens is this is basically the setup of any machine learning problem you have the training data looks like this X sometimes we call that X so the number of observations is n and then the number of features will call that M and then you have the response column we'll call that Y and so even if you're new to machine learning you you just this is what you have to know you have to get your data in this format so the rows correspond to individual observations or you know exam Bulls like training examples and then the features just correspond to different things that you're measuring about the data so so if you have fraud example so you're predicting a like fraud yes or no like let's say you're doing you work at a health insurance company and you want to predict whether or not a claim is fraudulent you might put in some features that have to do with the claim like how many previous claims has this person made in the last you know five months how many of they made in last year what type of claim is it what is their zip code like all different things that you think might have something to do with the outcome maybe sometimes even if you don't think it has to do with the outcome just throw it in there anyway and this is how this is how the machine learning algorithm works it sees a whole bunch of measurements about a particular thing and then it learns if that thing is a yes or a no in this case so this is a binary classification and then once it's seen a whole bunch of different examples then it that's how it learns so this is your setup essentially an all machine learning all supervised machine learning problems all right so with stacking you start here and you choose a set of what are called base learners or you just call them like those are the models in the ensemble and let's say you have a random forest and a GBM a deep neural net and a GLM or something like that or maybe multiple of each of those so you start with that you just decide I'm gonna put these so that's your job as the user to just decide what you're going to put in the ensemble and then what you do is you do what's called k-fold cross-validation and if you don't know what that is just maybe ignore the next slide but if you're familiar with what k-fold cross-validation is it's just a process of training like k models of on that data so you split the data into K pieces row wise and you train a model on K minus 1 parts and then you test on the holdout piece and then you rotate which one is the holdout so then basically you're getting you know K models and they are averaging the results so it's a little bit better than just training on one piece of data and testing once because if you average over a bunch you probably get a better result or a better look at your whole data so anyway one of the things that you get when you do capable cross-validation is that you get these what are called cross validated predictions so when you were training let's say you're you know this represents K minus 1 number of chunks and this was your test chunk in this case you trained on this data and then you made predictions on this one well if you do that k times then eventually you'll get you can fill in this whole thing of all your predictions that you made so sort of the product of running cross-validation usually the goal is to get a better estimate of your model performance but one thing you also get is just a bunch of predictions on your original data so there's n rows and predictions so you do this for each one of your models so we said we had like L number of models and so now actually you can stack these and that's I think where the stacking term comes from you like squish them together columbine them together into this new matrix and then this is the setup for the meta learning algorithm so so if you think about if you train a machine learning algorithm on this data set now what is it doing it takes as input the predictions from the other models and tries to predict the output so now you have this like combiner algorithms that like knows how to take the output from the models and you know combine that together and then make a prediction so it's just that's really all that's going on in the super learner algorithm so you know it's not complicated but you you know if it's your first time seeing it and you know at least like a cross-validation is they'll still take a couple times to like read it over and kind of figure it out in your head I think it helps if you have the diagram but anyway my point in this is just showing people that some of this stuff is not that hard so it's just kind of moving things around and you know doing some stuff so if that was above your head just don't worry cuz it gets it's all downhill from here oh yeah so I probably should have mentioned this earlier so this this was just you had a question on what is the difference between stacking and ensemble selection I kind of already explained the difference but I'll just highlight again so with ensemble selection kind of your your resulting ensemble is just like the good models only and then you average the results together whereas in second you can put in good and bad models you don't want to put in like too bad of models but it's okay if there's some kind of bad ones in there and then the meta learner will it'll learn to ignore their input if it's consistently bad so when it's combining all those predictions together it will kind of just say okay you guys we're gonna ignore your input because you've consistently been bad you don't get to say anymore so that's the difference so you know it's just another reiteration of something everything so I mentioned like we by default use a GLM or like a linear model but you know gbm's also work really well so it depends if you have like really good models and like that's only what you have in there you might want to definitely try a GBM I think my experience is that if you don't really know in advance what kind of models are going to go in there the GLM is will maybe a little bit more reliable in terms of you're probably not going to get the best results but you're probably not gonna like totally do a bad job sometimes I've seen gbm's do a bad job but I think in general they probably do better than a GLM but you should just try different ones and we like I mentioned haven't exposed this feature in the h2 ensemble but but we will and then it will be easy to try the different metal learners okay so I kind of want to hurry up but mm-hmm so I already mentioned kind of what h2o is and this is just a reiteration Oh reiterating what I've said before basically so highlights all the points works on all different platforms all different languages and the way that it's different from most machine learning libraries is that you can run it not just in a multi-core machine and you make use of all the cores but on a multi node cluster so yeah the question there's a separate project so there's a little bit of GPU support in h2o but there's another project called h2o for GPU that's GPU version basically of h2o and that's just getting started so we only have a few algorithms right now but that's probably what you want to look at yeah true so we yeah so we have XG boost as the name of a package that that does it's really well known for gbm's and it's a popular package and so we actually started shipping XG boost inside of h2o I don't know when we added that like in the last six months or so so that has GPU support and you can use that in h2o right now but yeah in general like when h2o was built most of what people were using at the time is like well sort of built for like commodity clusters like Amazon ec2 where you pay you know very little money to rent these like fairly weak I mean they have some nice good instances but the idea is like just get a whole bunch of these cheap machines and just run the models on that and then it's fairly cost effective to do that but one of the things I see people that are new to h2o they make a mistake of thinking they can just like add no it's like and it's just magically gonna make your data go or your models go faster so if your data is not big enough to require that you're just kind of slowing it down actually because if your data fits into the memory of single machine then that's you want to use one machine because otherwise you're just splitting up your data and then they all have to communicate to each other and it does slow it down so so you know if your data is I think the the rule of thumb that we tell people is like if your data on disk is like h2o will require three or four times that amount in RAM to run the algorithm itself which is actually fairly low memory compared to a lot of other algorithms because in a lot of other software they make a lot of unnecessary copies of the data and things like that so so yeah if you have a 16 gig laptop then as long as maybe you have your data is less than four gigs you could probably run it on your laptop so all right so let me just introduce just two words that that we use in h2o because they have to do with distributed computing so one of them is what we call the h2o cluster so then that's not like a machine that's basically this process this Java process that's running on your laptop or on a cluster so it's just the name of the process that all the h2o stuff goes on so all the models are trained in the cluster all your data sits in the cluster it's all in RAM so it's fast and then there's no limit on the cluster size so you can just keep adding nodes if you want and each node only sees some rows of your data set so you split your data in rows so that kind of brings us to the next thing so if you're an R Python user like in Python you're probably used to like pandas dataframes and R we have data frames we have data tables and now we have Tibbles which are another thing so it's the same thing so there's rose there's columns you can do things you can subset you can do whatever to the data frame it's just and it feels like you're using you know the native language version of these things but actually under the hood your data split across a bunch of nodes or maybe if you're just on your laptop it's just like normal pandas dataframe type of thing but so yeah so there's what what we call an h2o frame is actually just a regular data frame but it's split across your machine so there that's just like jargon that that you'll need to know so the first thing you always do in h2o is like you start up your h2o cluster so okay so now I did say it's downhill now we're gonna things get a little bit easier okay so these are just the general things that you can do with h2o so we have what's called supervised and unsupervised machine learning algorithms so here's us some acronyms so GBM is gradient boosting machine RF is random forests deep neural network generalized linear model whatever we have a whole bunch of algorithms then this is all the stuff that I mentioned at the beginning that's kind of annoying in practice if you have to keep writing code for this all over the place like so we do all this imputation normalization one Hunton coding it just happens automatically in h2o and there's a term called early stopping and NH 0 so any herb not may show in machine learning in general so sometimes you have models that are trained iteratively and so you don't you don't know exactly in advance when you're going to stop so what you have to do is you take a separate validation frame and you kind of measure the performance as you're going along and that's a certain point you look at the performance and you decide the optimal time to stop and it's basically what what happens if is if you don't do this you'll over fit your models so let's keep learning and learning and learning and it over fits to the training set because it thinks basically the goal with a machine learning algorithm is to you know it learns from the data but you don't want it to learn so much that it it's not useful on any other piece of data so what can happen is if you let a model just like learn too much it gets too excited and it like you know it yeah I don't know it over fits the data that's the term that you should know but so we have this thing that's called early stopping and you just use this holdout set and you kind of measure and make sure that it's it's doing okay because the point at which your validation error starts to go up again and it means that your models starting to overfit so that's that's why we have this extra piece of data any way that you can do that automatically in h0 this is one of the more advanced kind of tricky things and tuning a model so it's nice to have some help doing that and then all this other stuff like cross-validation grid search random search we have all that and then once you've trained the model we have you know you can look at the model and you can actually see which variables are important and predicting your your features or I mean your outcome so that's kind of just knowledge gain it's not it's not necessary to the training but it's a nice thing that you get out of the model when you're done and then performance metrics plots so the point is it's a general purpose tool that can do all sorts of things ok so now that we've talked about all these other things let's talk about what what is Auto ml okay so an h2o I this is the slide from before but I've highlighted the things that that make up h2o 0 ml so we do all that stuff imputation 1 hot and coding standardization we do that in all the algorithms it's not specific to auto amel and then we do a random grid search over a bunch of different types of algorithms and then we and within each of those models we also individually tune them using a validation set so that we do like early stopping that's what what we're referring to there is early stopping and then we do stacking of the models at the end ok so to do okay I basically just said this these three points here and then what it returns is you know you obviously have all the models or that you've trained but what the auto ml object one of the things that returns is what we call a leaderboard so this kind of from Kegel language so Kegel is a data science competitions platform where you can compete to win money and prizes and it's like if your model does the best you win the prize so so they have a leaderboard I mean it's not the term is generic but anyway that's that's yeah exactly so they have forums and people post their scripts and so it is a really good place to go if you want to see people in action kind of like as a fly on the wall or maybe then you start to learn how to do some of these things yeah and it's fun so so this is what you get back a leaderboard you get all the models and sort of an order of how well they did and okay so this is I'm gonna just show you how small the amount of code that you have to write to do this so this is in R so you load the library you start the h-2a cluster you import some training data maybe it's in a CSV file it could be in all different formats but just for simplicity we'll put it in a CSV file and then this is the interface basically so there's there's some options that are not being shown right now but these are the only options that you need to use so you say here's my training data it's in like a you know in an h2o frame and then whatever the name of your response column so if this is the fraud example that I was saying maybe maybe the name of the column is fraud yeah you have a question yeah yes you can connect to any database that has a JDBC connector and I think I think my sequel is supported I don't I don't 100% know but it probably is yeah yeah so it just depends like where your data is coming from so one thing like in order to be able to do all these computations in memory you obviously need to get the data from disk or from wherever it is into the h2o cluster memory so sometimes that can like if you have you know a billion rows that could maybe even be the thing that takes the longest is just importing the data into the cluster so if you let's say have I mean so this this particular file rate or function right here import file is a parallel reader so if you can take advantage of you know all the cores on your machine you can read the you know all the data in all together it's fairly fast so I guess I don't know if I answered your question or not yeah yeah so there's I haven't worked with that a lot myself but we have a if you I'll point you at the end to like where the h2o documentation is and there's a page that says here's all the different ways that you can get h2o or get data into h2o and it will explain like here's how you do it if you have a database here's how you do it if it's on s3 or if it's in a Hadoop file system or whatever so yeah so we try to support like all the popular formats in ways that people have data yeah [Music] yeah so that's a good question so one one thing when you end up choosing different types of algorithms as if your data is very dense meaning it's just a bunch of numbers and versus sparse which means it's still numbers but a lot of them are zero so one of the ways that you get sparse data is if you encode text into matrix format so the most simple way that people do that and they obviously they generally do something more complicated is they they take they're you know all of let's say you're doing want to do a tweet predictor like to predict whether or not the tweet you know was from Donald Trump's cell phone or his staff cell phone so somebody did this kind of funny analysis where they looked at all Donald Trump's tweets and you can from the Twitter API it says the client that was you so it was an Android versus an iPhone and so they knew that I guess he had an Android and his chief of staff had an iPhone and so you could train a model to see like and to predict which one because that's the label basically is like Android means Trump iPhone means whatever and then they showed like that his tweets are like kind of crazy or you know compared to the more subdued ones so so anyway text yeah that was a side note but so what you would do is you take all the tweets and gather all the words that ever appeared in all the tweets and then you sort of line them up on as the columns of the feature so each word that exists in your sort of dictionary is is a column and then everything's zero except for aware that that word occurred in that tweet so if the you know the tweets is only five words long then you have all zeros except for in five places so to be like you know whatever words that were active essentially so it's similar to the one hot encoding a little bit but but different so almost your entire Xero's but you know just some ones or twos if the word occurred twice this type of encoding is called bag of words it's like the most simple thing to do so so you have like there's certain so that the way that that data looks is very different than if you have like dense numbers that all mean something and every column and so you you end up choosing different algorithms like something like a GL M is probably what you want to use there may be a deep neural network but if you have you know a hundred thousand columns you'd probably don't want to Train a random forest or a GBM so like these are things that you should probably think about when you're building an auto ML system so we don't have anything right now to check for sparsity and to like to basically have a different pathway in inside the program to do different things but that's something that we will probably add at some point so it's probably more common I think that people have dense data so that's kind of the assumption that we make is we will train a whole bunch of random forest GBM deep neural nets and GLM's so one thing that we haven't added that we will is the ability to turn off certain algorithms in the auto ml process so if you have sparse data you just say I don't want to do any random forests and I want to do G gbm's just do or maybe just only do GLM's and then it will train only GLM's and so we'll first give the user the ability to do that and then probably add some stuff to detect whether we should do that automatically like maybe have a warning message or something that says beware like this going to take forever to train this if you have a hundred thousand columns so that's a good question and other than that we don't have anything yeah so we don't have any other assumptions that were making other than like you if you put time series data in there that would not be good because we assume that the data is the ID which means in depend and identically distributed so time series is a very different type of data and so we would probably want to either you know build in that as like a separate thing or maybe just have a flag if you know let us know if this is time series and then tell us which column represents the the time date stamp or something like that so most of what we're talking about here would be like non time series type data so yeah another question yeah [Music] correlated earth-like well that's part of like yeah so you don't when you use auto mail you don't have any control over really what it's doing inside so we've decided all of that for you so so one thing that we're doing so right now it'll just train an ensemble on all the models in the version that's out right now so what we're adding right now is like we're gonna do different subsets of the models and try to choose them in a smart way that you'll get maybe a better ensemble so so hopefully we'll cover all the different cases that you know maybe we'll have five or six different types of ensembles at the end and then whatever your data is maybe one of those will work best and then that one will pop up to the top on the leaderboard you know and you'll just see you don't have to know like you'll know in retrospect what happened and what each of these models are and how well they did but you basically just let the auto ml process you know decide all these things and and do it for you well I think like the the best use case is anything that you you know if you want to just get a good model and you don't want to do all the tuning yourself it could be any use case that any machine learning use case where so that's it's less I guess I mean so it's it's somewhat about the use case so we would assume probably the you get the best results on data that was dense maybe had like less than you know a thousand features you know maybe ten thousand features doesn't matter how many rows you have like you know so I think yeah those are so those are some basic things that would where this would excel so if you have weirdly shaped data or sparse data or something more unique you might you could try it and see if it works but maybe you would want to you know look into what types of algorithms are good for that type of data and then train something yourself but the good thing is like you don't have to do any work to do this so as long as you have a machine where you can run it you could just say I you know maybe I don't think this is going to do well in my data but I'm just gonna give it a shot and just let it go and if it returns anything useful then I'll use it yeah so so the question was basically like can we use this on Kegel I think so so one thing to note about this is that this doesn't do any feature engineering at least right now so the way that people win on Kegel is two things one they do a whole bunch of manual feature engineering and like figuring out all these different new data sources that they can append to the data and stuff like that so that's a lot of really manual work that in theory can be automated and that's actually a different product that we're making right now that's called driverless AI that does that but but then the other thing they do as they do ensembles so you need I think to win at Kegel you need both but if you like for me personally I enjoy much more of the modeling side of things versus the feature engineering things so this is a nice tool for me I'll just say you know what I'm just gonna try this Kegel competition I'm not gonna do any work I'm just gonna put this so you know let this go and like see where it goes now that's probably not going to get you a great score cuz you're missing out on all the like feature engineering that people do but the best way to use it is that you would do the feature engineering yourself and then you give it to Auto ml and the auto ml thing does all the modeling for you and then maybe if you're you know want to take it even further then you figure out like the best model that it took and kind of like look at what what's in there it's probably going to be an ensemble so maybe this doesn't make a lot of sense but you know you could try you know taking some of the models that it trained and try different ensembles or things like that so so yeah and and once we add some of the more advanced features to the stacked ensemble function in h2o like the customized metal learners or different metal learners which it's probably in the next month or two then you'll will add that into Auto ml as well so then you'll be able to do all this stuff more automatically so one thing that I might do if I were really trying to write now this version using this on Kegel would be take all the models and then do a stacked ensemble by hand using the GBM meta learner but I probably wouldn't do it because I'll just wait so we have it in the software and I'll do it that way and it gives me a good motivation to to finish that but yeah so this absolutely could be used on Kegel I I would love to see somebody do a really good job of feature engineering and then do this and then get a good score so we'll see that maybe that somebody does that or maybe I'll do it I don't know okay any questions all right so that's oh so I I mentioned the thing I didn't mention is what is this thing at the end so max run times seconds it just means how long do you want this to run for basically and there's three different ways that you can tell auto amel like how long to do its thing so one is you just tell it a time-based things you say ten minutes go off and get some models and be done another thing is you can tell it the number of models so you could say train 100 models and then be done or 30 models and then be done the third way is it's a little more sophisticated I guess so it's it's early stopping for the auto ml process itself so you keep track of the performance of the models in your collection that you're building and then you kind of you know rank them in terms of like where is the best model and like how much am i improving by adding more models to this thing so in early stopping just in an algorithm you say like you give it a couple different settings so you have to control like how sensitive you want it to be so you say like if the AUC of the model stops increasing more than some amount let's say point zero zero one after five iterations then cut off the process so you have some web we have defaults for that so you can just say I turn on early stopping and hopefully it does something good but if you want to control like how sensitive it is to the early stopping so sometimes the performance can be a little shaky at the beginning so if you if your early stopping setting is too big it might seal or might it might stop too early so you want to basically say like if after like five iterations of measuring the performance if it still hasn't like increased then stopped or decreased depending on what you're looking at so you can do that with Auto ml as well it just basically says if if by adding more models you're not really getting much more value than stop essentially so this is now what it looks like in Python it's very similar our Python interface is a little bit more you know pythonic it looks more like scikit-learn so this all looks the same basically and a lot of the like almost all the names of the functions are the same in our Python so import file so we have you make an auto ml object and the only thing you would tell the object is you know something about the run so do 10 minutes or 600 seconds and then you train and when you train you actually that's when you tell it what to train on it's that training set and that column is your response and then to look at the leaderboard at the end you just stored there and so this Oh first before we go to the leaderboard this is what it looks like in our GUI so the H show GUI is called flow and you might not really be able to see this too well but wait what I'll point out is this is all the same stuff that you have so these are some of the hidden options that I didn't show on the other things so we have the training frame response column this is the fold column which don't have to worry about it's just we do cross validation in this process and so if you want to specify your own custom folds like which Row is in which fold like some people want to do that or need to do that you can do that there I just added number of folds so if you want to control the number of folds doesn't show up on this screenshot but that'll also be there if you have weights in your data there's a way to tell it which column is the weights column and then you can give it a validation frame and we have something that we call the leaderboard frame so this is kind of like a test set but it's the test set that you you score the data on to make the leaderboard so right now you basically one thing it does right now is if you don't supply these things the validation frame and about in leaderboard frame it will just chop them off your data for you so you don't have to split the data although it's just like one line of code to split the data but see that's just for reproducibility models to build so that would be if you wanted to control the number of models again run times seconds and then these three have to do with the early stopping so really like all you really need to do is this training frame response column and choose one of these to figure out how to stop it or if you don't set it it will just run for an hour and this is all you can get this I'll show you in a minute but then you just hit the build model button and this is kind of what a leaderboard looks like so this would be for a binary classification model it would show you what's called the AUC or area under the ROC curve that's just a metric that people use to measure binary classification models so log loss is another metric people use so we just show you both we'll probably add all the metrics at some point so this will show you the models in the auto ml so typically the stock ensemble wins that's you know if it's not it probably means your data is really small sometimes that can happen if you don't have enough data it's you know maybe if you have less than a thousand rows it would start to maybe slide down a little bit but typically if you have enough data stacked ensemble will win and then probably a GBM will be second like based on my experience then maybe you know this stands for distributed random forest here's another GBM and another one this is another this is actually just the same as like a random forest except for it's called extremely randomized trees just means it's more random than a random forest so you know it's just a version of random forest that does well sometimes so it's not its own function you can you can do that through setting different things in either the random forest function or the GBM function if you set we should probably make a wrapper that just sets those things so if you want to just say train an extremely randomize tree forest then just do that but yeah if you look in our user guide it will show you like what the parameters that you need to yeah it's kind of I think if you read either the random forest or the GBM I forget if it's on both pages but it's on one of those pages it'll say something at the FAQ or something at the bottom that says like you know how to train an extremely randomized tree so so yeah this like in this example I don't know we have like like less than 10 models here so GLM pretty much always does the worst that's okay one of the things that we're adding is an ensemble that leaves out the GLM's because as you can see they're they're quite different in terms of their performance compared to these other models so if you look at the numbers in terms of a you see these are at point six eight and then it jumps up to point seven two is like the next one Oh point seven 3.7 3.7 30.7 five and then point seven seven so these are actually much worse than the tree based models so sometimes I've seen that if you put too many really bad models in ensemble it's not a good thing so we have added a version of the ensemble so in future versions of h2o there'll be multiple stacked ensembles that do different subsets of the models so that won't just be one I forget like I this might be just a screenshot from something either we're deep learning wasn't working yet or it skipped it because if you don't I it's it's supposed to sort of distribute the time in not evenly among all the different models but in a way that we specify so we're gonna say because you can train so many GLM's in the amount of time that you could train one GBM so we don't say like give equal time to the GLM grid you we say give a little bit of time for that because it doesn't need that much give deep learning a little bit more time and give like you know so we you know proportion the time in a way that we think is sensible I don't know why this one is missing the deep learning we'll see if when we train it like right after this we'll just run it once and we'll see if it pops up and so this is a slide like if you want to just take a picture of this if you're trying to like get a bunch of resources this is where we have all of our stuff so Doc's documentation and the user guide is in there that's the best place we put all the tutorials that we have like just code tutorials of different things are here all the slide decks including this one goes here video presentations including this one will go on our YouTube page and then this is just a I I think this page is still relevant or active but we list all a bunch of our events so we do mostly meetups but then there's also conferences and other things that we do and if you want to get a hold of me this is where to find me on github or Twitter and that's my address and my website at some point I'll get a new website but that's my Berkeley website okay and so then quickly we're just going to so maybe we'll just pick one language is there are there more Python or are people here so Python raise your hand oh I think I think that's going to be the winner any are people yay take an hour ladies sticker when you leave I brought some our ladies stickers I don't know if there's pie ladies our ladies they're these groups women in tech anyway know our ladies in here though okay so let's let's do Python so I'm going to show you this is our slide deck page so what I'm gonna do is just show you how to get to the documentation because that's where the code is that you're gonna run right now so if you go to docs.google.com eaters and if you scroll down far enough you'll see a code examples so we'll click on python and we're just going to paste these into a terminal oh and I should also mention well how do you get h2o you can either in Python you can pip install h2o but I will warn you that if you do that right now you will not get auto ml in there because that version is behind so I know it's really makes me mad but there was an issue where pi PI has a limit on the size of the we'll file and we have exceeded that and so now we are working to get that limit increased so that we can put our not just like for us and like increasing so if you asked I guess if you asked nicely they will increase it for you so we have to do that but if you want to you can always get it from our website if you do hg o dot ai slash download it's probably yeah you get because it was like the sub dot dot versions if you didn't have them exactly matching they would not work and it was like I couldn't figure out how to control it yeah so one thing you can do if if you if you have what's you see like a version mismatch error is that what you were saying first mismatch error was like on the you know the third decimal place yeah so good night Pete there's it was yeah so when you one thing you can do is there's an option in the h2o dot init function that says strict version check if you set that to false it will that's a loophole but if you really want to find the right version well usually you would go here and you go latest stable release and then if you want to install and python this is how to do that but our release pages are versioned so if you know the like release that you're looking for then you can just type it in type the URL in that you know that it is otherwise you can if you don't know exactly the name of the release and all that you can go github and go to h2o AI is our github name and then if you go to h 2 o 3 that's that's h2o and then if you go to changes MD that'll show all the changes for all the different versions but it also has a link to the download page for each version so if you're like oh I really need you know version 3 point you know twelve dot zero dot one you know hopefully you can find that and then go to the link on that page but if you're you know and Python just install from here you'll be fine so now let's I've already installed h2o okay so let me make that bigger and bigger bigger bigger I'm just gonna use ipython and so I'll do import h2o I'm just going to go back to where where we go back here and you know just copy paste all this stuff so we import h2o we import some training and test data the way that you tell h2o which columns like what is the response column you just we call that Y if X is actually a list of the predictor columns if you're using basically usually people have a data set where one of the columns is the predictor and then everything else is or sorry one of the columns is the response and everything else is the predictor if you need to like remove like an ID column or something like that you can specify which ones you actually want to use in X otherwise you can leave it blank and it will just assume everything else is predictor that's that's I found really handy a lot of systems or as easy to do yeah so try to make everything as easy as possible so this will so first what you're seeing here is this is just some printout that's like when you start up an h2o cluster so this is just on my laptop and it prints out some information like the you know the version number this version was released 24 days ago you know here's the number of nodes so we're on a laptop so one node by default it will give you a four gig cluster but you can specify how much memory you need and usually like just you know you can play around with that like if you have more memories just give yourself more to use and then you know if you need to be more sparingly you use it more sparingly than start small and if that's not enough then increase then I'll show you how many cores you have and then I just also pasted to get the data in here so now if we do so what this will do is it will run Auto ml for 30 seconds and then we will sit here and if anyone has any questions now is a good time to ask doctors of the h2 and it show you GPUs so there's some there's a integration that we call deep water which is kind of like a separate thing but there's a function that will check if you have GPUs but that's specifically for deep water but I think also if you're using XG boosts there's a function that checks if you have GPUs and it will use them so it's the also in deep waters it just we don't have deep so what deep water is is a wrapper around tensorflow cafe and MX net which are three deep learning libraries but yeah so that's not something that we've put in auto amel we might at some point but we also might not because it's typically if you're using those types of libraries you have different kind of data you don't necessarily use that on numeric dense data which is kind of what this is designed for maybe you have images or time-series data or something like that so you can use those libraries to do deep learning on this type of thing but if your data is dense numeric data you probably could just use h2o as deep learning algorithm which is just called it showed a blank yeah that's kind of like the point of deep water was to make make it easy to if you're already like living in the h2o ecosystem to like just using a different function h2o deep water to try out these other libraries but if your goal is you're only using tensor flow and like you already know what you want to do like you might as well just use tensor flow you know so like it's it's kind of meant for like if you're doing multidisciplinary work where sometimes you might want to use deep neural net and sometimes you might want to do a GBM like that will allow you to do that but a lot of times what people when people are using like a tensor flow they have you know a particular type of data that's you need tensor flow or something like that for and it wouldn't work on like an h2o GBM so it just depends like what you're trying to do there's a separate download for deep wire as a separate download for the other pieces so how you know when you say H to own it how does it know which one you want so if you use the deep water bundle it's like you'll be using an h2o that knows about deep water because it's already like installed and available so if you if you don't have the deep water version of of h2o then it doesn't know about it basically it's kind of like an extension package so if you look at when we started h2o you see there's some like what we call extensions and you'll see you don't see deep water there and I'm I don't remember is if like deep water if it shows up as an extension when you when you use a deep water enabled version but it is a separate distribution so we have functions built into the h2o to access the deep water but if it's not there then they don't work basically so there's an h2o deep water function but if you're not using the deep water version of h2o then it doesn't do anything and the reason that we do that is because these are huge library is to bundle with h2o we don't want to like smash everything together if you don't need that and most people don't so they we don't need to deliver the entire deep learning like universe as part of regular h2o okay so let's just look at the leaderboard so basically what we have here is we've trained seven models and well so six six base models and then one stacked ensemble and we find that stacked ensemble is doing the best and then we'll see the you know this particular GBM is this you know whatever so if you want to look and see like what what was that GBM like what what were the settings you can do h2o so these are model IDs so if you do get model that that'll bring the model like so you can do stuff with it and then let's see this might give me way too much information but this is like all the values that correspond with this particular GBM so like things you might be interested in our you know sample rate you know that was 0.8 you know so this is you know if you want to like kind of learn more about what parameters do well and kind of study this data you can look at this another thing that I would actually recommend doing is maybe let me just show you one thing real quick in the docs or no sorry it's this is a tutorial so if you're just learning about h2o I would recommend that you go to github and our h2o - tutorials repo and if you scroll down it'll show you kind of like there's a bunch of tutorials in here but if you're just learning like try some of these so if you're Python you go let me find out how do i do grid search and python and it will show you a notebook it shows you how to do that and i the reason i'm going here right now is i want to scroll down to the end where it if you print out a grid the whole grid it'll show you something more nice like this so it shows all the grid values and then the name of the model and then the performance so this is actually a little bit inconvenient how this is formatted cuz the performance metric is on the next line but these are ranked by like how good the model is so interesting that was the same sample rate that we found in the auto a male model 0.8 maybe that's a good value if you keep saying that do well you know like but it's totally dependent on your data so I would not read too much into this and say oh sample rate 0.8 I'm always gonna use that it's the best best one so and if you're wondering what that means in a GBM every time you build a tree you sample some subset of the rows and this is just saying that percentage is 80% of the rows so anyway so go there if you want to learn more anyway that's what what auto ml does so it's gone off and it's done all this stuff that you would actually typically learn how to do this way if you're just using the grid search functionality directly what you have to do is set up all these values that's you know that's kind of hard to do if you don't know anything about gbms so you might not like what is max depth mean I don't know like is should it be a million should it be a hundred like I don't you know like these are things that you learn as a data scientist like you know og BM is like you know typically you want somewhat short trees versus a random forest where you want deeper trees max depth might be different like these are all things that you learn as you you know progress in data science but but this is hard because if you don't pick the right ranges you're not going to get the best models and so we try to pick good ranges for our grid searches so that you don't have to do and this is how you would train a grid and do it all manually so this is all stuff that we're doing under the hood so anyway that concludes our session and let me just show you I I did mention that I would show you how to do this in flow but basically I just want to point you to where it is so if you go run Auto ml it'll bring up this thing and I've already run Auto ml so there's actually a bunch of frames in there so it's kind of confusing but here's the one that's the training set and then I I'd say like what's the response column there it is and then you say run that for 20 seconds and hopefully this will work I've haven't done this recently and then now we're training in flow so no code at all and it kind of gives you a little I guess update about what's going on here so it's training some gbm's this will be over in 10 seconds hopefully okay we're done now we can like view the leaderboard oops I forgot to I use the wrong data set so it actually thinks that we're doing a regression problem so we see different metrics up here so one of the things that we did in the Python code was convert the response column which was encoded as zeros and ones into a factor so that's how you tell h2o if you want to do regression versus classification it has to be a factor or categorical column otherwise it thinks zeros and ones those are numbers I should be doing at regression so this is how you tell h2o do you want to do regression or a lot of times your data is encoded as a factor already so might might be like yes and no and so it automatically does the right thing but that's why you saw a different model there okay so thank you very much for coming and I'll leave this resource website up here for anyone that needs it Thanks
Info
Channel: H2O.ai
Views: 1,373
Rating: 4.8571429 out of 5
Keywords: Meetup
Id: 9nHXyLR8XEw
Channel Id: undefined
Length: 96min 58sec (5818 seconds)
Published: Mon Nov 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.