Intro to Driverless AI - Hands-on Training - #H2OWorld 2019 NYC

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so we'll have a few people going around helping you get to aquarium let me plug in here really quickly and let's see I'm plugged in and okay so it looks like you can see my screen here umm let me up before I start there let me mention a couple of other things coming up we have an event a dive into h2o training event full-day training in New York which is going to be happening what's the date here November 4th so that is on Eventbrite you'll be able to find that on our home page I would invite any of you that are interested in more in-depth training this is going to be a really good kind of one-day event and our training team and a number of our expert data scientists will be there to help support that so that's the first piece I I want to mention to you the second is as as you're getting to your login on aquarium there are a number of tutorials available online and these tutorials are all hands-on we have a team that has been working very hard creating a lot of these everything from getting started with driverless AI test drive a financial focus machine learning interpretability tutorial time series tutorial etc a lot of great material here and and so let me reference that as well more places that you can come to learn more about all of our software including driverless AI so once you've logged into aquarium and let me pause here for just a second how many people are actually logged into aquarium right now okay so so a good number of you once you're logged into aquarium go over here to browse labs click on browse labs and you'll see a list of labs and by the way now that you have a login to aquarium I feel free to come back and revisit this and use it as you will there's a lot of things you can explore through here what I would like you to do is come down to David Whiting's driverless AI the one point 8.0 lab and click view detail there's a there's a lot of information here and so feel free to come back and read through that I'll just voiceover it click on the very bottom and click start lab and you should see an instance come up now if you're doing this on your own it will take a little while for these instances to come up we've kind of preset a number of these up click on that URL I there's an evaluation license go ahead and agree to that and here's the important part when you get to the login I would like you to use training as the username and training is the password and so what you're seeing right now as you first login and and I know some of you are continuing to log in so I'll speak a little bit before we start moving on this this is what driverless AI looks like after you've been working on it for a while right we have a number of data sets that we've loaded a number of projects that we've built visualizations experiments deployments etc so we're going to do sort of a high-level overview here but you're actually going to be building things today so as I'm so how many people are at this page right now okay let me um let me talk about the data so let me speak about the data set we're going to use really quickly here is a motivating example all of the data sets we use in training by the way are obviously open source so that we can freely distribute them most of them you can find on like the UCI machine learning library or for from government sites a lot of them find their way into Kaggle and and so as tutorials or whatever this credit card example is one such case this is a data set that comes from a bank a lender in Taiwan we've got about six months worth of data in 2005 and the goal of this is to build a model that will determine who's going to default on their next month's credit card payment so it's kind of a very typical situation we've got some information on historical payments demographic factors you know a credit history etc so a real basic kind of nice data set to work on here I do want to mention so this is available at the UCI machine learning library it's available on kaggle the way that I've I got very confused when I first used this so I went back and relabeled things so the labels will look different from mine all from from what you see if you get the original data again you're you're always welcome to come back and visit aquarium once more to do this so let me talk through the data really really quickly and then we'll look at this data within driverless AI so my columns are I've got an ID I for each customer I've got an indicator of whether they defaulted on their next month's credit card payment I have their credit limit this is in new Taiwanese dollars so the the numbers will look really big it's a couple of orders of magnitude you know take a couple orders of magnitude off and it'll be kind of comparable we have some demographic information sex education marriage age and then we've got these things labeled 1 through 6 status bill amount and payment amount so let me explain what these are I the 1 through 6 refers to how many months ago this information was right so bill amount and payment amount are pretty straightforward bill amount is I you know what was their bill last month two months ago three months ago all the way back to six months ago and then payment amount is how much someone paid on that obviously status is interesting so I've labeled the status this is kind of their payment status and it's interesting because any value zero or less is what we'd call good customer behavior in the financial services those are people that either paid at least their minimum balances what zero is minus one means they paid their balance in full minus 2 means they didn't use that their card that month so they didn't have a balance to pay and then positive integers are basically how many months late they were on that particular payment now you'll notice that these data are because I'm predicting for next month these data are sent that status is censored so six months ago if my status six is one that means six months ago I was one month late in paying that payment showed up five months ago right however last month the most I know is that I haven't seen that payment yet so this ultimate value status one is going to be a one status two is going to be a two etc so that do the data make sense so is and I know there's a little hard to project is everyone here now or pretty much everyone here so let me first show you quickly we've got a bunch of you know you can you can navigate and kind of play with this here we have a number of different data sets available here let me show you how you load a data set very quickly so I come up here to click on add data set and Arno talked earlier about a number of connections we have four of them here and let's let's go ahead and use file system and this is the internal filesystem of your own instance so it's going to be a clone of mine come down here to data and you'll notice that within data there like I mentioned there are a lot of tutorials online there's a lot of help was was there a question okay let me let me start once more really quickly so I'm gonna add data set and gonna click on file system which is the first connector and after clicking on that here I'm going to click on the data folder or the data directory and then within data there's a whole bunch of material here all of the data sets that that we use in this training or in this folder at the bottom called training and here's that credit card data that I just referenced so we could go ahead and click and say import selection and it will go ahead and try to import it actually driverless AI smart enough to say hey I've already loaded that data set so let's just not load it but loading a data set into driverless AI is very very easy once that data set is loaded and let's see I'm gonna have to go to page two here here is the credit card CSV data or just no you don't need to load it it's all it's it's pre loaded but you'll find it on the second page and now you'll notice over here you've got click for actions under status you can actually click anywhere along this line and you'll see we've got details visualize split predict download delete so let's let's look at details just very quickly details are basically how was the data loaded and what does it look like you know we'll notice here that here's my status one data with minus 2 minus 1 0 1 my status to data you'll notice in the and this is kind of nice to explore your data a little bit beforehand status - I've got a bunch of zeros I've got a bunch of two's I don't really have many ones etc we'll come all the way to the right-hand side to default so again with the summary information on the left-hand side I can look and say okay approximately you know 22% of the records in this data set and have a default so a nice training data set to use it's obviously not a rare event a little bit unbalanced but not bad at all so that's what details is it's kind of nice summary information with some graphics so let's come back and and by the way pause me or feel free to raise your hand if you're stuck or if I'm moving too fast I just kind of want to get to the good stuff here come back to credit card CSV and let's click on visualize and as Arno had mentioned the visualizations here are that this is wonderful because a lot of what we do in visualization is filter out what you don't need to see so for instance we have this dataset has about 25 variables in it approximately 20 of those are numeric and so if you've ever done like a scatter plot matrix of all of the you know all the scatter plots you should pay attention to 20 is a lot of you know dimensionality of 20 actually gives you if you do the combinatoric so that's 190 different plots to look at but there's only one it turns out that's important now how did we decide that you can always come down to help and the help will describe what's happening here it will reference the literature if it's something that is maybe less well-known and will also give you some we'll tell you what it's for and give you some great advice at the end which you can choose to follow or not follow in in this case anything that's got a squared Pearson's correlation greater than 0.95 is what shows up in this plot and so in terms of building a predictive model that's really the only thing we care about right there and that's not to say there are lots of uses for visualizations in you know in other use cases but this is all about building a model that's really great at predicting and understanding the data underlying that again just very quickly we've got some nice outlier plots and you can look at the plots that have anomalous points or outliers and if you say wow I've got a negative value here you can click on the point and it will bring up the data so you can actually explore your data a little bit and try to understand that space once more parallel coordinate plots you know for understanding your data in a multivariate sense radar plot which if you're familiar with that same thing as a parallel coordinate plot only polar transformations so I can see all my variables in the same spot you'll notice here that anything that's kind of pointed out as an outlier is in red which is a nice feature so this individual eye is responsible for a lot of the highest values that we saw in those univariate outliers lots oh the radar plot is the exact same thing as a parallel coordinates plot and the idea behind these plots so these come out of engineering a great idea it's just to is just to visually get a sense for what your data is doing and you'll notice most of the you know most of the data if I click on one of these lines I usually bring up a bunch of different a bunch of different data points that represents a lot of them but every once in a while I have in in you know I have one or two individuals that may be responsible for a lot of the extreme values in my data now the purpose for this is that you can understand your data going in you know what your model is is trying to do I we're giving no recommendations on what to do with this data that's up to you if you've got business rules that would filter that out at the front end etc you know that that part's up to you but this just allows the the analyst the data scientists the modeler to very quickly hone in on any potential data issues now one of my favorite things about this really really quickly before we move on is recommendations and if you remember from maybe your intro regression class somewhere in college you probably had a list of when data looks like this here's the transformation we use when data looks like this here's the transformation we use etc driverless AI will look through and give suggestions on what transformations to make on the data but unlike a black box right so one can treat driverless AI as a black box the real idea is to empower the data scientists to understand to make decisions you can overrule this and say no that doesn't make sense or you can say well that's a good idea I'm glad driverless AI found it for me so this recommendations at the end is just like we said recommendations on on how to use that now by the way if you come back in and visit and I would you know encourage you to come back and revisit you'll see that there are a number of other data sets that we have visualizations for some of these fairly famous like the Titanic who survived the Titanic sinking etc and there are some plots in the Titanic one that aren't available in credit card because they weren't applicable right in the Titanic data set there's some missing data so there's a missing data heat map etc so again feel free to come back and revisit question my question is basically on visualization of large graphs how does it scale up so if I'm doing text summarization and I want to actually build in Python comparison it would be Network X which doesn't scale up can I use this visualization for visualizing large distributed so in general and if I say anything wrong I'll have R no correct me because he understands that so the purpose of the visualizations here is not a general-purpose package this is for model predictive models that you're building so these visualizations in this context scale very well right with the data I'm not sure that we have the specific plot you're talking about as one of the options here because these are just automatically built for me well they're not quite automatically built right there only automatically built if I come to credit card CSV and say visualize and when I click visualize it will go ahead and create those so Arno yes so these plots are for numbers and strings only that is text really is not handled that well it would be treated as categorical and then it was just to some kind of a mapping of whether it's like correlated or not to an order categorical maybe or to another number numeric column it won't do any P recipes for that you would have to wait for queue the order product is socially they're more earlier that will have custom visualization recipes you can do anything now this is scalable so 100 million rows visualizes in like 30 seconds or something right so because we have an aggregator inside that's smart it's basically fostering with epsilon bubbles so you you pass an epsilon bubble around and everybody was close enough get sucked into the epsilon bubble and in the end you have like a counter going up so every bubble is like a count of how many people are in that bubble and that means the actual data set visualized is small but you still have all outliers because it's an outlier just count that one right and if you are together with us then you have count of a hundred so it's a smart way to visualize big data sets but not for text we'll have to see but we can do it driverless because this is a written in Java somewhere else it's not part of the customizable recipe so it's a little bit harder to customize that's what q is for I had a quick question around pruning the data so let's say we have a certain set of outliers that are visualized here is there a way to just remove those are outliers and visualize the heat maps if we have a highly correlated variables is there a customizability where I can only look at a heat map of 10 variables that I know are of value so that you know I'm not looking at a 10,000 by 10,000 or even like a 50 by 50 which are is hard right so yes so oh sorry so the new version is a custom data correction facility inside the rival si yeah I showed you earlier that that live code preview you can take your data set and say drop everything but those 10 columns then you get a new data set and then you say visualize not really there is a UUID for every data set but we don't have full linear yet right and so I'd thank you so that's not available on this version that we're doing and what I anyway so it will be available shortly if you're working on a one point 8.0 or earlier in essence this was just to let you hone in and and I would say if you're up here using the Python client or the our client that you would do the filtering there and reintroduce the data set in in general as a as a process there is actually a Python client version for auto base they can make custom graphs using this smart aggregation facility so you can say 100 million rows I want to do a scatter plot of X versus Y from Python you can say that and back in your troop with the notebook you'll get a nice plot so this is like Leland Wilkinson's visualization program for you for custom plots so you don't you're not just stuck with these plots that came out in 1/7 so this version has it also and it's on the in documentation on the website okay so let me show you the next so we've talked and again I apologize we're kind of flying through this to give you sort of a quick overview on actions here we've done details we've done visualize let's go ahead let me just show you the data splitter and the notion here is if I'm going to create a training data set and a test data set and you always always always want to do that down at the bottom you've got a slider or if you just want to enter a number like 0.8 that's going to you know 80% is going to be on the left 20% on the right come here and select target column and we'll go down to the bottom and click on default under target column and what that does is that target column is boolean and so because it's boolean driverless AI knows that it needs to do a stratified sample rather than a simple random sample so this is just indicating to driverless AI how to how to do your split we'll go ahead and save this data and very quickly you know just just sort of a nice convenience function there's a you may have noticed there's a seed there so if you want to make that reproducible rather than saving those tests and train datasets off you can just redo that using the same seed so now that that's done come over here to page two if you've done it like I have come up to card train because the next thing if you come to credit card CSV the next thing we're going to do is predict but as we predict we've we've split the data into card train and card tests so just to make sure we're all doing the same data come down here and click on predict tell it you don't want the tour at this point and you're gonna be presented with and let me zoom in on this a little bit if it can zoom in on the screen you're going to be presented with this right with a blinking target column a name you can call it my card experiment or some other name you want to give it if you don't give it a name we'll randomly create one for you come over here so I would suggest before clicking the target column come over here to test data set oops and I need to zoom back out so you can see it and come down to card test and the reason I so number one you always want to test data set attached to this the reason will become obvious in just a couple of minutes there's a test data set I now select your target column and so the target column we're going to go down to the bottom and select default and once this happens I have sort of a whole bunch of spinning things in again the display gets very busy and there's a lot of stuff going on and so often we forget about what's happening above so well let me tell you there kind of multiple levels of using driverless AI if you want to use driverless AI is a complete black box then the way you do it is just come down here and click launch experiment and believe me the suggested settings all this other stuff that driverless AI has given us it's still gonna build it's got some very nice guardrails it's still gonna build a very very good model for you but that's using it at its most black box ish and for most modelers and most data scientists statisticians economy Trish ins etc we'd like to understand it a little bit better than that ok so well we'll leave it at that I call that like my summer intern mode or I have a ten year old son who is a Python coder now a little bit and he gets to do driverless AI that's the button he pushes right so that that will build a good a good model for you the next level has these nobs now I should mention really quickly to score here that's blinking at you it has selected a you see which is the which is the default for a non rare event classification model you can feel free to change this select a different one or later today you will hear about how to create your own via a recipe so if you've got your own custom score or if for instance you have a specific cost for false positive and false negative within your business that you want to have the models scored on then you can use that now it's going to get a little bit complicated down here so before I get to this five for six I just want to reference you come to resources and go to the bottom and click on help I'm going to show you where all of the details are in here and then I'll give you a sense as to what's happening so in help if you'll come down to using driverless AI on the left I'm sorry let me make the font here a little bit bigger click on experiments click on experiment settings and now as we come down here our here's a section on those different knobs and for each knob so let me let me come back here really quickly each one of these knobs goes from 1 to 10 and so because I've got this knob labeled accuracy in a knob label time and a knob labeled interpretability you can scale these kind of up and down that gives you a thousand kind of unique recipe combinations to build a model off of and again a very nicely you can build some really terrific models using this so really quickly on the left hand side here we give you sort of a summary of what's happening with each of these knobs so very quickly accuracy I need to point something out about accuracy it's maybe it's hard to find the right name for this because in reality even if accuracy is at its lowest level at a1 we're still going to build the best model based on the score so we're still you know optimizing that so in essence we're building the most accurate model we can this really is I don't know there's there's a sense of a amount of effort right how deep we're gonna dig in with this accuracy because accuracy really impacts a couple of things it impacts how we're gonna do feature a evolution and it impacts what that final model is going to be will it be if accuracy is low it's gonna end up being like an individual model in this case a light GBM or an XG boost if accuracy is high it's going to end up being some kind of ensemble so it figures in both feature evolution and in how that final model is built the time dimension over here basically is saying how long am I going to let my features evolve right so if you think of an evolutionary algorithm where features that are doing well survive and features that don't die off this in essence is telling us how long that will that will work once more all of the details because this gets fairly calm pretty straightforward accuracy gets fairly complex there's a lot of things going on between so let me make one suggestion though in general and and this isn't 100% true but we can play with it and I'll show you some examples in a minute i-if driverless AI suggests like a 6-4 it's fine to move these around and experiment try to find an ultimate place I would just personally avoid going to that like a 10/10 generally it's going to end up over you won't get a good model so let's come back to about what it was I think I had a 5-4 or 6-4 so last thing right interpretability this knob in essence and again under the hood it's very complex so you can go back to the help and look at what it's doing but the general idea is that if interpretability is one we're gonna throw the kitchen sink at it we're gonna try all the model types we're going to do every transformation I don't care whether I can interpret it or not I just want something that will work black box or not as I increase my interpret ability I start kind of constraining the types of things we're going to allow to happen to try to make it more interpretable and I would so personal from my experience in like insurance and financial services if I were in a space where I had a regulated model I tell you that starting at about a 7 so 6 & 7 I probably wouldn't start below that and the reason is one one little thing I'd like to point out here is that at 7 oops and this isn't there we go I'm zooming in a little at 7 I have monotonicity constraints that I can add so in essence I'm not building a black box model this is more like a white box model right I'm enforcing monotonicity on my gradient boosting machines etc so that being said you know there are lots of different combinations we can do here I've been talking for a bit maybe move these move your buttons around to somewhere comfortable and click launch experiment and we're gonna let this spin for a minute while the experiment is launching and let me show you all these experiments so we have this data sets tab or label up here we've got the auto viz that we talked about all of the experiments live under this experiment heading and this gets fairly complex right because we've got lots of different experiments from different data sets etc so while that's starting to spin I just want to show you come over here to projects and I've created a project for all of these car default models that we've built and you'll notice that I've kind of pre-built a bunch of them let's go ahead very quickly and link the experiment that we're working right now so I'm going to click link experiment here's my card experiment so that one now is going to be one of the members of this leaderboard it's still working it's still building let me actually come back to that one really quickly and let's see what's happening as we start building this so first off really quickly each one of these dots is an individual model that's being built so already we are building a number of different models 10 out of 14 parameter and feature tuning models etc here are the variable importance here you can see I've got my ROC a curve with all of my metrics and confusion matrix etc so I've got a lot of stuff going on right here as this is as this is building and evolving and we're going to allow it to evolve as best it can again based on that recipe that we chose was there a question like let's let's get a microphone do we have a microphone can you set up a recall threshold before you start the experimentation can I set up a recall threshold yes for there will be a custom score time right in my case say one class one for example yes yes what about 195 percent yes you can make a custom score and you can hard code the threshold to 95 and then you get your metric based on that what are you a seeker for social all thresholds yeah yeah right was was there another question I saw another hand yes every single yellow dot is a full pipe line that's fitted so not just features but feature Plus model the only thing how you can tell yes yes we change everything it's not just features but we change features and parameters for feature engineering and modeling so the whole pipeline is tuned to over and over again and there is a race going on like a genetic algorithm but also particle swarm and Monte Carlo so it's like a it's like a Smart Search let's say and there is bad individuals that are losers they get eliminated over time and the good ones kind of survive and then mutate and so on and then they listen to others who have won in the past and they'll say oh let me make your behavior and then they all exchange information about the future engineering for example these three groups are good to group by and to compute mean median min Max and so on so they share the information it's not that every model is on its own in random search land yes sir cross-validation is tricky advanced ago too like accuracy seven eight and nine and so on you can look in the preview on the left side so the preview is always accurate whatever the preview says that's what's going to happen our documentation might not always say the exact right numbers let's say when you turn the knobs it might do different things for different sizes of data so if you have a hundred million rows it will do something else for accuracy six compared to twenty thousand rows because we know if you want accuracy six on a million hundred million rows maybe all you need is like you know one third of us data set as a split for validation of 1/4 or something but for small data you need to do more folds more repeats and so on together good accuracy so we do whatever it takes for a given data set and the preview on the left side is always what will actually happen so so one of the wonderful things as a user and and all of our great makers who built driverless AI but as a user is just amazing the amount of work that gets done here very quickly I I have control based on my settings of whether it's going to do cross validation whether it's going to do a one-third validation split things like that let me but before we go on to the results here so come with me to experiments really quickly and you can click on if you ran one actually to make sure we're all at the same place come to projects and card default and this this model that was pre-built up here this card default was basically just the default model if you'll come to the right you'll see three dots click on those three dots and say new model with same parameters this is how I like kind of building my second model after I've built my first one because I've already filled everything in and let's just open up expert settings for a second so for those that are here expert settings gives me sort of my my third level right this is so the first level was complete black box just hit go the next level is okay I've got three dials I can mess around with right the the third level here is I have complete control in terms of here are the models Arnel mentioned earlier I can turn some on I can turn others off I have feature engineering controls how much feature engineering do I want to do if I'm working a time series or NLP problem I have again great control over what's happening there and then recipes I can come here and and right now include or exclude certain transformers so maybe you're working in a regulated space and you've got a regulator that says no I don't want you doing this type of target encoding you can turn those off so they're not available to driverless AI at all and go ahead and build build your model customizable but one of the things and and that's the that's the next level the ultimate level which again was referenced and we're gonna do a training a little bit later this afternoon is building your own recipes so you can you can create your own specific model or wrap a model you've found out there like a cat boost or something and include that you can go ahead and click done here you can go ahead and do it with a score you can include your own transformer like I believe the text sentiment was like from text blob right one of you find a great Python package and you want to include something from that package but use the whole infrastructure of driverless AI underneath to build you know to kind of build this best evolutionary pipeline that that's how you would go ahead and use it so there's a lot more that you can do in terms of details and customizing your analysis and and real quick the reason for giving that preview is let me let me click back here let me go to prod oh here we are at projects I just want to explain very quickly what's happening in this project because there's some kind of nice comparisons we can make go ahead and click select scoring data set on your own and will click card test and then select score and in this case we used a UC and once this is done you'll notice that now it's populated so I've got my validation score I've got my test score the time it took to do the test this last one we built we haven't validated yet we haven't completed but let me explain really quickly here's the car default this was this is my black box approach right driver let's say I suggested something and I built this model and let's go ahead and rank order these in terms of turns out in this case right the driver let's say I actually built the best model available without me doing any tweaking now maybe there are other reasons for tweaking because best predictive power isn't always the thing you want in a model it's got to be a certain level of explainable it may have to include or exclude certain techniques etcetera so the next thing I did was I said okay I'm gonna I built something called card monotonic where I move that six in interpretability up to seven it took about it took about five minutes to build the car default it took twice as long to build the the card monotonic but that's not that much longer a comparable validation again comparable test scores I don't know what the variability is around these but not quite as good and then I said okay from card monotonic let me really crank up accuracy to eight and time to six and you'll notice cranking that up it took an hour and 11 minutes right so it took I don't know that's seventy it took seven times as long as this first one and it really didn't improved a little bit but not very much for all that additional time which I would argue claims that that I you know this six four seven settings were sufficient to sort of extract all the information out that was interesting so let's go ahead and compare these by the way let's let's go ahead and compare these three models yes so like we could adjust different knobs how many wears attempts we should make to keep trying all of these otherwise you know we could probably just keep trying not sure how far we should be going so so that's a hard question to answer because you have to be you know I don't know what your use cases are as an end user after in this case with this data set after I've done five or six models I have an idea of what the best I can get out of it is now maybe I go when I take that to my business partners to my model review folks to my regulatory people and they say Oh show me what you did oh guess what you can't use this right or I'd really like I don't want an ensemble at the end to make it the same thing but strip the ensemble all of those you know in terms of making this a product that's very easy for you to go back and remodel and redo things it it's wonderful in that aspect I think you'll get us all I can say is you'll get a sense as you work on a specific problem how many to do I don't have a good recipe for that except based on my experience using it you know it doesn't take that long now if it's a big complex problem if it's a time series problem or an NLP problem that has kind of other things going on with it I'm not sure I know how to answer that except I didn't use this and just build one model and trust it and go on from there but after you build the fuel you'll start to see what's adding value or subtract or subtracting value so I again are no did you have yes so I would say fill out that in the next version is in the projects page right you want to see that leaderboard but right now we are telling you to make those seven models and then compare them we want to add a button that does those for you with the baseline models so you'll get a GLM a single decision tree and extra boost you know this kind of random for us all these baseline models may be a tensor flow model may be an h2o more and then once you have that leaderboard out of the box you at least know where you're staying and then we can do the defaults and if that's much better than you should say oh wow or something we missed here and if it's all the same then you can say okay we squeezed enough but you will never ever know what's possible with any given dataset unless you kind of crowdsource it for a few months right on the whole planet like Cargill and only then you'll see that there was a leak in the data and you have to start over again or something like that so it's it's always a tricky question it's really it's an unsolved problem to know what's how much juice is in the data at least this will help you make a leaderboard fast and you can automate of course the leaderboard creation but it's our job to do that for you so yes so there's a there's a couple points I need to make so when you read on another experiment the same data sets the model will get smarter because it already knows what features work there's a brain built into the system on your heart where we don't have that brain it's your brain it's your IP you will the system will get smarter over time so if you run ten experiments they all get smarter now at some point it saturates another thing you can do every single experiment you can say a retrain final pipeline on the right side there we should probably show that at some point that's a good one you can retrain any experiment whether it's finished or not and say make me one more and then you're disabled light GBM or you enable tensorflow or you drop everything about target encoding and you can redo the final pipeline just with those changes you can even change the data set or the columns that you're dropping you can change anything and say retrain final pipeline and whatever it works still from the past will be used and anything else will be filled in so if you add ten more columns and you say retrain it will do the same feature engineering on the order of columns but for those ten new columns it will start fresh and so on so you can also say restart from the last checkpoint and continue some more iterations to evolve by default if you say retrain final pipeline it will just make the final pipeline and it will not evolve anymore right but you can always say evolve a little bit longer but you it never starts from scratch unless you turn off that feature brain but basically the system is there to get smarter and if you add or remove transformers or models or add custom recipes and the whole thing flatlines after like a week of trying everything under the Sun you can say oh I did my best at least and you can never say more than that anyway but we have built in these checkpoints to facilitate power outages or you know server restarts and so on you just restart and it'll finish it will not actually have to start from scratch brain that that is that it maintains a lineage so is that at a user level or at an installation level or at a company level installation level every driverless has a folder with up to 20 gigabytes of pickle files from previous experiments and they get to rotate it so it's actually not something that's 100% reproducible like because that state can change right so if you really want a reproducible experiment you click the reproduce little button between time and interpretability and then it'll turn off that feature brain and they'll start from scratch so that every time you run the same experiment you get the same output you mentioned the final pipeline that you could rerun it what's part of that final pipeline is just like it does it take all the features as you know all that it build and so a final pipeline is is all that you need to put it in production and also to have the scores on the on the test set let's say around the training data so it it basically fits the best genetic algorithm winner or winners on the full data set and makes the model that you would put in production so it's the it's the output at the end of like this this last final screenshot the number you see that's on the final pipeline and if you say make me another final pipeline it will redo that last stage and ideally it would then change something so that it doesn't do it the same exact work again right you would want to add some features or drop some columns and see what happens but often a good idea is to drop the best feature and redo it and see how you can squeeze the other features right so that you can see whether you can make a more diverse model so so if I'm a we've only got about five more minutes let me show you really quickly just where to find two more things and and then from there we can you know feel free to come ask me questions that's what this whole things about feel free to we lead lots of trainings online we lead there's lots of tutorials we've got that next training set so this is a chance to kind of get in and and look at it quickly click on interpret this model and this is all the stuff you're seeing today about MLI the Patrick Hall and his team are working on is available here if you'll come to dashboard this gives a really nice view of a this is a Kaline model surrogate model to explain what your black box model is doing here's a decision tree over the top of it feature importance and partial dependence plots - again this is so you that you can understand your model but the one thing I wanted to point out really quickly here is so we've got under the driverless AI model we have feature importance we have Shapley if you're familiar with using Shapley values to explain things these are all in terms of the transform data set I mean that in terms of the transform features rather than the original features but I want to just show you where disparate impact analysis is because we've highlighted that a few times that's really important in a lot of regulated industries disparate impact here and let's see let me look at what are I can barely read oh here we go so I want to see in this data set I've got education and so I want to see if across the different levels of education there's adverse impact here's and here's adverse impact accuracy true positive rate etc right we have a whole tool built here for you to be able to very quickly see if your model again may be due to the data or other things or things you've got in the model but this will give you a quick view of disparate impact now by by the way let me come back to aquarium really quickly and show you a couple of things that what we've got a number of tutorials you're also welcome as you browse these labs to this open source MLI workshop is a workshop with materials that are available that Patrick and his team have put together so if you kind of want to know what's going on under the hood this is all done in Python so you can see the Python code for doing all of these things and and for some of us that makes it easier to understand again feel free to come and use this resource or our tutorials as well but that was the first thing I want to show you so I apologize for doing a flyby of ML I usually we spend quite a bit more time on explaining this and explaining what Shapley values are and and how they work and again we have sessions today discussing Shapley and the fact that with a tree based model you can compute Shapley very efficiently is kind of magic last thing I want to show you because this is the most important thing if you're if you're a data scientist who builds models but nothing ever goes in production whatever you call yourself you're in R&D right without things going into production where they impact an end-user you're not really having an impact on the business and for a data scientist that's not a very fun place to be so if you'll come here really quickly you'll see something called build mojo scoring pipeline in my options I can turn that on so it does it automatically I'm gonna go ahead and build my mojo scoring pipeline and now that I've built it I'm gonna deploy it so click deploy this is beautiful this is a one one-click deployment let's let's deploy it to a rest server so this is a local internal rest server on the same ec2 instance now part of the reason for doing this this is really nice and convenient it's also you know if you have friends that are engineers that you work with and you show them this they immediately understand how this works and goes oh I can I can do that right point them to the documentation this is gonna take about I don't know another 30 seconds I to deploy this but the thing I want to show you is we've gone from building a model to literally being able to deploy and score and deploy means not using any of the driverless AI software anymore I've got this artifact this this piece of code that now I can throw my data against and it will return predictions for me so in just a second this will be finishing and I guess while I'm waiting for this we've got we got a couple minutes for a couple more questions one thing to stress is that this pipeline that is in production is the full deal full ensemble full feature engineering no shortcuts this is not a surrogate model it's not a rule fit anything it's it's the full actually booze light GBM tend to flow ensemble whatever it is is production eyes C++ or Java with R and Python bindings or command-line and custom recipes they are coming as a Python package first because we might not have a mojo or Java Runtime for your ARIMA model right if you have your custom logic so if you have a pip install somewhere in your custom recipe we can make a real at full and a Python bill that is shipped fully self-contained but it still it's gonna be Python because we need to call your Python code but if you work with us then we can make a Java version of this like we did for all of our transformers so that it's purely Java or purely C++ runtime without any Python dependencies so anything is possible we've done most possible in the world already today PI torch and so on our standalone c++ for example without the need for python so it looks like our time is up again this was a quick flyby and i would love to do a deep dive we have a community site that we'll be launching in the next few weeks and on that site we will have a list of virtual trainings that you can attend if you want to learn more I would suggest so to two things that I should put up here side by side is our tutorial site and our dive into h2o event on November 4th where we're going to spend we spent a little more than an hour just doing a hands-on here we're gonna be from 9:00 to 5:00 doing a full day hands where we can dive deep into all of this it's well worth the trip in the and the price of attending to do that so anyway thank you for your attendance I think we have other people coming in for the next session but I appreciate it feel free to come back and use aquarium anytime you would like [Applause]

Info

Channel: H2O.ai

Views: 822

Rating: 5 out of 5

Keywords:

Id: HvYWE6YFoR0

Channel Id: undefined

Length: 58min 14sec (3494 seconds)

Published: Wed Oct 23 2019