Hands-On with H2O Driverless AI by John Spooner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
not that another laptop there we go there we go so just before everyone comes in let me just do a quick introduction to Who I am so as Ian said my name is John Spooner my role at h2o is there I'm the director of solution engineering for Europe so anything technical that we have to do at h2o in Europe me and my team look to sort out so I've been in the space of data science for over 20 years now in fact way before the term data science existed so I've worked for a number of organizations but the last company that I worked for was a software company called SAS and I was the head of data science there I've been at h2o now for just coming up to a year and I'm really excited about what we're doing as a company and what we're doing with regards to growing our footprint within the European region what I am going to have the pleasure of doing for the next couple of hours is to walk you through a hands-on experience of driverless AI so when I first started in software I actually started in education teaching training courses and the biggest training course I ever taught was I had about 15 people on a training course so this is by far the biggest training course that I've ever taken so do bear with us with regards to any technical glitches we do think we've solved them all but that's famous last words so what we're going to do is get you up and running on driverless AI we're going to work through some data sets work through some demonstrations it's gonna be a bit like I do a bit then you do a bit can ask some questions then I do a little bit more then you do a little bit more and we'll break it up like that rather than just me drivel on for minutes and minutes so first of all hopefully you've all managed to get a network connection it is crucial you get a network connection we're not going to be installing software on your laptops because we we don't want to waste time what we want to do is get you up and running quickly and the game plan for this particular session is as follows we're going to get you up and running on driverless AI we're then going to introduce you into the data sets that you're going to play around with and then really it's show time show time for me show time for you time to get your hands on and the product so first of all what you need to do is get on to the training environment the training environment is called aquarium you may now have started to notice a little bit of a pattern going on with our naming convention so we've got sparkling water we've got aquarium we've also got an environment called puddle and so no prizes for guessing the theme but if you navigate to aquarium h2o dot AI that should get you to a logon screen that looks like this if you go to there's a button there that allows you to create a new account if you go and create that account it will then ask you for a few logon details in terms of your name your organization your country and your email address your email address is important because that's where your password is going to get sent to so this is allowing you access to aquarium and then just get to the point where you can log on to aquarium and get to the dashboard so let's go back again if you could go to aquarium dot h2o dot a I create a new account and once you've created that new Akane get a password sent to you and then get to your dashboard and then once you've done that raise your hand okay wait it's my sophisticated way of knowing how many people in the room have actually got to this particular stage and so I'll give it a one more minute get you logging on again anyone got to the dashboard put your hands up as soon as we get to about seventy five percent will continue so aquarium dot h2o AI ok once you've got to that particular point and you've got to the dashboard what you will see in the top left hand corner is it will say browse labs click on browse labs and then you will go to a screen that looks like this that has a number of lab IDs associated to it what you have to do is to click on david Whiting's driveless a i1 GPU lab which is lab ID 4 and it's crucial you go into that lab ID because I haven't set up the other labs so you need to go to that click on view detail choose that lab and then once you get there you will have an option to start the lab click on start lab and that will get you to a screen that looks a little bit like let me just log in so you log in go to browse labs go to lab number 4 view detail and then you get to a screen that looks like this and you click start lab that will then give you as you scroll down access to a machine name an AWS machine name there your AWS environment with driverless AI on it if you click on that and then it will ask you to log in to that environment if you do have challenges please don't click on end lab just give it a little bit of time and then click on the link again so hopefully you've got to a point where you start the lab you click on the driverless AI URL that gets you into let me just will get you into this log on screen there are two log ons but the log on that I want to use for this particular training already has some datasets on there will already have some experiments on there and will already have some visualization is the interpret e on there as well so to do that you have to log on as user training and password training so it's quite easy to remember so your username is training your password is training click on sign in and then you should get to a point break you probably have to agree to some terms and conditions I've already done that and then you should get to this data sets overview once you get to that particular point well then start the training does everyone progress any who's at that point excellent just 30 more seconds okay so if you're logging through that let me just explain some of some of the features that were going to be looking at so we're going to be going through building a machine learning model in the way that Marius talked about and what Marius did did was a great job of explaining the inner workings of driverless AI for driverless AI for some people it seems a lot like magic and there's not many people that will want to run their business on magic so what he hopefully did was give you the confidence that there's a lot of science that has been put behind the building of this particular product but the way that you interact with it we want to make it as simple as possible so there were lots of different settings that you can play around with with driverless AI and we are going to just touch the surface in this particular training but the first key thing that you really need to think about is how do you tune experiments so really you as data scientists are the mechanic equivalent to our software it's up to you to fine-tune the software to run quickly and get to a highly accurate result you can run it out of the box and you'll get really good results but your skill in this is to tweak it to really get those performance gains so one of the ways that you can tweak the software is through some dials that you will see when we create an experiment and these give you the ability to tweak the accuracy of the model the time that it takes you to build a model and how interpretable you want the model one thing that we always find is that what a new user of driveless will do is ramp up the accuracy to ten ramp up the time to ten and turn down the interpol 'ti to warn i'ii give me the most accurate model i don't care how long I take and I really don't care how interpretive all it is and then they wait and they wait and they wait and they wait a little bit more and then they phone at tech support and say I've broken it but a sickness really getting into a car putting your foot on the accelerator and wheel spinning so what I would recommend is you don't go straight with the 10/10 one option what we try and do is to stop people doing that by when you bring in your data set some default sensible default values but what we find is everyone just changes those anyway but just to let you know there is some science to those particular settings and these are some of the dimensions that we look at but look at all of this is documented if you go to Doc's - okay I feel free to have a look at the user guide you can see everything that's going on within the products we're incredibly transparent and about what we're doing so you can start to see as you increase your accuracy levels we're doing less and less sampling so if we've got very large data tables maybe with billions and billions of rows then some of those accuracy settings are going to sample the data and down it's controlling what level Ensemble Earth joining together of models that we're doing parameter tuning levels Handley folds from a cross validation we're working with with time it controls the number of iterations that will go going through and we'll see some of this in more detail and then from interpret II it looks at different types of transformations basically the more interpretable you want a model so probably the more regulated your industries that you work in the higher you want that dial so those that's the first thing to to him to remember so what are we going to go through we're in the next then I how long we've got left next so an hour and a half we're gonna really go through these five areas we're going to look at some data first of all we're going to visualize it we're gonna build some models we're gonna do some interpret e and then we're gonna do some freestyling ie you can do whatever you want okay we're doing it based on there are two key data tables and one's a credit card data and it is in customer churn it's some credit card data that looks at how likely someone is to default on alone and the data set that we're going to look at is card train there is an opportunity to do some sales forecasting as well so because we are based in Mountain View then we're actually looking at a very unusual data set called cannabis train so California is a state in America that marijuana is legalized so what we're going to do is to look at some cannabis data to look at forecasts in the sales across a number of different organizations so let's go into driverless AI so this is the bit that what I will do I will do a bit navigate through driverless AI feel free to follow if you want but at the end of each bit I will stop and allow you to play around a little bit more and then we'll go back on to driveless and and show you some more so let's go so first first things first we need to bring our data into driverless AI and we've already done that and just to save some time one of the questions that we did have let me just go to the slide oh so if you do have any questions specifically do feel free to ask them on to the slide oh I will be going back and forth to this but one of the questions we did have was does driveless connect directly with databases or data repositories such as Amazon s3 cloud storage bigquery with the answer to that question is yes it does so when you click on a data set you get a link to all the connections that have currently been configured for your particular system so in this system we can see that we can it's being connected to Amazon s3 and Hadoop but we can also connect it to as you--as blobs big data query you can upload CSV files into this data set area so this is where you have the data and the data that you want to analyze you basically click for actions and you have a number of different options so let's have a look at this card underscore train we can see it's two megabytes in size nearly 24,000 rows 25 columns so we want to see a little bit more about that particular data table to do that you can click on details and if you click on details you get some summary statistics being calculated and you get some graphical objects that will be the size and the shape of those will be determined by the data that's within there and this is a really good start for you as a data scientist just to understand what data that you've got in there so as marius said you do have to have your data in a certain format so you do have to have it in a analytical data format with a target variable the thing you're trying to predict and some features so one row per account one row per customer or one row per time point if you're doing time series forecasting we don't have to impute missing values before bringing it into driverless AI driverless AI will do the imputation for us we don't have to do the conversion of character values into numeric as Mary's talked about you have to do that conversion with machine learning models driveless will automatically do that for you so for the column sex we can see we've got as male female and other we don't have to bring that in and code it as one two or three you can have a look at the data set itself by clicking on data set rows and just navigate through that so that's the first thing that you can do just really understand the details of the data the next thing that you may want to do as you move through the building up of models is to then start to visualize your data so if you click on visualize basically what happens is we do a visualization so we've done the visualization up at the top and if I click on that visualization we will start to see what were the visualizations that drive let's say I felt was useful for that particular data source so what we've done is really switch visualization on its head so traditionally with visualization tools such as tableau or click all the best business intelligence tool on the planet Excel what you have to do is take data select what objects you want to use to visualize that data select it have a look to see if there are any patterns and then work out other ways potentially of visualizing it what we've done here is to say well let's look at lots of different visualizations and given the data choose the best visualization so when you run this on different data sets you will get different visualizations available to you so let's look at the outliers if I go to the outliers plot what I will see in the outliers plot is columns or variables or features cool them what you want but you will see only the ones that have outline values associated to them if a column doesn't appear in that camp in that gallery it doesn't have an outlier so it enables you as a data scientist to really focus in on those variables that potentially problematic all those variables that have got relationships to them so rather than go through every single plot what I'm going to do is just to stop for a few minutes and enable you to play around with some of those visualizations feel free if you want to to use some of these different data sets to do some visualizations feel free to look at different types of visualizations I like the correlation graph because you can start to spin things around you can see where variables are highly correlated with each other for you to sort of scroll through here we can start to pick out those key correlations and and spin that round so have a play with that if you do have any questions then there will be some h2o people wandering around they sorts it into the background I really wish he answered that so will be wandering around just to help you if you do have any specific questions though and you can't grab one of us just do write it on the slide I will try and use this as a central pool of asking questions what we are able to do at the end of this is collate all the questions and all the answers and get that to you as well so and feel free to play with those visualizations just probably for about five minutes just get you familiar with the interface and then we'll come back and build some models just a couple of more minutes finishing that off and then we'll make a start on the next bit hopefully you've been up and playing around with the product let's just go through a few of the questions that I had went when wandering around one person asks a question and they were in the collated scatterplot and they said can you control the axes that are being displayed remember this is a data driven visualization tool so only when a correlation exists between two variables will this object appear so there's no control at this particular stage if there were more than one correlation between variables then you would see that notice at the moment we're just looking at those linear combinations the size of these dots are important the larger the circle the more observations that are there one thing to point out about these visualizations we are dealing with a fairly small data set at the moment but these visualizations will scale so if you've got millions of data points then these visualizations will automatically scale themselves to make interpret e of those relationships easy again one of the challenges you have with visualization tools is sometimes the scale goes a bit do dally when you're dealing with real data the work that Leyland's put into visualization over a number of years gets surface within this tool also one thing that to note with this data is I don't have any missing data and the reason that I know that is because if I did have missing data I would see a missing heat map as well so let me just quickly go to the questions if there's any other questions that you've asked that sort of link to the visualization no no there isn't Siletz and that point let's go out of the visualizations and into the experiments so you can go into different parts of driverless AI in different ways so you can power everything from your data sets menu up in the top and drive your whole analysis off that data sets or you can go into individual areas so if you wanted to drive your visualizations you can create a new visualization from that screen what we're going to do is going to experiments what we're interested in doing is creating a new experiment so if you click and what I'm going to do is click on the experiment one thing that you do get is in driverless AI is a lot of help help in terms of the documentation online but also in product help as well so we can have a tour that looks at the main features but in this case rather than that help you've got me to guide you through the product so we're still going to stick with the credit card data so let's go into card train and then this is the data set that's going to get built will have a model built upon so the first thing that it's asking for is basically a target column so driveless isn't super super clever we do have to guide it so we do have to specify what column do we want to predict so that could be sales if we're dealing with a forecasting problem in it could be a categorical target variable it can be a continuous target variable what I'm going to do is select on that field called default that opens up the driveless AI dashboard from that you've got a number of different things that you can do one thing that driveless does is prevents you from overfitting models so what we want to make sure is that we're giving you the guardrails to make sure you don't build this fantastic model or what you think is a fantastic model that when fitted on some holdout data will completely bum and you look a little bit daft in front of your stakeholders so there are some internal metrics that enable you to control that it it behind-the-scenes would do cross validation but you do if you've already got a validation data set so other software vendors do have the ability to bring in a validation data set to evaluate the model as it's being built you can do that by clicking on this validation data set area and choosing one of your data sets to be that validation data set what you can also do is have a completely neutral data set with regards to your model building so you can have a test data set where we know the outcome but driverless will not use that at any point of the process of building a model evaluating features in evaluating algorithms so for this particular data we've got some test data set there's got the same structure a target variable and features but it's not going to be used in the building of that model driveless we'll look at all columns in that data set and consider them for use unless you specify dropped columns so there it may be that you know that there are some variables or features that you don't want to utilize you can drop those so how do you then interact with driveless we're here with those dials that I talked about earlier you can click on the pluses and the minuses that allow you to alter those particular settings what they're doing behind the scenes is controlling different parts of the model building process and on the left-hand side we give you some indicators of what's what is actually happening when you change those dials up and down so as you increase the accuracy different training datasets are going to be used different algorithms are going to be used to evaluate the features and evaluate the models when we run through the model build process we need to be evaluating it on something and calculating a accuracy value or goodness-of-fit statistic or terminology that we use we want to evaluate it against the scorer so there are a number of scorers that are available for you to utilize so you may have been building models before and you want to maximize the accuracy or maximize the genii value maximize the area under the curve so you as a user can change that that score that the model evaluates on so that's the simplest way of interacting we drive this AI so while we want to create driverless as a black box approach to running a model we don't want it to be back blocks in terms of tuning we want you as a data scientist to have different levels of tuning and the process so the way that we open that up in terms of the process is through the expert settings area if we go to the expert settings what you will start to see is all of the options that you can change when an experiment then runs actually that's technically incorrect it's the second level of things you can change there is a config file underneath driveless a I call config tamil that allows you can't control every single bit of driveless and how it's running and this is really beneficial for data scientists that really want to squeeze the extra juice of the product and really tune driverless AI they can you can get to the inner workings of driverless AI what we've tried to do is give you a step change of getting involved in that to start off with some high-level settings of those dials then more of the expert settings so you can turns algorithms on and off you can decide what configurations you want the feature engineering to go through there's a whole range of options available in there I'm not going to go through all of those on this particular training because I want you to get hands-on playing around with that the accuracy of accuracy r5 sorry it old it is being driven by how the date if this dataset was very large the accuracy words be controlled by some subsampling or of of the data so if let me just go back to this particular slide so an accuracy of five basically says maximum amount of data that goes through this algorithm would be 200 million observations then so not 200 million observations the combination of rows and columns has to add up to 200 million it's going to be doing one level ensemble in so if we didn't want to do an ensemble model we would just turn that to four it will look at a target transformation there will be some high parameter tuning going on there will be a cross-validation going on with three or four folds only first fold model not sure what that is but then it does a distribution check so this distribution check is really important because the distribution check is saying let's have a look at the distribution of data in my training dataset and make sure it's the same as the distribution and my validation data set and the distribution and my test dataset because if it's not any discrepancies that I see within the results could be due to the distribution shift so so as you change your accuracy levels it will depend on what internal settings get changed okay really the accuracy and time do link together because the higher the accuracy setting is the longer that the thing will take to run yes so it's so time we it goes through iterations I'll talk about iterations in a minute and then interpret II again it all depends on what transformations are being used behind the scenes and what feet feature engineering all this is available within the documentation what we try and do though is to give a high-level view on this left-hand side of what is actually changing sometimes when you change it depending on the dataset sometimes things don't look like they've changed yes so that so when I first brought this dataset into the experiment those dials were controlled based on some high-level settings that it looks at the original data set and says these are good default values okay so what I would always recommend is bring it in then just go with the defaults first of all don't try and think you can outsmart driverless just go with the defaults first launch the experiment and then start to tune so once you at a point where you think you're happy you then click on launch experiment and off the experiment goes and what it starts to do is pick up certain patterns within the data so at the bottom you will get some notifications of what's actually happening so in this particular case basically what's happened is it says it's recognized an ID variable and it's dropping that from the analysis so here we go 27 seconds I think it was and we had one model now what's happening is it's going through that process that Marius expertly talked through in terms of it's going through the process of engineering features that make this model more predictive we can start to see how well this model is performing with regards to an accuracy statistic and we've also got an ROC curve and our area under the curve statistic being quoted here and what's happening is we're iteratively going through that process what we can start to see in the middle is which variables are the important drivers for this particular model we can also start to see at the top at what level are we at so we can start to see here where an X G boost model and it's going through a parameter and feature two min tuning method if you look at those variable importance variables you can start to see some features that have got a prefix is associated to them those are variables that have been created as part of the automatic feature engineering capability of driverless AI so the raw variables are there so you can start to see bilham an warn status warn and status to all the original variables but then you've got the main variable is a numeric to categorical target encoder for marriage and we're just going through that process of building the models what you don't have to do is to leave that window open and go and get yourself a cup of tea before you can continue work unfortunately for yourselves you can run everything in parallel so we can see we can be running a model so at this particular point you can see the model that I kicked off is running but you can see other experiments that have been built so let's go to this model here that someone's created already on this data set this gives you a good view of the current accuracy it's got a test score associated to it we evaluated on AUC and it had the value six four six so let's go into that so once a model is finished running basically you will get information display to you so the the wheel at the top will convert into this and then you get a bit of summary information in the bottom right hand corner and some variable important statistics so if you're following along with me feel free to go into a run experiment things to point out is if you want to understand what has happened at a high level have a look in the bottom right hand corner so we can start to see information about the settings what the data was that we built the model on what the test dataset was and what that target column looked like the key thing that I really find every time I demonstrate this product really amazing is the feature engineering part so this model how long did it take to run so this moral Turk in total about two minutes to build it originally had 25 columns on it but as part of its automatic feature engineering we created 822 new features and the engine felt that 38 of them were predictive enough to go into its final model and this is where you as a data scientist really get the performance gain to create 822 features in code would have taken you a long time so we're now at this point where we've got a model built in two minutes and it's a model that is hopefully highly accurate so that's the first thing that I want want to do so that's what we'll explore this in more detail in a little bit later on but we're at a point where we've now got in experiment one thing that I do just want to do is to show you how it's different and it's subtly different if I want to apply a forecasting recipe so I'm talking about this concept called a recipe so we had a recipe ie a thought process that the cowgirl Grand Master's have gone to gone through to solve a particular data science problem so the problem that we were solving there was a traditional predictive modeling problem but what we may want to do is to solve a forecasting problem where we've got time as a dimension so again you need to make sure your data is of the right structure so this Canibus data isn't the right structure it's got one row per time period of sales of this product so this one the variable that I want to forecast is this demand variable this loved one P demand so I select that and the only thing that you have to change that's different from the other methodology that I showed you was you specify a time column so if you specify a time column to be sales date what then happens is you get some further options in that top left-hand corner because now we've switched it from a traditional predictive modeling problem to a time series forecasting problem so there's some additional things that we will want to set some of the things that you may want to set would be to produce a forecast per store or per store and per product line maybe so what you're able to do is to create grouping columns so a grouping on sales date but we may also want to group by organization so in this particular case retailer so we click on done there what we also need to give it is a clue on how far into the future do we want to forecast so here we can space if I a number of different time periods so think I've got I've got daily data I may want to do a forecast three days into the future but then what also comes up is something that's referred to as a gap so basically that wording says after how many days do you want to start predicting by so what we're doing is defining we're at a time point here how many gaps do we want to occur after that time point before we start forecasting and this is important sometimes for retail use cases where I can't forecast tomorrow because tomorrow is too soon I have to have a specific lead time and I'm in at time point t and I can only start forecasting maybe a week in advance so I want to create a gap of seven before I do my time series forecasting and that's what that bit allows you to do so you can collect you can change that and then you click launch experiment the reason that this recipe it's crucial to specify sales date is the challenge that you have with forecasting problems is you need to make sure date is kept in a consistent form when you do sampling and create cross validation day sets you can't just randomly select a cross validation across multiple dates you have to keep your date dimension in place and have various rolling windows so that's what the time series recipe does make sure that you solve the problem maintaining day as a factor everything else is exactly the same you go and launch your experiment it then looks at features and creates new features but based on the fact we're now dealing with time series data to create features based on date so it's start to extract maybe the day of the week the month the year it would start to then consider lags of particular features and lags are the target variable when creating additional new features so let's go back to hands-on so this is what I want you to do now for the next ten minutes is for you to start to have a play around with driverless a I'm building some of these experiments so let's get you building a predictive model feel free to use the credit card data feel free to use the cannabis data but just focus on building it experiment play around with those expert settings get an understanding of those expert settings and get to a point where at least maybe one of those experiments gets to a point where you see these big buttons and once you get to these big buttons will then restart and I'll start to explain some of these big buttons to you technical term they's the big buttons okay so have a play around with that any questions feel free to raise a hand ask the question or ask it on slide a so I can't see it on this screen okay just just one thing that we just wanted to point out with regards to the asking of questions so fortunately I don't have to answer every single question that you guys put put up there Marius is doing some answering of the questions at the back as well so when you go on to slide o what you will see is some of those questions that you've asked with replies to those okay so if you have asked some specific questions on slide o then the reply is there I would love to take credit for all of those however it's Marius at the back and there's answering all of those okay so if we just spend a couple more minutes going through that and then we'll start to once you're at this point you're at that point where we we've built a model and we've got a certain level of accuracy anyone got so I wonder around thank gives you just a quick initial view of how to build a model and we are going through this at a rapid pace so as I said right at the end we will have an opportunity just can to consolidate all of that learning I'm very conscious that driveless has got a lot of key features to it and I want to bring out a lot of those and for you so if you go into an experiment if you haven't managed to complete an experiment then feel free to open an experiment that's run and that has got the status of done say so you can see those big buttons so what are those big buttons allow you to do well the first big button that I want you to look at is the download experiment summary button the download experiment summary button basically takes everything that you've done well everything that driverless AI has done and packages that up into a report that you can then share with your external stakeholders so you all I believe you most of you are data scientists who has to write documentation as part of their model building process so majority of you who likes doing it probably none of you so this is they cease at all that ticks the check box off you have to write documentation but you get someone else to do it for you so what happens is when you download the experiment summary you get a zip file you can open up that zip file you get access to a lot of the underlying results in various file format but the magic document that you want is a document called report and this is a Word document that has gone through everything that driverless AI has built up so what you have here is information about the experiment so just some quick overview of how many features were considered how many features were built into that model and then you start to see well what data was used what were the settings that were used as part of driveless what version was used so a lot of this documentation has been built with a number of the first adopters of driverless AI technology and a lot of those early adopters were financial organizations with financial organizations being heavily regulated specifically around their credit scoring models so what this has enabled them to do is to have information about the data that was fed into driveless make sure that the shifts any shifts were detected what the assumptions were with the model what the process was how many models were considered how long did it take to build all these configuration options and some of these configuration options you may be looking at going well I didn't set those options and you'll be right you didn't set the options but the system did and those are all those config settings that sit behind the scenes said so if you have trouble sleeping feel free to download this report and you can have a read through it but regulators love this stuff but in yeah regulators love this stuff and it allows you to back up why the model was built you've got then what what is important is sometimes when you look at that screen within driverless AI you get a variable name and you look at it and go what actually is that variable the num2 te what you do get in this report is an english translation of what that variable represents well when i say an english translation is real in English but sometimes it's a bit geeky even for me but you can see all of those variables that are in the model what that final model was how it was built and the performance of that and then some goodness-of-fit statistics as well so very very comprehensive and all I've done to create that is press a button so that's one great thing another thing that data scientists always hate is once you've built a model it's only a value if you then can deploy it if you still take weeks to deploy a model then all you've done is shifted the most bottleneck from building a model to deploying a model so what we have is the capability to turn that model as Mary said into either a Python scoring object it's the button that you press is download Python score in pipeline or you can create a mojo and a mojo allows you to deploy this model as a Java object so anywhere that this model needs to execute all it will need is either Python if you're deploying into Python all we'll need Java you don't need to go back to a h2o environment to do the scoring why is that important where you may be building machine learning applications or machine learning models that you want to deploy away from a cloud environment may be directly to where the data is being generated ie on a mobile phone well your mobile phone isn't going to be running h2o your mobile phone isn't going to be running Python but it probably be running a Java object in some way so you now have an object that you can deploy into an application that runs on a specific device so you've got flexibility which isn't then hooking you two doing scoring in a cloud environment in the next version of driverless AI what we will be doing is deploying the model as a C++ object um as well so a number of IT or departments have said we want that model as a C++ object well we'll be able to do that in 1.7 which is hopefully available very very soon so those options that you can run there are a few others so you can download those predictions to have a look at in maybe another visualization tool if you wanted to integrate it into maybe tableau or click you can score another data set from within driveless ai using this particular model so that's what I call the ad hoc scoring method you're not going to productionize that but you may want to score that but the other button the other key button that you want to press is interpret this model so at the moment we've taken data we've worked out what are the best features that go into a predictive model or machine learning model and we've decided what algorithm we're going to use and what hyper parameters are going to be associated to that specific algorithm and we've done that all in a completely semi lights-out way you've you've seen the logic that it's applied and we've got a highly accurate result but what we then want to do is to interpret that model really start to understand how those predictions have been generated by clicking on the interpret this model button allows us to do that so in true Blue Peter fashion so this dates me and it also shows that I watched a lot of kids TV program when I was younger but in true Blue Peter fashion here's what I've prepared earlier so if I go into the MLI place of driverless I get into some of the interpretive models if I click on one of these it gives me a wealth of information with regards to interpret II and this could seem fairly overwhelming first of all so what I'm going to do is guide you through some of the important bits of this particular screen so the first bit that I want to take you to is the dashboard and if I take you to the dashboard this is where it gives you a summary of some of the key interpreting measures of this particular model so the key thing that we want all of these machine learning models especially if we want more and more adoption is we want to make sure that they're fair we want to make sure that they're accountable for the decision-making that they make we want to make sure they're transparent and we want to make sure that they're explainable and all of these graphs help you on the journey to hitting those four dimensions so if we start off with the graph in the top right hand corner the graph in the top right hand corner it's very similar to the graph that you saw with regard within your experiments area but the difference being now is that rather than looking at the feature importance of the new variables that have been created which could be incredibly complicated as Marius went through earlier there's lots of different feature engineering that's going on and you as a data scientist has got to explain this back to a business user and even in explaining our the model the key feature in a model is the interaction between gender and income it's quite hard to explain to a business user so what we're doing this feature importance is going back to the original variables that went into the model to understand one of the key driving factors of this model and they're ranked from highest down to lowest so in this particular experiment the feature of status one was the most predictive variable followed by status two followed by status three and status four so some piece of this users was going okay that's that's useful but then the next question that they'll ask is oK you've said status warrant is important but what is the relationship between status one and the thing that we're trying to predict so if we increase the value of status one what impact does that have on my predictions do they go up or do they go down or do they stay the same to get the answer to that question you can go into the bottom right hand corner if you go to the bottom right hand corner you create a partial dependency plot whereby there drive this AI creates a partial dependency plot for you and you can navigate through each of those variables seeing how does that variable correlate with the target variable once the model has been built so we can start to see as we in this case of selected status three as I increase the value of status three the probability of defaulting on a loan is going up a little bit and then starts to level off okay so that's there a second way that we explain the results the third way that we explain the results is start to use predictive modeling or machine learning algorithms to predict the predictions and that's what you see in the bottom left hand corner so bottom left-hand corner gives you a decision tree that is being used to predict the predictions why is this important where you can start to then see some business rules that drive you to a high prediction or to a low prediction through a model so you can now start to go to the business well if a customer had a status one value of one and they had a status two value of nought point two and a status six value of two three four or something else then their probability to default it is quite high if however you go down the left hand branch their probability is quite low so that's what a decision tree allows you to do in this particular context and then at the top you have a kale I'm explanation so if people come across lime or Kaline before few hands it's just another approximation method to take the outputs of your model and see if we can create logic or reason codes behind that so let's look at that in a little bit more detail so if I go and find the surrogate models here it's a k line so I can go into this K line plot and what I start to see is what my younger daughter ago that just looks like a massive stars just looks like a galaxy and should be right I suppose it's a lot of just dots but what are the important things to take home here the first important thing to take home is this r-squared value so for the reason codes that come out of this K line plot you want to have an r-squared that is high because what you're saying is this model is a good approximation of the probabilities that are coming out of the driverless model so the closer that is to a hundred percent the better the this is as an approximation method so that's the first thing that you want to pull out the second thing that you want to pull out is when you click on explanations if you click on explanations you then get information about what are the positive contributions to that score and what are the negative contributions to that score so what we can see is if someone has a status one value of one then their probability on average increases by about nine point two two if they have a status three value of one increases by 0.16 if a status six value increases by six so here we've got the highest contributions and the negative contributions as well because there will be summed that as we have a value so in this particular case status two as a value of minus two it decreases the probability of defaulting on a loan by naught point one three and if you click on the go-to additional features you get a whole list of them so you can start to see your reason codes for the overall model but these are what we call global reason codes reason codes of how the model is making the decision but what we know is that this data probably has groups of people groups of people and that are similar in their characteristics so the K represents how many groups of people or counts could we create to make a better approximation so every person gets put into one-off K groups in this particular case we've got 14 groups and again what driveless has done is how to look at the data and work out how many segments that we have within our data and what this allows us to do is we can go to an individual data point if wanted and say what are the reasons specifically of that data point so I'll just select a data point at random this data point here the top left hand corner is customer number thirty thousand six hundred and sixty four it's gone to cluster twelve so there's a few people in this cluster our R squared is high which is good and now when I go to my explanations not only do I get global reason codes I also get local reason codes as well so local reason codes are what are the reason codes that are specific to this particular customer that made them have a high score or a low score so in this particular point I can see the model so the the probability of default the actual value for this customer was they defaulted on a loan the model said that they would they had a 41% chance of defaulting on a loan the lime approximation said not 0.42 so these so the accuracy of these lime values are very very good so they can be taken quite literally as really good approximations so these are the values that this customer had they had a status one value of one that increased the probability by point two two they had a status value of five with a value of two which only increased their probability by not 0.046 so these are reason codes that are fairly easy to explain back to a business of how was the overall model decision made but how was an individual decision made as well and this is really important when you're maybe doing scoring of transactions in real time where you potentially want these reason codes to be calculated in the same way as your model predictions are created as well so you may be building a model that predicts whether or not transaction is Forge alone once you identify a transactions forger and you then want reason codes of how that model made the decision so you can feed that into an operational system that then goes to a human to start to explain why a particular transaction has maybe been rejected so that's how Kaline works and what is available and through that a lot of this work with regards to model explained ability is new and constantly evolving and the back there's a red book I can't remember the title of it but it's a red book that is all about the theory with regards to a lot of this stuff that you're seeing on that on the screen so feel free to to grab a copy of that um remember though these methodologies that we talked about here are approximations some organizations specifically financial organizations don't want an approximation they want the mathematical inner workings of those models so there's been a lot of work of how do we get those mathematical exact values and the shap value or the Shapley value allows you to see the exact contributions for each of those features of the model and these are really important specifically within credit scoring so credit scoring has been a lot of work just done in the last 24 months of using machine learning algorithms and opening up those machine learning algorithms allowing them to be explained to the regulators about how those decisions are being made and that's where Shapley values have been incredibly useful and valuable the downside with them because there's always a downside there's nothing as a free lunch the downside is they take a very long time calculate so if you've got hundreds of thousands of accounts or millions of accounts you do have to wait and some time for those to execute and what we're trying to do is how do we make that process more efficient running across a large CPU a state or a large GPU a state so what I've hopefully done is giving you a view into driverless AI we've got you to a point where you brought in some data you've run an experiment you've had a look at the experimental results now what it looking so it's 20 past seven now so now feel free to go through any of these particular tasks feel free to go through those data sets maybe try different experiment settings maybe try running some different interpreting results and understanding how that's building up and just freestyle it just play with it if you break it that way you can't really break it but if you feel you've broken it then the Amazon server would just come down anyway in an hour's time okay so any burning questions any things that you want me to show fall on while I'm up here okay we've got two questions yeah yeah so the okay yeah so it's a wait yeah yeah so it's always going back to the original data so you're right if there was some sort of labeling or mapping the reason that you have such a small number is you only need eight bytes of storage for example 4-1 whereas if you've got a full description you just need more storage space so but it's a good point we maybe want to put in some sort of mapping or labeling yeah yeah that's a good idea yeah that's a good it's a good point I now feed that back to engineering excellent thing and that's one well the thing again I've worked in the software industry for a while one big difference that we have is how connected our customers are with our engineering so as soon as you become a customer of h2o you really are the product managers of how our software then evolves so I'll feed that back there was another question someone had their hand up might have just discovered the answer to the question I did have Bo I was just a little bit confused about the shock value plot you've got a global score and a local score I wasn't sure what the local score was but then I've just noticed in the top left you've got a box where it says row number yeah so the local score that he's calculated based on the cluster did that wrong Susac correct that that's right that local score is assigned to that specific local at that specific observation to get rid of that local score just delete that number I think we've just there we go and then you're back to the global Shapley values okay got it fine okay excellent okay so as I said it's up to you now we want you to have as much hands-on of driverless as possible but I wanted to make sure I never gated you through that so hopefully you found that useful we're going to be around for quite a bit more time and so feel free to play merry sisters that back asking the hard questions only because he's more likely to answer them or have the ability to answer them than I have okay thank you very much and we'll be around
Info
Channel: H2O.ai
Views: 1,090
Rating: 5 out of 5
Keywords: H2O.ai, H2O, Drivleress AI, H2O Driveless AI, Technical Training, Intel AI, Intel
Id: 5t2zw4bVfsw
Channel Id: undefined
Length: 89min 19sec (5359 seconds)
Published: Wed Jun 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.