What's new with H2O Driverless AI?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks for coming today yes h2o driverless AI is our flagship a product and it's doing well in the market and it's a lot of change every few weeks or months when we talk to people there's new stuff to announce so today I'm going to present the slides introductory slides from h2o world h2 AI world in London that we had a month ago and since then already lots of things have changed you know it should have always a machine-learning company that's all we do so there's nothing else we do than machine learning products we write software and our users use it there is open source and there's closed source and H 2 O 3 is the product that's the open source version that's the platform on which you can build scalable machine learning models up to terabytes on even SPARC systems with sparkling water so unlimited size super-fast training accurate models we have deep learning we have gradient boosting of random forests of linear models we have unsupervised methods clustering dimensionality reduction anomaly detection all kinds of things that's H 2 O 3 and today we're talking about driverless AI which is the proprietary version that takes it another step further where it actually makes feature engineering for you and feature engineering if that's not a term that you're familiar with is creating new columns for the data set so when you have a data set you have usually let's say a big Excel table right hundred million rows hundred columns or something and now we just make extra columns from those existing hundred columns and that's the art of feature engineering and usually when you have two main knowledge and you do this feature engineering right you can get a huge lift in your models performance so for example if you want to predict you know the outcome let's say is the the salary of a person right and all you have is their job title and their zip code then that alone is very hard for a model to just memorize because every ZIP code in every job title in combination it has a different salary sometimes they have multiple of these so you can just predict let's say the mean of all engineers in Palo Alto right and that's a prediction but maybe you can do something else maybe you can say what about only people in this zip code with two kids and four cars or something and and you find the mean of of those people but you don't include yourself in that mean so for every person in the data set you you put in the number that you compute from everybody else in a similar situation to you and that gives you basically a estimate that's an out of fold estimate and that's what's called a good feature right because it's a very valuable number if you can compute that from the data set and you actually is leaking if you want the column the target column Y into X the matrix of the features and that is an art because if you do it wrong you're you're gonna have a lot of problems because you put the answer into the data and then you are just gonna over fit and you're gonna cheat basically yourself and nothing's gonna work but if you do it right by doing this carefully without a fold estimates and so on where you do not take your own number for your own prediction feature then suddenly you get a much better model and and those kinds of things are the tricks that the Kaggle grandmasters use and there's only about a hundred in the world that are caracal grandmasters kaggle is a website as you know that Google acquired that's doing nothing about data science competitions online and we have six of those hundred grandmasters working for us here at age to go right and that's a real joy for us because they know what they're talking about you don't need to explain to them what is out of fold estimate or anything they know right they come here they already know they've done a hundred competitions they want these they know exactly what the mistakes are so when they show up they just basically tell you what to do what is missing in the product how to make it better and so on and they can also code because python is of course something that they master so when we write a product like driverless AI we write it in Python and they're using C++ and CUDA and all that as an engine underneath but the actual algorithms are controlled from Python which means everybody that knows Python can help which is great for a company because now we can have 50 contributors and not just five if you have to do something like this with c++ or something that's much harder to get a huge team to work together so with python it definitely is easier so this was a year ago when we got the Editors Choice Award and then the technology of the Year award with driverless AI so basically we were really able to create good models out of the box without any human interaction and you might say well any anything that just does a fit it's good enough well sometimes you can get two or three percent improvement by just doing this this extra column generation is featured engineering out of the box and so Tom to some people this two or three percent improvement matters a lot especially if you do fraud prevention or something if you can reduce you fraud from ninety million to 80 million a year then that one model was worth 10 million right and there definitely are companies that have that kind of scale of benefit if they apply AI properly so AI to us is something that's smart about how it deals with the data and the data doesn't have to be images or text only can be in our case actually cannot be images we do not handle images yet right now in driverless AI it only handles strings text let's say some fields with like description like a tweet or something or a document then a number or a categorical for example like you or your gender or the type of car you drive or how many kids you have and so on those kinds of numbers so it's a it's a big excel table if you want it can be hundreds of gigabytes but typically I would say it's in the order of one gig to ten gig or something that's like the sweet spot and it actually then takes this data set and Rich's it and tunes hundreds of models with hundreds or thousands of extra features that it makes and it will figure out the right combination of this enrichment and the fitting of the models and it even will make an ensemble of models at the end and then make a deployment package that is called the scoring pipeline that you can take outside of driverless to productionize your model so for example there is a java version that's completely standalone pure java has nothing to do with python nothing to do with c++ it's a pure java that does the same thing that the thing did that was displayed on your on your computer when you were training it in driverless AI so a one-to-one mapping of the model that we fit it in Python but then mapped into Java and we're working on a C++ version as well and then also of course we have a Python version that you can import into your Jupiter notebook to make predictions and these are all standalone so you don't need to have drive unless the eye on that same instance where you want to import that model that was previously trained in driverless AI so driverless is basically the platform where you can train stuff and then you take it outside and make predictions this is the architecture it looks a bit complicated at first but you can imagine in the middle is just the training brain and you can connect to it from the bottom left there either with a Jupiter notebook or just with the web page like it has a browser built in basically a web server so you can connect to the port just like h2o three you connect to a port and then you see a website and that looks like this so this is me connecting to port 80 that we then map to this port five four three two one inside that web server so now it looks like a website but actually it's drivable LCI and this is a listing of the data sets I can also list the experiments I can list other things if I have done any of these other things that can go to messages and it shows you that version 1 4 2 is out and these are the new things that are added to it it's hard to read here for the audience locally but basically we have support for IBM power4 this version 1 4 2 so everything is compiled both for Linux systems or for power systems we ship it as a docker image or as a native rpm or debian package that runs anywhere or on power of course whatever the date has to be to make it work on power it's a probably also an RPM package and then the improvements in this case were various but I mean there's always something right you can get the release notes you'll see what's going on I'll do a demo of this later we'll go through all these points we have an time today but there's a lot of installation guidance key features what's going on there's all these different versions the changelog here let me scroll down this is November this is October a lot of changes September tends to flow like GPM extra boost GLM all these things have been in for months support for compression some support for data ingestion from it's a Google bigquery from Mineo from s3 from Hadoop from Parker files for example we have support for snowflake redshift you name it basically whatever you want to import you can import KDB so the data goes into the system you tell it what to do which usually is just go and fit a pipeline not just the model but the data engineering pipeline that doesn't have models in it to make predictions in the end and when you're done with it then you can deploy it either using pure Java or Python or a soon C++ as I mentioned or you can even have these embedded scoring systems where we basically spin up a service where you can just post HTTP requests and you get an answer or even a binary protocol that you can send thrift messages saying give me give me the answer and a strict protocol of course it's an order generator for all kinds of languages so you can send your c-sharp you know request from a c-sharp client that wants to know how to score a certain mode so when you run this on a data set that was in Kaggle three years ago Cargill has this these competitions right where you have two months to compete and at the end you you see if you're good or not basically and there while you're while you're making these submissions along those two months you can see on a public leaderboard how you're doing on a special piece of the holdout data so they're telling you on a random thurid or something you don't know which turret but on a third of the data on the test said you're doing this well but at the end you take you'll take the order two-thirds to score so no matter how well you're doing on that third doesn't mean that you're going to do well on the order two-thirds so a real moment of truth is at the very end when you get scored on the order two-thirds and that nobody has seen that two-thirds before the answer you just saw the data but not the actual answer and so the challenge now is to throw it into driverless the training data let it run and at the end you give it the test set and say predict for that and then you submit the whole thing Carol looks at these secret two-thirds of it and says this is how well you're doing okay that's the idea and when you do this today you get to tenth place out of three thousand people that compete it out of the box in our tenth and this is a data set where feature engineering mattered a lot it you need this eight way interaction of this categorical so how many kids how many cars which zip code what income group you know what isn't that and then once you have that interaction that like subgroup then for those people are they gonna need help or not with this claim management process or not right so there was some kind of binary outcome related to those categorical features high cardinality features but also numeric features just a regular Excel table if you want but one where they interact a lot and you could say well a neural net can figure that out or actually boost alone can figure out the interactions but it's not quite right if you throw this into a regular algorithm like tensorflow you'll get a 45 log loss and if you throw in the extra boost you might get 44 and a half or something but if you throw this into the system you get a 43 or even below 43 so it's not a huge difference but it's significant enough to make money for those who care and that difference gets you basically 2,000 positions higher in the leaderboard but everybody can fit just a little night out of the box let's see but not everybody can make these interactions explicitly and compute these out of fold estimates of those groups --is for each group is it as an outcome right so for every group compute Li the outcome for everybody that's not in this particular group so it's a it's a concept to get used to what wants you to get used to that kind of stuff it helps a lot now I ran the same experiment here still running okay it's about to finish up so I started this about an hour and a half ago and it ran through all these iterations on the left you see this chart that the yellow stuff is going lower and lower so this is the error in log loss log loss says how well do I get the probabilities right for this prediction that says yes or no if you say yes and actually no your error is 30 so if you did this wrong for every single point your log loss would be 30 huge that's the mean log loss right for all these rows invested in this case actually the number is measured on the training data internally so we are measuring the performance under one data set that you gave us by making our own splits internally to figure out how well they're doing we can't just say fit on everything and here good luck to you we have to say fit on a 2/3 or something and then on my order third we say how well they're doing we do this over and over again on different 2/3 to get an idea how well they're doing on this data set but let's say we have this data set we make predictions if we got every probability of yes or no of zero one just perfectly right the log loss would be zero but now we're having a large loss of 0.4 that means the the error in the probability is is relatively small but it's this number can go up as the logarithm goes down so if you know the logarithm it goes down really fast right and it goes down to infinity down negative infinity but we cut it off at 30 typically machine learning but as I said earlier if you totally get it wrong the log loss is 30 because it's negative off the log where the log is supposed to be zero crossing which is where it is zero the log at one is zero right it's the log goes up so you want to be at that point so if it's a yes you want to predict at one but if you're predicting a zero then you have an infinite error and so basically where is the log point for well it's about I don't know at 90% or something instead of 0.9 roughly speaking on the chart if you think about it so it's there are roughly off by 10% on average in terms of probability but the actual number is either 0 or 1 always right so that means I not a 9 out of 10 times we get it right roughly speaking a long times you get it wrong or something you can you can make your own like rough guess of how well this does but it usually helps you to have an intuition for what does it mean point four of law of gloss if your data scientists but point four is a pretty good log loss and 0.45 it's also not bad and 0.43 is just slightly better right so if you only want to predict something like is this fan gonna blow up in my face next year or not it's not that important that you get the probability perfectly right as long as it's more than 0.3 or so you're gonna get scared but if you have to do this on a systematic fashion where you have millions of transactions and you want to really figure out which of those are important then you care about this last digit so what this did is it made these features I know you can read this here but this says out of fold mean of the response group by fourth columns so we take these four columns let's say H zip code income group and age group and of course we compute income group and age group automatically from those numeric categories by bidding them and so on then that four-way interaction has a mean outcome and that number is computed for five different buckets and then for each of the five buckets we use the order for to compute it so we substitute their actual four-way interaction we make that into a number and put that as a new column in the dataset okay so enough about feature engineering that's just one of them there's many others you can do clustering you can compute distance to centroids you can compute the dimensionality reduction you can do the spinning that I mentioned you can have other models making estimates and then you use that as an input you can have simple difference or ratio between columns and of course we have to figure out which of these interactions are important how do you know how do you know which group to make here well it's all driven by the algorithm itself so it figures out what columns matter and then it does more stuff with those columns so it's kind of a smart way to explore the space in a Nevada it's not super greedy so we don't want to exploit everything that we can see but we want to still explore the space so there's a difference and there's a balance between exploration exploitation there's a genetic algorithm at work you have all these yellow points at any given moment in time there's multiple yellow points that means we have something like 8 individuals in this race and half the racehorses at every step die because they're worse than the other half and get reclone with a small mutation if you want and now they're not just randomly created with a random mutation sometimes they can learn from each other so if some racehorse is better than the others always then we can make more of those but we want to still mutate them a little bit to give a chance to get even better and there's also some randomness at play something like you would say Monte Carlo of course but it's just a random number generator use the various places for some statistical tricks there's also reusable holdout which tells you how well you're doing in a holdout set so if you have a third that you hold out you've built the model on two-thirds of the data then you make a prediction to one-third you're not just going to compute one score and say okay my log loss is point four five from that third that's not enough you want to measure many times on sub samples of that third what's the log loss or no sub samples and then you have a distribution of log losses on that sub sample that gives you much better estimate of how you're doing on an unseen data set than just one number so only if that distribution shifts significantly between models on that her on that third holdout then we can say yes we found something that's actually better and that's when we make a step down that chart on the left so these are all systemic changes that actually are significant and they also have an error bar you can't see this from here but there is an error bar with every yellow point so we know we got better and we are currently putting the the final methyl order together for this ensemble so there's an ensemble of I think six let's see three times four maybe twelve twelve to eighteen models I'm not quite sure how many in this case I didn't look up the numbers but the there's going to be something like a four fault or fivefold cross-validation model and we do that three times three different actually both slight gbm's get blended together and they each have a five-fold cross-validation schema so that they can optimize the parameters again so they need to have some unseen data in which they can tune themselves so you can see there's a lot of logic going into this thing that is data sides expertise that does it all automatically for you and when it's done that model will be very predictive because it is tested to be doing well out of fault on this kind of data set that looks like the training data set now if the training data suddenly changes next month it's all spring versus fall or something campaigns they change totally then you might need a new model right so you would have to measure this models performance on a new data set and see if it's still good or not and if it's not going to have to retrain and we're not currently doing that yet in the GUI so the GUI doesn't have this facility to help you with that but that's coming soon I will show you what you can do when this is completed you can also see the GPU usage the the gains the lift curve precision recall curve and that was the ROC curve so this is the place where we have the best f1 score and here you can see the confusion matrix associated with that so there's a way to go through different thresholds but you don't yet have is the ability to say fix this threshold as a threshold that you want to designate for this model and when it then makes predictions it uses this threshold to make labels so normally when this binary model makes predictions it gives you probability for class 0 or class 1 right that up to 1 but now we say well it's your problem to figure out what the threshold is so that you can say if it's a yes-or-no outcome because we just gave you probabilities and once you say ok my threshold should be 0.75 as much you can see in this in this this position here in this chart where you like the true positives false positives through negatives false negatives if you like this confusion rages at that point then you would say ok this is my threshold every threshold has its own confusion matrix once you pick that threshold 0.75 then every probability above 0.75 means yes to you and we currently don't make that for you that the transition to the labels we just say it's probability 7s whatever it is probability point 9 it's your problem now to figure out that this is supposed to be a yes so we let you deal with their probabilities but in a future version you should be able to select from the GUI freeze this threshold and then you then make predictions it knows that it should use that threshold does that make sense is that a feature that people require or request or is am I the only one assuming that that would be useful does anybody have feedback is anybody here dealing with binomial models on a daily basis anybody doing machine learning on a daily basis anybody eating pizza on a weekly basis yeah I really do like still Troilus and I yeah it's very easy to you know a pre-loaded data and one automatic right one model and generate a cold and appointment yeah yeah there's a free version for academic use if you're a university a student or a professor actually have so many and who might be able to say more if you're interested but there is a totally free version for academic use so if you promise not to use it for commercial use and you have some affiliation as an educational facility then that means you have driverless so what you can do you can look at the data set with details and you can inspect the data set you can look at the rows just to stand out like preview right so you can see that the data set is is what are you expect it to be or there's a question [Music] [Music] yes thanks for the questions over there the first question about the unsupervised versus supervised we have both we have clustering the truncated SVD we have different unsupervised methods but we don't make just unsupervised models available to you you only get supervised models right now so we use unsupervised to make a better supervised outcome for the supervised models we have extra boost light GPM tensor flow GL m and the rule fit those five models and usually the gradient boosting models actually wasn't like GBM are the most predictive ones for these kinds of tabular data sets once you have images you would need neural nets which we don't have right now but once you have texts here then you can actually have text columns you can have multiple text columns in your hundred column data set and for each of those text columns we can either make tf-idf like count based features statistical features or we can also make tensorflow convolutional neural net or lsdm based features right so we have tensorflow built in and we use that for text columns and then once you have that feature converted to a number then it becomes a feature for actually boost let's say so we do these kinds of unsupervised or supervised techniques to mix up the the model to be as predictive as possible but always a supervisor and product now for the data ingest you see here there's a file system upload Hadoop Amazon s3 Google Cloud storage Google bigquery Mineo snowflake and KDB and redshift is coming soon so we have some ingestion capabilities if you click on Amazon you just immediately get access to all kinds of stuff in Amazon but of course this system is running on Amazon so if you click on this it's faster than if it was running on Google's you know and then it has to convert across and all that so there's a it's always about where you put your data normally for banks they have it on Prem and then the data is also on Prem maybe on the same machine or maybe a NFS or something that's close by I would say the data size is usually in the cakes or tens of cakes or hundreds of cakes at most and and if you have a reasonably fast file system the upload time is still slower less time than the modelling time right modeling will take longer another thing we have you can click on a data set you can say visualize so this will build a unsupervised exploratory screen that tells you everything that's interesting about this data set and you don't need to do anything you just need to wait for this to finish and what it does is it a grits the data first which is an way of clustering it but it it basically keeps the outliers alive so it's something like a small epsilon environment around every point and if you have neighbors that are close to you you'll just gobble them up and if you don't then you're alone and you remain alone but you'll still be in a data set so in the end you have a bunch of epsilon bubbles left there are all actually actual points they're not centroids they're not like the center of mass right if you do clustering you have two points the center of mass is in the middle which is not an actual point but here these representatives or exemplars are actual points so if I gobble you up then I remain where I was originally so there's always going to be a set of actual points left and this aggregator is implementing the data table have heard of data table in our we're now making a Python version of data table so we have both data table main authors for our and Python in the company and we use data table a lot for feature engineering and managing and so this is something that we use the C++ back-end that's controlled from Python so right now we have a unsupervised display of this data set we never had to provide a target column and you can click for example in outliers and it will show your only distributions of columns for which there are outliers so v6 has three outliers you can click on those and a point might actually contain multiple outlets because this was the point of these three these three points are in one XM car so this is the the actual three points that are weird in this v6 value and this v6 value is 20 but everybody else is smaller than 20 see this is 20 everybody else's up to 5 so you now know that there's something interesting in this direction and you didn't have to do anything it's not like tableau where you have to manually plot everything here it plots you only what you should see and the same for missing values you see there's a lot of missing values in this data set lots of interesting distribution this is a heat map will do more interactive stuff for this other ways where we can zoom in and out and you know color by certain things group by a lot of cool visualization is coming up soon this is just the first version but it's a very useful tool to just quickly get a check of your data so this one here is finishing up it's probably scoring right now and on the test set making predictions and yes and this this prediction I can now download and submit to Carol but we won't do that today I'll just show you how it works but I'll show you like an experiment from the beginning ok so I'll go to file system I have a server here that has a local file system I can import let's say this credit card data set of course you can do this Val your experiment is running you can do multiple things at once and when it's done I can just say predict and I want to predict a target column and I can predict anything I want here this last one is the default on the payment next month so will I default on my payment or not can I pay my bills next month or not and this is a credit card data set also Carol from 2005 I think from Taiwan every feature here has a mouse over it also has a an assistant you can click on and then it will actually help you with every cell explain some more what it is and all that but basically you just need to provide the target column now sometimes you can provide a wait column if you have observation waits for every observation and data set sometimes you care more about weekends or about expensive cars or people that are special status or something and you can also provide validation data set if you wanted to make your own train test split or train validation split I should say so the validation dataset is I'll be used for a model tuning the test set is what you use at the very end to score the model to see if it's actually good or not but we should only do that once you shouldn't keep measuring the models performance on the test said over and over again otherwise it becomes a validation set which is used for tuning and then you have nothing left to see if it's actually good so we use the validation set to tune test set to see if it's good now the validation set if you don't provide it is cut up by us internally from the training data and we can like I said earlier we can make multiple splits and we can do a reusable holdout on that too to actually measure the performance so it's the numbers that you see that says how well am I doing is not just a single pass on our fixed holdout split but it's it's it's the best guess we can do on the actual data to say this is how well we will do on similar data in the future there's a fourth column this is for stratification we do something like let's say you have six different people in your data set or let's say 60 60 different people but you have a hundred recordings of each person you don't want to mix up the same person into training and validation because otherwise you learn from one signal and say okay now know what this is and then in validation say oh yeah the same person I know what the answer is and that's a little bit cheating right you because you you might remember the wrong thing about the person so let's say there was a there was a famous case where they had lung pictures and then you had to say if that's a disease or not and if the same person has a hundred x-rays and you put half of them into training and half into validation then you just remember the bone structure of the person and that says it's a sickness right it's not the actual image that told you it was sickness so he want to basically split the data such that the same person is either in training or in holdout which is validation you do not want to mix them up so if you have this kind of separating column then you should provide that and unfortunately it's hard for us to guess which one that is right you can give us a CSV with 6,000 columns and we will not know which one of those should act as a fold column but if you don't do it the model is junked basically it could be really bad or it could be kind of okay depends on the data if you don't give us a weight column that's not so bad that just means that your model is going to equally treat every row and if you don't give us a validation set that's totally fine because we split our own way now if the time matters a lot let's say your fall and spring and you have a data set for each it could provide fall for training and spring for validation and force us to not cut up to this the first data set internally but focus on training on that first one and then hold out the other one for tuning that means select the right parameters for the pipeline such that you're doing well on this new campaign in this fall and you train on spring otherwise it's just going to get good at being good at spring basically and then in fall it gonna be bad again so validation can be provided by the user if you know it as a difference there's a shift and if there is a shift and we see that for example the the stock price is changing dramatically and use gift as a predictor and that would give away the answer because the answer is also rising right like house price correlate to the stock price of something then maybe it's not the best thing to have that stock price in the data set because that alone will tell you immediately what the house price is so maybe you want to make a model that learn some what are the things then this cheater column that gives away everything because it's highly correlated so we can detect that and say between training and validation the stock price is changing so much it's not a good feature because it's not what I'm used to from training so why should I make a model on a low stock price if all my scoring later is going to be on a high stock price so we can automatically drop stuff like that so it's it's a good idea to give a test set basically yes how long does it take for the model to predict 32 believethey machine learning solutions it's always the dataset in which matters and how much data it needs to actually quickly predict because the few bytes of flow by Omega beta beta correct correct yes oh and the small data recipe is actually the hardest so did the question of how much data do you need give us a hundred raw data set that's probably the lowest that we would consume otherwise we'll just say no it's too small even with a hundred rows is not usually a good model we do our best we split it up many ways and we would make repeats and all that and we try to do the best we can to give you an estimate that's reasonable but being a statistician will be better than being a machine learning model at a hundred rows right so I would say thousands or tens of thousands of rows is usually the starting point for a model when you can be somewhat blind let's say you don't need to know the domain too much if you know it identically an independently distributed data set so every row is same distribution as the other row they're all just t-shirts right of different sizes and colors but if they're all correlated stock prices from one day to another then you have to have this time series nature in mind and for that you have to provide a time column and then there could be other things there could be many different groups in the time column you could have a data set of all sales for all your people that's all the Starbucks franchises let's say all at one data set so if I gigabyte data set all the Starbucks franchises all the sales numbers for every person that is playing the sales role right say from data set that has the column of is a date the next column is which store next column is which salesperson and next column is daily sales and now this data set you throw in and you say predict for this date this store this person tell me what's going to happen and driverless can do that but it has to automatically figure out that there's these groupings by store and person it has finds these different time series and it has to find out that it even is equally spaced in time for example you have every day sales number and if there's a weekend missing might be okay but if there's too many holes the data than we might complain so there's a lot of tricks and subtleties in timeseriesforecasting and we're trying to do more and more generalizable methods that will work in any situation and is this this kind of sales per day per person thing is working very well and we're working on other improvements such as the ability to update the the data that's inside the model so that it can make predictions into far into the future or not on a daily basis let's say right now it's it's something like the model that we made for time series has to remember all its state from the training data as soon as you deploy the model you can't update the data so it has to look back to that training period of time to make predictions for the test set and if the test set is super long if you're gonna make predictions for two years let's say at the very end of those two years you're still looking back at the training data from today which is not a good model right to make predictions you should look at yesterday's value in two years right so every day you look back at yesterday's values and that's something I'm working on to to freshen up their model with fresh data in case you have something like stock prices where every day a fresh numbers and you can ask what was yesterday's stock price for this ticker and that's a much better predictor than saying what was the price at the time of training so anyway here we have something you can pick this quarter you want you can save an optimized genie or MCC or F zero point five these are all data science specific terms if I don't know what they mean that you can mouse over and explains a little bit Matthew's correlation coefficient it looks at all therefore called things in the confusion matrix false positives false negatives to positives to negatives and it will make a good model so let's take this for fun and let's just run with this but you can actually change the interpretability the accuracy at the time you want to give it so you can say about a more interpretive model I'm going to spend less time and I want a super accurate which gives you a model that uses the original features more all these in these generated features that are maybe less interpretable but also just to lets genetic algorithm and not much to do because the time is set to one and we want it to be super interpretable so there's there's not much room for feature engineering but we still want the highest accuracy so we might do a lot of model tuning to find the best type of parameters to make the best model on that data set so let's look at the other model see if it's done right now oh wow 43 6 9 that's pretty good I almost feel tempted to submit this to cargo yeah it's actually pretty good sorry them if they the Wi-Fi is probably a little bit slower here basically the web server has to now upload all the information to my laptop and when it's done we can also interpret the model so there's a full interpretation suite and I might actually be able to show you that in the slides so we showed this automatic visualization and this is the interpretation so there's a booklet that you can pick up it should happen in the other office if it's not in this office it talks about a interpretability of machine learning model so if you want to make models that are interpretable it can either turn a knob up or you can actually look at this page to get more feel for the models and you can say okay everybody who has a low outcome seems to have a high missed number of payments or something for a scarlet card example if they miss the payment the last month or the last three months then it means they're not gonna default and Rickon a default they're not gonna pay and the next month right so you can see why do we think the model picks that up so basically it exposes bias in the more in the data if you want it tells you what's going on in the data that leads the machine learning model to think that something is yes or no it also gives you a decision trees and partial dependence plots so for example you can say what's the path where most people go down to get to end up with a high probability of default or what is the scenario like what-if scenario if I if I want to like change my income what happens will I will I be less likely to default or will I be more likely to default and if that's jumping around then you know it's not a good model right because the more income you have you probably should have less chance of default unless there's something interesting going on so I need to go dig deeper and see why would somebody with more income have a higher chance of default maybe they became they're more speculative or started to gamble or our entrepreneurs or something and then maybe that makes sense again so there's all these drill down methods that you can go to and we can look at that in the in the in the GUI in a second so it's it's free for at trial 21 days and then you can contact us to get another license key usually but that time people start up signing up there's also a as I mentioned a free academic version in support for Linux Windows Mac IBM power RPM packages there's also just a power file they can unzip it in your folder and just run it it's like a local installer if you want in your home directory if you're not root you can't install the RPM and this for the audience that's online on the left side are the feature engineering variables that get generated for this data set that are for iid data sets this is a not a time series problem so this is just interactions of the columns AIA d means independent and I mentioned earlier identically distributed so they're these observations are independent of each other and they come from the same generating function so it's all just points from the same kind of population and on the right side you have a time series feature ingeniero so this this looks more like lags write lag is a value in a prior point in time so a lack of one would mean yesterday's value if my time interval is days so for example a lag of 52 if the unit is a week means last year's value so in this case they were predicting prices or sale numbers sales numbers for stores and departments and we automatically grouped by store and department and we knew that for a given department and store combination last year's number is the most it's one of the most predictive numbers obviously if every Christmas or every Thanksgiving or so there's more sales then there's another one that said take the exponential smoothing of certain interaction of these lags so maybe at all as a Christmas and Thanksgiving and two years ago also this what's the but interaction of those with some exponential smoothing Colonel over them it gives you some kind of a smooth schmear of those five numbers and that's the best guess we have to then make a feature and then the gradient boosting method can then say okay all the stores that had this number a year ago roughly and that there are two years ago and money yet over time that number groups my stores and when I split by that number you know I'm bigger or smaller than that number I set the population evaded and then predict their mean it's a good prediction right so put all the big store sales into the right corner all the small store sales in the left corner now I separated them ideally but it's not by the sales right now it's the sales of the past and some kind of smooth fashion over time for each group so every store every department has its own time series and we say for each one compute me the past exponential smoothing average and use that as a feature and that was the best then there's also cluster distances of example we can take the department's and cost to them and then you see others departments that are like toilet treat I don't have much sales because toilet paper is cheaper and a lot of departments like cosmetics is more expensive so he can basically group the departments and then depending on how far you are from a certain group that also gives a very kind of what your sales going to be right that is something that's interesting or we can do interactions of lags you can compute the median of the sales or the maximum of the sales also a good feature sometimes the sales are spiky and if the maximum is no more and so on so and you know you're not that spiky or something it's more like this kind of store that has this kind of sales and so on so it's all these are all numeric features that are generated from the past and this might give you more input than a neural net right a neural net will just multiply stuff and you have no idea what it actually multiplies but this will say okay well it looks like the one year ago value mattered a lot so this is more interpretable by itself without doing anything okay well that's pretty good so 43 69 is the last model here the last point and that's the ensemble so you see the model got better and better over time the winning model of this horse race and at the very end we made a model that is combined of fifteen models so there's three different actually boosts and like GBM models and maybe a GLM or a tensorflow model and so we take three things and then each has five folds for cross-validation it gives you 15 splits for which we fit and predict and those 15 predicts are then done on the test set so each one of those 15 pieces makes a prediction on the test set and then you blend them at the right weights that were also optimized using the original training data so we have the right interaction of those 15 models and they all have different features all of those 15 every one of those 15 is fitted on different data because the three different model types have its their own pipeline but that says which features are made and then each of the five folds gets a different 80% of the data so also gets a different pipeline so they're all fitting a different feature engineering pipeline and then a different model and then they're making predictions on the test set that also gets transformed in the same way so there's a super sophisticated technique that now this will be in the top 15 guaranteed at Carol and I'm more than happy to do this later upload it for those who stay around and show you that so this just saved two months of time because our own Grand Master's actually spent two months to get there in the past this is the private leader board and we get this 42 nine nine nine number which is this the tenth place and this is Dmitry Darko who's probably sitting in the audience somewhere around the corner he's one of the main contributors to the code that does this feature engineering so he's the he has automated himself basically and this was two years ago took him 192 submissions and two months of playing time 3000 teams and we're now in the top point three percent out of the box so imagine you have your own data sets oh man it's a good benchmark tool for sure and it may not do this well on all problems because sometimes leakage is is present in these cago problems for example the the ID the row number matters that's something that shouldn't matter but it does sometimes because the data collection was such that you know a later point in time you had more events or something so just knowing which row you are will help you predict and sometimes Kago people figure that out very early on and then they redo the expert they hold the whole like competition and get rid of that exploitation sometimes I figured out at the very end and then it's too late to change it so then all the grandmasters just competed cheating the system so to speak to get higher accuracy but it's useless to the business because it's actually a leak that was exploited not a insight gained for the business right so often these cargo competitions even with a lot of care are not what they intended to be because of such leaks in the data and the ability to see the whole test set even though there's no answer in it but you see all the data that's already more than normal right in the real world you wouldn't have next week's transactions so this is also not a perfect match for the four reality that's why the driverless does not need the test set if you don't give us the test set it will still fit the same model and then make predictions just fine you definitely do not need to give us the test set answers that would be cheating and you're not looking at it even if you give it to us we're not going to look at it but we might look at the distribution shift if we know that next year's stock market is totally different than this year's they might drop that as a feature so we will benefit if you give us a test set just to make sure that the distribution to data is a good distribution be able help you not give us a bad feature if you want but we're not gonna cheat by looking at the outcome okay so let's do some yes let's a little bit more about the feature Engineering's in this case party bar here back then that took nineteen thousand features and thousand models let's see what we did today it says here we tried twenty two thousand features so we made twenty two thousand columns and we made four hundred and forty or four hundred and fifty models so we trained 450 gradient boosting models on about twenty thousand features in an hour and a half all automatically to figure out the best pipeline I could say well that's stupid why don't you just build one that's really good right and yes ideally we would be even faster and not build twenty thousand features and just built the right ones but in the final engineering here after the pipeline it's still a thousand features that we use so those fifteen models totally use a thousand different transformations across those fifteen pipelines so they might share some and they might have some unique so roughly a couple hundred or so each that are going into the models and this data set is one where you have to look at everything all these interactions and a long tail of features they're all helpful so we cannot drop we would definitely drop especially if you set the accuracy the mobility differently we will say okay look this feature is not that useful to drop it for you so you can give us more features than you think you need and they will automatically trim it down you can also tell us in the expert settings in the beginning how many features you want the in there you can say I don't know more than 10 features no more than 1000 features whatever it is so we'll listen to that we have overrides so you can run the same experiment again any experiment you can restart from the last checkpoint or you can make another model at the same parameters and then when you say the same parameters its preset to what you just earlier what you just did earlier and now you can change it and they run again if you want to try something that's less interpretable or you can go to expert settings and you can switch around these buttons here there's a lot more settings they're all documented so you can say I want to make I don't need the data distribution shift or I do want tensorflow enabled or I do not want like GBM or I wanna set my number of features in the pipeline to be no more than 50 don't engineer me more than 50 features or don't use more than 10 original features so give me a thousand white column data set and say don't use more than 10 right well so the best we can so there's all these options you can also enable tensorflow for NLP this convolution little nectar change the text into a number or not you can turn that off if you want because it is time consuming it will be faster just to tf-idf counts there was a question the audience do we have two microphone please so the question was what was what was the CPU and GPU resource used for this experiment for this part of our competition it shows it here it says it's a docker Linux 240 gigabyte memory 32 CPU cores and four GPUs so it was for most likely this is a p2 8x large or something instance it's a 32 virtual course so probably a 16 physical core system is hyper threading this is about a dollar or two an hour maybe I think two or four dollars an hour I'm not sure exactly how much so this should be actually it should work on CPUs - you don't need GPS GPS is very useful if you have big data if you have like a trendy gigabyte data set and your models take five minutes each to fit we'll do the best we can to shorten that time by doing a higher learning rate and all that for a genetic algorithm part and then GPUs will make is difference of like 3 X - 5 X sometimes 10 X you don't have GPUs you're still fine you get insights much faster and normally you did have to do this by hand one thing that's also important is that the the cores the CPU cores matter so if you have 30 course it's going to be really faster than if you only have 8 cores or so so there's parallelism that we can exploit in the feature engineering that each feature is engineered on a separate core so it's not just about GPS also CPUs and there is a project ongoing with nvidia we're trying to move everything to the GPU including the feature engineering and that's something that we're looking forward to but it's not at the point yet where we can integrate right now but it might at some point we might do everything on a GPU at which point it will be faster and the memory limit will be given by the GPUs not by the CPUs right now be doing data table which can the main memory and all the course of the system so you can do big data with that they do deep learning and of course non deep learning so deep learning is something we we were part of innovating actually when I joined this company five years ago I was tasked to deep learning so I wrote the h-2a deep learning package which is widely used and it's actually very easy to use it runs in pure Java so you can understand Windows laptops anywhere you want and it will do a deep learning model which is quite accurate for these kinds of datasets but it's not meant for images we're not the replacement for Kara's for example tensorflow we're not a tool we just on the for deep learning we're just using carrots inside of our own tool and that's actually pretty good idea because it's pythonic and it is fast so why not use that there's no value for us to reemployment deep learning itself now for time series we'll make those holdout splits internally based on time so we're not mix up the future in the past but we'll split it by time using causality and all that and we'll say okay if your test set is in the future how far ahead is it from the training so if training is all the way till today and testing starts next week we have a one-week gap that is how long it takes you to put the model in production let's say then we can mimic that for training and we can imitate this one-week gap so that when the model goes to production it will not ask for yesterday's value but only a week ago value which is what you have today when you trained it so we can basically live with the same constraints that you have if you were to put a IOT device somewhere that has to know past values then it has to ask for long enough aegyo values that we are available when it was trained but if you're a hedge fund then you always have last milliseconds values this will not work right because we are looking back to the training period time we're not looking back a millisecond at any given moment in the future for that you would have to give us the last milliseconds value basically all the time as your score which means you have to give us more than a single row currently we only take one row at a time and we just predict and that row can contain a timestamp and the store and the manager and all that and then we tell you exactly what the sales going to be for that combination in the future but if that model has to know every millisecond in the in the past and it's not a good model to fit that in driverless today because we don't have that last millisecond number available in test time because we operate on the condition that we say when the model is finished training it gets deployed in isolation nobody gives us fresh data ever again we have to be able to score in isolation and if that's not true for you because you actually do have every millisecond fresh data then you need to wait for us to release the next version which will handle that so don't run time series problems where you always have fresh data but it run time serious problems where you make a model then you score it for a while so let's say you have a year worth of data and then you go to production and you score for one week then you refit the fresh one and decode again another week right for that kind of model it's fine but if you're a hedge fund that one's always very very frequent updates of the data then that is not fine so that just to let you know that you're not running it in a regime where it's not as useful if I'd still be better than a remodel or a mean model or something but it's it's not I wouldn't be confident saying it's the best now as I mentioned earlier we do automatic detect those groups or the store and department combinations on a salesperson or whatever that matter that each have their own time series we can separate those time series automatically so they'll detect that there is 17 time series in your data set because you had 17 different stores now for text you can have multiple text columns in the data sets such as tweets or descriptions or last conversation you had with the person or doctor notes or something like that they will tokenize it in to us they separated words so any language is fine doesn't have to be English let it move to stop words and then as punctuation and so on and then we'll do tf-idf counts or count up how many times this word shows up here at least this document versus the global dollar the corpus of all documents in this data set will say okay wow this word shows up a lot here so something is like a higher weight for that word and then we have let's say 50 thousand words so we have fifty thousand counts for each of those and that's a sparse matrix which then we have to compress down with truncated SVD for example to make it into fifteen components and those 50 components now are 50 columns that we use for our model as input features so we turn the text into 50 columns each text field turns into 50 number fields another way is to take these fifty thousand counts and throwing that fifty thousand sparse matrix into a linear model and say well predict me the outcome from those zeros and months or counts how many times so these are relative counts so it can be even numbers doesn't have to be yes or no from this from this sparse matrix with mostly zeros but lots of non zeros those other words that actually occur from that matrix you can actually predict the outcome and that you can do out a fault and then you have a good column which is the likelihood given this word to make the get the outcome and it's likely or is a good replacement for a string that was originally there then the gradient boosting models or tentacle models can then take that and now you can use tensor flow as well to make word embeddings that then take these sentences and stream them through new neural net and they have an awareness of the order of words not just how many times does each word show up and it also is is it I am rich or rich am i or something is different right and and then you can see if the model actually learned something from these interactions in the order and when you then have that out of fold estimate of that convolutional neural net then that's a good replacement for a string as well and that can then go again into another neural net that's the actual model that takes all the other numbers and strings and so on of the data set or you can go into and actually boost model or like GBM or a GLM so we're just turning this text into a number first and then do it the next step because into regular model so we're just getting rid of the text if you want and once that's done one time we cache it so once the text is converted to these numbers we can reuse that feature from now on instead of the text so we're not for every experiment every diello dot on the plot on the plot we don't have to redo this we just we just have to do it for every split in the cross-validation folds but we don't have to do it for every single fitting a model alright so and in the end of course you get documentation that tells you what happened so there's a full PDF coming out telling you okay we did five fault isn't that we need these features these are the top features this is the description of all the features this is the pipeline that came out there's many models there's many features and this is how well it does so it gets there's all these metrics Jeanne MCC f1 and so on accuracy log Los AMC's each has a error bar has a number for a training data internal validation splits for a test set data and if you have an external validation set that's also possible to see so there's all these numbers and we're working on making it more visible in the GUI these all these metrics so you can actually compare different models on different data sets and see which one has what ROC curve what's the difference and you want to also add more model debugging tools that's all for the roadmap for the next few months and there's also going to be new models such as F TRL which is a streaming algorithm similar to a linear model like stochastic gradient descent linear model will be the closest so you you take one row at a time and you make a prediction with a linear model anything that's wrong fixes your current weights for each of the predictors so how many times h plus how many times income plus how many times zip code gives the right answer basically you fixed you correctly you correct those weights until you get a good model now FDR L will do the same thing well do it multi-threaded it will do it with higher order interactions so the weight is not just for agent-in-command zip code but the weight is for H times income or H times zip code let's see but there is also an ability to handle strings so if you have a word like hello world then that's also a coefficient that it gets so how many times hello world how many times this URL how many times this marriage or single coefficient and of course if the URL shows up only six times and the whole data set then the weight might be relatively small because it didn't matter too much but if the word is male or female then it has a more important number to keep because every single row will fix that number until it's correct right so then you get a big number like a point three times gender or whatever it is and then who knows if it helps or not but it's a linear model that has a weight for male and has a weight for female in the linear coefficients and you can have 10 evade for male times H and then at different age groups have different bins and for each of those male times different age groups or female times different age group there's an average coefficient and if you're in that age group then that coefficients is applied if you're not in that age group you get a zero so what everybody in the room here there will be a coefficient that says how much is their contribution to your default likelihood right and all these coefficients are automatically trained in parallel fast in a streaming fashion so it uses no memory at all other than to store the coefficients and you can throw a terabyte data set at it and it will just stream through it and in the end you get these coefficients so that's ft RL that's similar to deep learning except that the weights are clearly defined for those interactions so it's not like a deep neural net it's like a one layer neural net if you want but we can do it very fast because it's using a hashing trick to turn the strings into a index for which we have a weight so it's a long weight vector that's indexed by the hash function outcome apply it to the input whether it's a number or a string doesn't matter whatever you throw at it as a feature it hashes it into an integer does a modulo on it and that's the vector entry that you're changing with the update method that's a really cool algorithm that's what Google used for their ad sense and soon that will be available in driverless so let's see we have here the the demo so I wanna go to the credit card data set and here I'm going to interpret this one so what we have here is the machine learning interpretability module that will show us for every row and the data set why we made that prediction so we can get a feel for how the model operates and what's the thinking behind the decisions and sometimes you'll see that oh wow does the the IDE mattered a lot or something and then you should have dropped the ID now we automatically drop IDs but if you are if you somehow sneak in something that looks like an ID but not quite or something and it remains in there and it's super predictable you get an ALC of one then you know why basically this will tell you the the feature importance listing in the first page also will show you very high importance which means you're probably overfit that feature something is wrong with that feature we will not over fit but we will say wow this is a very predictive feature and you should then decide is that actually a good thing or not is that actually true that that feature is so predictive or is it just me making a bad data set so we're going to go to the MLI page yes emily is machine learning interpretability so this now says okay we did this model it took 52 seconds and we built actually here a bunch of surrogate models and clusters so let me go to the dashboard it's usually the easiest way to start so the dashboard shows you the four different oops sorry four different things so one of the things is let me get rid of this X here the top left chart so what it does is it sorts all the predictions and the actual values of the data set so it has all the zeros and ones so there's two lines all the zeros on the bottom all the ones on top and it sorts them by prediction so there's a yellow line going from the left bottom all the way up to the right top so it sorts all the points by what we think is the default probability now that guys on the right should be really more yeses and the guys on the left should be more knows if you had a good model and that seems to be the case so that the yellow dot density the top right is higher than the yellow and the yellow dot like density on the left bottom is high but what you also see is these sprayed in white dots and those are a prediction of a approximate model that's a linear model so instead of doing our ensemble of actually boosts and light gbm's intensive flows and now I have a linear model that only has a bunch of coefficients and it comes up with roughly the same answer because it's following this yellow line so for the points that are high here you have also high numbers predicted by the linear model so these are 15 different clusters that we found in the data and then for each of those 15 clusters we fitted a linear model and that is basically the model of those who default a lot and then here is a different model for those that don't default right and each one has its reasons basically and these are reason codes these are just beta times X values so if you have that basically adds up to the final number along the x axis and then it goes up the sigmoid function and then becomes a probability so the more the number the more we go right or if you add a negative contribution you go back left so the more right you go the higher probability the more left tick or a smaller probability and these these numbers are basically telling you how much they add to the overall probability so there's this kind of a a reason code that you can say okay p0 added this much to the probability now if you pick a point then we will get feature importances for this point that is the gray bars are now showing up which is the importance of the features for this specific point so you can actually see compared to everybody else I have a higher importance to pay to pay to were saying did I pay last month pay zero means are actually two months ago pay zero means the last time I was able should have paid did i pay or not so pay zero matters the most the most recent time I should have paid my bill did I pay or not pay two says the previous time did I pay or not pay three says again previous time right and now you see p5 and pay for what are mixed up here so the model thought that p5 mattered slightly more and pay for even though it was further ago so maybe there's just an artifact of the small data as 24,000 rows only you would think that further ago means like less important and closer to time means more important so this is some insight we can get here that the data signal is not super strong there's partial dependence plots or let's say if I pay more three months ago what happens with my outcome right and it's just a true statistics is to show you if I change my number and I predict what do I get out that's my model sensitive to that change or not and it gives you kind of a what-if scenario now it's not quite right because if you actually had a higher pay three you wouldn't be the same point anymore so then the rest of the model might not have applied either so that the model never saw that kind of point so why should it be good at predicting for it right because you change it too much you make it artificial and the model never saw those kinds of artificial points so how it can it predict for those it's hard to extrapolate outside of so too much but maybe a little bit you can get a feel for it you can also download these recent codes and scoring pipelines all this stuff can run outside of driverless in Python there's a lot of other things that you can do you can do sharply values which is exact contribution to the that x value I mentioned earlier before you up the logistic function so you can see for a given point the contribution of pay amount six was this much it actually made my probability slightly less and pay zero a P 3 and P 2 made it much more so paid to made it much worse my likelihood that I would default it added a big positive contribution going right on the x-axis which increases my Rev ability so these are additive terms here if you add up those gray bars the pluses and the minuses you get to some point on the x-axis and then you have to go up the logistic function that's the probability so it's super easy to understand these these these contributions and this contribution is from a random forest here now that's from a surrogate model but you can also do it on the actual model our actual model also had these sharply contributions so you could ask our full ensemble what's the contribution of this particular feature to the final outcome probability now it's it's the the random forest number for sharply a loco had the original features the ones that you gave us contributions now that the this one here has the features that were transformed by us because we first as I said we first enriched the data and then we give it to the extra boost so it only sees those extra boost models the only seat runs formed features so here you get a list of transformed features now because I said the interpretability very high I kept them the same here so now we have actual impact of the original features into our tuned models so this is probably the most useful one here because it's telling you exactly what's the contribution to my one for this particular row eleven thousand one ninety seven how much did this feature contribute to the outcome and this is an original feature because we didn't transform it in this model but of course if your original feature is a text field we do have to transform it somehow and at that point it becomes more complicated and then it would say it's the the CNN's likelihood encoding of the text for this guy and then you have to go look up these guys is text and say allow because the text was such-and-such means it's this high probability so you have to go back a little bit for those cases but we could make it easier for you to highlight the original features for this row so any feedback you would have I would love to hear so now that we have this model we can download this automatic report that is being built so there's there's other buttons here we did interpret we can score on another dataset you can even order test set and score on it you can also give another test the data set that we can transform it as if it was basically as if you wanted to see the inside of our feature engineering then they'll spit out a feature engineered version of that data set then you can say I want to download the Python scoring pipeline the thing that is standalone or what I want to make a mojo version which is the Java version of that scoring pipeline so it just built it now we can download it and if you open this module one it will actually have it actually has a jar file see it's a pure Java standalone scoring pipeline and then you can download the experiment summary and the logs the experiment summary has a bunch of files in it it has text files and JSON files so these are pandas frames that's Jason so you can import them in order to they will show you all the features for example what are the importance is all the different features what are the different but I'm at the tuning tables so you can see what different models were tried actually boost and so on how well are they doing is a full leaderboard with hundreds of models and you can exactly see how they're doing that you can of course turn that into a panda's frame it is Jason so we can quickly look at that energy up with the notebook do whatever you want with it and there's a report PDF which is the auto doc report I mentioned earlier that is for regulatory used you can this is also has given to you as a markdown file so you can actually change it and then modify it if you want to write text to it and say this is what I did this is why I did it and we're working with banks to customize those reports so that they can have their own version of the report that they want and it'll tell you how many models were built for each phase and what the parameters were and so on and by the way if you're interested in how driverless AI was built you can go to blog dot H to one of the I there is a full blog on the the making of driverless AI which is going into much more detail about the history of the product and how we originally started thinking about it how the different pieces such as data table or Emily were conceptualized and then implemented the GPU algorithms collaboration with NVIDIA Carol Grandmaster ideas the overall application stack did they GUI how it was first designed how we came together automatic visualization and then the first Kegel results that were promising the the booklets that were made to the architecture and then all the different features that were added the rapid succession and the time series thinking behind the whole recipe and then also the tensorflow texts after I mentioned and there was a whole conference and in London last month where we gave a lot of presentations lots of customers talked about it so there's lots of good information here cargo grandmaster panel all the videos on youtube everything's public you can look at it and then the the future this is the roadmap that we're going to talk about so the integration of the h2o three platform with driverless AI this this convergence is happening now so you might want to for example create a good H 2 O 3 model using driverless the driverless can tell you what to do with an H 2 O 3 will do petabyte Donna sparkling water cluster for example this is a good idea and we will do that in order idea is to say ooh now that they have a driverless model mojo the standalone java scoring can I put that into sparkling water and scored on a terabyte to make faster predictions than throwing it through a single threaded scorer here can I use my cluster as a scoring mechanism or can I just for example can I train driverless from inside of the open source product in the open source product you might want to have a button that says fit driver less on this data set if you have travellers running then you can just connect to it and it will make you another H 2 or 3 model because it can produce that mojo artifact outside and then suck it into Java and just score right if you have all the pieces in place we just have to wire it up so we can make a call-out say if it's the driverless model give me the Mojo bag stick it into the Java MapReduce and have a parallel scoring engine that then acts like an H 2 or 3 model before which you can make metrics you can score it you can call it from our own Python you can deploy it as a mojo and so on this is a very natural interaction between the two platforms we will make a multi node multi-user version this is already multi-user but not multi node yet so there will be more scalability more people can learn experiments they can share experiments they will be able to score the models on different data sets see how it's doing there will be new scoring tools such as that C++ runtime not just Java so if you want to have tensorflow models that will come in February for our next age to a world in San Francisco for which you should sign up if you're interested in these developments and then there's also going to be more debugging tools where we can actually see the things like a residual analysis and so on see Vidor models are doing wrongly at certain points I see what's driving those mistakes and so on and the thing that I like also is the the manual fine control so let's say you have a pipeline that comes out it says this is a really good feature of doing pipeline and you look at it and you say no it should not be doing H - zip code that's a bad feature it doesn't make sense but we automatically figure that out as a useful feature you should be able to like exit out and say go away right so this kind of final manual inspection of your pipeline but you can mess around with that state and say no at this feature h plus income is actually good but not h - income or something you can just play with they didn't say now submit to fit the model again so this is also on the roadmap to be able to control your final model and so we make a guess and then you approve it so to speak that way you have more control right now your only control is the buttons to press and what data to give us which is still nice because you don't have to worry about the hyper parameter tuning and the feature engineering but sometimes you do care about the features that are being engineered so that you can explain it to the end-user and sometimes it's it's useful for you to say everything that happens here I can vouch for and not just some black box right so we definitely want to make that better here are some links so if there's any questions that would be a good time I think if I forgot anything I will be more than happy to go back and show you that there's a question here any been to my place and acts so for the time since all's you know time series and coalition and Austin with a single time series so for every group we say there's one time series right so let's say you have a store in the department and then you have two years of data that is to us one line yes yes so that's a good point yes exactly so that's actually what's happening because in one data set we have all the different time series right so they're all the different groups and stores or let's say sensors they all have their own time series so for every date there is a sensor ID so you have a thousand times the same date for all of the thousand sensors each one has a value so let's say today however their order doesn't matter but they have like a date stamp and the sensor ID and then have a value and then for the same day of a thousand rows right we all have same day different sensor different value and the next day thousand different IDs for sensors different value and so on if it is for ten years so now we have actually a thousand times series going on and we can make lag features for all of them at once after we do some grouping in data table and then those thousand time series are all at once so it's one time series per group but the data set has actually all thousands in it and then when actually bush looks at them of course they Lopes out all the interactions so you can actually see if sensor ID is seventeen and six months ago value was five and since or and then sensor ID is also and five months ago was this and this then grouped them together with a split in extra boost for example so yes definitely they need to interact that's one of the strengths of this recipe is that we don't make $1,000 models but we have one massive model that looks at all the data together this is a really want to see that if something happened the first time series maybe with their like something will happen the second time shows you didn't stop the time yeah we do have two lakhs for all the different groups and we have two lakhs for the other groups yes yes I think it should be presented to algo already because you do have all the lags for all the different stocks at once and you can present that to actually boosts let's say and it can say if I if I split them properly that will happen but you have you have to maybe make ratios and not just values so it's stuff like that that's the next level secret sauce or I'm you don't want to just predict the value you want to predict a change of value and maybe you want to have ratios of changes from the past between different lags and so on and these are interval yes interrelations between lags and groups yeah so we do between lags for the same group so we can say for example the ratio of a year ago and three years ago yes the autocorrelation part but but the cross correlation to other groups at the same lag let's say you're different lags that could happen only as an actually boost effect not intentionally by us directly yes but if you had a neural net as a main model here then you that's what we have then the neural net could do more than just simple if-else so it could come up with that but it's not we don't materialize that feature as an extra feature yeah yes yes especially as Ellis times yeah yeah a lot of Windows yes yes you have to pad it all even a 32gig Voltas aren't enough anymore if you don't want me to densification of time series for small small millisecond latency and so on yeah yeah any other questions one thing I didn't mention was the time series MLA also shows you the actual and predicted against each other so there's a time series visualization of the actual outcome so it's a nice feature you can see where you're doing how well and for each group so for each store you can see the predicted versus actual plotted in the GUI and you can also click on any point in time and it'll tell you why it made those time series prediction so you can say it's because of isn't this feature so we have this whole sharply importances for time series for each group in a visual way as well cool thanks a lot for your attention and hope to hear more from your feedback thanks
Info
Channel: H2O.ai
Views: 2,719
Rating: 5 out of 5
Keywords:
Id: FI8qj7xcI8M
Channel Id: undefined
Length: 96min 46sec (5806 seconds)
Published: Tue Dec 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.