Introduction to H2O Driverless AI by Marios Michailidis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
just a few words about me I work as a competitive data scientist for its to all let's do is a company that primarily creates software in the predictive analytic space and machine learning so my main job is to make them well as predictive as as possible I've done my PhD in machine learning in the University College London my focus was in on sample methods as in how we can combine all the different algorithms available out there in order to get a stronger a better prediction and I'm not sure if if people have heard about Cargill that cargo is is the world's biggest predictive modeling platform where different companies would post different data challenges and it will ask data scientists to solve them and there is a league there is a ranking for those who are able to solve this best so at some point I've grown multiple competitions I was able to get to the top spot there out of I don't know half a million to the scientists but the what I take away from this is that I have been able to participate a lot in different challenges see all the different problems companies have in this machine learning space and be able to incorporate back at least some of it into our our products and make them more efficient and more predictive as this was mentioned before it's tools goal is to democratize AI is to to make certain that people can use and get and leverage the benefits of AI without thinking that it is so difficult in order to enter that field we have a very and we are very proud that we have a very big community primarily the empowered by our open-source community we have around 200,000 data scientists using using a products and some major organizations and parks are using it in general we I would say you have two main products we have the the open-source set of products where we have the main h2o library which is available in different programming languages like Python are Java and it contains many machine learning algorithms and applications in a distributed manner then we've taken this and we've put it into we try to make it more efficient to work in a spark environment and we call that sparkling water and then we also we have a version which is a bit more optimized for for GPU but water will be more focusing today is this tool we have called travel si ie where it tries to automate many steps in the machine learning process trying to give you a good result fairly quickly and that we're going to deeper about what this specific product does using this tool we have had some success in this competitive environment I mentioned before goggle not sure why the slides move so quickly for example there was this competition hosted by PNP Paribas where driverless was able to get top ten out of three thousand teams within two hours and I know that was super hard because I also participated and it took me around two weeks maybe three weeks to get near where driverless AI was able to get within two hours so that's an idea that can give you what you can do how much predictive power you can get by using a tool like travel si I generally the typical workflow when you try to work on a data science problem in general terms quite often looks like the one on the screen where you you normally have a data integration phase which is essentially you try to gather to collect all your data from different data sources maybe different tables different CQL databases and you try to put it together after doing multiple joints into one tabular file where as it is let's say in one line is every sample every row in that data set is let's say one customer and then normally starts a very very iterative process from a machine learning point of view or a data science point of view where you try to do multiple experimentations playing with different algorithms training with after defining different validation strategies and where you keep seeing what kind of results you get and keep reiterating until you improve on this problem and get the best results possible and this is where driverless AI is primarily sits so once you have that data set which you have some properties that you would like to predict and build algorithms based on this is where driverless takes over it will uses multiple machine learning applications in order to be able to get best results possible given some constraints in order to be a bit more specific the way it would work is once you have that tabular data set you can imagine something in an Excel format then you normally have a target variable something that you try to predict out of this data set for example will someone kind of predict somebody's AIDS based on some characteristics or canopy which is a regression problem or can I predict if someone will default on his or her lawn given some past credit history data which is essentially a binary classification problem so travelers can handle multiple die different types of what we call supervised problems the next thing that you need to do is to define an objective a measure of success so do I want to maximize a form of accuracy or do I want to minimize a form of error there various different objectives you can specify to make your model focus in specific areas and then you allocate some resources obviously you're bound to the hardware you're running driverless on but also you have the ability to control how much intensity the software puts on maximizing accuracy and how much time is spent on doing that so you can always make accuracy also a function of time if you don't have much time you can you can essentially say to driverless try to do the best what you can quickly or if you had a lot of time normally driverless can get higher accuracy so by controlling accuracy and given the hardware limitations and how much time you have available then travelers will use all this mix in order to start giving you some outputs so those outputs come in multiple forms it could be some insights general insight and visualizations it will be what we call feature engineering so when you normally put some data most machine learning algorithms prefer the data in in a certain format and I will explain later more about this but you also have the option to through driverless extract this transformed view of the data that can maximize your accuracy and try your own algorithms for example on it you can also get the predictions for your problem based on the algorithms that we use and there is also a module called machine learning interpretability where it it becomes increasingly more important as it promotes accountability and essentially says misil trying to explain how a model makes predictions in humanly readable terms can I understand how my black box model works in simple terms it is a very important area that is a lot of focus in order to make certain that businesses construction trust AI in today's world so our visualization we have a great guy behind him Lila Wilkinson she has written the grammar of graphics it was one of the first people within tableau and he has built a really clever automated visualization process which consists of several algorithms that they scan through your data and they try to find interesting patterns and as I said before highlight again this process is purely automated so it will not search for everything it will it will not sorry it will actually search for everything but it will not show you everything it will show you only patterns from within the data which are important for example we have some graphs that focus on outliers so it will focus on so casing to you and highlighting features within your data which contain outliers and pinpoint them so that you're able to to see them and determine what whether you want them in or not whether there are mistakes or just extreme cases other graphs focus on correlations and clusterings within your data others could be heat maps generally is a comprehensive and detailed process which is completely automated and focuses to give you a quick insight about your data automatically so if you don't know how to look on what to search for this visualization process can find out some quick good patterns in order to give you general insight about what your data is about and now I'm moving on to the other face that actually driverless AI spent quite some time because from also my experience most of the predictive powers is within your faces and how you transform them is very important and critical in order to be able to get good results let me give you an example you have a feature in your data where you scroll animal it takes different distinct values like dog cat and face let's say you have a target where you try to predict cost and this is just one of your input features there are multiple ways to represent this variable most algorithms tend to understand numbers they don't understand letters even if you use even some some applications use letters in reality they use a numerical representation under the hood so one way to transform it could be to use something called frequency encoding where you just count how many times dog or cat appears in your data and you replace with this then you have a variable that says how popular its animal is or you could use something which is more quick is called label encoding where you just sort all the unique values of the animal and incrementally you assign a unique index or you could use something called dummy coding or one-hot encoding where you treat each one of the distict categories of animal as as a binary feature so is it a dog yes or no is it the cut yes or no and something else you can do as I mentioned before so since you try to predict cost essentially it's your target variable what you could do is estimate the average cost from your from your categories in this case the animals and this create a feature that map's this so you already has a feature that map's the target and quite often especially if you have lots of categories this kind of representation can really help algorithms to converts to a good result more more quickly and there are many different flavors of all these transformations I'm just showing you on a high level what are the different transformations we might consider and we always search from for the best ones and the answer is not always clear sometimes you really need to go through in order to follow all of them in order to be able to find which one works best other type of transformations we might consider is imagine you have a continuous variable AIDS and let's say you try to predict income so this isn't quite often not a very straightforward relationship and by this I mean when you're young your income is low and then it increases by quite a fast pace when you reach middle AIDS you know slowly the income starts still increasing but at a lower pace and then once you go towards retirement income starts decreasing so there are sifts in the relationship between your input filter and what you try to predict which is income so being able to spot this and create some features where they specifically point to these changes in relationship for example through painting by transforming the numerical features to categorical say that instead of having a numerical feature say this is the band that falls from this AIDS do this it's normally you can really help some algorithms to drive better performance other form of transformations could be how you replace missing value you could use the mean the moon or the mod or or the median or you could just treat it as a different category in a categorical context and there are the transformations you can consider to like dating the Ligurian or square root of a numerical feature sometimes this form of scalings can help to minimize the impact of extreme values and help some algorithms converts faster and give you better results other type of features we are considering is interactions among fits can we I can create one more focus and more powerful feature bike by combining two together for example we could multiply or add two features and other form of mathematical operations or if you have two categorical features we might just concatenate them at one single string or if we have a numerical and the categorical feature we can explore interactions in the form of group by statement can estimate the average age of a dog cat in this case and you just create a variable that so case is this and you don't need to limit yourself to only averages it can be maximum by two standard deviations any form of descriptive statistic can can go here or you could even Ben it through the technique I showed you before create those bonds convert it to categorical and then use this the concatenation technique in order to make it as one bigger string similarly text has its own way of being represented in order to get the most when you use it with some machine learning algorithms something that quite often we will do is out of all the possible words that you have in your data so in all your O's you might have a field which is called description what we are doing is essentially we are talking izing so we are braiding down each word into a single feature single variable essentially and then we say how many times each word appears in its row out of all the possible words so we call that the term frequency matrix which it has different it can come with different versions and flavors but that's the basic idea that some words are very dick ative about the context of what the sentence tries to say and there are other techniques as well obviously their way to pre-process the text but for example applying stemming which is removing the suffixes from words it's like you might have playing but the core of the word is play so you can just use this in your analysis obviously spell checking trying different combinations of words removing words that get repeated quite often that they do not match value like he see and you mean there are other techniques that can help you to decompress this huge matrix of all the possible words to a few explainable to fewer fewer dimensions you can use something like singular value decomposition to do this we could use something called word to back it's based on deep learning and it tries to represent its word with a series of numbers in a way that you can do mathematical operations between words so it from the word King I subtract the word man The Closer result that comes out is the work Wayne and I have seen it working this way it doesn't always work so interestingly but it definitely this representation can give you very good insight about what a word is really about therefore it can give you features derives from this de présentation can help you a lot in NLP problems other feature engineering is applied to time series data when in the most simple form you may just try to decompose the date so which day of the month it is which here week they work number if it is a holiday but quite often the features get right from the actual target variable versus time so I want to predict cells today can I use the cells yesterday or the cells today ago as my features in order to predict sales today so essentially like one and like two or I could even try to take aggregated measures or Windows based on these like values for example create and moving averages for the same periods this is extremely based ink I'm only touching the surface here or what the software does but this is just to give you a high-level idea of the different features that the software will explore on trying to make better predictions for different problems this is some of the packages that that we use there these are not the only ones but some of the most well no one once the key point here is that we obviously capitalize on our open source heritage and we use many of our Grammys but at the same time we also use other open source tools which have done extremely well they have won multiple awards and they have done extremely well in the competitive context like light CPM from Microsoft and gradient and exit boost for gradient boosting applications random forests from sacred learn get us with the tensorflow backend for our deep learning implementations a lot of our data munging happens using siphon amp I and pandas but we are also slowly transferring to data table it's an open source tool is something that aids to all develops and we think it's it's it doesn't have the depth of pandas yet in terms of functionality but it's extremely efficient and very quick handles memory extremely well and supports more most of the major operations I advise you to have a look if you haven't tried it it's available in R as well so both Python and are obviously just picking a machine learning algorithm out of the box is not going to give you the best results so all these algorithms are heavily parameterised they contain a lot of hyper parameters that you need to tune in order to make them perform well for a specific problem consider an extra boost algorithms with is essentially a form of a weighted random forest whereas in each tree you can you you could control how deep you should make it three DEP through the three what different loss functions you can use to expand your trees or what should be the learning rate how much each tree should rely on the previous one when it gives you predictions how many trees you should put in that random forest so and this is just some basic parameters that are a lot more but in order to be able to get a good results you need to find some good parameters for these algorithms and this is something that travels AI also does automatically as well as for any other feature transformation that you you've seen in order to make good decisions within driverless AI we try to find internally a good way to test we try to internally create a good testing environment so that we can try a lot of different things a lot of different transformations and algorithms and have the confidence that it will they will work well in in some unobserved data so for example in a time series approach most of the times where you know time is very important we will have we always we can use different variants but the the basic idea is that we always train on past data and we validate our models on future data there can be various flavors to it one we like to use a lot is a validation with many moving windows all with a rolling windows where we will build multiple models on different periods always sifting that validation window towards the past the test window sensor it always the past building models with any data you have before that as a way to make certain you have a model that can generalize well in any period when the data is essentially random in respect of time we will most probably use a form of k-fold cross-validation where what this says is I'm going to divide to separate my data that my data set into K parts it doesn't need to be sequential like what I have but as an example and then for K times you're going to take a part of the data you're going to fit an algorithm or try different hyper parameters and then you're going to make predictions on that other part of the data and save the results so how well you've done for example in terms of accuracy and you will repeat this process having a different part of the data now being as test or us as holdout and you repeat this process multiple time until essentially every part of your data has become hold out at some point has become part of your test and then you can get an aggregated metric for how well you've done and then you can go to how good was that algorithm to use and how good were the hyper parameters you selected for that algorithm as well as whether the feature transformations you've tried where you know were good enough so how we decide on all of these things because theoretically the the type of combinations you can use for different algorithms different features and different hyper parameters you know that space is really really huge so I've tried it we found a way to optimize this through an evolutionary way in order to get some good results fairly quickly and just to give you I will go inside one driverless AI iteration and show you what it does in order to come up with good models and good good features and good parameters so imagine you have a data set a very simple data set in this format have for numerical features one target where you try to predict so what travelers will initially do is it will take is for features it will decide on a cross-validation strategy based on the normally the accuracy setting you said in the beginning how much accuracy you want and then it will pick an algorithm semi randomly it will put some initial parameters for this algorithm it will tune based on cross-validation those parameters a little bit and then you will get an X percentage of accuracy based on this test framework for example this k-fold cross-validation and this it will come back with with the ranking stating which features are the most important now we can use this ranking in order to reinforce or make better decisions once we start the next iteration for example from this ranking maybe I can infer that x1 feature doesn't seem to be so important so going forward I'm not going to spend so much time on it however x2 and x4 seem to be a little bit more promising so once I start the second now iteration I'm going to capitalize more on the features that seem to have more promise by other trying better individual transformations of them or even exploiting their interaction but at the same time I'm going to allow some room for random experimentation I don't want to get trapped into this very directed approach of loading into the data I want to always allow some room for searching in case I find some other interesting pattern and the process continues because I will pick an algorithm it could be the same or a different one I will pick some parameters for these algorithms which again could be similar to the one before or different ones I will slide it soon those parameters based on the validation strategy you have selected we will get a new percentage of accuracy and this will come back with a new rankings as to which features is our most important and this doesn't and this is not only limited to features this ranking goes to algorithms goes to hyper parameters so after a few runs we have a good idea about what's working and what doesn't work and always keep optimizing where we see essentially there is more juice again always around allowing for some room for random experimentation so is it an exploration exploitation optimization approach which has its roots on on reinforcement learning a kiss briefly I wanted to mention that maybe I'll skip this part is we obviously use a lot of work in order to determine which features are the most important ones of your data maybe I can quickly mention it as you saw that is always our process always comes back with the ranking and the way we can understand how good the feature is and create this ranking is assuming a habit data set I can split it in training and validation I can fit an algorithm my training data and with this feeded algorithm make predictions trying to predict the target on the validation data and that will give me an X percentage of accuracy let's say it's 80% of accuracy so what I can do next now is take that first column that first feature in the validation data and randomly shuffle it so now I have one feature which is wrong in my data and everything else is correct so if I repeat the scoring with the same algorithm I'm expecting that the accuracy will drop how much the accuracy dropped is essentially how important that feature was and normally this ranking is very intuitive and is very powerful to understand really which features are the most important to include in your algorithms in order to get the best results and then you essentially repeat this process for any other feature so it's a good a good quick way to understand which are the main key drivers for your dataset then we use a process called stacking in order to try because while all this process starts iteratively we come up with various model and matrix transformations which could work well so then travel is has a process that tries to combine all of this tries to find the best way to combine this in order to get the best result process possible and the process is on simple terms imagine I have three data sets a B and C a could be my training data set B is my validation data set and C is the data set where I want to eventually make predictions for the test data set so what I can do is take an algorithm feed it on the training data set and then make predictions for data set B and data set C and save these predictions into new data sets and I can continue this with another algorithm as well I can pick a different algorithm again I can feed on data set a make predictions on B and C at the same time I can stack these predictions on this newly created data sets and I can repeat doing that until essentially a have a data set which consists of predictions of multiple different algorithms and now I can use this target of the validation data set to use another R go limp and find the best way to combine all these models all these different algorithms I use so we can essentially pick one new algorithm to fit on this p1 data set and find the best way to combine essentially the all the different algorithms in order to give a final prediction for the test data set this is normally a good approach called styling or stock generalization implemented by walpert in 9 1992 and normally can drive predictions can give you a good boost and the last part before I pass over to my colleague or maybe before the break is machine learning interpretability this is a very important process for us because it can promote accountability and bridge the gap between the black box models and something where people can feel comfortable and understand I think there are two main approaches colliding if I can use this term so there is the approach that says I want to have something which is 100% interpretable for example I look at my data and if some everybody who's less than 30 years old has 30% chance to default on his or her credit card payment but everyone who's more than 30 years old has less sense the default may be 20% that is my model so that this is the model I want to put in production I've measured this values based on historical data I'm 100% certain on how it works so there is no clear accountability but probably I can get much better accuracy if I combine more features make something a little bit more complicated that can search for deeper patterns within the data but obviously I will I cannot have exact explanation of how it works so what we do is we use approximate explanations the idea is you have the predictions of your complicated model your in this case the driverless AI model and you try to predict it with a simpler model so you use a simpler model in order to understand the complicated model and that simpler model could be a regression model or a decision tree which approximately can give you an understanding of how the complicated model works and and you can build different reason codes and representations that can help you understand on on on not only on a global level but even at the parol per sample basis why case have been scored like this for example this case had 70% chance Oh to default 30% because he or she missed the payment last month add 20% more because you see missed a payment two months ago add a little bit more because he's too young etc so using these approaches with called essentially surrogate models you can get to an understanding of how the complicated models work and get very good insight about how the predictions are made on a global as well as local parole level and the nice thing about driverless is that once it has built all this pipeline of transforming the features building the different outcome is combining them you can get different artifacts some one is based on Python another one space in Java called the module where you can put it in production and do the scoring through them and yeah this is what I basically I wanted to say happy to take any questions if you'd like to connect these are my details and thank you for the opportunity you gave me to you to present to you [Applause] so will they still negate or at least in loot the need for Cargill competitions I don't think so because you know once you raise the bar you know people can get to the next to the next states and I also think it can push it even more which is good but I think at the same time there are various elements of these competitions where a tool like travel assay I would have disadvantages and I'm saying this because this is a tool which is made to be production ready for example it does not look at the test data in order to improve the model so because in a real-world situation you might only have training data you never know when your test data might come in the future for example this is something that coddlers use to their advantage so they will see the structure of the test data which they have it already in advance in order to be able to get a better score you can draw all sorts so you know calculus is slightly what I'm trying to say is a little bit of a different in a different world it's amazing that we have been able to do so well even given these disadvantages against competitors but no I think as I said you raise the bar and people can push it even further which is good if you need to carefully format data before passing to h2o is there any help available as to the best transforms to apply to improve accuracy in principle we like the data in raw format as long as they are in tabular format and that's because we can iterate we iterate through different transformations and try to find the best for example you have a categorical feature and you decide to put it in as multiple dummy variables but then I have been another transformation which could have worked better so actually we prefer people not to do much cleaning now there might be some special cases particularly on time series for example where a certain pre-processing of the data might actually help but I think that's that's quite of a of a bigger discussion in principle we like the data in RAW format we're comfortable with working with missing values and inner structure data as in text and find good representations for them a question that I can answer for all those this that what I could save give Marius a rest on his voice so is it flexible with different cloud solutions so driverless AI is available on all the major cloud environments so Microsoft Azure Google cloud platform and Amazon Web Services it's available on there within the marketplace so feel free to go go there what we also do is there's the learning environment that we encourage you to utilize which we'll be using for this particular training something called aquarium so the great thing about driverless technology is you can get up and running with it in a matter of minutes because of those cloud formations that we are installed on see what take the next one Maris okay no more rest for me so that's travel is provide an opportunity to mine early set or streak a list of models we want to go with for the experiment absolutely absolutely driverless can give you full control of the parameters you want to set the the models you want to try they fit the different feature transformations you might want to block or allow so if you want it to have some control you can still have it and with our next version you should also have the which is coming soon this month you should also have the option to add your own models your own feature transformations your own metrics you have through python you have the options to do all of these things and yeah if you wanted to have you know the control you know we can expose the will we we are happy to do that the next question on in terms of how proactively do we accommodate open source updates as Marius described driverless AI sits on top of a number of open source packages and when driverless AI gets installed then those packages automatically get updated so every iteration of the product would automatically have those open source packages available to them one other point that I want to really highlight that that Marius talked about is h2o when building products is utilizing the open source community but what we also want to do as an organization is to give back to the open source community as well so a really great example of that was where Marius talks about the data dock table that's now available within Python so that was a package that was available in our we felt that to accelerate her driverless ai's data preparation capabilities we needed something that was better than pandas behind the backend and felt that data table package if it was in Python would help accelerate driverless AI but rather than just take that out of the our world and turn that into something that was proprietary for h2o and driverless we said well let's put that package back into the open source world but there's no point putting it back into the our open source world because it already existed in there so what we've done is to create it utilize it as part of driverless AI but give it back to the community so you can utilize data table in your Python workflows as Marius said it hasn't got all the bells and whistles at the moment of other data manipulation frameworks within Python but we are continually developing that and would continually add to that into the open-source community so I just wanted to add that as a point of to really emphasize that first phrase that marries talks about where h2s mission is to democratize AI it's to create tools whether they be open source or commercial software to accelerate that process but also give back to the community as well we'll have one more question what maturity level is travelease car models are currently I have to say although we are called driverless we drive less cars it's not necessarily our specialty we haven't worked on this problem that based on own reports have seen performance was actually very good it seems that we can already achieve better than humans performance in terms of sheer accidents at least the problem is and I think this is where the process stuck at the moment is there worse there was an accident and then there is the problem of accountability who is at fault and and why the the accident happened where this is the thing we need to work a little bit on to be able to fully integrate such an AI within society and that's why I also highlighted the importance of interpretability something you know as a company we've taken obviously very seriously ok so that if there's no more questions from you on that takes us nicely on to a break so what we would do is have a break for 15-20 minutes refuel with there's some pizzas downstairs I believe and so feel free to grab some pizzas don't eat too much because I don't want to have that dreaded grave shift graveyard shift where everyone comes back and then feels very sleepy but go down have some pizzas we'll come back we'll start to explore the products and get hands-on looking at all the concepts that Marius has talked about and how those are integrated into the tool
Info
Channel: H2O.ai
Views: 8,321
Rating: 4.9615383 out of 5
Keywords: kaggle, h2o.ai, driverless ai, h2o, machine learning, automatic machine learning
Id: GMtgT-3hENY
Channel Id: undefined
Length: 45min 23sec (2723 seconds)
Published: Wed Jun 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.