Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Applause] this is going to be a a sales talk maybe out of many at least first of two so I'm going to you know make an introduction in feature engineering the next talk will be about you know maybe some advanced feature engineering so but as Shri mentioned I do cattle for living they also forced me to build the product in h2 yeah I don't know bro I do my best in both fields so yeah I the last five years of my life I spent on Kyle I competed in different competitions usually not that good if you can see but no but so and again so the comment that the topic of this talk is feature engineering and basically why I think it's very important a lot of people across the domain I'm across the much learning society a lot of people who you know well-known in machine learning they agree on one thing that efficient engineering is extremely important but basically what exactly we can mean by fish engineer the very simple explanation you can see on the bottom so basic you know is that's a very how you can transform your input so mention an algorithm can actually consume it and build a good projections that's the the easiest and simplest explanation of what Fisher engineer mean I was able to find and so basically this slide is a some sort of a motivation so you know considering we have a 2d space we have a point right with a red and blue and would like to build a linear classifier which able to you know distinguish classify them correctly that's not possible actually to build this kind of linear classifier given the data right this no way how you can split them you know by just lying actually in two separate classes but if you transform the Cartesian coordinates into polar one they immediately become like very easy to separate by just a lie single line so of course you apply like a quiet complex transformation you can think of this transformation as a feature engineer because you just engineer a feature fulfill model but it allows you to build a simpler maybe like you know more lightweight model compare like if you try to let's say a fitter random for us into it so ok so a typical machining workflow might be seen like this so you have a debt integration step the next step data quality and transformation then as after the transformation and data quality checking you have a table right you can fit into your favorite machine an algorithm each room on this table represents a one single event and each row has a target you would like to train to predict so and basically that's exactly the place time feature engine can can take place right so of course we can you can argue and say like hey this part actually is a part of data injury in itself is as well but for for this talk we consider on this on this part only this one is a very much more complex and I would say it's a vast area for research for future research because there is not much research done on this area how to combine basically available structured data to get the good data set for prediction so what not for changing in from my point of view so as I mentioned initial data collection cannot be a feature engineering the creation of target variable because there's something you it should be business driven right you have to be having some business need to predict something for like you know to fulfill some business girl's removing duplicates missing values fixing mislabeled classes a small like a data clean of course each national models should be more or less you know more more more or less stable if you put a duplicate so like a mission value but it's not the the goal of feature engineer and it's more like a data clean scaling the normalization is not a feature engine by itself it's a more like more less like a data preparation for specific models let's say for example for neural nets you have to scale the data always this gradient descent won't work as you expect it to work well feature selection is not per se like a very efficient gene but I'm going to mention this in this talk in a couple of places so basically that's a feature engineering cycle right you have a data set you have a hypothesis to test your validate the hypothesis you applies to that mean by applying a min you create a new features based on the existing one and you repeat this process over and over again in in pursuing of building the better a better model so how you can come visit what's the source of your hypothesis basically right well obviously if you're the main X but that's a significant source of your knowledge right so that's how you actually can build different features out of your own features well if you don't have a demand knowledge of course you can use it your prior experience you know based on the nature of the data is it an a miracle feel to the categorical how many category levels you have how your normal milk features is distributed and etc etc that's exploratory data analysis you can help you this and of course my favorite part you can use a specific measurement models and by analyzing the model itself you can get some insight about how the data is structured and how you what kind of fission gene transformations you can use to get a better model okay our validations yeah so feature engine is kinda hard problem right especially if you try to apply a powerful feature transformation like targeting : because and let's type of transformation you explicitly try you use to specifically try to encode the categories using information about the target that's something that can actually introduce a leakage to your data to your model and the models will be very good fitted to the training data but it'll be completely useless in the in a real-life usage again that the my knowledge is extremely important especially if you have a specific narration about the nature of the data let's say for example in Shaolin if you analyze the well data right how well actually drilled it's a kind of physical process right result of physic physic processes could happen inside which you actually can be expressed using via formulas so that's something the knowledge you can put inside your model as well well of course it's a time consuming especially if you have a lot of data because you have to run your model against that you have to test how good with a feature is and of course to do that is you can run for a thousands of experiments especially if you perform like a EDA or like preview experience George so again as I mentioned simple model models give you a better results so ideally it would be nice to find some golden features and just a fit linear model on top of it right that's would be the best possible scenario because you prefer all see I always prefer simplicity over the complexity of the model of course in a real life scenario it's not it's never the case you still have to apply like a quite complex models like a random forest or gradient boosting or neural nets to get some results but still the good features can help model to approximate faster and we can discuss like a freaky components to that that's a target transformation feature encoding and feature extraction well target and summation that's something you can use to transform your target variable that's especially useful for regression problem say you have a you target is not normally distributed it has a skewed distribution in that case you can apply some transformations to make the the distribution you target more like a normal shape like a bell curve for example log transform transform usually proved to be a very good in and a few of coggle complications like for example in mystic there was a complication a liberty mutual property inspection prediction you try to predict the outcome of a property inspection and on x-axis you see a different example strands these are random parameters on y-axis you see the score normalized genie in that case and you can see how models could be various on a green line you see the model the green model which is a logarithm by logarithm of ten and the the variation of a model is less compared to the previous model so standard deviation of this result will be way much smaller which is actually good that means your model is more stable even though in some cases the other models out performance but extended case we're just looking for a stability stability is usually better than just the best score for some single given point so feature encoding again like you know one of the interesting topic to discuss as how you can you encode your category categorical features right because most of the machinery logarithms they actually usually expect you to provide a numerical data right like numbers basically right your categories like a agenda for example or a color that's a that's not something but not the number obviously the easiest way to do that it's a one hot encoding which you've technical all familiar with you also can do labeled encoding that's a very simple technique you just replace your category with some integer number but it might be a very Abbott interpretation because in that case you introduce some order in your data but in most cases the same color but is no order in color right basically especially by some some randomly unsigned integers so one what is a good idea of course but it's sometimes I mean in in some cases if you have like a lot of levels in your category that that because it becomes too huge so as an example writing you have a you label and code you have a free category sabians you just you just map this category or some specific integer number any place is number in your data set for one hot encoding has an example right you have again with cific idea but you have to create a free different columns to represent these free features but there is a one advantage actually of one hot encoding compared to a label encoding right that you let's see if you have a new category you can just it'll be all zeros right basically you just you have just your bias and that's it if you fit a linear model into it also you can do as a frequency in column you just can encode your categories as if using frequency so basically you you calculate how many times you see this category in your data set and just normalize it like you know divide by a total amount of rows you have any data set to get the frequency so basically you can think of it as a probability to meet this particular category in your data so in that case you can highlight by a less frequent categories pretty good for it basically C is just you have a category C just two times in a data set that means it has a very low frequency compared to the rest of categories the disadvantage from that approach of course if you have let's say a category which has the same amount of frequency in your data you won't be able to algal European motion and model won't be able to distinguish from right let's say if you have a 2 B into C they both have the same frequency yes that I also introduced the order button there so then that in in that case order actually mean something right because that's a frequency riots yeah and you know basically order and these things actually it's a good thing because that's usually I use this technique to filter 3 based on samples and GPAs and symbols usually looking for a best splitting point right so basically I'm looking for the right order actually all this so so my model will be able to split them you know like easily okay the next ID and approach X you can use a target meaning according so you just given them outcome you would like to predict any features you would like to encode you just replace each each particularly the mean of an outcome in case of a we have a for a here alright then we have a it's three out of four to the point seventy five probability of a given feature a welcome will be one in case will be to the point 60 C 66 in in case of C because we always have like a ones the probability of how it can be in one given the category C will be always 0 what we are all sorry all this one and seems to be like a very good approach but you'll immediately see let's say you have a less frequent categories the information about them will be not very reliable right you just have a two example sets you know it's not very statistically significant even though both of them actually ones that doesn't tell you much right because that's like well so what it can be easily be been by chance right to deal with that will introduce instead of like just and goodbye mean we introduced actually a weighted average between min of the level of a category and the overall mean of the data set we also have a function which depends on how often you can see this level in your data set the more the bigger the n the the bigger the lamb that we will be here right so and that mean do your average should be more reliant on the mean of the level not of a mean of a data set so it's more like a some sort of a smoothing approach and usually that's a very good step by step look the step function which to 260 mode like function actually to to to model this kind of ideas so basically X here is a frequency how often you have it your your you category level in your data set K is an inflection point that's basically the point then lambda equal point five in case of a red line it will be like a equal to 20 in case of a blue line the k equal 2 and F actually does control of your steepness right there less if you have the most steeper your function so blue line has less F F oval blue line is a less than F of a right red line if F equals zero you have a stepwise step function basically which has a which is zero until the inflection point and then goes to one after inflection point so as an example sorry okay so that's a that's a function I'm trying to explain and if we just run it for different case you see it's we just shifting but oh I'm sorry how do I so basically I'm running four different case from zero to four and you can see how function actually you know shifting from given the different inflection point F controls the steepness of the function the bigger if you have the more smooth your function will be in case of equal zero it will be like this is the simplest case I'm your f equals zero that basically tells you hey you have less than two two examples of my category level I'm going to use the mean of the data set if I have more than two example examples of my category level in the data I'm going to use this min of the level that's it by adding the steepness you just smoothen the results you know like around to twenty point five so let's go back today all right so that's exactly the example what I just showed you so in this example the F is all this point 225 I'm just playing this K right I'm shifting it from 2 to free and you can see how predictions actually varies from this change especially for C right for if k equal to that's exactly amount of examples I have 4 see the alarmed the lambda will be 0.5 and I get the a weighted average between the level of category and the data set mean which is point 75 in that case if I move a K to free I immediately got the very small wait for min over C and a very big wait for a min of a data set so you can see like you know because I'm moving I mean the bigger cave a more conservative model becoming more embedding be closer to the to the mean of the data set the last key will ask answer it if it is and it closer to the ofcourse it closer to the level mean so yeah what else can be done even that case even if you apply a Bayesian spoofing a very complex algorithm black IQ boost for example can can can find a leakage in this data set so what you can do you can join your categories which has a small frequency new data set you can join them together to you know to create like a I'm not run together to create a bigger category level that's one approach the second approach you can somehow introduce the noise of your data into your data and that's a so instead of just blindly encode them using a mean encode you know you can also use a leave one out approach and basically it's a so you encode each each each category each record using the rest of records in this case for a letter for Row one which has a feature a you found all the rest of a new data and integrate them in to get there to get the mean of the for this particular role for the second you could do exactly the same but you just exclude the second one so it's like I leave one out you always leave in done one out of the ouvea data and for a fruit and 4/4 you repeat the separation for all records in your data set and it's kind very time-consuming I must say but that's a very reliable especially you have a small data set that might be your weapon of choice so as you can see the data here is a slightly different so basically that's you can think again as I mentioned you can think of it as a as a way to introduce a noise in your data which helps you to you know to which prevent the model from other fission because in that case model you trying to fit on these data set should be more you know careful not blindly rely on one single column you also as a one-hour technique you can instead of doing we leave one out usually can just add some noise to that the calc to the found encoding of course noise no supposed to be independent for each row I'm pretty sure the noise should be not normally distributed it should be more like a uniformly distributed because you know normal normal distribution that's something that all machine algorithm design for right basically it's very good and actually approximate and anything if which has a normal distribution expected value actually this that's a way how you mean you do basically with normalization always so the second thing about the Rand I didn't random noise actually that's you have to somehow calculate the right range for your neural attend random noise let's say if a binary classification is kinda easy to get what kind of wrench you would like to apply but for regression task it's not that easy to and to get the idea how exactly to what kind of random noise you should add to your data in order to not over fit okay the last technique what can you use for categorical encoding actually that's why it well-known technique in a banking domain which called the rate of evidence it's very easy so basically giving the level you just calculate the percent of non-events your negative events divided by a percent of positive events and just taken a natural logarithm of that to avoid division by zero which can mean you can you you instead you just apply small you just Add Ins you just add a small number to the number of non events in the group or events in the group so to highlight what exactly I mean by that I can I just write some simple example right so let's say we have a a category it has just one single non event right which is zero it has a three events which is once so across the whole data set we have a nine nine records and we have a how many seven positive examples so I forgot to check that number so basically I'm sorry this was supposed to be slightly different number I just forget to fix it so it should be free divided by I know actually that's right for that yeah so for positive it should be like 3/7 right the total amount of ever positive events for negative it should be like 1/2 we just have a two negative events in our data set so be like 50% you press you repeat the same procedure for all category level you have and now you can you can take a natural logarithm out out of that which gives you a rate of evidence so that's another way how you actually can can encode your particular levels and your categorical data what else can be done this weight of evidence has a nice addition to it which cause information value you can calculate information value for the weight of evidence which calculate like as a as a difference between percent of known events and events multiplied by rate of evidence and you summarized at across via your your levels which give you some number this number can be used to select or actually this basically it's your future importance so you can use this number to pre select the features or categories you would like to use a rule of thumb here is a quite simple so if information value is less than 0.02 that's are not useful for prediction value at all from up to 0.1 it's a big predictive power from 1 to 3 that's a medium from 50.5 it's a strong if it's more than that it's a very suspicious color column you might have some leakage in your data so you have to take a look what exactly you did maybe made some mistake maybe like your data has a very long tail of it on frequent categories which has a always like say 0 or 1 so you did some investigation supposed to be done but basically you try to fit this feature into your machine and model evil other feet so what else can I mean what else what would have a nice feel nice property of this particular encoding you also can use the same approach to encode your numerical values but fun Americal you casually can because you can calculate information information value you can play this a basic let's say you have a numerical feature you can you can bin rise it right be using quantiles for example but you also can actually play with these contours and manage some of them giving an information value of a whole column all for example let's say you quantize your numerical value and you can you calculate of evidence if there if you category your category level has the same weight of evidence you actually can merge them together because for marginal good there is no difference for them they have exactly the same ratio basically inside so why don't to power actually keep them keep them separate which is extremely useful for in case if you're trying to binarize and basically in GGO a numerical values is a categorical so that's basically exactly so the good thing about numerical features you can use a measure as they are especially for tree based models you don't have to scale them you don't have to normalize them they just good as they yeah but you also actually can to them as a categorical by being them into quantile so you using all histograms basically right so basically you can you can use a beam of the same size which is a histograms or you can use a beam which has the same population insights which is a quantiles what can be done as I mentioned you can just include a mosaic using any categorical coding schema or you can just replace with a beam in some medium that's kind of approach usually it's very handy in case if you have a sensor data especially we have a sensors you know that's usually supposed to be across from values and but they because the sensor that usually oscillating all the time so that you know we have tiny little changes because of there's some error or measurement but if you don't need the exact value but you just need some actually you know like say a level approximation that's the way how you can do it also you can apply like a different incident reduction technique to several numerical features to get the the smaller representation of the same features that's kind that could be useful for again for my intuition behind is say you have a free numerical features you can use intrical truncate this VD you can you can just reduce them to one single feature and that speaker might be useful for three based models because three based models usually vary but an approximately linear dependency but si Jin PCI because they because of a linear nature we actually can quite good an approximation some sort of a linear dependency for you so it will be a very good support for G based methods also for numerical features again you can use it you can cluster Ram right you can just use a k-means to claustrum and then you have a cluster IDs which can you can cheat as a categorical again and using a categorical coding schema or which is kind of use very useful you can calculate the distance from for each row in your data set you can cut the distance to the to the centroids and let's give you a like a whole set of new features not to mention about like you know as simple features like you know a square power of two for examples roots or even you know like addition and multiplication for example so yeah I know we all know like so basically random for us is a very good approximate or it can approximate literally anything but let's say to approximate this kind of dependency for random for us it took like a lot of trees actually to build so if you just provide the you know the basically the power of two features and the row features as well it will require very much less tree to approximate Weig much better so yeah so let's kinda introduction to the to the next part right so basically she's trying to discuss like a simple approach how we can basically we just cast like we had an overview of the tool right how what exactly we can do with the features how we can represent them in a miracle form but the thing is how exactly we let's say find this I found the right representation how we find different efficient interaction to help our model to converge fast and maybe like produce a more smooth decision curve well of course it requires the my knowledge you can analyze the machine learning algorithm behavior like you know you can analyze GBM splits obvious you can you know analyze linear aggressor awaits for example that's something we I would discuss on the next meetup advance in fiction engineering especially because you know let's say you have so in case if you have a let's the question is how can you include different fish in interactions well for numerical features you can again you can add different mathematical operations you also can use a clustering or let's say K nearest neighbors to you know to create some features for you okay if you have a pair of categorical features you can use them you can combine them together treat them as a as a new category level right and you have let's say you have a pairs of of category T and you just in canonical musn't be any provided schema as well if you have a combination you would like to encode the interaction between categorical and numerical feature you can forage categorical level you can actually calculate different statistics of numerical features like a mean median standard deviation usually mean and standard deviation is quite helpful min max might be but it depends on that on the on the problem you have in hands so a feature extraction that's something yeah given the role that you would like to extract some features for example if you have a GPS coordinates you can download the the the third-party data set which provides you let's say you can map basically GPS coordinates to the zip code now in the zip code you can known you can get information about population or any other different different sizes call information if you have a time in your data you can extract the exact year month day hour minute like time ranges like you know if you if you have a holiday calendar calendar it's actually usually helpful for retail chains you can just you can have a fly keys at holiday or not for example what kind of holiday you have as I mentioned before a couple of times you can actually tune your numbers let's say H in two ranges usually quite helpful especially for range for for age well for textual data like a classical approach as a bags of words obviously but we all know that so I can I won't stop here actually yeah of course you can use a deployment for it as well especially you can do you do a word to vac or doctor back in begin so and the beauty of it actually basically basically you don't have to Train anything from yourself you can use a pre trained word vectors and use them like as they are in your model you also can for example let's say you have a short documents in your data set you can calculate a word you can transfer your words into vectors and then can have a calculate an average vector and that will you talk to that representation of your data and yeah I think that's it and I can have a question okay so thank you [Applause] as well duties speaking again but there's a lot more speakers yeah we're going to have a cackle number one actually on our panels and Mario's cackle number free former cackle number one Grand Master is going to have a talk and be on a panel as well and I'm going to be in a panel and I'm going to be talkin yeah they leave one out I mean yes yeah yeah we can have a AG file actually that's a nice idea all right all right so Demetri what am I thinking we could do and hi everyone I'm Rosalie I'm our director of community thank you all for coming tonight I figured just ask you two questions if you'd like yeah perfect okay so first question for you what are the cases where it took forever to compute and what did you do about it are there specific hardware or configurations so it does not take so long to compute what well depends exactly what the trying to compute if you join is trying to compute something like you know let's say a target and : well in that case I just usually that's because I've brought a shitty code and I just have to rewrite it if it's a let's say I'm actually a model I would like to you know to speed up I might switch to a smaller model I might sample the data or because I work an h2 I have access to like a cluster basically I can spin up which to a cluster and rent on a cluster that's actually exactly what I do all the time I just borrow the cluster from h2 it says here is frequency encoding the similar same as Huffman encoding which is used in communication systems yeah I think so actually yeah yeah very close okay I think it's exactly how I found it but for the first time I'm perfect how can one decide which is the best category encoding by random chance I mean by checking all of them and find the best right it requires some gut feeling but let's say from from my point from my expertise target encoding usually proved to be the best like you know let's say it's a weapon of choice oh this metric okay that's that's that's a good well obviously it depends the mattock actually depends on your business problem in trying to solve not about limits nothing to do this so the whole setup actually you have a raw data you're trying to you know do different fish engineer you fit it to your model you check it on your validation set using a memetic you select actually visit business and that's how you got results you forgive a significant improvement you apply this teacher as well if you don't well you try it again and again and again so it's really mostly like a thematic action it's a business question you know would say however a good example them then dick then the company actually would like to predict something and at the beginning of the quarter there was actually okay you know if you over predict or you under predict but as soon as you're getting close to the end of the quarter the over prediction actually cost them so much money somebody would like actually to you know to be like smaller but that requires you to design of metric yourself basically but again it's nothing to do with the future engineer you can use the exactly the same features you just see how your model actually fit to the the metric of choice basically so but mostly what you can do actually what let's say if you have a you see as a metric and if you're a binary classification binary classification and IEC is a metric in that case you can encode your categories and immediately actually check it against a UC because you know obviously that's that's a very easy to do that's something that can be done same goes for you know classical matrix if you around Aramis he for example I would like to encode a category so basically your category encoding a special target encoding is a very it's a it's a small simple model by itself right because you just yeah and you can directly measure you can directly measure it if you if you like but not something I actually highly recommend because still in most of the practical cases that even though the the this simple comparison can be good the phenomenal model can be bad actually so it's there's no guarantee I'm another question from that attendee they're curious what to do with the feature with a very large categorical values you mean them do you have a lot of levels oh that's basically that you can do you can apply both checking at the same time you can do a smoking and leave an out leave an out is extremely expensive but you instead of live on out you can do actually cross validation you just split your data and five checks for example you try to encode them using four checks and apply this found in connection to a few phone and you repeat this process for all five categories that's yeah obviously the fastest way just to apply random noise which is could be replacement but in that case you you have to design it carefully or any site I didn't include northern random noise here just one reason I don't have actually you know and understand them actually what's how exactly actually doing what kind of noise toward what's the rules actually fight that noise so that's why I don't have it in my slides awesome so many great questions coming in thank you all another question it's claimed that neural networks will make feature engineering obsolete what do you think and why in image recognition in speech recognition in what else in text so data maybe yeah definitely that's exactly why we do so basically as soon as you have unstructured data like image sound definitely that's the best way they can design features for you if you have a structured data set like you know the data that we walk on data with you have a database you have a lot of data which is structured not so great actually usually they perform poorer compared to like tea based methods especially if YouTube methods empowered by a smart feature engineering you still can actually use them you know the neural nets from the future I mean on a structured data but it still requires you to carefully prepare the features for it so it is an active area of research I'm pretty sure but it's nothing like you know nothing smart actually still we just waiting for final results awesome any techniques on feature extraction for time series or historical data yes a lot of them actually but most of you mean most of the things you can do this are different like features right that's obviously what you can do with your time series data the key and the key to success key is it's a carefully apply a validation schema especially if you let's say if you would like to predict something like two weeks ahead for example in that case given the data you can't use your last two weeks of your data set your actually your features should be created on the previous data which so yeah the key here you buy your validation set you're trying to model the actual real life scenario so that's basically yeah that's the most but like features usually exponential smoothing very helpful you know them you just trying to that means you your most recent events actually more important than my previous events basically that's it if you given the three based methods of course and of course you can apply different like technique like our email on awesome so I'm seeing will you post a presentation online yes we will you can find it on our YouTube in a couple days what are the methods of encoding a list of elements that depends if list has a sequence I mean if they order it somehow you can just a big create them as a separate columns if there is no actually if there is no sequence in it well obviously you can one hot and code them right because I said that's something that usually you're doing the marketing that's how you actually deal this kind of events usually it's a very common task in marketing when you given the set of events you're trying to predict user behavior or let's say you given the user click click trying to predict something out of a user and no I let's say I do have a couple of ideas but they need to be checked before I can share them but obviously it's not very easy compared to their to their today I think even problem awesome can we pile up all these methods together and run the model and find out the best one that's something that you can do let's say usually it's Molokai yeah you can do that although usually it's very expensive say you have a data set which has a 500 columns you would like to know Lyman you find useful and you have like if some of them are categorical you have to encode them using every given categorical scheme I'll show it to you and then we have to apply let's say feature selection methods which could be let's say feature elimination usually or you can fit a linear model and you know I use an l1 organization to get there there's the the feature which has weights it's a no I don't think it's a good idea usually if you just do it iteratively you get a better results because the feature selection methods which available right now they're not they not very reliable that saying jumps like hey I found like a ten features out you one hung it that doesn't mean actually it found all of them you might still have some like a random features inside you can still have like some feature actually meeting so I wouldn't recommend that actually so usually it's you know kind of one by one feature by feature item that's usually more about you more stable solutions great can you do target mean encoding with h2o flow no okay we can do it via our product which I forced to work on that's that's yeah I was actually instructed with them not to mention it another great question thank you all for the questions I see more just coming in have you come across examples where feature engineering was a in physical security especially in a domain like banking in banking yes and yeah let's say a fraud detection an anomaly detection usually apply and use a different tool set to do the same stuff it's more like we should have actually invite a he can do like he can talk for four days about that I'm pretty sure I will make a note on my Trello question here let's see and is there any good feature encoding tool for example a tool to help compute target mean encoding yeah I think there is some actually it's creeps in the wild for Python I don't remember the links but obviously you can find this very discreet for Python for sure on Cagle side just google it you know like : k : google target M : you also can try you know like you know just just I like his church on github do exactly the same just you know Google github for targeting : k which should find something I might let's say if I found the package I'm referring to I will post the link kilometer on a page on a meet-up page awesome how to deal with complex objects having partially unknown attributes can I have an example what exactly meant by that oh you're talking about impute and write basically say we don't have a some information about user is mission yeah let's say we don't have an edge for one particular user we have forever well it depends right so you can apply a different impotent techniques to do that like obviously the simplest one you can just income you can just replace missing values by by me just kind of stupid but it works in some cases it also depends it also depends sometimes your missing values can be due to a mistake right you know like they not supposed to be the actual image about the EMT is a mistake in the in your in your data gathering process sometimes missing value actually can mean something it can be an information hey values mission so there's some submissions you some this could be a some reason behind that in that case you can actually leave missing values as they are and basically modern technique modern methods like legs abuse can actually handle missing values pretty easily you can just pass it on it it's usually an Ida splitted decide where exactly to pass all the missing values depending on the loss function so you know and go to v3 and basically on each split it aside left or right he put all the missing values so it's a pretty powerful technique in case of encoding guy described let's say for categorical just treat then missing values as a as a separate category and same goes for numerical values it's a separate beam just yeah but I would try you know what I mean one of the rule of thumb you replace your missing values for numerical you replace if is a min and you also add a like feature you know a binary feature of it tells you like hey that wasn't missing value actually you know zero and one so basically instead of one column you will have a two but usually that's that's a good starting point especially for a neural nets and linear models because you be preventing you keeping the both information you replaced with a mean visually but in that case you want to motivation schema but you also keeping the information that was actually missing value there so it's usually very helpful awesome do you prefer certain models for categorical data such as random forests examples mostly but yeah again and it depends right because let's say in case of the events likely you have a list of events obviously that's like what you do what I can what I could do immediately I'll just replace everything is the one hot encoding can basically fit a linear model to see what happens and usually because events is a very sparse and like a very very wide space that's enough it's already probably a very good model okay any other questions I think we're good on questions here I was actually just gonna pass the microphone around if anybody wanted to ask any questions I mean if you do just raise your hand and I'll come on over all right you were talking about combining categories into one if they have almost the same predictive power so it makes sense if your model is completely statistical and but what if so does it mean that you can combine them by the same prediction power only after you like you made sure that your features are independent for example you get some primary let's say well that's a forgiving model right because you look alright so basically that's what the model actually sees right I see in the data right so basically because I always see the same weight of evidence number actually I can it's the model itself is only combine them right for for my I mean there is no way to distinguish to say is it point is this point for value for a categorize point for value for B key together so it's it's already done basically right but it can be a good idea you know let's say it can bring you a good insight from the miracle features mostly right because you can that shows you you can actually combine the beans but although yeah let's say it does make sense to combine like a neighbor beans but if you like say being one and Beant envy the same weight of evidence it's kinda strange right to combine them obviously yeah ideally yes but it's you know even I let's say yeah this Marchand GTC conferences there's a paper about how you can actually you know predict the they are normally using the recurrent neural Nets and that's exactly what they did they binarized finna miracle features but they force force noodleness to always keep the same order of them so they try to learn them beginning but if impedance let's say oh [Music] like the like is for example it should be an extension on the right if you embedding action right away they may replace they just shift the weights so all the time right so it's kinda strange idea but they always like you know force numerical values to have the same natural order of its I supposed to have and we're life all right who's next so when we are using frequency for some feature and then we want to apply the model and predicted then comes the real input how do we get the frequency that during your training you learn the mapping table right if you have a test data you just apply this mapping view you just look this lookup table to the train data to the test data I'm sorry you never so basically these data is never changed during the in different inference process you learn this frequency out of a train data if you have a new data set like a test data that you all you try to predict you never calculate anything you just use the found found values this immediately brought the question actually what if I have a new level right and there's data set well you can apply missing values if you motion on a model able to handle missing values so you can just say zero because you never saw this data in your and you during your training this alright so I'm coming up here how would you compare the a chose use of categorical variables which you can learn directly from compared to using feature encoding not sure I'm following your question well h2o supports categories as inputs yeah but it's under the hood adjust encode them using one hot encoding oh it can encode them oh it can do one hot encoding can them just do it demands dimensionality reduction you just done you know universe universe it by yourself and therefore the third way it can do a label on coding that's all the dash toward us inside he just you just don't know that but that's it okay I just never quite figured out what they do I know we don't do anything anyways so that's what what you're referring to is hey we have a lookup table why don't we just learn the rates yeah there is a commenting you can you or not but hto depends not do that it still do one houghton : but it can do it on the fly I think for trees it does yeah no no no no no no not directly so it still it's you have to have some numerical representation because that's how you be able to build a histogram on it and use this histogram to find the splitting point of it yes there is one approach which trying to use a categorical values directly there's a company called Yandex and they recently release a cat boost decision trees and this they stated they this diskette boost actually able to handle catechu categories you know as they are like row categories data but I didn't check out myself results are kinda controversial in some cases it performs extremely extremely good and some cases performs like stimuli bad so you know nothing in between yeah I have a question you mentioned great gradient boosted decision trees or regression trees and so I have a question which is probably it's not directly related to the structure but I think it could be interesting like of interest to many people so for example when I worked on when I used to work on query URL ranking then gradient boosted regression trees they worked quite well and recently actually about a couple of weeks ago I tried to use boosting this kind of boosting with deep learning in computer vision so basically what I did after we trained the network I did the inference on the whole train set and those pictures that did the worst results they were like prioritized so they were randomly selected to be more frequently and this didn't work at all so what is in your opinion there was one paper actually about how we can boost CNN you can google it on archive except I think it's called boosted CNN on something they tried to learn there was a specific architecture they tried to learn first of all but they get a pretty good results but I tried to repeat them but I kind of know it was pretty good but yeah I wouldn't say it's the best of a class there is another technique you casually might find helpful I think it's called online hard example mining that's something you can actually try for yourself ID is quite simple given the badge right you calculate the loss you have a gradient right but you pass in view backprop only the gradients of the examples which was the hardest one not all of them but just the hardest one but of course there is a question how actually to define which of a proportion and all this stuff but I haven't read the paper I don't have it if it was just nobody but the difference between you what you are saying and what I was mentioning that I try to determine the worst ones by just full inference but what you're suggesting that we are still running it in machine mode which means they have like a drop yeah I know I know loss is not real awesome to me you know what I think if you ask my opinion which is actually doesn't support by anything just my gut feeling you see in in boosted trees each and every tree are different right basically it's shaped differently depending on what exactly is trying to predict in neural nets you always keep in the same architecture on each step is a similar connection just dates I think that's the biggest difference it just yes it's a different model because you don't different weights but it's not something would define actually different model falling in that case it just that's that's my gut feeling I don't know actually because I did exactly the same as you did I just try you know hey what if what if I just you know fit a neural net in and inside the boosting schema which is kinda easy to do in psychically I'm actually right you just apply me a fattening you don't know didn't work at all I mean the first the first one is good and that's it basically any other questions if you raise your hand I'll bring you the mic anyone okay let me check slider real quick to see if folks maybe use that as well um let's see here okay I think that's all the questions to retrieve awesome thank you so much for your talk tonight and thank you all for coming thank you [Applause]
Info
Channel: H2O.ai
Views: 10,762
Rating: 4.9196787 out of 5
Keywords: machine learning, feature engineering, Product
Id: irkV4sYExX4
Channel Id: undefined
Length: 59min 28sec (3568 seconds)
Published: Wed Dec 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.