Learn Kaggle techniques from Kaggle #1, Owen Zhang

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
exactly because you have so many degree freedom but that's only the most common way of overfed very many other ways so one where is subtle way of of overfitting recorded one very subtle way over vision is that you keep building different models and you test the each model against the answer which is your lay the public ear bar and then you see whichever model works well you keep it and whatever it doesn't work well then you can top it you can I keep doing this actually give a lot of situation you can keep doing this and you will find that you are models performing your markers and performance on the property border just keep worry it keeps you prove it first almost no unto it but the reality is that you are just pop your chest over freaking the public either bored so every that happens quite often um is flex in statistics okay I'm not a statistician to let me know if I suspect a sister decision nowadays but instead is that we always will use a five percent comes the significant whatever significant test so that is something is giving you now now hypothesis you for you a you have less than five percent chance of observing what you have observed then you would say that okay this is significant but there's a we but we drive into multiple comparisons follows you very very often in building models he's probably will try very hard you randomly try 100 things then I filled them I turn to be turn out to be significant statistically speaking but that's because it has many things so here the philosophies that are I think more and the try less so the next one is that is a secret so this is the secret line so I also some people ask me what's the secret of doing well you not have complicated these are the secret so there are a longer version on the left and the shorter version on the right so uh okay that's not gonna be a hosted somewhere like either on the events website how in the / air so there's no reason there's no need to take notes on the on the on the slide content so I'm the short version is that and why they need to be disclaimed so the temptation to over fader the public the parties everywhere is strong some a comical sometimes a psychologically I cannot control myself but just to pop just the overview probably apart because that makes me feel good ok we are we are human I mean this is it but we need to be display the Romero services we don't want to look a little bit of embarrassed after the probability were come down so you know you need to so we're all the positive are then you flip sudden see you couldn't find yourself anymore is that look good so um tonight is work is to work hard kygo at least some type of cable system rewards effort you never lose anything but by participating so in hamper you can you can wrangle the absolute the last on a contest that you still relatively speaking doing back to better does not participate so you're you're participating more get your better results and even within a competition advance your discipline try and work harder is always helpful and we were touch on that on how to properly work hard later the next wise learn from everyone this also use a little bit against the human nature so uh we are we we are around pride right so uh ok delay it I will think okay this time I were to strike to everything myself on the soil I don't want to look like all other people do it and yes there is a little bit like her brace in there but on the other hand that people other than you always know better than you individually right so because there are so many of Clement the only what you right so it's pretty hard to compete against everybody so I I'm just I always feel that it's really learning from others that you can really improve and there's really no pride in like doing something by yourself otherwise I think I'm not sure how many people had really event all the mathematics that you learn the elementary school and I think that takes more than what a lifetime to even how that so there's really no such term yet oh I learned this myself the next one is luck so what we are being predicted you're a lot of times that we operate in a very noisy data and I think that we haven't seen yet and there's really no there's no absolute that you free shader between noise and the signal you there's not so the only way to really differentiate them is that you have more data wait your before innocence competition there's not you have that much data there is a signal to noise ratio that's pretty determined so there's not there's always none zero luck so if you participate in something under youth you know don't don't worry well you're pretty competition so just it's probably just bad luck but if you if you do well that's all backwards that's high C so uh you know I I always got remind me before I go to a team of presentation I always gotta find it make sure that I bring something concrete you know people can take away right so this the the rest of the the presentation is something that is more concrete but I really but I personally really feel that is how you approach the problem is have we got the philosophy and then strategy is actually are actually more important than specific techniques so that I I don't want to say that oh just learn all the techniques then you have to well actually let me go back to two previous that so a lot of times that uh it's not about which one you do right so there are a lot of specific things that you better you know it's necessary to do well to build a good model that's not going to be a competition you mean reality you want to feel the good model project in all these as well the the key there is actually how to allocate you are you effort we all have a time budget you know there's a deadline and then we have to do our data all when the three are keys and if you okay so there's only a limited number of hours per day so it's actually worth it to think about how much how much time you want allocated into doing feature engineering how much time in one elevator into trying different firm models and how much you want allocated to doing the hyper parameter to me so you actually I do think about it consciously now back to our technical tricks so they said how concrete stuff so gradient boosting gradient boosting inversions I agree I used to be Emma everything that I kind of to be Emma there are things that you can't DVM told my example would be a very large data if you have you know 100 million observations you if you probably don't wanna read DBM but for anything that can t be emoji Mia I probably over you'll see BM a third why GBM is is so good and I think by now probably most people already know to BM already but I let me cover that anyways in case so GBM will automatically captures to a word a large degree nonlinear transformations and subtle on a deeply interactions in your features so GBM also gracefully trace missing values so on I'm a taster I use the are implementation of GBM there so the package EDM very often they actually see are actually for that red is a for worried good GBM implementation today everything slightly like two years ago there was out there was only the are one that was good for nowadays you can use either to our package you can use the the second a learning package or GBM that's also very good and if you want to go parallel you can use either 2h 12 version or you can use actually boost they all very good implementations and each has their own kind of curve but they all are able to do linear nonlinear transformation the interactions so these are actually those two things actually are people what people spend most their time on when we are building linear models or generalized linear models so you know in the days when we only had like a logistic regression or when we had only a Poisson regression you know in the industry that the Union Insurance like we work so we spend actually up whenever we building models we are not really building models because we are you always run the lawsuit what's our common regression so it's always same model so what we are doing is and other outliers all the other transformations internet you shaped like toppling shapes whatever shaped a hockey stick shaped or we try to do this interaction that interaction if you GBM other things I think are for you for you tinker for you automatically that's why you do so much so if you haven't tried please to try so to me I'm like everything like all the other like a modern machine learning algorithms it has a few tuning parameters and when your data is please make that the inside is small and you can just do great search but every GBM has actually three separate are pretty orthogonal tuning parameters if you want to create a search actually you actually do need a pretty big rig so Apple IIe review some of group thumb kind of tuning so it is mostly for saving time so if you do a very smart research you can probably do better than the rule thumb below thumb is you know where we times the time to time saving so the first set of fraternal parameter is a how many trees you want to build and the learning rate they are reciprocal to each other so higher learning rate well cut you to use the last number of traits I'm I'm pretty I won't save time right so you early I just target a 500 or 1000 trees and attune the learning rate you know start from 0.1 then we'll go up and down little bit the the next one is a is just a common tuning parameter for the decision trees so TPM is based on descent rate so it needs the descent rate parameters the number of observation in the leaf node so you're in that I just look at data you can get a feel for how many observations that you need to get a good mean estimate so if the data is very noisy that you may need more if your delivery is pretty stable you probably need less so I just use that number address field that's actually your child to be pretty good and then in action depth they say you had adapter is a is very interesting thing for for the R version of the DDM the average of the TBM interaction taps the basically describe how many split the dots so it's not interaction depth as in the Tradex if you have will inherit depth of eleven so that means you have ten leaf nodes at your part and leaf nodes it doesn't mean that you have a thousand leaf nodes so you to be afraid to use ten I use them very often so there's one thing that the GBM that doesn't do actually there is one thing that all the tree based algorithms do not do well which is uh dealing with high cardinality categorical features so if you have a feature that has which is a categorical variable that has many many levels uh throwing them again to anything tree basically is a bad idea because the tree well either run super slowly if you work hard including all of them are exploded others over you that feature because it over faded so for high cardinality features you know use them in the trade you have to somehow you encode them as numerical features there are a few different ways of encoding so I just wanna mention that actually high cardinality features are very often we often see them like extreme code or if you do a medical data you see a diagnosis code that tens of thousands of icd-9 our text features they already have culinary features so one well approach a fool encoding them it's actually build on you can build a regression you can build a ridge regression so the tool you know you're fighting a logistic case is very simple I died out to a penalty to your largest aggression to based on your categorical features and then you can then use that prediction to self that would be a numerical as a input to your to your TPM ever ever cover more detail of a particular example for hard do this so it's actually pretty standard idea so that's what your league people call us that so basically you build like a stack of different models and the subsequent stage of models values the previous stage models as other pool heights as their input so as I describe it you have if he attacks features you have numerical features you can other text features into your your reservation model and then if I make a prediction and you put that a prediction side by side with all your raw numerical features and then feed this into a into a DVM and this usually works pretty well and uh if you haven't tried it then if you you got this car you can replace this by like a Cudi categorical features you can't do that too and the full a lot of a kind of competition so you just do this you rap like a top for 25% at least so so so so for people so people who ranked at the bottom 1/2 of a Keagle your little you'll see yeah that's a way then come around because it is really quite easy to rank her like a top half um there are many like beginners or people who like a really cut into this on the bottom how much mmm so the one particular risk here is that if you use the same data to do this you use the same data to build this the the first stage model and use the same data again on the on the TDM one on the model tool are very often the prediction one will be overfitting in your mother tool because of prediction wines out are right you already used at the target a verbalize using prediction one so whenever you're producing my absolute the leakage in there so basically you guys leaked the information from an actual target so whenever you put a particular one like that together with numerical features the prediction one will be given too much weight and the more your fit the model one the more weight it'll be given into prediction one so actually it's it can be quite bad so what do you want to do here is that you already want to use a different data for for model one and the model tool so you are going to different data so you can simply cut it in half use a half later for mother one and having a mother told everybody you ready give you a better model then using the all the data from other one I did it for mother tool and you can swap them you can you like a splitter the other thing to have all these a Happy Feet using a to peel the model one and then peel a model to lumpy and they using p2p on the mother when I build another model about a 12-8 you have to do that so you can use an outdated I why done is it that seriously keep very good result Yuri and again if if you're not trying to win a computation right so you for practical purposes I try feel that kind of model is already good enough your phone for a lot of practical applications so okay here so also you can take this one step further if you have a smaller data than even do you have have maybe leaving your data to marching can actually do something like cross-validation you can split it into ten volts and the to ten different model ones each one using nine particular the other one now you guys you can then concatenate how the models together so that each record will have all of the song called prediction from model one so you can avoid overfitting problem in the model so that's how you deal with the category features once you can once you can deal with the categorical features I mean GBM use them is coded for 95% of all general predictive modeling cases you can make a GPIO model a little bit better so by little bit better you mean that you can kind of you know a little bit higher kygo ranking may be guided 1% higher a you see it may or may not be worth it for you depend on the problem and the reason is TB ammo can only approximate in actions and transformations so it's a way to approximating when GPM is approximating the transformational interactions in the community for engine oil versus signal so that is you navigate to pick up some noise so if you know that you'll have a certain a non-linearity you know model you have you know that they're strong interactions is actually better to explicitly code them and then put them te to DBM we are still using VPN the next one is that um the TPM is a feature engineering it's just a greedy States greedy space search so that encounter really find the very specific transformations so for example a lot of times when we peel the like a sales forecasting model so the bicycle sales forecast area is actually the previous time here in sales so if you want to be able to automatically discover that I think actually that's pretty pretty difficult so if you know that definitely so that covers the TBM the second the most often used to me is area the linear model so that cuz the generalized linear models are regularized generalized linear models oh I used to use a claim that very often that is you know it's also very popular our package the so from a methodology perspective claim that is actually uh all other generalizing models are kind of opposite of a GBM so other generalizing your models are global models assuming everything is linear another three models are local models are imaginary assuming everything is a staircase shape so they complement each other very well the linear models become all become quite important when your data is very very small of when your data is very very big so reason for different reasons so for the for your when your data is very very small um the you'll you just don't have enough signal to support a nonlinear relationship with our attacks it's not that there isn't there always is not linear relationship interactions but if you have to have enough data to detect them is a better to stick with a simpler one so that's when the linear model works well on the other end of the spectrum where you really have you know planes in the billions of observations only linear models are fast enough for you so everything else remember to finish you know in our lifetime so that's why you're going back to the inner madhavan they have very large so the the linear models when you do a model plan have a cover multiple in the tip you later so when your average basically you pull n the models together GBM and claim that compliment each other very well so you guys are very nice opposed to in performance the downside of full of lignite or anything similar to this kind of model you got it requires a lot of work so that all the things that GBM does for us automatically EO we're having to die myself I will deal with other missing values or the outliers or the transformations and interactions it really takes a lot time and the last thing one cover is a the regularization so my my name and this is my personal collection at bias is that you know in this day and age if you are building linear models without any reputation so you must be really special that is very happy right so easily seriously so if you have a linear model B does a regularization that you must be working somewhere really special so always regular eyes so it's required so there are two very popular regularization approach wins l 1 1 l 2 so basically L 1 give you spots models L 2 just make a everything's parameter a bit smaller the the sparsity assumption it's a very good assumption actually the book all the elements of statistical learning that's the word a good book it have to discuss why that's the case so the reason is almost like a like how the Pascal's wager right and you know I don't know how many do you know that but the point is give you a problem spots in nature assuming sparsity will get you much better model than others humans if you are ma if your problem is not a spaz in nature then as you made wouldn't hurt much so the lesson is always assume it's positive always assume unless unless there's are exceptions to do unless there's a problem you know is not suppose so they actually problem that we know is not spots so things like on uh like tax money or if you are doing some part like a like fit code if you peel the zip code based model then if you think what you spots in the sense of geography Co disappear in time in fact there are three zip codes that are special they have some kind of parameters and everything else is the same so that's the same thing like the in-text money if you assume that that 500 words in the English language is very special and then all the other like a hundred thousand is not useful that's not possible so in that case uh when you're doing text money in are doing very uh higher cardinality categoricals that's why you do not assume spassky but other than that it's pretty safe to assume mrs. Potts so let's go to text many a little bit so first I admit that I'm not very good a text miner if that's word I just you know I agreed at Google things and I really would have rather get right and then I try to follow their examples so my my approach next my name is the simpler stuff really works better the as you know as far as we have come on the you know machine learning the Umbra and based approaches Emma Graham and a naive base patient approaches actually works surprisingly well and a lot of times actually is quite hard to to do better so here on tax my name is always no matter my my my view is make sure you got basics the last is not who is is is less likely to work the basic set has a grams and so the trivial features so sanitizing how long the text is how many words in the text then the if there is any punctuation stuff those kind of features are everywhere important and the last one is a lot of problems that are happy on text are not necessarily dreaming bad text so this year zucchini carpaccio is an interesting example so the PD cop is a problem he's also you suddenly sponsor value occupation' e they do they have a website that website where you had donate to your local schools teachers projects so if you're a teacher you know elementary school or high school you can postcard a saying I'm doing this reading club or with my my students I need a 100 dollars to buy this carpet in the people reading whole stops and not if you'll find that if you sympathize with that effort USA will find relevant content people can write better so they call a donor choose so that either don't actually practice sponsor so they said have the competition to see which company which your project more likely to gather attention from people and more actively funding so the teachers well put up or Isis and the summaries of their projects so it's very messy I say something recommend essays it's not everybody and so that all the tax data are provided also together with the cause and type of full product and so that you know there are not so many tax right if you look at a data you know whenever you have unstructured data but the text always decently bigger than the than the structured data so you you know I always feel that I am obligation to you know use the text a lot but you turn out that these are the some other features money much more important so for example a projects that cost less a much more likely to be funded so this doesn't need the essay to tell you you know if you cut if you ask people one for $1,000 for a laptop you know no actually we're save you so it's much easier that's for people for math books the data sets didn't help us that's red all or I don't know so next one is plenty so if you are spending you to a competition I got that you I learned this so it's almost never some some almost never that a single a single model well win a competition people are repealed a variety of bottles and blend them but letting blending is just a fancy word full weighted average of models and you can do you can do a little bit of fancy language but not that much I already just use a weighted average so my my my rationalization for why blending works is that because our models are wrong so as a charge box I'd that all models are wrong some are useful there are lots of good things in statistics so I like this one very much almost Ron suppah somewhat useful I would like to think that all models are wrong by the Maya useful there is another one that I would like to measure mediate this I think I forgot to say that it's very hard to make predictions especially about the future I forgot who said that this has a very closely yes so that's our XO so that put the sock like a woman's when we study regression analysis or something similar in school we always start our models please the iid Gaussian assumption like independently identically distributed culture errors right so you might heard I've have never seen a really that's a tree that it secures and then put that way okay just please never happen I don't know where it should happen but you never have it so our assumptions are never right so whatever you assume allows you to even simulating the data set you know unless that latter is always wrong so so the hope of that is that our models are wrong in different ways so that when you everyone there you're wrong is kind of pencil out you cancel each other out that's that high justify why it works there are few things to keep in mind why is like a simpler is your refactor when things are certain so if you're not sure just the I just adam-11 model to model 3/3 is usually better than you know that you know taking the parameters unless you have a lot of data the next one is that there is a useful strategy which is the intentionally over faith the public readable and if you do it have carefully actually may give you a nice boost so you want to do with this you build up many different models and you submit them to the public depart Univ Italy and then you pay sound their score you keep the better ones and then every to them and that's your plan so this actually up at least I'm happily to the board that will give you a nice boost you know this amount of damage will come work better than any individual model submitted so either that will work better on the private inner part or not depends on how much data you have on the public leader board so if you have a lot of data on public to the data board actually to help you because when you're doing this you are actually implicitly using the public leader board data as your training data so you are so you actually have advantage over people who don't do this a real about it if the probability the raw data is large enough but it is not large enough you're working probably the road even so terrible bad when that thing come out on private so it's a you totally pay something about it one thing that's very useful in billion blending is actually you are not going after the strongest individual model you want the model that always you want model that work but you want to model that different sometimes actually intentionally build a weak models have a lot when you add them to a strong model so the key is a diversity I'm sure HR people will be very happy to hear this you know the median average game all of you think so is this is true this is actually scientifically proven he's good so you can have many ways of using diverse models so here's a few examples them the infinitum ways doing this so you can use different tools you can use different model structures you can use a different subset of you can use different sub sample observations you can bill the models that have for waiting the problem you can peel the booth waited and the unweighted models usually the unweighted one will work worse than the wait anyway if you have waited the problem but if you can take a 90 percent if awaiting the model 10% of language if I work better than the vinyl person waited it's a quite often the put here is try to view those models more Eliza Blandon don't look have an answer you know that the public in Iran is just high in there and you always want to test them try not to use that as a guide to filter out your models unless you are confident that the data set is large enough so that you have a yield that way so that's it it's a total judgment call based on the noisiness and sides of data if I go to the real examples I will any hope I hire me okay I have me example let me try to address this a question I'm objectively although it's not possible because I really fast so P was sometimes you ask me like III certainly I'll enjoy and how I enjoy doing computations right so on so you other than being a fun game is there anything really useful no computations so first of all let's acknowledge that it's a fun game right so you know fun game itself is the entertaining so that's real useful that so we shouldn't this means fun game and something not useful but beyond that uh there are two ways to look at this so the modeling competition only covers actually a very small portion of the necessary work to make a data science really useful in this world so in order to make that to make it it has to make a model videos for we need at least another three pieces so why not we need us select the right problem solved so if you find a problem that is interesting only abstractly but there's no real word equipment real word implication then there's actually no you need the value the next one that we need have good data so the models Ernie's is a carbodiimide problem so if you don't have good data you you cannot expect how good models the third why is that we need to make sure that models are used the right way so a lot of times I reads possible to build very good models but then they are implemented wrong you can't be trivially wrong it happened to me in the past value I believe you the pricing model inference and then you implement they swap the parameters in the models like a XY z-- parameter goes to X 2 and then you figure out this doesn't work then do you see all this wafted everyone has a YP sparkly do so so with all these 3 plus the right generalizable model then you will have resolution so building the model the fun part of being in the model air cargo is on a small program we all need to keep that in mind in the in reality but that side you know competition half has you many different ways so I have two ways that helps a sponsor so if you have a company you know start up a company our prize but beauty not doing a modern competition for your own data set is very helpful so so two ways so that helps you is what why the help you some like merit or degree the signal versus noise in your data so if you could have a reasonable price money then you can be quite sure that a 99% of all that signal that you can squeeze out of your data going to be squeezed out by the people uncle Curlee were quite good at that and the the second one is if there's any flaws in your data so a lot of time to paste it because of the data collection processes issues you may have predictors of that and I'm not real predictors such as you know you for some if a particular field is missing you always have a you know yes or no as an answer so if you have anything like that the type of crowd well finally for you and you can go back and face if the model is not useful you have that problem but at least you can fix your data but as a as a participant so I learned two theories a mother's two major themes from chiral wine that I have to do the generalizable models you know just for what I know is useless that a disciplining a drizzle is it's hard Lord discipline the next one is tool to fully realize title day every every day basically that there are other approaches there other bad other people with better ideas so you always keep me on my toes and Kimmy learning new things otherwise you know it's quite easy to just hiding my counter thing oh it's a really good modeling work then I go there and they try the COI media that feel that you have a quite often so let me just give a two examples of a full tool commutation like to having that either okay well so why is the Amazon user access commutation so they say it's a this is what this is one of the most popular competition so the either used to be the most popular but recently the the Hicks competitor had more participates so this one has about seventeen hundred teams participating I got a second place on this one and I was today's actually is one of the interesting experience for me because I was not around the public leaderboard I meant didn't try to over faith have leaderboard sometimes I do so this time I didn't try and then I still lost somehow to another team never figure out Hawaii so uh so the problem is to use anonymize the features to predict a year for a employee access request will be granted if you're working on fairly large companies very often right so you I need to accept that folder then something I say yes ml and other features are categorical so there are basic like a resource IDs manager IDs your IDs Department IDs and then one two three four all numbers and the many features have many many levels but uh but I want to use GBM right so this is where how you I convert all the categorical features into numerical so there are two ways that I use to encode the category feature into into a numerical wines diet how many times the better level appears in a data set and you can do this for all your categorical features and all the interactions of categorical features and another one is use the average response so basically every Dwight for that level as a predictor so here you have to do something slightly more complex than doing a straight average because three an average will lead to overfitting on thin levels I will show example on the next page and then beyond those are encoding of the day so the final model is a is a linear combination of three different kind of trace a I plus the glint net and a plus to subside features based trades so this is a blend that had n that in the manually at that time I did not fully understand it all online learners like a rabbit so I can use it so you might expect if I you those kind of things I think I work at a better model because they have a working photographer so there's a there's a particular competition has a requirement that everybody has a public publish their code so that a my code is after the kid hub okay don't do not evaluate my software engineering skills so we read the code so here is a this is the this is something that is very easy to do actually for encoding carried out from features by do you mean that every response so that is that is this is very simple data file so we have one category feature that use is a user ID and it's a for the level a one we have six observations four of them is in the training data and two of them is you attach data so for the training data you have the response variable that's 0 1 1 0 and the test data you wouldn't have the response variable so here shows how to encode them into a how to encode this into a category into a numerical so what you do is the calculator for the training data to average response for everything but that observation so for the first one is at 0 the for for theses this particular observation there are three other observations in the same level that's number two three four and he's a two out of three that's wise one six six seven the second one he has it also has a three other observations but is one zero zero six one two three three three do not use yourself if you use it yourself then you will be overfitting thinner sometimes it also helps data to add a random a noise on to your training set data either help you smooth over where we frequently or cream vitals for example if you look at this you se that upon sixty seven thousand six seven you know two point eight seven two three three three actually when you throw those kind of point into geometry gvm go nuts because it you know you to you treat them as special words almost special values so if you add a small noise on top of that it actually make the little more real from data perspective you do not need any such such a special treatment for the testing data has no data is a straight average of the response value for that level for the training so to a low for that point back so this is I need to do to use categorical features in GBM this is very much easier to do compared to build a separate regression and I do this word very often so so that's the Amazon so Amazon Amazon competition is a very simple data set mostly just to feature engineering anonymized categorical features and the response is 1 0 so the I'll stay the competition is easy own in a different way that you had very structured target variables so it has a seven correlated targets so you've got a b c d mg we should be present at the options people chose that when people buy personal ugly insurance car insurance policies for mustang so you can feel like you want a comprehensive coverage with one a bond without the limited Humvee IRP do our so this turn out to be a very difficult competition for two reasons every a lot of people hate is coming a lot so why is that a you management criteria is all nothing so you have to predict all seven correctly to get a point if you predict anything wrong this zero so basically partial answers got no credit so this one people hate that a lot of we had a long debate on the type of forum but the type of people didn't botch so next day I heard a started with even Austin was upset with us even a sponsor wasn't happy so the next line is them the baseline model there's a trivial solution to this model which is very good it is a the problem we should predict the final purchase option based on earlier transactions but a retailer option they true the for their last known transaction is actually a very good predictor of the final purchase option so so the model is like if you just use the tree view last the quoted as the predictor no change is right about fifty three point two six nine percent you really need a lot of decimals to see the difference that's where the decimals and then I got I got the third place in the competition so I predict I'm able to improve that by opponent 4 / 4 percent and then the number a solution in for them to improve another pond oh three percent wonderful and four so we did this is a I'm proud to say that this is a statistically significantly better other belief is the pigeon is significantly better Bob and I don't know but it's statistically speaking significant the challenge to this is a dream most on the target of our book side so the ABCD you have D are not independent each other actually every one of them coming predict is very well more than 90% accurate whether you can put them together they're actually quite interesting structures and then the 2008 data capture the correlation and a not losing for the baseline so so so I build change models for for the dependency so I first build a free-standing F model and then assuming you know after you build a G model then assuming you build it you know F energy and then you build a next and next you go to another and you can actually change the order for models to put your free models first and dependent models later to make the model better it is required and this is a you know in my way that it's a little bit more theoretically appealing because it have anyone systematic model there's no like a hunt Union you're like I want to this combination that combination so then the next one is that you honor not use lose the baseline basically we put two state models you build a one model to decide either you want to go to the baseline all you want to use your prediction and then when you wanna use your prediction they use a prediction all right stick with baseline so at least for the you know for the dummy cases that you do as well as the baseline and for where you can find implement you improve up so I think that's how you can improve up our very strong baseline some to finish there are fewer kind of trivially useful pointers so uh one thing that I highly recommend is that so critical if you want to learn how to feel the generalized models media participate in the computation just just watching it doesn't count so so for my own perspective is that up is the more frustrated I am doing it the more I learn after so it's like I couldn't solve this other foot too so I really really hate myself but then I don't have a I don't even answer I say oh that's how they did it you you really kind of worry a good learning students that way if you're just observing tides of spectator ice I really don't get into the problems and really read the forum's so people really post very useful knows and that that's the very good they also put links to papers to books a next one is okay and you know the meat that have no pity so I man my ID pager was engineering it's related but it was neither in statistics Noreen comforter science so I learned a lot of Just Blaze reading books and reading in Kaikohe forums so one book that is very good is the elements of statistical learning from the Stanford guys so highly recommend it you just test like up the Bible of the virtual learning process in a way that so you for you is also freely available legitimately on the Internet hey you know a lot of things up really well whether you can help but knowledge is missing look at the hip readable elbow so please take a look at the coverage that really are these are very good a survey book for the sister cities were learning protease the next one other trivial other pretty simple ones such as the second case the second learn package in Python cron and then seems like wah wah wah Bates what work it actually is quite useful I've been using that a lot recently so with that and hopefully I didn't you know how the guys too long back here so we can take five question has to be good one I know what otherwise you're banned from this group okay all right okay oh well okay so my second question is what's your take on using neural networks in competition for tacit aren't all scream neural networks like image or oh if we got you forgot doing image recognition please use a convolutional neural network either way you're dead so don't do image why I thank you so yes I think even for other compilations you can use newer network but I think your network has the advantage has the disadvantage other than this other than the silence you read in the general problems is that new retro has been out of fashion for well so there there aren't a lot of very good packages to build like a regular neural networks outside of the convolutional stuff that's why it's not as common it is sir question no and okay alright so in the private you've evolved in the cubic test data if they have some fields with a later like na or something with you tell the kumudam you will eliminate those things but is there a situation where the data is it given doesn't contain any of those things but the actual data will contain it so really think through all the situation and mitigate those things or Oh what kind of things you'll do me on the feature side on the target side so what is the smaller training data set they provide that contains no any value so we create a model for that Oh mitigating those factors but what if the original data contains those things so we need to think through all those things oh yes you do so because you are always given the predictors for the hold for the private data bar so you can tell you there are so you can all enemies there's nothing to worry about so they work so exquisitely as long as the widget free data set contains an innate them in the training set up so they will prevent no necessary you can have a field that which has absolutely no age in the you know training and then there are any sinner you know reverence for every field and we need to mitigate or do that you know you are providing privately depart data so you can tell authorities if they are is you have to say that but if there isn't even be part on it so you don't get that but what if my model will not work in the prior version of data which contains it because I worked on the model from the private data which contains only without any you are given all the data on the free exercise so there's no I'm certainty at that point okay you may not have it is later armonica even question yeah in my experience you went through a couple different things once your choice a different models and learning all the others can feature engineering and product in my experience it's the feature engineering that gets you more the distance for the goal that our models can get people ask little distance like especially in in business applications for you to win a chemical competition you know is very fine difference known as your site between the top people you may be in business the difference between GBM versus trees versus glass overage whatever for me is less get it is less important than me than getting a one hot and coding at different predictors and kind of making sure that you're running your your analysis on to my parents but what's your that's almost like to loot it's almost on like Legos food and water which model is right multi-particle survival rifle is most necessary hi I am on to a great video that I think that a lot of a state budget during the 15 years is certainly very important especially if you know that there are cool features and earrings for that particular problem but I think give a lot of time a lot of situation so that the line between model and feature a Miriam is not really that clear so the university tmu it actually does interaction for you so that makes it's much less you less necessary for you talk about interactions so is including spacing where oh okay so first on Cairo we have the luxury of not even not needing to explain anything okay so so then that's why I think that's a good thing now your energy that is is actually not knowledge these are quite common to build the models to have all one prediction model but another separate different simpler yeah actually that means at all cost but I think that's might be a reasonable approach but beyond that that really depends on what your mother's purpose is there are two kind of wool models actually so to think what predict you supervise the learn supervise the prediction model there are two kind of models why is that you only need to predict you predicted outcome in that case you know when example would be like pricing insurance so you just need to know predicted outcome who is more likely to have accident in that case the compact model is very example but there's another kind of model with moderate it's kind of an international model based on whatever your model says that you won't do something different so in that case you actually are you are actually leaving the predictive modeling and internal Authority study your medication is a strongly prefer to use simply simpler model and avoided our course of complex interactions because there is no way to get rigid actually there is very difficult difficult correlation and causality in observational data so simply fine model is much better about the dependent or how many questions do it um go first any war girl that's a ways of doing tam service so uh why is the proper temperature is way which is all linear models although all the proper cancer attack dates a linear model fundamentally and the other ways that can wear that emissary's problem into a non-cancerous problem and again oh I think we have to wrap up here if you have more questions welcome to chat with oh yeah I find James I want to teach how to do citibike prediction how to break how many bike and ducks would be there in five ten fifteen minutes okay the last five minutes actually give the best prediction I guess GPM I use time store I use Oh the techniques are very hard to beat the baseline problem yeah okay guys we have a traditional way to evaluate two nice speakers per woman I gotta vote for him so this makes he's awesome this means he's super awesome this may be I don't want to see him again so let's see open can't turn off so you only look at me all right yeah okay okay guys we're voting honestly okay I will show later Wow you have all the hands in the air all right he'll person from the our host of this location that you feel have ID place our kinds of things please take them and either current because they have a as I said there is a very early session tomorrow so whatever we had once those guys I passed and remember no that's just a straight average of what do you ever take oh maybe nothing is there every response of all peas in the same level you train you know every interval 0 1 1 0 1 1 so like if this was if these are all a 2 vector a 2 would be point 3 3 and then this one will be 1 and no you're so so even at a to thank you I'll be pond 6x because you have two lines on my zero okay so you basically fold for tests later you go to look at these two I always have same value we got that same level and you go back to in opinion can all those a ones and every day for 800 units a average of eight will continue our degrees for you just as a straight average oh god of Cheney's and how many hours through Cal competition for you to win one how big are your ec2 instance actually I don't use any I have a pretty good computer at home but it does take to do well you know company gets proper take ice leaves a few hundred or two hundred hours yeah what is a lot of ever ever what what is it - wishing because response oh and Thursdays really Thursday intuition so it might be retrospectively created intuition right so so it's like this what the idea is to make her the training data as similar as possible as the testing so if you think about it will you look at this record you are trying to use something on other records because you doesn't have its own one you are trying to use information other records to predict this so you want to the same for training that's why for every record you want to use other records to predict it so that's it what the hell does it tell me as I said so I'm gonna be the about a foot a twenty one different so this because it so this is your ID it was only when you ready this filter has six observations so into I have another ten operations a name to may have up only upon gender on to average they might be fun to run so they just convert on categorical into a numerical so every particular level always map to the same value we times so this is in the sense that ontology deduction like I was wondering what other techniques we use you have a lot of others yeah I always try piece a buddy never working for each other I tried that too log yes actually uh the Dahmer one like senior ones may actually work sometimes so on Sat eyes are two one-way correlation and then keep the morning art of motion that really work that down that's versatile by your words or network sometimes what's better than a say bacon number worked a while you loses something your yeah but maybe those boys right that's easy yes this never so it's finding the regularizer in technology oh yes more like a surprise because I already need to sort by now maybe up stuff retire somewhere is in Iceland I have a question about the whole stream yeah competition in Merseyside participants going the direction you want and perfect Asian females who are pretty or group of heart rates how did you decide yes I for seem to be lit nice that you can a matter night FA the twig burn Asian of all the hairs yes a nap is the least correct with anything else which the list goalie morning and then we'll which are why that is late so the most occurs when that end ok so that's how there's a lack of data compares of relations with like a CSV and the DNA ETS online and so on that other than that way to fill a problem because I think sequences were different person so that's kind of like a that is what spaces today's yeah so you have to collapse that in a way so there's like a last transaction previous transaction and also like a aim for the last transaction just say my previous transaction they are similar flu before all my knees they had although things Mandy ones your chance yeah yeah yeah talk about how to grow up transactions into a wine or generally are with mine users my foot man very PDP habits but a lot of money mmm God Oh feature engineering yeah what do you think of simply RNA you know the promise of deep learning and well you just get this through the feature internet or using some concise learning yeah first how a CD burner fitted like an in Geneva that's not high C deep learning I try to give my opinion keep learning is just a fancy name for your networks given you a network school the word of feature means different sales in the DPO network compared to like one drop by the context ID so the video researchers are will be opening oh that's a wagon I'll shake at a base hand design without the same but again in these kind of engineering actually everything turning is very different I say if for the last transaction with the same Gators there's no video that'll be like a variety so you don't see the seem to pick appointed at play you think to discuss I don't see that why visual recognition and all you know yeah I think you know way that the way I'm really bad at with real analysis but you know whether it's a fractions you can track snow snakes you think maybe today is much larger then then the you know all songs we do on the song that features phrase he's always personal without they have their spatial temporal there are only so many things but there's infinite amount of ways to collapse transactions in there's so many different ways you should have automation and model selections when you have like statistician still very kind of seems like yeah probably sure there were like yeah you can you know what other Oh Mohamed I work for we feel a build a system that has a lot of those things yes I have a question this is for a Basu competition which you're currently leading and I know it's still going on so I'm sure you can't say much but are you using that fast solution that was posted at the forum's at all no not are using like follow regularize leader which is the core of that um like as LGBT Bop but uh you know that we do this pace company very similar to the to the radio creation right here you have that this rate of the radial winners of publications right so but that was all like online learning that your mom is not know go get it three years I've read it but I know that the image is super talked the necessity of bringing your models religion I bet this off to inquire that your innovation to producing the data to be able to be is that the case like if this computation issues with non linear predictive fit would you prefer to the high-density and as a positive table just so kind of together such attractive way so your ways like when I usually a be king goes so so that's very much the ocean stuff so the window for innovating animation that you take a sample you possible take a sample day that's just a funny title right yeah I see an X I put that back into a bar that has underlying predictive curling hair under that beautiful game picture would come yeah that's question yeah what kind of hardware setup I travel I will ever ever quite good Iowa I have
Info
Channel: NYC Data Science Academy
Views: 26,889
Rating: 4.9408865 out of 5
Keywords: owen zhang, data science, machine learning, kaggle, kaggle competition, nyc data science academy
Id: LgLcfZjNF44
Channel Id: undefined
Length: 67min 46sec (4066 seconds)
Published: Fri Oct 09 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.