Marios Michailidis: How to become a Kaggle #1: An introduction to model stacking

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Music] I'm Irish Michailidis nice to meet you thank you for coming I work as a manager of data science and that number and I would like to talk with you about my journey in getting to first in cattle obviously this is linked a lot with model stacking we just go through in details later on but it all started from here so I think it is nice to give a talk you know where it all started I'll tell you a little bit about what categories I presume most people should know but I think a good way to describe it is it's the world's biggest predictive modeling competition platform so what happened is you have lot of members like more than half a million members right now and the various companies like Microsoft or organ Hamby will host different modeling challenges where you know they will supply a set of data and they would like to find answers which are the outsource to lots of people who try to get the best solution for example Dell jambe will try to try to predict when a customer will visit the shop or Microsoft will try to find which let's say with virus has infected a certain file so all very diverse all very interesting problems all can be solved with machine learning and I give you a list of different challenges that are common but I want to go much in detail at this point of come back later so let me tell you about how I started with with goggle because I think it then you might find it inspiring guy I found it inspiring so I'm originally from Greece so I came here six years ago to do my masters in in Southampton and I remember that at the end of the year as audit risk management at the end of the year I was trying to find what to do next so I used to attend this enterpreneurship type of folks where you know people tell you about ideas for business and it was this guy with his name is Kelvin right now that he presented his business idea after he finished his studies he started going to the horse races but he didn't really he wasn't betting he was collecting data he was collecting all sorts of data like weather conditions like who has warned the next hex number of races where do people bet the most so anything you can think of he he thought of various sort of features and he was monitoring on a daily basis and he was able to make a predictive model that you know more you gave him more than 50% to get the winner and he made lots of money with it I think I think the point was it seemed like a superpower the way he explained it this so I thought we have it it's kind of cool I want to do this so this is actually how how it started so I was really curious I back then the you know saddest pieces are were dominating the market so if I wanted to learn about algorithms and tool are to invest in these over time and that's a became more passionate I felt that I have to dive deeper so you know I have to be able to code these things myself I had zero programming experience by that point so my background is economics and then risk management but the most I have ever done was like some functions in Excel and so I picked up some programming skills I specifically started with Java and there is an aside with javis because I started we see and the fountain too was too difficult so then I said man maybe gel is better but I think later also I picked up Python and it was quite a nice transition because Python is in a way easier but yeah I started learning a lot and I started coding a lot and then I once I built lots of algorithms I tried myself basically to to to deal with with different sort of problems then I decided to Pike it everything and make it available so I created a tool which is more like a GUI to be used for simple credit scoring applications you could say simple analytics applications and it contains lots of algorithms and I named it Casanova the name comes from Innova which is you know it's analysis of variants and the topics the prefixes sake as an e which is my mother's last name the the idea of the logo is that you see the nose is quite big and you know you have to be nosey with the data in order to get the most out of it that was the idea my sister made the logo I try to include all my family into this maybe a typical Greek thing and then actually before a joint cargo agenda namby and other than Hamby I found out that the they had already hosted two cargo competitions at its point we were trying to solve different problems and people were talking a lot about it so you know I thought that I you know I should sit ups kill myself to see to see we know what exactly is that and how I can you know reach reach that level and so I started with a few competitions and I guess initially I didn't do very very well but you didn't it didn't seem like it wasn't I was doing very well because you know I was learning a lot with the thing I think that's the point you know it never felt like yeah I mean it didn't stress me I mean it was really nice because people were really open they were they were sharing stuff there was a lot of collaboration and I've learned Lots and then over time I think after 100 competitions I sort of also found a way to be able to do consistently consistently well but yeah I do want to highlight the fact that you know sharing and collaboration is great and therefore I encourage everybody to give you the salt so I think the end results of me playing three years in actively in participating in Carroll competitions he have played in more than 100 different they're very diverse challenges in certain situations I have been to the lot so in about half of the situation's I have collaborated with different teams from different nationalities different ages completely different backgrounds they decide these engineers biologists lawyers even sometimes you might find estrangement they want to make a transition I mean it's it that the science is a hot topic so I had a couple of top-ten finishes I mean obviously didn't happen immediately to took some time to develop sort of the pipeline that can get you there and you know a lot of reading and a lot of collaboration but eventually got their Prize winner a couple of times I have not participated only in Cargill there was a period where Cargill was not enough but I think right now it is you know it was at some point I was really so eager I was you know participating in every in everything that they was out there I think the a result of all of this is at some point I managed to get first in the in the a global ranking list you know the way it works is that with every competition based on on how many points you get based on the composition you get you get different points someone's like in tennis if you know where you know the more you participate the better positions you get the more ranking points you are awarded so there's this sort of League so after three years of competition some money to do yet first you may see the Greek flag there just because you know the morale of my country is quite well low at the moment with all the financial crisis so I thought you know I should give it a boost so I think and now I'm slowly getting into a thing the more interesting part for for you I mean what wins competitions in a very rough summary I think first of all you need to understand the problem you need to understand what you try to optimize you need to understand what you try to predict you need to understand you know like the data that you have you have dates you have categorical data did you have numerical data a do you have data which are very specific to the kind of problem you're solving you know all all sorts of things you need you need to really spend some time to understand what you're trying to solve I can give you an example I remember there was an endurance competition where you had to predict which policy customers customers will pick but you would see that in 99% of the situations they were renewing you know they were just picking the same policy so actually there was no they it was more valuable to try to predict if they will renew or not rather than you know trying to find immediately wills policy this and sort of map this into a two states problem so I mean I mean this kind of thorough investigation so of the problem can really help you to break it down the smaller ones and helps you get better solutions I think you need to have discipline especially in you test me you know goggle has a leaderboard where you submit your scores and you can sort of see how well you've done however in most situations what determines the winner from the loser in the research thing as the loser is how well you manage to replicate that testing environment internally so if you manage to create a test environment internally which is very reliable you can setup you can try all sorts of things you know without being afraid I mean you can explore lots of lots of possibilities or if every possible permutations or combinations of your data and you need to be very strict with that because it's easy to to make models which are not generalizable or they under feet or as they use the term over feet so you need to be you need to follow a certain certain streak guideline about you know the Defiant's model selection have a parameter of medicine and all these things using and modeling you need to try problems specific things I mean unless you tribe convolutional neural network I guess with the previous Atari example it was probably not going to do very well it is the same with image classification a thing unless you really try what's what state of the art right now you're not going to perform competitively and I think that's the mentality you need to have for every problem it's good to have an idea at least about what solves its best and have the tools which is the next thing I wanted to say you need to have access to the right tools this means hardware this means software you need to always be up-to-date have all the latest releases or all the packages end and the right packages and always put the time and the hours to try new things I remember that in that stretch where I managed to get first in Carroll I was putting something like 60 hours per week on top of everything else I was doing so I was always always insane yeah not good for my health I think collaboration is very important because apart from the fact that you can sort of divide the space of work you need to do everybody can focus where you know is more skilled or more talented that there is also an element of you know everybody can see the problem from different angles may see things that you you don't see so I think in that sense you know collaboration I mean apart from any other obvious benefit which is you know makes it more interesting more fun it's also you know it definitely helps performance it's almost as an example or different processes I will come later to focus on and and sampling refers to how we combine many different models to get sort of a better result or how to create a very powerful model by using smaller or weakened models and I think mastering this is what really I think gave me the edge and in most of the of the competitions and I'll get more more details into this later on so what what I'm doing after goggle well or along with goggle it's not like I have stopped that I'm not participating as much as in the past I'm doing also my PhD at UCL about using and sample methods to improve recommender systems with the sponsors die by the Hamby and as part of this I have developed stagnant stagnant is a methodology to use metamodeling it is the same methodology that have used in in coggle and have won multiple competitions and all gives you more about it now so what is packet stacking it is a metamodeling methodology what by the term meta mode the link I mean that you know we use other models now in a new model so it is an after modeling phase using using new modeling phase that it uses walpert stock generalization this is basically model stacking not sure if you've shared the term it is basically a way to to combine different generalizes to say different algorithms to make a stronger model but I do so in a kind of in a neural network architecture where you build multiple models but in different later layers where the output of each model is becomes an input to a new model and you keep doing that until you squeeze any sort of information there is in regards to your your target variable whatever you try to predict however their various elements into this that Google in detail later on but you can imagine it as a neural network where it's node instead of being a simple perceptron as it is in let's say normal neural networks here is any machine learning algorithms can be any complex works once your complex not as in mathematics complex I mean nonlinear linear this what I mean complicated the first version of stack net which is with is a methodology but there is also a s oft where implementation with it which seems it is already working quite quite well although it has some it has only a few algorithms for for the moment you know for people to play with but yeah it is it is in Java and I'll tell you now more about the inspiration behind it so other momentum if you add elements before the first was stock generalization so Wolverton 1992 introduced the stacking in very simple Stern term stiking involves four stages first you have a training dataset and you divide it in two into two parts then you on the first part you train various algorithms and you make predictions for the other part then you take the data part and you have the labels also the correct answers for that product you trade in your model where the predictions of of the previous phase are now inputs to that new model and you can retrain it to these labels this is basically how the words the predictions on that second data set now form a new data set where with the label for that whole Delta the said you can now train a new model and this is at work so you can be seen also as a way for you to correct the errors of of the algorithms you know of the predictions of your AdWords that you have already used in a way and the second inspiration is neural networks neural networks actually have been here for a while thing Rosenblatt was the first that he created the perceptron a very simple we could say very simple neural network then for for some period of time they have almost been been dead then back propagation came back into the mixer perceptron took it through its revenge for a while then you know I think I think then the computational requirements to be able to make this models very predictive we're not there but now with the advent of this with the usages of of GPUs who has sort of come into a state where we can revisit them and make them deeper make them more more predictive and the thing they have we met right now as being you know for many problems again the state of the art and in predictive modeling and it is this kind of architecture that I mean it has various advantages and that's why I'd like to use it through the stacking mechanism because as you can see it's very easy to build models scalable and by the using the information in multiple layers you can squeeze more and get a better prediction I built that first version of stagnating Java people ask me why Java and not either see which is more efficient or byte on the decision and we'll have more integration with other algorithms it's not see because it's a I think it's not very easy to use I mean it you cannot see it's not easily used for for data Sciences it need another scientist workflow just because it is it is over boss I mean Java is also proposed that is not as robust I think that it is quite easy I personally find it easy for example to work quickly in Java even for something like a card competition as you tasted in in my first two years of competing in cargo that first competition I want them fully in Java with my own tools so it's it's it's not as easy as Python but it's definitely easier than say every operational system has it and there's a thing it's an advantage because you build something and you know you can scale to its you know it has a very big rich 80 statistically type I think that's good for you know when you really want to build a nice software application especially me the way a code I'm very prone to making mistakes so this ensures that I don't Java doesn't have something like scikit-learn and I think the what I had in my mind when developing this tool with stagnant is a part of it the algorithm contains is another part is it it can actually be used in a similar way you can have the same their API and you know people that would like to use Java they can focus on on using this it has good predictive algorithms so that was that was some basic data pre-processing steps more will be added on but that was the idea so how how it works in imagining neural network where basically it's one of the neurons it can be described as as a regresar so this is the address the perception so is some bitters along with with the buyers so it's almost a combination of regressions so here we just replace this with any modeling function so this can be any so the way you combine all your input data of the previous layers doesn't have to be linear it could be a random forest it can be a graduate boosting machine it can be anything you want basically and this does give you this so instead of going from a linear regression linear regression which is say we replace it with any any function if you want see it more visually you can see it like that model 1 model 2 so the first part of models this is what 1/2 created this has been created from scratch so it doesn't it doesn't rely on any other implementation it's just the pure Java code based on papers for other implementations how the training works at that I think this is the interesting part because the in a traditional you don't read cook you normally have something like black propagation you know in order to to train it this is indeed here you can imagine the in your network it's it's it's very easy to overfit even the way it is so much more if you have stuff like random forests in reading boosting machines and all these arguments combined together so I haven't massively personally explore this but I would say is extremely difficult to apply the same neural network logic in a network where you can put any machine learning algorithm so this is where starting really into into play so what and something before I go into this I'd like to tell you a limitation of Walter the way he described it the way he described head stock generalization is that you have to divide your dataset into two parts however that means that if you want to keep doing it you will always need to divide your data into two parts so if you wanted to potentially combine algorithms in multiple levels you will always need to run your models in much much less data so and this is this is a problem because you can see you will see women when you try that you do get uplift by adding more layers obviously you get more in the first layers but you know you can keep getting but if you make the data too small then you just can't get any significant results so here it uses the reusable stack that usually usable holdout mechanism hey don't confuse this this term with I know this term has been used somewhere else but for something completely different but just notice recently so what I mean here is that we have some training data and what we do is we do K fold so let's say we divide it into five parts and we combine let's say four five times we combine if we assume that we're doing a 5k fold this despite is a hydro parameter we use four out of five of this part to to train and model and the way predictions for the fifth one we do that five times so that we always predict in a holdout fashion we always predict we try to make certain that you know make predictions out on test data in a way so test data so are our predictions on our bias and when we score everything we just put it back together and we bring that that prediction for would this is how basically the Magnum the it works and it works really well in practice you know like it doesn't over fit as long as they are not very strong temporal elements in your data so you have very fast data predicting very future data and you have I think there are there are many ways where you can do the training but for now I have included two modes one is exactly like a simple neural network you you built the first layer of predictions then you go to the second layer and you use the previous layers output as predictions in the new model and so on however a as explained earlier here you don't have this notion of of convergence like you have in neural networks where you have multiple epochs you know every model is immediately trained on your outcome variable so if you wanted to revisit the initial data a good way to do it is to actually retract everything so when you actually you are in the third layer don't includes the previous directly direct previous latest predictions but also everything else even the input data if it may be and it works well in practice in many situations this actually performs better because allowing the data to allowing an algorithm to seize the data again you can you know try to may find some information we didn't find initially with the help of some other models that already helped to explore some part of the information that was already in there so you have available these two modes and it may be you know one or the other in very schedule competition it has been one that worked better in other competitions for example it has been the other one that worked better so there are other modes but I think stuck needlessly comes with these two [Applause] now I would like to talk to you little bit about the actual implementation about the software and how it can be used and later I'll give you an example with some code of how you can use it to get a top score in the Kargil competition with code in github the our grant has the the software package has some some command so you can run it from the command line the vision is a player later it will be a GUI so you have a graphical user interface but when I was just a command-line so you have normally what you need to specify is just a training the test file this is my training file this is my test file it expects the first column in your training file to be your target variable so what you try to predict and then you have various options like how many folds you want to use for cross-validation obviously a seated for randomized procedures to be able to replicate the results if you want to use restarting mode this is attack data command they metric you want to monitor for now it supports only four more will be added how many models you want to run in parallel so imagine that you know the stronger machine you have the more models in each layer can be run in parallel and you can also add threads to these models individually too so you could have you can run in parallel to random forests where each one builds some trees in parallel so you can make your own you can leverage the most you know your course you can allocate your course optimally some other details I mean a delta faculty supports both sparse and dense data I don't know if people are familiar with birthdays as birthdays when you try to not not delete the zeros in your data so you you keep only the nonzero values extremely useful for where you have lot of down variables for example lots of binary features so you only keep track of the nonzero ones I also stock that requires the parameter file the parameter file looks like that is normally a txt file it says if you have you have a break line to determine the levels and then you list the models that you want to add they eat with their hyper parameters for example I want to build this tag net that has two levels in the first level I want to have a logistic regression that uses regularization with C value of two I want to use a gradient boosting forest classifier that has this trinket and has that many estimators and with user random forests then I want the outputs of these models to be fitted on a random forest classifier with 1500 trees so this this is this is what it looks like and then you just run a train command you set the parameters file you set the train in the test file and it produces the the probabilities your predictions that created them score file however you define that this is it is as simple as that or not simple I don't know how you find it now I want to get you through to an example that it actually finished a couple of hours before I came here so previously I mentioned the the static cargo four years ago and initially I didn't do very well so this was a very first competition I joined it was a competition hosted by the sponsor was Amazon and what they wanted to do is based on some categorical data you have to predict whether a certain employee will be granted access some sort of access for example administrative rights or ability to look at transactions based on features like what is who is your manager what is your role in the company what is the kind of resource you ask permission for you know this kind of features and with a little bit of work it's a little bit about less than 2hours computing time you can get a top score top 10 score in this competition using is exact net and some a little little Python code that will show you how run this experiments it was a popular challenge back then when I first joined they were 1,700 team I finished something like 100 after spending three weeks now I spend more like well as I spent two days I spent two days but I didn't you know I spent two days because you know I want to do to make sure than it looks okay but I think I could do it much much faster this can get you to a top 10 score within a few hours I think what was really interesting about this competition is that it has only eight variables which you will think how difficult can it be you know just eight variables actually you will see huge differences in this course and that's because all variables are categorical and they have extremely high cardinality so they have like thousands of categories so it is a battle against underfeeding and overfitting you know how to find you know which categories ticket and obviously you need to explore lots of different interactions in this data set to be able to do well I remember I have to admit I was cocky when I enter in that competition but I immediate but it I learned my lesson quickly because I said fight for eight variables how quick how difficult can it be well it was quite tough I managed to finish the around 100 but after reading a lot in the forums and after you know and after receiving a lot of help like from the from the competitors so yeah I mean and that's why I recommend car government this is where I really found out I mean actually you know all these hours have put developing software and you know when tested in an environment with so many skilled people I think you can understand you know where you need to be as it says this kind of problem you have to predict whether an employee will be granted taxes you have to optimize area under the ROC curve I hope people are familiar with it otherwise is just a metric that says that you try to have high correlation between your prediction and the target variable so the higher the score the higher truly the probability of a person to be granted access you in order to run this experiment you need to download the data from that competition hopefully we will share the link the the PowerPoint later so you can actually try that yourself if you want and you can see that a detailed example and this package official Keita Bravo the way initially you need to prepare the data set so you need to run a Python script where it produces several files for you modeling data sets where then you feed in into in the stagnant the way I made this is a built two different stack needs this because I wanted people to understand the different sort of options they have the first tactic uses sparse data so what I did is I put these eight variables and I created all possible for weights interactions so variable one per variable to preserve a 3/4 variable for then one two three five then one two three six or all possible for way interactions up to four way interactions but it is in a sort of thing in a forward manner so I estimated a you see and in cross-validation then on I did an interaction I repeat the same cross-validation if the score is better I'll leave it in L sorry move it this is sort of to keep only the important ones it is a brute force forward forward way interaction let's say feature select and they use logistic regression to do that so then the output comes out as a sparse file so what I do is then of all these interactions I did something in Python called one hold encoded so every every feature in these interactions subsets every value and these features becomes sort of a binary feature 1 or 0 is this role present 1 or 0 is this combination of role and role family present 1 or 0 so this created I think something like half a million columns so unless you find a way to to compress this data you cannot really run mode alone in reality most algorithms ignore the 0 elements so they are only interested for when something is present so for when something has the value 1 not for when has the value of 0 so this is what sparked that data too so they only keep the 1 values but I'm just explaining because I don't know whether you are familiar with it and once you produce this pass files which they follow something called SVM light format then you can run this command now it may sound a little bit intimidating in by the look of it but the first line says I want you to run Java and you allocate 3 gigabytes of memory this is my training file then this is my test file I have a parameter file with 9 with not 9 algorithms also it to you later produce an output prediction file as Amazon linear pred use 5 false cross-validation my data is sparse just use one thread here and a seed and then monitor a you see at each step so this this is the performance of these 9 models are selected so have some logistic regression Libre fan is very good so factorization machines very good for this problem to some noodle networks and the gradient boosting so you can see the AUC of these models and I selected and random porter for a level two classifier so to fit on the outputs of every other model and each score na you see of zero nine zero one which is about a point better than then not the best model of the previous layered which was a logistic regression without the regularization and generally this is what's tightening does I mean it doesn't do magic it can give you a better result than than your best single model but you know that that boost obviously this is a factor of many things like diversity diversity how correlated the models you have created the previous layer are but it can give you the edge especially in competitions where you know you try to to win by by yes how do you know that you should stop the university leadership model there and you should do it the next level that's a good point but we don't so you just have the way III built it is first I choose only one model only one in Tehran and detect the cross-validation and this is this is how I tuned it so and then I select the second model obviously you can think of that the more models you add the better performance you're going to get and you would hope that even if you had somebody done the models then your second layer model will be able to find it but there is no way to know that I should stop here so the way it works is is the more the merrier and you can hope that you can discard some later on in the in the metamodeling face so this is automatic but you can also impose some manual elements into it for example stagnant can can help you out put all the predictions of the previous layers so you can even check yourself whether I should stop or or add more models but in general there is no way to find what is the right architecture there's no way I mean it's you just in most situations the more you add the better it is and and at some point it converges you just you see it cross validation that you cannot improve anymore this is there is no mathematical easy easy way to determine that there is no no more you know here you should stop you cannot add any more value so the more you are normally the better it is I think we can have questions questions later later all right so I did what happen okay okay that was the best performing model just a little bit how how actually the output looks like it's just you know a flute ugly right now so you have the command in the command line then it tells you uploaded the data the data expires this from an element it has etcetera and then it starts there to a to give you the results of of each one of the models for each one of the faults and it is also how you tune them so in Italy you just look at the results there and you try to to get as best as you see as you can per fall for each model the second experiment actually uses dense data this time but the difference with the previous one is that here we actually provide the folds ourselves so previously you provided the file and you said you do the cross-validation internally here you're actually we are going to supply five different pairs of training and validation data for started to use the reason we do this is sometimes we need to make certain the we need to create some theaters within cross-validation which is common especially and in cago we use likelihood features or always of evidence if you've heard commonly used also and credit but these features are not safe to be created in the whole day done this which is better to be created through through cross-validation so this way sometimes it's not easy to run Pacolet unless you create those features yourself so this is just you know a different way for you to run technical on your own faults but at here you are responsible your yourself for controlling overfitting so for this data set I found all different ways or all different three-way interactions but then I did not feature selection and I used the counts of each one of the of the categories and also created this likelihood features so that each category could be represented with the score that reflected on the target variable like what's the probability of being writing access you run these commands what is different is that you have this data traffics command here the second line which actually said we will supply our own faults here our own pairs of of training and test data and because we have specified 5 volts this will expect 5 files of Amazon counts train 0 then train 1 the train to then train 3 and train for so that you can run the static manually here you get these here I had a different selection of models in this type of data crud in boosting perform the best but the models were more correlated with each others of the axle boot we got from from the level 2 meta classifier is actually smaller is only 0.4 but in general this model is stronger than the previous one and then what I did I take these two predictions and blended them I just submitted on Cargill you can run this example and you will get the the top 10 score and there's a lot of room for you to actually go get the top spot I think you have all the tools you have to do this but I'll leave it up to you to if you want to do that I think some things you need to be careful is what you need to know is that stack net is not magic I mean it it may sounded like it but it actually gives you a little bit better results than your best sort of single model single prediction model and normally most of the times that's better than any other and simple method you have like averaging or voting it will it will perform better at least better enough to win Cargill competitions sometimes it is also great value in real life sometimes the applet this is more to be considered let's say to be considerate of leesport for large scale applications and stack let me underperform ones that are strong at temporal elements I think when you have one sitting on temporal elements when you have past data and you have to predict very future data because stagnant squeeze is so much information out of your target variable it it it might over fit your your training data no matter how well you may have done your your cross-validation so a it may be that you need to do your striking very manually when it comes down to such a problem possibly only you know do one split where you will make certain that the validation data is a lot in the future you know you need it is to sort of try to recreate the same conditions you will be tested on and when you are supplying your own file so the second experiment you need to be very you are responsible for controlling overfitting so you cannot blame stagnated at sales you need to be very careful not to have cases overlapping between training and validation data and then need not have information in your validation data that you couldn't you wouldn't normally have in a real testing environment segment has already been used in some challenges it has actually won the data data a competition hosted by Dido which you had to you have to grow basically website in order to find whether they contain some sort of advertising or not so it was more like a text classification problem this was the first time I used the term and since then it remained it was also used to win the home site conversion Salatin coggle and i release the software now in a in a competition called sponsored by 2 sigma and also in toggle and people have reported that a lot of people in the top 10 have reported that they're using it and it gives them a good scores it's a little bit the same I delete it because the point I did it everybody passed me on the leaderboard I can give you a visual example of how stacking was for that data competition so we blow we built four layers where we had two different input data similar to what I described before and then from each layer we keep building building models in until with its four levels and it gave us enough boost in order to win this competition you can see it more analytically later on what I plan to do next for stacking it is for now stack net has some implementation of algorithms which I developed so I would argue they're the most efficient so what I want to do next is to make certain that it contains the top the top tools so when you build your example you can choose any of the other tools you like or love he has to be accessible in other languages too right now it's just a Java tool that I'll try to make a wrapper for for Python possibly are so that more people can use it it for now you know it's just a modeling part but in the future I would like to have some at least basic feature engineering steps convert categorical variables immediately to different formats count or one hold encoding possibly explore some interactions for some linear models hyper parameter optimization now you do that very manually however you it is fairly quick because you know you can resubmit and see your results it ideally I don't believe completely automated hyper parameter optimization is possible to be honest but I think you can at least have some some good initial some decent initial values some a feature selection too now there is no feature selection and I would like to extend that into model selection - so which comebacks - to your point you know after we produced a lot a lot of models hoping that we can converge then we can also have a phase where we can make most of the ones that are redundant you know disappear we can refer to this as dropouts though and I would like to close this with a couple of more tools that I would like to include some of them already mentioned but also my favorite tools in in cargo I don't want to go into much detail psych it is really good it has pretty much everything for gradient boosting I like exist boosts light EVN developed by Microsoft is extremely fast I thinking recently I found out that is actually performing better ba-ba-ba-ba is extremely good for fast linear models found it very very useful for some competitions that had like billions of data Liebling are also very quick and very efficient for linear models originally the fan very good also live FM so filled aware factorization machines they have been extremely good that clickstream type of problems weekend Java has a lot of tools GraphLab to Bordeaux I'm not sure whether they're still open source I like open source a carrot and lasagna I do like them to be honest I don't like personally working directly with tensorflow or piano I'm lazy so I prefer to use something like carrots that makes it easy for me to to construct my neural networks and when I have tasks that are specifically for ranking so have a list can you write me this link from top to bottom which one is more likely to happen then I like to use run clip and yeah that was that was it thank you for bearing with me [Music] [Music] [Music]

Info

Channel: Data Science Festival

Views: 33,101

Rating: 4.9278197 out of 5

Keywords: Data science festival, data science, machine learning, kaggle, model stacking

Id: 9Vk1rXLhG48

Channel Id: undefined

Length: 54min 25sec (3265 seconds)

Published: Wed May 10 2017