CS480/680 Lecture 24: Gradient boosting, bagging, decision forests

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay good morning everyone we're gonna get started this is our last lecture I have some still interesting material to present and then we're gonna wrap up the course okay so today I'm going to talk about gradient boosting we're going to come back to bagging and then in particular discusses well decision forests so here two lectures ago I introduced ensemble learning and introduced boosting one particular Alborz and known as adaboost this algorithm is mostly for classification so we're not going to see one that can be used in the context of regression so it's called gradient boosting and then we're going to come back to bagging where we're going to see how we can make this work in terms of satisfying some of the assumptions that are needed okay so gradient boosting as the name suggests here we're gonna do boosting but then it's going to be through some gradient method and then this will be particularly useful in the context of doing boosting for regression right so if you recall we sat the adaboost algorithm so this algorithm is great for classification so let's say that you've got a bass learner then you can learn a simple classifier doesn't have to be perfect and then depending on where it makes mistakes then you can reweighed the data set and then apply that same base learner to get a seven-second classifier and then update continue to update in this fashion and then the idea is that at the end you get a set of classifiers that can be combined together right but now the adaboost algorithm was specifically designed for classification and now we need to modify it we need to do something different if we want to use boosting in a context of regression okay so for this ya the gradient boosting algorithm will be our solution in fact it's one of the several possible solutions okay so what is gradient boosting so great and boosting the ideas that we're going to start with some predictor I'm gonna call it FK so here FK is essentially a regression technique so it's a hypothesis or if you think of it that it's a function that makes a prediction for some inputs okay so now let's say that at stage key we have a predictor FK and then it has some last function some loss that is measured here by L where we compare FK of X to the desired output Y so here we're going to come back to this in a moment so we can the ideas that gradient boosting can be applied with a wide range of loss functions so that's why here I'm simply using L to denote a loss function in general so given some loss function then the idea is that we're going to compute the negative gradient of the loss function with respect to our current predictor FK okay and this last function when we compute the gradient we're going to see in a moment that it has certain properties but then the idea is that now we're going to fit our next predictor to this negative gradient so here this is a little bit counterintuitive because when we think about regression and we're trying to fit a predictor we normally fit the predictor to the desired output Y but in this case we're going to fit the predictor to approximate the negative gradient the reason for this is that now let's say I already have a predictor FK I just computed a new predictor H K plus 1 knife I and FK plus h k plus 1 times some multiplier Etha k plus 1 then this will emulate a step of gradient descent so the idea is that normally FK is used to make a prediction the problem is that okay FK is not perfect I want to improve it now instead of fitting another predictor that would try to also predict why if we try to fit a predictor to essentially approximate the negative of the gradient then when we add this predictor we're effectively taking a step in the direction of the negative gradient so essentially doing a form of gradient descent so that's the intuition here for a gradient boosting okay so so here you see at every step we're going to have some predictor it's not perfect it's going to have some loss function it's some loss of error and now we're going to fit our next predictor to approximate the difference between the target and what is currently being predicted so effectively the gradient the negative gradient so that then when we take a sum of our predictor with the approximation negative gradient well then we're effectively taking a step and the negative direction of the gradient and therefore we're doing gradient descent ok any questions regarding this slide okay so let's continue so now if we consider a specific class function so in the case of regression what is perhaps the most popular last function is the squared loss so then L would be essentially f of X and minus y n square and then I have a factor point 5 here to essentially cancel the square when I take the derivative right so this is our usual squared loss and now if I take the negative gradient of the squared loss then I obtain essentially the difference between my target y and my predictor F and this difference is often known in the literature as the residual so the idea is that when we fit ma our next hypothesis to approximate the negative gradient which is kind of weird well what we are effectively doing in a context of square loss is to essentially fit the difference between our current predictor and our target the idea is that when I'm going to add this new hypothesis to my predictor hopefully it's going to fill in this gap is going to reduce this difference and improve the overall accuracy ok so so then you see the next step here is to now train a base learner h k plus 1 and we're going to train it to predict what is the residual for every data point instead of the target y so that's what i'm going to call this data set now the residual data set right so normally a data set is a bunch of XY pairs and i'm trying to approximate Y with a predictor but in this case I'm going to approximate the residual and the idea is that if I do a good job by approximating the residual then I can add this to my current predictor and it should reduce the overall error now in terms of based learner I can use any type of base learner but really for this to be effective I want the base learner to be a nonlinear predictor because if it's a linear predictor then if I already have a predictor F that is linear and then I'm going to add to F another predictor that is linear then two linear predictors together is still linear and I would not really improve but if if my new predictor is nonlinear then there's a chance that I will improve okay so this is once we instantiate gradient boosting to use a squared loss right then these are the operations that we will do and and hopefully this makes more sense that really here we're going to train a predictor to approximate the difference or approximate the error that we've got between our target and and the predictor so that then when we add this to the predictor then the error is reduced okay so if we put all of this together into one algorithm we obtain the following pseudocode so initially we can set our first predictor to be a constant simply by minimizing finding the the constant C that would minimize the sum of the losses for every data point so again we have a loss function it compares or prediction to the target our target is why our prediction here is just going to be this constant C so the idea is that for the first step I'm going to start with a predictor that is just a constant so it's not a good predictor but it has the advantage that it's simple and also it takes us into the right scale into the right range as if I've got values I'm trying to predict there are fairly large then I'm gonna get a seed that is large so that isn't the right range if it should be small then it will be small as a result okay so we set our first predictor to be a constant simply by minimizing C in this expression so this is a simple optimization right because I only have one variable C then after this I'm going to do a loop where at every iteration K I'm going to compute a new predictor by following the these steps so first I'm going to compute what I'm going to call here a pseudo residual where I'm calling this a pseudo residual because the idea of the residual only makes sense in the context of the squared loss function but then if I have some other type of loss function then it might not be a residual exactly so the the negative gradient will give me some value and this value will typically have something to do with the error that is being made but it's not necessarily a residual where by definition of residual would be the difference between our prediction and the target okay so so I'm gonna call that a pseudo residual because it's it's more general than just a regular residual and it depends on what is the loss function and once I have a pseudo residual for every data point then I can form a new data set I'm going to call it the residual data set and then I train a new base learner HK with respect to that data set so again this base learner is going to try to find a hypothesis that approximates the the residual or that approximates the negative gradient the ideas that after that I'm going to add to my predictor the new the new hypothesis multiplied by some step length Etha and here because my new hypothesis is an approximation to the negative gradient then this I can think of it as a step length and I'm effectively doing some form of gradient descent okay so now it's not clear what might be the best step length but what we could do is simply pick the step length Etha to minimize this expression right so at this point I've got my previous predictor I've got my new hypothesis and then I want to combine them together but then I want to take a step in the direction of the negative gradient that will hopefully reduce the error right and then for this I will simply optimize my step length so I'm going to do an optimization here just with respect to etho okay once I have Etha for this particular hypothesis let me call it ether kay then I'm ready to to combine HK with the previous predictor ft minus 1 to obtain a new predictor FK now if you look carefully at this can anybody tell me when are we updating some weights in a predictor in this algorithm so at what step are we updating some weights it's normally in machine learning right we've got some powers or we've got some weights and we would normally update some weights right so if you look carefully at this algorithm where's the weight updating happening yeah so okay when we compute the negative gradient then I guess here we are computing a scalar Y you know it's not it's not really a scalar I guess it would be wash you know sorry it is a scalar so it is a scalar so there's a value for each data point but that is not a weight update per se yes we're going to use that obviously to update some weights at some point but where's the weight update yes so when we train the base learner here you see we have a data set and then when we train HK so HK has some weights it could be a neural network it could be a decision tree it could be some type any type of regressor right so so then it will have some weights it will have some powers and this is where they get updated okay now what about this step here where we optimize sorry now when we do the update of the predictor do we update any weights here okay so if you look at this last update who would say that we update some weights here and who would say that we do not update any weights okay most people are in the uncertain mode okay so we've got a few people who said that we are not updating any weights here so why is that yes right so yeah so here in general that's right so we are not updating any weight so FK minus one is a predictor it's it's fixed it has some weights but they're fixed right and then HK we already optimized it up here it has some weights and it's fixed right and then Etha k we've this is a step length we already found an optimal step length and it's fixed as well so here essentially all of these things are fixed now what I'm really doing here is combining together several hypotheses but I'm not updating their weight so I'm not combining their weights in any way okay so what this really shows is that I'm going to combine their predictions okay so here when I say update predictor what I really mean is that I'm taking a weighted combination of several predictors and then the idea is that when I apply them to a data point right then I will apply each predictor to that data point and then combine their predictions through a weighted combination in this way yeah yeah so here this is interesting because yet the weights might not be optimal so when we found FK it was not optimal I mean presumably there's some error and that's why now we're finding a new hypothesis right and we're not updating it we're leaving it as is right and then when we find HT now HK does not predict directly why all it does is really predict the difference right so that's okay and here we're going to simply combine the predictions yeah so yeah so here we should be able to convince ourselves that this will lead to some predictor that is gradually improving simply because you see at every step when I look at so at every iteration when I look at this step you were optimized as a step length I have already a predictor now I found a new hypothesis and then I'm optimizing the step length now the step length can be set to zero so if I use a step length of zero what it means is that at the next line here FK is going to be the same as ft minus 1 and now will not improve my predictor right but if on the other hand I choose here and in a step length that is different than zero then it means that I found a way to reduce my last function by choosing an ether that is different than zero and therefore the combination here will have a lower loss overall now will this lead to a loss of zero in the long run it's not necessarily the case so it depends on what is or our space of predictors that were considering and also how much noise there is in the data okay so if we don't have noise or at least we don't have two data points that have the same input but different outputs then there is a possibility of finding some function that would fit perfectly the data and then here if we choose some hypotheses H that are from some space that is expressive enough then perhaps in the long run we will be gradually reducing the error and then the error might become zero but if we don't have these right conditions the error will decrease but not necessarily to zero and here you see each predictor has some weights but those weights we don't change and what we do is we simply and more predictors to get to essentially create a larger and sample so you see this is a different way of thinking everything else that we've done so far in this course we have some powers and we update the powers here what we do is we don't change the powers after this step where we learn with a base learner right so all we do is we simply combine more hypotheses together but then they essentially compensate for each other and then we can guarantee that at every step you see FK will have a lower loss than FK minus 1 on the training set okay what about overfitting here is there a risk of overfitting with this approach okay who would say that there is a risk of overfitting and who would say that this is robust and there's no risk of overfitting okay I had a grand total of one person for each question that answer okay so that okay let's think a little bit about this alright so we've got an algorithm alright and then I explained that at every step we're coming up with a predictor that is improving at least with respect to the training set right so we're reducing the loss at every step all right now this is with respect to the training set so now is there a risk of overfitting or not so again who would say that there is a risk of overfitting okay I've got a few more hands and who would say that there is no risk of overfitting okay so still the majority is unsure okay so among those who said that there is a risk of overfitting what's the rationale for this yeah yeah so so here at every step we're trying to reduce the error with respect to the training set so at least the training loss so yes so here there is a risk clearly of overfitting because if we have some based learners that are expressive enough that at some point we might just reach a loss of zero and and then I would look like everything is perfect but then at that point if there's any noise in the data were fitting that noise so we're overfitting okay so this is interesting it's in contrast to the adaboost algorithm that we saw where in its case it was actually fairly robust overfitting this one is actually susceptible to overfitting it's prone to overfit okay so listen what you guys have learned in this course what could we do here to reduce or perhaps prevent overfitting any ideas yeah we don't completely apply the new functions that we learn but have some kind of constant where yeah okay so here I guess if we want to reduce overfitting then perhaps we can introduce some form of randomization and indeed we're going to see this in a moment when we talk again about bagging this some some scheme that we're going to use anything else we could do besides some form of randomization yeah yeah so since we said that overfitting happens because at every step we're reducing the error well maybe we shouldn't do so many steps right so stop early now how do we choose K trial and error okay well how can we choose K so that perhaps it doesn't need too much overfitting or we could have some empirical evidence that we're gonna have some good laws on on a test set yeah so let's use a validation set so essentially and every step right when we add a new hypothesis and improve with respect to the training set let's also verify with respect to a validation set how well we're doing if the last function if the last keeps on decreasing with respect to the validation set then everything is good but when it starts increasing again with respect to the validation set and that's the Sun we're overfitting and we could stop when the loss gets low enough right okay very good okay so the this a gradient boosting technique has become quite popular and there is a package that's worth mentioning here known as XG boost and here XG boo stands for extreme great and boosting so now you might ask what does extreme mean here the truth is that some form of marketing it doesn't mean much okay except that what the authors want to advocate is the fact that this package has been optimized for performance so it's meant to be efficient and it's also meant to be accurate okay so the in terms of the engineering it's it's been really well optimized now this package has been used by many people in all kinds of situations and in particular on cattle there's lots of challenges where some people have won some competition and then they reported afterwards that well the main awards and they were using was actually XG boost so here if you go to this website this is the github repository for easy boost there's at least twelve cases that are listed there where the winning entry was based on XG boost so again if you recall when we talked about the Netflix challenge which which was one of the first early challenges where boosting or ensemble learning was used then after that whenever you want to improve the accuracy of any type of clasp our predictor then ensemble learning is useful you can often boost the accuracy by a few percentage points and indeed and many subsequent competitions then boosting has has been the key okay any questions regarding gradient boosting or XG boost good okay let's continue so what I'm going to do now is write on the board a brief summary of some of the important differences between banking and boosting so if you recall we started with bagging last last class and now we talked about a Debus and and gradient boosts so there are some differences these are like the two main paradigms for ensemble learning and then we're going to come back to two bagging for a few more slides okay so for bagging we do the prediction by essentially doing a majority vote and then we had a few assumptions the first one is that we needed some independent hypotheses and then the second one we needed hypotheses with similar accuracies okay so if you recall with banking we saw that if we have a bunch of hypotheses right and then each one of them make some error that is independent of the other one then it's a bit like you see having many people that vote for a candidate and if everyone makes a choice that is independent but then the choice would be somewhat randomizes or maybe a party of making some mistake but then that party is like yeah it is independent for everyone then the majority in general will tend to do better the trouble is that now this will only work if the hypotheses are independent and also the hypotheses have similar accuracies okay in the context of boosting so instead we have weighted predictions so if you recall for adaboost we take a weighted combination of the classifications and then in the context of gradient boosting we take a linear combination of the predictions now the benefit of this is that it allows some correlated hypotheses and also hypotheses with imbalance accuracy okay so boosting is more flexible in the sense that because it combines the predictions with some weights then the weights can be used to essentially bring down the importance of correlative hypotheses again if you have hypotheses that are correlated or even the same the trouble is that they might gain too much importance so to counter this then we can decrease their weight and then if we also have some hypotheses that don't have the same accuracy or the accuracy varies a lot then it makes sense that what we could do is simply give a higher weight to the better hypotheses in a lower weight to the worse hypotheses we don't want the worse hypotheses to essentially dominate in a way right so so boosting is nice and in that sense okay now if we and it looks like okay boosting is clearly practical because it doesn't have much restrictions whereas banging is making some assumptions and here these assumptions are unlikely to hold most of the time in practice but then an interesting question is could we engineer some setup where we could make sure that those assumptions hold or at least approximately hold okay it turns out that yes we can do this so if we want to have some independent classifiers or predictors what we could do could include two things we could do what is known as bootstrap sampling where we sample without replacement a subset of the data so each time we train a base learner to produce a hypothesis if we get that base learner to work on a subset of the data and that data is essentially sampled from our original data set but then you see we feed to the base learner different sub samples of the data then it will reduce the correlation between the hypotheses the hypotheses are likely still to be correlated because there's certain to be some overlap in terms of the data in fact the the hypothesis cannot be completely independent but at least the their correlations are going to be reduced through this scheme now we can do this with the data we can sub sample some of the data but we can also do something similar with the features and this is known as random projection so the idea is that normally we have a set of features and we want to use all of the features in terms of making a prediction but this is in the paradigm when we're trying to obtain the best predictor possible now in the paradigm of in sample learning the key is not to obtain the best predictor the keys to obtain a simple predictor that is better than random and let's get it as quickly as possible so here it doesn't hurt to perhaps simply sub sample the features this will reduce again the correlation because if I have lots of features and then one classifier one predictor is trained based on the subset of features the next one on a different subset perhaps there is some overlap but not too much and and so on then we will obtain predictors that have less correlation okay so in this way then we can obtain a set of predictors they're going to be trained on effectively different data right because I'm going to sub sample the data and also sub sample the features now just to be clear let me draw a picture [Music] [Music] so let's say I've got a data set it's fairly large and yeah I'm gonna represent this as a rectangle where here I've got the instances and here I have the features okay so when I say I'm gonna sub sample some data points it really means here that I'm gonna choose at random some rows in this matrix alright so I might choose the following data points so I'm just drawing some lines at random throughout and these could correspond to different data points that I selected okay so this could correspond to let's say X 3 X 4 this could be X 10 X 12 and here perhaps X 15 X 17 X 18 okay so have subsample some data points and then I do the same thing with the features so perhaps I obtain the following features so I have here let's say 5 f9 f10 f14f 19 so I just made that up alright so these are some features that have subsample and now you see those rows and those columns that I've subsample I can bring them together to form a smaller data set okay the smart data set is going to have X 3 X 4 X 10 X 12 X 15 X 17 X 18 and then for the features F 5 F 9 f10 f-14 and f19 okay so the idea is that those rows and columns that I have subsample I use them to form now a smaller data set and that's what I feed to my base learner now each time I'm going to sub sample different data points different features which gives me different data sets and therefore my base learner is going to produce different hypotheses depending on the amount of overlap right then these hypotheses are going to be less correlated the ideas that if there's not too much overlap that I'm reducing correlation and and then bagging will work or should work better okay any questions regarding this okay all right so now we can put those steps together to obtain the following bagging algorithm so as for gradient boosting I'm gonna have a loop I'm going to create K predictors now for each predictor first a sub sample some data I also sub sample some features and then together that gives me now a smaller data set I feed that to a base learner and it produces a hypothesis HK now depending on whether I'm trying to do classification or regression if I want to do classification then what I can do is simply take the majority of the classes that are produced by each hypothesis and this would yeah so so I get a class that obtains the most volt here would be the one that I would return as the prediction and if on the other hand I'm doing regression that I can just take the average of the predictions and then again in terms of base learn or I can use any type of base learner what tends to be popular in the literature is to use a decision tree and therefore whenever we do bagging in this fashion with a decision tree we obtain a bag of decision trees and one way of calling them is to say that this is a forest and because this forest was obtained by sampling some data points and some features so there's some randomization then we call that a random forest ok so when you look at different packages you're going to see sometimes an option where you can select random forests as I guess your classifier or your regressor and that's what it corresponds to ok so it means that you're really going to be creating a bag of decision trees and each decision tree is going to be trained on a subset of the data and the features okay any questions regarding this yeah yes very good point so yes so in practice one of the beauty of this framework is that now because we're training lots of small weak learners and they don't have to be very good then we might as well distribute that on lots of cores or even a GPU and then have some simple algorithms that will produce those those hypotheses and then so this lends itself very well to distributed computing okay so let me give you an example of an application that did use random forests at least as its beginning so some of you bought a kinect and played with a kinect produced by Microsoft several years ago at least at the beginning when Microsoft started manufacturing the Kinect it did use a random forest ok so what's what's interesting about the Kinect is that it was one of the first gaming console that did not require the player to hold a controller like a joystick or something else to issue some commands so the way you would play with the Kinect in various games is that you will essentially act infront of a camera and then so like in this example we've got two people that are playing volleyball so they're making the moves to essentially play volleyball and then the Kinect will essentially track their motion in further posture and therefore infer what they're doing ok and I should mention as well that when this came out this was also before the deep learning revolution so convolutional neural networks were not yet popular and and then so there was an important problem here of recognizing the posture and the motion of people and it turned out that at the time what was key was to use a form of ensample learning okay so how did Microsoft design a solution for this it produced what ended up being one of the FIR low cost depth camera so was the first mass-produced low-cost depth camera and here at deaf camera is a special camera that does not return just the RGB color pixels but also depth information so it has an additional pixel value that indicates the depth or the distance of that pixel to the camera okay so the way this works is that if you look carefully at the Kinect there is one I guess yeah one one of those circles here corresponds to a projector it's an infrared projector that produces a cloud of points in the infrared spectrum then another circle here corresponds to an infrared camera that perceives this cloud of points and then the last circle would correspond to a regular RGB camera so here the idea is that if we want to measure depth information then one possibility is that if we project some cloud of points then you see if we have an object or something that is near the camera then the points are going to be closer to each other so the the the cloud of points is going to be more advanced but then if we look at something that is further in the background then now the points are going to be further apart so they're going to be more sparse all right so so based on this now we can infer the depth information and in fact inside the Kinect Microsoft built in some hardware that would do some calculation in real time to infer depth information based on the distribution of points produced by the cloud points that that was emitted by the infrared projector okay so normally you see the infrared image is not visible to the naked eye because we don't see in the infrared spectrum and that's kind of a thing because then it's not disturbing so imagine if you're playing this game and then you know the camera is or at least they connected it was just gonna shoot some points like this everywhere then this could be a little bit disturbing but because it's in the infrared spectrum then you don't see anything and it doesn't matter but then with the right camera so an infrared camera I can't see the cloud of points and that's what it looks like okay now from this cloud of points now we can infer distances and as you see objects that are closer tend to have here a gray shade that is darker indicating that they are closer and then things that are further are going to be lighter okay so so this is all calculated in real time by the hardware so this was quite an advance at the time okay any questions regarding the Kinect okay so the Kinect essentially produces this depth map which is quite useful in terms of telling us about the shape of things but now for the purpose of playing a game where you're doing some motion then we need to go further we need to infer where are the body parts and perhaps what is the motion of each body part so for this Microsoft ended up treating the problem as a pixel labeling problem so you've got here the depth map of somebody doing some motion with a posture and then we'd like to essentially identify which pixels are part of the hand which ones are part of the wrist the arm let's see the the face the hair etc right so so then there's different parts of the body and we'd like to label every pixel to correspond to a different part of the body so so this classification problem Microsoft ended up tackling it with a random forest so what it did is that it trained a set of three so a bag of trees where the idea is that for every pixel if I take a pixel and I want to classify which body part it belongs to then I can look at adjacent pixels and compare the depth value of that pixel to the neighboring pixels but now there's a lot of neighboring pixel I mean there's like the entire image that consists of neighboring pixels and here this is where the subsampling becomes useful because you can simply subsample tens or dozens of neighboring pixel and when I say neighboring you don't have to be immediately adjacent it could be just in the region and the idea is that for instance if I try to classify a pixel that corresponds to my nose and my nose sticks out of my face right then the distance for my nose compared to other points that would be on my face right the distance is shorter so then that gives us a clue that perhaps this pixel would be part of my nose so here just by comparing distances of different pixels at different locations and we can get some clues these are going to be weak classifiers so they're not going to be very good in isolation but then we generate lots of those weak classifiers and in fact the Kinect was also one of the first hardware that used GPU and distributed computing effectively not just for rendering images but also for processing the images and in particular for here doing the body part recognition okay any questions regarding the Kinect and body parts Assocation good okay all right so as was suggested before one of them nice advantages of bagging boosting and in simple learning is that now instead of having one large computationally intensive class for predictor we can have lots of very small lightweight predictors and we can distribute them so so then this can utilize easily some distributed architectures and then the other advantage is that if we have lots of data perhaps so much data that it doesn't fit all in the memory of a computer well since we're going to subsample the data and the features then perhaps we can obtain some smaller portions of the data that can fit in the memory of of some computers and and then this helps to scale with respect to large data now when it comes to computation as we've seen in the core especially with respect to deep learning GPUs have become key to paralyse the computation in this course you've done now two assignments one using Kerris another one using pi torch but then there's actually several frameworks for working with GPUs and all of them are essentially providing you with some basic operations to manipulate either vectors matrices or tensors we're essentially vectors are just one D re matrices or two diaries' and tensors are just any number of dimensions for arrays right so so then operations on these multi-dimensional areas can often be paralyzed with a GPU and and then these frameworks essentially provide what's necessary so that you don't have to worry about the paralyzation right so so with this we can obtain some significant speed-up orders of magnitude speed-up and that has become if we want to do large-scale computation in machine learning okay so sometimes as well with honestly having GPUs we just happen to have a system with multiple cores and there it is tempting to take a large data set and just say let me subdivide it okay so that different cores can train different predictors and then I'm gonna combine them and that totally makes sense in fact that's what we've seen in the context of bagging and also this thing we do end up with multiple predictors and we combine them but not an interesting question is how can we combine those predictors should we take the average of the powers of those classifiers or predictors and here even though it is very tempting to take the average of the powers this is not a good idea you should not do this okay so you will encounter some packages out there that will suggest to do that it's actually a common thing so variant with intuitive thing to do but it's it's not a good idea there are problems with doing this okay so let me illustrate here with a concrete example so if we consider let's say a simple neural network that will come compute the exclusive-or boolean function and I'm gonna design it so that I have one set of weights that corresponds to exclusive-or and then another set of weights that's also exclusive or and then I'll take the average of those weights and then the average predictor essentially by taking the average of those weights will not compute the exclusive-or and so whenever you've got a machine learning system that has some hidden layers or hidden variables so like in a neural network or in a hidden Markov model and you take the average then there's often a risk that you're going to end up with something that does not have the same properties as your individual systems okay so taking the average can can be problematic okay so let me draw on the board an example for this and then it should become clear okay so I'm going to start with x1 x2 being my inputs these are going to be boolean variables so they just take values 0 and 1 right and now in my architecture let me use here a trash holding function I'll repeat this below so 1 x1 x2 I'll have again another trash holding function and then in the case of exclusive where we cannot compute with just one layer if we're using a trash holding function we cannot compute the exclusive or directly so we need to have two layers I'm going to combine these let me add another constant and here I'll put another threshold in okay and now let me label the edges with some weights w1 w2 w3 w-4 w5 w6 w7 w8 n w9 okay so I've got my nine weights and let's write down some values okay so let me claim here that if I use the following values I will get essentially exclusive-or okay so if I'm encoding exclusive-or then it means that I can write down the following truth table where if I start with the following values then I should obtain 0 1 1 0 all right so the exclusive-or is a boolean function or if I feed in values for x1 and x2 where one of them is 1 the other one is 0 then I get a 1 and then also 0 and 1 produces a 1 so now that we have set up the weights in this example if I look at the first three weights they do correspond to essentially checking that I have a 1 here and 0 here so it's only for this pattern that I will end up with a threshold that produces 1 the second set of weights w4 to w6 if you look at their values they do correspond to verifying that I have the pattern 0 1 and then again this will produce a 1 so basically you see I've set up the weights in a way where the first cell here the first unit is checking for 1 0 the second one is checking for 0 1 these are the two patterns that will give me essentially 1 so now what I do is I combine them together through another trash holding unit that essentially computes the or of of the input okay so this finds each one of two patterns and then as long as one of them is triggered then I I compute here and/or so I've set up the weights for W seven eight and nine - essentially compute the or of what comes in okay so that's the first set of weights that works now I could change those weights as follows okay so I've changed the weights in a way where now I've swapped the weights W 1 2 3 with the weights W 4 5 6 so what I've done is I've essentially exchanged the upper and lower unit here but that doesn't change anything right because whether I check whether I've got a pattern 1 0 0 1 or I check for the 0 1 1 0 pattern right it's the same thing right these are two symmetric solutions so now if I take the average of those two solutions I will obtain the following values and if you look at those values now you see for the weights W 2 W 3 their values are 0 so what this means is that I am not finding the inputs x1 and x2 so here no matter what right because the weights W 2 and W 3 are 0 this trash holding function is going to produce 0 and then the same thing for the bottom part W 5 + w 6 are both 0 so then this will produce a value of 0 now since both of them are value of 0 and this encodes the or of these two then it produces again a 0 so no matter what the input I'll get is 0 and this does not correspond to exclusive or okay so this is an example and I mean it's it's not a a rare example I didn't have to work hard to construct it in practice you see whenever you've got some neural networks with some hidden units or hidden Markov model with some hidden variables and then you have some symmetric solutions and then you simply combine two symmetric solutions you will typically end up with something that doesn't have the same properties that that doesn't compute the same thing and that's why then taking the average of the weights is not a safe thing to do so lots of packages are promoting that but it's a dangerous thing to do okay any questions regarding this yeah oh yeah I don't remember off the top of my head but I know in in deep learning this has been something that has been advocated by a lot of people that oh if if you don't have a machine that's big enough you know then just divide your computation and then you get different predictors and then and then the idea is that taking the average of the patters has the advantage that you obtain at the end a single predictor so if you see you you want to deploy this into a system that doesn't have a lot of resources having a single predictor is attractive but then there's a risk whenever you do that okay so what's the safe thing to do when we combine predictors the idea is that we should not average the weights but instead we should combine the predictions okay so the safe thing to do is to combine the predictions in a case of classification that means doing a majority vote of the classes and in the case of regression that means taking the average of the predictions okay and this is effectively what we've been discussing in the context of bagging and boosting so both banging and boosting are doing the right thing or these are doing the safe thing whereas out there you will see people advocating just taking the average of the powers but again that's there's some risks okay so this concludes what I wanted to talk about in terms of the material for the course but let me now tell you a little bit about other courses in case you enjoyed this course and you want to learn more about machine learning okay so it turns out that at the University of Waterloo we do have lots of great courses that are related to machine learning so I've got your full slide of courses that you can take you can start by taking a course in artificial intelligence so artificial intelligence is essentially broader so you can think of machine learning as being a subset of artificial intelligence so in di course you will do a little bit of machine learning still for instance we haven't covered decision trees and that gets covered in the AI course we haven't talked about reinforcement learning that also gets covered in the AI course so I highly encourage you to take the AI course for this also there is a course called computational linear algebra that I would highly recommend in fact ideally I would have recommended to everyone to take it before taking this course because as you've probably noticed in this course linear algebra is quite useful so some of you want to continue with machine learning then as you go deeper and deeper than linear algebra really is kids the foundation to all the things that we do so having a some good knowledge and on that topic is important so I highly recommend to take that course okay we do have another course in machine learning called faretta core foundations of machine learning so in contrast to this course which is more an introductory course to machine learning where we've covered a lot of topics but we did not go into much depth about any specific topic and especially the theory so this course the theoretical foundations of machine learning looks at an important question more precisely the question of data complexity in machine learning so when we talk about complexity in computer science in general we typically think about time complexity and space complexity when machine learning we have a third type of complexity known as data complexity so in other words whenever you've got a task and you want to predict something then how much data do you need to ensure that you can get a reliable prediction and so here I mean in some applications you might say well I don't have a choice because the data is given to me and I can't quite create more but then there's also the flipped question where you could see if you have a fixed amount of data what's the level of accuracy that you can expect right because in the industry sometimes somebody is going to ask you do you think it's possible to predict X like could we predict the stock market right and now the next question is well do we have enough data for this and then is your data informative enough as well so now there's some interesting questions about studying deriving some math that will guide us in terms of determining the amount of data needed to achieve a desired level of accuracy or vice versa so if we want to have yeah if we can change the amount of data then what level of accuracy can be achieved okay so yeah so this is a very interesting course it's more terrifical but I haven't want to take it okay in our course we also talked a little bit about optimization but I did not assume that anybody had any background in optimization but as you saw most of the algorithms essentially derived by doing some form of optimization so we talked about gradient descent a lot but we did not really go much further than that we talked I guess a little bit about convexity but there's a lot more to be discussed because at the end of the day most of the machine learning techniques boil down to some form of optimization so for this we have two courses optimization for data science and also fundamentals of optimization now you'll notice that those two courses are at the 700 level so that means they're graduate level courses so now what if you're an undergrad and you're interested in those courses so here's a little secret as an undergrad you can take graduate courses - okay so most people don't realize this but all you have to do is just go to the instructor and request to take the course and usually there's a form that you have to fill in but most instructors are going to be okay as long as you know you are going to be dedicated that you're gonna work hard and and so on but then you might say well if I'm with a bunch of graduate students isn't this course going to be much harder and really you know I shouldn't take it the reality is that the difference between undergrads and grads is that the grad students or maybe just a little older you know everyone who is a grad was an undergrad you know a year before a few years before so you'll get there right and there's not that much difference right so so from that perspective you can take graduate courses and then it's mostly a matter of going to the instructor and making a request okay we also have some 800 level courses basically plausible neural networks deep learning and it's applications and also teach a course on reinforcement learning specifically now 800 level courses are the main difference is that these are not regular course they're offered once in a while there's no regular offerings for those so they were offered recently I'm not sure when they're gonna be offered again but I thought I'd list them there as example so you should watch out so there might be new 800 level courses that are offered going forward as well okay and outside of computer science you can also take some courses in statistics so we've got lots of great courses the reality is that there is some overlap between this course and most of those courses in statistics however the I guess the angle that is taking a little bit difference in statistics is definitely more theoretical and more mathematical whereas here we took a more algorithmic approach to everything but still I highly recommend those courses too ok so beyond courses we also have some data science programs that you can enroll in both at the undergrad and the graduate level in fact some of you might be in one of those programs already but if you're not and you're interested so you can go to this webpage now there are science generally means that you're going to be taking courses and and learning about topics that are at the intersection of AI machine learning data system statistics and optimization and concretely in terms of courses it turns out that in those programs you will be taking mostly courses from this list and and and I guess a broader list but I mean this is a good sample of courses that you would take in a data science program okay so we've got a Bachelor in data science is also a master in data science the master in data science has a course based option as well as a thesis based option and then again the main difference with respect to our general computer science master's degree is that here you get to specialise in the courses that are related or at least relevant to data science in other words those courses and other courses related to this whereas a general master's degree would have some requirements where you have to cover some breadth and take courses across different topics that go much broader but then you do not have the opportunity to take as many courses in this area so that's the main difference okay any questions regarding those courses or the data science programs yes okay I I did not teach those courses I'm not sure what the precise difference is I'm gonna guess that the first one is probably more applied and in its optimization for data science the second one is probably more mathematical and more looking at optimization in general not just for data science but for all kinds of other problems that's that's my guess yeah yes yes okay very good so yeah to summarize then 95 is looking at the fundamentals of both continuous and and discrete and for everyone who does not have any any background in optimization or as 794 is tailored to data science and is mostly continuous optimization or okay mostly convex yes okay all right any other questions okay so in that case let's stop here and I want to thank everyone for the interest in the course and hopefully you've enjoyed it you [Applause]

Info

Channel: Pascal Poupart

Views: 3,858

Rating: undefined out of 5

Keywords:

Id: B3sN1BymGdc

Channel Id: undefined

Length: 74min 54sec (4494 seconds)

Published: Thu Aug 01 2019