Class 02 - The Learning Problem and Regularization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm learning throws ask oh you'll be stuck with me for the next 15 classes these are basically as Tommy was discussing the first lecture these are more or less the core material we cover a lot of classical materials going up to more recent materials a word about actual support there is this mighty draft of a book which I'm not sure is like ever gonna be a real book but it's deadly a bunch of notes for you guys and so what's gonna happen is that we're gonna give it the whole thing to you right away the first chapter we're gonna I'm gonna try to update the note throughout the course of the semester and so you right now chapter one is is good for a first read at least and you can look at that you can print the whole thing you might want to go back to it every once in a while and check what is the last date on the file because hopefully every week I'm gonna update the whole thing this is roughly covering the 15 classes I'm gonna teach and probably a little more we this year we're trying to put this course on OCW specifically the 15 classes are gonna be on OCW so we actually be preparing slides which more or less are what I write on the board I'm gonna teach the whole course on the board so we're gonna try as much as possible to also provide you those slides which are basically a summary of the whole thing the whole purpose is to give you enough material to study most things and also to avoid scribbling like crazy and spending more time listening as we recommend the idea that you actually come in class and attend as much as possible and can use this time to understand so I think today we're gonna start or maybe now if he has a canoe or jazz the computer you're gonna start to give out the material I'm just gonna repeat two things I said at the end of last class in case somebody missed it so the the idea of this course is to fill in what we feel to be somewhat of a gap of the current things you can the current courses that are offered on machine learning the majority of courses are fall into one of two categories one is that what I would call the tool box the bag of tools like courses so they go from everything to everything graphical model SVM deep networks whatever okay and the idea is just to give you because Marshall learning is such a broad field that is just to give you an a quick overview of everything okay by definition this course is going to any course like this is not going to go deep in any concept just gonna provide you you know a bucket list of all the things you might want to do before later in your life as soon as you switch to more theoretical stuff you just people typically you know go very deep in the statistical aspect of the story which is by far the most developed part of the story and things are like pop Nick Chevron and kiss Theory law of large numbers like Rizal concentration equalities nonparametric statistics all of that okay and there are plenty of course of both kinds that are very good okay we didn't feel there was any need and any extra course doing that we can you know I don't want to make names now but you know we have a few on the website there are many others you can look at and you know if you're not sure you can you can just ask us for more so what did we try to achieve with this course I want to try to somewhat get a feeling then while machine learning is such a broad field is not just a list horizontal list of things that are completely disconnected there are actually a bunch of ideas a bunch of ideas that you can follow throughout different fields so if you do graphical model regression or whatever else estimation you typically have to deal with certain basic ideas the fact that the data or might be a lot but not being an infinite amount that might be some notion of noise that you need to trust the data but up to a point that if you want to get everything out of the data you have to make some kind of assumptions and the broad topic the broad pillar of all this concept in our view is the idea of regularization an intrinsic trade-off whenever you try to learn something between fitting the data and getting something which is stable in the sense that if you repeat the same operation a couple of times whatever you do the results you obtain doesn't change your Mac okay so though the whole idea regularization builds around this basic idea and we like to actually see how it's developed and what are some basic instance of this idea so the idea the the goal here is to introduce some basic principles and see how they develop into algorithms and then going some bit of details of what we think to be some you know ingredients of this story that are worth discussing at least once in your life so there is gonna be it the the course is not introductory is not meant to be introductory um it's gonna be using a certain level assuming a certain level of maturity from a mathematical point of view you're basically the idea is that you have to know your linear algebra pretty well and the to know it's you know basic probability theory and then we're gonna sprinkle on it a little bit of function stuff okay you have to know your free analysis or at least some of it Wikipedia level okay that's what we're talking about but you have to know it okay so you go back and take a look and we're gonna sprinkle a bit more stuff because use operator sometimes you're gonna say what is a functional derivative maybe and maybe some convex analysis we're gonna say that there is not only a gradient there is another thing called the sub gradient that you can use where things are not differentiable and we're gonna try to make this self-contained and we're also gonna try to bring down relatively complicated things to layman term and so on and so forth clearly not making justice as any of you want to go rigorous and twenty of you don't want to just have an intuitive explanation we're gonna try to balance you know if we found our own balance and you know you have to stick with it because there is no ideal alright so I guess these are all the premise I wanted to make so we can get started the class today and the next two classes are kind of the foundational stuff today is going to be about statistical learning theory and then we spent two classes on basically introducing function spaces reproducing kernel Hilbert spaces and alike not because we're in love with kernel methods but because these reproducing kernel inverse spaces are not kernel methods are kind of the useful tool to understand when you're dealing with functions and because the whole you know 90% of what we're going to talk about these learning functions it seems to make sense to know what the function is and how you can build spaces of functions so that's why we spend actually two classes discussing the topic all right any questions if you have any questions now later on just interrupt me I usually say more stuff than is actually needed in 19 minutes so we can definitely skip how good so far so the rizal problem with my writing it tends to start well and then goes in a kind of exponentially decaying quality so you know complain at some point and usually we can let the quality go off again all right so today is actually going to be kind of a flow in detail description of the ingredient of the statistical learning problem as seen in the statistical learning theory framework so let me start just by simple stuff you know the Tony already discussed we're going to discuss the problem of learning functions so we assume that we have some X some inputs some outputs and we want to find the function okay from X to Y now the key point in this store and let's just make a drawing to begin with let's assume that the X space which is the space where all the XS lives it's just a real line and the Y is also the real line and we got a bunch of values okay so these are the axes X 1 X 2 X 3 X 4 and these are the corresponding Y's the goal is given this data to find a curve that describes this data and already you can stop one second away what does it mean to describe the data and you can make immediate distinction between the idea of fitting the interpolating the data and actually try to get something beyond the data which already you know puzzles it because what is beyond the data okay what does it mean to do something else we have to are ready to start thinking that there must be something going on that we haven't described yet and our goal is exactly doing that we don't want just a fit that we actually want to do something that is good for something else and just /li now I'm just gonna write here new we're gonna get a new point okay we know that we're given this today but we know also know that George is gonna collect more data tonight that is gonna provide us some new data let's say is very lazy so you just gonna get us one more no some and we need to say we're here we want to put our little circle what is the correct way okay eventually you also spend some more time figuring out himself because he's our teacher what is the correct way and then we have to figure out whether we did a good job or not so given this data we're actually going to try to provide function that depends on this data and we're going to use an N here or maybe a little hat to denote quantity that depend on the data typically we'll denote this all guys s is what we call the training set is what you're given today and your goal is to do well on future date all right and this is just a simplistic picture take for a bunch of reasons first of all I'm putting here four points okay I made there there are many many applications still now where the number of points is not that large but there are application when the number of points is ten millions or more okay so that's one first thing to keep in mind this picture is misleading because I endpoint where n is four whereas nkb order of 100,000 millions or more the other thing which is misleading is that I put the input dimension to be just one number whereas it's actually kind of hard to find an application where you actually only have one not rubbing your input you can think of any application you want and typically the number of the input gets quite large hey one running example that we probably can keep in mind for the rest of the semester is the case where you have images and you want to make statements about images say for example what is in an image if or images of object and you want to stop it is in an image and then typically you know there's more simple things you can do is just assume that you each image is described by the value of its pixels and you immediately see that even in this case you go up to on order of tens or hundreds numbers to describe each image and it's not uncommon to find application where the number of numbers that the image might be order of thousands or even millions okay because in some sense it's up to you how many parameters you want to put in there so these are two things to keep in mind we're often deal with massive debt massive data set big data became a trendy word a few years ago and we're also dealing with high dimensional phenomena we are dealing with high dimensional problems our variables can be exceedingly large okay aside from this couple of comments this problem is you know calling this problem just to learning is it perhaps a bit of a stretch because at this level of generality you'll find this same problem in approach what people call a pro theory in sampling theory in classical statistics and so on and so forth we can make perhaps another example which you know you start to be slightly more interesting still simplistic and is the case where we have a bunch of points and we want to classify them in class 1 or plus minus 1 okay so don't be misled this is not X and why it is it just exit okay this is the space X which is made of number two coordinates rather than one where is why Y would be somewhere here okay and we want to find the function as what well if you look at the level set zero where this function changes sign it should be you should get another line that ideally you know put the crosses on one side and the circles on the other side this is the classical probinaryscalper thing images of cats and dogs or whatever and then you want to discriminate between them is an example of this again the idea is that this is the case where the images are just two pixels which is not very interesting in practice but it is what our intuition can do alright so is it just you know the basic the basic idea what we to do is to actually get a mathematical formulation out of this and one thing we're going to try to do is this make always clear what we mean by a problem and an algorithm to solve a problem okay we're hardly ever gonna start from the algorithm and we're also going to try to make a distinction between statistical algorithm which are effective abstract procedures that might not be directly implementable in a computer an actual algorithm numerical algorithm then you can implement in a computer okay this distinction will be clear but at this level of generality just think of the idea of statistical requirements and computational requirements and when we think of an algorithm want to you know keep an eye on books now then one of the nice thing about supervised learning is that it it's one of those problem where we we are at a good level of understanding of the problem and the nice thing is that the problem itself can be fine is defined by two ingredients and can be defined three lines the ingredients of the learning problem are who the first one is probability space so we take the pairs of input and outputs and we just think that they live in some space okay X and I'm gonna for three-line it's so short that for three lines I'm not gonna comment too much on it we're gonna spend the rest of the class doing it okay so for now there is a probability space and a probability distribution the across space okay that's ingredient one in most of what we do we will not try to estimate the probability list number just a function as you see there and so we will need to measure error and that's the second ingredient of the gain function that goes from the output space to the output space and the output space to the real line okay it's what we call loss function it takes pairs of output and returns a real number which is positive you can't it is a cost and typically from a technical point of view we're going to assume that this loss function has some nice integrability property with respect to the entries we're going to comment on it in a minute give it these two ingredients we're actually going to define the problem as the following solve a minimization problem of a certain quantity which we're going to call the expected loss or the expected risk so this is the integral of the loss function with respect to the distribution if you are allergic to integrals but you're okay with the random variables this is the expectation of the loss with respect to the pairs x and y okay and our goal is twice to minimize this because some the semantics of this object is that we want to make the error on any possible pairs of points small so this is exactly what we would like to do okay the caveat is that we don't actually have access to row row is fixed but unknown and all we have is that we have the training set okay written over there we assume to be sampled to the product of n times row is fixed you hang on okay so the whole game is essentially this one line is the one for which we call this thing learning the notation is condensed okay the next thing is that it's just three this is the whole thing okay this is the problem we're going to discuss for 15 lectures or more the notation still has a lot of interesting stuff happening one thing is this little symbol here he said that each pair comes from Rho and we're just taking the product distribution you remember the product distribution where things are independent okay so we're assuming that the sample are always coming from the same distribution and they're all independent to each other most important we assume that fixed but it's not known so all you're given to solve this problem is actually these n pairs now you see immediately how we somewhat incorporated try to code if to encode that is a notion of today and tomorrow of productivity of generalization the definition of the problem with the problem would be to do well tomorrow to do well on future data but all we are stuck with is what we have today we also see that there is no way to solve this problem perfectly okay in journalism is very unlikely we will have to sell for an approximate solution when we measure the quality of a solution we would not be able to say oh this is a perfect solution of this we have to say this is su you know there is some level of precision which we can solve the problem and make sense of what does it mean to be precise everything is stochastic we can be looking at functions and and so on and so forth okay this is it okay this is the whole problem yes you down here so I'm being a bit vague here these are all the function from the input to the output okay that Clea doesn't make too much sense it's almost correct though so what do you need you need essentially you know when you do an expectation of a random variable you have to be sure that the expectation is well-defined so strictly speaking to you have two thing to be a random variable so if you look at the Wikipedia entry it says that this function and this needs to be measurable okay and that's all you need so this space is actually not the space of all possible function in the world but the measurable function but if you're human beings usually and you're not studying too much math in your life you don't worry about measurability too much so that's all it's there Piro XY so this is just the integral with respect to the distribution I'm gonna comment you know if you're used to the notation of P XY DX dy that's the same thing they're gonna comment on why I use this notation this is the usually integral with respect to a distribution okay I'll remark in a minute I'm gonna use this notation throughout the course and I'm gonna remark in a minute why Rho is fixed but unknown that's good you know this is a you know anything else ro n in the product measure you know I have I have a probability distribution then I take a sample from the same distribution twice so the distribution of the pair's is the same tribution to the square to the third to the fourth okay so the question was what is this guy and I'm just doing what is this guy's n and why is there is because I am assuming everything to be independent okay what do we want to do next then we want to keep this guy there and just explode these four lines in discussing them for 50 minutes okay so we begin basically listing you know let's list all the ingredients okay the ingredients are data space the data distribution so XY the data distribution Rho and the Ross function expected risk well I guess the idea that we have to settle for an approximate mr. mark would help do I mean by that right these are the main ingredients okay so let's zoom in and take a look okay so the data space because the product of these two things the data space and notice it's very democratic we think in our head as X as the input and Y is the input but so far you know I can switch order you know does it really matter the kind of thing that distinguish they didn't classically let's make example now for now they're just abstract stuff let's make example which is the first example that comes to your mind of an input space X is the part when you say something what real numbers okay are or ends here but not too fancier than this real numbers to the deep ok well let's be fancier there are pixels list of numbers let's give mathematical examples functions okay this stuff can be you can think of X as a function from sun domains to the real enough mat that for you when you see an image is a function from our tool to art okay rather than a collection of pixels fair enough Derek actually a lot of cases where this makes sense what else so zero one hero1 as in reality is yeah this is really nice right it's it's kind of a nicer I mean it did start to be a bit okay now I'll keep on going for you what about the space of input is a pace of well this is the space of the century space of strains what is about what if the space of input is is a piece of vectors but where I'll make all the entry sum up to one so it's effectively the probability distribution okay my normal is the notation for this let's write something like this a simplex these are all vectors that sum up to one so I can interpret the entry of the vectors probability masses okay but actually you can actually go continuous you just assume the de inputs actually distributions okay distribution over something is a special case of functions but is the case of function with some properties and what if the inputs are actually graphs each input is a graph and I have to distinguish between different kinds of graphs the good graphs and the bad graphs okay and again I'm not going to give you a ton of examples here because this is not what it's about but you can just imagine that there are tons of situation where it can fit any of this and you can keep on going okay you can take space of matrices the input are a space of matrices is a space of matrices okay in this mess of example there is one distinction it actually makes sense to make which is all this example and many more and all this example and many more has one fundamental difference what is it yeah but not really right because for example he ain't countable like you know if you have many you know things were continued but not really because this guy in this guy are essentially the same I really want to make the difference what's happening here and what's happening here what bounded really no I can bound or not measurable I don't know that's complicated right I'm looking for one very basic mathematical properties whether we have it or not is not to say which means library train is more yeah that's specific for this guy actually one that is specific for all these guy at once and again I can put matrix another stuff and all these guy at once and the probably I'm looking for is take yeah it's a vector space okay if you take what do I mean by that we don't mean vectors in vectors with a bunch of numbers and it's a narrow the thing but we think in terms of linear space if any of these these two guys and you sum them up you get still one of this guy and if you multiply by a number you get one of those guys okay this is the structure of linear spaces okay this is what this one of the ingredient makes your life nice effectively know what linear spaces are they're simple they're nice we're going to put some more on it vectors is what we know how to handle and what we think about for 90% of our time this doesn't have a linear structure okay well inner structure actually means the story that you can sum two things up or multiply by a number if you take any of these two guys and you sum up and if some just in the trivial way you don't get the string again you need to probably distribution you just sum them up ignoring that they're probably distribution you don't get the probability suppose you have to renormalize well of course you can do it but then you is not just a linear space same thing if you multiply by numbers take a graph you don't even know how to sum up you know it's not an obvious thing called to sum up the graph okay there example here even matrices and so on keep on going okay so to some extent there is this big distinction between spaces that are linear because they already give us an you know we already know a lot of things that we can do and species they are not linear this is not the only property that is useful to have in practice but this make a big distinction okay yeah we're gonna discuss about that he's talking about more properties okay he was basically commenting using words that we don't want to use for now on the fact that you know linear space is one thing you can have more stuff you know for example strings we don't know how to sum them up but we have a bunch of way to measure how these stands are to space strains okay if you nothing about this there's nothing you can do okay the funny thing is that let me say just two comments before going back to what you're saying I'm not gonna repeat this but you know I can do the same distinction for wise okay give me example of wise of spaces of outputs are kind of the same okay kind of the same the interesting thing is that whatever is below whatever user is more messy we actually call structure problem okay so the you know ironically enough the specie would lack structure for example the leaner structure is what typically in the literature is called structured problem structured in the input or structure in there so if you're about structure prediction is typically about solving problem words either the input or the output doesn't have much structure in fact but they have some structure so they are not linear space but as he's saying you have some you have to you have to have something okay for example you have to weight measure how distant are two things that's the very basic property that is what is called the metric okay having a distance so this is one first you know taxonomy in this world we have an input space we have an output space the typical distinction which want to do is do we have a linear structure or not you will see that once you have an inner structure immediately you know how to do a bunch of things when you don't have it you have to think more okay the neither interesting thing is that for the input space we all give the you know having a different input space is not enough to give the problem different name any different out to space interesting enough it is okay so we can actually give names for example if you just assume this to be our what is the name of this how do we call a learning problem with your output is our regression what if it's Rd well it's multivariate regression what if the output stays is a function where we actually call it functional regression and we keep on going going okay we got different names what about here well if the output is just zero one or plus minus one is the usual convention we call this binary classification if the output is a list of numbers or symbols what matters from 1 to T we call this multi category pacification and if you actually take subsets of this so this is the case where inputs can be of either one or two kinds this is the case where they can be one out of T cases you can ask here assume that each input can actually belong to simultaneously more than one of this for example you have an image that contain a cat and a dog but not me ok so it means that I'm going to assign two labels to that guy and that's the case we're basically taking subsets of this okay the one that I have and the one that I don't have and you can keep on going ok we don't learn much so far we just learned that there is this one distinction that you might want to keep in mind that there are a bunch of different problems and you see how things can be general ok we are gonna try to make some at the point when X is just a vector space and we want to and when it's not because you see that the list is long you mean exit you know this guy which is what we typically like to think about is actually just one of many many examples ok and we discovered that besides this distinction when you make depending on the constructor you want the output typically we give it a different name to the learning problem ok it's like a different name still supervised Dilla loss blah blah blah but I give a different name all right are so good in great number to the distribution and the distribution is this guy and we basically just need one basic assumption which we can we almost always give for granted but he actually you know in this level of food crazy generality it is an assumption and it is the fact that we can decompose this distribution in a marginal and a conditional distribution it takes a while to find cases where you know this is not true but we're gonna hold assume is true this break the symmetry between x and y and now you know somewhat pointed in the direction of seeing as X as the input and Y is the output because I'm conditioning this way and the idea here is that now given any input I want to predict an output but also this start to tell me a bunch of things the relation between input and out an is is intrinsically not deterministic in general given an input I'm actually having a whole distribution of possible values so you can interpret this in a bunch of ways for example you would like to think of this as some kind of noise okay you don't you I would like to say what is the output of this X I'm giving this X I'll give it this image you would like to know if it's a cat or a dog but actually with a certain probability I'm gonna say something different okay or I have a certain distribution of value say for the temperature something and I would like to say exactly what is the temperature tomorrow but there is some noise and I'm gonna have some perturbation this is somewhat the intuition behind this kind of stuff but a lot of stuff can go in there right you can have picture of me and George or maybe somebody that looks a bit more like me and we look alike okay so we start whole situation we have distribution of things that are similar and this is encoded in this you know the fact that two people can look alike is encoded in there and maybe the fact that you have a certain amount of resolution in the way you encode images so if you know you blur an image enough you can confuse me for somebody else it's all in there we basically dump a lot of that we have about the problem in the fact that we assume that the data are probabilities there are a couple of example for the conditional distribution that you might want to keep in mind because essentially in that list of problem problem in the linear case a regression and problem 1 in the nonlinear case are the most studied one and the rest with trying to figure them out okay and in these two cases there are two classical a special example of this the first one is regression so you actually assume that there is some special function that you call X this is a function from the input to the output and I just assume that the output are the reals now and then you define so you're even a function of star from X to Y even a random variable which is normal Gaussian and then what you do is that you have an input point okay you simple an input point or do some distribution we're going to discuss the new minute then you sample one of this variable you evaluate the function at that point and you add this okay all you are given in the end is this guy and this guy but this is one way to define a priority distribution on the output what is the probability distribution on the output well the noise is a standard Gaussian what am i doing I'm just a reese entering the Gaussian okay so this guy is still a standard Gaussian but Darden means center in zero is now centered where after X I and the noise Sigma okay no do it you can be fancier than that but I just assume is fixed this is just an illustrative example and the meaning is not to be fancy or anything you just to say see in this one example I really make sense of the fact that I want to think of rope as somewhat describe the role Y given X the condition is connected to the noise because in this one case it is exactly the noise you know if think of it as the noise which is what again if you're not from Mars you do then this is exactly what's going on the noise define the distribution I would like to know is to be zero and just have a deterministic thing but I actually add some perturbation now I change my function okay so this is the classical regression model and this is why you can think of that as a written C as a noise in certain situation used by the noise in certain situation okay so this is regression if you consider classification when I say classification without seeing anything s always mean binary classification and the situation is something like this doesn't really make sense or it's not obvious at first sight what is it good F star for classification that's not clear I don't even think in terms of F I only think in terms of the decision boundary or something so one thing to notice is that typically what you have is that the of Y given X is given by what well if you're in binary classification only have two choices so you can be you can think in terms of a food distribution but you're just two point masses we have Rho of 1 given X or Rho of mine 1 given X okay and that's all you have do you have the property distribution of class one day probably distribution of class minus one thank you is not completely obvious you know how to think of this as somewhat measuring the noise of the problem okay and in fact you know this the standard way of thinking about noise doesn't quite hold anymore yeah so this is just classification which is binary classification which is the case where the output is either 1 or minus 1 you have images of cats and dogs ok cats or dogs this is the probability of that an image is a cat or operated an image is a dog ok and what I'm trying to say is that you know just ponder for one second that you know how to think about this as noise is not completely obvious ok so you know a mental picture you might want to have like I'm going 1d resonant 2d is that you know this is the distribution of class 1 and this is the distribution of class minus 1 and something like how much these classes overlap can be seen as some notion of noise and indeed this can be formalized and when you try to make things quantitative quantity like this start to make to start to play a role but we're not going to do it now we're not going to really need this but they just to tell you clearly this tells me how much things get blurred out in the output space and particularly there's a noise or more generally some notion of uncertainty or one attached to my problem ok I'm going to a bit fast because I'm reviewing things which again I'm assuming thing I'm just going to add a few cherries here and there for you but not any substantial new thing okay it slightly more interesting to think a bit in terms of marginal distribution so why do I actually need to talk about the marginal distribution okay actually if you look at a lot of statistic wanna see it's often what is called a fixed design setting where the inputs are assumed to be fixed once and for all this is a narrator of the fact that mathematically this is nice to study so if you want to prove stuff this makes your life easier but also the fact that especially in low dimensional scenario it is some it was sometime possible to actually pick the point of the input suppose that you know you have you really have a problem like this one D you actually might be able to pick the points as you like for example you know regular on a grid that is of you know equispaced point or you might decide that you there are certain region of the space you want to sample more or less but it may be in that you're any situation that that's not up to you okay you know the same thing is actually something that you control for example you can add sample with noise so I would say I'll give me the function here you're like I'll give it to you here about here boom same okay well rather than noise there's really constraint in the sampling world for which certain things are more likely to be simple than some other there is some more noise in some region than some others okay this you have to account for this and then one way to account for this is basically assuming that you have a distribution over the inputs okay and you're going to sample your distri inputs it will your inputs according to this one distribution okay again one could be more specific than this and try to make use of the extra assumption you can make for example you can say I have really this model of sampling is something plus some noise this is fairly general it's like that is like what that stuff that I deleted this is more this is just this symbol is more general than any of these two example and same thing happens for the marginal distribution imagine distribution typically account for our capability of sampling more or less well the input space now an interesting observation come from the idea that let's just consider the case where X is a R to the D as we said before typically we have in mind situation where D when you have vectors D can be very very big ok so say T is n to the 5 right now a recurring theme of this course will be that if you're having this very high dimension if you don't have much structure it's unlikely that you will be able to get anything out of your data ok if the date are really high dimensional ok in a lot of situation will be impossible to get anything out of this but in most situation there is actually a lot of structure and a lot of structure is what allow you to get something useful out of your date now in the case row X is the first step allows us to somewhat to take a first peek into this what is the idea well if we just assume that the input is one-dimensional ok if I have to draw a distribution I'm just gonna say look you know yes you know this is the true function okay let's do the regression example but I have bad luck and you know I'm not able to sample everything in the same way so this is X this is y and say you know this is my distribution over the XS it's just an example I'm not sure it makes too much sense but what it means is just you know I'm more likely sample you know the point in the middle that the point in the outskirt and just notice that already this means that you know that these samples are going to look like this and now this is actually the only one case and we just don't you look like this okay so you yeah that's fine so suppose that your distribution is uniform roughly speaking I get a sense of this curve how does it give you 10 points roughly speaking instead you sample this way you know how does the sample gonna look like well you're gonna have a bunch here and then here you have got almost nothing okay so you might have a situation where you're just a one-point ear and nothing there or maybe one point two nos in there so you imagine this bump going up because you don't see in some sense this distribution doesn't suggest there is much going on in there and this is just to point out that even if you had no noise okay in the output just the same thing in the input is an annoying thing you have to be worried about okay especially because this is very simplistic this is just the one day what you're actually going to look at situation what is 10 to the 5 okay and this point is 2 Denari thing that might happen if you're in two dimensions say now let me just forget about the output and I just look at two dimension how would a distribution look like what God knows is complicated but let's just make one observation so if your example take picture of images say it is my face and you suppose that you're gonna plot it in the high dimensional space how it is gonna look like a mess but if you take any email say I take pictures at 20 by 20 okay very small 400 dimensionals now take any 400 dimensional vector is gonna be a face of mine of course not okay so if you go in this dimension all space which is 400 dimensional you know my face you're gonna occupy some region of this space it's not the whole thing okay and this region is gonna add some structure because whatever you find here you're gonna find very similarly here my face has a certain you know I'm just a frontal faces with no background there's a lot of symmetry in my face and probably this suggests that my face is gonna kind of lower dimensional structure okay not everything is independent to everything else you know the simplest way to abstract this idea is to basically assume that you know this this thing here is actually 400 dimensional okay it's not - it's 400 dimensional and my face is has to lie in some sub region now if I assume that is a sub region and everything is independent will be you know a blob inside of this bigger world but if they actually assume that you know things are correlated enough are dependent enough it means that whenever you know what's happening here you know what's happening here whichever means that you have one dimension less okay in one dimension in two dimensional you have two so I have to kill and assume that you know they are dependent to each other okay so instead instead of a blob I actually have the line I have to go down one dimension okay because things are dependent to each other they'll just have this complicated line what is the marginal distribution well is a distribution over over a line where this line here just to catch up in my in my little drawing this is just the input space which is clear r2 but you can think of this as 10 to the 5 instead okay this is set what the distribution is supported okay is what you call the support of the distribution rack so which is contained in Rd but is a smaller set and so first thing that you want to think about is the fact that when you write rho x you're not just writing something that has to live in the full dimensional space or rather it has to live there but it might go by just a smaller region okay and the simplest mathematical object the describe that region is is typically thinking of a surface or a manifold if you want to be more general okay so when we think of Rho X is where somewhat the structure of the input is is coded and the fact that we don't use the notation of you know the notation P X DX right row X we're gonna review this notation rather than this it's because when things are in like this situation when things are like you know we have a distribution that sees the all in place that's a really matter typically this is an innocuous change of notation when you're in this situation for example you have you know data that are supported on a circle and you have to look at the density of the distribution over r2 let's require some thought because you know if you remember the volume of the line in R is zero okay so if you look at the density which is the volume the local volume of this line doesn't really make sense so the notion of den we don't want to go in the detail of that but if you have if you play this game of assuming the thing can be high dimensional but liner lower dimensional space the notation that you typically are used to the one about city it requires some thought and so rather than actually entering the data of the thought which was going to use this notation sorry there should be a D here okay so when I write that integral instead of writing P X Y DX dy I write this why do I do it because you can think remember in your head that you know sometimes things like a lower dimensional structure and that's where this notation is correct and this might be correct or not and you have to pay some attention so it's just a reminder for you and it's not too much different so you can deal with it okay which one yeah this is the support of the marginal and the marginal is actually you know it so in this in this case this is the support of the marginal and the marginal tell you which region have more mass than others here the support is this line literally just the line and the soup the marginal tells you in which region this line just this line you have to put more mass so if I venture into trying to draw it you know it would be maybe something like this you know something that says that this region of this all curve has more mass than the rest of the region just in the input space no output in this game make sense okay so we're going slowly here we just introduce the input space the data output space and the priority distribution and we discovered that we can spit it and input and output and there are a bunch of situation where you can think of these different cases for the output classification regression are the two and examples and for the input you know the marginal distribution is reminder of the potential structure in this high dimensional spaces okay so what we want to do well any questions about this stuff so far so at what point do you draw the line between justice either and low-probability and keeping it out of the support altogether again so at what point do you draw the line between including it from the support so this is it so what the question is about here I'm just saying oh you have a slower lower dimensional thing there is a support and is not the whole thing in practice you're not gonna have this okay you know even me you know my phases are not the whole thing maybe because if I have some noise if I touch pixels you know FF Peaks not touch noise and noise is pretty wild you actually is gonna have my face all over the place they're gonna be mostly around something small but there's gonna be stuff all over the place and this question is actually taking a step further and is like okay where do you say what's what's the actual support and what is the rest and I don't know okay the point of this is just to make you think of the fact that you know this is this most simplistic situation I'm just telling you suppose that there is a clear cut that typically does not between what's in my face and what's not in practice this is just a vast simplification you're gonna have to be fancier than that you're going to take say a distribution which is a convolute to distribution or a sum of two random variable one is my face and one is noise and then you have to figure out what's going on but at this level again this level of jannat is just pointing you the direction of the fact that there is such a structure asking you know answering the question of what does it mean to be exactly supported or not it's basically the problem what is called anomaly detection okay or support estimation is a learning problem in its own right then go on I discussed this okay so this is just not meant to answer a fancy question like yours is just to tell you to think if you have not done before which clearly you probably have you know to think about the fact that these rows X lives in a larger space but I actually described on your portion of it okay other questions so what we want to do next is getting closer to the actual problem and the main point we want to make is that see everything so far is probabilistic and we actually describe to allow ourself a ballistic framework to basically encompassing of uncertainty sampling noise similarity quantization and whatnot yet from line one our goal has declared to be not to estimate the probability distribution but just to be able to make statement about an input okay so we actually just want a functional relationship so given an input we actually going to spit out an answer okay while it's written here that there is no such a thing as an answer okay there are many answers some are more likely than some others we're gonna make mistakes okay we have to make mistakes because we stick to a deterministic function ratio or a fixed functional relation whereas our output is actually it is actually a probabilistic so that's the reason why we need to introduce the second ingredient which is the loss function the loss function is the simplest the thing we can do which is basically saying look if you tell me f of X and then I reveal to you why the last function tells you the price you pay by saying FX one you should have to really say why okay is it point wise measure of the error it doesn't tell you what is going on every to tell you for every decision you make the price you're gonna pay okay let's make example turns out that again you know the standard case is the case of regression where the loss function are typically as a special form I'm gonna you know you respond to are typically symmetric doesn't really matter if you put f of X before or after Y and so let me just use the notation YY prime is remember the roast function is a combination of any two output okay so I'm just write it like this because I don't want to keep on writing and for that for 10 minutes now in regression it turns out that loss function are typically not any function but they always depend on the difference between these two okay between the difference between the two outputs example yes the most famous one is the square the V to be the square so you take Y minus y prime square another example it's obvious why you want to keep the square right there is no particular reason to prefer one side or the other you took the square actually if you asked anybody to give the first example this is not the first is the first Wester you realize that whenever you have to take a derivative it's good to have a square but before realizing that what do you say you just take the absolute value then you discover that absolute value is annoying because it has a kink and then you have to worry a bit about it and these are the two main example the square loss the least square loss and the absolute value loss there is another loss which is somewhat I don't know if is useless that much but it's the one that you know was often mentioned in statistical learning courses which is what is called the absolute insensitive loss function well let's draw it because then it's obvious why you call it like that so here I can just write Y minus y prime will be the square loss these only straight will be the absolute value loss debtor insensitive losses they look like the absolute value but not around the origin so is flat around the origin - epsilon epsilon and then it goes up okay everything like this and is the maximum pi minus y prime absolute values minor at minus epsilon zero and i like the other loss function this 1 loss function depends on a parameter okay so really opens up questions about how you pick that parameter but the basic intuition is when things are really small I want to be robust because I don't know what's going on so I'm going to set them all equal to zero things are big enough I'm happy so in that sense these two losses are more robust than this one this one start to pay a lot and keep on going this guy try to pay as little as possible while they can and clearly here the slope doesn't really matter you this is the shape what makes the difference here okay all right classification so using classification you can write this as a function of one variable V is a function of one variable in that case you only depends on what is the relevant quantity here in classification binary classification not the difference the difference not that obvious the product okay why because you take two things if the product is this a is positive it means that there are either both negative or both positive but the same class when they are not agreeing it means that they become okay so something like these and people often put a minus there just for the sake okay so what is the most common example what is the obvious way to measure air in classification if you get correct you count Mountain Jules if you get corrected hinge loss no you can't I this afternoon we have to design a system and we decide okay if you make a zero you count zero and if you make a mistake you count one why well because you can always think of any real number but one is good enough yeah so this loss function looks exactly like this if this is the step function okay and simply write it that this and this is if you want you can also write that the indicator of when why is this one from Y prime okay let me draw it here YY prime here this basically says when this thing is I put a minus here just because this is the way this theta is typically defined this day so call Heaviside step function is defined this way so when when I have something negative I put one and when it something is positive I count zero okay I guess to some extent the most natural choice of a loss function for classification just say nothing or a paste some and some I decide to be one clearly it doesn't matter if you put fifteen or three okay from mathematical point of view here we saw that the the square loss is nice because you can take derivatives easily because it's continuous and differentiable the absolute value is continuous but it has a kink in the middle like that this one is actually worse than that because it has kinks but it's also what as your word here is not oh is that is not all those two that guys not differentiable so now when we get apart so this is not mention well again no one convex this guy are convex this guy are not convex and no non convex stuff fashionable this day but still you might want to stay away for it whenever you can because it's immediately a mess and this guy is definitely a mess it's a step function try to differentiate this guy and set the grain equal to zero and get anything like you know finding a minimizer you typically get hardness on your head in a minute because it becomes combinatorial immediately you have to try your blahblah okay so this guy is beautiful and from the information theoretic from you know learning point of view is the right choice but it's also the one that you want to try to stay away from as much as you can okay so what you typically do you try to keep everything you like and kill the one thing that you don't like is the lack of convexity so our friend actually said for the first thing hinge loss which is at least at one choice okay it's the choice where you basically say I like to be zero there but also I try to make this thing convex or a smile ish okay now I probably I want to be flat and I don't want to bend this much at some point have to pace in the positive line okay so here you typically decide to pay your ad at one okay here you don't pay here you already start to pay even if you're on the correct side because you want to pull that thing up and then you just go linearly okay this is what is called the hinge loss one minus y prime plus we just basically means that you take one minus y prime or zero the maximum of these two okay so basically I look at the product but don't compare it to zero compare it to one okay so if you're smaller than one if there's more than one okay you start paying if you're bigger than one so it's a bit of what you know a word that people like to call using this thing it's a bit of a margin okay there is a region around the zero it is in spirit similar to this there is a region or one zero where you don't want to pay okay and here you basically say well I'm gonna start paying something around there okay now this one is vex okay it is not differentiable okay suppose that you want to try to make it differentiable okay you have to change it how would you change it and do a couple of things one is to just take this kink and like massage it that's one thing you can do or you can actually give up this thing being zero exactly and you can try to get something that you know look similar or like this and this is what people call logistic loss function okay it turns out that you know these expression is exactly something they look like this is the log and the e somewhat the contours the fact so there is no super slow or super fast stuff you basically have something basically linear here and something basically linear there and then these transient around zero okay so it is the hinge loss is the logistic regression loss I actually would like to put back the core loss into the game it's my favorite loss first of all the usual check whether somebody already seen it before you see why this can be written in that way just as a product of two things rather than the difference in classification not in general turns out that the wear loss can also be written this way and it just become one - y y prime square okay that's the same okay so while you think about Y is true I'm gonna draw it it means that I basically pinpoint things at one again and I have something which sounds a bit funny from this plot okay because look this is a parabola that passed through one and then it goes up so it kind of makes sense from the discussion we'll be having so far it kind of makes sense here it's a bit steep you start to pay a lot quite soon but here it looks weird because it actually looks like you're actually paying but even if you're correct this picture is somewhat misleading there I'm actually many other reason to see that this is actually fine guy okay it's just shooting for more it turns out is shooting for more than just the correct classification so keep that keep that in mind that this is my favorite loss this is very important to everybody and also that there are other reason for which will hopefully convince ourselves that is a fine loss for classification do you see why I can do this why here is just the plus or minus so I can just factorize it out okay there are million ways of doing this but one way to just take this Y factorize it out is Y square so it's 1 or minus 1 becomes 1 then you get the 1 over Y here but 1 over Y and Y are the same because you just plus or minus 1 my Italian proof but it's one line so we don't want to write it ok so these are the most standard examples and they actually kind of useful because if you get out of this world to get out of the word the regression classification I listed 3 gazillions examples of input space ok well good luck giving you know the standard loss function for any of this this is actually a matter of research ok if I give you the space which is a space of function well you can generalize some of this it's not obvious and especially if you go to say multi-class classification multi-label classification a structure prediction of other kinds you know apart big part of the business is what is the loss function you want to try to minimize ok so while the loss function is the key ingredient and is the king grain that we seem to be given ok is the one that defines the problem the exact nature of what is the right loss function is something debatable ok and here just show you some example or do I'm not even tell you which one you should use which is a ready first question and more general also telling you that if the output is more complicated I'm not giving giving you examples at all ok some are suggesting that the matter becomes complicated when you go to or complicated domains right so these points do a little - a little subtlety so the question is whether why so if I put a sign or not here should I put the sign or not well so there is a there is a little subtlety here you know I I'm not sure what is the best way to resolve it but you know let's discuss it three seconds you see the way I defined it it's actually a loss between y and y so if these are real if this regression is two real numbers it is classification binary numbers ok so it's 0 1 or minus 1 1 and minus 1 1 ok so I set up the thing so that this thing is always just binary he's asking that because when you actually you know in practice one of these two guys is going to be a function and typically your functions are actually real valued even in classification you're not able to minimize boolean functions we don't handle boolean functions well so actually typically one of these two guys is gonna be a real number and one of them is gonna be a binary function and we'll discuss in a minute why typically you take the sign of the real function so in regression you just take them they're both Vietnam and you're good in classification if actually say that this guy is f of X then we have to discuss a bit what it means okay and typically in practice you take the sign the product C still works fine that's why we write the whole real line okay we already had in mind the situation one of the two can actually span the whole real line okay but it's there these are somewhat important subtlety but I wouldn't indulge too much into these right now okay let me make sense good so you you would have to change the problem in line one right so right now we're just doing this so what is casting is can I can I put an assumption here the things are probabilistic yes in some sense you would have to take some kind of a generative point of view so the idea is that you would have to take a function as there is now not just a function but is a turn of death of some parameter or is it was a process okay and then you would have to define the problem so that you have to account for that what do you how do you want to treat the probabilities the probability aspect of your function so typically from a Bayesian point of view one can do that you assume that the function f is itself a random variable and then an extra expectation can be put or some fancier notion of you know that controls the probability there you can do that we're not going to do it at all ever just because it makes stuff more complicated okay all right so okay so now we can you can keep on going because now we have we're almost done so now we can define you know there is the problem is define one basic guy which is this guy called the expected risk or expected loss or expected error which is this quantity here okay so literally busy he says I have this measure of error that I picked is one of those and now what I want to do is that I want to sum the errors over all possible pairs but I want to do a weighted sum where things that are most likely to be simple should come okay that's the usual meaning of an expectation and this you know again from a semantic perspective you can think of this in theory and code both the past and the future you know what I have today and what I don't have the is the full population that's one word that is often used where is this defined well it is define again you need to be sure that this guy is is well-defined and typically we're going to define this over all the function from X to Y well let me write me at the R or it's measurable and whenever you see measurability you might want to go back and look at it but it's basically the one property that you know makes sense of that integral okay ensure that that inter is well-defined the thing is also sometimes called the risk okay sometimes called expected risk and the reason is because in classification when you assume that the output is just plus 1 or minus 1 then e of f you can check that is also the probability of the x and y such that f of X Y is more than zero okay so if you take classification and L is the mixed classification loss also so the first one in my list then you can write this like this and you can read out as saying this is the probability of any two pair not being equal so or the probability of a mistake if you want okay so this is like the risk of making a mistake and that's why people call this thing also the risk so expected loss suspected risk expected error are kinda the same name that different names for the same thing all right what do we want to do you want to minimize okay before actually discussing one minute what do we mean by try to find a good solution we just want to observe one thing if you loosen it for a minute and assume that you have knowledge of this distribution you can actually get some little more insight on the nature of the loss function and how it impacted choice the definition of the problem in fact you know is typically if you have to solve this problem I can write it like this so you have a function but the function input is itself a function that's what is called the functional and how many of you know how to minimize a functional yeah usually that's the percentage like three okay well why because it'd be more complicated yes their function you have to put a bit more structure here to think a second about what's going on and we're gonna discuss this we're going to discuss it throughout the course because basically half of this course is about minimizing functional okay we have error functional we want to try functions and so we have to learn how to minimize them good news for you guys that typically what you're trying to do is that we're going to try to replace the function with the vector so now I can ask the question how many of you know how to minimize a function of D variables and now everybody goes that's what you study right you have a function of the variables and it's fine and we're gonna try to get back to that because that's what I assume everybody knows unless it's not raising of hands is real and then go away but there is one thing you know there is one cheating trick that we're not gonna use for computation but we're just gonna give for theoretical reason is the following observation you know it turns out that you can write this integral like this so you first take the integral with respect to X and then the rest or it's not you see one of my writing skills which is or squeeze things in an unreadable way you see I just split the interval so this is the loss and this is the condition this is the stuff before this of course you can always do it one thing that can be shown is that you see here I have to find a function if now I look at the integrand not the whole thing but just the integrand now f of X here is not a function anymore because I fixed X when I look inside of the intro I fixed X so call this alpha' then you basically have some function that depends on X but it's just a function of a real random variable okay makes sense so I took the integral I split it in two and I look at the integrand when I look inside of the integra see that is just a function of one variable it's what sometimes is called the inner no sometimes called the inner risk or inner error notice that in general defected minimizing over f4 over alpha is equivalent is not obvious at all but it turns out and the very general assumption one can do that mathematically okay so you can turn the study of dysfunctional which is complicated to actual study of a real-valued function which I guess assume it's something that you know I'm not saying this because we're going to use it to design an algorithm but just because now we can ask the following question if I fix a loss function and I don't know the distribution but I assume that it's sorry I don't know the exact shape of the distribution but I assume it to be known so I can handle it as if it's known can I just derive the form of what you might want to call the minimizer there in either of this so not only the mean but the arc mean okay want to know if you my you mais this and there is some function let me call it f star which is the best possible function might be one might be many how does it look like okay now we have some this point of view allows us to reduce this question to somewhat simpler term because now you just have to minimize the real valued function what did we achieve this what we're going to find out well whatever we found that is going to depend on the loss in the distribution is gonna be as useless as the problem originally because we don't have the distribution it's going to be hard to compute but at least we would be able to see if it makes any difference to choose a loss function or another because it will depend on the loss function for example we can ask what is the target function this is the name I want to give to this guy we take the square loss okay to take V II to be just a square in that definition over there okay regression you know just try to minimize this to go here right here the integral is D of Rho y given x is become Y minus alpha because that's what I did I want to replace the second end to it just a generic number and then I just want to take you know there is a square here now I just want you respect to alpha set it equal to zero again what I'm doing here and just try to I want to try to look at the minimizer of this you know this is complicated so I use this property Hallman principle that tells me that sometimes I can under weak assumption look at the mean minimum integrand so that now I can just take derivatives and now I pick an example I have to pick an example I take the square laws i plug it in here this is just a function of one variable and now I can just minimize it okay what do you get if you do a bit of calculation you find out that this is an easy case it's differentiable smooth convex whatever and turns out that alpha which is nothing but direct if you want is just equal to rho of y given x y okay you can almost do it with your eyes this is what is called the regression function is the average of all the outputs so if you take this plot it says if i have not one value here but the whole distribution of possible values and i do the gaussian thing because the only one I can draw I'm gonna try to say that the best thing if you give me the distribution which in practice you don't have is to pick the average of all those values that's the ideal thing that's what the least square is doing for us okay what if you were to take doesn't make sense cheating here this is just you know what you might call an information theoretic or a decision theoretic point of view I'm assuming that I know the distribution and I just want to look at a quick effect of my loss functions what if I was to take not the square loss but rather the absolute value would I get the same thing no no and in fact you don't and if and I guess this might be actually a piece of a problem set what you get is actually the median of the distribution of the output so the one point that splits the distribution into equal parts part with equal mass okay then keep on going and check what happen if you do this now this is the case of regression and so the take-home message here is if you were to take a look of the minimizer of the true risk the 1-yard okay they wanted you call the target function because that's your target you will find out that depending on the loss function you choose you get a different solution okay by choosing a loss function you're actually choosing a different problem and it's fine because it actually was defining a different problem was one of our two ingredients okay what about classification suppose I do the same thing but actually assume that the output Y is just plus or minus 1 ok then this expression is come out of an overshoot because this is a much simpler situation this integral so this is the sum of all the possible wise with their with their probability okay but how many Y's I have 2 1 or minus 1 so I can just write this as you know Rho of Y then X minus Rho of minus y give an axial ok and then if you want I can just say okay but this guy is actually 1 minus this guy ok these two probabilities sum to 1 but that doesn't really matter let's just look at this ok so you get this ok then you say ok how is this related to cut does it make sense for classification the interpretation of this is that I take the output and then for so for each axon kay said that for each X I get a whole distribution of values and then I take the mean of the distribution again don't be mislead they always write Gaussian because I have a small brain and it's complicated to think about something else what about classification and that's not really how you know that's not really a useful interpretation why would it make sense you actually have this difference this differential cannot make sense somebody was saying to PD what you do in this case when you when you solve a problem where you should you would want to predict 1 or minus 1 actually predicts the real values then what do you do you take the I would you take the sine why is that a good idea just because somebody told me well yeah actually if you take this point of view what is the sine of this each of the two probability is bigger than the other so it makes a lot of sense okay you're just basically taking if I if I could solve the problem exactly I would find the difference between two probabilities and then I want to check which one is bigger and I would go for that okay indeed if you were to do exactly this calculation but rather than the square laws we're taking the misclassifications laws which again it's it's a crazy non convex loss function we will find out that it approximate not the difference of the probability but its sign is actually solving the problem exactly you know we find exactly the sign so you see that what I did by choosing the square law see that is a relaxation I was a binary functions as you allow myself to look for real valued functions and then what I do is that I'm not getting a boolean function I get something which is related to boolean function I should shoot for you can play the same game for each of this yeah because the problem set and it turns out that you know the logistic is similar is estimating the in some sense the conditional distributions of the two classes the hinge loss is some kind of magic and it's one of the reason why actually is actually good loss for classification it's not you see this tells you that this is a perfectly fine loss for classification okay one point two video say this is perfectly fine just shooting for the condition to take the sign why not you have to provide me reason why because for now we just have a misleading plot yet he feels that this one it should be better okay it feels better just intuitively well what you can check is that if you do that calculation you don't get the do conditional distribution but you actually just really get the sire here so if you take the hinge loss the hinge loss returns a boolean function so it has the same target as the misclassification loss now it's 227 so this might not be the friendliest piece of information I gave you today but it actually kind of interesting there is a problem set to like appreciate this even if you relax the problem and go from wool and function to real function this 1 loss function doesn't really care and spit out ideally a boolean function so in some sense it shows that it's a tight relaxation of the problem okay that it's a much closer relaxation of the problem okay so the give me let me use the last three minutes because we started five minutes late no two minutes because I think it's illegal to use them all I just want make two comments okay appreciate how I spoke absolutely nothing about algorithms so it should be refreshing if you're used to an algorithmic course I left out the question of okay but so which lost function should I use for classification for because for regression you change the problem for scientific and they seem to all shoot for the same guy so we might want you might want to think about this question when you writing back home tonight and what we're going to do next is basically try to set up you know the stage to how we build algorithms and the first thing would be to be able to handle functions this function class will not be particularly useful for any practical purpose so today the whole goal was just to explode this into actually meaningful objects okay and what we want to do next is start to see how would you solve this problem okay what's needed to solve this curve all right Amasa would say go and if your question I'll take them offline
Info
Channel: MITCBMM
Views: 6,028
Rating: 4.9384613 out of 5
Keywords: CBMM, Center for Brains Minds and Machines, Artificial Intelligence
Id: SFxypsvhhMQ
Channel Id: undefined
Length: 85min 4sec (5104 seconds)
Published: Wed Sep 13 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.