Class 03 - Reproducing Kernel Hilbert Spaces

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is kind of the summary of there was the beginning and he's gone at the end of last class it was a bit of a pretentious already what is what did I write wrong cut it cut it already yeah yeah it's true it's true I'm gonna comment in in a second like spot the different there is one difference between last class and today which is here there is why instead of our or there is our instead of why I'm gonna comment on it in a minute it's coming and knowing bit that I never know which way is better and we'll put discuss it now that we know why it's worth discussing it anyway so this was the beginning and it's kind of the end of last class what we did is kind of a explosion of every symbol into some meaning okay discussing at length what are different inputs input space outer space why probability distribution why taking this decision theoretic point of view where you don't consider the problem of estimating a priority distribution but only a function and then the necessity of introducing a measure of errors and defining the problem okay the problem of finding a function which is good on future data a concept captured by the idea that we want to ideally be able to make the expectation with respect to the underlying distribution small and then the crucial point for that make right the use of the word learning which is the fact that we don't have actually access to the distribution but only to a bunch of points okay this is roughly what we discussed last time again the the we assumed basically that we do have knowledge of what these are the input space is but Rho is not knowing L on the other end is known we made a few examples then we try to get a grasp of what was the meaning of the ROS function and how we could compare different loss functions as your colleague immediately notice last time I introduced here Y from Y to the loss function was from Y times y into 0 1 ok now typically it's not okay so it depends a bit on the story so typically you know that most of the situation you have that the prototype example is say in regression what you have is that typically take real valued function you might have that the data are also real valued but maybe not the whole real line some subset so Y is just a subset okay then you have binary classification binary classification in the case where this is plus or minus 1 and this we have to decide okay because if you want to just take a decision theoretic point of view you can say I'm looking on for boolean functions and then your basic can consider some loss function that just taking input to just two values okay 0 1 here and 0 1 here but most function used in practice actually don't do that they have this kind of disparity between the first entry and the second entry in the first entry just say plus minus 1 or 0 1 but the second entry is just a real number if you go to more exotic situations say for example multi-class classification or something else as we discussed this is kind of a different complicated business because the design of loss function not for regression or classification is actually a topic of research and is not something where we have clear ideas oftentimes they still happen that you have to put into the game two spaces some output space which is structured lacking linear structure for example and then some other nicer space that you use to estimate functions ok we later on in the course will make some example this class I actually decided to go for this because from now on we're going to do this ok either is going to be a regression problem or a classification problem where anyway we decide to just look for real valued functions so from now on I'm gonna use in fact I'm not sure I'm gonna repeat this one line many more times but we go for this kind of this kind of definition ok it doesn't really change anything substantial one bit of information I gave you last time is that you could you can get a grasp of what is the difference between the different loss function by essentially massaging this a little bit and get an idea of what you can say about the actual minima what we call the target function the target function would be the solution of this problem where this calligraphic f means all the possible function where this expectation is well-defined really in either the knowledge of role or being able to handle the largest piece of all possible function is is anything practical but still it makes sense from a theoretical point of view to try to derive an analytic expression to just look at it and see what it is and we did this by basically introducing one trick which is described in detail in the natural India tree is described in the notes and it's part of the pset that's where you're gonna you're gonna meet this idea again and the idea was basically to first just use the fact that we assume always that this distribution can be decomposed in the marginal and the conditioner and then basically reduce the problem of computing the minimum which is a function to actually find the function at every possible point we fix one point but it's any point and so it basically we the basic idea was to consider what we call the in risk that makes the problem of finding a specific function turn is turns it into the proper function of F at any point and the idea is that this become a real number and so the problem become easier and somebody who's my prime so okay we're not going to repeat this it is just is actually a trick to do a computation and analytic computation and it's then it's it's on it's possible because of some fundamental result that we also don't want to do that's not an interesting part of the discussion is other reason is what you get out of doing that trick and using the trick and if you remember we came to realize that by choosing the loss function actually changing the target of the learning problem if you consider the square loss we saw that the target function is just the average of the Y's for every point absolute value the median and keep on going really in the regression this means that we actually change in the target okay we literally consider different functions for classification what you see that the story is lightly different a long story short we'll see that these are different flavors of doing this thing which is shooting for a loss function which is convex and yet find a decent approximation of the Bayes rule which is just the one that says if you're doing classification look at the conditional probability P of 1 given X and P of -1 given X and go for whatever label has higher probability if you're given X is bigger than P of -1 given X go for that okay we'll see that the square loss and you can we're gonna check that logistic loss or the hinge loss do exactly that they provide the loss function that is able to mimic to some extent the decision will provide or derive a target function that mimics to some extent they base the see again want to discuss it too much here because this is going to be something that you're gonna bang your hand a bit on or during the problem sets but is the first insight on what is the role of loss functions okay and how you choose them we'll see that another one is going to be the different loss Intel different form of computations different kind of computation to actually find an estimator and so that's what we want to start to discuss now how the hell you solve this problem what are the rule of the game is actually gonna take us a few more classes to define algorithms to today we want to begin very briefly just deciding on what do we mean by this problem it's worth a discussion because we are basically asking we're busy asking ourselves to solve the problem for which we have very little access to we we don't have access to any of this we just have access to the data so the hope to give a precise exact solution to this problem is it's very little so that's what we want to ask ourselves in which sense can we hope to solve this problem what would be a good solution or a bad solution an exact solution is out of out of the game before we do that let me also just add one little one little remark there is something that I never thought about but Carter recently and it turns out that it happens all the time I have mostly my mind the situation where row is fixed but unknown and this is the classical situation for learning saving solve perception problem like vision speech or many others in many branch of applied sciences when machine learning is used is not because you sis to this you actually have access to this same physics is one example or geophysics another example you actually know this process but if you actually were to try to simply it in order to solve this it would take forever so what you do is that you actually take a bunch of sub samples and try just to find a function that you can compute quickly okay and so do a crazy ginormous simulation every time that you have to make a decision so again this is just an entry point for a discussion there are situation where you might have actually have access to distribution may be very costly to sample it all the time so you just try to sample it n times and just get a good proxy and in high energy physics for example that's kind of the situation they have you know they do know the standard model did give rise to the phenomenon but I actually try to still reduce it by synthesizing it through a function that you learn from samples end of the remark so what do we mean by solving this problem and what do be a solution well generally we know we can come very broadly speaking we are in choosing two maps from the data to some function some estimates again when you see a hat or n is typically some estimate from the data s here is just a sample what we call sometimes the training set interesting some map like that some terminology typically we call the map from the data the learning algorithm and we call its output an estimator tential solution to learning price and x : estimator one thing I would like you know to start to 2m rate is the idea that the terminal learning algorithm is symptom you say am i right some confusion because it's two different tasks with a statistical aspect and a computational aspect so we actually be worried about both but we see that there are situation one defines a procedure that is make sense statistically but as an entail specific computation so then you still have to worry about computation separately I imagine many of you are familiar with super vector machines or logistic regression is logistic regression an algorithm depends who you ask if you are to a statistician it will answer probably yes if you have to optimization person will say no it's a problem it's not an algorithm because logistic regression gives rise to a system that you have to still to solve okay so this is something that we want to start to think about you know when we talk about learning algorithm sometimes they're actually computational procedures are a timer just some statistical procedures we might be confused for now but you'll see this making sense for sure in one class or two that's what we want to start thinking about how do we build something like that something from go from the data to some actually procs of the solution well one first idea is suppose that you do have something like this because the first thing you wanna do is what what is a good algorithm or a bad algorithm a good algorithm should be something that provide a good proxy to the solution of this problem so how do we write it down in one equation what will be a good F or a bad F an F that I find from data an effect is good if what you're basing I take the loss on the data let's give it a name let's call it the empirical error vertical loss and then I say I'm happy if this number for the specific guess I'm making is close to what this alpha journey mile so if this is close to the minimum you okay that's one possible requirement is the deviation between empirical mean something you can compute on the data and the actual problem okay and sometimes the deviation of these two quantities what is related to what one might call the generalization error and it's a it's a pretty common quantity that appears in some form of in some fields of optimization where people study conversions of function or empirical function epic convergence and so on and so forth for those of you know right now that's a good guess can you see another this is the problem and I would like to find the solution an approximate solution right y-intercept that I think that's from a mathematical perspective seems to be the more a natural thing to do I want like to get something with us I want to minimize this objective function give me something that I mean you know let me check how big is this objective function right so I would like to minimize this but I cannot well I still would like to find a function which this is small let's put you just said okay there is the heat yet it might be confusing the first time you see it but the point is that these are just an empirical one you can compute fine but from a just a theoretical perspective you won't make this small well check how big it is for what you found of course you cannot do it in practice but here we're just defining what is a good solution from a theoretical point of view and this seems to be the net to look at in fact this is just the natural metric in which we measure success whenever you minimize a function you find an approximate solution and a new check how far is the value of the function from the optimal one okay if you ever see any proof in optimization that's what you do and if you want this is nothing but what people in optics are calling a stochastic optimization this quantity is what is called excess error and excess risk and we're almost done introducing the point we still had to worry about something for example you know this thing depends on the data and I could ask for example if these things goes to zero as the number of points goes to infinity but how can I make an precise can I just write a limit here and say that these goes to zero now otherwise we wouldn't be asking the question but why cannot do it why can't you do it there's a fad is random and so I have to worry what does it mean that this quantity goes to that or that the difference goes to zero I have to do something like that we have to do such as you saying I have to require convergence in probability that means somebody else might say in expectation but certainly we have to worry about something this is a random variable so if you want to say that this goes to zero yet we have to agree on how it does it okay and for example one possibility is to say let's study this and let's study hits behavior okay we need to do something like this because a random quantity notice that I don't put any absolute value because I know that this is the smallest of all possible values so I don't have to put any absolute value here and basically this is the exact way in which we say that a solution is you know from a point of view is when we have a solution we measure its quality basically introducing studying this kind of quantity learning theory the theoretical part of learning that studied the statistical properties of procedure is exactly about studying this quantity and you typically have two kind of notions and discuss this stuff basically when I'm gone the last point of my in 15 classes you can ask this quantity to go to zero just convergence and that's what is typically called consistency you it's just convergence then you can ask a more quantitative requirement which is not only to know that it goes to zero but to know how fast it goes to zero with the number of samples this would basically mean that we actually want to try to bound this quantity in an explicit way that depends on the number of points and the precision which we want our solution okay we want to know exactly how fast this goes to zero and this is going to depend on how many points we have for sure and also on how much precision we want if you want a very small precision we probably will need or time if you want a very bad precision less okay the two questions are not unrelated one is basically convergence one here on is convergence with rates this kind of results are sometimes called the tail bounds because this is like if you can see if you can think of these as the tail of the distribution but also is a finite sample bound at the sense that provide a quantitative answer for any finite number of points rather than a purely asymptotic statement okay why don't they do what I mean this is the this is one possibility okay so another thing you can do is to put an expectation or you could put almost sure convergence that's what you want almost Sher Khan very much for what do you mean normally it's a random variable or any point yeah you can do something stronger than that this is the we so this is a very weak this is a very weak requirement it implies by pretty much any other one you can list okay so it let's say one thing to notice that most in most it's useful to look at this because the moment you actually get an explicit bound this become a very strong lot that implies most of other other kind of conversion probability so just to wrap up this question for everybody he's asking why do you just look at this quantity going to zero you can look at the expectation you can look at almost sure convergence you could look at point-wise convergence yes you could and this is the weakest of them I just really write this down because it turns out that it actually can get rate for this kind of convergence meaning that I can get something like this I get a concentration result I get a finite sample bound this is actually very strong result and from this if this Delta is good enough I can actually derive most convergence in probability that you want yes I for now what the mean way rate is an explicit functional form here okay an explicit expression for this Delta term this is that the term has to be maybe not monitoring but a decreasing function of the number of data points for any epsilon okay that make sense but a random verbal is a function from a probability space okay so literally the F F hat is a function on a probability space yeah yeah the functions if this one the point is that this is not a function is this this guy yeah yeah this quantity is just a deterministic quantity I mean why don't they just look at the asshat going somewhere yeah but my goal here is to just to say so my main goal is to point sends a map from the data into a function is good or not the data random okay the map itself is deterministic it could be randomized but it could be a randomized algorithm we're assuming that we're dealing with deterministic algorithm just for the exam so the map from here to there I assume it deterministic still because the input is this is the definition of a random variable and then I have to worry about that and so this is what we are doing now okay alright so we are spending this too much okay we're gonna have classes devoted exa to this topic for now what we want to want to think about okay if this is what you want to do then how can we start thinking about designing such a map okay the job of learning theory is to prove that when you invent an algorithm or to provide indication of how to invent algorithm that actually allow to control this kind of quantities okay and it's actually true for big chunk of statistics anyway one question we want to ask you start asking is okay how do you start designing learning algorithms okay and the basic idea starts again from basically looking at this and think what you can do and see what you don't like about that well you don't like it please two things one is that you could run a backseat to the priority distribution two is they do have access to all possible function in the in the universe can we start thinking about fixing these two problems rather than considering that problem we could consider another problem which one again you don't like the fact that you don't like the produce distribute do you know know the probability distribution and you don't like the fact that the toll function in the univers are kind of a lot of functions what can you do replace that problem with something else what well what is that you know the data so you could say well instead of this let me consider a hat let me replace the true error with the empirical error okay so you start consider something like this and what do I minimize but again if you want to do something that is useful in practice we have to have access to the space of functions and the space of all possible function is too much okay we need an a parameterization so the next idea is to try to replace all possible functions with some more manageable space and space with some structure okay which can be very large but has to be more manageable okay so this is one possibility make sense these are not the only way in which you can do things but this is a good start and a start to you know point us in the right direction we will see that this is what is called the empirical risk minimization principle URM empirical risk there is sometimes called the risk minimization and it's kind of one of the main principles okay and if you start to go this way the thing we have to worry immediately is okay I can see how I can do this but these I have a lot of freedom how do I replace the true space with another space and you can immediately see that there is something going on here because basically you can imagine that diffusion happens if H is big or if H is small okay so what do I mean by H being big or small well okay I have the space of all possible functions and say I can take the space of missions okay that's presumably a much smaller space than the space of all possible functions as it is this piece of polynomial degree 2 or degree 3 okay or I can take say a large space the space of all possible polynomials of any possible degree plus the Fourier series and plus a bunch of other function made of gaussians or whatnot okay that's clearly space and that's kind of the distinction I'm making here okay well what you might expect to be the benefit of picking H big pick a large space then what might happen right so she's saying so she's saying I have a big age I have a lot of options so I'm most likely to be able to find a function that is the right one okay if how do we formalize this in a symbol well consider the situation where you're lucky to have a ton of points you really have a lot of points okay you're so lucky it actually give you infinitely many points then this guy just become the true risk but I'm still replacing the true space with another one okay so even if you are so lucky to have a lot of point let me just use a shorthand notation you will be considered this minimization the minimization of the true risk over my space age I just delete F from the business because I'm gonna write this four times now if H is big what she's saying is that I can actually get close to the true problem which is that guy okay you but this is nice I take a big space and I'm not changing things too much what's bad well in practice you don't have access to this okay so these are just this is just a toy picture but you don't have access to is what you have access to it is just empirical version the problem is that if you've access just to finitely many data and you allow yourself a big space you can explain anything you want ever remember its final movie Nuri give me three parameters and I fit everything you need for and blah blah blah it's actually tell me do you remember this how many parameters do you need to do what who said that all them all together anyway so the point is that this is kind of the same story here I give you this ginormous this ginormous space and then I can pretty much do everything particular I can drive to to be very small right I can just make the error my data very small and yet this can be whatever it is okay so typically we might have situation where this is more but if I actually look at the true value I get something much bigger and it's not like I'm getting anything useful out of this picture so that's one extreme and then you have the other extreme if you take a to be very small cannot these two things just get reversed okay if I take H to be very small there is not much space to look for functions okay and so I'm likely to find something on finite data is the same I would find if you give me a lot of data sorry okay so I am happy because my minimum wage is roughly the same but I have that what I've drawn data is much different okay if you take large H small you got the other way around okay likely these two promo are going to be very similar because you know you're very few function to look into the opposite of what you're saying a minute ago on the other hand you're right function might not be there if it's not there you don't find it so the moment you have to start to thinking about this pace is what is called the hypothesis based in the space where you're peeking I thought you have to worry about how big or small it has to be and that's where the key word that we're going to use fifty billion times in this course enters the picture the idea of regularization the way we're gonna use it is oftentimes the following we're going to try to build a relatively large hypothesis spaces and then we're going to try to explore them so pick H large and then realization abstractly will be a procedure to explore age controlling complexity functions okay so here complex they put in quotes because it doesn't mean anything right just an intuitive notion as a big thing which again I just qualify very roughly and now I'm going to explore it by being able to say what's what's living in smaller space and we live in largest Bay what is simple it's complicated okay so that's basically what we're gonna do for the next two class for today in next class we're gonna introduce function spaces and get a feeling what does it mean to have large function spaces more function spaces in a way to explore them and we're gonna explore them typically bet reducing norms we're gonna say we are gonna introduce norms functionals that attached to each function and number if the number is big that thing is complicated if the number is more that thing is simple and I'm gonna try to make sense of how this can be done it will not just be one way but many ways okay this is gonna be our strategy once you have done this we'll be able to design learning algorithms okay and I would say the majority at some level or another this is the way the majority of learning algorithm develop you fix some potentially large function class and then you find a way to explore this function class okay again this is a bit abstract because we're gonna have essentially at this point thirteen and a half classes to to give exact any question functions clearly it's a fun I wrote functions not a level with me efficiently I try to use words so that a state when I've used okay evaluate efficiently it I'm just saying I want to be able to explore them in such a way that I can you know know when I'm looking at a complicated function or a simple function okay so and we're talking about large spaces and smaller spaces basically every way to explore in such a way that you know within a large space where you're considering smaller subsets or larger subsets in some sense okay and if example in one minute but that's the idea of course you know let me give an example I take the space of all possible polynomials and then I consider polynomial degree two then three then four and six okay or I take functions that are harmonic function made of sine and cosine with many frequencies and I take all of them okay or the function up to a certain frequency function with there are very smooth there are very low frequency then higher frequency higher frequency and higher frequencies okay the bigger space I can get is what I call H is large and then a strategy to actually parameterize or explore this function the degree of the polynomial the amount of frequency and you can imagine you can keep on going for a while with this kind of example but that's kind of odd and for now we've not talking about computation then we have to worry about how you do it computationally and this is going to become a twisting the whole story okay what we wanted to do day is actually start introduced function spaces and start to understand how we can describe Oh simple they are and we actually start from the simplest function the space of linear function because it turns out that this is basically the Paramount example on how we build function spaces using the same kind of ideas that are needed to build the function space that are linear we'll be able to build many many other function spaces so let me see if I'm forgetting something what I would like to talk about today is exactly the you know the product the main character of today's star is going to be H and you want to discuss four things well I guess is going to be more two or three one is linear functions or functions on finite finite dictionary like to introduce the notion of reproducing kernel Hilbert spaces and reproducing kernel and I I'm not sure we'll manage but they would like to touch upon the idea of positive definite kernel okay here we'll make a big example or translation invariant kernel okay let me say now because it's a good point burger I'm not saying anything interesting well a question that one might want to ask and they get out of the business why do we talk about reproduce incredible space if the only thing that exists is neural networks and kernel methods are dead okay that's legitimate and there is well because actually our HS are not just kernel method or function spaces and it's actually very hard to meet the functions pay which is not a reproducing kernel Hilbert space they appear everywhere in signal processing they're everywhere in statistics they're everywhere in stochastic calculus they're everywhere they're called Cameron Martin spaces in that case and you can keep on going okay it you will see in a minute we take a path that shows that these things are not very specific they're actually pretty general thing they just oh they're just it just so happened that their function space they're just put in two properties you get a bunch for free okay and so in summary it's kind of more of a foundational discussion where you just see an object that is gonna be useful even if you do it whenever you study even function spaces that are given your nectars you typically try to get back to things like this because they have enough structures okay to study them so that's why we do them and also because they will allow us to introduce the classical method that we use ages like linear methods generalized in your models or models based on Ephesians and nonparametric models okay and we're going to not discuss we're gonna touch upon Gaussian processes because we basically introduce the whole story of kernel methods in all these various forms and so these are going to be our main one two that we use for a bunch of lectures the way we start the whole story is to linear function and the dictionaries because basically these introduced a lot of things that we like so what we like we like species of function that are big but not too big and they allow who control the complexity of a function in a sense that we have to make precise but also at some point we stop pretend that we're just stationed as I've been doing so far just discussing minimum of that but somebody actually asked how about computations yes we have to worry about that too we have to set up a space age there is something which is manageable in some precise sense okay and that's what we want to do so again the first example you can think of a space-age air function from X to R such as it exists Lunardi X is equal to W transpose X for all X in Rd ok and clearly here I'm making the assumption that my input space is Rd right ready you can ask okay this is nice but I would want to generalize this with more general set graph strings probability by blah blah blah blah blah blah for now we stick to this ok why do we love this space why is the fantastic space that's because it's not very rich but the wise is so nice linor and we'll get the whole thing for free it's more than that it's leaner and finite-dimensional what we get is that we talk about function but really we're dealing with vectors at the it's indeed a linear space we have a few properties one is that it's a linear space more than that it has these natural correspondence between a W and an F it's a one-to-one correspondence okay whenever you talk about a function you actually talking about the vector so effectively you can pick whichever one you want and typically you'd like vectors more better than function because you can think about vectors okay so it's good you know modeling wise we think about function but operational while we talk about vectors and we know the starter we can do about vector so we can do it in our spaces but we can also define an inner product on functions okay we can define the inner product on any to function as an F prime in H how just because every function is a vector we know the function of the vector it's completely trivial now so we're gonna just make this more complicated in a minute but just appreciate that this is easy now so let me just call this W transpose a blue prime okay where W is the vector associated to F and W prime is the vector W prime and for those of you were not a run yesterday we're often gonna use this notation for inner product the pairing notation is the just a change of notation I'll use it all the time so get used to it because it's gonna be used all the time and if you want we got yet another little property okay of course here I'm not even even inner product of norms distances and blah blah blah blah blah and we can notice another small thing which is that f of X equal to W X and then we can use fish words inequality as I told you yet two inequalities no what is the other one but certainly this is one that you want to know and this tells you that you have this property okay what does it mean well at this point is not per to interesting but it means that if two vectors are closed then the function are going to be closed at every points okay so it's obvious we're at this point of view is quite obvious the weights are closed and the two function are going to be closed at every possible points agree so you go from a notion of vicinity of weight to notion of vicinity of function at every points this is not turn out to be a useful useful thing also good all right this is fantastic because it has all the properties you might need it's super simple notice that is actually something that is still used because whenever you have not so many points and you actually a lot of descriptors for the points this might be enough okay that's one thing we're gonna see a lot if you're number of points if n is not so large and D is very large the first thing you try to do is to see if you can stick to the linear model without going up because you might be already enough it might be it might be already complicated and this is something we're gonna discuss but generally speaking 1b you might want to go to something more complicated so one possibly is we actually introduced the second case which is the case of finite dictionary so what is the dictionary it's just a set of function okay so it's feii from X R I goes from 1 to P just a number which is finite it is not an infinity just a finite a set of a set of a functions ok is what we typically call a dictionary we call it the D it's something that you know a lot of time you can figure it what is the classical example because your example is if you have a some space and you can define a bit the space okay in some precise sense and dictionary is just a next-gen herb that defines bases and things that are more general than bases there we see in a minute there are more general Norton so it's just a collection of functions the elements of this dictionary also have some names attached to them they're called other atoms or features atoms oh no no where it comes from but is the classical name in our money canal is in signal passing a feature is kind of a machine learning thing and it's kind of its kind of a cute point of view is basically saying look this is a function that takes the input engines numbers so you can think of you have descriptors of the inputs say if the input is an image okay when I build the v1x I'm picking a feature in a number associated to the image 2x is another number and so on and so forth these are descriptors numbers that describe the input and we can think of them as features they're just names reputation but this is all they are okay what is the idea the idea now is that we want to build the space of function H is the space of function from X to R such that w in RP f of X X F equals some W J Phi J J from 1 to P a linear combination of the original features but of the regional coordinates of these features do I need to consider that the restriction that X is Rd oh really it's all hidden in these guys that haven't tell you what these guys but they can be whatever right suppose that is the space of possible graphs and George tell you that he has ten functions that are amazing to describe graphs all right there are your functions and that's it so X can be whatever okay and of course you're worth a lot of information on how we choose them clearly there's going to be a ton of prior information how you design this and that's what we want to discuss my which isn't suppose that somebody give it to you what do you do with them to build functions well you do the obvious thing you take that thing because you like it and you just generalize okay you can view this as basically some pre-processing of the data and then you use linear functions and that's it okay let's look at the properties of this guy is it a linear space looking on the floor buddy's doing this because it is okay what about the second property is can a define an inner product there is tricky yes or no yes but in general might be complicated there is one complication what is it look we're writing this down right if these guys are linearly independent it means what there is just one way to build this function between something like this okay and then we do have this and it's important because then it means that we really CLE we know when we write this symbol we give from every just one dollar and one W Prime associated to my F so this first line makes sense because there is just one guy okay suppose that they are not linearly independent any means that I'm able to write down the function in many possible ways and so what do they mean that I take the W corresponding to that have many possible double use we can actually deal with that case we're really gonna do it but for now I just want to make my life our life easy and we're gonna just assume that this is a dictionary of in early dependent atoms okay an assumption we make for now interestingly enough we will be able to by introducing reproduce incredible space we will be able to just remove this assumption and we'll be able to actually remove the assumption that this P is finite we are going to be able to take infinite dictionaries but for now that's what we want introduced richness and redundancy so the question is why would you not be happy with this assumption because it's a restriction okay and when you make a restriction is that you're limiting your choice okay so if suppose for example that you know in similar place the classical example I give you a function that sometimes is very smooth and tungsten is very big spikes then which basis do you use you take something like sine and cosine they're very smooth you take spikes and then there are two so you say oh I'll just put them together if you take a basis another basis then you put two bases together you know where the base is anymore okay and generally suppose that you just give you images and and now you invent features okay or or texts or strings or whatever you know test this yeah you know you might have correlation you know you may have collinearity of some features because you just pulled them out of our experience right so it might not be easy so it's both for the easy of design and the richness and because it's interesting we have rich functions bases these are typically the reason why it's interesting role beyond this okay any other questions if I don't do it remind me to repeat questions because these get recorded and you're not you don't have a microphone so if you don't repeat the question it will be interesting to watch the video in the future all right so let me repeat the first two lines they're the same okay well I'll repeat them so this is one to one okay I can define the inner product just in the same way because this is one to one and this is our D right I'm not gonna repeat it but is it just and this is our P and I also have this property that if I have f of X doing exactly the same reason but I'm gonna get the norm of W and then I'm gonna get the square root of some J from 1 to P of Phi J X okay the same properties as before as long as I'm sure that if I have this kind of some finite which is always true if I have a finite number of dictionary or still have the same property as before by looking at the vicinity of the weights I know that functional close at every single point all right so it's not bad what we have right now because we have species that you know which you can do a lot of operation suppose that ask you to compute gradients well you know if you have to minimize this guy basically it's not a function anymore its equivalent to putting a vector so if you want to try to minimize some stuff you can just try to do multivariate calculus and that's it and because you have an inner product you have a notion of orthogonality so you can do many many different things okay you will basically of linear algebra first and that's that's what you're shooting for the fact that we can move from the linear case to this also gives us some freedom because we can basically build now nonlinear features this guy don't have to be linear you can take the first coordinate and square it to take the cosine take the Gauss and of course you can ask okay how do I do it I don't know that's a big problem we have to face but for now you can just see that you have a lot of flexibility by basically using the machinery of linear algebra which is basically what linear function do you can actually consider no linear models and notice that there D is the features of the data that I'm assuming to come in a row format to be really vectorial here P is what ever okay whatever you're choosing it to be three gazillions or three it's up to you okay so I can have a lot of richness by increasing that clearly am I have at some point some bottleneck from a computational point of view because some thing might get big the number of coefficient gets much large okay but for example for any number of points you can choose a number of feature which is much larger or much smaller than the number of points well the thing that we would like to achieve a few things here we would like to see if there is more ways to actually build function spaces and we like to kill the well one main restriction we've done so far which is assuming that P is actually finite is a finite number because so far whenever you choose a fixed a priori a finite number of parameters and that's it okay we would like to see if there is a way to actual ourself to work in larger spaces okay going back to this thing we want to be able to work on a large age okay very large age for example space age which is infinite dimensional not just fine now you could say if you are here yesterday what I basically discussed if you have function spaces that are if you have a linear spaces with inner product if you want to generalize them to infinite dimension of some technical condition basically the fact that some convert that series to converge series of functions to converge and then you're good to go and the name of that we give to a space with this priority which is the linear space plus inner product completeness is property that Cauchy sequence convert that ensure that things in the space blah blah blah again for those of you where here yesterday spend a bunch of time other ones you might want to look at the first line of the diffusion what you call a Hilbert space so that's the most natural thing to do to say okay I want to go to infinite-dimensional I want to keep everything out there I just say okay give me a give me a hilbert space basically a linear space which is infinite dimensional sorry enough okay turns out that that's not quite enough because these that you see I have more there I have a linear structure that helps me out case you have some leaner structure if you just take any generic Hilbert space you have to sweat a bit more to get some linear structure it turns out that there is one other assumption that we make an actually gives us a lot of juice and it's a bit more technical so these were introduced in these pieces is that by far the most general one and this is the property of having some functionals continues its a very obstetric crime basically says suppose that you consider the map that they call e^x that for any age goes into by taking F and just returning F in X okay it's a it's a map that takes a function evaluate the function at the point by recording this map to be continuous we basically say that for all epsilon bigger than 0 there exists a delta bigger such that for all X if F minus F Prime f of X minus F prime X is more than Epsilon okay I write here already f of X rather than f prime but this is just the evaluation functional okay it just says this map is continuous this is not the friendliest think you might think of but you see here that is just that we go this way just to appreciate the fact that there is a very this kind of a very axiomatic way to get to some spaces we're going to spend basically from this moment on try to see how we can make this thing practical okay so it is intentionally brutally impractical just to appreciate how little is the story when you take an extreme point of view it's just the hilbert space with one extra condition that seeming nachus seem somewhat technically turns out to actually gives us a lot of a lot of things for free Accenture so let's just convince yourself that that's really actually is not such a we're going to see that what are all the consequences of this assumption let's just illustrate how this assumption is not is not trivial consider two basic functional spaces yeah what sir yes sorry it ages in Hebrew space of functions from X to R okay it will be space of functions next to our okay sorry okay okay so let's just rather than keep on staring at this which is really not that informative for now let's just check why it's not a trivial requirement consider X to be equal to just say 0 1 and consider H and let's check if L to X is okay let's Excel to 0 1 is OK is it a Hilbert space well let's check it's a linear space I can sum up square interval functions yep do have an inner product yes is it complete yes believe me or why jacket do I have the property that the evaluation functions are continuous but is that a consequence of this is that if f that you basically have this kind of inequality up to a constant you know that the value of them is exactly what we asked here another way of writing this result is basically saying that it just put F prime to be 0 ok this means that if F s is more norm then f of X is more norm for every value particularly can rewrite the whole thing a compact way this way so is exactly this requirement here ok now I actually put it as an assumption as a request is the request okay I'm just asking this to be true is this true for l2 functions what am I asking you if you take a function in L in 0 1 and you take its integral with that integral can you control what the function does at every single point why because intervals don't care about single points you can take any single points and shut it up to infinity you can take any finite number of points and share it up to infinity and you just do whatever you want so l2 is not enough it's too big ok I have too much freedom so l2 is a space that satisfies this assumption but not this one interval the function does not control what the function does at every point can you think of a function space with the norm that actually controls what the function does at every point I know can you think of another function space simple you know like this where you have a norm and that norm controls what the function does at every point continuous functions okay just a continuous function again on zero once a same example what is the norm is exactly the maximum or classical way to define a norm on this pace is to take the maximum show what the function does it every point yes you take the worst one you know if that goes to zero everything goes to zero we hit a linear space I think to continuous function I get a continuous function an inner product that gives me that so that this is again and one of the prototype example the norm which is not given by an inner product so I'm again violating what's happening here I got almost all the way down I got this guy I got this guy but I didn't get this guy that is just to appreciate how this simple thing rule out that come to your head okay l2 and seiect not okay too big too little okay so this is what we call a reproducing kernel Hilbert space they write it for once typically right Shawn enters arcade chess again these are these these the definition is on purposely abstract test to see how somewhat general it is one probably that is pretty useless and he actually doesn't explain why we call this guy reproducing kernel Hilbert space okay Gilbert space yes but while reproducing kernel and clear okay well it turns out that whenever you have this assumption you actually get one special function for free which is going to be useful and it's what we call a reproducing kernel so we're going to now introduce the function and show that is somewhat it comes for free and is equivalent to having a function space like that so we define a reproducing kernel as so it's a function k that goes from the input the reals and it has two properties if I now pick six one and three okay so you take K X and you leave the other entry 3 this is a function of one variable now I fixed X and I leave the other one free so this function has to be longer so sorry reproducing kernel of Hilbert space of functions age okay so I'm assuming that ever function space and I call and I said that this function space has a reproducing kernel if there is a function like this with two properties if I fix one entry the function that I obtain belong to the space okay and if I take the inner product of any function in this space with this function here and this gives me f of X okay again he's an absolute finisher I basically say I say that the function that the hilbert space as a reproducing kernel if in that space there is a special function with two property fix an entry you get the function in the space take the product with that function you obtain by fixing an entry and multiply by any other function you are evaluating that other function at the point okay the whole story is gonna be abstract for a little more okay and then we'll see that he gets more constructive for now it's gonna be more serious structural assumptions now basically turns out that the assumption of evaluation function will be continuous or the existing of a reproducing kernel are equivalent so whenever you make this assumption it means that you have this special function whenever you assume that you have this special function you actually have this second property these two things are completely equivalent the first you don't need to discuss because it's the same in both cases okay which one do you want if your suffering is okay that's kind of the moment where we try to make you suffer the most so that you go away we have more seats and let's be set to correct but is it this moment is design on purpose okay that is gonna cannot be much worse than this in this course it's not gonna be any worse than this in this course all right so just because always forget the names I don't remember the names but I just write what I just told you okay the only one I cannot pull up all right so just remember arc HS is this definition that guys are reproducing okay oh there is a theorem that is easy to prove in both direction but one is trivial they don't use a super-duper to result if H is RK chess you okay let me just short an intervallic to write 15 times it just wanted to basically if ever Huber space which contains values team function then in that space there is a special function like this and there is only one okay if age as a reproducing kernel if I give you a hilbert space with one of the special function then exists a unique are catches this slightly you know the exist unique symbol is just for my laziness but it's for there exist there exist an unique air pollution concert work HS given a hilbert space in arcade chest there exist with a revolution Colonel there is only a unique or particular space associated to it okay yes yes but to be this guy it has two properties right you it's clarifies so H is a Hilbert space I don't tell you anything about the valuation function but I tell you that it does have a reproducing kernel then it is a reputed kernel Hilbert space which means that it also have the second property a bigger heber space that has that is also the valuation functionals then there exists inside only reproducing kernel ok they're the same they're the same okay good we're getting to the correct statement urging just two steps hopefully eight do the proof it's literally two lines okay direction toby is elementary meaning that you use nothing you literally write down the definition of revolution Colonel is actually kind of obvious look I wrote it here okay what do I need to I need to show that I need to show that this guy is continuous okay well but if I ever reproducing kernel then it means that the evaluation function can be written this way by assumption because that's what the reproducing kernel is right and then I have to show that this norm controls this at every point let me put how do we do it how do I control that the norm of this controlled that nothing I just take cauchy-schwarz divide this and I said that this is smaller than norm of F one of KX a normal KX might be big but it's finite because by assumption KX belongs to the space so it means that the norm of s controls f of X exactly what I need so in this sense is elementary this is the whole story okay so all you needed to do is to use the definition using kernel and cauchy-schwarz and look at it this is the kind of proof that we do in this course it again yeah if I put it here yeah so in this definition strictly speaking I'm doing this okay you're correct I think this is what I want yeah of this line a continue it so this is just this is equal to e X F X prime F agree it means that this line is just the continuity of this function it says that this function is continuous if you give me closed inputs the outputs of this function are close enough yes it's like yeah I mean it said what happened if you find after wise it's outside its purview you told you know as core interval function is a function for which the norm is finite okay the to question where what what was how to you know what question adjust the continuity I skipped this one line because it's written here the question is every function in a hilbert space has finite norm yes and that's what I'm using there okay the other direction a is not elementary use what is called lemma this is actually deep result and we're gonna skip it for now okay we're gonna skip it because it is a it will just tear me with even more confused eyes and we're gonna we're gonna maybe do it in the next okay it doesn't really matter for now thing what I want to do now is actually show you some examples instead okay so that we don't leave we just what the hell we've been talking about for an hour linear spaces and then what the hell for like an hour okay let's let's start to do an example and then we can do this okay it's short is not shortest one line two is just that is like puzzling because they use a complicated result well I introduced this is because it also makes it a bit easier to check the fact that the space we're dealing with is a reproducing kernel Hilbert space we basically have to check because you see clearly that once have this theorem I have this theorem I cannot give a definition of to of reproducing kernel Hilbert spaces which is what I said if I have this theorem I can say that something is a replica every space if is super space functions and B it has agree I basically trade the valuation function reproducing kernel I gain a bit of substance right evolution fact are super up kernels are something that I can start to touch alright some examples either is this and reproducing kernel Hilbert space if it is what he's the kernel well I claim that this is a reproduce in Hebrew space I mean it's clearly it's great huber space is linear it has an inner product that I find it to look good I just have to check it it has a reproducing kernel and I came that the reproducing kernel is this guy how do you do it how do we prove that this is true well we have to check two properties it is not the best use of the border that could have done we have to check that you know if we erase one entry it is a function in that space and then if I connect a write f of X as the inner product of F times that one function alright let's try to do it where here so first of all let's delete one entry and check if it's still a function if they function in the space or not these are just a bunch of numbers these are the functions I'm doing a linear combination so this thing does belongs to H agree that give me another function yes in your it function what is it defined like this o is the inner product between s and K X defined how is it defined remember take the two functions now there is a one-to-one correspondence between F a coefficient so what you do you just take inner product of the coefficient okay so the coefficient of the second function so of the first function are the WJ or that the coefficient of the second functions PI J X is this f of X okay whenever you're dealing with the fan addiction which is made of linearly independent elements is a reproducing Kerber space and the kernel is this guy okay we see it actually you can go the other way around anytime you have a real bit correspond with in kernel and things like this what about what if my function if I considered a special case so consider the special case where Phi J is just equal to XJ okay so sorry Phi J X is equal to XJ that is just the linear functions and then I just get what is called the linear kernel so if the case were the inner product just the inner product of any two vectors okay so this is this is a reproduce incredible space with Colonel given by this this is a reproduce incredible space with Colonel given by that's the interpret okay let me give you one more example that they're going to discuss next time I just told you that L to is not okay can we make L - okay so consider H to be a subset of l2 and again let's just do L to zero one okay but no let's do one there just one dimension though so let's consider this piece of function H such that when you consider the Fourier transform confer nagamma okay again I'm assuming that you know what is the definition of Fourier transform and not much more so you have time then you go to space you can think so you have space or time you want to call it you're going the frequency domain which is the Omega space you can think of those as in a second of description in terms of harmonics and sine and cosine blah blah blah blah okay I'm assuming that I'm not talking stuff that you've never seen before otherwise Tommy would tell you to just go away clearly this is not this is an Omega okay so this is basically saying if I look at function in l2 I'm not a considering any function I'd take off first of all the function for which I can define the Fourier transform but then I don't take all possible five I take the one which in the free domain they're only supported summer okay if you want if you interpret this Omega as frequencies it means that I don't have all possible frequency F frequencies up to a maximum frequency and then have nothing of course this is not all possible functions right these are not all possible function is not even all possible square integrable functions it's just some functions let me define it can you can check easily that is a subspace of l2 so it is a linear space and you can define the inner product of any two function to be just inner product in l2 well it turns out that this is actually reproduce in hyperspace colonel you the sink okay this is the first example so for you to there this is the first example is called translation invariant Colonel there called translation invariant because actually they don't depend on X and X Prime but only the difference between X and X Prime and there is a whole series of currents we can define this way the Gaussian kernel the exponent Multi quadric kernel and so on and so forth they're actually kind of an interesting class of kernels and they start to show some some insight which is that reproduce incredibly space are large but not huge they they just cover a function that has to be simple in some sense here for example you see that they're simple because they don't have arbitrary C's the frequency only up to a given prescribed maximum volume think one second how you would prove this okay the the the the proof of this is actually not too long you basically have to use the property realize that whenever you write the function here is not you can norm in frequency and you just sitting on an interval not on the whole thing then use the inverse Fourier transform and remember that if you have an interval the interval length is sync does somewhat related by Fourier analysis okay that's all these are let's stop here so we introduced that we start from simple species we actually went to crazies trying to go back to the stuff that we can use we went to crazy space to show that there general now we look at linear space we really recover what you did before and now we're going to take step further okay
Info
Channel: MITCBMM
Views: 13,743
Rating: 4.8000002 out of 5
Keywords: CBMM, Center for Brains Minds and Machines, Artificial Intelligence
Id: 9-oxo_k69qs
Channel Id: undefined
Length: 80min 49sec (4849 seconds)
Published: Wed Sep 13 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.