Curve Fitting and Regression with L1 and L2 norms

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so uh we're gonna cover a new topic and uh we're gonna perform more magic and here's what i'm going to talk about x equal to b now i am a professional supposedly applied math person right or mathematician and i'm going to tell you right now that most of you probably think you know you kind of know something about x equal to b but there's always more to learn this is a hard problem actually so we don't often think about it because we just hit backslash on matlab you just say whatever who cares just backslash it it'll give me something and it usually does and a lot of what we're going to talk about on data sciences right is the fact that this matrix a it's going to look like that or it's going to look like this it's going to be tall and skinny in other words a lot of data this way a few pic that way or it can be like this a few measurements but lots of measuring time okay so this ax equal to b is gonna be either massively over determined or massively underdetermined that's like the generic scenario you have in data science okay now how you solve it of course most of you have been solving it this way you don't have to give it a second thought so today we're going to give it a second thought i'm going to talk about this because in one case here when this is a massively right so if you think about the vector here and multiplying this this is a tonic constraints with very few variables so this is massively over determined in general there are no solutions everybody agree with that i can't satisfy all these constraints there are no solutions this one has only this many constraints but all these unknowns so it's massively underdetermined so i have infinite number of solutions remember when the determinant of a matrix a is zero what does that mean it either means you got an infinity or none and what we're going to live with now is exactly that for these over and determine underdetermined systems we're either at infinity or none so the question is how how's this actually even working have you thought about that you hit backslash and over and down your determined system it's impossible to solve because there's either an infinite number or none so if you remember what i talked about earlier was that what we're going to do is do the following ax equal to b subject to we're going to impose a constraint this is exactly what we did for independent component analysis remember i could not solve the problem i didn't know how to pull these two signals apart i haven't given you enough information to do so then we imposed constraints we said okay well how about if we say that the things are statistically independent for instance right and then that allowed us to make progress forward everybody okay with this it almost seems beneath you maybe i thought it was beneath me and then i started learning about and i was humbled so this is your moment to be humbled too maybe unless you know it all okay but ax equals b but i mean this is not what you write home about because your parents will be like you're still on that i thought you were moving past that was not something you took like undergrad you're still there yes mom i'm still solving x equals b at that point your parents will be disappointed in you but this way we'll pass we'll move past it and show you all the wonderful things you could do with ax equal to b okay all right so i'm going to start actually this ax equal b process by doing uh and and part of i want to go down this route is i'm going to start talking about this thing called compressive sensing and what we're going to start looking at is just doing curve fitting for a moment okay so curve fitting is of form like this let me show you what we're going to look at at first i have a bunch of points and i want to try the best fit curve through there right so i don't want to raise my hand because then then i get on my suit and then and then or at my face and i know i can't count on you guys to tell me like you get a big black mark on your face i've tried before i've walked out of this classroom gone to the restroom and i have big black marks on my face and nobody even though you're standing five feet away said hey dude you got like so i have to be careful i can't trust you guys we're going to do that for a minute so notice i'm going to do i have a bunch of points and i'm trying to determine two parameters okay so this actually turns out to be just ax equal to b when you do the curve fit least square fitting okay so we're gonna do some least square fitting on this and the thing i'm going to focus on is when we do this there are variety options about how we do this curve fit including least square fitting and least square fitting you've done for a long time right you just say well i'm going to take these points this is going to be my function right let's call this f of x and what i'm going to try to do is look at the difference between where my function says i should be and where the points actually are and i'm going to minimize this right so that's what least square fitting does we typically take the two norm and i'm going to show you there's some options around this that make a big difference for us about how to do fitting okay and in particular let me just show you this is the precursor to this what if those are my points and what if all of a sudden i have one point there that was like when you brought your chipotle lunch into the lab and you accidentally bumped the table because you were so excited and then you got that data point now could be so so remember i said it's illegal to like just remove points you can't just remove points that are like well i don't like that point i'll take it out of my data set because you can't do that okay so you have data points like this you know what's going to do this whole curve if you do least square fitting what does it do it takes the difference between where you are and where the other point is squares it you see that distance you made a mistake you're going to square that mistake okay squaring mistakes is usually a bad thing and it's going to give you is a line that's way off all because you have a bad data point and all because you chose the l2 norm so we're going to talk about the norms that you get to pick the reason this is important because it's generic for ax equal to b and how you want to solve ax equal to b in particular i keep telling you here that it's all about x equals b for under and over determined systems and then when you write down a solution for me or for yourself you've got some choices to make right out of the infinite number of solutions which one i'm going to give you and what's the norm i'm going to use or out of the unsolvable problem because i got too many constraints i'm still going to minimize something for you i want you i want to know what is the norm you're going to do to give me a solution actually okay all right so let's pull up some matlab and come on wake up wake up wake up okay seriously you're rude booty how did that happen hold on let's see if you saw that i didn't do anything right i didn't like tell it to reboot you saw i just muted this thing come on gotta be kidding me okay here we go your computer was just started because of a problem ignore all right all right here we go let's bring this up here let me pull up now i gotta restart here we could got uh unmute this thing all right come on all right let's give it a second right we'll do the av unmute here all right okay i got that i actually wanted this okay here we go we're gonna program and let's bring up this code right here okay this is the stuff we were working on last time and let me just kill it all okay let's start off okay so we're going to do first i'm going to give you a set of points and what we're going to do with these points is make a least square fit so x is equal to here's my x points and i could have typed this beforehand but since my computer died it would have uh would not have mattered oops i want 2.8 3.0 4.0 4.3 4.4 4.9 so these are some x coordinates and associated y coordinates are here and we'll plot this in a second 2.0 2.2 2.8 8.7 oh no okay plot x y line with two okay all right so here is here's our data here it is okay and we'll just plot it with uh how about red circles that sounds good okay so here's our data and our job now is to do a curve fit and what i want to do with this is actually fit a line through that okay and so uh what i'm going to try to do with this is i'm going to have to solve a minimization problem and by the way for least square fitting there's actually formulas for how to do this it's not that hard you're going to minimize the l2 error and why these square fits are quite nice is that you can write down a very simple formulation of this as a as a optimization problem so let me just show it to you here suppose this is going to be my fit a line what i want to minimize is the l2 square error which means i actually have to minimize right the square of f of x minus the data points squared let's call this e so i want to minimize this so how do i minimize an error minimize i say that word and you think derivative is zero sweet in fact there's two parameters in here right so this is mx plus b minus y so really i want to take the derivative with respect to m set it to zero and i want to take the derivative respect to b set it equals to zero and this is going to lead me to a two by two system of equations now how do i know these are minimums not maximum well it turns out in this kind of thing we do curve fitting there is no maximum error okay you you could make it infinity error if you want right you just you move that bit off to infinity and you'll make an infinite error kind of like life there's no upper bound on your errors okay there's only like the best you can do and then it goes up with no upper bound okay all right so we know this this is going to give you a 2x2 and then actually that's that's easy to do in fact this is even built in to excel you tell it to fit a curve in excel even excel knows oh yeah there's a this gives you a very simple two by two to calculate okay all right but what we're going to do is we're going to actually use the fmin search command because i want to play around with different norms to sort of minimize that error and find these parameters through minimization okay so let's do that all right so what we're going to do now is we're going to do f min search we're going to take these uh curves here and i'm going to find the coefficients of this and i'm going to do an l2 fit in other words i'm going to look at the least square error so the coefficients that come out of a least square error l2 fit fmin search is the command and i'm going to do a line fit l2 with f2 error okay and i'm going to initially guess that two parameters there and that line fit are one and one so the slope and the and the offset are one these are for tolerance settings and i'm going to send in the data but i'm also going to do one other thing i'm also going to get coefficients for an l1 fit now what's an l1 fit the l1 instead of squaring that the l1 just says look the absolute value of the difference no square okay couple simple implications right away it's still trying to calculate a distance okay that's one now difficulties do you remember differentiating absolute values they're kind of a pain right so you have to take care of what the sign is and all this stuff so it's not so it's not as easy to work with absolute values when you do differentiation this is trivial this is not second let's go back to this picture i drew here if i do the l1 norm now if i get a point like that way out in the distance yes it says look there's a problem here but it's not going to take that distance and square it it's going to keep that distance right so it will count for those points but it's it's not going to move because you'll see here in a minute it's not going to move this line way up here or you know way off of the fit it's it's going to react to that point but only by looking at the distance not the distance squared okay now why is that important because often those points like that are called outliers and a lot of what people are interested in doing is robust statistics to reject outliers like how do i find them you can't just throw them away we've talked about that you can't just say i don't like that point i'll erase it what i'd like to do is say it's in my data set but my fits and my regressions are not going to overreact to them okay so the l1 norm is a huge difference in this as you'll see all right so let's go do some l1 norms all right here we go all right so these are the generic calls uh but what i need now hey maybe i even have them what did i call these things oh look at that i do have them sweet okay let's open them all both up okay line l2 fit here's what it does let's go through this here we go so it's a function it's going to return an error and it's going to minimize that error so here's the thing i'm going to minimize e and what is it i send an x naught x naught are the two parameters the slope the right show m and b that i had in there so initially i picked them to be one so here's my line m x plus b minus the actual data points squared and i give it a guess an fmin search iterates to try to find the actual solution okay so this thing will iterate iterate iterate until it reaches until the iterations they converge to so that each iteration does not change the solution by more by 10 to the minus six then it quits sends you back an answer sends you back new values of m and b okay the only difference between that code and this code is absolute value of the difference okay all right everybody agree this is super simple and you're kind of amazed that we're actually covering in class but that's okay it has a point um so let's go ahead and run them because i get these two things and let's plot these um so co-f here is i'll have uh let's call it m the l2 fit let's do this okay x will be some points uh let's call it xs for it's going to go be a line limb space that is going to go from 0 to 5 in 100 points and then i'm going to say my fit from an l2 version is going to be co-f l2 1 that's the m times xs plus co-f l2 2. okay so that's mx plus b with those coefficients and the y1 fit is now just okay and then i can plot these so plot um let's plot the points oh i already have them so i already have the points up there so we'll do a hold on and i'm going to plot now uh xs y2 xs y1 okay there we go l2 versus l1 now so what you see here is they're not that different the blue is the white l2 fit and the l1 fit is whatever that color is orangish reddish i want to highlight the following so what is the difference going on you see this point over here that's what's actually tilting that blue down right it's trying to minimize these distances and it's incurring a lot of penalty from here because i got to take that distance and square it right l2 is trying to minimize these things so it's going to say like okay if i can make that a little smaller that will minimize my score so it takes this down the l1 just stays on here somehow does not act too much towards that so in this case you still say well big deal they're pretty much this they're not that different okay so what's the big deal all right now let's make a bigger deal let's throw in some data right here um and right here here all right run okay now here's what i wanted to show you i threw in some outliers here and here uh let's look at the l1 fit hardly changed right because it just says i see you but you get a penalty score the l2 starts shifting quite a bit because the l2 is saying i got this i'm squaring that distance so you can start seeing that the difference between l2 and l1 can be significant okay now here's going to be my point how do outliers come into your data set let me give you some examples that i've seen in my own data analysis uh so i was working on some data from brazil temperature recordings from from the rio de janeiro airport and we're looking at dengue outbreak according to temperature looking at climate how it associates okay so we started looking at this data set and somehow in the summer you know there's daily temperature it's like a morning reading and an afternoon reading i know sun somewhere for about a two week period in the summer no data the person taking the recordings went on vacation there's no data there now what's interesting is if you just load data it doesn't see that oftentimes matlab and other programs will fill in a missing value for you like give it a zero or n a n or infinity do you know what your program does because if you don't you have data and you have missing data and it's being filled in you don't know that and all of a sudden your temperatures that sit there at high values maybe gets filled in with the zero and your whole thing gets pulled down by it that's one possibility another one fat fingers do you ever have fat fingers you're typing in a 31.5 and then a 31.6 in a 32 point and then you miss it you either just type in a 71 on accident or 311 because you know if you type in data i'm not going to say you know you're not flawless you all are but it could be that in your flaw as you contemplate their flaws and stuff you might have mistyped something or you copied and pasted you have you copied and pasted and you thought you got the first character is 113 but you thought you saw all of it highlighted but you got three not 113. these things happen this is the nature of data science you got a lot of data uh it's almost always going to be flawed by something like this okay that's that's the nature of real data the nature if you have that you're going to have problems unless you figure out a way you do some robustification okay now it could be you're downstream from this data you have no idea how this happened it could be there's a very simple explanation two levels above you that you're never going to get back to and so you have to deal with this data and you can't just say i don't like this date i'll throw it away you can't do that remember i've said this over and over you can't just do that here you have it and you see this thing hardly reacts to it okay everybody go with that so what we're talking about here is there's a tremendous difference between l2 and l1 norms and up to this point in your life you've probably only dealt with l two and ones and what i'm gonna teach you is that l one norms i mean sorry l2 did i say l2 i think you've mostly dealt with l two least square fitting in your life now we're gonna talk about l1 and what you can do with l1 because l1 is going to be super important for us okay good all right so this is the application in terms of curve fitting oftentimes this is what's called robust statistics robust statistics are methods just like the name sounds to robustify your regressions and things like this in other words you're going to have corrupt data and missing data and you always need some some way to handle this so john tukey i don't know if you know junk tukey john tukey was the guy who invented the fast-forward transform uh he was also uh the person who 50 years ago he was already talking about we need to move to data science data scientists its own discipline sort of this call to data science 50 years ago more than 50 years ago and he was also the one who invented robust statistics and he said the following about this roughly i didn't hear it but getting secondhand third hand forehand knowledge he always believed that it doesn't matter what kind of robust statistics you use just make sure to use some in your data science that's something that you should always think about always use something that's going to handle robustify estimates and so forth so this is just one way to do something okay all right now we're going to move on from curve fitting which is an x v x equal to b two straight up ax equal to b okay and we're going to look at the implications of this and x equal to b because it's going to have a big impact in this idea of compressive sensing that we're going to work on any questions on this does that make sense everybody yeah all right and what i want to just again start having you start thinking about is different norms make big difference okay let's throw all this away [Music] i hope i don't need that again because i just deleted it i'm going to make a matrix a and uh actually let me define two parameters m and n and what i'm gonna do is make a matrix a which is a random set of numbers and by n it's gonna have 100 rows 500 columns so if you remember those matrices i drew this is a short fat matrix i have i'm going to have 500 unknowns and only 100 constraints so i have an infinite number of possible solutions okay oop that should be a semicolon all right so my right hand side b is going to be rand n and it's going to be m by 1. okay all right 100 constraints they set the b is just part of the constraints the ax equals b and i'm going to solve for this thing so let's talk about solution techniques for this here's one let's do backslash that's normally what we don't think used right this is like what you would normally do there's let's call it prenathan and post nathan pn and pm so you could always say are you pn or pn oh yeah pn for sure okay i should come up with something better all right uh but here if your pn like as in post nathan then you might okay here's another way to get this thing right p inverse times b that's the pseudo inverse um those are two techniques i'm going to come up with others but let's just look at even the difference here so what i'm going to look at is i'm going to give you a set of plots is these two just solved ax equal to b okay and let me give you another one x3 equals lasso a b okay x4 equals ridge x5 equals robust fit did you know there were this many x equally solvers in matlab dropping them on you and it's probably to my point this is people think about ax equal to b all the time right in more sophisticated ways than we usually teach in engineering especially if you're in statistics these things are fairly common so let me do the following for j equals one to five um and what i want to do here uh oh shoot okay let me just okay i'll just copy and paste and set it other stuff to set up a fancy loop subplot five one one and what i'm gonna do is do a histogram of x1 so what i'm gonna do with the histogram is i'm going to show you the distribution of scores for the vector x that solves this problem okay that's what i'm going to do with histograms so i'm not going to actually show you the solution vector i'm just going to show you the distribution of values of those scores right after all it's just a random matrix right i just made random matrix random right hand side solve so let's just get some vector it doesn't have any meaning really except for the fact that if we start looking at what this histogram gives us it's going to give us some interesting diagnostics for the solution okay so let me copy and paste this and and here we go and we'll look at how these things vary and it's undoubtedly going to crash i think yes okay um let's just look at lasso again so i was going to change this up a little bit because oh i wrote capital x silly of me where oh yes thank you how did i do that all right here we go maybe that will work it's probably still going to crash oh shoot what is that okay ridge okay fine let's uh let's comment that out for a minute some of these you're required to do uh put in parameters let's do uh help lasso so why is it showing me that i want to see i think i'm gonna have to give it a parameter all right so i know it should be fine and help ridge telling me some stuff here oh yeah i gotta give it a ingredients look at that oop that's lasso where's my ridge here we go hold on let me get ridge here we go bridge okay y k i got gotta give it a k value uh k is a scalar uh let's give it like so i'll tell you what k is in a minute and let's try that see if that gets me to where i want it to be all right let's just do that that's the easiest fix for now robustfit also has to have it okay um all right i'll tell you why those are all right this is not going as well as i had planned in case that wasn't clear to you i was going to avoid doing the optimization so let me tell you first what i was trying to do and then i'll just do straight up optimization with cvx so here are two different optimizations backslash pseudo inverse they're going to distribution of values now both of them have a slightly different or okay depending on if it's over and under determined they'll have slightly different scenarios for how they load up the coefficients the other one i want to do now is do explicit l1 penalization this is what lasso does the lasso penalizes gives you a solution by penalizing l1 norm okay it says so in other words when you go to do this fit here's the kind of ideas that are coming up solve this and how many constraints 100 how many unknowns 500 i can't do it right straight up if i just write that down there's an infinite number of solutions so which one is it going to pick so often what matlab will give you is i'll solve that and minimize the x norm also you have a new unique solution okay so normally when you hit backslash that's actually what it's actually doing right there it's saying okay you give me an infinite number of choices i'll pick the one with the smallest l2 norm we could also have done this minimize the one one what's the difference so i'm going to show you lasso we get it to it but i right now i don't want to fuss with lasso so let's just go ahead and do cvx cvx is a package and what it does for you is you define a variable let's call it x3 size n and what i wanted to do is minimize the norm of a times x3 minus b then one norm cvx end okay come on so cvx is a package that you can just download it's con cvx stands for convex it's a convex optimization package down it's a little thing on matlab that you can just package you can download and what i'm doing here is saying here's my variable to solve for i want to find x minus b okay and i want the norm of that to be i want i want to minimize the one norm that's what the one there does okay by the way i could do this also with the four norm i mean the two norm so i could say i also want this solution to compare it to the others up there and i want that to be the two norm and so let's make this subplot four and four and i'm going to say uh right now then i want to do is just plot a hist so plot pissed x3 4 1 3 hist x3 and let's do and run um what it should work yeah that's exactly what i have what did it say here unless i have something mistyped okay so yes i did mean that good that should get it okay so i am typing exactly what's in my notes so why is that not going minimize norm x3 minus b yeah okay interesting day for me so far i am i am striking out here this is weird uh type double what oh let's try this so maybe i did it wrong here okay now it's going okay sorry stupid mistake oh yeah let's do this one here too all right so what this does is an actual optimization convex optimization it frames this thing up and let me just show you the differences are between these okay interest no that can't be right oh hold on no okay that's totally wrong that can't be right let me cut this out for a minute here so now it's optimizing okay that's totally wrong what is going on all right let's uh just look at the top two from because i'm copying right out of my notes if you have my notes it's right out of the notes and if you look at the notes i have exactly what i'm supposed to be getting not here all right let's just look at these two for a moment this this this is okay um so first of all how did i solve this how did i solve this this is the pseudo inverse this is backslash now it turns out backslash what it does is it doesn't actually do least square regression it does what's called it actually projects onto what's called a qr decomposition which has a sparse solution determined by the number of constraints so notice here this is the distribution of values so these are the scores of the loadings of the x vector and how many of them are zero 400. in other words i have 100 constraints 500 unknowns so if we did the following in class which would be exceptionally cool i gave you a matrix 100 by 500 and you had to solve this thing say okay before anybody leaves solve this 100 by 500 system you might say like oh well okay that sucks but there's 500 unknowns i can just set 400 to zero and then you might feel like i had to get 500 unknowns i already got 400 of them so i just have to solve 100 by 100 and you might feel better about that for a little bit okay so that's what this thing does so this says i can set 400 to 0 and move this forward now what you should be seeing here in the l1 is you should see the same kind of thing 400 of them are zero and here's the least square regression when you do l2 penalties the pseudo-inverse look what it does it spreads out everything okay so why does it do that it's trying to the l2 norm squares things right so if you take a really small number and square it it becomes smaller so what the l2 norm wants to do is make everybody super small because if you square it it's even smaller it makes none of them zero just tries to say everybody eats a little bit of this up and as small as possible because i'm going to square you all at the end of it and i want anybody to be big anybody who's big is going to create a problem for me in my score right because i'm going to square that so i want to make everything small now the l1 here which is puzzling me i just feel like there's got to be a simple mistake here but gosh do you see anybody see it because i don't um that's exactly how i might code in the notes in the notes by the way just to prove it to you if you download it it looks yeah the l1 looks a lot more like the top picture i'm on fire i'm going to go home after this class and just go back to bed and just sit there and just call it a day that sounds pretty good right all right um okay that's uh you're gonna have to just trust me on that one even though this is kind of crazy i don't know why that's not working out yeah that it's right there all right fine all right so now what we're going to do is do the opposite case i'm going to now do this 500 by 100 so now what we have are 500 constraints 100 uh so sorry yeah 500 constraints only 100 unknowns right there's no way we can satisfy 500 constraints with 100 unknowns unless many of those constraints are the same right but that's just generically not the case because i made a random matrix there's no way to get a solution to this so what we do there is no solution there's nothing that satisfies ax equal to b so what i'm going to do instead is say okay i can't satisfy this i will instead say i want to mostly satisfy it i'd love that to happen it can never happen so i'll just say minimize for instance that i can't satisfy x equal to minus x to the b but i'm going to try to make ax minus b to be as small as possible okay that's what happens in an overdetermined system now again how i'm going to do that do i want to minimize that with the l2 norm the l1 norm so let's show you some differences here and uh so here we go here's my matrices um so what i'm going to do with this is start to uh solve these to do this and again i can do the backslash i can do the pseudo-inverse and now when i do this i know why okay hold on remember when that all my papers fell remember to start off my awesome morning well two pages here got shuffled that i'm just starting to realize let's go backwards in time and we don't have time to go backwards in time uh actually what i was solving here was what you would do for the undetermined case what i'd really want to do in the under determined case i actually can solve ax equal to b right because i only have a hundred constraints and 500 unknowns there's infinite ways to satisfy it so let's go back to this case here where it was like this i could do the same for this actually okay let's just run this example to show you what would happen and if you look at my notes it actually works out for the other case too just you should be reading my notes anyway yeah okay uh it's gonna take some time to fix so let's let's pretend today maybe didn't happen uh but if you look at the notes all the code is there you can download cvx package what's my main point my main point is this there's two norms we're going to work with l1 and l2 and unfortunately it show up here what l1 does is it promotes what's called sparsity the l1 norm what you would have seen there is it tries to set as many things to zero as possible the l1 gets penalized for every non-zero entry so it says how many can i set to zero and then just make as few as possible non-zero whereas when you do l2 regression right it's trying to make everything small because it gets to square it so things that are small and you square it makes even smaller so what l2 and l1 look like are fundamentally different but what the l1 is going to allow us to do in this next lecture is going to show us this idea of what's called compressive sensing take a small number of measurements and recover an entire signal okay it seems like an impossible trick but i'm going to show you it's all depends upon that l1 norm all right awesome day hey enjoy your day
Info
Channel: Nathan Kutz
Views: 1,523
Rating: 4.9354839 out of 5
Keywords: compressive sensing, robust statistics, curve fitting, kutz
Id: n2_1BW8dTBA
Channel Id: undefined
Length: 47min 41sec (2861 seconds)
Published: Sat Feb 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.