The Bernstein Basis

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to mutual information my name is dj in this video we're going to learn about the bernstein basis for constrained curve fitting to see why that's a good idea imagine you were faced with this data and you'd like to fit a function to it but you want that function to base some constraints for example let's say this function has to be monotonically increasing how might you do that or what if you came across data like this and you want your function to be constant over a certain range how might you do that or again what if you came across data like this and you want your function to be the same value at either ends of the range how would you do that well there are multiple ways to achieve this but a great option is to use the bernstein basis if i had to sum it up in one sentence i'd say it's a specialized basis expansion used in a regression model where the coefficients are approximately the outputs of the model at certain points and since coefficients are easy to control it's easy to control the model's output while fitting data so in this video i'll first explain what a bernstein basis is then i'll show how to use them for fitting univariate data while obeying constraints after that i'll generalize to the multivariate case which will make this idea more flexible and powerful and if that isn't enough for you in the end i'll point to my sources which take this idea to the moon and back sound good well then let's get to modeling okay our starting point as always is our data in our simple univariate case each observation is a pair of two numbers x i and y i now before i could tell you the bernstein basis i need to mention it has a requirement on our data it requires all x values to fall between 0 and 1. fortunately that's not that limiting all you need to do is rescale your data such that everything falls within that range for example you could do mid max scaling just make sure your mins and maxes aren't crazy outliers or else your data will be crammed into a tiny range cool okay now the bernstein basis is a type of basis expansion that means it's going to map each xi number to a vector then we're going to regress the y's on these vectors if this idea sounds unfamiliar then check out my video on linear regression and ordinary least squares but otherwise let's stay focused we can define the bernstein basis as follows first you need to pick a positive integer which we'll represent with a capital m this number will determine how many elements are in the output vector with that the m output is given with this expression it turns out if you could find all these functions you get something which maps a number between 0 and 1 to a vector of length n plus 1 where each element is also between 0 and 1. to picture that let's pick m equal to 2 and plot each basis function as a function of x okay how about m equal to 4 and m equal to six all this is showing is how selecting an m gives a different number to vector mapping next let's make two observations first each basis function may be associated with one of m plus one equally spaced points second at any point if you were to sum the basis function outputs you get a value of one very roughly you can stitch these ideas together to recognize that when you feed a basis function a number its output is telling you how relevant one of those m plus one equally spaced points are to the point you fed in that'll be useful intuition to keep in mind in fact that observation allows us to get to the punch line of the bernstein basis ready okay for a given well-behaved coefficient vector theta we get this nice property which i would call the nice property because it's very nice what it means is if you plug in one of those equally spaced points into the base expansion and then take its dot product with the coefficient vector then you'll get back approximately the coefficient of the basis function associated with that point okay not a simple sentence so i'll say it differently if you plug in m over capital m into the basis expansion and then take a dot product with that coefficient vector then you'll get something close to the nth coefficient now at this point i suspect two questions what do we mean by well-behaved and how good is this approximation well to answer that check out this what i'm saying here is if you think of the coefficients themselves as outputs of some continuous function f then well-behaved means f isn't too curvy now the quality of approximation depends on a few things it depends on the curviness of f but also on capital m which means it depends on the length of the vector it turns out if you hold f fixed then the approximation will improve as m gets large but in practice you may find this approximation to be pretty bad that happens if you're trying to model something complicated in this case you should consider the coefficients as attractors values where your output moves towards but can't quite reach it can't reach them because the output function can achieve only so much curvature for a fixed m the good news is a poor approximation doesn't prevent us from applying constraints the same idea works if they operate merely as attractors with that i can say why this is useful for modeling in linear regression this dot product is our model's prediction so this relates in a super simple way the model's predictions at certain points with certain coefficients so that leads to the following idea for modeling first make the linear regression assumption y i is the dot product of our basis experience on x with our coefficient vector plus some noise then form this loss which is mean squared error plus a term that measures how much our coefficients deviate from our constraints if that term is zero that means our coefficients and therefore the model outputs are abiding by our rules now to make this idea tangible let's walk through an example before i show the data let's first walk through how the model makes predictions first we need to choose that number m let's go with m equal to 4 which gives us these basis functions next let's make up some coefficients remember the coefficients may be interpreted as approximate model outputs for equally spaced inputs so that means we could plot them in the x versus y graph the coloring here is to show the pairing of each coefficient with each basis function now to make a model prediction at a point we pass it into our basis functions giving us a vector then we take the dot product of that vector with our coefficient vector giving us a single model prediction and in fact we can do better let's consider all predictions with this view notice how the curve gets real close to the coefficients this is the nice property we talked about this is why we're using the bernstein basis that said you could have coefficients where the approximation is bad in this case the coefficients are more like attractors if you move one the mohammed gets pulled in that direction with that we're ready to plot the data to start let's fit the data without any constraints that's a good looking fit it's clear the unconstrained bernstein basis is pretty flexible but now let's make it monotonic so the first step in meeting that constraint is to measure violations of it that means we need to think what is a measure of our coefficients that gives us a value of 0 if our model is monotonic but a large positive number if our function is very not monotonic well since these coefficients are basically function outputs we could penalize non-monotonicity by taking the difference between consecutive coefficients but only if it's positive and then summing all those up that means our loss is this expression also we've set our lambda parameter to 100 which means we have virtually no tolerance for violations now if we optimize this our model will obey our constraints and fit the data like this pretty cool right i'm not sure about you but i find this all very satisfying okay one question i expect to hear is how do you optimize this user-defined loss function in the general case well i'd like to leave the answer there intentionally open-ended there are many approaches to optimization and the efficient ones will take advantage of the specific form of your loss but for practical purposes i'll tell you i'm using pi torch which provides a huge amount of flexibility in optimizing losses okay now this feels like a natural place to stop right no absolutely not you can't stop at the one-dimensional case if we want to be prepared for the wilderness we must handle the multivariate monsters but be prepared the notation here gets a little heavier so muster up some courage if it's hard to follow just wait the examples afterwards should clear it all up ready okay just like last time we start with our data the only difference this time is x is now a lengthy vector now in the previous case the expansion map from a number to a vector in this case it's going to map from a lengthy vector to a tensor with d dimensions we'll call that tensor b of x to calculate it you need to decide a univariate basis expansion for each element of x so that means when you plug those elements in you get d vectors then b of x is the outer product of all those vectors granted you generalize the idea of an outer product to work with more than two vectors to produce a tensor for example if x has length three then b of x will be a 3d tensor now if you'd like to understand that idea precisely read this big chunk here i'm explaining the logic you use when implementing this idea in code also i'm using the idea of broadcasting which enables the outer product idea to work on multiple vectors with that we can review the nice property now before you study this equation let me describe what it's saying first consider the space where x lives which is the unit cube in d dimensions now consider an evenly spaced lattice of points covering that space and pick one of those points now plug that point into b giving us back a tensor in d dimensions i'll call this b of the grid point next let's imagine we provide a coefficient tensor of the same shape then we flatten b of the grid point and theta into long vectors and take their dot product well that would give us a number which is close to a value within theta the one associated with the grid point plugged into b in other words you can think of theta as listing out approximately all the values you'd get if you took its dot product with b of different grid points okay so that covered the multivariate bernstein basis but we still need to bring it back to the world of modeling fortunately that's easy we just assume y i is the dot product of b of x i and theta plus some noise then we proceed just as we did earlier as in the previous case it'll be easy to control the outputs of this model by adding a term to our loss function which penalizes values of theta so now are we done no absolutely not i can't let this explanation float away in the abstract i need to nail to the floor with an example so let's say we have this data in this case each x i is a length two vector giving us the coordinates of these points and their color tells us the y i values now as you know the goal is to fit a function to this data though to explain the bernstein basis approach we'll need to take a little detour so let's drop the data for a second the first thing we need to do is determine a univariate burn c basis for each element of x so that means we need to pick m1 and m2 let's go with 6 and 8 which corresponds to this grid over the 2d space with that we can form b of x to show it i'll pick an arbitrary x point on the left and show you the tensor it gives us in fact in this 2d case this gives us a 2d tensor which is just a matrix notice if x moves the matrix changes you'll notice it looks like it weighs the relevance of the whole space to a given point next we need to bring in theta which gives us enough to form model predictions the model's prediction at x is given by creating b of x and multiplying it element-wise with theta and summing it all up but we care about all predictions so let's view those as contour lines now the nice property tells us you can consider the theta matrix as a rough approximation to those contours or at least it's the attractive force pulling those contours around next we can execute our objective which is to fit our data for now let's ignore constraints and just fit the best we can according to mean squared error okay not bad but we can do better my eyeballs suggest there's some symmetry we aren't taking advantage of so we could add that as a constraint we just penalize theta for any deviation away from top bottom symmetry or left right symmetry and we add that to our objective now let's see oh way better so i hope this sells you a bit the bernstein basis creates an easy way to merge what the data tells you and what you assume to be true that said i should comment on when you should use this approach basically this approach is useful for when you have strong opinions on the nature of the model's output and that isn't already captured in the data also this approach is for low dimensional problems the main reason is your parameters grow with the dimensionality of your input space if you have 50 dimensions you'll be handling a 50 dimensional tensor which can't be done with any normal computer that said there are ways of building multi-dimensional models out of low dimensional ones for example generalized additive models provide such a mechanism so if that's your use case check that out but before we end i must mention my sources i originally came across this idea from the clever folks i work with at lyft so in the description include the papers we passed around to get comfortable with this idea there you'll see this idea fleshed out super thoroughly so if you're going to use it make sure to check those out and finally thank you for your focus if you enjoyed this video and would like to continue learning about statistics and machine learning please like and subscribe content like this is content on casino make especially if i can get your support

Info

Channel: Mutual Information

Views: 492

Rating: 5 out of 5

Keywords:

Id: DNFhI_Op4y0

Channel Id: undefined

Length: 14min 6sec (846 seconds)

Published: Tue Jun 15 2021