Vincent Warmerdam: Gaussian Progress | PyData Berlin 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
cool so hi everyone welcome to my talk I'm going to talk about this thing called Gaussian Pro progress which bit of a pun intended is the most most normal topic I could come up with and Gaussian processes I think are sort of this wondrous thing but when I was thinking about it the feeling that you often get is that you know when you're a machine learning professional you know there's so many algorithms out there like this is this might be your feeling on a day-to-day basis and there's this pressure that you might need to keep up with all this new tech and that can be like really really the motivating and this is what I want to do now is this talk is not about downplaying this feeling I think it's just there but I do want to show like a really really lovely hack and that is understanding a mother algorithm many many algorithms tend to share a component and if you can just really really understand one single component that's being used everywhere really really well then everything else might become easier to understand and the Gaussian I think is like one of those core concepts because it turns out that if you appreciate what the Gaussian distribution can do then there's lots of algorithms that are just much much easier to grasp and this talk is an attempt at explaining the power of the Gaussian by kind of stepping up the ladder of complexity of algorithms so today I will literally introduce Gauss and then I will explain the gaussian trick for classification clustering and outlier detection i will then show how you can actually make neural networks just get properties that might make them better by using the gaussian and then i will attempt to LiveCode a gaussian process this will be tricky but I think we will be able to manage that so the introduced what Gauss is there's a bit of a story about a preschool so you have to imagine there's a town there's a preschool and there's a teacher and the teacher is lazy the teacher doesn't really feel like teaching any of the kids anything because the teacher really just wants to do reading the newspaper so what the teacher then says is you know hey children please at the numbers one until a hundred and that way the children are busy doing arithmetic and then the teacher can read the newspaper and that level sort of the plan but then there was one student who sort of looked at the task and sort of said all right this is number one and two and all the way up to 100 but what I can also do is I can just rewrite it I can take the number one up until fifth and I could take the number 51 back to a hundred and if I reorder the numbers this way one thing that you then see is that you know 1 + 100 + 1 2 + 99 is 100 100 up until 50 + 51 is 101 and just by reordering things a little bit 101 times 50 is a whole lot easier than sort of doing old arithmetic with all the additions so there was a student in the classroom who figured this out and the student once the student figured this out the student got bored and figured let's think about this and the student said ah that's another thing I could also do is I can also say take number one open till one hundred and a hundred up until one they also get like one hundred one appear at the bottom but now you got a hundred of them even though you're interested in 50 of them so then you have to divide by two and there's also the same number and you can also say does this generalize if I add this all the way up until n can I still apply the same trick and the answer is yes it's the exact same idea but by telling you this story it's a lot easier to remember and then you sort of realizing ah this is what mathematics is supposed to do math is kind of like a compiler but four numbers in a way and the reason why it's relevant to share this story is this actually happened and the guy who were the kid I should say who sort of did this that kids name was gaps and this this story is actually attributed to the guy who came up with the normal distribution and the thing that I like about this story is just like how by thinking about this building block you can sort of do arithmetic better by understanding Gauss you can actually understand machine learning a whole lot better the one thing that I'm always a little bit bummed out about by though is like this is a really cool story and this is inspiring and it helps you understand what is happening but when you open a math book or a book on any topic that's like this they typically explain it by showing you this dry formula which is a bit of a bummer so what I'm gonna try and do now is I'm gonna go in-depth into this Gaussian thing literally but what I'm gonna try and do is omit this and focus on this so that does mean that I'm going to sort of ignore a whole bunch of details which in the end are important but what I'm trying to achieve here is intuition so there's a certain shape this is the normal distribution and the idea is like there's lots of stuff in Center and there's some stuff on the sides but a lot of stuff in human nature sort of resembles this in some way not perfectly but like human heights there's a sort of a Gaussian like distribution and human weight tons of stuff and again this is the shape this is the formula but why is this useful well you know because you can do some clever hacks with it suppose I've got two of these Gaussian distributions let's say that I have the the height of people and there's one subgroup of people with distribution a and another group of people have distribution B if I have some person who has height X I then looking at the distribution of the difference of likelihood between one Gaussian and another even sort of the temped kind of a classification thing and this is a concept we're gonna have to remember you can compare two gaussians to say hey if this gaussian represents one group of people in terms of height and his other Gaussian represents another group of people you can really compare the two groups for looking at the likelihood and another convenient thing that you can do with the Gaussian is you can at some point say you know there's like a lot of mass in the center but the outskirts there's like at some point this threshold where we can say hey that's an outlier at some point there's a there's a moment when the point is so far away from the bell curve you literally can't hear it ringing anymore and that's also kind of a useful concept so let's keep these two concepts in mind and well I'll now sort of try to do is just show you that with these two concepts and also in higher dimensions you can really solve most problems I should say though there's also like a two dimensional distribution formula that's out there so to skip this math what I'll just show you is how this algorithm can sort of help you um when you have a couple of points sort of in a point cloud a Gaussian distribution in higher dimensions essentially says you know and in some higher space some blob of points and there's some sort of mean there's always some sort of highest point and this can be in higher dimensions what you kind of see here right so it's like mu1 mu2 it's a little bit of mathematical notation I apologize but but this is the idea this is this is something that describes a Gaussian and I hired I mentioned now what's equally important is not just the means but also the spreads so like there's a spread on the ax ax access and there's a spread on the x2 axis right that's also very definitive of a Gao share and the final thing that makes it very definitive is this notion of a correlation or sort of covariance the the direction in which the Gaussian is pointing that's also depicted in this covariance matrix and and the idea is if I have these numbers then I have this shape well define and this is what a Gaussian essentially is and you know the correlation can flip so you can also be pointing in a different direction and you can also say there's like no correlation whatsoever but but you can express a Gaussian by knowing the mean of the distribution about saying something about its variance that defines the Gaussian if you will so let's see if we can use a Gaussian to already make another algorithm better and I'll just take k-means as an example so k-means if you're unfamiliar with it the idea is I have some sort of data set and I would like to cluster it so the way that this typically works did you say ah there's probably three clusters I want an algorithm to automatically find these clusters and then the idea is you start with a couple of centroids these centroids will then sort of go out there and look for its nearest neighbors so each and every point is closest to one centroid we allocate all of these points to a centroid and then we take the mean of the points that are sort of assigned to the centroid and then we sort of move the centroid in the directions such that sort of in a nice little middle of all the different points I know then the means shift a bit you can repeat this and at some point this algorithms are going to converge and then you can sort of say okay all the points that are closer to one centroid belong to one cluster all the points that are closer to another centroid belong to another cluster and there's a couple of cool things about this algorithm like it will always converge at every time you do this thing the total distance between the centroids and its neighbors the decrease so you can prove that this actually will converge but it's also a very nice two-step approach like there's one step where you say I'm looking for all my neighbors and there's another step we say hey given the neighbors that I have I want to make like a good move in the right direction but this algorithm doesn't make much sense if you think about it because there's a lot of stuff that's if you about it suppose that I have this as a starting point and then I have these two send Lloyd's I mean sure what I can sort of do is I can say let's divide them up but for all of these blue dots that I have here in the middle I don't think a hard allocation to either of these clusters makes sense because the idea is it might be soft clustering at best but at this particular point I think it's a really weird idea to say you member of my population belong to one cluster and one cluster only it might make sense to say now you can be like part of one cluster for like 30 percent part of another cluster for like 70 percent so how about what I'm gonna do is I'm gonna replace the notion of a centroid and just put a Gaussian in there so what I'm gonna try and do is instead of saying hey there's a center of them sort of following and all the points belong to it I'm just gonna do the exact same thing but I'm gonna be moving a Gaussian in this sphere and the nice thing there is because I have a likely a distribution you know we can maps to some sort of likelihood then maybe I can do soft clustering I get some other benefits from this as well and there's some texts saying let me just say so the idea is you start out with two gaussians they can sort of be out there in space and what you're then gonna do is you're gonna say okay we're gonna sort of try and figure out for every single point to what the Gaussian distribution the point might have belonged to and in this example I hope you can see from the Beamer but there's like a lot of red dots those are obviously closer to the red distribution into the green one and there's a couple of bright green dots which are obviously part of the green distribution and there's some sort of teal turqoise blueish adults sort of in the middle we're a little bit unsure and then the idea is that now that I've sort of figured out which neighbors is sort of associated what I can do now is I can say alright then using sort of a weighted moving average if you would that's what I'm gonna use to sort of move the Gaussian in the new step and it's again just the same as a k-means except I'm moving a Gaussian around that's the difference and then the nice thing is what you can do is you can sort of move this around and then at the end you can say hey again I've sort of clustered my things and what you can then do is you can sort of say well this is one member that's not really part of any cluster but this is a way of clustering that also allows for covariances sort of say hey it's not a perfect circle while I'm doing this k-means it's sort of a different shape and this is basically k-means but in my opinion just better and it's just the consequence of the fact that you know there this distribution that we can go ahead and use and and again this is it this is the intuition is just this now there's also a gif that shows how this sort of works you can get this from Wikipedia but the downside is if you really want to sort of look for a Gaussian distribution or this calling mixture model if you're gonna go ahead and Google what this is like and leave this on the next slide what you're gonna get is this formula and this formula and if you understand Gaussian distributions this is great because this is sort of a deep dive once the intuition is there you can use this to really fundamentally go on the lower level to sort of figure out what exactly is happening and how this algorithm might be efficient but if you're trying to sort of learn how to do clustering but better and this is the first thing you see on your retina like not only is this gonna be confusing you and demotivating you but it's also sort of distracting you from what i think is the coolest observation about this and that is that this is hugely applicable so what I've just shown you is that we can use this for clustering right but this has so many ridiculous applications you can even make neural networks better with this and I'm just gonna have a bit of a rant on this so remember that I said this in the beginning that you can sort of say hey if I've got like one gaussian from like one class and I've got another gaussian from another class you can sort of compare the likelihoods and that way you can sort of say ah might be more part of one class than another one well I can sort of fit some Gaussian mixtures on a class and then using the Gaussian mixture model which is typically used for clustering you can also do classification and the same thing will hold for outliers so suppose I had this data set and again it's a bit contrived but what I could do is I can say I have some sort of group a I have some sort of group B I have some sort of Group C and then the idea sort of is well take all of everyone from Group A take them separate trade a Gaussian mixture on that that will sort of give us an impression of hey how are people from group a distributed can we sort of define that you can do that for group B you can do that for group C and that means that for any new point within the realm of x1 and x2 you can sort of say what's the likelihood of it belonging to group Eddie what's the likelihood of it belonging - groovy and what's the likelihood of it because belonging to group C this is the one dimensional representation of this this is a two-dimensional representation of this and oh yeah by the way if you do this you kind of get outlier detection for free the cool thing is at some point you can say you know this point is so far away from any distribution of the classes I've seen that I can simply yell this is an outlier this is not something that I can do and the funny thing is this is a huge safety mechanism in machine learning there are very few algorithms out there that have a way of describing look this data point is so far away from anything I've seen I am now maybe saying that I don't want to automate this decision being able to do that being able to say hey this is it may be an actual outlier it's really far away from anything I've ever seen it's a really really convenient property and this is just fitting a bunch of Gaussian system data but already we see that it can be used for clustering for classification and outlier detection in one there are very few algorithms that actually do this and by the way this is all probabilistic so if you're into probability theory you get all sorts of properties here that you could really use and if you're into Maps because everything is a Gaussian here you know there's lots of convenient tricks that you can do and the math is quite solvable and the final detail the Gaussian mixture can fit pretty general shapes of data you might have to tweak the parameter it says how many gaussians and we're going to throw into the mix so that's that's a fair problem it's a bit of a detail but very weird and arbitrary shapes can be fitted with this if your data is sort of not linearly separable this trick will totally work and I'm not just saying this I actually implemented this and so what I have here is an example of the two moons dataset from so I could learn and what you can see is that I hope you can see from the Beamer but what I've essentially said is hey I'm just fitting a bunch of gaussians here and once it's done fitting this is sort of the likelihood distribution so you can see that there's like some Peaks here those are the peaks of the Gaussian distribution if you would and together they can form a really nice arbitrary shape but once this is fitted I can say there's some sort of threshold around it and please do outlier detection and a yellow point here is considered an outlier any purple point here is considered good enough to make maybe a prediction you can tweak sort of how far you want the spread to go you can sort of tweak when you can say it's an outlier or not again because it's a probability distribution you can quite easily do this and another thing you can also say well suppose if one of the moons was from one class and other moons from another class you can also get a nice classification boundary and you can also still work with probabilities and because everything is again a Gaussian distribution you can do math with this stuff afterwards as well if you feel like playing with this me and a bunch of colleagues and friends and we even had a guy from Brazil contribute to this project now we've made this project called psychic Lego and the idea is that we just have some Lego bricks that really fit well to the scikit-learn ecosystem but that might be missing it's kind of an opinionated package some kind of sort of definitely as opinionated as basically if me and Mathias agree then we add it but if you want to play with a GMM classifier or the GMM outlier detector go ahead it's implemented we'd love for people to use it definitely have a play with this it is scikit-learn compatible so it's not every day that you come across an algorithm that can do all of this stuff and I think that's actually fairly impressive if not inspiring because if you can just understand the Gaussian then you can pretty much already do half of what you want to do in data science but these days it's fairly you know hip and stuff we should talk about deep learning so I just want to show now are just like two examples of how you can use this sort of Gaussian mixture knowledge to make neural networks behave maybe better and one thing that you can do is you can say how about I just glue some gaussians on top of the neural network what will happen and here's an example of something called a density mixture Network the idea is just some sort of X goes in but at some point before the final output there's an intermediate layer where you say some of the output nodes are called mu nodes and some of the output nodes are called Sigma nodes and some of the nodes are called PI nodes the idea being that you have a mu one for example that's a mu of a Gaussian you have a Sigma 1 which is the Sigma of a Gaussian and Pi then says for this particular X does this Gaussian have a lot of influence on the prediction yes or no and by doing this this neural network suddenly has like a multi-piece output there's not a single point estimate that is giving it's actually being trained to learn a proper probability distribution not a whole lot of neural networks can do this but by just putting some gaussian sauce on top of it you get properties that you might want now there's still like numerical downside to this approach like I'm not suggesting this is the it works for everything but just the whole act of hey I'm just applying some Gaussian sauce to the mix gives you properties to a neural network that you might need and there's the mathematical details another thing that you can do which which I think is the cooler trick you can also add other probe list properties with a mixture so take an autoencoder typically what happens when you have an outer encoder stuff goes in and then the same thing should sort of go out but you squeeze it down to a latent state and typically what happens is things that are different suddenly go into different clusters within that space and if that happens I mean one thing you can then do is train a Gaussian mixture on that so what you can do is you can say hey there's this a bunch of gaussians trying to sort of learn what the embedded state is like if you doesn't sort of rearrange what that means is that essentially means that when the encoder takes something and puts it into a latent state you have a low dimensional representation of a data point that you can give to the Gaussian you can do classification with that you can also do outlier detection with that but what's even cooler is you can also say hey suppose that you know I want to sample a I want to decode one of these circles because they're all gaussians it's also very easy to sample from it so it's quite easy to sort of say hey you know what I just want to have a random 0 in this particular case could you please generate one for me and it sounds like a cool idea so I figured I just built this and what I did is there's this data set called amnesty I use Emmis for this and this other data set called fashion amnesty which is sort of the same thing but for fashion so what you see here are just some sample outputs for every number I trained a six dimensional Gaussian embedding and I said okay sample from a zero Gaussian and put it into the decoder and here's some examples of stuff that I'm sampling it's definitely not perfect but you know close enough and here's some fashion I will say I paid onyx yeah so there's a bunch of shoes and at some point you know I do want to mention it's not really a perfect rendering but I'm definitely on to something here and there's like settings of the neural network aging that you can tweak here and as opposed to a variational outer encoder what I like about this idea is the encoder is just trying to do the best job it can do in encoding and once that's done then I'm going to introduce probabilistic aspects it's sort of a two-step approach that's all but another cool thing about this is what I can do is I can say hey how about I sample now not from the gaussians but just from that space there's gonna sample a random point and when I decode that what comes out is gibberish so this mixture that's happening in the middle is definitely doing something that's actually sort of maybe I shouldn't say it's a manifold because it's not the the sort of the textbook definition but it's a manifold it's it's sort of a way to represent how data points are sort of distributed in space so that that's a convenient property and again the only thing I did was I took a textbook out of encoder and just put gmm in there and suddenly have all these very cool properties and note the shape it can be a non normal shape in here because it's a make sure you list try to fit an arbitrary shape but again like a better algorithm because I was doing something with the ocean a final thing that's kind of cool and so suppose you have all the zeros in latent space if you're going to fit a six dimensional Gaussian mixture on that then the mean of every single one of those gaussians might represent a slightly different style because again if it might be the case that the zeros also have clusters within them that's 15 minutes and the nice thing is by going over all of these means you kind of also get a glimpse of different styles apparently there are a couple of zeros that are more skewed like this and a couple of zeros that are perfect round circles and from an artistic perspective is also kind of interesting is this kind of a way of saying hey I'm classifying and clustering and generating at the same time and again I know of no algorithm that can do all four of those things besides this this sort of approach and again the only thing I did here was just do some stuff with gouges so we're nearing the part where I just want to do some live coding but before I'm going to attempt to do that I want to like one like which I think is the weirdest property of a Gaussian and this is sort of this weird meta thing and I have about ten minutes to explain it so if you have coffee now is the time to drink it because what I'm gonna do now is I'm gonna talk about like super high dimensions when you're doing something with the gumption but I do want to already point out like just adding some Gaussian sauce made some algorithms better here and that's it that's a cool thing but another way of looking at a Gaussian is you can say here's a two dimensional Gaussian right if I look at one point from this two dimensional distribution what I can do is I can plot that in a different way a two dimensional point you can also say well on one dimension that's two points and the only thing I'm doing is I'm just making sure that this stuff is ordered essentially this is just a different way of representing what I've what I've done here you know there's like correlation here as well right so it's gonna be really weird for me to sample an x one over here and then X 2 that's super high so so there's gonna be some if there's correlation and this plane is gonna be correlation on this axis as well and it's going to be hard to draw like a five dimensional Gaussian in sort of this space but it's quite easy for me to draw a five dimensional Gaussian in this space the questions later so this is the value so this is x1 has this value right so this is the value of x 1 this value that's the thing that's on the y-axis here okay that was a good question field that was a good question great thank you for the good question but then you gotta wonder okay so so how can I change maybe this cuz this kind of looks like a time series it feels like something is changing over time right so this is interesting and the main thing that you can sort of tweak is what can I do with this config ovarian smait tricks is there something I can do with that covariance matrix such that that covariance matrix is forced to have properties that I'm interested in and how's about this what I'm gonna do is I'm gonna look at the distance between the points right so these two points are way closer to each other then let's say this point in that point so one thing that I could do is I can say how about this is really matter how about I take the Gaussian shape I look at the distance between two points I'm gonna take that distance give that to this Gaussian and I'm gonna use that in this confusion matrix the question is what will happen and that is something I'd like to life code because I think it's very hard to explain it without me coding it so what I've got here is this is just an instance of Jupiter lab and what I can do is I can type numpy about random dot multivariate normal and then I can sample a multivariate point I will just do is I'll just say I've got some variable called mu which is just MP told zeros and let's say that K is equal to 2 so this is a two-dimensional array with two things in it and they're row of zeros and what I'll just do is I'll do something with like Sigma as well and I'll do like NP they'll I give that a K as well and then if I put mu and Sigma in here all right this is this is a multivariate normal and I just sampled one point out of it that's the only thing I've done here and what I can do is I can now type ELT dot plot all right and then and then you can sort of I can do this a bunch of times but this is sort of the view that we had before this sort of the alternative view of sort of a two-dimensional Gaussian and what I can then do is I can say you know what increase that so now sample one point that's from like a 10 dimensional Gaussian and it's got a kind of look like this and if I do like 100 these points it's basically white noise because the covariance matrix that I've got here that is independent so all the different things that I'm sampling there's being sampled independent of each other but now how about I change that so what I can do is I could type for I in the range of K and at for J in the range of K right what I can do is I could say hey take that Sigma thing you got there right I think that that one part of the matrix and what I'm just going to go ahead and do is I'm just gonna sort of do something that kind of resembles a Gaussian here so I'm gonna divide I by 10 just to make it a bit smoother but I'm basically saying this should be the distance by the way so that this is the distance between I and J right so if so if this between xx index and hundreth index that's going to be huge and when you square that and put that in the negative and give that to the exponent you get something that's very very close to zero so let's just demo that like maybe six points right so this this is a different thing that I've sampled but if I look at the Sigma matrix you can kind of see that on the diagonals itself it's close together has a high number and stuff that's on the outskirt has sort of a lower number well what I'm going to do now is I'm just going to increase the number of points enough sampling let's just look what happens suddenly this is it like this is still super random this is a perfectly random thing that I've sampled but there's also a pattern in it a process and that's the idea behind a Gaussian process the weird thing is is if you sample a high dimensional Ferenc of it one single point can represent something over time because of the way that I'm sort of changing this covariance matrix but here's the freaky thing while I can also do is I can say how about I just you know change this a bit and I look at the distance between the two points and I give that to like the sine function then a single point from a Gaussian distribution now represents a sine wave to me this was super unintuitive when I saw this the first time but by programming this with just a couple of lines of code I already was sort of in this moment with ok I'm at least on to something here so why does this happen and I kind of made a picture to try to explain it but the idea is if I have a covariance matrix that sort of looks like this then essentially points that are close to one another they will influence one another they have a high covariance that's what that means and if I have a sine function in there then the high covariance is basically going to be while you're going to co-vary with something that is maybe ten steps away and every ten steps away it is going to be high covariance for it so by defining I guess what people like to call a kernel like the thing that I put in here the function and I give it that will force properties on the time series that's being sampled and then I figured let's do this in a really really meta way so I've made an object here called a kernel and it's just an object but the nice thing about this object is what I can do is I can say take the kernel and take let's say like a linear function and these are just examples of kernels that I came up with so this was the kernel I typed before here's the function that does like the sine wave I have another function that sort of returns a constant and I feel another function it just takes a difference between two two points now what I would just like to do is just show you if I if I just take a linear function like this like the difference between the two points that's what's going to define the covariance matrix this is what it looks like and if I take the RBF thing that I had and this is what it looks like but the funny thing is what happens if I say take the linear kernel and then multiply that by the linear kernel they get a polynomial so the weird property here is by giving a function that defines the covariance matrix you can sort of force that the covariance matrix takes a certain shape but the function itself can be used as a Lego brick because one thing I can now do is I can say oh I just want the sine thing to be added to the linear thing and maybe I should zoom out a bit but I'm multiplying Eric sorry this should have been in addition my bad but now there's like a linear pattern being added to something that repeats but the cool thing is what you can also do is you could take that linear pattern and multiply that by something that repeats and then I get the seasonal aspect that sort of changes amplitude over time so the funny thing is you can do stuff to these functions and they're kind of like Lego bricks that sort of have a cookbook if you will and there's actually a PhD who wrote a thing called the kernel cookbook and the only thing it does is it explains all these different kernels and how you can multiply them together to sort of get arbitrary shapes but then how do you make predictions with this well that's the weird that's the really mind-boggling thing if I have a normal two dimensional Gaussian like this if I have an X I value over here then this will constrain what values this XJ value can take if my X I value is over here the XJ is not gonna be down here there's a constraint on that this probabilistic that's something that the Gaussian will provide for you but this will also happen on this view so what you can sort of say is suppose that there's this point that's given then you can sort of say well then that determines sort of the probablistic shape that all the points here can make and then as a final demo because of time what I've got here is ooh Magister that I have this colonel that I have made essentially and I've got some data points that it's trying to fit on and you can see here that I can say look these are just some data points I want you to fit on and if you're gonna fit on these points I do want the colonel to have like a linear increasing relationship and something that's seasonal well what I can now do is I can say well I don't know about that sort of linearly increasing thing but I do know for certain there should be something of a sign in there and this is the best thing that they can fit but you have fitted the best sign function here and what I can also do is I can say just take the linear thing which I like demos they're plumbers wait I think I could do this right yeah so you can do linear regression this way so what did I just do well I'm sort of running out of time so but I hope I've just quickly been able to do is hopefully I've been able to give you some intuition on some things that might not have been obvious and that's that the Gaussian is actually sort of everywhere in data science it's this distribution it's a thing it's basically everywhere and knowing about it really helps out and it can be mind blowing to sort of witness to fill applicability but understanding this mathematical Lego brick really allows you to recognize algorithmic parts and other algorithms so if you're sort of being overwhelmed by all of these algorithms that you can learn one thing you can also do is just spend an afternoon in a notebook I'm really trying out all these freaky things that you can do with a Gaussian because just understanding that - like it's full of 6/10 it's gonna make everything else a whole lot easier so this might be a better investment but if you feel like you don't fully totally grasp this that's like super normal I've skipped a lot of details in favor of intuition but what I hope is the reason why I skip those details is because those are the details are the things that are really really the motivating if you type in Wikipedia this is how Gaussian processes are explained if you go to a book and this is how it goes you Gaussian process is explained the first thing you see is that the Gaussian process is a collection of random variables any finite number of which have a joint Gaussian distribution square no pictures or anything and if you read the papers it's worse did you get these you know algorithmic explainers that don't really help you so what might help it is this sort of thing that we might want to think about mathematics is super useful when there's context and intuition but it's really downright terrible if when it's just a bunch of symbols being dumped onto your retina and it's worse if this occurs here when you're young but the best advice I have here is when I was lucky when I was a kid I was really lucky because my parents actually found a really convenient way for me to prevent this intellectual paralysis from happening to me so the story about Gauss that I told you in the beginning that the very first story I read that from a book called the number devil and it's the weirdest thing it's a book meant for kids it's meant to sort of for eight-year-olds if you've got kids trust me best book ever really is still one of my favorite books it's about a boy who's afraid of maths and then in his dreams he gets a visit from the number devil and there's all sorts of fairy tales one of them is the story about gaps that I just told you and if you really want people to understand like hey maybe folks on the intuition first and then on the details there simply isn't a better introductory book than this everyone in my class that read this book as a kid got straight A's in their calculus exams and all my math professors they just see this as a pattern this book is something that can really help you sort of gain is intuition of what is actually a number and this is going to help you a whole lot thanks for listening [Applause] okay so any questions in question questions yes here's one no thanks for the talk maybe it's a stupid question but if you go to the parallel coordinates and so you say there's a distance between the dimensions this is this stuff right yeah so there is a you say it looks like a time line but it's if it's not a time line if the distance between dimensions is not its arbitrary so that technically it could also be spatial you can have right now is this one axis can also be spatial the intuition still applies but then you could higher dimensions upon higher dimensions but the trick still applies the idea is you have some sort of a kernel that says given the distance sort of between two points what can I do to the covariance matrix such a day behave in a way that I deem important but this also works for higher dimensions I will say though I chose not to discuss that due to the serious amount of confusion I might introduce this by doing so but the distance is a interpretation and time is sort of an easy explained thing but it can also be other things at a time if you're reading stuff on bayesian optimization then that distance is just distance in hyper parameter space it cannot be random no I think so the thing here is we're talking about maps so there might be someone in the audience who says ah I know a detail that you don't so I'm kind of afraid to say that but know that I would argue that you need to have some sort of meaning going into the kernel in order for a covariance matrix to really mean anything thank you if you have a cool counter example it come to me afterwards I would just do that thanks a lot for the talk I really impress appreciate the intuitive approach and my question is about slide 51 I think you have specific but well done and just yeah exactly the general problem I think you said you have a sixteen gaussians or six I have to count them but some number all right sixteen so that's probably sixteen I mean that is sixteen means sixteen standard deviations and you've got maybe a few hundred data points right so that's a bit of a mathematical detail but here's what happens for every single point here I can sort of ask what's your likelihood value right and because there's couches in and there's some likelihood value and this gives me a distribution over the likelihood values I can set a threshold there okay there's a documentation page on the open source package that explains this in more detail but this is one of the ways you can do this good question hi the gaussian alto encoder business that you did you mentioned very very quickly the stuff half a sentence yeah yeah variational all ten quarters this is not a variational Alton color could you say a little bit about the relationship of yes to variation also the actual history of this was I kind of figured like hey I want to add something probabilistic to my encoder and then I came up with Vincent's auto encoder unfortunately that name was already sort of taken that's that's part of the story but the idea the thought here is a variational auto encoder changes the cost function right so the steps that sort of the gradients are taking are being influenced by the fact that the center space has to be a Gaussian distribution note that I believe typically it has to be a standard normal distribution so another mixture it has to be a single Gaussian that's what the variational auto encoder typically does what I'm doing here is I'm saying I don't want to constrain the auto encoder anyway so let's just train that and once that's a given then take that latent state and then train an auto encoder sort of a Gaussian mixture on that so it's more of a two-step approach I'm not adapting the cost function whatsoever with this idea and I believe this approach has merits to it there might be some potential benefits but yeah if instance auto encoder was not going to be a good paper okay any more questions questions questions no questions okay yeah thanks for the talk might be a stupid question but you nicely showed how the sine function is approximated in the linear function could you give more intuition on the RBF kernel Oh something I've always been missing not doesn't this talk so in the end it's just a shape right and the main thing to take away whatever function you give it the one thing that that function will do is it will make sure that what the covariance matrix just behaves in a different way an RBF stands for a radial basis and it's just basically just means it's kind of like a hump and I and there's different humps you can do you can normalize stuff you can add extra at the parameters but the gist first and foremost is whatever function you can have it takes distance between two points and then fills in the coordinate of the covariance matrix that's already a kernel and there's a huge cookbook of stuff you could do so the thing is it's not too important that you have the exact right RBF it's more the general shape that matters this answered the question okay so we have one one more question time for one more question yes so yes for people that are interested actually Google this thing called the kernel cookbook this this character David he's actually kind of clever he writes about some cool stuff but he actually wrote sort of a cookbook on like how you can combine all these different kernels what properties and they have it's actually kind of a good read if you're a nerd but but make sure you get the intuition before you hey you can you big movie to 71 so I'm only for 70 and so here you show that how you actually take out the x1 x2 from from the and but maybe in this the distance between x1 x2 so this is a two dimensional distribution which I can draw a five dimensional distribution is kind of hard to draw on a two dimensional but this red dots it has a very high x1 value and it has a very high x2 value that x2 value is slightly higher the next one so x1 is slightly lower here than x2 is green it's the other way around and blue they're super super low so the interpretation of the values here is whoops you can see the mouse right so the interpretation of the values here is whatever value I have here on the X 1 axis that's the value I see there and the B value wants speak up and the B value wants I mean oh and this vevo so this is a value of x1 here that's this value so it's just when I when I have this dolet line here I have some sort of value on the axis that's the dolt high TC here the only thing I'm doing here is I'm just sorting x1 up until x2 I'm only sorting on the x-axis here the actual value that you have here from here to here this value that hits this axis that's the value have here ok ok thank you very much man since for the great talk another round of applause please [Applause]
Info
Channel: PyData
Views: 4,402
Rating: 4.9512196 out of 5
Keywords: Data Science, Python, Artificial Intelligence, Algorithms, IDEs/ Jupyter, Machine Learning, Statistics
Id: aICqoAG5BXQ
Channel Id: undefined
Length: 42min 22sec (2542 seconds)
Published: Thu Dec 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.