CS480/680 Lecture 17: Hidden Markov Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so in this set of slide we're going to talk about hidden Markov model this is a different model now you might be wondering what's the connection with neural networks that we've been talking about so the idea is that everything we've talked about so far assumes that the data is independent so whenever you do classification or regression with respect to one data point it's going to be done independently of the next data point and so on but now there are lots of applications where the data is really coming from a sequence and then the prediction that you may for one data point is actually correlated with the prediction for the next data point so a classic example would be in speech recognition so you've got an undo signal when you're trying to recognize a phoneme or a word well it's correlated with the next phoneme or the next word so if you know one of the two then it helps you design big you it or the next one might be and now these correlations are actually very useful so we can improve a lot the accuracy by exploiting them but then for this we need a new kind of model and then so hidden Markov model essentially one simple type of model that can exploit these types of correlations and then we're going to see later on that the generalization of of those hidden Markov models corresponds to a recurrent neural network that can also deal with sequence data and also can propagate information between the different points and in a sequence okay so that's why for this lecture we're going to take a step back and essentially go back to traditional methods essentially hidden Markov models but then after that we'll come back and talk again about neural networks specifically recurrent neural networks okay so as I was explaining so far we've assumed that the data instances that were doing prediction for are all independent but then in many applications that's not necessarily the case so if you consider weather prediction robot localization speech recognition cavity recognition these are all tasks where any point in time you need to do a prediction but then the prediction and the next time step is going to be correlated so like in if you're doing weather prediction and you want to predict the weather the temperature let's see on at a certain hour then if you know what the weather does the hour before then it helps you to predict the mode the the weather for the hour after that right because the the the weather changes smoothly so so then we want to take that into account okay and then same idea with activity recognition if you're trying to predict what somebody is doing any point in time if you know what the person was doing at a previous time step then it helps you disambiguate because people tend to do activities for a certain period of time before they change okay so as a concrete example here if we consider speech recognition we might have data in the frequency domain and then perhaps here we also have the amplitude so this would be our input and then what we would like to produce as an output is a sequence of phonemes or even a sequence of words and here the sequence of phonemes if let's say you've got a good accuracy for the recognition of some of those phonemes and then for other phonemes the signal is not clear and you're not sure what it might be well you might be able to guess what those phonemes are simply based on the neighboring phonemes because if if I give you this sequence of phonemes and you're missing one of them you can probably guess what it is because those phonemes are expected to form a sequence of words and therefore you're looking for phonemes that would lead to proper words and then same thing with a sequence of words they typically form a proper sentence so if you have a bunch of words and then you're missing some of them in between you can often guess what are the missing words and and that helps right so so here this is there are essentially some correlations between these outputs and we want to exploit that okay so in terms of doing classification we've seen several models and include mixtures of gaussians logistic regression and feed-forward neural networks so all of those models are essentially doing what I would call independent classification because you take a data point you make a prediction for that data point and then if you have an entire data set then you don't care about the order of those data points and then each prediction is independent for each data point but now we can exploit correlations with some extensions of those models that are known as hidden Markov model conditional random fields and recurrent neural networks so we're going to see into this lecture hidden Markov models and and they are really just a generalization of mixtures of gaussians there's a corresponding generalization for logistic regression that is known as conditional random field and we'll talk about also in the upcoming lectures a generation of feet for neural networks that are essentially recurrent in all networks now all those models on the right the benefit is that whenever we've got sequence data then we can exploit the correlations and then we can boost the accuracy by exploiting that okay so to define a hidden Markov model let me draw a picture that will help us see as well how it extends a mixture of gaussians okay so for a mixture of gaussians we have some inputs that are X there's the output which is y and typically what happens is that we again in a mixture of gaussians we're going to express a distribution over Y so Y is sampled from a multinomial distribution okay and now once we have y which gives us the class we can express a class conditional distribution for the input X so it's going to be X given Y and this would follow a Gaussian distribution now because X is given Y then we can draw an arrow like this that indicates which variable depends on which variable and here this arrow sometimes is a little confusing because when we think about the prediction task we start from the input X and we produce Y so it's time thing to say well the arrow should be going up but then from a probabilistic perspective the way the model is constructed well we have initially a prior distribution over Y and then there's a conditional distribution of x given Y so in that sense XS produce based on Y so at least in nature how the samples are obtained the the jerath of model is that you you have the class and then the input data that is revealed to us is obtained condition on Y and that's why the arrow goes down okay so this is for a mixture of gaussians now let me draw the equivalent graphical model for hidden markov models okay so we'll have the inputs as well x1 x2 x3 will have the output y1 y2 y3 okay so in the same way we have some arrows that indicate that here normally X I is sampled given Y I according to a Gaussian so this is the same as before now the main difference is that Y instead of having each Y sample from a multinomial distribution that's independent for each one of them then there's some correlations and typically we say that Y I plus 1 depends on Y i according to a multinomial okay and then for the very first y so y 1 in this case it would also just depend on a multinomial okay and so here we treat these as random variables and the arrows in the language of Prostate graphical models simply indicate probabilistic dependencies okay and yeah the main difference compared to the mixture of gaussians is that now you see we consider several pairs of input outputs together where we indicate that the outputs the Y's are correlated using these arrows okay any questions regarding this good okay so when we design a hidden Markov model we are typically making two assumptions the first one is that the process is stationary which means here that the transition and emission distribution so the property of X given Y and a party of Y given the previous Y are going to be independent of time so you see here if the emission distribution for X given Y is the same at every time step T then in that sense that distribution is stationary it doesn't change and then for y Y is dependent on the previous Y the previous hidden state and then so here will indicate that there's a transition that occurs among the Y's and then that transition distribution is going to be stationary because it doesn't matter with time you see so it's going to be the same distribution regardless of what time step T we consider so the second assumption that we make is that the process is going to be Markovian which means that I normally you know we could say that perhaps a given Y will depend on all the previous Y's but in practice often it will be sufficient to just assume that a given Y just depends on the previous Y so this is reflected here whenever I've got these arrows you see each Y each hidden state just depends on a previous hidden state and not all the other hidden States okay and so that's the Markovian assumption so it means that essentially the next hidden that while the next state is going to be independent of all the previous states when we're given the current state so if I know the current state Y T then we can ignore the previous States YT minus one down to y1 because they're not influencing that is they're not for more information about YT plus one okay any questions regarding those assumptions good okay so to summarize right so this gives us a project graphical model which is essentially the one that I drew here and then we're gonna have some power ization the franchise ation corresponds to the distributions there's an initial distribution transition distribution and an emission distribution and then together we can write this we can write a joint distribution so here let me go back to the board so I've already drawn here the Prostate graphical model for a hidden Markov model but just to make it clear those distributions that give us the parador's are as follows so we have the initial distribution so that's the distribution for y1 it will typically be a multinomial okay then we have a transition distribution so this is a property of YT plus 1 given YT and that will also be equal to a multinomial and then finally there's the emission distribution which is the property of X T given YT okay so in the generalization of mixtures of gaussians this emission distribution is also Gaussian but if we have data that is not continuous and is instead discrete so that's possible for instance the natural language processing let's say we want to do machine translation with a hidden Markov model then we would have data that correspond to sequences of words and words essentially discrete quantities right so so then in that case we could not use a Gaussian and instead we we would use a multinomial so we can also use a multinomial depending on the type of data so this will be here for continuous and alternatively the multinomial is for discrete okay so those distributions together allow us to write a joint distribution so the joint distribution is essentially a distribution over all the exes and all the Y's okay so here I'm using y1 through T to indicate that I've got a whole sequence of Y's and same thing for X x1 through T to indicate that I've got a whole sequence of inputs and then this would be equal to the property of y1 times a product over I of the property of Y I plus 1 given Y I and also probability of X i given Y I okay so you can take these distributions multiply them together and this will give you a joint distribution okay so in fact here to be precise so this would go from I is equal to 2 now I is equal to 1 up to t minus 1 and the last one should be a product i is equal to 1 up to t of the probability of X i given Y I okay any questions regarding this good okay so hidden Markov models can be used for all kinds of application an important application was in fact the problem of robot localization so here mobile robots have an inherent problem where if you put them into a location and then you give them a map and then you say go to another location according to the map then without the proper model to essentially monitor their location they might get lost in fact many years ago when I was a student this was one of the biggest problem in robotics so we used to have labs with mobile robots there were nice fancy robots and then sadly you know you would start your robot it would start going into some direction and then you would use the odometer let's say to keep track of the motion and then the robot could guess or at least believe where it is according to its Adama der but the problem is the Adama tur would have some drift and in within a few minutes the robot would get lost so as a concrete example let's say that we have a robot that's at this location and now you tell it to go forward once twice three times and then make a left turn go forward again once twice three times then making our left turn and so on then if you record a different experiment and then you simply put a dot for each precise location that the robot ends up in and here each time is just executing the same program the same sequence of commands it ends up in slightly different locations and these errors just build up so this point cloud here indicates all the locations for different experiments about where the robot ends up when you give it this sequence of command and it's the same sequence of command time so now you see if you're using an odometer just to keep track of the motion of the robot which in principle should tell you where the robot is well the problem is that the exact location of the robot might differ quite a bit due to these inaccuracies so the gears of the robot might have some inaccuracies and then it doesn't necessarily travel exactly the same distance at the right angle each time and that's how it might end up in two different locations so so when these things compound like this if you have a robot that only has an odometer right then it it it will eventually get lost so it will just take a few minutes before it gets lost and then you know you you have this nice little robot but it's completely useless if it doesn't know its location so so this was a fundamental problem many years ago and then the solution was to essentially use a hidden Markov model and then use observations that come from some sensor measurements to try to disambiguate what is the actual location of the robot which is treated as a hidden state okay so here we can specify a hidden Markov model where the Y's the hidden state are going to be coordinates of the robot on a map and then the axes that are the inputs are going to be essentially some measurements by sensors for instance it could be a laser rangefinder it could be a sonar or even today it could also be a camera okay so so we have we put some sensors on on the robot that can measure different things with respect to the environment compare that with a map and essentially try to infer its location based on that so now if we specify transition distribution that corresponds to the motion of the robot then we need to account for the uncertainty because even if you tell the robot to go forward by one meter as we saw in this slide you see when it goes forward by one meter there's a little bit of slippage and then it doesn't it end up exactly at the same location each time all right so so then we want to specify this with a distribution that will capture the probability that it might end up in two different locations so that's that's where the transition distribution is for and then we also include an emission distribution that will have a distribution over the measurements obtained by some sensors so here the idea is that let's say you have a robot that's in front of an obstacle and now you have a sonar or a laser rangefinder that measures the distance to the obstacle well the reality is that those sensors are not perfect so then the measurements are not going to be accurate and then it's also good again to model this with a distribution so that's what this emission distribution is going to be for now when you define your hidden Markov model in this way then you can use it after that to do localization where it corresponds to essentially computing the probability of some Y T so the location at time step T given all the sensor measurements up to time step T now more generally with a hidden Markov model we can consider several tasks and then these tasks there there's typically four broad categories they correspond to monitoring so for instance robot localization is an example of that where you'd like to infer the hidden state at time step T given all the observations up to time step T now in some other situations what we really want to do is not just monitor the hidden state but predict what the state is going to be in the future maybe K steps in the future so we only have observations up to time step T but then we are trying to predict why T plus K in other situations we already have observations up to time step T and we're trying to do is to disambiguate what was the hidden state at time step K where K is less than T so that means we have past observations but also future observations that follow time step K and then we want to take all of that so the entire window to recover what might have been the state at time step K okay and then the last one most likely explanation is asking for the most likely sequence of hidden states so y1 through Y T that would have the highest probability in the posterior distribution of y1 through T given x1 through T okay so these are four common tasks and and now the next question is what algorithms could we use to answer those probabilistic queries and essentially I solve those tasks okay so if we start with the monitoring task here we saw already robot localization as an example another example could be patient monitoring essentially any problem where you're simply trying to estimate the hidden state given all the previous measurements so one way of computing this is to recursively decompose this query so that it is now in terms of the probability of the previous hidden state given all the previous measurements so you see I can compute protti of YT given x1 through T based on the proteome YT minus 1 given x1 through t minus 1 and so here there's a little derivation that shows how we can expand that and get a recursive definition now based on this recursion we can now design an algorithm that is known as the forward alberta and what it will do is start by computing what is the the first hidden state given the first measurement and then after that compute the next hidden state given measurements up to that time step and keep on increasing like this the sequence by going forward in time okay so let me draw something on the board just to illustrate okay so let's say I've got a hidden Markov model with 4 time steps so it would look like this and now if I'm interested in the probability of Y for given X 1 through 4 then the computation that I would do with this algorithm will essentially start by computing the probability of simply y 1 given X 1 okay so this is the first step that we've got right here in the Al Gore so then the second step is going to look at the property of y2 given X 1 and X 2 ok so the first iteration of this for loop gives us y 2 given X 1 through X 2 so we're going forward in time then the next iteration is going to look at Y 3 given all the measurements so far so would look like this and then the final step is going to compute a party of y 4 given everything before so it looks like this ok any questions regarding this algorithm good let's continue ok the second task was a prediction task it is similar to monitoring the main difference is that we're trying to predict what is the Y key steps into the future so some examples for this would be weather prediction also stock market prediction so for weather prediction you see we have measurements up to the current time step and typically we want to predict what is going to be tomorrow's weather or the weather a certain number of these in the future all right and then for stock market prediction is the same idea you want to predict what prices are going to be in in the future ok so to compute this we're going to do a similar type of recursion so this recursion now is going to essentially allow us to predict YT plus key based on YT plus K minus 1 so essentially we'll consider the previous state and and then we keep on recursing like this so so based on this recursion there is a similar forward algorithm the main difference is that now when we do this forward computation there's going to be a part that corresponds to monitoring so we're going to compute the property of YT given X 1 through T right so this is just regular monitoring up to the current time step but then after that we don't have any further measurements so then we this is when we start doing some some prediction and now we're going to keep on increasing the Y's so T plus 1 T plus 2 up to T plus K but always just given the the previous measurements so so then the fourth algorithm is going forward in time but then there's just just two phases I'm monitoring phase and in a prediction phase ok so let's draw something to illustrate this so again let's have y 1 y 2 y 3 y 4 X 1 X 2 X 3 and X 4 and if you're interested in computing the probability of let's say Y for given x1 times x2 okay so this would be 2 times steps in the future so the first phase is monitoring so we're going to compute property of where y 1 given X 1 then at the next step we look at Part II of y 2 given X 1 and X 2 so at this point we've reached our current time step in terms of observation so time step 2 so now we start predicting so now we're going to compute the property of y 3 given X 1 and X 2 only so we don't have X 3 and then after that X 4 so Y 4 given again X 1 and X 2 only okay so that's the picture that would correspond to the forward computation that would be done based on just X 1 and X 2 yeah practicing for generation of new texts like if we want to go and see what's in the next paragraph only based on like the first three yes absolutely so here I guess here we could use a hidden Markov model with prediction like this for text generation so if we want to essentially complete some paragraph or complete a sentence then we might have text up to a certain point and then we want to predict future text now one little difference is that here in the context of text generation like this what we complete we could think of the axes as essentially being the observable the text but now what we're predicting is not really a hidden state it's more predicting what the next axes are going to be so that's a slight difference okay but still we could use a hidden Markov model for for that purpose I'll do the reality is that today while hidden Markov models are not anymore the state of the art for these sorts of things we're going to see how recurrent neural networks tend to be much more effective for that but still they are a precursor to recurrent neural networks okay any other questions yeah yeah okay so here all I'm doing is essentially circling the variable that we're trying to predict given some evidence and you're absolutely right that when I compute the product that the prediction for y for I do take into account my prediction for y3 okay so in the recursion this is what happens you see YT plus J will depend on YT plus J minus 1 so here all I've done is just circle the the variables that appear in one term okay so at every iteration you see I compute the the property of some variable given some evidence and that's what I've circled but obviously these things are linked together okay so the third task called hindsight reasoning is looking at computing the property of Y K when K is less than T so that means we're looking at a distribution over a past state given some observations that occurred before that past state there also that occurred after that past day so we've got the benefit of both the past and the future in order to determine that state so this can be quite effective in some application so for instance if you're doing activity or speech recognition often it is good to delay the the recognition so this way you can observe so the data that happens afterwards and then use that as well to help you disambiguate what is the current state okay so the computation will be also done in a recursive manner so we can look at the property of XK plus 1 up to T and and then it would depend on on this here so there's a recursion now actually okay the main computation would be divided into two parts by the chain rule so there would be a forward part where we're effectively doing monitoring and then it would be a backward part we're essentially taking those future observations and propagating their information back okay so and then yeah this one here can be computed in a recursive way using this derivation here okay so naturally this will lead to what is known as a forward backward algorithm because there's a forward phase where we propagate information from the past measurement up to time step K and then there's a backward phase where we propagate information from the future observations down to time step K okay so let's draw a picture for this one too so let's see so here let's say we're interested in a computation of y2 given x1 through x4 okay so you see 2 is less than 4 so the first phase is just a simple monitoring phase so as we did before we're going to compute the property of y 1 given X 1 then after that the party of y 2 given X 1 and X 2 so this completes the the forward phase so this will correspond to this part here and now the backward phase in this case you see we start with the property of XT given YT minus 1 and then we'll gradually increase the sequence of X's based on previous wise so here we're going to start with X 4 given Y 3 ok so this is what this first computation would do and then on the next step we're gonna have X 4 and X 3 given Y 2 so it's going to crash spawn to this and then we simply multiply at the end what we computed from the forward phase and the backward phase together to give us our our prediction okay the last one most likely explanation is quite important for instance in speech recognition we could also use it for machine translation even though the state-of-the-art again is not to use hidden Markov models but more some recurrent neural networks but the idea is that you've got a sequence of inputs and now you're trying to compute what might be the most likely sequence of outputs the Y's and three demands as hidden States so in speech recognition you've got your audio signal the hidden states are essentially the sequence of phonemes or word and machine translation you've got a sequence of words in one language and then the hidden sequence would be essentially the the words and the language that you're trying to translate okay so the computation would be carried out again in a recursive manner and now the main thing is that we have a maximization and then to carry out this maximization this is where we can use a dynamic programming procedure and then we can use it either in a forward way or in a backward way so the algorithm for this is known as the Viterbi algorithm and then it will compute essentially an increasing sequence so here this algorithm works in a forward way the way I've written it down and then so it it increases the length of the sequence that we compute one time step at a time okay all right so I'm not gonna bother drawing down what this one looks like but you should be able to get the idea so again it's a forward process the main difference is that now with the maximization we have to use a form of dynamic programming okay so any questions regarding those Algor Thurs good all right so let's talk about an application in so I'm gonna look at a case study regarding activity recognition and in fact this is based on some research that I did roughly 10 years ago with some other researchers here at the University of Waterloo and this was in collaboration with researchers in kinesiology so what we were doing the the task was to infer the activities that are performed by a user of a smart Walker where we have as input some sensor measurements and then what we'd like to predict or infer is essentially the activity of the user at any point in time so here we we have a walker so Walker is a device that older people would often use to walk and help them keep their stability and this is quite important to prevent Falls now the interest in predicting the activities of a person with a walker so here by activities obviously the person with a person typically does is walking with a walker but but then still there are different things where the person might be walking forward might be walking backward turning left turning right but also sit on the walker might have to do different maneuver so there's different types of maneuvers that are possible with a walker and then the interesting question is what are the maneuvers that are most likely to lead to a fall or might trigger or fall and here what's counterintuitive is that a walker is supposed to help with stability to prevent Falls but what some people notice is that in some situations it might actually be more dangerous and an induced fall so for instance several years ago I was visiting a retirement facility where they actually had forbidden people to sit on their walker because what happened is that the walkers typically have a lever where you can lock the wheels and then you turn around and then you sit on it but if you forget to lock the wheels when you're about to sit then the walker might just roll away from behind and then you just fall on the ground so in that facility they were actually preventing people from sitting on walkers because it was too dangerous and generally speaking you see Falls are an important risk an important problem for older people and we'd like to prevent them and would like to also understand under what circumstances they might occur so this is where if we can just get a sense of what are the circumstances the activities that a person was performing at the time of a fall then we can get a better understanding of what are the risk and and how to prevent them okay so the we we took a walker we modified it and we included a camera that looks backward at the legs as well as another camera that looks forward and then we also included some sensors there was a 3d accelerometer that provides forward lateral and vertical acceleration there were also some load sensors inserted in each leg of the Walker that measured the amount of load or the the the weight that the person puts in in each leg and then finally a rotation counter a wheel rotation counter that tells us the the distance covered by the person so here we took those eight channels recorded data had 50 Hertz did it digitize that and use that to do activity recognition okay so when we did this work we carried out an experiment with some walker users and also people were not walker users so we went to winston park which is a local retirement facility and then we recruited people that were between 84 97 years old and then we asked them to go through an obstacle course with their walker now here when I see obstacle course I mean this is not the Olympics or there's nothing you know really dangerous okay so all they were asked to do is essentially to go through more what I should call a walk in course where they would perform maneuvers that are common in daily activities so for instance they would walk straight then make a right turn get into a confined space that might correspond to an elevator do a u-turn and come out and and so on there would be a place where they'd have to pick up an object from the floor there would also be a ramp where they go up and down the ramp so this is similar to like getting up and down the sidewalk and then also going through some doors which is actually not easy so even in my case you know if I'm pushing a walker and I have to go through a set of doors it's not obvious how to do this so so in any case we ask people to to go through this walking course these are the activities that we were expecting and what we wanted to do is to automatically use a hidden Markov model to recognize what was the activity performed by the personnel in point in time so the idea is that you can get people through this walking course but now if you have to have somebody manually label what the person was doing any point in time by watching a video that that is labor-intensive and you can't carry out experiments like this with a lot of people so ideally you want to have an algorithm that will do the recognition automatically and that's where a hidden Markov model comes comes in handy okay so we design a hidden Markov model where the activities were the hidden States the the eight channels for the sensors correspond to the inputs or the observations X and now you'll notice that because we had eight sensors right then there are eight measurements associated with every time step so that's still a hidden Markov model it just has you know more than one variable per time step we had the following distributions the powders were denoted by PI theta and Phi they correspond to the initial transition emission distribution and then we learn those parameters by essentially doing maximum likelihood so we collected some data and then we essentially estimated those parameters by machine learning by maximizing likelihood and here we tried it in two ways one which was supervised which means that we had both the axes but then we also had somebody manually labeled the Y's in order to give us the supervision signal and then what's more convenient is to do it in an unsupervised fashion where you only have the sensor measurements but you don't have the labels Y so now that makes it harder for the algorithm but it's more convenient because nobody has to label anything and as a result you can expect the accuracy is also going to be lower in any case we did this and then we I had a student develop a little interface that could help us visualize what was happening so let me start the video here okay so in this demo what you're looking at is we've got here the backward view with a camera that is located under the seat of the Walker so you can see the legs of the person as the person is walking okay and as a person is maneuvering then you'll notice the eight channels the curves here are fluctuating be based on on the measurements that are recorded by the sensors now these sensor inputs are used to infer the the activity that the person is performing and then we have in this right panel the 13 activities there's a red square that indicates the correct behavior based on somebody essentially manually labeling videos so those I had one student essentially look at those videos watch those videos and then manual indicate at different points and time what is the person doing but then this was just for evaluation and training purposes the goal was ready to have an algorithm do the prediction so the blue square here is the prediction of the algorithm based on a hidden Markov model and in fact in this work we did both a hidden Markov model and a conditional random field so you can see that whenever the Blue Square matches the Red Square it means that we have the right prediction and then when they're not matching it means that the prediction is isn't correct and that tends to happen whenever it is a switch from one activity to the other because then the timing of the transition is not easy to capture right so it's it's somewhat subjective when an activity stops and the next one starts but in any case you can see that otherwise the algorithm is tracking fairly well what what the person is doing yes okay so yeah the question is will it be possible to have the person who's maneuvering with the Walker simply indicate ask where this person is about to do and then just do it right so in theory yes and and in practice okay we did these experiments with older people like 84 to 97 years old okay and you know they have different abilities and sometimes you know it's a it's actually easier to have an outside observer just look and see what the person is doing because often people think that they're doing a certain activity but they're actually doing something else okay so you can't rely on South labeling because people often are not good judges of themselves and and then in this particular case it would also be awkward to have somebody you know go with a walker and then just say all the time I'm about to do X and then do it but yeah so it would be a little bit awkward yeah yes very good point so if we do an supervised learning here we could get an algorithm that could infer some activity but could not name what that activity is so it essentially cluster activities together but we would not be able to name them so so here I guess I cannot be completely unsupervised if the goal is really to recognize an activity and name it then we have to provide at least a few samples where in each cluster there would be the correct name yeah yeah right so yeah great question yeah so here we did not start with the unsupervised way in fact four what are the activities because we're working with researchers in keys ology they already had some ideas of what are the activities that he wanted to capture because in their case this would provide them with information about the the I guess the circumstances that might trigger falls right so so they already knew that they wanted to monitor for instance when a person is making a laughter versus a right turn and and and so on and and then so they gave us the list of activities and and then so we just went ahead with that but otherwise if it was a task where with our human knowledge we could not tell what the activities might be then doing it in an unsupervised way like you're suggesting to first get some clusters then look at them and then they said like only this corresponds to this type of activity and so on would be the way to go okay so let me stop it okay so in this work we as I explained we did both the supervised and unsupervised learning for the supervised learning the Y's are known so we had somebody essentially manually label all those activities and then the objective was essentially to maximize the likelihood of the data so to maximize the likelihood of both the XS and the Y's given our powders PI theta and Phi so we're essentially trying to learn those paranoids pi theta and Phi now as we've done it many times in this course the traditional approach is that you simply compute the Dare ative set it to zero then isolate the powers and then you get values some estimates for pi theta and Phi okay so as a concrete example I'm gonna show you how to do this here's computed derivation and and then isolate the powers so here let's say that we only have two activities two classes so c1 and c2 and and then just to keep things simple as well I'd say that the measurements are binary so we only get to observe a bit okay so it's going to be either value v1 or value v2 now in practice the measurements especially in this case were continuous measurements and and then so we would not have just two values with having you know real numbers and in general but just for the the coming example that's what we're going to use okay so if we start with multinomial emissions because if I assume that my outputs my data my sensor measurements correspond just to bits that are v1 or v2 then I would use a multinomial emissions and then the Joint Distribution of my model will essentially correspond to this product that we saw earlier right and then I could expand this to correspond to this derivation okay so here don't worry too much about the details of equation the point is that then when we compute the derivative and set that to zero while we're doing a maximization with respect to a lot of paranor and here we can decompose a maximization with respect to each Parador separately once we set the derivative to zero we end up with some answers that are actually quite natural so it's similar to what we've done with mixtures of gaussians now we're doing it with a hidden Markov model but then you see the distribution over classes is just going to be the ratio of the number of classes while the number of instances in our data that corresponds to each class the divided by the total so it's just the relative frequency count of what we observe in our data now for the other powers that correspond to the transition distribution and also the emission distribution it would also correspond to relative frequency counts and then if instead we have Gaussian emission distributions which was the case here with the smart Walker application and in fact for most applications in practice the data tends to be continuous so you would use Gaussian emissions so we end up with solutions that are similar so for the initial distribution the transition distribution is the same but then for the the emission distribution because we've got Gaussian now you end up with mean and variance parameters that essentially correspond to the empirical mean and then the empirical variance of your data okay so so it's very similar to what we talked about for mixtures of gaussians right so mixtures of gaussians we essentially had the same solution that corresponds to the initial distribution and then also the emission distribution now what's new is that we also have some transition distribution but needs to correspond to some notion of relative frequency count where you count how many times you observe a certain transition divided by the total number of transitions and and then maximum likelihood naturally gives us those solutions

Info

Channel: Pascal Poupart

Views: 5,942

Rating: 5 out of 5

Keywords:

Id: 9EHWTHJkqUY

Channel Id: undefined

Length: 61min 30sec (3690 seconds)

Published: Thu Jul 04 2019