Pascal Poupart, Sum Product Networks, Oct 17 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right good afternoon everyone okay so I'll be talking about some product networks and here's the outline for today so I'm first going to introduce what what are some product networks then we'll see how we can use them to do inference we'll talk about some applications and then after that we're going to get into more depth where I'll explain what are some of the relationships with for instance other types of probabilistic graphical models like Bayesian networks we'll talk about specific algorithms for parameter estimation as well as structure and summation and then also how can we do some of the computation and online and distributed fashion okay so as I get going feel free to interrupt me at any time and then you know I welcome questions so it'll be more interesting if you guys stop me and ask questions along the way okay so what is the sum-product network so a sum-product network is essentially a deep neural network that has certain properties and in fact you can think of it as a restricted class of deep neural network where it has only sums and products okay so this was proposed in 2011 by Pedro Dominguez at the University of Washington and then since then there's been lots of interest so especially here at the University of Waterloo so my research group as well as Ali's research group have been advancing the state of the art and then so there's some interesting reasons that we're going to see today so some product networks have some clear semantics they have some interesting properties at least from a probabilistic and statistical perspective and and then they actually do not suffer from several of the problems that other types of deep neural networks have okay so in any case you've got here a picture of what is the sum product Network beyond the fact that we've got sums and products the other thing to realize is that the the network itself is going to be an in cyclic directed graph and then in the leaves we've got here some indicator variables or more generally speaking we can have univariate distributions so this is going to be something different then other types of deep neural networks where normally you have just features that you feed in so here when I talk about indicator variables they can be indeed features just like in any other type of neural network but more generally they can also be univariate distributions which will allow us to think of them as some some form of harka cool probabilistic model okay so in terms of introducing some product networks I've already talked about how we can view them as some type of neural network but then in terms of neural networks what's interesting about them is that they come with some clear semantics so we're going to see this in a moment and then the other view is that we can think of them as a probabilistic graphical model just like Bayesian network and markup networks but then one nice advantage is that it's going to be a class of tractable probabilistic graphical model okay all right so let's delve into the details now so when we look at it from the perspective of deep neural network some of you might say well we've got some zand products here but how is that really a neural network because the classical view of a neural network is that you have essentially a linear combination of inputs and then at every node after that there's a nonlinear activation function right so so if I want to see this where every node could be essentially a unit as in any other type of deep neural network then what is the nonlinear activation function and how can I see this as as a neuron just like in another neural networks so the answer to this is that I can actually rewrite the computation that I'm doing here in such a way where the some node I could transform it so that instead of just taking a linear combination of the inputs so here when I have a plus and then a 2 and an 8 it would mean that I take 2 times this variable plus 8 times this variable so that's a linear combination but it doesn't look like have any activation function now what I can do is actually work in the log space and then simply use a log activation function that will bring this down into the log space and then from there I can after that reinterpret my product node as simply a sum in the log space right so if you take the log of a product right then you you get a sum in the log space so then product nodes are really just going to be the sum of the inputs and then after that I might want to get out of that log space so I can use the exponential function to essentially produce a nonlinear activation function and then get back into the regular space so so with this view we can in fact show that mathematically having a bunch of sums and products is the same as having a neural network where you alternate between layers where you take the log and then layers where you take an X an exponential function okay yeah ah very good question so it is not the main purpose but that is actually a very good purpose so indeed when when we write some product networks the natural thing to do is just say well I've got some Zen products I'm just gonna code that up to you sums and products and then you run into under flow overflows and and so on but once you work into the log space then you can avoid all of those problems and in fact all the stable implementations of some products some product networks indeed do use this transformation and and work in the locked space and then as a result are much more stable so so like for instance we're gonna see in a moment that a some product network can be used to compute probabilities these are normally between 0 & 1 but when you multiply lots of properties together you can get a number that is very close to 0 and at some point the computer is just gonna round it down to 0 and that just screws up everything afterwards right but if you work in a lot space then you don't have any problems so you're gonna get the log of a number that's gonna give you something negative very very large negative number but you can work with this and there's no problem and then in other cases it might not compute probabilities but it might just compute some scalar and then whenever you've got sums the numbers might just grow and if you want to avoid some problems again working in a log space prevents this type of blow-up so in fact this is common in other types of neural networks but here with this transformation then we we can essentially mitigate this problem a lot ah okay yeah that's a great question so okay so here that means this is the right time to answer this I might as well answer it now okay so so some product networks are essentially encoding multi linear functions okay so the powders are going to be the weight these are the numbers that I have on the edges and then we have weights essentially only on the edges that are below the some nodes so that we can interpret the sums as essentially taking a weighted combination of the inputs and and now if we think of the sum product network as essentially a function of the inputs then we can show that it will be a multi linear function of the inputs okay so it is a restricted class in this fashion now we can make that class more more comprehensive if we have some shared ways or some shared components then we can essentially obtain polynomials and then at that point once you've got polynomials it is actually a very rich class and then there's there's already been some work showing that polynomials of a high degree with enough components can approximate arbitrary functions as well okay any other questions yeah right okay so so here yeah I'm assuming that I've got binary variables just to keep the example simple and so that's why this is X not X 1 and this is X 1 so it's a binary variable but then I could have arbitrary discrete variables and in fact I could have here univariate distributions that are continuous like a Gaussian or a Poisson or anything from the exponential family yeah okay so in terms of the advantages we're going to see in a moment that some product networks provide clear semantics I'll explain this in more details but basically it will have to do with the fact that they compute probabilities and we can interpret what they're computing indeed as proteins with respect to some features we're gonna see as well that some product networks are naturally a generative model so there's a lot of interest in recent years to obtain directive models in in deep learning and then there's been a lot of work to take non-genetic models modify them and and obtain some something that's jarett 'iv in the case of some product networks do you don't have to do anything they are naturally generative models we'll see in a moment how that works and then we'll talk about training in fact we can use lots of techniques for training some product networks beyond just gradient descent so most packages out there make available several optimizers that are typically based on on gradient descent we're gonna see that for instance we can use here the expectation maximization algorithm and i'll show you how we can formulate this as well as an optimization problem in fact a type of signal program for which we can compute various relaxations that lead to various algorithms that are more effective they'll have a super linear convergence that is better than gradient descent ok and then finally we'll talk as well about structure estimation so here there a network that has a certain structure in the sense that you see there are nodes with a graph that essentially defines the computational graph where does that come from so we also have some algorithms that allow us to estimate what would be a good graphical structure for the network okay all right so let's go on now with the probabilistic graphical model view so how many of you already familiar with Bayesian networks or Markov networks okay so all right so it's not everyone so in case if you're not familiar with those it's not the end of the world so in in I guess in the space of graphical models the most popular types of processing graphical models are Bayesian networks and Markov networks and then these networks are typically used to represent joint distributions over a set of variables this set can be quite large and then what makes them interesting is that then they can encode some direct dependencies in terms of the edges between the nodes or they can also encode some correlations again based on on the edges between the nodes so Prosecco models have been around for a long time and and then they've been quite successful in terms of modeling different things but then they tend to have an important drawback which is that in general if you want to do inference then it might be intractable so what I mean by inference is simply asking what is the probability of one variable given some evidence so this is a basic query that you might ask and then answering this when we have lots of variables with different processing dependencies might lead to some intractability and in fact it is known to be sharply hard okay so in computational complexity actually who knows what is sharpie okay not too many people who has heard about NP hardness okay very good so so np-hardness is often what is used to define you know the boundary between tractability and intractability right so with np-hardness it means that there are problem instances where it might take at least exponential time at least at the moment we don't have any better algorithms to obtain something that would always be polynomial time okay now sharpie is even harder than NP okay so here the intuition is that with NP a classic problem would be the satisfiability problem where in logic you have a sum sum formula and then you want to find a satisfiable solution sharpy is essentially the problem where you count how many satisfiable solutions there are okay so that's a relationship and and you can see intuitively right away that is going to be a harder problem because not only do you have to find a satisfiable solution but you must find all of them and count them okay so so now how is that related to inference and the idea is that when you do inference you're essentially counting all the joint assignments with their weights and then the probability or the answer to a query is essentially the sum of the probabilities of all those satisfying assignments okay so that's that's the relationship okay so for traditional probabilistic graphical models in general then we have this bottleneck where we can't do inference tractable so we're gonna have to approximate but now with some product networks what's interesting is that it is also a probabilistic graphical model but it has a different view so here the view is that the nodes do not necessarily correspond to random variables and the edges do not correspond to correlations or processing dependencies but instead the nodes they correspond to operations and then the edges they define the computational graph so how you're going to do the computation okay so so so that's a departure from those networks but then this is very much in line with what happens in deep neural networks where the network is also just defining a computational graph okay and then for inference what's interesting is that this class of networks is guaranteed to be tractable so we're going to see in a moment that to answer processing queries with respect to a some product network is going to be in the class of polynomial time so so we can guarantee in fact that we can find the exact solution in linear time with respect to the size of the network okay any questions about this all right let's keep going okay so now let's talk a bit more about how we can do inference and and how we can use some product networks to represent a joint distribution over several random variables so here I've got a very simple example let's say that I've got two binary variables x1 and x2 so the network is essentially showing the computation I would do from the leaves to the root but it turns out that depending on how I initialize there the leaves I might be able to compute the answer to a prophecy query so for instance if I'd like to know what's the probability that X 1 is true and X 2 is false then what I can do is simply set the leaves accordingly so here with x1 through I'm going to set X 1 to 1 and not X 1 to 0 and with x2 false I'm going to set X naught X 2 to 1 and X 2 to 0 ok so I start with this as input then I do the propagation going upwards so I get those numbers for these intermediate nodes then I get these numbers after that and finally at the root I get a number of 34.8 ok so now you might say well that's not a probability so what's going on here ok it turns out that this is the numerator other probabilities we're going to divide by some normalization constant and then we're going to get a probability so so this is a first pass and I'm going to do a second pass through my network to obtain the normalization constant and turn this into a probability so the normalization constant intuitively is going to be the constant that corresponds to summing all the contributions of all the possible world so here I want to consider essentially all the joint instantiation of x1 true and false and x2 true and false so to do this I'm actually going to set all the leaves to one so when I said all the leaves to want for binary variables it does correspond to marginalizing out all of those variables and then I can obtain my notation constant so then now I'll do a pass in the same way and then I get at the root a number 100 and there we go that's my nomination constant yeah how are you in various versions of problem here okay yeah so that's a great question so here essentially what I've done is I'm am using intuitively bernoulli variables here that we can replace these by Gaussian distributions now with Gaussian distributions what I'd like to do is essentially perhaps set them to some value and then as well as a property of that value according to that distribution and that's going to be a number that's going to be different than than 0 or 1 ok so this would be the natural way to start with with univariate distributions in the bottom but then when I want to marginalize these univariate distributions then I would essentially do the same trick I would I would set all the leaves to 1 because when I marginalize a distribution it's like integrating out that variable and that gives me 1 yeah ok any other questions yeah yeah okay so this is because here what I'm effectively doing is computing the probability of x1 x2 both true plus the property of X 1 through X 2 false plus x1 plus x2 true plus x2 x1 plus x2 false so effectively computing all of those company of those combinations simultaneously and and then okay it might not be apparent but when I do this setting all the leaves to 1 that's effectively what I'm doing yeah well okay this works in a binary case and then in fact it works for all the cases where the leaves are distributions because what happens is that effectively I want to marginalize out or I want to integrate out all of my variables and when I do that if the leaf is a univariate distribution its integral is going to be 1 by default yeah okay good so now with this example what should have become clear is that you see to answer this query I did two passes through the network so as a result to answer the query I just need linear time with respect to the size of the network and so it means that the complexity is is definitely going to be tractable so in fact it's just linear complexity okay so now let's look at the semantics of the network okay so here it turns out that in general we can interpret every node as computing the probability of some features a subset of the features that are part of the input okay in some context and this interpretation is going to be valid as long as I satisfy some conditions there are going to be known as decomposability as well as completeness or otherwise known as smoothness okay so in general if you just make up an arbitrary network of sums and products and then you apply the algorithm that I just showed you before you can compute something but the question is will that be a correct answer in the sense of giving me a probability that is part of a joint distribution and then with all my answers be always correct and the answer is in general no so we have to satisfy some conditions and these are the conditions here so for those conditions to define them I need to define first a notion known as the scope so here the scope of a node is going to be the set of variables in the sub SPN rooted at that node so for example so x1 its scope is obviously going to be x1 now if I look at this some node its scope is also x1 if I look at this product node now it includes X 1 as well as X 2 and so on so the scope is essentially the set of airballs that are part of the sub SPN under that node ok now that we understand scope we'll see that a product node is decomposable as long as is children have different scopes in fact these joint scopes and then a some node is going to be complete or smooth as long as its children have identical scopes now you might wonder why on earth do we need to satisfy those conditions like why are they defined in this way so there's a simple interpretation where if I have a some node that has the same scope for the children I can then interpret the some node as really being a mixture of the children so here you see I can interpret this as a univariate distribution in X 1 so this one essentially is is a degenerate distribution where I'm saying that X 1 has to be false this one would be a degenerate distribution I'm saying that X 1 is true but overall these are Univerity submissions in X 1 and now when I have the sum two times this inverted distributions plus 8 times the scene very distributions I'm an effectively computing a mixture so here would be a weighted combination of of those distributions but but then once they're normalized right then this weighted combination is really just a mixture and I can get this mixture interpretation as long as the children have the same scope because if this is a distribution in some variables let's see X 1 and then this would be a distribution in in some other variable X 2 how can I take a mixture of distributions that have different scopes right that that would not make sense but if they have the same scope then it's all good so it's a regular mixture and and this works out nicely now for the product no the reason why we would consider this notion of decomposability is that here we interpret the product note is essentially taking the product of two marginal distributions that have different scopes such that when I multiply them I obtain a joint distribution that simply happens to be factored assuming some form of Independence all right so you see this product node up here has two children this child has a scope of X 1 this child has a scope of X 2 so I'm effectively taking the product of two marginal distributions 1 and X 1 1 and X 2 and then as a result I have a Joint Distribution but it just factorizes with respect to X 1 and X 2 if if the children did not have the joint this joint scope I will be multiplying two distributions that overlap and now I mean the there would be ways in fact to make a valid distributions out of this but but at least I could not think of this as just a product of two marginals ok so that's the reason why we consider decomposability here yeah okay so invalid scope so let's say I was just going to change this here instead of being not x2 let's say I change it to not x1 right then now I would have a sum of x1 and x2 this would not satisfy the completeness property and then also when I take the product so here I would have a product of x1 times a product of something in x1 and x2 and and it would not satisfy the the properties again so this would be invalid oh sorry right okay yeah so yeah so these are properties decomposability and completeness now scope itself is not a property it's just a definition here that I'm using to define decomposability as well as completeness and smoothness yeah no that's not a requirement that that was just for the sake of making a simple example yeah because here I guess it's easy for me to think in terms of 10 10 and 10 okay but yeah it's just yeah yeah there's no reason okay so so I think it's now you see with this it becomes clear that some nose and product nodes are computing probabilities and and then when I normalize based on on the normalization constant right so you see I wrote here that normally we would compute 8 but in their zone normally in constant when we do the second path where we were divided by 10 and and now this gives me you see a property of 0.8 and and then so that's what this node is effectively computing and as a result you see we can now understand better what is being computed in in other types of neural networks I mean you would compute some thing and and then what does that number represent who knows it's it's always difficult at least there's no systematic fashion so sometimes what happens is that you might be able to plot something that would describe how this numerical value relates to the input but there isn't a nice mathematical way of interpreting the numbers right so to interpret the numbers normally you'd have to have like an scale and some units in this case we do have that ice all those numbers are gonna be probabilities and these are properties with respect to the scope so so we know what we're computing here so why is it only interesting for product notes yeah oh yeah that's right so here ya decomposability is only for the product note all right and then for some notes instead we want completeness smoothness because that allows us to interpret them as mixture oh yeah if we wanted to we could ask is some note decomposable but the answer would always be no because you'd normally you see in order to like the some note you want it to be to have identical score so if it has identical scopes then it cannot be decomposable right so so like I mean we could apply the definition but it it just wouldn't be interesting yeah okay all right so now that we've defined this we understand the semantics the other interesting thing is to compare how we do computation with a regular neural network versus or some product network okay so in a regular neural network what happens is that you have some inputs and then you propagate I mean you go through the computational graph and then you produce some output and essentially what what the neural network is it's it's a function f of the inputs that produces the outputs right so so it's just function okay and and now the issue is that this function is rigid in the sense that you can't quite change it like the neural net is typically trained with inputs outputs to find out a good function that would approximate whatever is needed normally to go from the inputs to the output so you get F or at least an approximation of F but but then if you wanted to like query your network differently maybe change what would be the inputs and change what would be the outputs then you've got a problem right it wasn't trained for that purpose and and then it's not flexible now in the case of some product networks there's actually a lot of flexibility so here I can define the inputs and the outputs at the time of the query it's not at the time of training right so I can train my network we're gonna see in a moment how we're gonna do this but once I've got the network then I can use it to answer arbitrary queries of the type the probability of of some variable given some evidence and here you see the evidence you can think of it as inputs and then the the variable that that is being queried is essentially the output okay so so now if I'd if I want to have X 1 as input an X 2 as output what I would do is simply set the values of X 1 and X 2 in a certain way that would allow me to compute this so here when I have a conditional query like this I can rewrite it as a ratio of a joint query divided by a marginal and then for the joint query here I would like X to false x1 true we've already seen how to do this so you just set those values like this do a bottom-up Pass and then you obtain a numerator and now for the denominator I'm gonna do another pass this time I'm interested in a marginal of x1 true so I'm gonna start over where I'm gonna set only x1 to true or X 1 to 1 and not X 1 to 0 and then because I want a marginalize out X 2 then I will set both X 2 and not X 2 to 1 okay so so whenever I want to marginalize out a variable that's effectively what I do I set both values to 1 and then I do another bottom-up pass which is going to give me my denominator ok so you see this is very interesting because here there's no notion of the inputs being at the bottom and the outputs at the top so in the sum product network all the variables are always at the bottom is just at the time the query you decide what is really going to be the input that's usually your evidence right and what is going to be the output which is your query variable and and then you can answer any query in this fashion any questions all right good ok so now let me highlight I saw the relationships between some product networks and other probabilistic graphical models so we've seen earlier that they're bayesian networks Markov networks and it turns out that we can take any some product network and convert them into a bayesian network so here I've got two examples so I've got here a some product network of this shape and then it turns out that this one is equivalent to a naive Bayes model so a naive Bayes model with have let's see a bunch of features X Y Z and then a hidden class H and then it turns out that you see the equivalent some product networks has has this shape ok so the idea is that we can think of the hidden class as really a mixture of the features and and then so here H essentially is replaced by this sum node at the top and then the features are all at the bottom here ok now if I want to have a product of naive Bayes models then it's very simple I just take this and then duplicate it and then take your product ok and then so the cross Bayesian network would be that we have one naive Bayes model another one here and then they would be essentially independent naive based model so that would be the representation okay so in general I can take any some product network and then I can convert it into a Bayesian network otherwise a mark of network and then in some previous work that we publish in 2015 we show that this conversion can be done without an exponential blow-up so if anybody you know is uncomfortable with some product network and we'd like to know how does that correspond to other types of processing graphical models then we've got in that paper a procedure that explains how you can do the conversion now you can also go the other way around when you go the other way around though there is a blow-up and so here if you start with a Bayesian network and then you convert that to some product network there might be an exponential blow-up so this is where there's some interesting differences at least in terms of the representation and then the conversion to go back is that here you see we we represent what are the correlations or project dependencies whereas here we represent what we're going to compute so the conversion to go from a processing graphical model to a some product network is simple in the sense that if you run any algorithm to do inference then and you simply write down the operations you would do in that process these operations are naturally going to be sums and products and then if you organize them according to the order in which you would perform them then you would obtain a some product network so that's one simple way of converting any probabilistic graphical model into a sum product network and then the reason why there might be an explosion is simply because inference in these can be exponential so as a result the corresponding sum product network might blow up it might become exponentially large okay so to summarize here we can look at some product networks and other classes like bayesian networks mark of networks as follows so in general what's interesting is to categorize these representations based on whether they are compact and tractable so here I'm going to define compact as simply meaning that the space taken by the representation is going to be polynomial with respect to the number of variables in my Joint Distribution an intractable will be that the time to do inference for any query is also going to be polynomial in the number of variables yeah I mean memory so if you store the sum-product network or the Bayesian network right what would be the number of bytes that you would need to represent that and obviously I mean the number of bytes could vary based on the precise representation but I mean if we look at the order of complexity would that be something that grows polynomially or exponentially with respect to to the number of variables okay so in general Prosecco models can represent arbitrary distributions and and then so they all have the same expressivity okay so this this outer circle is essentially the space of all joint distributions over some set of variables okay and then they can all represent that now the problem is that not all of those models are compact and tractable so those that are compact are obviously going to be a subset okay so here we've got this space of compact bayesian networks so those that you can represent in your computer with an amount of memory that simply grows polynomial e with the number of variables now the problem is that compact bayesian networks are not necessarily tractable so the tractable bayesian networks are going to be in fact an even smaller subset okay so they're the red circle here these are retractable bayesian networks so those represent distributions that can be represented efficiently and with which we can answer any query and in an attractive fashion now if we compare that to some product networks it turns out that some product networks in their case compactness and tractability are the same because when we answer a query we saw that it takes linear time with respect to the size of the network so it means that as long as it's compactly representable then you can answer any query as well tractive lis so some product networks that are compact or otherwise tractable or all in here yes yes so yeah every tractable bayesian network so if it's tractable according to the definition means I can answer any inference query in polynomial time answering an inference queries means I'm gonna do some multiplications and additions and then if I simply cache these operations in the order in which I would perform them and write down the expression that would result then this would be some product network and and then it would it would take a space that would be the same or at least the same order as the time it took to do the operations so that's how any tractable Bayesian network necessarily corresponds to a compact and tractable SPN okay yeah that's a good question so in theory we should be able to design such a procedure ok now in practice that that might be a little more complex simply because here you see Bayesian networks some product networks are are simply representations for a distribution now there might be multiple representations for the same distribution so sometimes what happens is that you start with one representation let's see some product network then you convert it to a Bayesian network and now after that you see alright let me convert it back to some product network but then it might end up with a new representation that is different but it encodes the same distribution okay so that's why like in terms of implementation like this it might not get back to to the same representation but at least in theory there exists a procedure that will allow you to do that right okay yeah that's that's a very good point so there is a notion of normal form for some product networks but unfortunately it's it's not sufficient to to obtain something unique so I guess yeah here we would need to have like a canonical representation that given a distribution there is a single representation that goes with it but the normal form is essentially simply that all the weights are going to be between 0 and 1 and sum up to 1 ok so that's what is known normally and as a normal some product network but that's not sufficient we need something even stronger for that yeah ok all right so now let's talk about how we can estimate parador's from data to obtain the weights of the sum product network so graphically we can think of the problem as follows so I've got here my data ok so this data perhaps has a set of attributes and then there's many instances so here every column would be essentially a data point right and then every row would be a different attribute okay so based on some data now I'd like to estimate the weights of my some product networks so each one of those interrogation mark right is is simply an unknown weight that I would like to estimate right so the power estimation problem is precisely how can we get those weights now there's lots of our that one can use and then in general most of them are going to use the principle of maximum likelihood so you can use stochastic gradient descent there's also expectation maximization signal programming and several others and then you can also use a Bayesian learning technique like vision moment matching there's also collapse variational inference and and others so what I'll do now is explain how we can view this as a form of signal programming and then after that I'll explain how we can use also Bayesian learning with Bayesian moment matching okay so all right so yeah in fact okay before I explain the details for how we're going to do the interest I me just tell you a bit about language modeling while some of the applications so so over the years there's been several applications they include some in computer vision like image completion you can also do activity recognition I'll talk about language modeling there's also speech modeling mobile robotics so there's a wide range okay so for language modeling in 2014 there was an interesting paper that was published where they proposed the following some product network to essentially emulate an Engram model so an Engram model in natural language processing is essentially a model where you have let's see and words and then you're trying to predict what the next word is going to be okay so this is a basic task in in coming up with a language model and you'd like this to be as accurate as possible and then so here we're not necessary to take just the last and worst but in case we can take a window of previous words and then we can feed this through some product network and then compute a distribution over the next word and then so this partner network was proposed there's some songs there are some products the paper explains what is the intuition in terms of how we designed this structure I personally was not able to follow the intuitions but in any case this is the structure that they came up with and and then they trained it using an a form of gradient descent that did a discriminative learning okay and they obtained the following results so we've got the sum-product networks down here you've got some alternative techniques that are either based on neural networks or otherwise even earlier techniques that that are simply statistical techniques okay so here what is being compared is the perplexity so in this case a lower number is better perplexity is related to the likelihood of of a data point but in this case what perplexity means is simply that it's a measure of how surprised the model is about seeing a particular data point so if the model is good and it can explain the data well then it would not be surprised and then in general you see it's it's measure of surprise Nisour perplexity should be low so so that that's what this shows here so they explained while they they obtained very good result and then they compared for instance to Layton there's a location augmented with a recurrent neural network there's a plain recurrent neural network there's some more traditional neural networks and then there's some other techniques from information retrieval and natural language processing okay great question I I did not do this work I do not remember the details but if you go to that paper they have the citation and then I guess they they explain what what that neural what that recurrent neural network so I at least at the time yes now we've got some people in the lab that are actively trying to reproduce those results and as you can expect from most papers right now we're not able to reproduce those results and and and so we're not sure what's going on so we're trying to figure out I guess yeah yeah how those results were being obtained right so yeah so here whenever you've got sums in a row like this it means that you could collapse this into just one layer of Psalms alright because it just means that you've got some linear combination so why did they do this maybe from a modeling perspective it was just something easier to comprehend but from a computational perspective there's absolutely no reason to do that yeah yeah that that's a good point so yeah so here if we don't have an alternation and you wouldn't take the log twice like this and I here I I don't know in that particular implementation if they were really doing the computation in the log space or not so in any case I mean this is more just to explain the model and then how the competition was done I mean there might be all kinds of tricks that are simply not specified yeah the Y's okay the Y's up here yeah okay so here okay so yeah I can explain this so the idea is that here we're predicting what the next word is going to be then so every word in the dictionary would have a corresponding Y okay so now the idea is that at the very top we've got a sum which is essentially a mixture of the probabilities of all those possible words and and then the idea is that the one that produces the highest probability is going to be the correct prediction okay so this is a common trick if you want to predict a specific variable then you can simply like normally all the variables are at the bottom but I mean you can simply have links from near the top to a bunch of leaves that you just put near the top like this and and these are going to be your variables that you're trying to predict so here would be the next word yeah okay all right so okay so this was an application now yeah let me go back to the question of how do we do the power estimation so here if we're going to do maximum likelihood the objective is essentially to maximize the probability of some query variable or at least maxima is a probability of some input X given some weights okay and then the weights are what we're trying to optimize here so yeah so we're trying to essentially find the weight that would maximize this and mathematically it would decompose in this fashion so we have you see the first pass that gives us the numerator and then the second pass that gives us the normalization constant now this optimization problem is not an easy one in general so we can show that the optimization is non convex so this would be a mathematical representation of this and then in this particular representation we also assume that we're working with some product networks in a normal form where the weights are going to be actually no sorry no we don't assume that they're going to be in a normal form so the weights in general have to be positive if we have negative weights then we can't interpret a weighted combination as a mixture so that is one of the constraints in general that the weights here have to be positive okay so so that's our optimization problem it might not be apparent but in any case you can show that this is non convex and this comes from the some nodes that in introduce mixture distributions and whenever you've got latent variables with mixtures that usually leads to a non convex optimization problem okay so how can we tackle this non convex optimization problem so last year with one of my former student hands ow so we had a paper where we describe various relaxation of this that lead to different algorithms so for instance the most popular algorithm with which would be gradient descent is at the top so here we have to modify gradient descent so that the weights are positive so so that's why it's PG d stands for projected gradient descent many cases just great in descent where we project by always just rounding up everything to be positive and and then you can show that here gradient descent I mean it's really just making a linear approximation on the objective and then when you do an update based on the gradient you're doing an additive update okay now this is a simplest form of approximation you can do there's some other approximation so here we've got the exponential gradient technique the main difference is that it does a multiplicative update and then it naturally takes into account the fact that the weights have to be positive so we don't have to do any projection there's another one here a sequential monomial approximation so that comes from the fact that the objective I mentioned earlier that for some product network we can think of this as a polynomial but then that polynomial could be approximated to a sequence of monomials and then it leads to another algorithm and then finally we've got another one here where the objective can be seen as a difference of two convex functions so we can use a convex concave approximation where essentially we come up with a concave lower bound and when we do this it actually corresponds exactly to the expectation maximization algorithm so so then we had a paper where we described how all of these algorithms arise and then in terms of results this is what we obtain when we compare the algorithms and as you can expect gradient descent I mean is the simplest algorithm but it doesn't perform so well so here all of these plots essentially show how we can minimize the negative log likelihood so here lower is better and we'd like to converge as fast as possible and then we've got in red probably the projected gradient descent which does not perform very well and and then at the other extreme we've got the convex concave procedure which corresponds to the e/m algorithm which converge much faster in fact in this case you can show that it has a super linear convergence rate okay so so this shows that you see we can obtain different types of algorithms we don't have to stick with gradient descent and and then we can obtain much faster convergence rates depending on on what we we we use that's a good question yeah can we do the parents mission just on those at the bottom first so I'm sure we could do something local in this fashion most of the algorithms like including gradient descent I mean they have to do like a complete pass through and and then information is being propagated from the root down and then back up again so it's it's actually I mean you you could do something that that's local by at least all of those techniques tend to be global in the sense of using information throughout the network okay okay now let let me move on and talk about what if we have streaming data so you can use some product network or you can use neural networks to estimate a model from data but now if you've got streaming data what it means is that you would like to be able to update your model after you receive each data vector so if you want to do some traffic classification for instance there might be flows of data packets between different computers you might want to do some predictions based on that and I mean traffic is is happening continuously so you can get data continuously if you're doing a pre Commendation so each time a user is presented with a recommendation and either accepts it or not that's a new data point that can be used to update your model and you'd like to be able to simply update as opposed to recompute from scratch okay so so then yeah the challenge is how can we update a model if we receive data one data point at a time and and this is interesting because with large models again you don't want to recompute everything from scratch you just want to be able to update on the fly okay so here I'll describe an online technique for some product networks allows us to update the model or at least update the weights in well on the fly as we receive the data okay so now being able to do things in an online fashion like this is interesting because not only is it useful for sequential data but also in terms of making things scalable it means that you can just stream the data you don't have to load everything at once and then you might be able as well to distribute some of the data okay so in classic neural networks you can use stochastic gradient descent and and then this algorithm will allow you to update after you process every data point and there's no issue with this except for the fact that Breeden descent is not a very good algorithm ok now India or sir that I mentioned earlier so there was also the exponential gradient algorithm there's also the e/m algorithm that we just discussed all of those can be turned into online algorithms but it turns out all of those algorithms they suffer from the fact that they lose information from when when they process data in mini batches so normally all of those algorithms they have some data then they do some updates based on each data point but then they would do multiple passes through the data and and then you can turn these algorithms into online algorithm simply by splitting your data into mini batches and then feeding one mini batch at a time doing some updates moving on to the next mini batch and so on but when you move on to the next mini batch you forget about the previous data and that means there's information loss so an interesting question is can we do better so here we're gonna get some help and does anybody recognize who this is ok very good ok yeah so this this is a portrait of Thomas Bayes ok perhaps we don't know if it is indeed correct portrait but I mean it's all over the web and and whenever people talk about him does it picture they show okay so we've got Thomas Bayes now what is he famous for Bayes theorem ok so here you might ask like why do we care about Bayes theorem at this stage I mean Thomas Bayes essentially proposed Bayes theorem back in 1764 and now we're in 2017 and how can something so old be still relevant today I mean okay maybe it's foundational so that's why it's relevant but still is there something about Bayes theorem that you know has become in fact quite interesting and and and and give us an advantage just recently so the answer is that we can in fact use Bayes theorem to do powder estimation in an online fashion with streaming data without any approximation okay so this is pretty neat and it turns out we can also do distributed computation very naturally based on Bayes theorem so there's lots of algorithms that are based on mini-batches but then whenever they use mini-batches there's information loss I'm gonna show you now how you can do online learning and distributed computation without information loss this is in general yeah so so in fact people are not starting to appreciate this like there was a paper in 2013 by Broderick that essentially highlighted this in the community most people are still not very they don't realize that this is such an important result but it's starting to become more and more understood ok now let me show you mathematically what I mean here so for online learning based on streaming data if we're doing Bayesian learning what it means is that we would start with normally a prior right and then update that prior to obtain a posterior based on the likelihood of the data right so here this is Bayes theorem that allows us to go from a prior to the posterior using the likelihood now if I want to do this computation with streaming data it means I'm gonna get one data point at a time so I can start with my prior and then when I get my first data point then I can update my posterior based on the likelihood of that first data point okay so I can just do an update with that data point then when I get my second data point I can revise again I can just multiply my previous posterior by that likelihood get a new posterior and that's my new update and then I can keep on going like this so because Bayes theorem naturally factorizes what is the likelihood of the data nationally factorizes whenever you you make the assumption that the data is iid then it means that you can do online learning naturally using Bayes theorem so this is beautiful yeah yes yes okay so I guess yeah here when I said iid what I really mean is that it should be independent conditioned on your model okay so if the data is indeed independent given your model then we're fine and and most models in fact do make that assumption now you're right that if we have a sequential process it might happen so that one data point read is correlated with a previous one and so on and and this is you know regardless of what model we use because it's sequential data so so here yeah we we have to yeah so we might have to do some further assumptions to to break that yeah yes oh okay yeah so here I did not include a normalization factor but I mean it's it's it's something that can always be computed on on the fly and so here yeah I mean it does not it does not really impact the result well okay so it depends what the goal is so let's say your goal is to do classification so you're trying to make a prediction then in fact you don't even need a normalization constant because you just want to know like what's the magnitude of every possible class and you can just go with that now if you do need to have the marginalization sorry the notation constant at every step in the context of a some product network then I mean it would correspond to one pass you could do an extra computation to to get that but but then that extra computation would not depend on how much data you've seen before it would just depend on what is the size of your model okay yeah that's a fair point so yeah depending on what the model is let's say we want to do this in a context of Bayesian networks or Markov networks where the organization constant is is not easy to obtain then yes there would be some issues yeah okay now if we do distributed computation there's something similar that happens where what I can do is take again my equation or at least the competition I would do for the posterior and then I can simply partition my data where perhaps I can take the first core to do this computation the second core to do this computation the third core to do this computation and then after that I just multiply everything together okay so so here you'll notice the rockie the first core gets to have the prior and the other course don't have the prior but having no prior is essentially the same as saying that I start with a uniform prior okay so if you wanted to I guess make sure there's a prior everywhere you could have your your true prior here right and then the others could just have a uniform prior or something constant so that then you're effectively still just just doing Bayesian learning and then after that you can multiply everything together and there's no problem okay so so this is beautiful in theory in the sense that unfortunately Bayesian learning will be intractable okay so we're gonna have to do some approximations but at least in theory there's no issue so all the questions of how can we handle streaming data do distributed computation are already answered but now we're gonna have to do some approximation and the question will be can we preserve some of those as well as possible okay so yeah when we do exact Bayesian learning what happens is that we have normally a prior and then we multiply by the likelihood that gives us a posterior and then depending on what is the underlying model for instance in the case of some product networks that have binary variables we could use the family of darish lats and then we would get in the posterior an exponentially large mixture of dershlit's okay so that's that becomes problematic because the the mixture would would grow it would grow by a factor that that would depend on on how much data I have in fact it would be exponential with respect to the amount of data okay so so that's that's not going to scale okay so for this we're gonna need some help and who recognizes this other famous statistician No so yeah I guess his picture is not as widely used as maybe Thomas Bayes no it's not Thomas by a so it's not box but so I said again yeah so his last name starts with P yes Pierson yes okay so what is he famous for right so okay I guess there's a correlation test and and then there are the p-values and and so on but it turns out that Karl Pearson also has some very important contributions with respect to power estimation and when it comes to estimating pounders from data if you take any book of statistics one of the first technique that is described is the method of moments okay so the method of moments for instance let's say that you've got some data and you'd like to estimate a Gaussian from data what you would do is simply compute the empirical mean of your data the empirical variance of your data and then simply just declare that the mean and the variance of my gosh e'en correspond to that empirical mean and empirical variance okay so this is known as the method of moments because many distributions can be described by a set of moments and then so estimating them really corresponds to simply s meaning from data the moments while estimating some moments from from your data and then setting the moments of your model in in this fashion okay so so yeah that might not sound too exciting but very recently there were some major breakthroughs where when it comes to estimating powders or estimating ya s painting powers of of some mixture models from data we had the first provably consistent techniques based on the method of moments in the context of hidden Markov models mixtures of gaussians as was latent their slave occasions okay so so these are all classic models that have been around for a long time and traditionally you would estimate them let's say using maximum likelihood or some other technique that would do some optimization but then the optimization would be non convex might get stuck into a local optimum and then it would not necessarily recover the true underlying powers because that would require conversions to the global optimum okay so so now using the method of moments we can formulate things differently and then we have been able to show that with enough data in fact in the limit an infinite amount of data we can recover the true underlying pairs okay so so this is something quite powerful and quite important both from a theoretical and a practical perspective okay so in this work what we did is we decided to combine together Bayesian learning with the method of moments okay now that might not sound like a good recipe because we were essentially taking a frequentist with a Bayesian together and asking them to work but in any case as practitioners here we're not going to worry too much about this and it turns out that this is a an interesting approach that can yield some some good results okay so by combining these two techniques together we'll see that we'll obtain an algorithm that is online and distributed based on what we've seen for Bayesian learning and then it's going to be tractable as well and and then so this will come from the fact that now we're going to approximate or intractable posterior by matching some moments and obtaining a projection onto a tractable family of distributions if so tractability will come from the method of moments okay so the general approach is that in general here we will obtain a mixture of products of dershlit's that can be exponentially large and I'm gonna project that at every step to obtain a single product of the Irish lads by matching some moments and and then obtaining something that that is tractable and the moments I'm gonna match are just the first and second order moments okay so graphically the algorithm would work as follows so I start with a prior let's say that is a product of the Irish let then I do my Bayesian update so I do exactly Asian learning and I will obtain a mixture of products of darish lengths that mixture now I'm not gonna let it grow too big and at every step I'm going to project it down immediately to a single product of their slats so this is what I'm gonna do an approximation but that approximation I'm gonna make it in a way where I'm going to match the first and second order moments of my distribution so that at least I retain you know some some good statistics and so yeah so this will work well this will ensure that I don't have any intractability and and I'll show you in a moment some empirical results that shows that this works well oh the the Bayes update this is just applying Bayes theorem where we go from a prior to a posterior so we've got a prior we have a data point we compute the posterior and and that's just based on on Bayes theorem yeah okay so yeah I went quickly over this let me go back okay so it's it's based on this equation here so the posterior is going to be the product of the prior so the prior here is just this product of dears net times the likelihood or the likelihood in a some product networks is essentially going to be a large sum of products of the weights okay and as soon as you've got some knows in in your some product network you're gonna have a sum here and then when I multiply these so multiply this by this I can essentially collapse together the products but then the sums are gonna come out up front here and this will induce a mixture of their slats and that makes sure is just gonna keep on increasing and increasing so that's how I don't high-level yes what's also the the way we're doing it is simply by matching the first and second order moment so I'm not showing you the math for doing this but I mean it's there's some details here but it can be done simply by solving a system of linear equations yeah okay let me show you the results and then we're going to conclude because I think we're out of time okay so the results are here so this was some work we did in last year we've got our approach in the middle here what those numbers correspond is the likelihood of the data and then so here higher is better and then we've got some alternative techniques so there is here stochastic gradient descent online am online exponential gradient and then all of those arrows indicate that when there's an arrow down it means that this result is worse than what was obtained by Bayesian moment matching and and and then it's worse in a statistically significant way based on on the wilcoxon signed-rank test okay we also compared to learner's PN which is an approach that is not online so as a result it can go over the data multiple times and it is possible to obtain better results in general it does but this just shows you essentially what's the gap between an online and an offline technique okay and and then these are all data sets that you can download from the web and and use for testing okay so all right and then we we did this as well and larger data set so for instance you see we've got data sets with hundreds of thousands of variables and then we were able to apply the online technique online bayesian moment matching and then this column here is the distributed version where what we did is we took the same algorithm but we just partitioned the data into ten different computers and then got them all to compute a posterior multiply that together and and then we obtained the following results so that's a distributed version and the results are slightly worse simply because in theory it should be identical if there's no approximation but the computation is not exactly it's approximate and then that approximation will be amplified once we distribute the computation okay and then you've got the running times as well here that shows how this scales okay so all right I had another section on structure estimation but we don't have time so I'm just gonna quickly go over this and we're gonna conclude oops okay here we go so today we talked about some product networks so it is a special type of neural network and hopefully I've convinced you that this special type has some clear semantics but it's also a type of processor graphical model that has some interesting properties and and then so there's some other work for instance that other researchers here in my group have done in terms of extending some product networks for decision making but also in terms of making them recurrent similar to recurrent neural networks and what we're currently working on is a PI torch library for for some product networks and also I would like to apply some product networks in the context of conversational agents okay so I'll stop here and perhaps I can take maybe one question or two and then I think we have to vacate the room after death okay any questions okay well thank you very much
Info
Channel: Data Science Courses
Views: 2,817
Rating: 5 out of 5
Keywords: Sum product networks
Id: Nm0jNqOnQ2o
Channel Id: undefined
Length: 81min 21sec (4881 seconds)
Published: Sat Oct 28 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.