Community detection in the stochastic block model - Ismael Lemhadri

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
at the ecole polytechnique one of our applied mathematics project yusuf amin was supposed to be here today unfortunately he cannot be here due to visa restrictions anyway I'd like to talk about community recovery in a specific random model called the stochastic block model so first the problem of community detection the basic idea is that you observe a certain number of connections between entities be they people objects whatever so you have this graph of connections and what you can see here is that there isn't that much structure in this graph and we'd like to be able to recover this structure and the way we do this is we were going to assume that there is a community structure and that people belong to different communities and depending on the community they belong to they will have a certain behavior so we would like to start from this observation and be able to recover the structure so all we have as information is the connection or absence of connection between people and eventually we'd like to be able to have something that looks like this and what you see here is that in this example you see that inside one community there are lots of connections and outside the communities what you see as the big diagonals there there are not fewer connections and once we have found these type of communities we were able to simplify it and make it look much in a much nicer way so this problem of community detection really has a lot of applications in real life if you just look at social networks so here I took the example it could be Facebook connections where you have friends and they they're connected or not and depending on their nationality or gender or whatever you can define these these clusters if you look at communication networks like just phone calls and you connects people who have talked to each other by phone then you could build another type of network or if you look at biological networks this morning we had a talk about RNAi targets prediction which is very close to a protein to protein interactions and then you start speaking about biological networks and you recover the same type of problems and in these three examples that I show here the community is actually natural to find but if you go to some more some different problems where you don't have obvious community structures but you can actually build one by yourself you engineer it to help you solve a specific problem for instance you have images and you'd like to cluster images that look alike and the way to do this would be to connect images that have some features in common or that look alike according to some criterion and in this way you managed to build a graph and you can then apply the same community detection procedures to this graph you also have webpage sorting the PageRank algorithm that Google uses advertisements problems and all sorts of other problems that come back to the same issue and the bottom line is that we'd like to be able to identify groups that behave in the same fashion so now to the stochastic block model it's a specific random graph model that relies on communities and the way it works is it has three parameters basically you will go you're going to need to know the number of people that you have and I'm also going to define G which is the membership matrix so it's a matrix of size n k k being the number of different clusters and Q is going to be the connectivity matrix we tells me the parameters of connections between two people so if you have an a someone in community one and another one in community two they're going to connect according to a Bernoulli variable a parameter Q 1 2 and this variable is going to be independent from all the others so this is the basic circuit o'clock model and just as an illustration let's say that we search from these nine individual and they're going to connect according to the community they belong to so first let's look at the two communities so the two communities are going to connect with each other like so using the corresponding parameters and now if you look at the inter community connections they're going to happen using the different parameter so it's all symmetric in my graph and the basic problem that I want to look at is I see this this is what I observe and I'd like to be able to recover the original structure which is this one to be able to tell this is the first group and this is the second up to a permutation of the groups okay so now that I have defined the problem so let's make it more specific and we'd like to be able to tell when are we able to solve this problem when are we able to recover exactly the graph to say all these people belong to this community and and and so on for all the communities and not to miss anyone so we'd like to be able to do this with high probability in N and we'd like to have some conditions or guarantees that if the matrix of connectivity and if the membership matrix satisfies some condition then we can solve this problem and how it means do we have efficient algorithms to do this in contrast with can we just do we just know that we can solve it information theoretically so next thing I'm going to show is just a simple example where you have two communities which is exactly the problem the red-green problem that I showed and present the results that I found on this example before moving into the general case of the SVM and given the conditions for exact recovery to be possible so when you have two symmetric communities you basically are going to have just two parameters because you have the intra community connection which is going to be the same for the two communities and then you have the int inter community which is given by the parameter Q and obviously you don't know what P and Q are you just assume that there are two communities and in this example our condition translates to this one so here P and Q are the probabilities and number of individuals K is a number of clusters so it's just going to be two and M is going to be n over K so n over two and so I'm showing this condition and I like to compare it to the information theoretic condition which tells you when is recovery possible no matter what algorithm you use because obviously there are cases where you cannot recover anything if if become if the connectivities are the same no matter what community you are in no matter inside or outside the community then obviously you cannot do much but you can still hope that if P and Q are distinct enough you should be able to find to solve the problem and basically we have this condition here and we need at least to be in this regime here so we have a lower bound that V and Q should be at least of order log n over N to actually have enough connections to to recover something and in this case with these amb you have this condition here and it should do the same and go back to our condition you can actually check that the two are equivalent actually constant which just tells you that our condition is optimal in the case of two symmetric communities now to the general case so the way we started this problem was to look at k-means k-means the most famous clustering algorithm probably just tells you that you'd like to be able to cluster your data points in order to minimize this criterion over here and this criterion can be rewritten as a maximization problem which is almost like a semi definite program when you look at this set over which you are optimizing you can see that you have these constraints over the matrices they are all almost linear except the last one so the last one is going to need some work and we're going to rewrite this v squared equal B to relax it in order to get a convex optimization problem so the way to do this and to connect it to our problem is to actually define this matrix P which is two which you are going to apply this SDP so our SBP transforms the k-means problem into this problem which which we define and we call peacock as penalized convex optimization of k-means and now we require that the eigenvalues of b be positive and less than 1 and for this problem now we have an SDP and we can actually solve this this SDP in a relatively efficient manner and this is the basic algorithm that we use and we can prove that if the optimal partition of course we don't know the partition but if it satisfies these conditions so we define this Delta which is a sort of discrepancy between the groups then you can actually recover the exact partition with a high probability so you can see that the probability tends to 1 when n tends to infinity and in this way we have managed to use this convex relaxation which has been applied in many different situations and to apply to the specific case of stochastic block models and to prove that it's that it works and that it's near optimal at least in the case of two symmetric communities thank you for your attention Thank You miles now we have time for a couple questions hi thank you and does your work covers overlapping communities you mean like when an individual can be member of several communities yeah actually this is a problem that we haven't looked at but I would assume that there is something to be done in here because this so this this this relaxation this convex relaxation has been applied to several different problems like mixture of coefficient variables for instance clustering of random variables and I will assume that if we if you manage to rewrite the the problem in a way that that that that actually allows you to to rewrite to to use this B then yes it could be used and also to other variant variants of the SVM where the connections between the people don't only depend on their community but on something else then you can consider these more refined models and try to do something that would be a generalization yes thank you very much for your table very interesting so so you have a more discriminative point of view I think because I mean what would be the generative probabilistic model underlying this the stochastic block model you see what I mean it's what would be the likelihood connection between the euro area and the likelihood of you observation I see so it's true I haven't defined formally what an SVM is is that what you actually mean yeah I didn't see yes so I should yeah I didn't give a mathematical definition so okay let's go back to SVM so what I assumed in my in my model that I so I know n G is the membership matrix so it has values either 0 or 1 on each line it has only one because each member because we don't have overlap in communities as dimension and a Q is a matrix of connectivity so it gives you probabilities so it's matrix between 0 & 1 symmetric that gives you the probability of connection between community I and J so what you assume is that first the connections are independent and second that if you take one individual from community let's say community one it's it's you know it only belongs to this community it's not random at this stage so we we already have J and G is not random so you have this individual here you have an individual in another community and they are going to connect so you're going to have you're going to build your connectivity matrix which is also 0 or 1 according to a Bernoulli variable of parameter Q and the indices are the Unity's to which the two individuals belong so q is symmetric because we have a undirected graph does that make things clear from this you could actually derive the likelihood function because you have all the information that you need okay thank you mr. picah again okay [Applause]
Info
Channel: Ismail Lemhadri
Views: 2,819
Rating: 4.8461537 out of 5
Keywords:
Id: Trq-1h-s4l0
Channel Id: undefined
Length: 16min 20sec (980 seconds)
Published: Fri Feb 16 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.