Lecture8. Community detection

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right um let's get started so topic for today's lecture is uh network communities so when you look at the structure of a network you realize that there are parts of the network that looks well when you draw it they look denser right or you know more connected with the nodes much more connected um among themselves than with the rest of the network right so there are sort of groups of nodes clusters um you know you can say communities that's the way um they usually called in uh social network analysis so if i look at this picture um there is probably you know this group of nodes that are very tightly connected right there's a lot there's a pretty high density you know you can think about say of this group well and that one or maybe you know this whole thing um then there is this group of notes you can think about this one um as a community you know this one is separate from everybody else just floating in space but it can also be thought of a community now with the rest well it's not that clear but you know you might also notice that the other probably some some sort of higher connectivity patterns in here in there so these are all um examples of communities different communities but in order to be able to detect them and that's what we're going to be learning how to do today uh we need to provide at least some sort of definition right that allows us to search for the communities so the definition is is the following so network communities are groups of vertices such that vertices inside the group connected with many more edges than uh between the groups right so the definition is a little bit vague um and allows you know for flexible interpretation but there is a reason for that in some ways the same as with you know when you work with data science or data mining you know the definition of a cluster what constitutes a good cluster now in this sense we can think about community detection as an assignment of vertices to communities right so on this picture it's sort of again obvious uh and it's very clear that we have you know three communities right this is probably um following our definition you can think about this three groups but it's not always the case um sometimes um as we will see on this slide um yes sometimes you can actually see this very clear separation um where you can literally you know cut out some number of edges and get your communities but then uh we might also encounter a situation like this where we have a node that explicitly laying on the intersection of two groups right and so um there is sort of this is a community and this is a community and then node d is uh not exclusively belong to one community but it actually can be assigned to two communities right um and if i go back for a sec to the slide we started with uh that's precisely um the situation um like there is this node um which is on which which lays on the overlap of two communities so um then uh when we look at the definition yes community is assignment of vertices community detection is assignment advertises to communities but uh it's also possible to have an overlapping communities and then a node can be assigned to two communities all right so looking for communities um we will distinguish the picture on the left where the nodes will be exclusively assigned to one community and picture on the right where there is a possibility when there is that the node is assigned to two communities so picture on the left is called non-overlapping community picture on the right is overlapping community uh speaking of non-overlapping communities um you can clearly see and those of you who remember the previous lecture that the problem of detecting these communities can be thought of as a problem of graph partitioning because here we can actually cut let me clear this first we can actually cut out you know edges and that will separate graph into the communities on this on this side yes we can split yellow community uh from uh green blue but with this type of graph cut we cannot um split these communities and so we're going to talk about two different types of algorithms one for non-overlapping community detection another one for overlapping community detection all right so so what constitutes a community what makes a community a community and uh um you know the the this definition actually comes actually this explanation comes from uh the the sort of the the social network bible westerman and faust where community described as a cohesive subgroup so um there are you know mutuality of ties so almost everyone in the group has ties to each other not necessarily everybody has ties to everybody because if that's the case then you get uh you know get a click uh fully connected sub graph i mean it's still a community but you don't get that very often right so we expect a lot of direct ties in between members of communities but not necessarily all of them and that's pretty much the same as the second point that you know every node is almost reachable from each other node um there of course higher frequency of ties within the group and there is a separation um sort of low frequency of ties uh in between members of the group and the rest of the graph if you think about social network this is you know like any cult right or any tight group of friends they are pretty much all of them know each other and they have more connection internal than connections with uh sort of the rest of the graph the rest of the society so how can we quantify this notion of of higher density of edges right more edges within the community than between well you know the sort of very straightforward um definition can be actually just using graph density right the the definition of a graph density now we define the graph density in general in graphs as the ratio in between the number of edges that exist in the graph divided by the maximum possible number of edges right and if you have a complete graph then the density is one if you have you know empty graph density zero everything is in between so what you can think of is a definition of community internal density so if you have we have you know some graph and we have um some nodes and edges and then let's say i'll try to draw this briefly let's say you know there is this community and then i'll put sort of some more pieces of graph right so what they can actually calculate is we can say okay look if this is a community for example let's calculate number of edges that belongs to the community and divided knowing how many nodes belong to the community because again remember partitioning our communities is determined by the node assignment then we just divided by the total maximum number of edges that could be in that community right and that's going to be internal density we can then also calculate external density which is uh the edges um that outside of the community and then we can say okay look let's compare those things and we expect that internal density is greater than or on average density in the graph and external is less um so that that's good right so if you determine um the sort of if we have this number of of you know if we know which nodes belongs to the cluster then we can actually calculate the density and try to compare it now but that knowledge does not help us identify which nodes belong to the cluster and if we as we discuss it last time um you know even splitting uh into two groups uh it's gonna be extremely difficult to to go through all the possible combination factorial large number of combination and leave alone you know splitting in multiple groups so if we have groups we can verify based on the density uh but it's not easy uh you know density is not going to help us to determine those groups to detect those which nodes belongs to which group um at the same time also um uh we have a challenge here um that that density is not um sort of sensitive enough for us to identify precisely communities so instead people often use the notion of modularity in fact we already introduced once this metric when we talked about you know assorted distorted mixing but the idea here is again we're going to compare the fraction of edges within the cluster but now we're going to compare it not to sort of the density not to the average density in the graph but we're going to compare it to an expected fraction of edges if we had a random graph with a degree sequence the same as our graph so it sort of tells us that um if we have a random graph um it does it shouldn't have any communities right it's random because random graph means every pair of nodes connected or not with some probability right and the probability is equal across the entire graph and so you do not have sort of built in notion of uh you know communities in there so we expect the random graph not having any communities but that gives you this uh sort of the level the density you can compare against um and that's what modularity score does it says okay uh let's take let's take the actual number of the actual connectivity pattern between nodes and uh subtract the expected the probability this is the probability of the expected um number of edges in between nodes of degree degrees ki and kj and that's what it would be in a random graph and add them up only for those nodes that belong to the same community delta s c i c j this is a chronic or delta symbol which is equal to one when c i is equal to c j and that is just normalization now you can actually rewrite this formula in the following way uh this is a ratio of number of internal edges to the total number of edges in the graph so that reminds you about density but then it is also sort of normalized or you know you subtract you can't trust it against um what you would have there if the graph was random and that's computed the sum of node degrees within the community divided again by twice the number of of edges now this might look like a strange metric and especially when you realize this that this modularity score kind of goes from minus one half to one and when you have a single community like graph of the whole uh the modularity score is zero that's the way this thing is is set up but it actually works surprisingly well um and it's used a lot for community detection and here are the examples now if we have if we calculate use that formula and uh you know say the entire graph belongs to one community um here it's a green community that the modularity score is zero now if we for example you know randomly assigned here like every node to its own community um then you even get this modularity score is negative we can think about you know we can choose to have this partition um which is obviously not the best one right uh but you know let's go and and calculate this modularity score for this partition and it gets zero to two and then this is sort of we understand that this is probably the best partitioning right this is the best cluster so you can use the best communities um and this gives you the highest modularity score right so um in this sense the higher the modularity score the better other communities that we have and again this metric is very popular it is not perfect um and sometimes it becomes sort of insensitive to small communities when you have very small groups of nodes um this metric does not work too well but other than that it's it's quite you know it's quite robust and we could use this in two ways we could either try to you know take this metric and directly optimize it by you know trying different partitioning just trying to maximize this metric or we can use some other approach some sort of heuristics but we can monitor the quality of the clusters by this metric you can also check quality of the clusters by density but again um this is used more often than just pure density any questions so far okay so as i mentioned before um you know we can actually think of we can we can think of community detection or formulate community detection through recursive graph partitioning we learned last time that you know there are there are ways for example spectral algorithm to split the graph into two into pieces and then um you know we can actually continue cutting it until we get smaller uh pieces of the graph and we can stop cutting for example when either modularity or density of those pieces becomes pretty high and then uh as a result you actually get the split of the graph into communities right so graph partitioning recursive graph partitioning is the way to do community detection right in general graph partitioning splits the graph into two pieces that's not necessarily communities but if you do it recursively you know eventually you can end up splitting the graph into uh into the communities by doing this again you will assign all the nodes to different communities and um last time we talked about metrics that allows us to do it and if you remember we talked about you know the graph cut which is just the number of edges that you need to cut uh we talked about ratio cut we talked about normalized cut we talked about conductance and um the algorithms that we described just discussed in in depth sort of applies to normalized laplacian we optimize this normalized cut by finding optimal partitioning and that allowed us to you know to find clusters that's sort of one way you can do it today we're going to talk about several other methods and uh as i mentioned before you know the graph partitioning or you know community detection cluster detection um this is np-hard problem and so there is no you know precise solution so the laplacian method the one we discussed before that was an approximation to the solution right we we kind of formulated this as a integer programming integer optimization problem but um we solved it eventually with you know relaxation method which is an approximation today we're going to talk about very different approaches we're going to talk about this heuristic approaches where uh you know you you think of some of sort of ideas um then try it and uh you know if it works great it's usually greedy methods which means they're not guaranteed to find you absolutely optimal optimal petition but in practice they're finding pretty good partitioning and we can monitor through modularity the quality of of the partition so one of the method is called edge betweenness and this method is uh similar to the method you know graph cuts because we're going to be cutting the edges but it differs in the way we select the edges uh if you remember uh again on the on on in the previous lectures when we talked about the cut uh we either just selected um the partition the the selected those edges that when you split the graph you remove the smallest number of edges or it was normalized right normalized by uh by the number of nodes um in each partition now um edge betweenness actually based on a similar idea uh but sort of slightly from a different angle now um we actually talked several lectures about ago about the metric which is called node centrality right it was betweenness centrality now can somebody remind me what between centrality is remember guys we had degrees [Music] researching how much the puffs it is right so between the centrality um tells us of how many shortest path goes through a particular node so mark neumann introduced um the sort of came up with a similar metric but with respect to edges he called it edge between us and the idea is well we can actually calculate for every edge we can calculate how many shorter spots go through it and if you think about i don't know same pittsburgh for a second um and and the bridges over the nevada river um you realize that yes if you want to get from one part of the city to another you know you will most likely you know have to go across the bridge right and so the bridge um will lay on uh the lots and lots of shorter spots um that takes you from one point in the city to another and so that means um that if you think again again about that saint pittsburgh you will think about okay wait but this is a partitioning into the clusters right because um every part of the city can be thought in a cluster in a sense and so what we try to do is we want to find those um bridges that have most shortest path going through them right and and so the way to do it is really you know through the definition of uh the short the the edge between us so we define edge betweenness as a number of shortest paths that goes through an edge so we look at every edge and then uh we go through all possible pairs of nodes and calculate how many shortest paths go through this edge and normalize it by the total number of shortest paths from node s to node t so we pick up one node here you know this is s this is t you know calculate how many shortest paths goes from this guy to this guy and how many of them goes through each of those edges in this case none of them but if i take this node and and and this node you know there will be shortest path here if i take this node and this node again the shortest path will go through here or maybe through here uh maybe through both of them and um that's how you count um the that's how you count the the shortest spot that number that go through each edge you define this edge betweenness and then uh you start you know iteratively removing those edges that have the most number of spots going through them because they are breeches and so the idea if you remove them um that would be the best way to split the graph into pieces does make sense all right so we assign we calculate we calculate um h between us for each edge we pick up the one that has the largest one and we remove it then we'll compute h between this because serious path would change and again remove it and so on and so on and so on [Music] and that's the algorithm right so compute h between this remove edge with largest betweenness and just keep doing it until the graph splits into two components when it when that happens you have two components it's bipartite sort of graph partitioning and then if you want you can continue recursively for each of the partition uh keep doing the same procedure now um we're not you know explicitly optimizing anything here um that's why it's heuristic but the process leads to you know nice graph partitioning so if you run this algorithm here this is the partitioning we get and that's sort of the dendrogram the tree um the way to read it is exactly the same as you do for you know any any clustering algorithm right this is sort of the first level split um so that's sort of you know how it goes um it the first cut here is is cutting this and then you go deeper it splits blue and and green purple so the next cut is here etc etc etc so that's um edge betweenness algorithm now interestingly uh um to see how it actually works this is uh same same graph zachary karate club um you know if i if if we do edge between us this is the result this is option partition that's the petition that edge between us the split will do now we can actually keep splitting further and further and that's you know what's going to be happening right if we run it again um that it will it will split those four nodes and what i can do is we can actually measure the modularity score as a function of number of clusters so um remember you know when we have one cluster so we sort of one community when the graph is not split into pieces modularity score is zero right and then you know we have we split it into two three et cetera et cetera you know 10 communities 20 communities and we monitor modularity score and notice that in this case modularity score is the large oops is the largest um when we have what approximately so it's five when we have six communities now this is somewhat strange result here because um we're gonna i'm gonna show you um on the next slide this split so that's according to modularity score this is the best possible split yeah maybe though you know maybe this is not bad either but when the graph is small you know those metrics are probably not not extremely sensitive but this gives you an idea that modifiers is score when you start definitely splitting it into many more pieces which is obviously not right for such a small graph you know what the artist's score keeps dropping and getting smaller and smaller and smaller and smaller so you know if not sort of six though again according to this idea that modularity score increases for optimal partitioning um then at least it's clear that you know that's probably where your sort of your optimal number of clusters or optimal number of um of communities in this graph again this is very similar to clustering plot problem where you don't really know how many clusters are optimal um in in in a data and and this is a split you know based on this edge between us of the graph for a dendrogram now the problem with this method though it actually works pretty well it's robust but the problem with it is that you know every time you need to calculate the split um you need to go through uh you know computing lots and lots of pairs to calculate between this right because for each between the centrality for each edge between us you need to go through all possible pairs right which is you know order of n squared and then you need to go through all the edges which is m times n squared so it's a it's quite computationally intensive now there are ways to um compute approximate the number the the edge betweenness by sort of randomly sampling nodes but still it becomes quite computationally intensive and that a problem for a large graph right and typically you know you don't want to do graph partitioning on the graph like we see it for 32 nodes you know nobody cares about it but you know when you have several thousand nodes that already becomes um quite confidentially expensive but it's a good sort of algorithm to check your to check on the smaller data um and it's pretty robust and it's sort of obvious right but at the same time again this is heuristic so there is no sort of strict proof um why it works and and and uh you know prove that it will work always now um here i want to talk about the method that is probably one of the sort of workhorses of today's graph of today's community detection it's called fast community unfolding it's also called uh you know it's multi-resolution um scalable methods it's called also leuven methods and and here is demonstration of of this method working on a pretty large 2 million um 2 million people mobile phone network 2 million nodes so it's actually quite large so this method is also a heuristic right there is no sort of um global function that we optimize here but on every step we do something smart and on every step we sort of monitor modularity modularity and we just try and agree fashion to make only those steps in this algorithms that increases modularity right because we know that often petition has the highest modularity score so let's you know try to do steps that will increase it and then you know it's it's a greedy algorithm right we'll not always succeed with it or will not always get the best partition but let's say if we repeat it multiple times or do something else smart we might actually get pretty good solution so how does it work what does it do so modularity is is here right and it's going to be recomputed every time on every step of this algorithm so the idea is the following uh we start with every node being assigned to its own community in this sense it reminds you of this agglomerated clustering remember sorry guys um remember we talked about um we talked about um you know the the graph partitioning where we go uh top down right we go from uh we take the graph and we split it into pieces well here it sort of reverse process we assign every node to its own community to its own cluster and then assemble them and grow those clusters by assembling those nodes okay so we start with every node belonging to its own cluster and then we will be checking if we take this node and its neighbor and put them in one cluster if it's going to increase modularity or not and we check it against all the neighbors of this node and pick up the one that gives us the highest increase in the modularity and merge them into one cluster then we go for the next from the neighbor and do it the same and then for nod's neighbor et cetera et cetera et cetera et cetera so um if we look at the picture this is what it looks like we start with every node belonging to its own cluster and then uh we say okay you know let's look at the node for example 14 and its neighbor eight if we merge them into one cluster if it's going to increase modularity or not or if we merge sort of node 14 and you know 10 in one cluster um so it's reassigned them and we go through all the notes that are connected right so it's not all the past possible pairs of notes this is you just follow the edges um and see if it increases modularity and if it does you know you you select the one that increases it the largest way the assignment and that if you run it that way uh for example what we're gonna see on this first iteration we're gonna see that okay uh there are sort of groups of nodes that become assigned to um the clusters right so it used to be one two three four five um different clusters when it started but by doing this process by following the edges and reassigning um it it happens that the highest modularity will be achieved when this five belongs to one group right and then these guys um will will belong to another group right um that forms the third group that forms the the the last group um and then on the next step what we're gonna do is um we're gonna actually take those notes and form a super node by literally collapsing them and putting a weight that corresponds to the number of of edges that's being collapsed and then we repeat this process and then repeat it again again and again and again until we pretty much end up with a sort of you know single cluster for the entire graph so again in in in in the case of graph partitioning we started with a graph and then we keep splitting it until we get small clusters here we start with single nodes and merging merging merging merging them until we get you know big single cluster going back yes say it again it's uh very same for mst uh algorithm finding because i know minimal spanning tree yeah it's it's it is yeah it's the same idea it's it's the same idea um so again there is a sort of two phases on phase one we evaluate the the modularity gain uh for merging uh for for sort of putting nodes in the same community and select the node with which you get a better score right then we place it there we do it until there is no improvement and then we merge nodes into the super nodes and then again repeat it on this coarser graph right so um you know it is sort of this again you know hierarchical approach now um you realize if you think about this um that you realize that in this algorithm it really depends on when you how you start it right um it if i gave you different results because if you started with node two you know it might say okay look we look node two one and four and might say okay node two is best you know to have one node with node one and then we compare it let's say with other nodes it will get certain uh clustering right but you know if i start instead of starting with node two i start with node five it could be that node five looks and says look hey you know what we're better um with with some other node and then you know let's say it might happen that this node will form the cluster um and so eventually sorry guys um eventually you're gonna get some sort of uh um eventually you're gonna get some sort of uh you know clustering and that clustering will uh have a you know uh score that corresponds to it and you you run multiple scenarios and you'll select the one that gives you the highest modularity sorry um okay so here is an example of this algorithm running on a you know quite complex graph yes in previous slide pl could you please why not why a blue node 4 it should be 3 blue nodes excuse me which one which one uh in the right right uh yes blue node yes four four oh uh [Music] it should be three or i give me give me one second um so the idea is that when we merge things together yeah i think this is um yeah i actually i don't know i'm surprised um why why the weight is is that way good question i don't know i mean what's what what's important here is that um the the weight you know the the weights that are important for computing that are very important the number of uh each between blue and yeah so i think so too but i don't understand why it's four though yes yes three the red is correct in the green correct but blue four so i i don't know i don't know we need to we need to look it up guys i actually don't remember um yes but sorry it is number of edges yes i think i mean it would make sense i it's probably it's you know it should be either number of edges or number of nodes and edges or or one of those things um please check and you know write uh email to the to the class with a correct um with the correct answer i actually do not remember um from the top of my head but let's see here it's not see see what it says um notes uh merged into super nodes and weights on the links added up right and so we add up weights um on the links initially weights all weights were once right and when we had this um for example in between dark blue and light blue it's a weight one in between um you know red and blue um there are three edges and it's a weight three here um so these numbers are very clear where they're coming from right so this is uh you know this and this and there is one edge so that's where this one uh in between red and in between blue um there are one two three edges and that's where this three comes from and in between green and blue there are four edges and that's where this four comes from okay so that's clear now um these numbers that's what i am not quite sure about i somehow thought honestly that that would maybe it's doubled uh internal not really because notice there are three of those blue nodes here right um internal like a modularity score calculating like we have two edges in blue so it's uh well i see what you're saying yeah it could be it look yeah it looks like it because there is one edge here and now it's two they're all even um there are here is 14 and that's green so one two three four five six seven it could be but guys let's check let's check i'm surprised that they just decrease within the community i think or or its number of degrees yes that's also just degrees and my version is the same because you remember that number of uh the um no because that's what it is one two um one two three four there's more um no within the community degrees in the community that is possible that is also totally possible right because this is gonna be one two three four and the version is the same because remember that some of the sum of all degrees is equal twice the number of edges because uh within the community because um you know because every every two degrees give you an edge right but this one if if that's the case then it is uh not only internal degrees right because there's one no this is internal degrees then it's going to be internal degrees to community could be guys okay please check uh and and you know write to everybody right i'm sure it's it's within the paper it's actually uh you know specified precisely okay all right let's move on um and so here is uh you know here here is an example of this um sort of uh of this algorithm run on um on a pretty big graph with this hierarchical structure and still works pretty well right and then so you know in fact this is again this algorithm that literally everybody uses in large graphs these days okay so that was this blonde algorithm there is another algorithm which is um actually there are many more algorithms but you know i want to point out some some other algorithm which is also quite popular um it's sort of it's a glamorous hierarchical clustering again it's a notion of okay let's just try to agglomerate um clusters um the idea here is uh you know slightly different um you know we for every pair of nodes uh we build a similarity matrix that uh measures similarity between those nodes um and the similarity you know in this case and sort of this is what the authors of algorithm just you know wanted to use and somehow it worked pretty well for them is a number of common um you know edges uh minimum number of node degrees uh plus one minus this what's called half side heavy side step function um zero for if x is is is negative even one for um positive cell one for when there is an edge in between nodes um and and then you know here's a graph um this is uh what the similarity metrics look like remember we actually talked about node similarity we had a lecture about node similarity we used slightly different metric we used jaccard similarity well here they introduce their own version of similarity that by some reason works well for for for the clustering again guys this is a heuristic method baristica which means they don't have a very particular function that is precisely optimizing um compared to say that laplacian the graphical question we did last time um they're just doing some sort of local procedures that they believe make um some sense right and so the process here is you take um you know you you take this matrix you assign each note to community on its own again right and you evaluate um that that similarity matrix you know find the node pair with the highest similarity right so it's agglomerative clustering merge them into the single community then you calculate the similarity between this community and you know the rest of the communities and then just keep doing and doing and doing and doing and you'll get again a tree a dendrogram and then you cut it on a certain level uh to get your your partitioning now again this algorithm will work uh well for smaller graphs um you know it's not going to work too well for large groups because um this the computational complexity of finding you know the node paired with you know measure comparing all the similarities with all other nodes that's what will take lots and lots of time and computational effort but here is an example of how it works on a smaller graph um on the smaller graph it actually works pretty well um on a smaller graph it it should work uh pretty well um notice you know it's it's nice and clean uh nice and clean uh uh picture all alright okay uh sorry guys give me one second and we'll continue all right moving on now um here is a very very different idea um but um but um it's it's actually kind of pretty idea right beautiful idea so think about again about you know say st pittsburgh and the bridges and uh think about this you know you're walking on the in the city and you just do like random walk on the city right you kind of randomly walking the streets now um it's actually not a high chance that you will take the street that will end up with a bridge and you'll be able to cross it on the other side and this corresponds to this notion then when you have a graph and you have a random walk a random walk tends to get trapped um in the part of the graph with the higher densities so you know imagine that you start here you know you randomly select where you go to well you know there is one third chance that you walk here then let's say you randomly select you know there's one third chance that you get here then you know you get here then you get here then maybe you get here but when you get here there is only 25 chance that we will actually be able to leave this community right 25 because one two three oh actually 20 there are five uh ways out right so it's only one fifth twenty percent chance that you will um the the community even if you get to this node all right okay so if that happens so that's that's the idea the idea is to actually do random walks and see where those random walks kind of spent most of their time and then communities will be those groups of nodes where your random walk spends most of the time now it might sound sort of strange to run like random walks across the graph but this algorithm can actually be pretty can work pretty well and be a very good algorithm if you have a very large graph and you don't want to detect all the communities but you know piece of the graph where you want to detect the community and that's you can like initiate random walk see where random work spends most of the time and that's going to be your community right so that's the idea here the conceptual idea now in practice um well there are some algorithms that are actually a bunch of algorithms that are based of this idea um one of them is called walk trapped community detection um again the idea is that on every step you actually make this random move and you get from one node to another node uh pij is gonna be matrix the probability to get from that node to another from node i to node j if they're adjacent you normalize it by um the node degree and so that's sort of the same thing as what we did with the page rank right because that was random walk and if you do t steps that this can be described at this matrix to the power t all right and so um the the assumption of this algorithm the theoretical assumption for this algorithm is that uh you know if two nodes belong to the same community if the probability uh to get from node i to know j in in some number of steps is high right um and then it's also remember that there's a mutuality so if you can get it from one node to another you can of course sort of get back and so you can define um you can define distance between nodes through these probabilities now again um you could you could you could use um you know this idea even without this mathematics by just literally randomly simulating random walk but this math gives you sort of proofs that this this process converges right and so you can define the distance between nodes you can define the distance between nodes through uh through this metric right and then you can actually uh you know use it to do some computations let's see how it works so we compute the distance and we can actually do this we can do this by literally simulating right um by you know by simulating or um by actually sort of computations right so if we simulating um then uh we just you know literally run the random walk calculate how many times um start from node i ended up at node k and then we in in t steps and then approximate this probability um as a ratio um this nik over um you know k over the number of random walks and then we define uh distance between nodes and then for communities um we define distance between communities and then we can measure that distance right and so whenever we have that distance then uh we can all after that we can actually do again this agglomerative clustering approach but now based on the distances that are defined through random walk okay and then uh you know to take sort of we start with with single single nodes assigning to its own community then we compute distance between the json community and then we choose the closest community in terms of that random walk distance merge them and just keep going um again the beauty of this thing is that you can sort of center random walk in the part of the graph where you're interested in um and and that's the result again you get you you're gonna see the the you know the the tree right the dendrogram um that formed by this merging um nodes based on on on this walking patterns okay does make sense all right uh by the way um on one hand we can actually just do exact simulation on the other hand um if you don't care if your graph is not very large you can in fact do this um do this uh on on a um uh sort of through multiple matrix multiplication um and and then um you know if you do it through the matrix multiplication um you know it's it's it can be a large matrix right so again it can be a simulation and then uh you can just focus it on the part of the graph or you can do those computations those probabilities through matrix multiplication explicitly um but then you know you sort of deal with the entire uh with the entire graph uh if you do that all right so we talked about the algorithm that allows us to detect communities when you have an non-overlapping community where you where you would put every node to a single uh to a single community now um if the situation is um uh is is such that we have an overlapping communities so we have a node that sits on intersection of two communities um those methods where we look for edges they're not gonna work anymore right um and uh you know we cannot do the sort of graph partitioning you know cutting edges out simply because we have some notes that are overlapping right that can be assigned to both communities and here on this picture uh for example there is this node that belongs to two communities if you look at this community is based on their definition or we can have even you know several nodes and even edge that can be an intersection right so um there is there are different methods to do this now um this algorithm that we're going to discuss is quite interesting it actually works well again on small graphs um this algorithm usually used on on a for for bioinformatics not for social network analysis so um but the idea for overlapping communities yes you know if you think about social networks for a second yeah you know you can have your friends community your family community you know your scientific community and then you're you belong to all of them so you just cannot cut out um you know any of the edges to split into the communities so the idea here is called uh the algorithm is called um k-click community detection um and the idea is to try to you know use clicks or complete sub graphs a k click is a complete sub graph with k nodes um just to remind you you know you know one node is a complete sub graph you know this is a complete sub graph from two nodes this is a complete sub graph with three nodes this is a complete sub graph click of four nodes etc etc right so this method actually works um you know well for for denser graphs uh but that's where you most likely have this uh situation where uh you have overlapping communities so the idea is that you know you try to detect those clicks and then look at their overlap so here for example if i take a click with four nodes then uh a b c d is a click and b c f d are two clicks right and they overlap by those three nodes right and so here's the definition two k clicks are said to be adjacent if they share k minus one nodes so these are four clicks that we looked at four clicks means they have four nodes and they share three k minus one so four minus one three nodes so these are these clicks um this one and this one are adjacent according to this definition and uh on the on the picture on the right they show uh various four clicks that are adjacent now again to have those four clicks you know you the the graph should be pretty dense why do we do four clicks well you know again because that's sort of um the way algorithm designed but you can start with three click right for example there is this three click and uh here is another three click and there adjacent adjacent with um two nodes a and d um you know being adjacent here right uh they share those two nodes so uh how does this algorithm work well the id is the following again uh look at this picture so we start with uh um with with a click and again here it's i show you three click right over there it was four click this is three click um it's covered it's shown here in in green right um and then we kind of sort of start rolling with this click uh shifting it and and and making sure there is this intersection that it remains right in this case it should be two nodes and do we capture the next one and we just keep doing it we keep doing it until you cannot do it anymore right so it's sort of this click that starts rolling slowly um keeping uh if it's three click it keeps two notes um common right and tries to replay one note with a new note but such a way that what you get is also a click now with triangles it's actually quite easy because you know there are a lot of triangles in the graph but with four for example notes it's already not always happening and so you kind of roll this until you get to the point where um like for example like here um you know there is no way to uh you know move one node uh if i move uh say this node if if i drop this node and pick up this node um this is already not not a click here right this is not an edge here so um that's that's sort of the concept for this algorithm right you try to find that click and you start moving it until you sort of can right and and that detects your uh community now in algorithm itself it's done slightly slightly different instead of actually moving and doing the sort of rolling of the clip you find those maximal clicks and you just look for overlap um and then you threshold this overlap um at the value k minus one so if you have four clicks you threshold at three nodes um and everything that less than three notes it just means you will not be able to roll through them if you have three a click you threshold the two nodes um and and so on and what you get as a result of the threshold those connected components will be your communities so this is what it looks like um um you know we detect those clicks um you know the maximum clicks um that are possible here and uh what you see here the number is the number of nodes the or the size of of of the click so for example the blue one you know it's right here this is you know it's it's size five notes or the the the red one is a four notes click where is it sorry this is the red one right and uh and and and so on and and say that the brown one is four notes click so you detect all those clicks right maximal click and maximum click is the largest possible click for those nodes and then uh you look for overlap for each of the click how many nodes um they overlap with for example if i take um the say red click right and i take the blue click they have in common this three nodes and that's why there is this overlap three or if i take uh you know if i take if i take a blue click and green click the overlap with those two nodes and that's what i put here all right and then we threshold them uh and based on the algorithm which is k equal to four for example i want to have our you know the the the the click size four um algorithm tell us threshold at value k minus one so threshold at three um uh come on threshold at three um and then when it's above three i just leave uh you know one and when it's below three we put zero and then we get connected components um and there you go so based on this four click separation this graph has to be you know have two communities one is this community uh which is this one and there was the rest of the graph and so yes it's overlapping community notice that um you know there has been this node also belong you know that's that sort of the split right um it sort of doubled it here so you can see it so that's um the k click percolation now it's it's it's it's actually quite interesting algorithm and uh if i look if we look at this two examples of the different graphs with a selection of this uh clicks you notice um that it works actually quite nicely right um it splits um the graph into this overlapping communities now it's very clear that the performance and the result of this algorithm will strongly depend on the selection of of the size of the click right and so literally the authors recommend you to run it with you know size two size three size four well two you don't want to do it but size three size four size five and see what happens the dance is a graph itself the probably the higher kind of click number you can use um to detect things um now the challenge the problem with this algorithm that is insanely computationally complex because you need to detect those clicks and then you need to do everything else and click detection is computationally extremely expensive process so this again will work um on in small graphs but that's probably one of the very few algorithms that are designed for uh overlapping communities again when you have large graphs um there is you know most likely you don't have this overlapping communities and so people typically use those graph partitioning approaches we talked um before but if for whatever reason you have or rely on communities that's what you do now if we go back to real world networks um there is this interesting paper by uh eurolis kovac and kevin lang in 2010 when they analyzed the sort of the real world social networks i think this was from either microsoft messenger or yahoo uh graph and and they realized that um so what they did is they actually remember uh we we looked at the metrics like normalized got cut and one of metric was conductance and so they looked at the different partitioning of the nodes and measure the conductance of those cuts and they found out that in real world graphs typically typically optimal size sort of the best size the best partitioning happens you know for for nodes of size and i think this number is like you know probably a couple hundred and so when you get largest uh you know larger um larger cuts larger communities you know accusing the quality drops when you have smaller communities qualities also drop so in a real world network um they're often like around 200 nodes that's sort of the the best communities that you detect um and and and the last slide today i think it's actually a last slide um you know this actually from the review uh from 2000 probably 10 or or you know 2015 and so if you notice all the ages all the years with algorithms there before 210 um now there are many more algorithms that are out there this is just a very simple you know just as probably yeah i know a third of algorithms um that exist out there but i think extremely important uh column here is is this column um the right column which is a computational complexity and if you notice most of the algorithms they have this for example n cube and cube but there is m cube n there is m4 there is n cube now this is this is all really bad uh because um you know think about this like any real graph as well at least thousand notes so you know you have to do computational complex tells you how many operations you need to do so that like makes it you know 10 to the um instantly you know 5 000 or 10 to the third 10 to the ninth computation so it's already it becomes almost non-computable and if you take a million node graphs that just doesn't work but there are some gra some algorithms like that um which is actually pretty good right and there is n log log squared n um and this is the algorithm we actually considered which really of m um the reason is because we only measure um we only use modularity based on edges and then it's we we because we do aggregation it very quickly uh drops the complexity but you know if you look at the say for example um you know at the algorithm for for this for overlapping graphs you know you get like exponential complexities so that's that really works only for very small graphs so bottom line um community detection is defined as finding groups of nodes um that are more tightly connected among themselves than with the rest of the graph it's the definition is not precise in the same way as with clustering there it can exist multiple solutions and there's sort of there's no single unique correct solution and there are multiple algorithm that that you know can produce it um algorithms defer by the way they operate most of them are either greedy heuristics or some of them approximate there is no exact solution um and algorithm differ widely um in terms of their computational complexity and so before you run something on a large graph check the computational complexity of the algorithm because you might you might wait forever for it to solve the problem now typically in practice yes this is a algorithm by blondel that you use as a sort of the first thing uh it doesn't work well okay you know you can try a bunch of other algorithm and just see which gives you better clusters according to your sort of understanding in your definition and i think with that we are done um questions so i would suggest you if you're interested in this you know look at the review the community detection graphs the resistor view by fortnite and i think there is also fresher review probably i think from 2018 um on on more of other of of of the algorithms okay if there's no questions we're done for the day thank you and you're gonna be a seminar after the lecture thanks guys thank you you
Info
Channel: Leonid Zhukov
Views: 826
Rating: 5 out of 5
Keywords:
Id: W8sR_sq_Lec
Channel Id: undefined
Length: 70min 16sec (4216 seconds)
Published: Wed Mar 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.