Gephi Tutorial on Network Visualization and Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so hopefully you've watched the videos on both network statistics and network visualization what we're going to do here is actually look at a network and I'm going to run you through the full basic process of loading up that network running it through Gaffey pulling up the basic statistics and doing an initial analysis I'm going to do this at a relatively quick pace I'll explain everything I'm doing of course but I'm not going to go really slow fortunately this is on video so you can pause it rewind and go back over things so I'm going to do the whole process for you like I do all the time watch it a few times pause it check out all the details and this should make it pretty straightforward for you to go ahead and do it yourself what I have open here are two different files I've actually created these from a different file part of the difficulty in doing network analysis is actually getting data in the right format so I've done a little bit of pre formatting for you here but I do want to talk you through these so I have two files nodes and edges which you can see listed over here on the side let's start with the edge file you can see that it's just one number comma another phone number so we have an adjacency list here and on the top line it says source comma target this is specifically for Getty if you're importing an adjacency list you need the first line to be exactly like this with this capitalization that allows Guffey to recognize your file as an adjacency list so it can populate the edges correctly within the program so there's lots of places that you'll be able to find adjacency lists online for all different kinds of networks but if you do you want to open them up in a plain text editor and make sure you have source and target up at the top this is a comma separated list so I have a comma between those words if this work tab separated for example you would put a tab between them this isn't the case here but you may have additional information so for example your edges might have a weight in that case you could go ahead and put weight as a third value here and then your weight could go as the next item in your lid just make sure you get a label to explaining what it is like wait the only ones that have to be exactly correct are source and target you can make up other names to describe the other columns if you have them I don't have weights so I'm going to delete that and that's our edges table now you might just be working with an adjacency list here this is given all in numbers but you may have people's names or the names of other entities I only have numbers here the information about each of these nodes was given separately so I have a separate nodes file so my nodes are here you can see that have the number and then there's a name in quotes afterwards again if you look at the top row of this file it's ID comma label again those are forget to let you know that the first column is the ID of the note that's going to match up with the numbers that were in our edges file and then the label is a keyword for get feed that says basically this is the label that we're going to put on that node so you won't always have separate edges and nodes but you can especially if you want to add extra information about your nodes I decided to go with this more complex example that has the two files so you can really see how to get it all imported okay so we're going to come over here to get e I've just launched Jeffy's so this is our opening screen and you can be tempted to just open a CSV file that has an adjacency list sometimes it'll work and sometimes it won't so I want to show you how to do an import that will always work instead of opening a file just start a new project and you're going to have a blank workspace come to the data laboratory tab on the top that's basically going to take you to spreadsheets that have your nodes and edges and you can see that those are described here we're going to start by importing the edges I clicked on edges here but it actually will work regardless of what's clicked on what you want to look for now is this import spreadsheet this is not the same as open up here in the file menu so you do have import spreadsheet there or you can click it here and now we want to find the file to import we're going to start by importing our edges I happen to be in the right folder you may have to navigate to the right folder I'm going to pick our edges file and then do open and you often will have this you want to check what kind of table are you importing this says oh you're importing notes which isn't right we're importing edges so you have to make sure that we pick the edges table the separator here is what you have delimiting or separating your columns this is selected right as common now but you can see it has presets for semicolons tabs or spaces we're going to leave it on comma and then we get this little preview down here so we can see that it has them labeled a source and target and then we have all of our numbers in here you may see an error if you have like numbers or actual data pair showing up in that top row that probably means you forgot to add the source in the target but this all looks good so we're going to click Next this says everything we need this create missing nodes is always a good thing to keep checked so just leave that as it is and click finish and now we can see we have a bunch of data here in our edges table if we click over to the nodes table we can see we have an ID that's the number for all the nodes but there's no label so we want to import our nodes table that we had separately again we're just going to click that import spreadsheet button and we're going to go and pick the nodes table now now you can see there's an error here it says we need source and target column but that's because it thinks this is the edge table so we have to go back and tell it note that's the nodes table and now our preview looks good we have ID in label in the top and then we have the IDS matching our numbers and the labels are people's names so make sure that these are matching ID and label they are and then click finish and now you can see our nodes table is populated with people's names to match up with all those numbers that were over here so that's how we import our data click on the data laboratory tab and use import spreadsheet now let's go back to overview and now we have this kind of mess of a network before I actually go through the steps here let's talk a little bit about the network that we're looking at this is a co-citation network or a collaboration network we're looking at people who have cited a paper by Stanley Milgram this is a small world's paper which we're going to actually talk about later in the semester and so we're looking at people who have worked together or cited one another are linked here so our network is a mess the first thing we want to do is lay it out that's always a good first step I like to use the e fun whoo algorithm I always keep the defaults here if you start to get more into looking at the different things to do with layout you can of course change those but they're pretty good at default so pick that and then click run okay and there's some crazy stuff going on there right so we have our interesting little thing down here but if you saw we had a bunch of nodes fly off to the outside you can use the zooming feature of Jeffie by on a mac using two fingers up and down or using the scroll wheel on your mouse so i can zoom out a little bit and you can see those nodes across the edge if you get uncentered or if you just want to see the whole network you can always also click this magnifying glass down here which will reset the zoom and center the visualization so if we zoom in here on all these nodes on the outside they're just these single nodes these are here because these are people who have sighted the paper but don't have any other connections those aren't interesting to us and so we actually want to filter those out filtering is one of the first steps that you'll do a lot of times if there's anything to be filtered and it also something that you might do it early throughout your analysis you may look at something and say oh actually only these kinds of nodes or these kinds of edges are important so filters matter you'll see that there's a filters tab over here and if we click on that we get a bunch of different options this query section is where we actually put the filters that we want to apply I suggest you take some time load up a sample network and just explore all of these filters but the one that I use the most is under topology and that's degree range so that lets us filter out nodes that have a degree greater than or less than some value so find degree range and then it says drag filter here do that drag the filter there and now you have some settings you have a slider down here at the bottom so you can slide that pick a value and filter things out you filtered that too much there obviously slide it back down things come back in sometimes it can be hard to slide it to exactly the value that you want you know I'm trying to slide it to one and it keeps going up so this is a hidden little thing if you just double click that number it looks like it's highlighted but you can actually type the number that you want and hit enter and that will filter your network for you now once I've done that I can click the center and we get this big cluster in the middle just one other thing in the filters you can also click this arrow here that shows us the parameters click that it shows us the range and then that can be a way to see the actual values if for some reason you can't tell down here we could filter further so we could take another filter for example and drag it so after we filter degree range then we could do something else but for now we're just going to leave that as degree range so we filtered out all those singleton nodes and now we have our main cluster in the middle so you want to do some analysis of that and hopefully right off you can kind of see that we have two main clusters one on the right and one on the left so let's go over to to statistics and now we're going to run some basic statistics on this Network the network overview has a lot of really useful ones those ones that I always run by myself I start with network dynamic diameter that's going to give us a lot of the main centrality measures you can pick if you want to do that as an a directed or an undirected network I'm going to treat this as undirected I think it's easier to work with undirected networks so do that and click OK and it's going to run this ran pretty quickly but you may have noticed a little progress bar opened up here on the bottom of the guffy window you can always check there if you feel like it's running slow once you run it you get this report it's always pretty uninteresting I think it's got a few little points that show up on these graphs but it's not really anything that provides a lot of insight but what you get out of this is that Gabi has now computed a bunch of different melody measures so you must close this window out and now we can use those centrality measures to find other things in the network the way that I like to do that is to change the way the network looks so there's this appearance box up here at the top and you can click on nodes or edges we're going to look at nodes and you can pick all sorts of different information about the nodes the information is coded over here and the two that we're going to look at our color and size those can be hard to remember but if you just kind of hover it it will tell you that this is color and this is size so let's click on size and then you can choose if you want a unique size so there you could just specify every node will be the same size or you can adjust the size based on an attribute so let's click on attribute and we're going to pick we'll start with between this centrality you can get a minimum so that's the smallest between this will get this size for the node that's in pixels the biggest between this will get this one and it'll create a smooth gradient between there so a node that has a medium value between this centrality will get a size somewhere in the middle between 10 and 50 you can adjust those however you want these are values that I tend to use by default because it makes it easy to see the big nodes but they don't get kind of overwhelmingly large so from there we're going to click apply and now you can see we've got this one big giant node in the middle that is the most central if I mouse over that you can also see this other interesting feature here which is this arc right there that means the person has cited themselves that's a self loop the first place I actually saw this was in food webs where you see what species eat what and you'll see a self-loop if they're cannibalistic so I always kind of think of those as cannibalistic nodes so we can see most of these nodes have a pretty small betweenness centrality except for that one big node in the middle that's something that we're going to come back and look at in a bit okay apologies you can see that my network has changed a little bit because I crashed my guffy so we edited that out I relayed it out so now we're back here where we were so we just noticed that we have this big node here but everybody else looks about the same so there's other ways that we can look at those other nodes in addition to setting the size attribute we can also set the color and we can base that on an attribute and if we look there's all kinds of other things we can pick here so we can do degree total degree separate that by in and out degree but I like to look at closeness centrality so that's where shows us the more center of the nodes that may end up being the same as between the centrality or it may be different so that's an interesting thing that you can visually look at in this case we've made the big nodes have higher between this centrality so if they also tend to match the higher end of our color scale then we know that the two types of centrality are sort of the same so they won't necessarily be so here we have a color scale that's just going from white to green if you click on this little icon here you can pick a whole bunch of different color scales so let's actually do this one this red to yellow to blue so we can really see the color differences so the lower closest centrality nodes will be red middle ones will be yellow high closeness centrality will be blue and so if we apply that we say our big middle node with high betweenness centrality tends to be blue and then we have kind of a mixed we've got a lot of these red nodes with low closeness centrality but and then some yellow ones scattered in here so our closest seems to pretty much track between this there's a similarity there we could also compare this to degree centrality and here we'll leave the white to green scale and apply that and again we can see there's a pretty pretty close tracking our most high between this node is also the darkest green and as you go out from there we see that the nodes get whiter so basically that's telling us this node has a very high degree high closeness and high betweenness what is that node right that's the question so we can we can't tell from here and unfortunately there's not some nice feature where you mouse over it and it tells you but we do have labels for that so there's a few options that you can do you can kind of pull this node off to the side if you want to and then we really can see how many connections it has in fact it seems to be connected to almost everyone which is interesting so let's put that back where it was by the way I just use this little grabby tool over here that should be your default that's selected I had clicked on something else which is why I had to go back to it so that's our grabby tool let's actually look at the labels now so we have this bar down at the bottom and if you click the T it's going to turn the labels on and this is usually what happens when you do that you're overwhelmed with the amount of text that's there there's this tiny little icon over here and if you click that it brings up this panel that gives you some control over these labels so if we click on labels here one of the most useful features here is under size you can pick node size and you can see what that does is it scales the size of the text to match the size of the node and so our big note in the middle has a big label and the other ones have smaller labels that makes it easier to read and of course we don't even need to do much we can see that this is Milgram he's the guy that this network is based on right we're looking at people who have cited his paper and so of course he's going to be connected to all these nodes in one way or another so that sort of becomes an uninteresting node to have in the network we know that this is about people who are connected to Milgram and so it's not all that useful to have him here from there we can sort of look in and we can see on this side we've got Duncan Watts you can there he is Watts D we're going to read some of his work in this class so he's a pretty big one and we can't really read a lot of the other labels so now what do we do what I would do in this point and since I'm showing you exactly how I do this analysis is that we need Milgram out of this network he's just cluttering things up for us so I'm going to turn these labels off we know that Milgram is this big one and if you can control click on it you can do delete that's going to take that node out of the network so I'm going to say yes and now we have a network without that big giant node now all of our previous statistics no longer apply right because they're computed having that in the network so the first thing I'm going to do is just repeat the process I'm going to use the you fun hoo algorithm and lay the graph out again it looks about the same except our big giant node is missing I'm going to rerun the network dynamic diameter making this an undirected graph ignore the report but now I have new values and so for the nodes I can resize the degree and even though the degrees haven't changed for these other nodes the relative values have changed right because our very highly connected node is gone so nodes may fall in a different place on the scale so we'll apply that it's a little bit greener for the size will still use between the centrality and apply it and then we can see we get some more central nodes in this side and on the side appearing bigger now I feel like this is a little bit claustrophobic like everything's kind of in on top of each other and so there's an option here in layout called expansion and all it does is spread the nodes out it keeps their relative position it just makes them further spread so I'm going to click that and you'll see it get bigger and we'll do that a few more times so now there's just more space between our nodes that makes it easier to see which of these are big ones and which aren't so this is our node that I believe was Duncan Watts we can turn our labels back on we have J Travers and Granovetter up here Duncan Watts we're going to read he's written a lot about small worlds mark Granovetter we're going to read he wrote about the strength of weak ties and so we're already starting to see big players in this space of things that we're learning about what we also can tell is that we really do have separate communities and if we zoom in here a little bit this you can see for example is Mark Newman Strogatz here Albert barabási these are all people who study small world networks and network structure we can also see a few little errors in our data for example here's Duncan Watts again so he appears as two different nodes in the network if we really wanted to clean it up we'd merge those but since this one's so small we can probably say well it won't affect what we're seeing so over here these are a community of people who all study network structure small world networks and the kind of mathematics of what networks look like if we zoom out and then zoom in over on this side what we can see is that we have a lot of people who study the sociology of networks and I know this because this is what I study but you'll see this when we look for example at Granovetter asure k-- he's really a sociologist and so these people are looking at the math of networks they're looking at the sociology of social networks and how people use them and so it makes sense that these communities are largely siting within themselves the mathematical people are citing other mathematical people the sociologists are largely citing other sociologists but there is some overlap between that you can see if we look at Duncan Watts he's got some citations over to this cluster and I kind of jiggle the node you can see those edges moving and vice-versa our big nodes like Mark Granovetter has some citations over to that side but then we do have some nodes that kind of fall in the middle they haven't written a lot of papers with a lot of citations but they point pretty much equally to both sides so that's one way that we can do this kind of analysis you can see I kind of did this iteration between looking at who's important running some calculations and some statistics trying different sizing and coloring attributes and then potentially filtering out the network in this case I deleted a node but remember we've also filtered out those Singleton's that were around the outside reapplying our statistics looking at it again and then doing an in-depth analysis saying let's go into this cluster which we can clearly see as group together and let's see who these people are so I'm fortunate that I read a lot of this work and I knew who these people are so I could define the different clusters but if you didn't you would just simply randomly pick some people out of here probably pick the higher between this ones first but then also look at some others and look them up and see what kind of care mystics they have in common that can be a lot of work but it's a really critical step in the hybrid process of looking at the structure and the content of networks which is kind of the core of what we're doing in this class so you eventually learn here the distinguishing characteristics of this group versus the distinguishing characteristics of our sociology group over here so that's a basic overview of the process that I often do and hopefully you'll get very comfortable doing when using guffy to analyze a network run this through a couple times you'll see on the website I have links to the files that I've used here so you can try it out yourself and hopefully that will get you feeling a little bit more comfortable with the process
Info
Channel: jengolbeck
Views: 134,172
Rating: 4.9562659 out of 5
Keywords: gephi, visualization, network analysis, citation networks
Id: HJ4Hcq3YX4k
Channel Id: undefined
Length: 23min 0sec (1380 seconds)
Published: Sat Apr 30 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.