Social Network Analysis. Lecture1. Introduction to network analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right okay so the course is social network analysis sometimes we actually we we called previous network analysis for linguist then we called it like uh linguistic network analysis somehow you know social network analysis got stuck as a name to the for this course but in general what we're going to be doing is going to be talking about network analysis or approach through you know networks to analysis of you know both social networks um and and a lot of sort of linguistic a lot of text information and uh whatever can be represented as a network so the idea is that in the several lectures i'm gonna talk about the instruments right the the tools about the approaches to network analysis we're going to talk about you know what type of networks exist out there and their properties and then we'll try to focus a little bit more on the linguistic side of the things and that's sort of our first module we'll also have seminars where you're gonna go through practical exercises working with text um and networks and then um the second part of the course the the second module uh we're gonna run it more as a project slash journal club where we're gonna discuss um various you know techniques and and and research papers on the topic okay so again like first module is uh more about learning approaches methods tools um second module is research type of discussion all right can you see this can you see my screen yep okay good all right so class technicalities i'll be teaching the class yamakaraf will be doing the labs seminars and i think they're scheduled right after the lectures um and we have a telegram channel i think you probably all subscribed onto this channel so um regarding assignments there will be labs that you'll need to complete there will be paper discussion and there will be some course project that you're going to do next module um so the schedule is here um there is a website where i'm going to publish some some papers they're going to be on the lecture slide is going to be there i'll probably share it in telegram too and uh you know recordings uh of the lectures will also be there my email uh yazi mail you know zoom as an online collaboration tool now it's it's a python-based course so we're going to be using ipython notebooks i strongly recommend you use anaconda distribution that has everything it just works nice for text processing of course in ltk i'm sure you're familiar with that i might learn a little bit um might might use a little bit of psychic learn for a few things here and there um and the network x as a library for network analysis um yeah i typically you know you know python is still not so good for visualizing large networks and so there are some open source visualization tools frequencization tools one's called yed and others gephi these are two programs you can download and and use for visualization purposes when you need to look at large networks so that's sort of those technicalities any questions about those the process okay all right so um the topics we're going to cover and there's like literally four lectures um the topics we're going to cover will be you know the first is today lecture it's an introductory lecture plus we'll do we talk we'll talk about statistical properties of the networks um next uh lecture will be about what we call what i call network nodes and link analysis so how you kind of you know look at at the links how you calculate you know certain type of ranking for the nodes um so you literally how you analyze the structure um like local structure of the network then the next lecture would be about community detection this is probably one of the most popular topics in network analysis how to detect communities of groups of nodes that are tightly connected among themselves and then we're going to talk specifically about network visualization and uh you know sort of other other type various types of networks um all right so network science by itself is is is kind of it's it's a new science in new science uh in a sense that it kind of starts forming as a science and as a scientific community is uh you know at the end of sort of last century now right at the you know 90s um and it formed up from from from different sciences and you'll actually notice the influence and influence of those sciences on network science and you will notice that papers on network science has been published before and is actually still being published in for example you know physics journals or mathematics general economics in bioinformatics these are all uh sciences from where scientists who now working on network science came from that their origin and uh you know they just keep using those journals to publish things but at the same time there are very specific network science uh dedicated journals and conferences exist uh right now but again you know if you think about sort of the probably the the the very origin of network science is of course in sociology social network analysis that goes you know back to mid you know 20th century works um then there is of course mathematics where people talk about graphs um computer science graphs and networks physicists like to talk about complex phenomenas and so they usually call it complex networks and then there's econo economists there's bioinformaticians et cetera et cetera et cetera so there are like a lot of different disciplines that focuses these days on on using this network representation of knowledge so because it came from it comes you know network science comes from so many different um disciplines there is also some duplication in notations and terminology and um you will hear me often referring to to you know network as a network or sometimes as a graph and it's the same uh you know you call nodes vertices or actors if it if that comes from from social network analysis links are can be edges or relations um sometimes people talk about communities but sometimes people call it clusters so there is a lot of this and you know eventually you will you will notice um how those uh how that terminology propagated from different fields and you can actually by looking at what people call certain things with the network analysis you will actually understand what sort of subfield what field of signs they came from um you know the bunch of examples and in fact when we started teaching this course probably ten years ago uh you know i had to show closely like sort of okay well this is what network is and that's sort of what we're going to talk about today's i mean you know everybody understands what social network is and i'm sure you guys are you you i'm sure you have accounts on a bunch of the networks right um so it's not surprised that you can see this the example of social network where there is uh with the color coded the color code is you know communities color coded um their friendships are edges um nodes are actually people and this is someone's friendship network where the person is shown in this big node in in the middle and then there are different types of communities surrounding this person and here is for example you know linkedin map also like this egocentric network and um there are communities um groups of nodes that are tightly connected to each other right color-coded and uh here is an example of uh for example quartership where um you know there is if you notice there is always a reference to where this comes from um this is you know international committee of intellectual cooperation um where there is you know 1723 nodes here on the slide and they're connected when they worked on this on the same um type of of on the same document right so it's co-authors again um you know please notice visualization the the way it's done uh where there is the size of the notes proportional to the number of connections and then there is this very tricky job of laying this out laying out means positioning the notes on the on the on a picture right on this dimensional plane such that you can actually see something and you will notice um later on when you start when you start loading networks that this is probably the most challenging task in network visualization is how to place the nodes such that you can actually see the relationship and and see them clearly all right so that's here is another example where we have uh um it's in fact uh from from um twitter uh example i'm sorry that's not twitter this is political blogs and this is how blogs ref refer to each other right so color-coded is um conservative versus liberal and if you notice it's kind of entertaining for those who are in political science you realize that in fact conservatives of course mostly reference conservatives you know liberals mostly reference uh liberals and there is a very very thin of thin layer of of blogs which means people who kind of look um in in both direction right try to keep it in sort of balanced um here is another example this is twitter um and this is how news spreads on twitter um again here is uh the the edges are followers um that being activated when they when they retweet um and every node um every node here is a tweet this is a very famous data set which is called enron data set um who guys do you know do you know what enron is okay yeah so enron was a u.s company energy traders in the beginning of the 20th you know probably 2 000 something early 2 000 and there was a huge scandal because those guys were inflating prices doing insider trading a lot of you know other things and so eventually the company collapsed and there was a big lawsuit and because of that their internal communication and emails um became you know what were presented as evidence in the court and because of that they became available online and that was like first corpus um of of the emails that were that became available online for researchers and so researchers were extremely happy because you that the emails actually contained all the text of course communication but they also contained um all the links right so you know who wrote what to whom and when right and so that became an actually a very very i mean it became like one of favorite data sets for for like network science slash um language sort of processing guys who wanted to know how people for example influence each other and how they communicate anyway this is a visualization of that enron data set you can actually search for it um just type andron you know the email data set and you'll find it easily here obviously um the the nodes are people um edges are communications or emails sent between people the size of the node is intensity of communication sort of total number of emails in the data set so when you think about when you look at this network at those networks you might notice a few properties and on the right hand side there is one of those type of networks you might notice that um you know there there is they're sort of not regular right but at the same time they're not completely random and so topology which means you know the connectivity patterns are not trivial in the sense you know you don't see such a very regular uh structure uh and and you see maybe you kind of realize there is a little bit of like fractal structure you zoom in you see also uh quite similar structure as when you sort of zoom out um the most amazing part about networks that um there are a lot of properties of those networks that are quite universal and universal means you know there are networks that comes from big biology there is a transportation network there is like email network there are language networks um there are well you name it networks but they do have very very similar statistical properties like sort of amazingly similar properties and that's actually why physicists love the science of networks is because physicists usually try to find some universal laws laws that describe phenomenas and it feels like there is some sort of universal some something unique and universal happens um that generates the networks in the very different fields um that have very similar properties all right and so we're going to touch upon this and that's why it's sort of it's complex system in the sense that um there are you know the the the like like the same way as a human body right where you can you know by learning just what for example the properties of the human self of the cells of our body will not actually explain the behavior of the human right you actually need to study a human as as as a human as a whole right and so the complex system means really that we can study and investigate every node independently we can look at every age but by looking at them by deconstructing it into that level it's not going to help us to understand the behavior of the network as a system right so we need to study the network as a system um like holistically in its entirety and so that's where this word complex our complex systems come from um so there are really as i mentioned there is a sort of universal properties right and there are three major properties major universal properties that complex systems or compounds or complex networks have um one of them the first is this very specific node degree distribution we're going to talk about it in this in in a moment um the second is what's called small world property or small diameter or um you know six degrees of separation you probably heard about this um straight stroke righty and then um the third property is this high clustering coefficient or high transitivity of the networks in fact if there is one thing i want you to remember after this lecture is that those three properties right so when you talk about networks um there is this usually three things three properties that are quite universal across all the networks out there okay so let's go we're going to dive in right now into those properties so you would understand a little bit better what i really mean so first of all distributions now if we try um you know to measure for example uh you know take a group of people and try to measure the height of every person right then uh you're gonna get some distribution right it will be some a few people of of maybe you know 100 and and i don't know 70 centimeters and then there'll be some 165 then it'll be 100 you know 85 etc etc this distribution might actually have two peaks for example for males or females uh but you know if we combine it there's going to be some distribution and this is usually sort of it very very very similar to gaussian and the idea is that there is a very uh sort of very clear expressed peak in the distribution right where majority of the people uh have the heights and then there are very very quickly decaying tales of the distribution um so most of the people are within you know a couple sigmas within a sigma actually of or maybe a couple sigmas of of the of the mean of the distribution so which actually in pure in in a know russian or i'm sorry purely english it really means that uh you know you can hardly meet somebody who is two and a half meters tall right maybe some you know some person somewhere but very rarely and i'm sure you're not gonna meet anybody who is like six meters tall right any human and the same thing uh you know yes there are there are people you know with diseases that are maybe one meter height but they're just again this is this is a disease problem right so majority of the people are all you know the heights are centered around you know 100 probably 7 meters 70 or meters 75 right so this is a distribution and a lot of things in the in in the world um distributed that way but not all of them and the power law distribution is a very very different distribution um you know if if you kind of just to connect things um you know with with language for example it's zip or uh zip distribution that is uh type of parallel did you guys hear about distribution you're familiar with that can you remind me what that what that distribution says what is it what's sip flaw or zip flow is yes you're absolutely right so the idea on the zip floy is that if you take um the frequency of for example of the words usage and then you rank all the words by the frequency right then uh uh the the then you know the frequency of the appearance of the war over the word is inversely proportional to the rank right in this table so um but that's not only you know again this is not on the zip flop right so for example if you look at the populations of the city and you just try to to to to for example count how many cities have what population you realize that there are several cities with a lot of people but the majority of cities have very very uh very small population right and if you think about russia yeah pretty much everybody lives in moscow right these days um there is the population of the moscow is clearly dominating now if you look around the world um that's that's exactly the same situation where there are a few cities that are like really huge and then the majority are small so if you look at this type of distribution it's called power law distribution and in general the general form is that um it says a probability or you know the frequency or the probability you know if you normalize probability if you don't it's frequency um of of something is inversely proportional to that number um in in some power right in gamma is that coefficient and that's why it's called power law so for example gamma can be one and and then it is just over c over k right and that's your tip flaw and but gamma can be you know any other number and so i would say that the frequency of how many for example or the probability to meet uh to to to encounter a city with a with you know thousand people is going to be proportional to some coefficient divided by this you know population to the power gamma so that's the idea now um this law if you look into if you if you draw it on the regular scale it's gonna look like on the left the interesting part that if you actually take logarithm of this expression right you take log then um the power law will will will actually look as a straight line on a log log scale does this make sense from these formulas guys yes no sort of kind of okay so what i'm trying to say here is if uh your x axis is log k and your y axis is log f of k then on on this axis y versus k it's going to be straight line and the reason for that is just because if you look at this um at the formula then the logarithm of the product is equal to the sum of the logarithm and the logarithm of uh some number into the power is equal that power times the log of that number right so sort of basic math all right we're going to look into we're going to see we're going to encounter this many many times so you get used to this um and and there are you know there is plenty of those power law distributions out there in the world like you said the word frequency is clearly following the and and this is log log scale right so you expect to see here straight lines so the word frequency it's on the x-axis it's it's again it's logarithmic axis so it's a logarithm of word frequency right versus how many times it it in how many books it happens you know citations um you know there are some books there are very very some works extremely popular but there is not that many of them majority of the works are not that popular uh and book saw that web hits and earthquake magnitudes there are tons and tons of phenomenas out there that follow this power law distribution and you know sometimes it's also called the long tail distribution so again the quality of quantity of interest for example the frequency now we're going to talk about networks so for us what's important is a frequency distribution of node degrees right no degree is how many neighbors node has right so how many friends you have and the the the interest for us is the distribution of the number of friends each person has you can think about number of friends or if you think about i don't know twitter its number of follows or instagram its number follows or you know whatever if you just think about this you know what do you think the distribution is i mean actually you already know the answer is it's written here um there are there are celebrities who has tons and tons of followers like thousands and thousands millions of followers but there is not that many celebrities majority of the people have few friends right and you know how many friends do you like each of you guys how many friends do you have on facebook maybe 200 200 okay who else who more who has more 500 who has 500 oh it's just you guys a shy crowd all right so um you know it it actually like you know typical on facebook there was really like a couple hundred friends right for for a normal person now there are some people who have who are quite lonely out there with very few friends and they're quite a lot of those people uh so that's that's what majority is um and there were a few people who have just you know a lot of a lot of friends but those are you know either fake friends or like you know celebrities with really fake friends anyway so that's the distribution right and that's a power law a very similar picture for citation of papers um when you take you know scientific paper um there are some papers that are extremely popular um and they have a lot of citation but there is a very few people very few papers like that and the majority of the papers will have very few citations now this has been noticed um in the paper by uh derrick price in 1965 right so this this this power law distribution you know goes back to to to this 1965 and actually in fact this also on tolerable port 1961 um when they started a percentage uh relative number of total number of cited papers by different you know from from the subset of papers published in 1961 now can you think about mechanism why would that be happening why why do we get some papers that are extremely cited extremely and others kind of obscure and and almost not sighted what's happening to my perspective the the more significant the paper the more it gets harder yeah i mean look of course uh of course there is this notion of like yes somebody wrote a brilliant paper um and it's probably eventually getting excited but what's happening what's interesting is how fast this um and you know the more it started the more people get know about with paper and it becomes more sighted and the more cited yeah there's exactly that's what i had in mind there's this sort of self-fulfilling prophecy prophecy which is um you know the money goes to money right in some sense you know the paper gets noticed and gets cited there is another paper with maybe similar results but it was not noticed now the paper that got cited now more people read the site the papers that are citing it so okay that's the paper so they start citing it right sometimes without reading if you actually write papers yourself you realize not very people not very often kind of go and trace down to the original source right and so um and that sort of snowballing effect and that's pretty much what magnifies some papers uh to like fame and others goes to obscurity though they might be originally first of all they start with the same on the same basis level right no citations and they might be quite similar in the quality and the discussion and in fact there are so many stories about this in science even like if you think about you know the famous deep learning there are people who are extremely famous now because um their papers were cited and now they cited as the fathers of deep learning but there is sort of another person who actually wrote papers to people at the same time but somehow you know he he he doesn't get the fame right um though but if you go back you know 20 30 30 years back you realize that back then his papers were as influential as these guys right um so that's the the magic of this sort of uh in you know the sort of enhancement right that happens due to this um you know citation um and and then the power law right so that's the parallel that's where it comes from i mean at least i would say in the citation that's sort of the mechanism right it's much harder to understand what the mechanism in in biology or other networks but in citation right this is probably why some people some some papers be why we get this very uneven distribution right um so you can also think about and i think i really like this slide and this picture and i um in fact this picture is from the textbook by albert barabashi and i strongly recommend the textbook um for this course um it's available online and it's actually quite good in terms of the sort of the level of the the scientific rigor and the sort of the readability of the book right so what you see here on the right hand side it actually gives you an example of power law network which is uh airline network well it at least it used to be before recorded right you do have major hubs right and those hubs there are very few of them in the us right la chicago boston you know atlanta um and and there are a lot of lots and lots of flights goes there right so they have lots of connections and then there's like very many small cities uh where flights go to but there's like one one two three flights go to those things so the same thing actually in rush right there's moscow which is a huge hub like everything flies to moscow but then there is like you you pick up you know any other city will be like very very few flights coming on it so it's very uneven distribution in terms of number of flights uh getting out of the city and um what's on the left is a sort of simple visualization that allows you to understand and it just shows you um those connections just shows you like look on the x-axis is a number of links or a number of friends or no degree right on the y-axis it's uh it's how many of the nodes of those type exist and so what it says it's just if you go to the left on this picture um by the way i can probably um just give you one sec uh oh no all right ah sorry it's not allowing me to easily show you all right so if you go on on the left uh of this picture um there are a lot of nodes with only a few links and on the right it just shows you a node with very many links but it says that you know there are very few of those nodes in the network right and so the same thing with the social network again a lot of people with a with a few friends um i'm sorry yeah a lot of people with a few friends and you know very very few people with many friends okay so this is an example of this power law and this is how you spot it um on a graph this is actually a very very cool side of its history of philosophy and the way you know to understand this is different philosophers uh they shown uh as as nodes right and then for each philosopher there is a connection to another philosopher if that other philosopher follows this guy this first one follows sort of his teaching right follows his philosophy and if you look at this well i'm sure you know these familiar names there anything surprising no no surprises so for me honestly again i'm not a philosopher right so but for me what was extremely surprising to see that karl marx was actually has as many followers among philosophers as for example you know plot or aristotle right or even you know you know even right because here again the size of the node is the number of followers within the philosophy world right number of philosophers that follows this philosophy right it follows his teaching so for me it was like wow that's pretty big right anyway so the point is um that again if you look at this you realize that when you have a power law distribution there are some nodes you know some people who have very high degree i have very many friends very many followers but the majority of the nodes in the graph have low degrees right so there is this very very strong difference in between you know those who have everything and those who have nothing right i mean if you think about economics it's the same principle 80 20 or even more scoot uh you know how many people have majority of the wealth and and the rest have very little money right so it's the same principle there are it's it's a zip flow it's pareto law you know there are different names of this in the networks it's expressed as a power law which is again few nodes with very high degrees which is lots of connections majority of the nodes with low degrees now mapping it into for example for a second let's say into in into a language if you think about every word being a node here and an edge will be placed in between two words if those two words occur next to each other in a sentence right think about the sky currents network which words would have high degree the words that occur with many other words right so what are exactly exactly articulate articles uh in english is probably and or you know or or or a right um in in russian i don't even know what what it's gonna be yeah maybe maybe you guys will have a chance to check right and that's part of this course to actually try to analyze this and see um you know the positioning of the words okay so that's one story right this parlor distribution second story comes from actually from from um uh you know sociology and uh you know kind of the first person who noticed that and and dedicated a lot of work to it is a mark granovitter um or granada and um where he looked in the social networks and published this famous paper the strength of weak ties it's actually the paper about job search so that suggests how you need to look for a new job and literally saying that you need to use what's called weak ties instead of strong ties when you're looking for a new job because uh strong and i'll explain in a second why that's the case so what he said is a strong ties are those ties that are those friendships that are used very often very um so they're strong right the connections are strong that's the close friends and weak ties are those that are kind of you know used once in a while right so this kind of your friends you probably have it in your in your phone book but you know you call them maybe once a year like say happy new year right so um the point he was saying is the following that when you have an a social network um and you have this kind of scenario a c is a a b and c are people um there is a tie in between a and c right strong tie and there is a strong tie and b and and between a and b it's impossible to have right you cannot have strong ties in between a b a c and not have a connection between b and c so he pretty much was saying that guys in the social networks there are triangles right you're not going to have strong friendship between person a b and person a c and person b will not know person c and just think about your friends most of them most likely know each other right and that happens because you know you either were introduced to them that way or you know some but somehow you introduced your friends to each other or maybe you're mentioning one person to another uh in your discussion so if you have strong friends you know strong relations with people they usually know each other and that forms these triangles and uh there are a lot of pranks on social network so humans you know do that and in fact so what was what granada was saying in his paper um that look you don't look for a new job following strong ties because uh you kind of remain within the same community of people all the time right you need to to use sort of long-range weak ties to find for you to look for new opportunities um you know in life but um anyway so the idea is again the main point of his paper is that um when you look at the social network it's about triangles and they're very rarely you will see this kind of arrangement um this abc um which is you know sort of open triad and he called it forbidden triad all right this is 1973 which is man like what 50 years ago right um and this is for example an example from from facebook uh from data science team on the facebook where they looked at for example friendships um for the person this is a central person and they talked about like this old friends relationship and maintained relationships maintained meaning you know people communicate a lot and you realize that you see lots of triangles in there right triangles here there you know triangles triangles triangles um here is quarter network and which also has quite a few triangles you know by for the obvious reason usually quarters know each other pretty well um and here is you know organizational network where people in organization of course you know you work with people so people know each other so there's also triangles so social networks do have a lot of triangles now another word for this triangulation of the network for a lot of triangles is called transitivity and transitivity really means you know if a knows b uh b knows c then a will know c right so it's the transitivity property from the mathematics and so that's what's also um you know used here so the point is um social networks have high number of triangles and there is another word for this is transitivity and then there is another word for this it's also called clustering coefficient now you know that that often causes confusion but when people talk about clustering coefficient they usually talk about um the the fraction of triangles in the network uh there and that's the ratio of the total number of triangles to the total possible number of triangles so and we're going to look at the formulas a little bit later but that's sort of the the the words right that's the terminology and finally there is this third property that i want you to know about this is a six degree of separation right a small world property um how many of you heard about this experiment yep one two yeah so it's a famous uh experiment by stanley milgram and stanley milner milgram actually he's famous for another experiment do you know what other experiment he's famous for all right this is a dude who set up the the stanford prisoners experiment and if you don't know about that one read about it because that's like scary fascinating thing i mean i was amazed that he was actually even allowed to do this right today i'm sure nobody would allow in you know any university to run that type of experiment now this is much sort of safe experiment here he just tried to understand how people know each other and so what he did is he published a newspaper an ad saying like look guys uh i wanted to participate in experiment um i want you to send a message to a person this is for this this is exactly the the his um his ad right um so an instruction for for participants of the experiment so he said look look i want you to send them uh you know i want you to send a letter to and deliver a letter to a person you don't know him um you what you allowed to do you allowed to send this letter to any of your friends right any people you know if you think that will shorten the distance to the address to the to the final recipient right um but you and that's what it says right so and you allowed to send it to this final person only if you know him on a personal basis and so the idea is to really to trace the letter that goes from friend to friend to friend now today it's probably it's it's very easy to do right you can just try to for you know forwarding emails until it reaches the right person um though of course most likely you know people will not do it because it sounds like a spam but um back in the days and again this is you know 50 years ago more than 50 years ago um that what they wanted he actually published a real you know ad in the newspaper and uh asked people to actually fill out the cards and you know send out for the participants um send out those cards and so he um the target person would be leaving was living in boston and the people who were supposed to send him letters he picked up some number of people in boston and some number of people in nebraska which is like middle of the united states and so then the the the important part of experiment was to actually keep a ledger of of of the letters and so when somebody receives a letter the letter would also says okay you know there it exchanged it it went through two people or went through three people went through four people right and uh you know amazingly amazingly that actually worked and um oh and by the way this is a map um so he recruited 296 volunteers 217 letters were sent and uh you know they knew about the target that was you know uh boston stockbroker um there was no name um you know address was known but they were not allowed to send it to this address um it was known you know where he lived you know where he went to college what was his hometown and then the question was like okay uh you know the goal is to deliver the letter to him without sending it to him directly and sending through um you know your friends you believe might know him better of being closer to this person right who you do not know so this is literally the experiment to measure how well the world is connected right through friendships and the amazing part that you know 30 or 29 of the letters actually reached the target right through this sort of hands from hands to hands right and the average chain length um they measured was like 5.2 so letter went through five people on average before reaching um the target and that's where the six degrees of separation the phrase came from right from this experiment um you know after that there were many other experiments and of course when social networks started there like lots of experiments confirming this um for example there was an experiment 2001 um that was on emails actually with you know sending you know using a lot of senders there's of course microsoft messenger graph at some moment that there's a facebook graph etc etc interestingly enough on the facebook of course this is the largest experiment um back in the 2012 when facebook was a little bit less than a billion users they measured the separation right average distance from one person to another on facebook and was like 4.7 sort of handshakes or just 4.7 you know people on average and interesting that it is shrinking right as facebook growing world facebook world becoming smaller and smaller and smaller so that's uh yet one other property of the network which is sort of amazing because there are almost a billion people and what this experiment says that on average if you pick one random person and you pick another random person you need to go only through four nodes to reach that other person which is amazing thinking about that there are you know 721 million people in there right now think about this um i don't know if you have 100 people and let's say each know only four people right and they're like sort of leave on the greed that will take at least 10 steps to cover to go from one end of those you know sort of sales to another and uh you have 700 million and it's only four steps you're only four steps away right and and and so that really means that you're you know say four to six steps away from any other person in the world on average right there are some people who might be much harder to reach but this is amazing and uh you know surprisingly there is you know you you think that this is like impossible right like totally impossible but there is extremely simple mathematical models that show that you know in fact this this might happen so that's not like completely entirely crazy um if you think about this and let's just look at this very very simple model um the model has a every friend and this in this picture has four friends right and the only thing i make here the only assumption i make here that these friends are do not overlap now this assumption contradicts the the triangulation story i told you before but let's for a second make that assumption to to make simple computations so if first person has four friends right in each of his friends have four friends then second degree is going to be 16 friends right then each person on the next layer has four friends then it's going to be 64 friends on the third layer right okay now let's assume that the first person has 10 friends then this then on the second step it's going to be 100 people you can reach on the third step is going to be thousand people you can reach on the next step is going to be 10 000 people you can reach and that's only four step right and then it's going to be million so this kind of setup has extremely fast growth and it's it's it's exponential growth in fact that's the same way as as you know coveted growth right um very very fast and so you can easily estimate that you know if i want to cover 6.7 billion people and let's say each of us has 50 friends and they don't overlap it will actually take on the 5.8 steps to connect all of the people if people were living in this type of a model right the sort of center centralized now there is of course uh you know overlap and so you know it it it it takes you know maybe not 5.8 and you know you on average have more than 50 friends but it just shows that this is in fact possible if there is a very particular structure of the network and this is an example of this structure and those networks that you know the language network friendship networks they do have that type of structure and this is called small world structure all right so um what do we see well we see that there is this three major properties and i forgot to put that slide in um where they are here they are right there is three major properties that you want to know about networks one is this power logic redistribution or you know what's called scale free network which is that um you do have lots of nodes with low degrees and you do have a few nodes with high degrees but those are very very high degrees right so this very very skewed distribution then the second is the small world property rights so six degrees of separation six handshakes that separates the world whatever you call it but the fact that within this networks it's very you can very quickly get from one node to another so the the distance along the network is very short right that's the second property and the third property is this high uh clustering coefficient right so there are lots of lots of triangles in the network um it's also called the game transitivity so it's it's a high transitivity so these are three important things you need to remember about this net those networks and in fact like most of the networks have this property any questions all right okay so you know you're this is your guy's linguist so you probably know this better than me right i'm not a linguist but i want to give you a few examples of the networks um how they might work in linguistics and in fact the goal for this course is for you to learn and to discover how the tools how those properties are going to teach you we're going to teach you right how those approaches might work for linguistics right so that's literally the the goal for the course so i'll give you a few examples of the networks uh there is a word car current network right which is the word precedence then you can think about semantic networks then you can talk think about some synthetic networks right um so let's look at at few of them let's say um this is a car current network right where uh what you do is the nodes are the words and they're connected by the edges and they're connected is one word occur in the sentence after another word right now you can actually hear build very different type of networks you can build only those network we can you can you can make or you can count only those connections when one word follows another word directly or maybe it can follow you know through a few other words you know it does matter right you can build it you can remove stuff words you you you can do whatever you want it's it's your network right um and then it you know you can build it as a directed network and then you preserve this the sequence of words in the sentence um or you can build is undirected network and just pure currents this is what you see here is this word precedence relations right so this is where we literally look into one word following by another and uh you know it's kind of funny in in this picture you can for you you can actually just start um somewhere in the graph and start walking around right um just follow random walk and it will you know generate different sentences right now if you look here it's the most sort of well connected and the highest degree nodes are with article a and then the preposition two right which is sort of obvious all right so that's one type of networks and this is network generated by a few sentences so that's why it's so small but you would expect you can just take you know war and peace and and and run it on it in fact by the way i and that's what uh we did a few years ago with guys uh you run you know you build the the networks for different authors and you realize that different authors have very different type of networks uh the way they use language because the networks reflects how somebody uses language right and some people are much more i'll say fluent with language right and they can put together sentences that others will not and so you will see uh the network will show you you know you can you can actually distinguish authors um by the sort of average network of for their uh of you know that built on their books right okay so uh of course you know you can pick up we build the syntactic networks right when you parse the sentence and then you actually connect things um you can build semantic networks this is less automatic because it's i i think there's probably still not a good way to to automatically build semantic relationships right but there is caesarius right testers that that have um semantic relationship correct me if i'm wrong do you know if there's any way to like automatically build like you know the networks that that that show you you know semantic concepts i don't know i mean so that you guys quiet means you don't know they don't exist maybe we can take a dictionary definitions and use it as a text to make a current network no no i mean of course now look if you take a dictionary yes the dictionary will tell will will give you like okay these are synonyms right or there is an antonym but they have to be like it's us human who create those dictionaries right they're not like automatically created it's sort of uh we decide which words are synonyms and which are not in some sense we can build the graph out of the vocabulary that was generic that was learned by the word to back model for example okay that's actually that's that's an interesting thing um i'm just actually wondering i mean i'm wondering if word to back will be smart enough right to um to group together you know synonyms i mean it feels like it should but if it i don't know if it is good enough to do that right because i mean again worked back the idea is right it puts your words in the in the context right and you would say that yes you know you can you should probably pick out synonyms if they're used if they're replaceable right if they're kind of within the same context um just use different words then they might be synonyms so it it could be that it works i don't know i never tried but this is something you know one of the projects you know you guys can work on okay so um the point is again coming back to this story of properties of the networks um this is from one of the papers um uh you know the the guys the authors look at you know different network types um they look i believe this is probably for english for english so there is a concurrence network synthetic network semantic networks and they look at the different size of the networks this is just sort of the size of the experiments right nothing else um and then um they measured the degrees um the average degree on the network right they looked at and you notice that there is you know degrees uh from four to eight five to ten ten to four so there are different so there are different node degrees in those networks um the type of links you know functional meaning interestingly enough the path links notice that the path links in those networks are actually quite small right it's three to four uh then they look at the clustering coefficient and we're going to define in the next lecture how you measure that but what they're saying is like literally let's compare number of triangles in the network compared to the number of triangles if the network was completely random right so imagine that you have a bunch of nodes like bunch of words and they're randomly connected which means you know when you generate sentences they would like sort of randomly following the words randomly following each other right and so they realized that the the number of those triangles will be you know thousand times more in in this in those language networks than if there were sort of meaningless connections in between words right um and then the link distribution this is um exactly this power law and the gamma is that coefficient right so the the the frequency the probability to meet um a particular word or you know within this network um the the probability of the node degree is one divided by uh the degree to this power gamma right so it is one divided by uh you know k to the power 2.2 or 2.4 right so it is a it is not zip flow is gamma equal to one this is gamma equal to 2.2 um and interestingly uh you know they notice that the concurrence networks the hubs and hub is a node with a lot of connections like high degree node right they're saying hubs are words with low semantic content right so uh then um articles and propositions so low semantic content but they used a lot of places and so they are uh the nodes of high degree which are called hubs um in synthetic networks it's a functional words and in semantic network obviously it's uh uh policeman right policyma's words right those they become hubs and interestingly enough uh when you build those type of networks there is a question okay what happens if you sort of remove some of those words from the network right let's let's kick them out of language and uh you know for example you know that that's the results that that causes so um this is an example in fact of you know how you can use networks to study language right and i mean ultimately again as i said before the the goal for this course is to give you sort of instruments tools and approaches to study language right and so it's we're going to teach you how to do this how to use those things and you know you can apply it to um language networks okay so that's pretty much it for today i'm i thought i will actually have more time and i'm gonna talk and that i would talk about um you know some formulas uh we're gonna do it next time um to finish up there is this is the book i would definitely recommend which is a book by network of network science by albert laszlo barabasha um google it it's available for free online and it's i also have it on on my side and then there are actually quite a few books which are your collection of articles or papers on um using networks for to study language right and there is a book on structure discovery in natural language and there is a group based natural language processing information retrieval um those books you know if you search long enough online you'll find them right they're also people put things online so these are the books again the first book i would say this is network science book this is a must right that's definitely a must reading there were two other books this is more if you want to understand how this is used uh for for language studies and uh ah interesting and i think this is this is it uh for today um i'll stop here any questions okay no questions look um let me stop the recording
Info
Channel: Leonid Zhukov
Views: 2,926
Rating: 5 out of 5
Keywords:
Id: vcDdv-EyTwg
Channel Id: undefined
Length: 63min 21sec (3801 seconds)
Published: Fri Sep 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.