Social Network Analysis - From Graph Theory to Applications - Dima Goldenberg - PyCon Israel 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everybody they don't mentioned my name is Dima Goldenberg I'm a data scientist in Tinley that booking.com but today I'm going to talk about social network analysis and cover topics from graph theory to applications with spite on basic I'm going to talk about the time that I spent in university working with this data ok great so that's me Dima so I just recently found out that on one of my works in high school the term data science was written so I can say that I did data science since high school and as I mentioned I did my master thesis on social network so that's where I'm bringing all the knowledge I'm going to talk now and I really believe in networking via communities so PI data is amazing community and it's great to have you all here and currently since 2017 I joined the new office of booking.com in tel-aviv which focused on machine learning so a bit of what about what I do now so booking.com is a quite an old company in terms of internet because it's established in 96 and it has around 17,000 employees but here in tel-aviv we have we are for the last year and a half so we have I guess about this amount of people 40 people here and we focus mostly on machine learning and different problems of welcome endows systems for than anything else and what I'm going to talk with you now as I mentioned is trying to cover all this path of discovering network analysis so let's start with network theory so you all might know already or seen graphs graphs are constructed from nodes and edges ok basically some components that connected between them in different links and we're trying to represent this in this mathematical way first of all nodes nodes might have different properties so it could be a person and has an A in have a name it could be a country and then it's going to have other properties like number of people maybe and it also has a degree so degree is actually something that the node gets from the edges and it's amount of edges amount of neighbors do not know this so let's talk about edges edges could be simple connection so be symmetric connections between two people for example so if I'm your friend and you're my friend but I also could be that I'm your friend but you don't know who I am it happens sometimes so that's about the direction it'll also can have a wait so besides saying how do we have a connection or not it also can say how strong the connection is so once we have these properties we can do a lot of different analysis on this data set so there are different real networks that we can try to convert to this data social networks obviously I guess it's very easy for you all you to imagine this now take Facebook LinkedIn or any other types of connections of people following other people but also virtual connections okay maybe trades between countries and any other data that we can represent in this way or even physical things like electricity networks traffic networks that actually represent some dynamics over different entities so the interesting part of real networks that they not just do random stuff okay they have some very common properties so one interesting property is a small world phenomena or you might have this about this as a syncs handshake six handshakes theory but basically everybody's kind of connected in the distance the social distance between each of us I guess it's even less than two or three okay we all know somebody who knows somebody who knows you I guess and another interesting thing is that in social networks we tend to see a lot of very central nodes with lots of connections we can say it like Instagram celebrities for example but we also have a really big amount of people with really small amount of connections and basically we have this long-term distribution something that we won't see if just generalize a network randomly so for example if you look at the degree distribution as I said number of neighbors of each node if I'll just create a random network I will just have this normal distribution okay many nodes have the same amount of degree in in the edges but if we look on real networks we have a small amount of really big influencers and then the rest are distributed with power low longtail so that helps us to understand that there are some central nodes that might have more interesting properties and when we trying to understand what is central we always lower already started to talk about degrees okay so number of connections so if we're talking about the most central node with with lots of connection we'll just look at the highest degree but if we're talking about other centrality properties for example a spade ring that could be that we can see here it's not only how central I am it's also how central are my connections and everybody who connected to me or maybe we can talk about between a centrality okay the nodes that connects between different components of the network and if we remove it we'll just create this connect basically this network so it centrality comes from different perspective and when we asking the question who is the most central in this network there's no clear answer because it really depends on the applications that we're looking for great so when we cover this network theory let's start to see how we can build this with our data okay so we'll try to do this kind of flow we'll take some raw data convert it into pandas data set that you I guess all supposed to know quite well and then try to create a network using network X package and the thing that we're going to try to do explain is how this happen how Israel won Eurovision last year I didn't talk the example of this year from obvious reasons yeah so let's start with the data set I just Google Eurovision roads and just the set of data set are found I found an excel file and loaded it into pandas it represents countries and the number of points it received from any other country in general in booking for real of countries data so this data set felt really native for me and we need to convert this into the data that we can fit to the network so what I want to do is to create an edge list okay list of connections between one country to another country so all I do is just melt transformation on this data and I create country ok receiving country Israel source country Albania in number of points that'll give Albania gave to Israel cool and then I just input this to my network expeced as I defined source country a source input gonna get sauce country target will get tangled country the weight that I'm going to give to my network is the number of points as obviously it has some information there and also know that it's going to be directed network the connections are not symmetric in this case amazing I already have a network so we have to do is just to visualize and it's quite simple I do just network x-drone networks G and that's what I get okay very informative we clearly understand it is only won this competition so yeah we'll try to in hey sit in hey sit somehow so the first thing I'll try to explain basically it works quite well everywhere will you try to tackle data problems it's trying to divide it and conquer again so I'm going to define all the parts separately I'm going to define position or pharmo nodes style and size separately and nicely we already have some properties of our nodes because remember these are countries and not just some random data so for countries I can just give them their physical location okay the longitude and latitude and convert them to this graph so just assign them X&Y properties moreover I also have a way to visualize countries because I can just assign to each country the flag of the country and also take the dominant color of the country to represent the edge that are exiting from this country and then the last part I want to set size to the countries so we already have some weight property they're talking about the points so the size of the country is going to be the number of points it received and the width of the weight of the edges going to be the amount of points in the connection amazing so I have all this it's still not enough to divide and conquer the next step is going to be to plot everything in part so just gonna run iterate edge by the edge and just give each edge its property and I'm gonna do the same with the countries don't worry this is just snippets I'll show you the code later so I'll just put all the countries give them the position give them the flag and any other properties and that's the result I get okay so first of all I had to move up steadily quite inside Europe because I discovered that Australia participate in Eurovision yeah so you see that Israel first of all is quite central in terms of receiving a lot of points and we also compete with Cyprus and we can also understand a bit of like communities already that are related to the geographical data and at least we can understand a bit better what happens here but already give you some spoiler here every time it's trying to visualize Network data you need to do some kind of trade-off between how much data do you want to put in your graph and how much information you actually want to get out of this picture because the more we put the less we can understand and another thing if we don't really know the positions of the graph so let's say I didn't know that countries has geographical position we can still use some kind of algorithms on graph to do some interesting stuff so for example I just can set this layout okay some is basically default layout that we have in network X that's going to put closer edges closer notes together and what we can see here is actually we can see some communities of countries that wrote to each other for example Cyprus Greece and Macedonia that's always bow to each other and kind of understand the geopolitical relationship between them cool so understood how to visualize the network now let's see what we can do with it with the data else so let's talk about information information something that flows a lot in the network especially social networks cuz we spread a lot of ideas rumors and everything else and it also will have to describe it as viral process because it just plays like reality okay if I just given this talk to all of you suddenly you all know this but I'm not forgetting it so we just multiply the number of information we have and it's going to spread all around the all around the graph and when it spreads we can basically define two very basic very simple models of how information can spread so one really common idea to define information spread is the threshold models so when I'm receiving some information when I'm hearing about some really cool TV show and I hear it from multiple people suddenly I have a threshold that says okay fine I have to watch the show now and I'm going and watching the show after lots of friends recommended it but I also can have an independent cascade model in which I have some ability to get infected to get this information from my friends just because we connected so one friend told me and he tried to infect me with the show didn't work somebody else and they do this independently until maybe I'll be infected with information maybe not okay so let's talk about the really cool show that I watched lately a game of thrones yeah but what we're going to talk is about the previous seasons because there were a bit more interesting and we're going to try to build a network out of the data of Game of Thrones which is based on the books the good stuff so how we even going to construct the network because we obviously can understand the relationships between different characters in the show but we need to extract it somewhere somehow and I'm not gonna label the whole show when I watch the videos so what we can do is actually take the books okay and create the relationship define them by coherence of two chapters so if I think if they appeal in the same sentence with this appear within 15 volts distance between each other we just say these two are connected and the number of coherences also they defines the weight of the connection then we already can get it into the edge list structure the same one that we just looked before and then we can convert it actually into network object and try to visualize it and try to run any other analysis on that so let's do this that's the network that I built on think top 60 something chapters in the in the books so we clearly can see central characters as Jon Snow or Tyrion in the middle and we can say see some other characters around but I decided to take this example because spreading rumors in this show is something that could be very interesting and I'm not talking only about spoilers but I'm also talking about other information so let's use the independent cascade model okay as I just said we have some chance to infect our neighbors if we got infected so this chance we have wait we have number of corpulence is between the character soil just made it make it proportional to the weight of the chance to the chance to be infected and let's see what's going to happen if I'm gonna try to split the woman or if I'm gonna try to hide a secret so let's say John knows nothing but then suddenly these two guys came to him and told him a secret so suddenly know something and we can see we can try to see what can happen with this secret so I'm gonna initialize my simulation all I'm gonna do is just say hey okay this blend and Sam knew the information and minus one and now John just discovered it at zero and then let's try to propagate this information let's try to run whew infection rounds and see where it goes with this okay John just discover it and then the next point quite more people a bunch of people discover it more people discover it and they'll some dead characters discover it and more people discover it and eventually we get to the point of everybody we could let this information know discover it we end up with something like that yeah and the information is quite far oh okay it can spread anyhow and when it's time that these models can help us really turns in where it goes but let's try to control it okay so instead of just letting the information spread like wildfire we can actually try to control it and give it some direction and that's the application we can talk about and that's actually what my research was about let's say think about it from the point of perspective or Marketo trying to spread my product my marketing message and trying to give it as many people as possible and your low today that influence marketing is already happens in East Instagram and we can spread rumors let's try to see who we can affect in the network with some limited the constraints so for example I have a budget to give something away to spread some rumors and see where it's gonna go so I'm gonna take the same network I'm gonna pick a few cheek characters and charters and in the network trying to spread my information that's called seeding and I probably want to pick the most central people in the network so remember we talked what is the most central we can have the most central in terms of I'm very connected we can have the Muslim trial because I'm connected to very central other people and it could be the most central because I'm connecting people from different parts of the world okay so we can look at this relative measures in the network so in our network we can see the Tyrian for example is super connected to many people but if I'm trying to understand between a centrality of the of the people Robert Baratheon and Brienne suddenly connect really different areas of the graph and now I'm going to just use these three sitting started just picking the top people from this list from each of the list and try to affect the network to see where it goes so just to remind you it's a random process so I'm gonna run this at the simulation numerous times and just see the average result of this sitting so when I try to run it that's basically what happens within one infection so if I just pick one random person and not random one the most central person in the in the graph I can achieve almost 50% of infection of the graph and then the marginal effect of each additional person that I add to my sitting set is quite monotonic Kyle okay we can see some advantage to between a centrality maybe but eventually picking the top to the top three most central people not given me this extra effort so I tried to brute force this dissipation just to pick the best couple I can find and the best couple got somewhere there okay 58 so bit more better than the best two in any other method but it took me in this small dataset of 63 node 41 minutes to calculate and it's really a unintuitive okay if I would need to pick myself I would would never pick calvo go and what about you I don't know the connection between them is maybe they both are kings but they helped me to spread the information across fifty-eight percent of the network and the point my point here is that finding the best City set settings that is very hard it's literally np-hard okay and yeah this is the same guy from Game of Thrones were from different show and that's opens a lot of opportunities trying to hack this approach just trying to kind develop new algorithms to find the estimations find something that's closer to the solution and there are actually lots of academic walls but also lots of actual industrial works trying to find the best set of influencers trying to split it as much as possible and we can talk about different real use cases both for influence maximization but for any other networks application so obviously influence maximization as I just presented it's great to spread my ideas to spread my political views to spread anything but I can also look at the network without knowing any anything else and just discover anomalies trying to find patterns in the data that are unusual to the regular way and I can also use it for recommender systems so we can see that people who are strongly connected share similar opinions so maybe I can recommend to strongly connected people strongly connected the products ok so to recap all of this we can start with some simple theory from networks networks tier is just representing our real life in nodes and edges ok mathematical representation try to convert it into networks in Python okay so we have Network X we can get stronger applications for a big data as well we can take the information flow models so I just presented two very simple models but we can also try to model something else like for getting information and add them to our applications and eventually we can run a lot of simulations and optimizations to find the best solution in our graph to spread my ideas so if you like this talk you might also like because there are kind of connected the these to talk so Dean is going to talk about variety and disease modeling tomorrow and duel is going to give a workshop about actually working with network X so if you're interested you can get deep dive into that and yeah it's basically based on recommendation system of network connections I've seen that we're very close in github or just sitting somewhere around here yeah and also let's stay connected so yeah add your meat to your network you can reach out for more information you can definitely check the codes examples that I used to create this graph so this bitly is actually going to get LOM and you can check out the booking guy blog to see about any machine learning worker that we do or just reach out to me if you enjoy said in work at booking or any research that we do cool so thank you and I'll take [Applause]
Info
Channel: PyCon Israel
Views: 13,765
Rating: 4.9305553 out of 5
Keywords:
Id: px7ff2_Jeqw
Channel Id: undefined
Length: 20min 38sec (1238 seconds)
Published: Mon Jun 24 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.