Graphs, Vectors and Machine Learning - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so today I want to talk to you about some machine learning algorithms there are really not much discussed for example we talk a lot about how to work with text for example right so for example if I have some text and I want to predict maybe the next word or something and those are large language models for example that we see a lot out there maybe maybe I'm interested in comparing images right for example let's say this is a dog this is my very good impersonation for dog and this is let's say a cat and then I want to predict is this new image a dog or a cat um we also have a lot of algorithms that work with tabular data so like in a table and I want to find out whether this person maybe is eligible for a loan but same thing but with graphs so here are two graphs they will show to you I have graph G1 and graph G2 this graph has six nodes so this color blobs are called nodes and the connections between them are called edges in this particular graph the nodes have color so they can be red or blue and there are six notes with one two three four five six edges as well and we have a second graph which kind of looks maybe similar it also has six notes with the difference that I have three blues here rather than two and the connections here are a bit different they kind of look like molecule maps are the colors important the differences and colors that's awesome comments uh I was gonna exactly discuss why do we do these things and molecules are the best example so maybe if you have something that looks like a molecule like a carbon connected to an oxygen that's connected to a hydrogen and this keeps going I'm not a chemist so I don't know how this is gonna go but um I could think that maybe I would represent a carbon with a red blob and an oxygen with a blue one for the chemists out there this is probably not real molecule but we don't care but you could absolutely model molecules not graphs you could model other things you could model friendship networks in which colors maybe represent people's preferences about a certain topic um or something that identifies them in some way so these are obviously graphs in a kind of true sense that way with nodes and edges they're not we're not looking at plots here not plots and that's the first important distinction that we should but could you do plants like this as well or is that diff is that different for what we're going to see today plots cannot be analyzed with the same algorithms what we're going to look today are very specific to graphs in that sense of nodes and edges network is a way to describe them a network is a way that you maybe differ the growth these actually could be parts of a cloud or something couldn't they it could actually could be they could be knowledge graphs that I know you discussed in the channel before they could be anything that you can represent with connections maybe it's a natural question that we have maybe I have a third graph that I'm going to draw here um maybe this third graph looks something like this right so here I have a third graph I'm calling it very creatively G3 and you can see it's a bit different from G1 and G2 it also has unlike the other ones it has one two three four five six seven notes instead of six again Sean maybe you can tell me reasons why G3 should be more similar to G1 and maybe reasons that it should be more similar to G2 so obviously I'm looking there upside down but actually it probably shouldn't matter actually she loves it very good um but looking at you know the the G1 to G2 you've got that kind of triangle at your the top from where you are um but then it feels like G1 to me because it's just maybe an extension taking those two Blues to be three and it feels more like that and to me than G2 and and there is this kind of yeah that kind of a kind of things coming out of it which here we have too but with an extra one so maybe kind of these parts is sort of similar on the other hand this part is very similar to this one it seems oh yeah the other way around of course yeah but it's upside down but again maybe we don't care about this of course these are toy examples right these are like not realistic examples but maybe in real life These are molecules again coming back to the example and maybe this G1 belongs to a big group of molecules that somehow are good they are drug that solves your problem they are active or something and these are not good or inactive you you may want to come up with a new one and say should this belong to group one or group two so before actually going there and like doing the synthesization of them could I maybe have models that will tell me kind of looks more like this blocks like this so today we're going to see two very very very simple models that nonetheless are going to give us conflicting answers and the reason why they give conflicting answers is to highlight the fact that working with graphs is really hard and we're not never expected to have one final only way of doing things because there are presumably all sorts of different ways of saying they're similar you know how many edges what different colors how many you know how do you measure it I suppose is the question isn't it and I've you've touched on something that's quite uh profound so if you have two vectors and maybe two vectors like this and they have numbers like there may be one two five and seven and the other one is one two four and seven you want to know if they are the same it's very easy you just walk from the beginning of the vector comparing the entries if they are the same keep going if they are different you know the vectors are different and you are done that can be done in linear time for those of you who like complexity um of algorithms graphs not that easy the graph isomorphism problem is not known to be solvable in that quick way so the whole point is that we can't even tell if graphs are the same so let alone the degree of similarity that they have that's exactly what we're going to calculate degree of similarity but it's not an unique answer and maybe there's one takeaway of this whole video is that when we do such similarities in comparison between graphs we throw information away what we throw away and what we keep is the magic so if you're applying this to some sort of application specific some specific application you may want to discard some information you may want to keep some other information and that's the important choice that you would have to make we start with an algorithm called vertex histogram you probably remember histograms from maybe when you studied statistics so if I would look at G1 for example and I want to maybe count so histograms immediately here counting the number of occurrences of a given color labeled in that case so for G1 I'll have two blue ones and one two three four four red ones so this is my histogram for the first graph I will not draw histograms as plots like this I'm just going to draw vectors so that's going to be quicker I think so I'm just gonna get booze oh God okay um threads and we're just going to calculate I'm going to call them VH prefer text histogram in this case of G1 so this is going to be a vector that has two Blues and forwards and then vertex the circum of G2 what happens well I have three blues here and three Reds so this is going to be a three comma three so you see what I'm doing I'm suddenly creating a vector that sort of represents the graphs now you're probably thinking there are many things that we're discarting with this that we're gonna analyze later okay so at this point you could get this vectors put it to a nice Cartesian plane place them there and do your favorite favorite algorithm of course this is a toy example but imagine that you have hundreds and hundreds of those points you could use nearest neighbor classifier you could use anything that you want we are going to use something called kernel methods we're not going to get into the details of the kernel methods but we're going to go just see one example of how they work so if I want to use the kernel method I will take the inner product between three and one and three and two because I want to compare three right so I at this point I don't want to compare one with two I just want to compare three with both of them so we're gonna go through step by step so if I do the inner product is going to be two comma four if you have not done inner products or dot products before you can just it's just a sort of a multiplication of vectors in such a way that you multiply entry by entry and add them together so in this case I do two times four which gives me eight plus four times three and that's 12 so this would be eight plus twelve and that gives me 20. if I do the same thing with G2 now and G3 I'm going to get we don't have to write it down we can just look in here it's going to be 4 times 3 12 plus 3 times 3 that's 9 which gives us twenty one right inner products are a good measure of similarity between two things two vectors for the purpose of this particular part of the algorithm we can conclude that T3 is more similar to 2 because 21 is more than 20 then 2g1 you're probably thinking many many things at this point because you'll be like um well I can't I can come up with so many examples of things that shouldn't be similar but give me a high number here and that's absolutely true I mean I will give you many um I if you if you just have one vector that has huge numbers and the other one has low numbers this product is going to be high um if um maybe maybe higher than the comparing it with itself which should be the utmost similarity but nonetheless inner products are understood this measure of similarity and especially because we are going to use algorithms such as svms support Vector machines in which inner products play a very Central role and comparing and creating the boundaries between the classes and everything um if you are really bothered with the fact that these numbers are very high or that you could get similarity that's much higher than you would like otherwise with um outlier numbers here you can always normalize your vectors now it depends on your application maybe normalizing does not make sense maybe normalizing does make sense so again it really depends on the particular algorithm you're using okay Sean there is a there is a central piece missing in whatever I've done here it's this yeah Jesus name exactly and and what they're connected to not not just how many there are but where they are and all of that sort of stuff awesome actually not even how many they are at this point we are completely ignoring we're not taking into account edges at all um nonetheless this is a very simple example that I like doing because it gives us a flavor of what's going on there is one thing that may be bothering you does it mean that if you put this in a support Vector machine algorithm the fact that 21 is higher than 20 would automatically be the case that G3 is classified as the same group as the G2 not necessarily support Vector machines work with a lot of different parameters and there are biased bias terms and and all all sorts of other parameters and factors that should be taken into account but nonetheless in the measure of similarity still works so the higher the number the more similar it's sort of the opposite as distance where the higher the number the more different now we're gonna get edges into the whole thing and see what happens this is called vice versa Lehmann algorithms in this case we're going to do the graph kernel version of them they show up in other parts of graph Theory what we're going to do and the basis of what we're doing now is the following look at this for example this red node and this other red note they are similar because they're both red but they are also similar in the sense that they have only one neighbor that happens to be blue so they are not only the same color but their neighbors have the same color unlike for example this one and this one so the top one here is red but actually neighbors are red in a blue whereas this is red and neighbors are blue so for the purpose of this algorithm now we're going to say that these two things are the same but these two things are not the same anymore because their neighborhoods are different we're going to analyze all the patterns that we see in this particular graph I'm now writing WL and I'm putting a number one you may guess y1 is in here at this point but we're going to discuss later is that for one neighbor it's not from one neighbor I'm gonna give you a spoiler it's um I'm only looking at immediate Neighbors so I'm looking only at neighbors with distance one distance one exactly so you're gonna guess that wl2 is going to somehow look at neighbors of neighbors so the first pattern that I'm gonna analyze is the one we just discussed so it's a red comma and I'm adding a comma here to clarify that the first red is the actual nodes label and what comes after The Comma is the color of the neighbors because it could have more than one so it's red comma blue again we're gonna do a histogram again but with the new patterns so how many in the first graph have that behavior so this one is a red with a blue one this one is also red with a blue neighbor but the other two are not the same so there are two other only have one red yeah okay take out one red with a blue neighbor okay but if you go to Second graph it seems that there is only one here right this one is it red with a blue neighbor and the other Reds they have Reds as neighbors so they shouldn't be the same pattern so we have a one here now for G3 I have three actually right because I have this one this one and this one terabytes with blue neighbors so it's going to be a long Vector longer than just two because now all the intricacies are being taken into account um and so not to get lost because this can be confusing I'm gonna tick all the nodes that I've already processed in a way if you actually want to implement this uh there is martaways of doing it you use hash functions and you're going to use all sorts of things so let's get this thing done and then let's see what's happening [Music] at this point we've ticked all the nodes in the first graph so we know that the rest of this feature Vector here has will have a bunch of zeros in the first line because there is no more patterns to pick up also you probably realize that two plus two plus two is equal to 6 and as it should be because as each node will contribute to one tick in here and the second one will also have to add up to six and the last one will have taught up to seven because I have seven notes [Music] right we seem to have some feature vectors going on you can see there are a bunch more zeros now because there are many patterns that we only find in one or two of the grams and if you have done machine learning at this point you probably realize that something that is potentially starting to happen here is called overfitting as we get more and more details on the start of type of patterns we get it could be that we are too focused on the very very very fine details of our training data and we fail to generalize so when we see something new everything is different so how much of this fine graining you do depends on again the application you want and depends on again on things that you want to do you may think that for example this will never be used the vertex histogram one because it disregards all the edges right so why would we use it nonetheless it can be a baseline for your model let's go back to here and finalize our calculation so we're taking the inner or dot products between these two things so what happens if I do W L1 of G1 with wl1 of G3 I'm not going to rewrite the whole vectors down we can just look here and see what's going to happen so this first product is going to be six but it so happens that every other product is going to be zero because two times zero zero zero and then 1.0 and 0 and 0 and 0. so the inner product so the similarity using this algorithm in this case is six I should mention at this point many implementations don't have just this Vector here but appends some concatenates the first one the vertex histogram at the beginning so I would have something like 2 comma four comma the rest and in that case I would not get a six I would get 20 plus six but for this example let's just stick with with the second half and finally I can do wl1 of G2 and compare with G3 and then I get 1 times 3 which is 3 then once with a zero and that's zero again and there's a one here and there's also another one here so I get three plus one plus one which is five again this is a measure of similarity that has a lot of caveats in it it's not exactly a final say of everything but it's just an example to see how in this case G3 seems more similar to G1 whereas before it would be more similar to G2 because it was a 21 versus 20 and in this case we have a five versus six or six four the first one but obviously there's only one in it for both those examples isn't it I mean they're very similar they're just all very similar right they are it's just one and has a very tiny difference again doesn't mean your support Vector machine algorithm is going to do it um we have not discussed exactly how these numbers go into your algorithm uh for those of you who know kernel trick it has to do with that it has this idea that you do not need necessarily to know the future vectors in order to do your prediction in real life um you would not have only three graphs you would have many many more and you have you will train in many more um of them and you test in some others so this is just a toy example if you actually implement this as svms I do think uh for the first one this is going to be classified as um closer to to G2 but again there are bias terms involved and the number is not a predictive necessarily of what's going to happen but just to get a feel of how these things are working the kernel Matrix that's going to come out of this for wl1 which we're not going to go into detail is going to be something like this that if I compare graph 2 with graph 3 which is this one I get a five so I also get a five here and I get a six if I compare three with two and I can fill these with other numbers so the magic happens because in support Vector machines for example you can just put this Matrix into the algorithm and get a predictions so you don't necessarily need to know this thing there are many graph kernel algorithms that actually don't ever do this because maybe those are infinitely long vectors but nonetheless you can get a number and this number is going to be what you're going to use there's one last thing you may be wondering at this point you may say well okay look at these two red nodes again they are both red they're similar they both have a blue neighbor so they're even more similar but more than that their blue neighbor so happens to have one red and one blue neighbor so they're similar if I go up to distance two if you want so the neighbors of the neighbors happen to be the same unlike these two although they are both red and both have a blue neighbor the neighbors of the blue neighbors are different because this one has a blue neighbor with a bunch of neighbors and this one has a blue neighborhood with just two neighbors so we can keep going and we can keep going forever so I could just create wl2 in which case wl2 would have patterns that would look a bit different and you at this point you may be wondering I am going to get lost right where do I stop like neighbors of name like how do I even write it down okay so that that I pre I pre-drawn this because it can get really complicated so each pattern that we saw before will have a new color so if it's red with a blue neighbor like red but the blue neighbor is going to be this one and this is now going to be pink and this is going to be pink as well and so on and so forth so if you wanted to wl2 you're going to look at patterns such as like pink comma light green and that's going to be like this one and this one but not this one because this is a pink with an orange as a neighbor we do this to make sure we can do this whole thing efficiently computationally speaking nonetheless you can try to understand in an algorithm if you use the future vectors as we have them before as 2 and 4 instead of the numbers that you get from the kernels you could just do this um this algorithm with the vectors try to explain using some models to explain which feature was more important and then try to bring out and say ah this particular graph was deemed similar to G1 or the whatever class because of that pattern so you could try to explain decisions as well using that but that's another topic an edge or relationship like is located in so Bush house is located in London you see how possible for any Behavior because how could they they are completely distracted by the computer of all videos they're not even looking at their own

Info

Channel: Computerphile

Views: 94,206

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science

Id: w0rOvNJW58o

Channel Id: undefined

Length: 23min 7sec (1387 seconds)

Published: Mon Aug 07 2023