Converting a Tabular Dataset to a Graph Dataset for GNNs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to this quick tutorial on how to convert a tabular data set to a graph data set i think this is the most frequently asked question and because of that i thought it makes sense to build a simple notebook in which i explain how to convert a simple csv file into a graph data set in order to apply gnns so this first video is a regular graph data set and the second video i'll upload is all about temporal graph data sets which are a bit more complex besides that i also collect some more frequently asked questions for example what if you have pairs of graphs what if you want to use images as nodes and things like that these are also questions that many people have asked now with that let's begin with the first section how do you convert a regular csv file into a graph data set so i've grouped this notebook into seven steps until you arrive at a graph data set step zero is to bring some creativity and not lose hope it can be very challenging to convert a tableau data set into a graph data set but just keep trying and also look at the examples in this notebook i am sure you will figure out how to do it step one is then to identify the notes the edges the note features the labels and optionally the edge weights or edge features for your data set if you have a csv file with some items usually these items represent the notes this can be people locations cars whatever and the difficult part is usually to figure out how those are connected and then you have node features those are simply the attributes that describe these items people locations and then you have labels and there can be three types of labels either you have labels for the notes or you have labels on the edges so that means you do link prediction or you have a label for the whole graph so before starting with the implementation i recommend to first think about what are your notes what are your edges and all of these attributes this will give you a better feeling for how your graph will look like in the end an important design decision is if you have a homogeneous graph or a heterogeneous graph a homogeneous graph has the same note and edge types for example you have cars as nodes and then connections between these cars and in heterogeneous graphs you have different node types for example users and items and those different node types typically have different node feature vectors and because of that they don't fit into one joint matrix and instead you need to separate them and this can be done in heterogeneous graphs it's the same for edges if you have different edge types you also need to go with the second option here so i will just roughly go over two examples i picked out from the internet and those are just random data sets i found notes that i try to keep this video short and because of that i will not go into detail on all of the code parts but i've written a lot of comments and text to explain everything so this first data set is the fifa 21 rating data set it's a data set that consists of players their skills and their teams and so i don't know if this data set is suitable for a graph data set but i just thought i select any data set i find and try to convert it to a graph as mentioned previously the first step is always to identify these properties so what are the notes here those are simply football players and those are identified with a unique id regarding the edges we have to come up with a way how to connect the players i discuss a couple of possibilities down here and then the note features are simply some attributes about each of the players for example you have ball control dribbling different features about each of the players and then there's one feature column or one column called the rating overall rating and for example the task could be a note level regression task predict the rating for each of the players given some team assignments the way how i will connect the players here is simply by their team assignments so if two players play in the same team there will be an edge between these players and this will eventually lead to one large graph and because of that we only have one single graph and in other data sets it might be possible that you have multiple graphs or even temporal graphs that change over time i've also discussed some options here what you do in case you have multiple graphs the next step is to extract the node features and that's usually a very simple step for each of the items or entities in your data set you simply extract the features you have and here we set we use things like the ball control dribbling so some attributes that describe the entities in your data set typically you have to apply some one hot encodings here if you have non-numerical features and this is also something i did here one important point is that all your matrices in the ends so you have this node feature matrix now all of these matrices need to be converted to numpy in most of the libraries and you end up with a shape of number of nodes so those are the football players times number of features here you have 18 features the next step is to extract the features and here we have a note level prediction problem because we want to predict the rating for each of these notes so you have as many labels as you have notes because of that our final shape is the same as here but just one attribute per node and this is the overall rating of course you can also normalize this here but this is generally how it's done again we converted to numpy and that's our second matrix now this is the probably most difficult part you need to connect the nodes somehow and either you have some relational information in your data set or you have to come up with some clever way how you can connect the notes and as i said before here i'll just connect them according to their team assignments but i also mention here that this is not the most sophisticated way to do it and i wouldn't use a gnn with this graph because it doesn't make a lot of sense one important point is that you should always start your ids from zero because later when you build the edge index based on these ids the id with the number zero will reference to the first item in your note feature matrix so here i simply build all of the permutations inside of the teams so that each player inside of one team is connected with all other players and then the edge index typically has a shape which is called coo format and that's a representation of the adjacency information and that has a shape of two times number of edges and those indices here correspond to the ids in your node feature matrix so node zero is connected to node seven and because of that it's important to remap your ids so that they start from zero so that it's easier to build edges after that step you have basically everything you need and i have some samples here how you can build a data object in pytorch geometric and put it into a data loader so we have x which was the node feature matrix of the football players we have the edge index that is based on the indices of these football players and we have the labels y and with that we can build a graph in pythog's geometric and use it to apply gnns that was the first example in this notebook regarding a note level regression problem on one single graph and the second one is a heterogeneous data set and here i simply select it's the anime recommended database so that's a simple recommender system database so you have some ids for movies and you have some users and then you have this rating matrix that tells you this user rated that movie with this score so again just like before we first need to identify the the properties of this graph so we have nodes and those are now that's because it's heterogeneous users and items we have two node types and then we have edges which are simply if a user has rated the movie and also how the rating is then we have note features and here we have different features for the users and movies so that means we have two different note feature matrices and then we have labels which are simply the link prediction problem so we have the ratings between a user item pair again we have one single graph here and if you have multiple graphs have a look at section 1.1 i describe how to proceed with that so step 4 extract the note features and we first do it for the anime movies and just like before we also re-index the movie ids so that they start from zero which makes it easier to build edges later and here we just use some properties we have in the data set we want hot encodes the movie type and some other things and with that we have a note feature matrix just like before we convert it to numpy and our note feature matrix has a shape of 12 000 that means we have 12 000 movies times 48 which means we have 48 uh features i just realized i forgot to one hot encode uh the type but i'll just ignore it uh and then we do the same thing for the users and here the problem is we don't have a user matrix and if you don't have note features there are different ways to handle it either you insert some dummy vectors like randomly sampled values or you calculate some statistics about the users based on this rating matrix or if you have no information at all you can also just use the degree or the neighborhood information or even a node to vac embedding to represent the nodes in your graph so in this example we use some statistics and calculates the mean rating for each user and how many times the user rated and this allows us to assign some properties to each of the users so our we convert it again and our second note feature matrix now has a shape of 70 000 that's how many users we have times two after that we extract the labels and those are simply the ratings and here we have a link prediction task that means we have the labels on the edges i plotted some distributions of the labels minus one means that the user has watched the movie but didn't rate it and this is the distribution of the ratings usually i would normalize this to improve the learning later so in this data sets as i said i selected a random data set from the internet here were also some caveats so in the rating matrix there were some movies for which i had no information in the movie database so i dropped those and another thing is that you have those movie ids but have way less movies in the rating matrix and that means we only have some labels for some of the pairs if you think about a matrix it's that you want to do this matrix completion in a recommender system i also have a video on that and the thing is that we only have for some of the cells information and that's why we need to store in addition an index that tells us for which edges we have information and that turns out to be exactly the edge index so first of course here we have the labels that's how many ratings we have and the edge index now has the same shape and tells us which indices so which user item pairs have ratings and in order to do that so that's the important part your ids need to start with zero so if your id starts with one that's corresponds to zero in your node feature matrix and because of that it's important to do that remapping so that it looks like this so that means user 0 is connected to movie 10 and that corresponds to the actual note features in your note feature matrices and in the end you have this edge index 2 times the number of edges and that's exactly how many labels we have also note that we have unidirectional edges here that means we only have a edge from user to this movie but not from this movie to the user so if you want to have bi-directional edges you of course need to also add the other direction here for heterogeneous data sets there is a special object type in python geometric called heater data and this data type can hold multiple node feature matrices for example one for user one for movie and also multiple edge types here we have just a single edge type called rating between user and movie and if you want to learn more about these heterogeneous graphs there's a great tutorial in pytorch geometric also if you specifically want to build a recommender system there's also a tutorial on bipartite graphs now that's all as i mentioned please read through the notebook because i can't capture everything in this short video and this was the first part to build some heterogeneous or homogeneous graph data sets and the second part is an example for building a temporal graph data set which is also a bit more challenging thanks for watching i hope you find this helpful and let me know if you have any questions or comments and with that i wish you a great day
Info
Channel: DeepFindr
Views: 16,286
Rating: undefined out of 5
Keywords:
Id: AQU3akndun4
Channel Id: undefined
Length: 15min 21sec (921 seconds)
Published: Wed May 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.