Build a custom dataset with LightningDataModule in PyTorch Lightning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone my name is benelin and in today's video i'm going to show you how you can build a python dataset and use that within a whitening data module so we can train a model or fine tune a model with pytorch writing let's get started in the last video we've just installed pytorch lightning and the transformers library and i've shown you how you can use the electric tokenizer on a dataset called goemotions in the next couple of [Music] points i'm going to first build a python dataset and use that into a python writing data module but before all of this i'm going to basically transform the problem a little bit since we have a couple of we we have basically some comments and those comments can have one or more emotion associated with them i'm going to just take the most common emotion that was annotated for each comment and use that as a label for the particular comment but first there is a newer version of pytorch lightning and i'm going to start by installing this and then i'm going to run through all the code from the last time and this will go ahead and install the pytorch lightning on the newest version so if i go to the bottom of this i'm going to start with converting the data into a single uh into a single data point or single label per each comment and here i'm going to just take let me just do this and here i'm going to go through the data frame that we have which is again this annotations from the google and go emotions data set and once all this is complete i'm going to continue alright so now that the notebook is again executed i'm going to show you just as a reminder what we have here and we have to text the id of the comment and multiple people that annotated the particular comment and we had 27 different emotion categories in a neutral category so for each of those i'm basically going to take every annotation so i'm going to group the comments by their id and here we can take the text at the first position of the group and i'm going to show what this looks like here so we have this text for this particular comment and for each comment i'm going to take the ratings for it and to do that i'm going to use group of the emotion categories and remember that these are the categories that we have for each emotion with the with the neutral category and i'm going to print those just for you to see so we have each individual category assigned here as a column which contains zero or one and i'm going to basically sum the rows and as you can see now we have this vector this vector with approval as a category assigned by the i guess the two annotators that did this comment and i'm going to take the maximum value of this and this will be 4 since this is the fourth category and if i go back again just for you to see this and i'm using the vim bindings here so this is zero one two three four fourth category has the the maximum number of uh annotations here so this will just take the index of that category so this is great and this will be the emotion that this particular comment has so i'm going to take the texts and the emotions from this and the text is going to be this and the motion is going to be this row here and if we leave this one running this should take some time to go through this and after that i'm going to show what is the the result of it but we can basically do a data frame which contains the texts as a text category and the emotions as an emotion category so this should give us the data frame and once this is complete executing it's a bit slow but you did do it only once so you should be able to go through this and then i want to do another a bit of pre-processing i've taken this from the original example project from the google team this one but this let me just take it so this is the result now but i just want to add an emoji to this and this is basically an emoji map which is once again taken from the original growing motions example project and to use this map i'm going to take the emotion categories and assign a new com right here which contains the emoji that is represented by the following emoji right here so you see the result of this so i'm going to add the emoji cone and then this is a bit of complicated stuff but not really so i'm going to take the emotion and i'm going to map the emotion to the name from the motion categories and then i'm going again to map the name to the emoji map let me run this and the module map is not defined let me define it all right so now if we have a look at the text data frame this should give us this which is great because we now have an emoji associated with each text that we have right here just a little nice thing i guess all right so next i'm going to show you how you can create a data set so now that we have the data frame in a format that will be useful for the text classification that we are going to do i'm going to instantiate or create an emotion data set which is inheriting from the pytorch original dataset so let me just check the imports and yeah i'm missing those so in order for the data set to be available we need to import from torch youtubes and i'm going to also import the torch module and another thing well let's just ride with this so i'm going to show you how we can build the data set so to build the data set we need an emit method which is going to be the constructor for this and here i'm going to pass in the data which is a data frame and the tokenizer which is going to be the electra tokenizer i'm going to create the fields for those i'm going to overwrite the length method which is going to just return the length of the data frame that we have and next i'm going to override the get item method which is passing the id of the current item that we are going to look at so i'm going to take the row which is associated with this id and i'm going to create an encoding with the tokenizer and i'm going to pass in the text a max length of 64 which is taken from the the exploration of the tokens that we had right here so 64 should pretty much cover almost almost everything uh next i'm going to require truncation padding will be to the max length so all of those will be at 64. we want the special tokens added we want the return type token ids to be false token type is to be false we want the attention mask and we want the return tensors to be bypass tensors so this is great and the final item is going to be just a dictionary which contains the input ladies i'm going to flatten this the attention mask again from the encoding [Music] then for the label i'm going to convert the emotion which is again this label right here into attention uh tensor torch tensor that is all right so now that we have this data set let me try it out i'm going to create an instance of it passing all the data and first i'm going to print the length of the data frame so this should match the length of the text data frame let's check if this is the case so this looks alright and next i want to just try out to iterate over the data set so none item should have label and it should have input ids and i would just want the first 10 of those let's look into those and yeah we can see that we have the input of this right here this is the special classification token and then the rest from the text and we have the id of the first example which is this one so this looks like it is working all right so next we are going to build a whitening data module and the purpose of this is to be able to adapt our own data set to what by torch lightning is giving us and to do that we basically need to import by torch whitening let me start with that and i'm going to import it as pl which is the convention so next i'm going to build the emotion data module and this will insta uh inherit from the whitening data module not the lightning module but the in data module and here i'm going to pass in some parameters we need the data which is going to be a data frame a tokenizer and i'm going to pass in the batch size and yeah this is a bit hectic so let's try it out like this all right so i'm going to take the data the tokenizer and patch size so one of the methods that we can override here is the setup method and for this to work i need to import the optional from the typing hints so let's go ahead and do this from attacking import optional all right this should do it and now why do we need this setup method well this is a place where we can do basically some pre-processing of our data and uh in our case i'm going to split the training the testing in the validation set right here so to do that i need to import the trend split from sk worn and for this to be reproducible i'm going to do pl see everything 242 so this will basically see all the random numbers that we have all right so here i want to split the data into train and test train test split cell data i want the test size to be 20 and i want to again split the data into validation data frame and test data frame train the split on the test data frame with the size of 0.5 for 50 and then i'm going to assign the train df here this will be the validation data frame and this will be the test data frame so this is great and next we need to prepare basically all this into data wallers for each individual stage or step of the training process so python i think is giving us this trained data water method which needs to return a date order which we've already imported and this accepts a data set which is going to be the emotional data set that we created this will get the train date frame and this need also the tokenizer this is why we are passing it in next we need a couple of more parameters so i do want this to be shuffled because this is the training data i am going to pass in the bar size which is going to be the part size that we are passing to the constructor here and i want the number of workers to be the number of cpus that we have but to do that i need to import the os module and to get the number of cpus i just need to call os cpu count and this might be different on your machine but on google cloud we have two cpus or two cores more likely so os dot cpu count should give us this and then we need to create a validation data holder and a test data order uh and both of those are not required and as a matter of fact the whitening data module is not required for you to train a model with python writing but i guess those are good practice and they encapsulate very well the data that you need to create or adapt for training your model and uh one nice thing about writing those writing data modules is that you can swap your data set by just writing a new whitening data module and reduce your old models to train on this new data set that you have so let me continue with this and i'm going to take this and i'm going to replace the shift link i don't want shifting into validation data order and here i'm going to pass in the validation waller date frame and this will be the test data order and i'm going to just replace this with the test date frame and this is pretty much it now we have this whitening data module and let me just create an instance of it so this requires the text data frame the tokenizer and let's say a batch size of let's say 24 or 32. so [Music] we don't have the tokenizer uh oh this should give us what we need all right so just to test this i'm going to call the train data module train data water and this should return the the data module but yeah just to try it out we need to call the setup manually because nobody is calling this num i have a typo all right so our mistakes here um workers i should have taken the alt complete all right let's try this once again number all right num were curse this should be correct this time i guess we're first all right so this is finally returning the data module [Music] and if we print this batch yeah this should be working let me break this and let's check the length of the batch first okay so we have three items in the batch and those should be the input ids the attention masks and the labels toy let me just print this and what is the size of this it's 32. yeah oh so the batch uh length is the number of things that we have in the box so we have the attention mask the the token type like this and the label itself but each of those should have yeah 32 items let's just check that this is the case yep and the same thing should be true for the token type like this and this the last thing should be the label the level yep do we have the token type so uh input is yeah right let's just call its input ladies call it the attention mask and run this yep so this is pretty much what we expect 64 is the maximum number of the sequences that we get that we want and each but should have 32 items the same is true for the label of course and we have three items in each batch which are printed right here so in this video we saw how we can create a pie torch whitening data module and once again this abstracts away your data set in a format that is let's say more general and in the next one we are going to build our model using whitening module i'm going to show you how we can our go we can basically just fiddle a bit with the original electric implementation from the hugging face transformers library and i'm going to wrap it into a module with pytorch whitening thanks for watching guys please share and subscribe and go visit ml expert to prepare for the machine warning interview have a good day bye you

Info

Channel: Venelin Valkov

Views: 4,842

Rating: undefined out of 5

Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, PyTorch, PyTorch Lightning, Dataset, NLP, Transformers

Id: e47f__x7KSE

Channel Id: undefined

Length: 26min 31sec (1591 seconds)

Published: Wed Dec 08 2021