Creating Your Own Dataset In Hugging Face | Generative AI with Hugging Face | Ingenium Academy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so in this video we're going to learn how to work with data sets from hugging phase and we're even going to create our own data set in hugging face and push it to our own account and the hugging face Hub and so we're going to actually make a contribution today and so let's go ahead and install Transformers and torch as well as data sets and so we want to make sure that we have data sets installed so that we can access data sets straight from how you face and this is going to take a second to install but it'll get there and so that's done so cooking face allows for us to load a data set in just by calling load dataset and then giving the the path to the data set and so you can go under data sets and look this one up and so I'm going to load it in right here and you can see that it returns a data set dict object that contains a train set which is a has a data set that has these two features an act and a prompt and then the number of rows every data set in hugging face is a data set date object and so we we loaded in as such and you can operate on it just like you do any other dictionary it's just a hugging face specific one and so this one only has a train set not all data sets have a um a just a train set and they some have trained validation and tests like we'll see below and so we're going to load in the Samsung data set this is a summarization data set and before we do that some data sets actually require for us to pip install packages and so just be aware that working with hugging face you have different data sets and they all have different requirements and so let's load this one in and it may take a second and so you can see it has a train test and validation split and it has different features and so I definitely want to show you that not all data sets come as just trains some come as validation as well but nonetheless we're going to work with this data set okay for the data pre-processing section of this notebook we're going to do some pre-processing on this data set because we wanted to have you know a test set as well and we want to change it just a little bit maybe and so like I said you can access it just like a dictionary by just calling data set and then and then indexing into train right here and then indexing into the first example to print out that first example and as you can see it says act Linux terminal and then it's going to give you a prompt like I want you to act as a Linux terminal and this is what the data set is there's some act and then there's a prompt to prompt GPT to act a certain way and right here let's just say we want to shuffle this data set up because we're about to create a train and test split so we don't want any natural ordering which I don't think we have to worry about it here but let's just say that we wanted to do something like that we can Shuffle It Up we set a seed for the random number generator and then after we Shuffle we're going to call Select and we're going to select the first 100 and so this is what that's going to return and it's just a data set with 100 rows now we want to create a test set as well so on this data set I'm going to call train test split with a train size of 0.8 in this random number seed and here you go you have a hundred examples because maybe you wanted to have 100 instead of 150 you can imagine if you had a thousand examples or tens of thousand examples and you're about to pre-train a model you would want to have less than that because maybe you don't want the training to take too long and shuffling it allows for you to randomly sample from that data set and so here we go that we're able to to Shuffle select a subset and then split it into train and test and that is some basic preprocessing that you'll see we're going to do more when we get to training models but I wanted to cover that at the very least now we're going to create our own data set so this data set right here comes from a machine learning archive that's kind of famous I know this is cut off and I do apologize but if you go to this link at the if you go to the machine learning archive and you look up the Reuters 21578 data set um there you're going to see you're going to land on this page and right here it has under data files I know it's cut off so I apologize but if you inspect it there's a link here and what you want to do is I want to click on this link instead of inspecting it I want to copy its link address I'm going to go over here and I'm just going to replace that and this is the link to a the file that contains all the data from Reuters which is just a bunch of Articles from Reuters from a long time ago and we're going to use it to pull in and create our own data set now this W get pulls it down into to our local directory so now we have this Reuters file right here we're then going to untar it and decompress it because it is a tar.gz and it's going to open up the file and we have all this data the only files we care about are these dot sgm they are the files that contain the Articles and inside those articles we have to access them using beautiful soup says they're sgm and this code right here just goes through and for each article it pulls out a title and a body and then it appends it to this master list of Reuters articles and it actually goes to them fairly quickly so you don't have to wait for too long okay so that took a few seconds right but nonetheless it got finished and this is what the Articles look like we have or each dictionary looks like we have a title and then we have a body which gives the body of the article we have Bahia Coco review showers continue throughout the week in the Bahia cocoa Zone and these are the titles of the Articles and we can see that we have 21 578 which explains the Reuters 21 578 name to the data set and that's how many articles we have and so now what I'm going to do is I'm going to split these into trained valid and test sets by just you know indexing them at 80 percent and then you know splitting them up accordingly right here because all my train set to be 80 of the data this right here is going to save each one of these articles in a Json L5 uh Json L file format so Json L is a Json file where each line in the data set is a new Json object and so each one of these is a dictionary right now but it's going to be a Json object whenever I convert it to Json right here and then write it to its respective file which is train.jsonvalid.json or test.json right and so I'm going to get three Json L files train valid and test that's going to contain lines of all the trained valid and test articles so data sets a lot we we can actually create a data set from multiple different file formats CSV Json Json L and right here we're going to use the Json which works with Json L as well by giving it this data files dictionary we're going to call load data set pass it to Json we're going to tell it we want we're loading in Json files we're going to pass it in this dictionary that Maps which data set which portion training validation or test of the training data is maps to which file so we're going to pass that in so it knows which file needs to be validation test or training and then we're going to call that right there and wait just one second and it generated our trained validation and test split from those files we just created let's load it back in and this is what our data set looks like we have a trained validation and test set data set this is our own custom data set and we can print out the first example as you can see it has a title and a body and then now the cool thing is is let's share it to our own hugging face Hub and so right here you're going to import hugging face from hugging face Hub you're going to import notebook login you're going to call notebook login and it's going to ask for an access token at the beginning of this course I mentioned that we need an access token creating an access token is very straightforward you first need your account this is the gdm academy account right now you need to go to your profile which you can access right here by clicking on that and then once you get here right you click to go to edit profile go to access tokens and then click on new token and then create your own token we're going to say ingenium hf2 because I already have one beforehand but I'm doing this for the sake of demonstration make sure it is a right role as right role so that you can actually write to your own repo and so if you don't have a right role on your access token then you won't be able to write this data set to your repo and so this this is our data set now we're going to push to our Hub Reuters articles right and it's going to take a second but it's going to push it to our Huggy face Hub and this data set should show up in our Hub immediately and so I'm gonna go here I'm going to go to my account and as you can see I've already created this model but we're going to get to that later but here we have this data set Reuters articles and it's created our own model data set card and then we can use data sets to load this in right just like that now it may take a second for your data set to show up like this but give it a few minutes or and an hour or so come back in and then you'll have your data set shown like this and so that is how we work with data sets that's some basic functions we can use on data sets and how we can actually create our own we're going to do more advanced stuff whenever we start fine-tuning our models but we'll get to that in a later video
Info
Channel: Ingenium Academy
Views: 3,235
Rating: undefined out of 5
Keywords:
Id: enObIMzyaE4
Channel Id: undefined
Length: 10min 9sec (609 seconds)
Published: Tue Sep 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.