Preparing datasets for Fine-tuning || Operations on Datasets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi all we often hear that data is the new oil and working in the data industry we come across two major challenges first the availability of data and even if we make the data available we come across the second concern which is maintaining the quality and standard of the data now today in this video we are going to talk about hugging face Hub which hosts thousands Plus data sets for numerous tasks such as text classification ation text generation text summarization image related tasks and whatnot and more importantly these data sets are available under open license which means professional Learners like us can make readily use of these data sets for building various applications under generative Ai and data science tasks so let us head over to our notebook section and see what this data sets has to offer but before proceeding ahead if you're new to the channel make sure to subscribe and hit that Bell icon to never miss a video and stay on top of your generative AI learning now we need to look at some of the operations that are related to these data sets or basically how can we make use of these data sets in the first part we saw how to load these data sets and then how we can index and slice these data sets now next important thing is how can we make certain processings on these data set such as sorting such as filtering sampling shling and and a few others right so in order to sort this data set basically Let Me Show an example here we take data TR label and let me take the first 10 labels okay so you see the first 10 labels are 0 0 3 2 3 0 1 5 and let me also print the corresponding uh text so that there is no confusion later on okay so okay I did not use the print statement here now in order to sort this data set based on labels what we will do is we will call data the data set that we have dot sort and then specify the column name that we want to utilize okay let us load the first 10 labels after sorting so in order to sort this data set what we need to do is we need to call the data set do sort and pass the corresponding attribute under which we want to make the sorting and now let us load the sorted data set so if we look at it the sorted data set contains the labels that are all all the zero labels are now coming at the top for followed by one followed by two and so on and so forth and the texts cor have been correspondingly arranged as well right so this is how sorting works one more thing to know here is like if for instance we use the original data right this is a dict type right data set dick type so in order to work around with any with this data set dick type we need to load the corresponding splits and then work them in the same way so you see this one split is a data set type which is what we have under data TR as well now uh let me show you another examp another case of where we are going to suffle our data set so in order to suffle our data set we need to pass the seed parameter say we specify the seed as 42 and let us load the first 10 samples after the shling so you see we get the random 10 samples after shling unlike the ones that we originally had in the data set okay so originally we had the labels 0 0 3 2 3 and now after shling we have data sets as labels as 4 1 03 so it's randomly suff this is how it works now another important operation is selecting the data set now I'm going all these one by one and then finally we'll summarize everything right so data do select and in the select we pass the indices that we want to pick say 10 30 340 21 and so on so this will pick or this will select particular indices from our data set and return us so basically selecting from this entire records and then giving us a few sampled records that we are selecting similarly we have a very important operation called as filter this filter operation is used to basically filter out based on certain conditions if we want to apply the Lambda functionality say data TR do filter Lambda x x of label equal to equal to Z so you see this will return us with all the records that have a label zero see it is giving us all the records that have a label zero we are just selecting the 10 of them so this is how filtering works you going apply any functionality here and now if you have a data set if you have loaded a data set and you want to split it according to your own will you want to split it as per your B say for example this data TR we want to make a split so what we will do we will call the split here train test split and pass the test size just as we do in our psyit learn package right you see now we have create split our entire data TR as two splits with a test size of 10% so so out of our 16,000 samples 1,600 have been sampled as test set and the remaining 14,400 as our train set so this is how the data set operations work this is how we load the data set work around with it and we are going to make use of these data sets for all our other tasks just fine tuning we already talk fine tuning such as building various applications generative AI tools so make sure you watch this and find this content useful give it a give it a thumbs up see you in the next lecture have a nice day bye-bye j h
Info
Channel: Datahat -- Simplified AI
Views: 500
Rating: undefined out of 5
Keywords: datahat, generative ai, data science, python, simplified ai, datasets hugging face, hugging face datasets, hugging face tutorial, load and download dataset from hugging face, hugging face hub, datasets, preparing dataset for fine tuning, python hugging face datasets, hugging face datasets tutorial, hugging face fine tuning, hugging face datasets overview, datasets download, datasets library, dataset library, dataset library hugging face
Id: i22tsGqw83k
Channel Id: undefined
Length: 6min 32sec (392 seconds)
Published: Sat Feb 03 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.