How to download Hugging Face dataset locally?? Easily Access 1000+ Quality Datasets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi all we often hear that data is the new oil and working in the data industry we come across two major challenges first the availability of data and even if we make the data available we come across the second concern which is maintaining the quality and standard of the data now today in this video we are going to talk about hugging face Hub which hosts thousands Plus data sets for numerous tasks such as text classification ation text generation text summarization image related tasks and whatnot and more importantly these data sets are available under open license which means professional Learners like us can make readily use of these data sets for building various applications under generative Ai and data science tasks so let us head over to our notebook section and see what this data sets has to offer but before proceeding ahead if you're new to the channel make sure to subscribe and hit that Bell icon to never miss a video and stay on top of your generative AI learning now the first thing that we need to do in order to make use of the hugging phase data sets is installing the library called data sets Okay so we already installed it here in our collab environment and then we import the low data set functionality from this data set library now let us see some of the data sets that are hosted here in the hugging face Hub and from these data sets we shall select one of them okay so there are nearly 9,65 data sets that are currently hosted that is an incredible number and these are under various task heads such as multimodality computer vision natural language processing and many others okay so we shall go for a text classification data set under text classification let us select an emotion data set okay dire emotion data set this primarily compris of various text and these text have been classified as one of these emotions from the various categories such as sadness anger love okay so let us quickly head over and load this data set in order to load any data set from hugging face Hub what we need essentially is the directory name dire AI SL emotion so this is the path that we need and how we shall load it we shall under load data set functionality we shall just paste that name or the path to that data set now data set will be quickly loaded here you see the data set is getting loaded let us load our data set and see what it is so basically our data set is a data set dict type okay and what this dict type contains this contains the various uh splits for the data set such as train validation and test and the number of rows or the number of Records under each is specified as 16,000 2,000 and 2,000 if we look back to this page where in uh this data set is hosted we you see it has also specified that there are 20K rows for this subset and this subset contains three specific slit splits train validation and test so we loaded it accordingly now uh what if we want to load a specific split of the data set or we want to select samples from specific splits okay so for that we have a parameter name known as split and which we shall use in the load data said so we see here two specific examples in one we just selected the split called train okay so this train split should have 16,000 records with two features text and label and in the second case where we are just sampling the data set what we are essentially specifying we are specifying that select thousand samples first thousand samples from train split and the first 200 samples from validation split and return a sample data set so let us run these and see how it works so we see there are 16,000 records that we got for a first TR data TR this is loading only the train split and in the second case where we sampled the data set we have 1200 samples 1,000 + 200 you see the splits that we saw earlier in this data sets dict are no longer available because we used a specific split now okay now let us load and see how things are working and this is how we index our data set we just pass the index number within the parenthesis and if we want to load a range of values we pass this as a sliced form okay for example here just see the example here so how this will load the data this will load each of the features as a list comprising the size that we specified here or the values that we specified here the splice so we see there are five Val values and corresponding five labels okay now one of the primary questions that people ask that how do we load data set or how do we download the data set from hugging face Hub to our local for that what we essentially need is to install the git lfs lfs is large file storage system so git lfs basically all the data sets that we see in the hugging phase data sets have been hosted in under GitHub right GitHub large files storage so from there we can directly install for that we first install the git lfs this is initialized then we clone a specific repository that we want to clone or Download Let me show the example here okay so this is the data set and the path of the data set that we saw here just give me a second yes so from this path we can directly download or clone any data repository into our local and then load them just the same way as we used under load data
Info
Channel: Datahat -- Simplified AI
Views: 3,162
Rating: undefined out of 5
Keywords: datahat, generative ai, data science, python, simplified ai, llm, datasets, hugging face, machine learning datasets, open source data, 1000+ datasets, download data, hugging face tutorial, how to use hugging face, huggingface datasets, hugging face course, hugging face models, what is hugging face, hugging face transformer, hugging face pipeline, data science datasets, load and download dataset hugging face
Id: -svlg240JXk
Channel Id: undefined
Length: 6min 15sec (375 seconds)
Published: Tue Jan 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.