How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and welcome to this video we're going to cover how we can build a tokenizer for bert from scratch so typically when we're using transform models we have three main components so we have the tokenizer which obviously what we're going to we're going to cover here we have the core model and we also have a head so the token is obviously what converts our text into into tokens that bert can read the core model is is burst itself so but has like kind of like a core which we build or train through pre-training and then there's also a head which allows us to do specialized tasks so we have like a q a head or a classification head and so on now a lot of the time what we can do is just head over to the the hugging face website over here and we can say okay we have all these tasks over here if i want a model for question answering i can click on here and typically there's usually something we can use but obviously it depends on your use case your language and a lot of different things so if we find that there isn't really a model that suits what we need or doesn't perform well on our own data that's where we would start considering okay do we need to build our own transformer so in that case um at the very least we're probably going to need to build or train from scratch core model and the tokenize or transformer head so we definitely need those two parts and sometimes you'll find that we also need tokenizer as well but not always because you think okay we already have tokenizers for say our task is something to do in the english language but the model doesn't perform very well on our specific data set it doesn't mean that it doesn't it hasn't been tokenized properly it can still tokenize all that text probably pretty well as long as it's like standard english but what we'll find is that the model just doesn't quite understand the style of language being used so for example if bert is trained on blogs on the internet is probably not going to do as well on governmental or financial reports so that's the sort of area where you think okay we're probably going to need to retrain the core model so you can better understand the style of language uh used in there and then the head is like i said before that's where we are training it specifically for a specific use case so if q a would we'd probably want to train our model on a specific question answering data set yeah so they can it can start answering our questions now in that case and it's in english we probably don't need the tokenizer but sometimes we do need a tokenizer because maybe your use case is in a less common language and in that case you probably will need to build your own tokenizer for birds and that's really the sort of use case that we we would be looking at in this video so we'll cover that building a wordpiece tokenizer which is a tokenized used by burt and we'll also have a look at how we can get or where we can get good multilingual data sets from as well so let's move on to what the debate tokenizer is and what it does okay so like i said before the the tokenizer it's called a word piece tokenizer so this this light this text appear word piece and it's pretty straightforward what it does is it breaks your words into chunks or pieces of words hence word piece so for example the word surf is just probably most likely going to return a single token which would be surf whereas word surfing the the ing at the end of surf is a pretty common part of a word in english at least so what we would find is this word here would probably get broken out into these two tokens now where we see this prefix the double hashtag that's the standard prefix used to indicate that this is a piece of a word rather than a word itself and then we we see that further down as well so surfboarding gets broken into three tokens and then if we for example compare that to snowboarding snowballing surfboarding are obviously kind of similar because they are both boarding sports the difference being one is on surf the other one is on snow and before we even feed these tokens into bert we're making that very easy for bear to identify where the similarities are between those two objects because bert knows that okay one of them is surf one of them is snow but both of them are boarding so this is helping bert from the start which is i think pretty cool now when we're training a tokenizer we need a lot of text data so well when i say a lot we let's say two million paragraphs is probably um a good sign point although ideally you want as much as you can so what we will use for training our data is something called the oscar data set or oscar corpus now oscar is just a huge multilingual data set that contains and just an insane amount of illustration text so it's very very good and we can access it through hugging face which is super useful so over in in our code first if we want to we want to download data sets we need to pip install something called data set to pick install data sets i already have installed so i'm just going to go from data sets import load data set okay and then through that we can use the datasets dot list datasets method let me sorry let me import datasets as well import datasets okay and this will give us a very big list probably a little bit too big showing us all the data sets that are currently available in the datasets library which is quite a lot i think it's like a thousand just a fair bit over a thousand now okay so we have all of these which is a lot but how many let me length data sets list data sets my internet is very bad um at the moment so it takes forever to download anything but that there are data sets and this is one way of viewing those data sets but an easier way this is how many we have in all of hogging phases and there's new ones being added like every day so but an easy way of doing this to go to google type in data sets here or hogging face that is its viewer and just click on the stream lit hugging face so this is a streamline app hugging face of built that allow you to go through uh their data sets so you see over here we go over to dataset and i'm going to type in oscar because that's the one we'll be using oscar okay i'll type oscar and then on the right it should pop up so within oscar we have all these different data sets so the first one here is africans the the language and then you have all these other ones down here i'm going to using italian as my example here but italian has a lot of data so if i click on here it doesn't actually show you anything which is a little bit annoying but it's because it's just a huge data set it can't show you everything so in fact so that is 101 gigabytes 102 gigabytes of data there so it's a lot but that's good for us because we need a lot of data for training so if we want to download that data set we we need to do this so we write data sets or data sets and it's just a variable name and we want to write load data set and then in here we need to write the data set name so it's oscar and then we need to specify which part of the data say it is so over here uh it's a subset it's unsurfaced deduplicated it if i can can't select it so never mind so deduplicated it and it's also on shuffled and shuffled deduplicated it so right looks looks good and then the other thing that we can do is we can write split and we can specify how much of the data we'd like now when you when you use this split it's still going to download the full data set to your uh to your machine which is a little bit annoying but this is how it works so i found that this isn't particularly useful and unless you're just loading it from your machine and you're saying okay i only want a certain amount of data what you can do if so this is a 101 gigabytes it's a lot if you don't want to download all that you can write streaming if it's true and this is very useful so what this will do is create an iterator and you can iterate through this object it will download your your data or samples one at a time now because i already have my my data downloaded onto my machine i'm going to use the split method so i am going to take the first i'm going to say 500 000 items simply because i mean obviously you want to be using more samples than this but i'm just going to use this many because otherwise the loading times on all of this is pretty long i don't want to be waiting for too long and we also need to specify which data set or subset are we using here so typically we have either train validation or tests in our datasets we i think we always have the train set in there and then we can have validation test sets as well so we'll load that and then what i'm going to do is i'm going to create a new directory where i'm going to store all of these text files so when we're training the tokenizer it expects plain text files that where each sample is separated by a new line so i'm going to go ahead and create that data set for us so i'm going to make directory i'm going to call this oscar and then what i'm going to do is loop through our data here and convert them into the file format that we need so first thing i want to do is import tqdm auto import tqdm so from and i'm using this so that we have a progress bar so you can see where we are in that process oh because this can take a while so i'm going to create this uh text data list so we'll populate this with all of our text and i'm going to use this file account so that's zero so this is just going to loop through and we're going to create all of our text files using this here so what i want to do is for sample in tqdm tqdm set yes for now i'm just gonna pass okay and let's run that and we see that we get this this bar this tqdn bar you see we're not even doing anything at the moment and it's already taking a long time to to process the data so i'm actually going to let's i'm going to go down to 50 000 so i'm not waiting too long so let me modify that 50 000 and that should be a little bit quicker okay it's much better now first thing i want to do is we're going to be splitting each sample by a new line character site so i want to first remove any new line characters that are already within each sample otherwise we're going to be splitting our samples like midway through a sentence so on sample equals sample and in here so if i can i show you i can show you a sample yeah we have id and then we also have the text we want the text obviously so we'll just write text i'm going to replace new line characters if there are any hopefully there's not any way uh with a space and then what we want to do is just append that to our text data so text data append sample and what we want to do so we we can put all this in a single file but then that leaves us with one single file which is huge so i mean 50 000 samples it's not really a problem but we're not going to typically be using that many samples it's going to be more like 5 million 50 million or so on so what i'd like to do is just split the data into multiple text files so what i do is i say if the length of the text data is equal to let's say 5000 at that point i want you to save the text data and then restart again and start populating a new new file so let's say with open so we need to open the file we need to save it into this oscar directory that we built before so oscar and i'm just going to call it file file file count dot text so you convert this into an f string i'm not sure why it's why it's highlighting everything here and we are writing that file so with that ah that's why we want to do fp.right and then we just write our text data write our text data but we also so this is a list and what we want to do is we want to join every item in that list separated by a new line character so we write this and that creates our files now at this point we've created a file and our text data still has 5 000 items in it and we're going to start looping through and populating it with even more items so what we need to do now is re or initialize or empty our text data variable so that's empty again it can start counting from zero all the way up to five thousand again okay and so at this point we're saving our file so this would be initially uh file underscore zero dot text but if we loop through again and and do it again it's still gonna be zero so we need to make sure we are increasing that file count so that it's not remaining the same just overwriting the same file over and over again okay and what you can also do if you want is you can add another so this down here with open you can add that just in case there's any uh leftover items at the end there that haven't been saved into this neat 5000 trunk i'm not going to do that now but you can add that in if you want to okay so it looks pretty good the only thing that we we do need here is actually make sure the encoding is utf-8 otherwise we'll get i think we'll get an error if we miss that okay so that will or should create all of our data so let's let me open that directory here on the left so we have this empty oscar directory i'm going to run this and we should see it get populated so it's pretty quick there we go so we're building all these plain text files here and if we open that ignore that and we see that we get all of these so each row here is a new sample okay and as you can see it's all all italian so that's our data it's ready and what we can do is move on to actually training the the tokenizer so the first thing we actually need to do is get a list of all those files that we can pass on to our tokenizer so to do that we'll use the pathlib library so from patholib import path and we just go string x for x in path so uh so here we need to specify the directory where our files will be found so that is just oscar and then the end here we just add this glob and here we don't in this case we don't need to do this because if we if we just use path here it will just select all of the files in that directory and in our case we we can actually do that because there's no other files other than the text files but it's good practice to just in case there is anything else in there we can use a we can use this function here to say within this directory just select all text files okay and then let's have a look so in paths we have we should have all of our files and let's see how many of those we have okay so we have 10 of those so in total yep 50 000 in in total samples there because we have 5 000 in each file okay so now let's initialize our plane tokenizer so we want to do from tokenizers so if you don't have tokenizers installed super easy all you have to do is to pip install tokenizers again this is another hugging face library the light transformers or data sets which we used before and from transformers you want to import the bert wordpiece tokenizer which is sharing as there so we load that and then our tokenizer we initialize it with bert word tokenizer again and then in here we have a few a few different variables which are useful to to understand so first one is clean text so this just removes obvious characters that we don't want and converts all white space into spaces so we can set that to true we have handle chinese characters now this you can say i'll leave it as false but but what this does is if it sees a chinese character in your training data what you're going to do is just add spaces around that character which as far as i know allows but at least when we're tokenizing those chinese characters it allows them to be better represented i assume but i obviously i don't know chinese and i have never trained anything in chinese so i i don't know but that's what it does strip accents so this is a pretty relevant one for us so this is say if we have like an e like this it will convert it into this obviously for romance languages like italian we those accents are pretty important so we don't want to we don't want to strip those it's also strip not string and then the final one lowercase so this is if we want to if we want to view this is equal to this we would set location equal to true in case we i you know for me i'm happy to have those capital characters as being equal to lowercase characters that's completely fine so that initializes handle chinese sorry handle chinese characters like this so that initializes our tokenizer now we we train it so tokenizer dot train in here we need to first pass our files so the is the path that we use up here yeah paths so training with those uh we want to set the vocab size so this is the number of tokens that we can have within our tokenizer it can be very small for us because we don't have that much data in there i want to set the min frequency which initially i thought oh that must mean that you know the minimum number of times a token must be found in the data for it to be added to the vocabulary uh but it's not it's actually the minimum number of times that the it must see two different tokens or characters together in order for it to consider these as actually a token by themselves so submerged together so typically i think people use two for that which is fine special tokens so these are the special tokens i use by bert special underscore tokens and for that we will have padding so the padding token the unknown token the classifier token which we put the start of every every sequence let me put this on a new line we have the separator token which we put the end of of a sequence and then we also have the mask token which is pretty important if we are training that core model we also have limit the alphabet so this is the number of different characters that we can see within our vocab so limit alphabets so we'll go with 1000 and word pieces prefix so this is what we saw before in the example where we had the two um the two hashes and this like i said it just indicates a piece of a word rather than a full word and that should be it actually so i don't think there's really anything else that is important for us so we'll train that which hopefully it will work again this can take a lot well this will take a little bit of time even without our smaller smaller data set so let's see what it's showing us it's not i i don't know why i think i need to install something here because i just get this blank output i think it's supposed to be a loading bar and then what we we do at this point is we probably want to save our new uh tokenizer so i'm going to save as new tokenizer and i just write tokenizer dot save model and that is going to go to new tokenizer directory so new tokenizer okay and that will save this vocab text file and if we just have a quick look at what that has inside so come over here let's uh we have new tokenizer vocab.txt and then here we can see all of our all of our tokens okay so the way that this the way that this works yeah so you can actually see you know how we use that alphabet the limit alphabet we can see that there are 1 000 tokens so this stops at 1 0 0 5 and then go all the way up here so it begins at row 6 so that's the 1 000 alphabet character so single characters that are within or allowed within our tokenizer now the oscar dates that's just pulled from the internet so you do get a lot of random stuff in there so we have well we have general chinese characters when we're dealing with italian but if we come down here we start to see some of those italian words and these are the tokens so our text our tokenizer is going to read our text it's going to split it out into these tokens so like the abb op fin it's going to split out into those and then the next step is to convert those into token ids which are represented by the row numbers of those tokens so if it's all thin in in the in the text it would replace that with the fin token and then it would replace the fin token with this two two zero one now let's let's see how that works so first thing is how do we load that tokenizer we we do it as we normally would so from transformers import we're using a bet tokenizer so import bert tokenizer and we'll say tokenizer equals bet tokenizer from pre-trained and all we do is we point that to where we save it so it's a new tokenizer and that should load okay so first i want to say tokenizer and i'm going to tokenize ciao come over and this is just uh hi how are you and we see that we get these these tokens here so we have number two here which if you probably don't remember in that in our text file at the top we had our special tokens row number two we had the cls token so the classified second which we always put start of the sequence and at the end we also have this three which is the separator token now if we just go and open that vocab file that we built so it's new tokenizer new tokenizer vocab.text if we read that in uh so let's write vocab fp.read and we want to split by newline characters so split like so because every token is is separated by a new line uh we can see let's have a look at the special tokens so we have padding unknown cls at position number two and separate position number three so if i were to go number two we'd get cls which aligns to to what we have here so what we can do is we could take all of these values and we could use them to identify from this vocab and we can do it using tokenizer d code by the way as well but i'm going to do it by indexing in that vocab file that we built just to you know show that that's what the vocab file actually is that's how it's used so if i write that out so we have take this i want to access input ids okay and what i'm going to do is say 4 i in that list can you print vocab vocab i and at the end we'll just we'll add a space okay and then we get this so cls the starting classifier token ciao estimation mark comey va question mark separator token which marks the end of our sentence which i think is is pretty cool now let's let's try that with something else so we'll take this again do i want okay so yeah we'll just do this so what i say a lot uh when in italy is ocupito niente means i understood nothing which is very useful so if i print that out we see that we get cls okabitoniente now what we're seeing here is you know full words not we're not seeing any word pieces so if i can find it i think this will hopefully return a word piece so response stability like this let me try this will hopefully return a few word pieces yes there we go okay so we see uh respond see billy and these are our different word pieces so this gets separated into not just a single token but four tokens which is is pretty cool now i think i mean that's it for for this video don't think yeah that means nothing nothing else i want to i think we need to cover that's pretty much everything we really need to know for building a wordpiece tokenizer for for using with with bert so yeah thank you very much for watching and i will see you in the next one bye
Info
Channel: James Briggs
Views: 387
Rating: 5 out of 5
Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding
Id: cR4qMSIvX28
Channel Id: undefined
Length: 31min 20sec (1880 seconds)
Published: Tue Sep 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.