How to Load Custom Word Vectors into a spaCy Model

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello and welcome to this video on how to inject custom word embeddings that maybe you trained in gensim or fast text and injecting those into a spacey model so that you can then train a custom component off of those custom word embeddings i have a video on how to do this in spacey 2 but i've yet to actually do one for space e3 so i thought i'd go ahead and walk you through the steps because they are a little bit different and you need to know the difference here what we're working with is essentially a jupiter lab and within this directory we've got a couple different subfolders here we've got models where we're going to be kind of storing our models throughout this whole process and we've got data within data we have our train dot spacey our valid space d this is important for the training process and we've got our word vectors dot txt i'll come to this in just a second now what we want to do is we want to be able to take word embeddings that we've already trained and inject them into spacey so i'm going to show you how to go about doing that the first thing we need to do is import spacey this is going to allow us to actually use this basic functions in just a little bit the next thing that we need to do is we need to use a terminal command within jupiter lab you do that by using an exclamation mark if you don't want to use jupyter lab you can follow along and do everything in the terminal just don't use this exclamation mark so what we're going to say is we're going to say python m spacey so we're going to load up the spacey library we're going to run the init command this can take a couple different arguments but we're going to run specifically the init vectors and that's going to allow us to then pass in some other arguments as well the first thing that we're going to do is pass an en and note that every time i pass a different command i have a space en indicates the language that we are working with in this case english if you're working with german you would use d e so this is gonna be your language the two letter code right there i have a space and the next argument that we're gonna pass is the location of our dot txt file that has our word embeddings in this case it's gonna be in data word underscore vectors dot txt and this file let's take a quick gander edit and this is going to look like like a typical word vec file it's a txt file and what you have on line one the number of words so the vocabulary and then the second thing is the dimensions so the quantity of uh dimensionality to each vector and then each line after that in this case the next 3 32 546 will be a word and the first position followed by a space followed by the 10 dimensions of your word embeddings this is what a typical word vector file looks like this is not amazing i made it 10 vectors so that it could be put onto github so you can reproduce these steps in on your own so that's what our word vet file looks like but we need a couple other commands here the next thing that we need to specify is where we want to drop this new model that we're going to create with a config file and everything we're going to drop that in models backslash 01. if we look at our directory we have an empty directory right here if you're pulling this off of github this will have something here go ahead and delete it and clean it out and go ahead and clean out models backslash o2 as well the next thing that we need to specify is a name this is optional it's a keyword argument but we're just going to call this en underscore t-e-s-t again this is not necessary we're going to execute that command if you're doing this in the terminal you're going to see some things printing off at this stage this is spacey going through reading in your wordvectors.txt file and converting them into its own format and it's creating the model and it's going to be saving those word vectors as a vector file within the spacey framework if you're doing this in jupyter lab it all gets printed out in one go and we can see that the results were successful we successfully loaded the vectors from the data word underscore vectors.txt file great that means that we can now load up our model because if we go over to model o1 we now see that we have this area populated this is our config file our meta.json file and if we go into vocab we see our vectors and if you look at it on your end you'll see that the vector file is not your typical one kilobyte file that comes with your standard non-static vectored spacey file instead it's your custom vectors now in spacey format so let's go ahead and go back and load up that model and just take a look real fast so we're going to say nlp is going to be equal to spacey dot load and we're going to load in models backslash 01 which is where our new model actually is it's not really a model so much as just a spacey pipeline with some word embedding stored once we've got this opened now we need to add to that pipe the components that we want to train in this case i'm interested in training a named entity recognizer you can be training a classifier at this point it can really be anything but you're going to follow these steps nlp dot add underscore pipe and you're going to add in a component that already exists there's about 18 of them that you can add in here we're going to add in an ner and what this is going to do is it's going to allow us to save this new uh updated model over the old one so in this case we're going to say nlp.2 underscore disk which is the command to save in spacey we're going to say models backslash one and that's going to save it to disk let's go back now and take a look at models01 we see that we actually have an nar component here and if we open up the config file this is important we'll notice that we've actually got any r added to the pipeline automatically so we've automatically not only created the config file from initializing the vectors we've now added in the ner component this is a little easier in my opinion than some of the other ways that are documented in the spacey documentation the next thing that you need to do is you need to train your model in which case you need to go back to the terminal or if you're working in jupiter lab or jupiter notebook you can do it again in the cell our training data is already converted for you and again i'm providing this all on github so you can follow along under data our train files right here our valid dot spacey files right here and we're going to be calling not only those two things but the new word vectors that we have loaded in our model so let's go ahead and run we're going to do exclamation mark python m spacey train so the same things as before but instead of in it here we're going to run train and then we're going to specify essentially where the config file is so our config file is loaded located in models let's go over here so you can follow along models01 and then the config.cfg so we're going to say models01 backslash config dot cfg that's our config file so let's go back again and the next thing that we need to specify again this is a keyword argument it's going to be output and we're going to have a space there and say models o2 that's where it's going to save the best and last models next thing we're going to pass is the path of the training data so we can say paths dot train here it's important to have those two dashes space and then the location of that data so it's in data train dot spacey next we need to specify the path of the dev file or the validation set so we're going to do dash dash paths dot dev and then we're going to say data backslash valid dot spacey and i apologize for the zoomed in nature here it means this is going to be a little longer and a little difficult to read so i'm going to zoom out just a little bit and then the final thing that we need to pass and this is important is we need to pull those word vectors from our original model that we already created that has that vector file which is located in models01 it's going to be in vocab and it's going to be right here we need to grab that information so all we have to do is we just have to specify the model spacey will handle the rest and go automatically find those vectors so we're going to say dash dash paths dot vectors with the space now models01 so it's going to go and grab those word vectors from the o1 pipeline you're going to execute that cell and what's happening now is space is going through and doing all of its training again because this is happening in jupiter lab you're not able to see the results in real time but i'm going to pause the video and pop back after it's done and our model is now done training if we look down here we notice that we have trained the ner component and if we were to open up where we saved it which is going to be under models 02 we see that we have the model dash best and model dash last if we click on that we notice we have the ner component still we've got under vocab our vectors still we have now successfully injected custom word embeddings into a pipeline and then added an ner component that we then customized and trained that's gonna be it for this video as always thank you to everyone who supports this channel on patreon and now a new feature if you want to support this channel you can do it within youtube by simply clicking on the channel you'll now notice we have this join component right here and you can join the channel and contribute however you feel best members receive kind of some custom little perks that they can use like custom emojis and things like that as always i try to keep all of my content free for everyone so being a member here is really just a way to contribute to the channel and help me keep this channel alive um any contributions that you do give do go back into this channel i'd like to thank my first supporter and first member janteen anderwani i hope i'm pronouncing your name correctly if i'm not i sincerely apologize as always thank you everyone and have a great day

Info

Channel: Python Tutorials for Digital Humanities

Views: 192

Rating: 5 out of 5

Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, spacy, text analysis, word vectors, spacy word vectors, spacy custom embeddings, spacy custom vectors, spacy 3 vectors, spacy 3, spacy 3 word embeddings, spacy 3 init word vecs, spacy 3 word embedding, spacy 3 use custom embeddings

Id: JmLQedi80_Y

Channel Id: undefined

Length: 9min 53sec (593 seconds)

Published: Tue Sep 21 2021