Examining a spaCy Model in the Folder (Named Entity Recognition for DH 05)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello and welcome back to the series on python and spacey named entity recognition all for digital humanities in the last video we looked at how to actually take our training set that we cultivated with the spacey entity ruler and train a custom hp harry potter ner model and we saw that we added an nar element or pipe to the spacey pipeline of a blank model in this video we're going to stop really working with code and move into something that is a bit more conceptual by analyzing what spaces models look like under the hood and to do that we're going to just write a little code real fast we're going to say nlp is going to be or sorry we're going to say nlp is equal to spacey dot dot load and i'm going to load in one of the standard models to demonstrate what it looks like and because i don't want to have a lot of space i'm just going to use the small model here and then what i'm going to do is i'm going to say nlp dot to disk i'm going to use the 2disk function and i'm going to save it as n core web sm demo that's just so that it doesn't overlap with the actual small model and this is going to be important for demonstrating why our model is good and also where it can be improved and to demonstrate that a little bit better i'm going to jump ahead to a bit of code that i've prepared for video 6 in this series when we start to look at the problems with off-the-shelf spacing models a little bit more closely when it comes to named entity recognition for domain-specific tasks so here we have a bit of code where it's just going to print off all the different people that our our model finds from this test text and compare it to all the different people that uh the spacey large model so the the really good megabyte model that spacey has and it'll take a second to load because while our model is small the 780 megabytes will take a second and there it's already done so if we look at these two results we see that the first initial results are pretty similar both models have grabbed harry james potter both have grabbed james both have grabbed lilly potter ours however has grabbed the word phoenix not entirely sure why perhaps there was something in the training set that actually had phoenix in it maybe because the forks was pronounced as forks the phoenix in our entity ruler i have to go back and check but if we go down this list we see that our model starts to perform a little bit better on certain domain specific things so the spacey model correctly identified neville longbottom harry but it gets here and it has identified voldemort singularly it has missed this entity lord voldemort and i've already gone through to make sure the numbers line up so lord voldemort appears right here the spacey lg model missed that entirely it did find this one though and we see that right there our model found both this lord voldemort as a single entity so lord and voldemort because that's a single entity that's a single person he's referred to as lord voldemort consistently throughout the book and we also see voldemort here going on down the line and this makes me think that maybe i should include in my entity ruler he who must not be named things like that these super case that refer to voldemort i'm not sure that those were in our training set i'd have to go back and look that's something that you do when you're training an er model you look at the results and you think about how you can make adjustments to make it perform better your goal is to get as close to 100 accuracy as you can but that's going to be unobtainable sitting around 95 accuracy is considered quite good 97 and you're looking at a state-of-the-art domain specific model so if we keep on going down the list we see that our model does vary and in fact the number of entities that our model has identified compared to the spacey model is i think we have three more that the spacing model doesn't have but the spacing model is able to find things that our model can't find for example it has found cedric diggory as a person why has our model failed to identify this it comes down to two things cedric diggory's name doesn't appear i don't believe anywhere in book one of harry potter he's not introduced until a little bit later in the series and of course becomes one of the main characters or main secondary characters in book two however the spacey model has not read harry potter and yet it's been able to identify him why because the spacey model was able to do what we call generalization it was able to generalize well and why is this possible well in order to understand that we need to look under the hood of spacey so i'm going to open this up in my actual windows explorer so we can explore this in more depth because this is going to be a little illegible for those of you looking on smaller screens and if we open this up we can do right-click and atom and it allows me to open up a an actual windows explorer i can zoom in and we can look more closely at everything so if we're looking at this we notice that uh we can open up the spacey demo that we saved and in this file we've got a few different folders we've got an nar pipe we've got a parser pipe a tagger pipe and vocab a folder now these are actual pipes in the spacey models pipeline in order to perform nar the text needs to be parsed needs to be tagged and then it passes to the ner and the model i'm going to jump in and look at a couple of these things in depth the first thing that you'll notice is this meta.json file i'm going to pull this to the side for the second and actually open up the json file here if we open it up we have some we have a clearly a dictionary right so it starts off with the dictionary opening and it'll end with a dictionary closing so standard json format of a dictionary you have a couple different important pieces of metadata up at the top you've got the language that it's in you've got the name of the model you've got the license you've got the author a url that'll link to the actual team a description of what the model was trained on so this gives you a little bit of context for how it was trained and you've got some sources where they got that training data and then it tells you what's in the pipeline a pipeline is going to be a series of steps in which data is processed or manipulated and the further down in the pipeline you go the more that the earlier stages in the pipeline affect what happens at the next stage i hope that makes sense i'm going to explain that in a lot more detail in a later video and then you keep on going down and it gives you kind of the accuracy benchmarks and all of this for all the different things it's able to detect and if you keep scrolling down i'm going to scroll down this way now it's going to take me a little while you get to its speed and you get to n words inwards tells you the number of words that are in the model's vocabulary and i want to make a very important distinction number of words does not necessarily reflect the number of word vectors spacing models when you're training them will add new words that it encounters to the vocabulary but it won't automatically add vectors to the word vectors and we're going to see that in just a second the next thing you get to are the different labels that can be assigned to and in different stages of the pipeline when we get down to the ner pipeline we can see it right here we see that we have all the different labels that the model is able to recognize i believe there's 18 so you go from cardinal numbers dates events facs gpes language law lock etc on down the list person and that tells you everything that the model is able uh to label and they'll be different accuracies assigned to each of those labels and then you get to kind of this area here where you get to the vectors now notice here that we don't have any vectors represented well what are the vectors well if we open up spacey we can look at the vocab folder pull it over here under vocab and you get to this little file right here called vectors now you need to open it up in something like atom or some kind of advanced notebook reader you also have a file called strings.json strings.json is going to be all of the vocab that the model has encountered and that's going to allow for the model to generalize in theory well on unseen data now if we go down the list we can see all that however we don't see anything in the vectors so what i'm going to do is i'm going to load in the spacey lg model which actually does have word vectors present and this is going to take a little bit longer because it's opening up that 780 megabyte file model and it's saving it now to my directory right here and now if i open up vocab i'm going to see a lot more represented underneath vectors it's going to take a second to load because you're going to see why in just a second this is a massive text file the 780 megabytes this is where a lot of that space is coming from it's a giant text file of word vectors which are matrices of matrices i'm going to talk about word vectors in the next video in a lot more detail for right now we're just going to take a cursory look at this element that's under the hood of spacey so this is what the word vector file looks like now this is illegible to us but it makes total sense to spacey what these are are matrices of matrices or mathematical representation of word vectors what we care about right now however is these are these two pieces of information the shape of the word vectors now these two numbers are very important we're going to see why when we start working with vectors and training a word to vect model exporting word vectors and importing them into spacey but this first number up here this six thousand six hundred thousand six hundred eighty four thousand eight hundred thirty one that tells you the amount of words that are represented in the vectors so there are 684 831 different words or characters represented in these vectors and each the second number tells you how much or how deep the vectors are so in this case we have 300 now this means there are each word or each character each rep each vector has 300 different dimensions this is the way in which the model understands word similarity and this is instrumental in training machine learning models for the purposes of ner because having good word vectors that are domain specific rather radically increase the accuracy of ner models and we're going to be doing that over the next few videos if we go down to our for this large model and we pull up the vocab we see that the vocab still a lot of the same stuff there you got all your words in there but or if we go down to our meta data file our metadata file for the large model so we can see it right here core underscore web lg we can see a lot more detail in the vector area i'm going to slide it on down for you so here under vectors we have a bunch of very important information the first thing we have is the name of the vectors so ncor web lg dot vectors that's going to link up to the actual vector space under vocab and we have two other or three other pieces of information that we didn't have before with the small model we've got the width which is going to be the depth so there's 300 dimensions we've got the vectors and we've got the keys now you will not always see vectors and keys look the same as you train ner models that are off the shelf and the reason for that is because as the ner model encounters new words it's going to add new vocabulary and new keys but it won't add new vectors for those words so that's something to pay attention to so that's going to be really kind of what's happening with spacey under the hood these are what these files look like and these are what these files actually mean i encourage you to spend some time and just explore them if you damage them it's not going to be a problem because you're not affecting the uh the main spacey model save it into another directory and just explore it and you will also notice that within the ner model we're gonna see this a lot more in the future videos but you also have this cfg file so i'm going to open up the cfg file right here and this is going to become a lot more important when we start adding new vectors in but this is going to tell the ner model kind of where the actual word vectors are that are pre-trained this is going to become very important when we start to try to get our model to generalize better on domain-specific data so in our our case harry potter and once we get past all this we're going to start exploring this process with a real world example and two different domains one for the holocaust and another for medieval latin the idea there is to serve two different sections of scholars who might benefit from two different kinds of problems that happen with the main specific data again we're going to get to that in future videos but hopefully now you have a better sense of what the spacey model actually looks like from a folder point of view and what some of these model these files actually referred to and the model which you'll see underneath the ner folder is as you might have guessed the actual model if you open it up you're not going to be able to make any sense of it because it's going to look like this but it's the actual model file that is kind of the the trained model that comes with spacey that performs the nar so when you create the spacey pipeline and you create the the ner model that's what you're adding to and every time you train an er model that's what you're actually training so that's going to be it for this video hopefully you have a little bit of a better sense of what the actual spacey models and pipelines look like in a folder setting and i promise you even though we didn't do a lot of code in this video this is going to make a lot more sense and be a lot more helpful in the next few videos as we try to train custom uh custom more dive models import those word vectors and actually start to do some much more sophisticated and robust things with custom spacey models that's going to be it for this video though if you've liked it found it informative please like and subscribe down below

Info

Channel: Python Tutorials for Digital Humanities

Views: 1,047

Rating: 4.8709679 out of 5

Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, spacy tutorial, spacy up close, spacy folders, named entity recognition, spacy models, spacy pipeline explained

Id: cuUf2H6uCA8

Channel Id: undefined

Length: 15min 6sec (906 seconds)

Published: Wed Dec 09 2020