Sujit Pal: Building Named Entity Recognition Models Efficiently Using NERDS | PyData LA 2019

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

a little bit about myself I'm Sagitta whatever your labs good yeah thanks so um my role is a data scientist role mostly self-taught prior to this I was doing search and natural language processing and over the last few years have been doing some machine learning and deep learning as well I got interested in nerds about some time last fall when I was doing some work in relation with search and knowledge graph development so I wanted to build some custom named entities and this seemed to be a nice thing to do a disclaimer I am NOT neither the author nor were the original maintainer of nerds this was built by buying colleague panagiotis from my Amsterdam office there's a bunch of contributors and my name isn't there either they're all from the Amsterdam office and from the same group this was open source like a year and a half ago and one thing I would like to say though I'm not just a happy user I picked it up and made some changes to it and they're all knife work and I think my changes are actually you know I'm hoping to fold them in around January February next year and you'll find that I have more functionality in my fork than the original one so that's something I wanted to just mention so the agenda here is broken it up into five little sections so let's go okay so why would you even care about any or I guess some of you might actually care because you do it it's body feed a job but for the rest of you it's basically a foundational task for any NLP pipeline just like part of speech tagging phrase chunking you know even sentence splitting and so on right there are good name netting recognizers available you know trained entity recognizers that are able to detect standard named entities like people places things or organizations and stuff but really I am looking for custom ners right things that I can build myself any other also so the out named entities the availability of name entity is also useful for downstream tasks like maybe topic modeling core efference resolution and so forth for myself I found any is useful in information retrieval to chunk your queries into meaningful pieces and also to understand query intent I'll not go into depth I actually built a system totally based on a needs and which understood query you know he is very intent the other thing that you actually need to do is if you are doing knowledge graph constructions so typically a knowledge graph is a bunch of triples like subject predicate an object and the subject and object are basically entities so again you need entities to you know get the subject and object before you can do relation extraction so that's another you know kind of thing and if you combine the two you get the holy grail of search concept search a semantic search right so so these are probably the reasons why you would care right I'm going to go through a you know kind of a whirlwind tour of NER techniques for those of you who are not you know who are not doing any are for a living as I am doing right now for well typically any our techniques are you know they go from the traditional to the neural traditional are like rules or regular expressions rules could be things like finding names proper nouns and you say ok you know the rule is that those words must begin with a capital letter regular expressions could be things like looking for dates and times different kinds of formats gazetteers are basically just dictionaries of terms and the idea is that you have this dictionary and your text you find instances of these terms in your text right and it's kind of old school but don't knock it it's very very useful right for statistical for those of you who come from the search side you are looking for basically things like you're looking for sequences of words who the empirical probabilities are higher than if you multiplied the probabilities of the individual words together right so if you multiply the probabilities it means they're independent right so that that's why you're saying that the probability of the three things together you know occurring together is very low but if the empirical probability is higher then that's an entity it's not a named entity but it's some entity of interest right there's also sequence models of which the most popular one is conditional random fields which we will talk about slightly later on the neural side of things there is the bi-directional STM which is the basic building block of any R and as well as you know a lot of other things that even part of speech tagging for instance but most real neural models for any R also include a CRF head CRF is the conditional random fields and of course since the beginning of this year there's has been a bunch of transformer based models so the ones here in orange are the ones that nerd supports out of the box so we'll talk to talk about that later a little bit more one other thing I wanted to bring up again for those of you I apologize to those of you who are already into any other and know all these things is the standard data format that we use it's called vio tagging vio stands for begin in or out so if you look at the example up there so you find that there are the whatever tagger that is the hypothetical dagger has actually found two entities in there Barack Obama and in other states right so in the BIU tagging you would basically tag Barack Obama Obama as beginning of person and Obama is in person and similarly with United and States and every other token is termed as out so it's not in the NER sequence there is a variant called be ILO you wear bo u bi n over the same you stands for unit for single tokens and L stands for last so in this case Barack Obama would be Barack be poor and Obama would be L per rights in the Balu system the other thing I wanted to point out is the Kwan a little format it's based on the shared tasks where you basically represent your data the tag data is basically a two column vertical format right so the first column is the token the second column is the tag and it's tab separated and at the end of each sentence you have a MP line to mark the end of a sentence so reason I bring this up is because that is the standard format I'm going to have nerds use right sir for gazetteers that's the first one we are talked about you know the traditional setup usually most gazetteers in the market today use the hakura SiC algorithm the ho classic algorithm was first published in 1975 still being used essentially all it is it you take your dictionary and you create the data structure like that right and then you essentially you stream your content one pass single pass against the this data structure and as you stream it will you know think of it as hitting the O 0 node and it will try to find tokens that it can navigate to and at the end of so if you look at that the barackobama it just navigates down to barack and obama and then at the end of it it cannot navigate any further so it comes back here it reports that I found a power and it comes back here right and back to the 0 node so the nice thing about this it's like a single pass algorithm you don't have you know regardless of the size of a dictionary you only do a single pass against the data right the second one is conditional random fields essentially the the easiest way to think of conditional random fields is it's a sequence version of logistic regression right so if you have these tokens you can find features for each of these tokens and you can independently predict the tag for each of the tokens right in normal register progression now consider that you can also look at the previous label the next label the entire sentence you know any other token entire sentence and so essentially the entire sequence is available to you right and you can basically create a labeling over you know this L right of tags over the entire sentence the best level and how you do it is basically you create these features right the feature is the function of the sentence the current position previous tag next tag and you weight them and you basically sum them up for each token and then put the entire sentence you sum up this set of you know token scores and then you learn the weights here because you did a set of weighted features for each token and you learned these weights using gradient descent right so that's typically the approach and this is actually for those of you who have done hmm have struggled with it and you know concluded that CRF must be even more tough it's not true that was me I used to think that but this is like totally you know xenakis neural models the simplest thing is vilest um it's basically two recurrent neural networks pointing at it you know going to opposite direction one reading front to back and the other reading back to front and of course in neural networks it's not just words going in you essentially you create embedding so do these right so it's a vector and then on the other side it's actually a vector of probabilities of which the probability of beeper is the highest right so that's how you so it's kind of an abstraction here oh the nice thing about it is unlike CRF where you actually have to do manual feature engineering there there is no manual featuring engineering here you just trade compute and data for the extra manual work right again like I mentioned real world any earth have this CRF layer on the head right this is the CRF head on top and essentially everything is the same except for this additional CRF a and the CRF that the the features that the CRF consumes is basically the hidden states that come out of each of these neural networks right the ordinance that there and oh the other thing that I read here that pre-trained embedding so essentially you can either learn the embeddings from scratch right so you start with random vectors and as you train it basically figures out their bearings or you could pre populate for each word you could look up something like word to wake and find say you know what is the embedding for Barak right and you put it in there and then over time it modifies that so that's what this means pre-trained embeddings observed are observed to improve the performance so something to keep in mind if you're actually building something that is so the next step from the word embedding so we talked about this right so this is basically consuming words and creating M bearings and you know it's basically creating that breaking the tags right so the next step here was basically to expose the signal in the characters itself why because there might be some characters if you're looking up in bearings words then there might be some words which may not be in your embedding dictionary right so and the other thing is that there is signal in the word structure itself so if you have words which end with ly for instance or even words that end with pre or saw right so those mean something right peep on right postpone that's our thing so it explores the signal so essentially what it does it takes a word right like United decomposes it into a sequence of characters and then uses a neural network to basically extract a vector out of this right and the word vector that you got from this one and the character vector did you got from this one are then concatenated and then sent to the by C Wireless gems your effect right so it's like an additional thing that so again all the weights are learned end to end and this is and so in in nerds you will find the neural network model it can optionally take the character of this gem the other thing you can think of is that this is mainly useful for things like what two vacant love the older type of M bearings if you think of things like sub words in weddings like fast text it that actually contains a lot of the information that this guy is going to talk about because it contains about a Marines right again something to keep in mind so so the next step is to basically replace the word embedding the static word embeddings that you got with contextual word n bearings right so there is a model in nerds which uses Elmo so essentially you pass the entire sentence to Elmo and Elmo will give you back a sequence of words so if you think of a word like Bank for instance right so Bank could be a river bank bank could be a place where you put money in case of glove and what to Vic the embedding return is exactly the same right however depending on the sense in in which you use the word Bank the embedding scene out of Elmo or any other contextual embedding system would be different right so so this is again another little improvement to go forward with and finally of course we have board and and it's variants definitely so there are two ways to use board one way is basically to replace this component with a word component right so that's the easiest simple way the other way to think about it is that birth itself is a language model that has been trained with large amounts of text and it has learned how to learn the structure of sentences it has learned how to you know figure out what sentence or what word is what sense and so forth right so if you then fine tune this this language model the word language model with your own data which is basically words and tokens and tags right you could come up with a model that is basically an Indian tree extraction tool right this is not their innards yet but I'm working on it and it's coming soon like you know so it's maybe January or so yeah I know so that was that was this one right so if you extract if you just replace this Elmo component with the bird component it would basically just extract bird pre-trained bird vectors right into a CRF yes yes on the bird side all this is built-in so essentially you are saying that here is my language model there is no CRF it's like a sequence to sequence model right except that you know it's not like context it's everything comes in to you know it's like parallel right it's end to end kind of model yeah my question was what's gonna be the algorithm that you'll be feeling this data in okay so essentially what I'm doing here is that I'm fine tuning this model right this model is already a language model it knows how to you know work with English let's say it's English I now find unit to take in tokens and return these tags so slightly different tasks but it's still leveraging the structure of the sentence right so that's so I haven't done this yet you know but this is in the works you know this is come most likely in January February timeframe it should be R here is some more information I like this this high-level overview it's a series of blog posts the rest of it is basically survey papers slightly old but it has a lot of coverage so you know just is there and I will share the slides so you don't have to actually take pictures if you let me know which don't have to so back to nerds right what the architecture is so according to nerds own readme dot MD file it's a framework that produces easy to use any are capable HD data scientists right and how it does that it it basically wraps popular third-party any are models it doesn't have to you could write your own models and : through nerds and again so new third-party tools can be either you know your own or third-party and your tools can be added and I think of it less as a data science thing then more and more of a software engineering thing right so it basically makes simplifies data science tasks for people who are doing any are right and basically I am here because you know if you guys downloaded use it like it give us starts that way you know we get support and it is easier for us to basically tell management that we should put more effort into it right so that's one bug reports contributions and ideas also right totally so how does it do this right essentially does it a bit like government does it right or the EU for instance that caters to the lowest common denominator so essentially we have this common data format right each of these guys they have their own different data format that they consume and they return right and here we had this common data format and using nerds we have converters that are able to consume this and you know passing the data to this guy in one way this guy in another way and this guy in another way and so forth right and again outputs from these are again converted back to this format and sent out so you have a common API that talks you know that as a data scientist you can talk to and internally it works on these different models benefits it's a consistent API so all models are subclass is the venire model so you have four things here fit is for training predict is for predicting save is to save your model to disk lorries to retrieve it so it's very scikit-learn like although other tell you later is not exactly scikit-learn you know compatible so however one big as a data scientist one big advantage you will get is that you have to do your data preparation only once right and can reuse it across different models we have provided some reusable training in evaluation code which I am trying to get rid of and revert more to scikit-learn and use theirs there is a familiar scikit-learn like api as you can see fit predict and whatnot and because all of them have the same api the duck typing you know idea of Python which walks like a duck you know it is a duck right it allows us to build ensembles out of these things right so you can actually call them using the same thing and the last one is probably something that you know I find very dear to my heart is that you know if you are actually having domain experts build your in your data for you they are going to constantly come back and say how much do you need right and honestly you don't know right in most cases it depends on how difficult it is and so forth and so this gives you a nice idea you say you know you say okay give me like a thousand sentences right and then you run it and you see what you get and then if it's not good enough you either go back and say you're you know tagging is not right or you say give me more data right so this actually helps you you know you can basically baseline your data really fast okay so so the thing is that you know we talked about our data common data format and that's a really good idea but I started looking at it and I wasn't able to make this guy write the violaceum one run and the reason for it is because the old code was written against an older version and the Onaga virus TM thing there's a third-party tool and they had changed their API in the meantime and so when I went in I still could not make it run it would give me like at the end of five epochs it will give me some time stamp error and I worked on it I tried to fix it but I couldn't so ultimately in frustration I went back and you know did it against the third party tool you know wrote my code against the third party code itself taking any route knots out in the process and it worked so then I started looking at it and I realize that you know so this was the native format that this guy consumed right and then I saw that this one too consumed the exact same native format so two out of four had the same simpler native format right so I basically wrote and these kind of similar they took like span oriented things instead of a list of tokens and list of tags they took a list of spends as the the labels and so I figured that it's fairly easy to write converters for the ones that do not conform to this standard and go forward with that so that's how I came up with this I think it's simpler but you know and so you know so this is my new the fourth version basically uses the simpler data format and as a result of that I also got a new model for free write this one so while while you know between the time that it was open sourced and the time I started looking at it the Anangu project basically created a new this guy the new Elmo based NER thing and I basically could hook it up it was like you know five minute job to so I'm going to just go through the Nords usage pattern and for that I took a data from this bio NLP 2004 bio entity recognition task it has four five different kinds of entities and it's all it's all tagged in bio format and in coronal format so I had about five hundred thousand training examples and about hundred thousand test examples and the entity distribution as you can see was for DNA RNA cell line cell type in protein and as you can see it's actually not very balanced it's quite unbalanced right although that does not actually affect you for the good any ours but you see it read so if you look at the calling sequence the calling pattern it's actually quite simple right so essentially you instantiate it right instantiate your in your model you train it and then you can optionally save it and retrieve it which I did just to test the whole thing and then you can take your train model and generate predictions on it right so it's like three lines and then you can generate your classification report these are some you know like kind of adapter functions that I have but this classification report comes from secular right so and you can see that you know it looks exactly the same and so and you can see is like about six point six five from a dictionary in here so it's not that bad and incidentally dictionary in ears are usually the first line of defense because you know most companies have ontologies and dictionaries about their domain and you know that's a very useful baseline even though it's like 1975 technology so the thing I did here the improvement that I did was you know this was this had a different pipeline in the old nerd system I basically converted it to the fit you know fit to train predict to predict right setup and also made it handle multiple entity classes for some reason that handled only single class there's no reason the automated the Hawker AIESEC automaton cannot handle more than one class at a time so I just made the change the second one was again this was the CRF 'near okay so the improvement here is that they had a dependence here in NL TK and now you know and I was using Spacey anyway so I just took out an LT case so I don't have the other big improvement here was that the feature you know the feature ization of your tokens was kind of transparent it was hidden under the NER so I still have that as my default but if you want to add more features you can do so in this new system right so you can actually pass in features as well but if you look at the way in which you call it you will and if you remember what the dictionary NER had is identical right there is no difference except the only difference here is that you call a different model everything else is same when it gives you a slightly better number 7/7 spacey again you know this Spacey NER is actually based on Spacey and it's not a by Ellis TMC or if it's the neural model but it's actually a state-of-the-art very close to state-of-the-art according to the documentation so again you know the calling sequence is identical very boring now and the only difference I did was I use mini batches this is easier to work with follower datasets it's kind of it was an oversight I'm guessing where the original code so I just fixed that this gives you about 78% right and finally we have the CRF in here this is again the only thing I did here was make it work because it didn't work calling sequence is the same and you get like 78% the ELMO NER is it needs an additional glove or some kind of static word emitting okay and some kind of static word embedding and so that's why I've had this exist check otherwise again the calling sequence is the same and this was the one I got for free right and this one gives me about 80% f1 and here is an ensemble example the only thing here is that I already pre trained those things so I loaded them in and then I call it here in the fit in sorry so in the fit I basically pass in the estimators I want one sample I could also so fit is basically for training right oops so fit is basically for training but I already have pre trained so I say is pre train true and I don't pass any of the the parameters for fit in here so these are empty parameters here and again you do predict here just like you do normally and you get this kind of results so I wanted to kind of show you a comparison across the different models that we have if you notice 0.65 I'm getting f1 scores I have done nothing I have written three lines of code per model right I'm getting f1 scores from point six five two point eight which are pretty good as a baseline I have not done any hyper parameter tuning right so this can give you a very good baseline to start your work with and even across entity types if you notice this is from the ELMO model and f1 scores range from point six one and point eight one right so the idea is that you can you know leverage existing function ID and create your custom menus really really fast right future work I have a bit so like I mentioned the current API is superficially cycled on like so it has issues with serialization when it comes to parallelizing using geo blip parallel so essentially it means that I cannot train them in parallel right because it will not serialize correctly it's not a big deal because most of the time you would actually train it earlier and then just use it for prediction effort on something also if I can make it scikit-learn compatible I can basically just use I can eliminate my on some of them in ER and use the scikit-learn voting classifier instead right and also I can leverage the randomized search and the grid search Mehta estimators from scikit-learn so those are that's ones big structural change I want to make in the future I also want to add flare and bird based NER I mentioned the coming soon that's the word based NER is the coming soon and there is the other one which is brat brat is a manual you know used by domain experts to hand annotate documents and that creates an XML output so essentially I want to you know be able to directly adapt and copy you know get it into my own format and use my models so these are my annotation that's the future work I have and this is all I had so that QR code you can actually you can actually if you want to go to the github page you can go type this in or you could point your camera at this QR code and that will take you there in the readme there is a link to the slides so you know so that's that's pretty much all I have I'm open to questions if you cannot handle questions within the time frame we can you can either catch me outside or you can email me and yeah so thank you [Applause] it actually the it ended up getting the best course for so in my case I did the ensemble out of the four things I have not the alma one because Elmo takes a lot of time so this course I got were actually the best in each category almost right so it didn't boost it in the sense that you know okay overall it boosted it but it didn't boost it on the individual you know I got like 78 f1 score overall which was the same as what I got with spacey and the vilest Ian do you consider 80 percent f1 as yeah f1 or yeah did the state of the art these days pretty close to it and yeah I guess I mean I I would want to go higher though before it is usable what's the highest you've ever gotten with any years I've got about eighty seven ish but I would really like to go to about ninety eight ninety plus two questions first is do you have a reference for that protein paper you show the results this is not a paper this is I basically took it from here I can I don't have a I might have it but I don't have it on me here okay where is this here so this is the buy 1 lb 2004 tusk I might have it give me a second it might I might have it the second question is I've just asked when you're looking for and Elmo is a character base in bedding right Elmo is not character based I think it's contextual but in the training it's like it broken into character base it's broken into sub words so it has its tokenizer but if you say so in my example like 44th right it will actually break it up into 44 and th right so that that part yes it breaks it up into sub words but not consistently like fast textures fast X breaks it up regardless of inflection it breaks it up into three character trigrams and you know uses that so each word is basically the sum of the embeddings for the three character diagrams thank you but this one is more inflection based so it's the Google's tokenizer so if you go to the website the github and you go under examples and bio NLP it has the link to that right sorry good I guess it's a volume and level of effort of annotating for your training set I mean you see you have some training sets already done but if I was starting from scratch what's sort of the level of effort and amount of documents I mean it depends a lot I'm sure but yeah usually typically like I mentioned here I you know this is my ballpark kind of thing I usually say give me like a thousand sentences and it's there and it could go up to like five thousand six thousand sentences you think how much effort like if you had to do that now for a new entity how would you go about doing it how long would it take so typically we you know we have like teams of three or four domain experts at a time and they would you know you give them about week week and a half is about per per batch of questions okay yeah let's thanks it [Music]

Info

Channel: PyData

Views: 2,048

Rating: 4.8688526 out of 5

Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3

Id: ilzFiK0nAh8

Channel Id: undefined

Length: 34min 31sec (2071 seconds)

Published: Sun Dec 29 2019