Training a custom ENTITY LINKING model with spaCy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, I’m Sofie. I am one of the developers of spaCy, an open-source Natural Language Processing library. I have been working on spaCy since November 2018, and am now one of the core contributors, combining my passion for NLP with my love for open-source. In this video, I want to introduce you to spaCy's Entity Linking functionality that we have recently implemented. Entity Linking is the challenge of resolving ambiguous textual mentions to unique concepts in a knowledge base. In this video, I will first give you a short introduction about what this challenge entails. Then, we will create custom training data for an example use-case. Finally, we will use spaCy to train a new entity linking model from scratch, and apply it to unseen data. We will be distributing a notebook that will contain the relevant code, so you can follow along while playing this video. So why do we need Entity Linking? Let's first have a quick look at a related task - Named Entity Recognition or NER. An NER component basically identifies words in text that have a specific name and refer to real-world objects, such as people or organizations. spaCy offers pre-built Machine Learning models that perform Named Entity Recognition for a variety of languages. Let's have a look at this script. We have already loaded a pretrained English model in the variable `nlp`. We can now apply it to some sample text, loop through all named entities that were identified, and print, for each of them, the original text as well as its label. When we execute this simple script, we see that this sentence contains a Person called "Emerson" and an Organization called "Wimbledon". Unfortunately, there may be many people in the world called "Emerson", and this output still doesn't tell us which one exactly we meant. This is the challenge addressed by Entity Linking, sometimes also called Named Entity Disambiguation or Normalization. It transforms an ambiguous textual mention to a unique identifier by looking at the context in which the mention occurs. In this specific case, the sentence gives us important clues: Emerson is clearly a professional tennis player. Searching the internet, we can establish that this sentence is most likely talking about Roy Emerson, an Australian tennis player, who, according to Wikipedia, won dozens of grand slam titles. We can now resolve this entity in this sentence to its unique identifier from WikiData, which is a free and open, interlingual knowledge source related to Wikipedia. Its unique IDs always start with a Q, and "Roy Emerson" has the identifier Q312545. There are many business cases in which entity linking is critical. For instance, in the pharmaceutical domain, I've worked on several projects that required normalization of biomedical entities such as proteins and disease names to standard identifiers in knowledge bases like UniProt and OMIM. As an example, by augmenting a relation extraction pipeline with gene normalization, we could connect information across millions of research articles to reconstruct an important signaling pathway. Or maybe you are interested in identifying company names in news articles, finding out who the CEO's are, and how spin-offs and subsidiaries relate. Once you are able to consolidate different spelling variants, synonyms and abbreviations, you can combine the normalized forms of these entities with other data you may have, and depict meaningful graphs. For instance, this graph produced by openthebox.be shows a fraction of the economic landscape involving Belgian chocolate, and it was produced by integrating structured data with textual information from Dutch articles. I hope I've convinced you that Entity Linking can be quite powerful. Unfortunately, it's also quite a difficult challenge. To implement an entity linking pipeline, we need 3 different steps. Let's use our original sentence again with Emerson hopefully winning Wimbledon. The first step, as we already saw, is Named Entity Recognition, in which the mention "Emerson" is labeled as a "Person". Next, the extracted mention needs to be resolved to a list of plausible candidates. In our case, we'll consider three different people named Emerson. Typically, this list is created by querying a knowledge base that contains various aliases and synonyms. In the final step, we need to reduce the list of candidates to just one final ID that represents the correct Emerson. In the remainder of this video, I will show you how to use spaCy to create a Knowledge base that will address the second step of candidate generation. Then, for the final step, we will create a new Entity Linking component, and train its Machine Learning model on some annotated data. The aim of this video is to help you get started implementing your own Entity Linking functionality with spaCy. If you want to know more about the technical details, checkout the presentation I gave at spaCy IRL 2019. So the first step to perform Entity Linking, is to set up a knowledge base that contains the unique identifiers of the entities we are interested in. In this tutorial we will create a very simple one with only 3 entries. For most use-cases, this obviously won't provide sufficient recall, and you'll have to think about which entities you need in your knowledge base depending on the problem and domain you're working on. You'll need to find a balance though: the whole of WikiData for instance has more than 80 million items, so including all of that could be impractical and unnecessary large. In our case, we've simply defined our 3 entries in a CSV file - let's have a quick look at its contents. We see that each line contains a different "Emerson": There's an Australian tennis player, an American writer, and a Brazilian footballer. Let's go back to our Python script. We've defined a helper function to load the information from this CSV file into dictionaries that map the Q identifiers to the corresponding full names and descriptions. Let's remove our print statements from before, and load the CSV data. We'll now create our knowledge base. To do so, we'll use the same `Vocab` instance as the one from our `nlp` object. We also need to define a fixed length for the entity vectors that will be added to this knowledge base. As we'll see in just a second, these entity vectors embed each entity in a high-dimensional space, in our case 300-D. We loop through the entities dictionary we defined earlier, and grab the description for each unique ID. We embed each description by using the built-in word vectors of our `nlp` model. The `vector` attribute of a document is by default an average of the token vectors within that document. So in practice each entity in the knowledge base gets an encoding representing the words in its description. We won't store the actual descriptions for performance reasons. We add each record to our knowledge base by calling `add_entity` and providing the unique QID as the defining entity, as well as the encoded vector as the entity vector. We also need to provide a frequency. This is just a raw count of how many times a certain entity appears in an annotated corpus. We won't really be using this frequency in our use-case today, so we'll just set this to an arbitrary value. Once we have all the entities in our knowledge base, we want to specify aliases or synonyms. You can think of an alias as a specific mention in text, that needs to be disambiguated. For instance, we're pretty sure that if a text mentions the full name "Roy Emerson", that we need to disambiguate it to "Q312545". The prior probability of this entity-alias pair is 100%, because we assume that there are no other Roy Emerson's in the world. So we add this information to the knowledge base by first looping through all the (QID, name) pairs, and then calling `add_alias` for each unique identifier and its full name. This function takes an alias, then a list of possible entities to resolve to, in this case only one, and a corresponding list of prior probabilities for each entity. In our case, the full name is 100% likely to resolve to its associated QID. Now what about when the text just mentions "Emerson", and not the full name, like we saw in our original example? It could be that one of our three Emersons is more famous, and thus more likely to be the correct one, regardless of context. If you have a large annotated corpus, you can easily deduce those statistics and use them as prior probabilities. In our case however, we assume we have no such prior information, and we'll just set the probabilities to be equal for each entity. Note that their sum may not exceed 1. So once more we call `add_alias`, but this time we specify 3 potential entities that are relevant to the alias "Emerson", by listing the three QIDs in our dataset. We define the prior probabilities as being 0.3 for each. So this will be the Knowledge base that we'll be using. We can check the entities and aliases that are contained in it by calling `get_entity_strings()` and `get_alias_strings()`. Just to double check, we'll also print the candidates that are generated for the full name of Roy Emerson, as well as for the mention "Emerson", and for the mention "Sofie". Let's go to the console and execute the script to check its output. This looks alright: we have our 3 entities and 4 registered aliases. If we query the Knowledge Base with the full name of Roy Emerson, we get just that one correct candidate back. If we query it with the alias "Emerson", we get three candidates. And finally, if we query it with an alias it hasn't seen before, like "Sofie", then it just gives an empty list. Let's go back to our script. We'll remove these print statements and instead store our results in an output directory. We save the knowledge base by calling the function `dump` with an output location. We'll call it `my_kb`. We also store the `nlp` object to file by calling `to_disk`. We'll call it `my_nlp`. At the console, we can see that the output directory currently only contains the CSV file we used earlier. Now, we'll execute our Python script once more to create the knowledge base from scratch and store it. We can see that our output directory now also contains the knowledge base and the nlp directory. As a next step, we need to create some annotated data that will be used to train an Entity Linking algorithm on. For some use-cases, this may actually be the first step, because the distributions in your training dataset may help you define the frequencies and prior probabilities of your entities in your knowledge base. However, as we'll see, we can get the Entity Linker to work even with the default values we used earlier. To create the training data, we will use our annotation tool Prodigy, but you could generate the data in whatever tool you like. For those of you watching the video and following along in the Jupyter Notebook, we will now leave the Notebook for a second, but we'll make sure that the final result is available as a file that can be loaded back in, so we can continue after the data annotation step. For this tutorial, I scraped 30 sentences from Wikipedia that have the entity "Emerson", without first name. The raw texts are stored in a simple .txt file. We will now use Prodigy to annotate these texts. Prodigy is a fully scriptable annotation tool that lets you design custom annotation workflows in Python. It is a convenient tool for a fast feedback loop between the annotation process and the training of the actual model. In realistic projects, you'll want to iterate on both your data and your code, with as little overhead as possible. As a data scientist, it's useful to do some annotation yourself, as it really helps you understand the complexity of the challenge. I will walk you through the process of creating a custom recipe in Prodigy for our Entity Linking annotation work. First, we define our Prodigy recipe with the recipe annotator. We'll call this recipe `entity_linker.manual`, and we'll store our results in a specific dataset, defined by the user. As input, we take the raw sentences in a `txt` file, the NLP model we'll use for NER, the knowledge base that we constructed earlier, and finally a file with the additional information about the entities. By convention, the recipe needs to return a dictionary of components, including the dataset where the results are stored, and the stream of input examples that will be fed into the annotation interface. So this is pretty much how a skeleton Prodigy recipe looks like. Now let's fill in the body of this method: We need to process the incoming text file and define the stream of examples. First, we'll load the NLP model from file using `spacy.load`. Next, we'll also read in the knowledge base, using the `nlp.vocab` argument and setting an arbitrary number for the vector length - this will be overwritten soon. Now we call `load_bulk` which reads in all the information from the knowledge base from file. We load the nlp pipeline into an `EntityRecognizer` object from Prodigy. We now define the annotation stream by using the TXT loader in Prodigy, which creates a generator. We set hash IDs for uniquely identifying the different tasks with set_hashes, and then we apply the NER model on this input stream of sentences. As you'll recall, we need the NER results before we can do any Entity Linking, so this takes care of that. Next, we will use the Knowledge base to generate the options that will be presented to the annotator. Let me just speed this up a little. So what this function does, is it will take each annotated span found by the NER algorithm, and run the `get_candidates` method of our KB to find the list of candidate identifiers. These candidates are then added to the stream. Then we basically overwrite our old stream with this new one, where the options are added to each task. To avoid unnecessary annotations, we remove duplicates from the stream with Prodigy's utility filter_duplicates. OK - so let's see what we've got so far. On the command line, we call `prodigy` with the recipe `entity_linker.manual` that we just created, and we give it the correct arguments. First, a dataset to store annotations in, we'll just call it `emerson_sandbox` for now. Then, the input text with the raw sentences in a txt file. Next, the NLP pipeline that we stored to disk and called `my_nlp`, the Knowlegde base we created earlier and saved as `my_kb`, and then the CSV file with additional information about the entities. Finally, we use the `-F` option to tell Prodigy in which file to find our custom recipe. This command starts the Prodigy server, and we can go to our browser to start annotating our data. We see the first sentence with the results from the NER step. Note that in this case, the prediction of "Emerson" as a named entity is correct, but the assigned label should actually have been "Person" instead of "Organization". We will ignore this for now, as the focus of this video is on the Entity Linking step, but in reality you'll want to go back to your NER model and perhaps retrain it, if you see a lot of these mistakes. Below the sentence, Prodigy shows the 3 candidates that were fetched from our knowledge base. These 3 unique Wikidata identifiers refer to the 3 Emersons we saw earlier: the tennis player, the footballer and the writer. Unfortunately, I haven't quite memorized their Wikidata identifiers, so I don't really know which is which from the top of my head. We probably want to add a little bit more information to facilitate the annotation process. It's always a good idea to make it as easy and straightforward as possible to do the annotation work. So let's go back to our recipe. We read in the CSV file that holds the information about our 3 Emersons. The results are stored in a dictionary mapping the unique IDs to the corresponding full names and descriptions. Remember that we didn't store this information in the Knowledge base itself because of performance reasons. Let's create a helper function that will print more information for each option in the annotation interface. We want to provide a link to the corresponding Wikidata entry. The full name and description are taken from the dictionary we just created. We can wrap this in an HTML tag, which Prodigy supports naturally. Here we create the HTML link and show the ID and description, though we could have also chosen to print the full name instead. We change the option type from `text` to `html`, and insert the HTML bit that is created with our helper function for each candidate in the Knowledge base. For completeness, we also add a few more options that allow the annotator to specify that none of the provided links was appropriate, or perhaps that the context was too ambiguous to make a final decision. In the dictionary of components that our recipe generates as output, we specifically select the `choice` annotation interface, and we set `auto_accept` to `True` to speed up the annotation work. This will automatically accept the first answer you selected, and move on to the next example in the stream. Now our recipe is complete, and we can start up the Prodigy server again, with the same command as last time. We store the results in a dataset called `emersons_annotated`. Again, the Prodigy server is started, and we can start annotating in our browser. We see that each option now has a link to the appropriate WikiData page, and shows some additional information that will make our job as annotator much easier. Because we set `auto-accept` to `True` in our recipe, we just have to click on the correct option and Prodigy will automatically store it and present the next example. So let's create some manual annotations. This is actually an interesting example. There is no direct clue about whether this Emerson is a writer, footballer or tennis player, but I do think that Queensland is in Australia. So what we can do here, is to check the Wikidata page, which has a lot of structured information as you can see. And yep - this does mention that Roy Emerson was born in BlackButt, Queensland, so that is the correct one. Let's speed up the remainder of the annotations. Ok, we've run through the end of the stream, and have annotated all 30 sentences with our "Emerson" entities. We can store the annotations to file by using Prodigy's `db-out` command, specifying the dataset we created, and direct that towards a JSONL file. Let's have a look at the results in this file, by printing the first line. We see that the full text of the original sentence is stored, together with a lot of detail about the annotation task. But the most important bit is stored with the key accept at the end: this is the value of our manual annotation. For this specific sentence and this specific mention, we selected the option with key Q312545. This is the gold link annotation that we'll use to train our entity linker on. If you were following along in the notebook, you can join back in as we'll be distributing this JSONL file with the video. To feed training data into our Entity Linker, we need to format our data as a structured tuple. The first part is the raw text, and the second part is a dictionary of annotations. This dictionary requires the `links` keyword, which holds another dictionary mapping a character offset to the correct identifier. Here, `1.0` signals that it's a correct annotation. This format allows to include negative examples of incorrect candidates as well, but these instances are currently not used by the Entity Linker so we'll leave them out for simplicity. Let's go back to our script. To have it in the IDE as a proper pipeline, we'll do a little bit of cleaning and introducing separate functions for each step. We also load the original NLP pipeline and KB back from file, so the functions can be run independently. From our Prodigy annotations, we have a JSONL file which is in a different format, so we'll first convert it to the training format spaCy needs. To do so, we parse each line by running it through `json.loads`. Then we extract the raw text, and the answer to double check that we hit the accept button and not reject, and finally in the "accept" field we'll find a list of the selected answers - in our case this is always just one ID, so we can take the first element of the list. We also parse the start and end offset of the NER annotation, because the entity linker needs to know which tokens the link refers to. Finally, we compile everything into the required format of nested dictionaries, keyed by the word "links", and we add the data point to our dataset. To check whether the conversion looks OK, we can just print the first sample in our dataset. We can also check some statistics. How many cases of each QID do we have annotated? We'll analyse this by storing an additional list of gold QIDs, which will come in handy later on as well. We loop through the dataset once more, and store the gold links if their annotation was marked as `True`. Then we can easily print the counts using the Counter utility. By some happy coincidence, we got exactly 10 annotated sentences for each of our Emersons. In the script, we remove the print statements and we'll now set aside 6 cases in a separate test set. Because our dataset is so small, we don't take a random subset, but we carefully select 2 instances of each Emerson. We start by creating two new sets: a `train_dataset` and a `test_dataset`. We loop through the 3 unique identifiers and for each, find the indices of the corresponding data samples. And then we add the first 8 to the training dataset, and the last 2 to the test dataset. What is extremely important, is that we shuffle the datasets again after this manipulation: otherwise some signal can be found in the order of the data samples, which is never a good idea. Best to remove that bias as fast as you can. With our datasets now properly set up, we'll first run each of our sentences in the training dataset, through the nlp pipeline with the NER component. We store the processed docs in `TRAIN_DOCS`. Then, we'll create a new Entity Linking component by calling `nlp.create_pipe` with the key `entity_linker`, and we'll also specify that it should not use any prior probabilities, as we did not set these to realistic values in our knowledge base. To initialize this component, we also have to specifically set the knowledge base with `set_kb`, or otherwise it'll throw an error if you start using it without one. Finally, we can add this component to the pipeline with `nlp.add_pipe`. Next, we will run the actual training loop for the new component. To do so, we first have to disable the other components of the pipeline. We loop through `nlp.pipe_names`, selecting all pipes that are not the entity linker, and then disable those with `nlp.disable_pipes`. We then create an `optimizer` object by calling `nlp.begin_training`, and we'll run some 500 iterations. In each iteration, we shuffle the training data, and define a set of `batches` by using the `minibatch` utility in spaCy, that will create a virtually infinite series of compounding values: increasing batch size from 4 to eventually 32. We'll keep track of the loss in a dictionary. Then for each batch, we compile the input text and the annotations, and run this through `nlp.update`. We'll set a dropout value of 0.2 to prevent overfitting. We also provide the `losses` dictionary and the `optimizer` we created earlier. To check on our progress, we'll print the loss every 50 iterations, and once more after all training is done. Finally, we write the nlp pipeline, which now contains a trained entity linker, to file and we'll call it `my_nlp_el`. For usage in the next step, we also write the test dataset to file with pickle. We run the script again, and this takes a little while to execute. We see that during execution, the loss is consistently decreasing and approaching zero - which means that the underlying neural network gets better at representing the training data. The final loss is pretty small, which is a good sign. But to truly verify whether our model generalizes well, we need to test it on unseen data. So let's test our new component. We load the trained model back in from file. Let's first apply it on our original example sentence, where we had the Entity Emerson as a Person, and Wimbledon as an Organization. Now that we have added an Entity Linking component, we can run the example through the nlp pipeline again, print the text and label as before, but we can now also print the disambiguated QID, predicted by our Entity Linker. We access it by calling `ent.kb_id_`. Running this, we see that Emerson gets disambiguated to Q312545, and this time I have in fact remembered that this is our tennis player, so this seems correct. Note also that the entity "Wimbledon" gets the annotation NIL, which is basically just a placeholder value, showing that the NEL component could not find any relevant ID for this entity. This happens because our Knowledge base and the Entity Linking component have only been trained on "Emerson" examples, and are thus quite limited. You'll recall that we also set aside 6 sentences in the test dataset. Our model never saw these during training. So let's unpickle them and see what the Entity Linking component would predict for them. We loop through the texts and the annotations in the test dataset, and we print the raw text as well as the gold annotation. Then we process the text with our nlp pipeline, loop through the entities with `doc.ents`, and print the prediction if the entity is in fact "Emerson", as we can expect all the others to be NIL at this point anyway. Let's execute this again. Because we only have 6 sentences in our test datset, we can quickly eyeball these results. We find that 5 out of 6 predictions are indeed correct, which is about 83% accuracy. This is not a bad result at all, considering that random guessing would have only achieved 33%. Again, we do notice that the label of the entity is often wrongly predicted as "ORG" instead of "PERSON", but this is a mistake of the NER algorithm and out of scope for this tutorial. I hope this tutorial has shown you how to implement, and use, an Entity Linking component in spaCy. The knowledge base and training dataset used in this example were kept small for demonstration purposes, but in reality you'll want to use a much bigger, representative set of entities, perhaps from an ontology or dictionary that is relevant to your use-case. Thanks for watching and I hope your next NLP project will incorporate entity linking !

Info

Channel: Explosion

Views: 24,584

Rating: 4.92278 out of 5

Keywords: artificial intelligence, ai, machine learning, spacy, natural language processing, nlp, data science, big data, named entity recognition, ner, neural networks, deep learning, python, entity linking, named entity linking, named entity disambiguation, knowledge base, prodigy, data annotation, annotation tool

Id: 8u57WSXVpmw

Channel Id: undefined

Length: 28min 23sec (1703 seconds)

Published: Thu May 07 2020