Hi, I’m Sofie. I am one of the developers of spaCy, an open-source
Natural Language Processing library. I have been working on spaCy since November
2018, and am now one of the core contributors, combining my passion for NLP with my love
for open-source. In this video, I want to introduce you to
spaCy's Entity Linking functionality that we have recently implemented. Entity Linking is the challenge of resolving
ambiguous textual mentions to unique concepts in a knowledge base. In this video, I will first give you a short
introduction about what this challenge entails. Then, we will create custom training data
for an example use-case. Finally, we will use spaCy to train a new
entity linking model from scratch, and apply it to unseen data. We will be distributing a notebook that will
contain the relevant code, so you can follow along while playing this video. So why do we need Entity Linking? Let's first have a quick look at a related
task - Named Entity Recognition or NER. An NER component basically identifies words
in text that have a specific name and refer to real-world objects, such as people or organizations. spaCy offers pre-built Machine Learning models
that perform Named Entity Recognition for a variety of languages. Let's have a look at this script. We have already loaded a pretrained English
model in the variable `nlp`. We can now apply it to some sample text, loop
through all named entities that were identified, and print, for each of them, the original
text as well as its label. When we execute this simple script, we see
that this sentence contains a Person called "Emerson" and an Organization called "Wimbledon". Unfortunately, there may be many people in
the world called "Emerson", and this output still doesn't tell us which one exactly we
meant. This is the challenge addressed by Entity
Linking, sometimes also called Named Entity Disambiguation or Normalization. It transforms an ambiguous textual mention
to a unique identifier by looking at the context in which the mention occurs. In this specific case, the sentence gives
us important clues: Emerson is clearly a professional tennis player. Searching the internet, we can establish that
this sentence is most likely talking about Roy Emerson, an Australian tennis player,
who, according to Wikipedia, won dozens of grand slam titles. We can now resolve this entity in this sentence
to its unique identifier from WikiData, which is a free and open, interlingual knowledge
source related to Wikipedia. Its unique IDs always start with a Q, and
"Roy Emerson" has the identifier Q312545. There are many business cases in which entity
linking is critical. For instance, in the pharmaceutical domain,
I've worked on several projects that required normalization of biomedical entities such
as proteins and disease names to standard identifiers in knowledge bases like UniProt
and OMIM. As an example, by augmenting a relation extraction
pipeline with gene normalization, we could connect information across millions of research
articles to reconstruct an important signaling pathway. Or maybe you are interested in identifying
company names in news articles, finding out who the CEO's are, and how spin-offs and subsidiaries
relate. Once you are able to consolidate different
spelling variants, synonyms and abbreviations, you can combine the normalized forms of these
entities with other data you may have, and depict meaningful graphs. For instance, this graph produced by openthebox.be
shows a fraction of the economic landscape involving Belgian chocolate, and it was produced
by integrating structured data with textual information from Dutch articles. I hope I've convinced you that Entity Linking
can be quite powerful. Unfortunately, it's also quite a difficult
challenge. To implement an entity linking pipeline, we
need 3 different steps. Let's use our original sentence again with
Emerson hopefully winning Wimbledon. The first step, as we already saw, is Named
Entity Recognition, in which the mention "Emerson" is labeled as a "Person". Next, the extracted mention needs to be resolved
to a list of plausible candidates. In our case, we'll consider three different
people named Emerson. Typically, this list is created by querying
a knowledge base that contains various aliases and synonyms. In the final step, we need to reduce the list
of candidates to just one final ID that represents the correct Emerson. In the remainder of this video, I will show
you how to use spaCy to create a Knowledge base that will address the second step of
candidate generation. Then, for the final step, we will create a
new Entity Linking component, and train its Machine Learning model on some annotated data. The aim of this video is to help you get started
implementing your own Entity Linking functionality with spaCy. If you want to know more about the technical
details, checkout the presentation I gave at spaCy IRL 2019. So the first step to perform Entity Linking,
is to set up a knowledge base that contains the unique identifiers of the entities we
are interested in. In this tutorial we will create a very simple
one with only 3 entries. For most use-cases, this obviously won't provide
sufficient recall, and you'll have to think about which entities you need in your knowledge
base depending on the problem and domain you're working on. You'll need to find a balance though: the
whole of WikiData for instance has more than 80 million items, so including all of that
could be impractical and unnecessary large. In our case, we've simply defined our 3 entries
in a CSV file - let's have a quick look at its contents. We see that each line contains a different
"Emerson": There's an Australian tennis player, an American writer, and a Brazilian footballer. Let's go back to our Python script. We've defined a helper function to load the
information from this CSV file into dictionaries that map the Q identifiers to the corresponding
full names and descriptions. Let's remove our print statements from before,
and load the CSV data. We'll now create our knowledge base. To do so, we'll use the same `Vocab` instance
as the one from our `nlp` object. We also need to define a fixed length for
the entity vectors that will be added to this knowledge base. As we'll see in just a second, these entity
vectors embed each entity in a high-dimensional space, in our case 300-D. We loop through the entities dictionary we
defined earlier, and grab the description for each unique ID. We embed each description by using the built-in
word vectors of our `nlp` model. The `vector` attribute of a document is by
default an average of the token vectors within that document. So in practice each entity in the knowledge
base gets an encoding representing the words in its description. We won't store the actual descriptions for
performance reasons. We add each record to our knowledge base by
calling `add_entity` and providing the unique QID as the defining entity, as well as the
encoded vector as the entity vector. We also need to provide a frequency. This is just a raw count of how many times
a certain entity appears in an annotated corpus. We won't really be using this frequency in
our use-case today, so we'll just set this to an arbitrary value. Once we have all the entities in our knowledge
base, we want to specify aliases or synonyms. You can think of an alias as a specific mention
in text, that needs to be disambiguated. For instance, we're pretty sure that if a
text mentions the full name "Roy Emerson", that we need to disambiguate it to "Q312545". The prior probability of this entity-alias
pair is 100%, because we assume that there are no other Roy Emerson's in the world. So we add this information to the knowledge
base by first looping through all the (QID, name) pairs, and then calling `add_alias`
for each unique identifier and its full name. This function takes an alias, then a list
of possible entities to resolve to, in this case only one, and a corresponding list of
prior probabilities for each entity. In our case, the full name is 100% likely
to resolve to its associated QID. Now what about when the text just mentions
"Emerson", and not the full name, like we saw in our original example? It could be that one of our three Emersons
is more famous, and thus more likely to be the correct one, regardless of context. If you have a large annotated corpus, you
can easily deduce those statistics and use them as prior probabilities. In our case however, we assume we have no
such prior information, and we'll just set the probabilities to be equal for each entity. Note that their sum may not exceed 1. So once more we call `add_alias`, but this
time we specify 3 potential entities that are relevant to the alias "Emerson", by listing
the three QIDs in our dataset. We define the prior probabilities as being
0.3 for each. So this will be the Knowledge base that we'll
be using. We can check the entities and aliases that
are contained in it by calling `get_entity_strings()` and `get_alias_strings()`. Just to double check, we'll also print the
candidates that are generated for the full name of Roy Emerson, as well as for the mention
"Emerson", and for the mention "Sofie". Let's go to the console and execute the script
to check its output. This looks alright: we have our 3 entities
and 4 registered aliases. If we query the Knowledge Base with the full
name of Roy Emerson, we get just that one correct candidate back. If we query it with the alias "Emerson", we
get three candidates. And finally, if we query it with an alias
it hasn't seen before, like "Sofie", then it just gives an empty list. Let's go back to our script. We'll remove these print statements and instead
store our results in an output directory. We save the knowledge base by calling the
function `dump` with an output location. We'll call it `my_kb`. We also store the `nlp` object to file by
calling `to_disk`. We'll call it `my_nlp`. At the console, we can see that the output
directory currently only contains the CSV file we used earlier. Now, we'll execute our Python script once more to create
the knowledge base from scratch and store it. We can see that our output directory now also
contains the knowledge base and the nlp directory. As a next step, we need to create some annotated
data that will be used to train an Entity Linking algorithm on. For some use-cases, this may actually be the
first step, because the distributions in your training dataset may help you define the frequencies
and prior probabilities of your entities in your knowledge base. However, as we'll see, we can get the Entity
Linker to work even with the default values we used earlier. To create the training data, we will use our
annotation tool Prodigy, but you could generate the data in whatever tool you like. For those of you watching the video and following
along in the Jupyter Notebook, we will now leave the Notebook for a second, but we'll make sure that the final result
is available as a file that can be loaded back in, so we can continue after the data
annotation step. For this tutorial, I scraped 30 sentences
from Wikipedia that have the entity "Emerson", without first name. The raw texts are stored in a simple .txt
file. We will now use Prodigy to annotate these
texts. Prodigy is a fully scriptable annotation tool
that lets you design custom annotation workflows in Python. It is a convenient tool for a fast feedback
loop between the annotation process and the training of the actual model. In realistic projects, you'll want to iterate
on both your data and your code, with as little overhead as possible. As a data scientist, it's useful to do some
annotation yourself, as it really helps you understand the complexity of the challenge. I will walk you through the process of creating
a custom recipe in Prodigy for our Entity Linking annotation work. First, we define our Prodigy recipe with the
recipe annotator. We'll call this recipe `entity_linker.manual`,
and we'll store our results in a specific dataset, defined by the user. As input, we take the raw sentences in a `txt`
file, the NLP model we'll use for NER, the knowledge base that we constructed earlier,
and finally a file with the additional information about the entities. By convention, the recipe needs to return
a dictionary of components, including the dataset where the results are stored, and
the stream of input examples that will be fed into the annotation interface. So this is pretty much how a skeleton Prodigy
recipe looks like. Now let's fill in the body of this method:
We need to process the incoming text file and define the stream of examples. First, we'll load the NLP model from file
using `spacy.load`. Next, we'll also read in the knowledge base,
using the `nlp.vocab` argument and setting an arbitrary number for the vector length
- this will be overwritten soon. Now we call `load_bulk` which reads in all
the information from the knowledge base from file. We load the nlp pipeline into an `EntityRecognizer`
object from Prodigy. We now define the annotation stream by using
the TXT loader in Prodigy, which creates a generator. We set hash IDs for uniquely identifying the
different tasks with set_hashes, and then we apply the NER model on this input stream
of sentences. As you'll recall, we need the NER results
before we can do any Entity Linking, so this takes care of that. Next, we will use the Knowledge base to generate
the options that will be presented to the annotator. Let me just speed this up a little. So what this function does, is it will take
each annotated span found by the NER algorithm, and run the `get_candidates` method of our KB
to find the list of candidate identifiers. These candidates are then added to the stream. Then we basically overwrite our old stream
with this new one, where the options are added to each task. To avoid unnecessary annotations, we remove
duplicates from the stream with Prodigy's utility filter_duplicates. OK - so let's see what we've got so far. On the command line, we call `prodigy` with
the recipe `entity_linker.manual` that we just created, and we give it the correct arguments. First, a dataset to store annotations in,
we'll just call it `emerson_sandbox` for now. Then, the input text with the raw sentences
in a txt file. Next, the NLP pipeline that we stored to disk
and called `my_nlp`, the Knowlegde base we created earlier and saved as `my_kb`, and
then the CSV file with additional information about the entities. Finally, we use the `-F` option to tell Prodigy
in which file to find our custom recipe. This command starts the Prodigy server, and
we can go to our browser to start annotating our data. We see the first sentence with the results
from the NER step. Note that in this case, the prediction of
"Emerson" as a named entity is correct, but the assigned label should actually have been
"Person" instead of "Organization". We will ignore this for now, as the focus
of this video is on the Entity Linking step, but in reality you'll want to go back to your
NER model and perhaps retrain it, if you see a lot of these mistakes. Below the sentence, Prodigy shows the 3 candidates
that were fetched from our knowledge base. These 3 unique Wikidata identifiers refer
to the 3 Emersons we saw earlier: the tennis player, the footballer and the writer. Unfortunately, I haven't quite memorized their
Wikidata identifiers, so I don't really know which is which from the top of my head. We probably want to add a little bit more
information to facilitate the annotation process. It's always a good idea to make it as easy
and straightforward as possible to do the annotation work. So let's go back to our recipe. We read in the CSV file that holds the information
about our 3 Emersons. The results are stored in a dictionary mapping
the unique IDs to the corresponding full names and descriptions. Remember that we didn't store this information
in the Knowledge base itself because of performance reasons. Let's create a helper function that will print
more information for each option in the annotation interface. We want to provide a link to the corresponding
Wikidata entry. The full name and description are taken from
the dictionary we just created. We can wrap this in an HTML tag, which Prodigy
supports naturally. Here we create the HTML link and show the
ID and description, though we could have also chosen to print the full name instead. We change the option type from `text` to `html`,
and insert the HTML bit that is created with our helper function for each candidate in
the Knowledge base. For completeness, we also add a few more options
that allow the annotator to specify that none of the provided links was appropriate, or
perhaps that the context was too ambiguous to make a final decision. In the dictionary of components that our recipe
generates as output, we specifically select the `choice` annotation interface, and we
set `auto_accept` to `True` to speed up the annotation work. This will automatically accept the first answer
you selected, and move on to the next example in the stream. Now our recipe is complete, and we can start
up the Prodigy server again, with the same command as last time. We store the results in a dataset called `emersons_annotated`. Again, the Prodigy server is started, and
we can start annotating in our browser. We see that each option now has a link to
the appropriate WikiData page, and shows some additional information that will make our
job as annotator much easier. Because we set `auto-accept` to `True` in
our recipe, we just have to click on the correct option and Prodigy will automatically store
it and present the next example. So let's create some manual annotations. This is actually an interesting example. There is no direct clue about whether this
Emerson is a writer, footballer or tennis player, but I do think that Queensland is
in Australia. So what we can do here, is to check the Wikidata
page, which has a lot of structured information as you can see. And yep - this does mention that Roy Emerson
was born in BlackButt, Queensland, so that is the correct one. Let's speed up the remainder of the annotations. Ok, we've run through the end of the stream,
and have annotated all 30 sentences with our "Emerson" entities. We can store the annotations to file by using
Prodigy's `db-out` command, specifying the dataset we created, and direct that towards a JSONL
file. Let's have a look at the results in this file,
by printing the first line. We see that the full text of the original
sentence is stored, together with a lot of detail about the annotation task. But the most important bit is stored with
the key accept at the end: this is the value of our manual annotation. For this specific sentence and this specific
mention, we selected the option with key Q312545. This is the gold link annotation that we'll
use to train our entity linker on. If you were following along in the notebook,
you can join back in as we'll be distributing this JSONL file with the video. To feed training data into our Entity Linker,
we need to format our data as a structured tuple. The first part is the raw text, and the second
part is a dictionary of annotations. This dictionary requires the `links` keyword,
which holds another dictionary mapping a character offset to the correct identifier. Here, `1.0` signals that it's a correct
annotation. This format allows to include negative examples
of incorrect candidates as well, but these instances are currently not used by the Entity
Linker so we'll leave them out for simplicity. Let's go back to our script. To have it in the IDE as a proper pipeline,
we'll do a little bit of cleaning and introducing separate functions for each step. We also load the original NLP pipeline and
KB back from file, so the functions can be run independently. From our Prodigy annotations, we have a JSONL
file which is in a different format, so we'll first convert it to the training format spaCy
needs. To do so, we parse each line by running it
through `json.loads`. Then we extract the raw text, and the answer
to double check that we hit the accept button and not reject, and finally in the "accept"
field we'll find a list of the selected answers - in our case this is always just one ID,
so we can take the first element of the list. We also parse the start and end offset of
the NER annotation, because the entity linker needs to know which tokens the link refers
to. Finally, we compile everything into the required
format of nested dictionaries, keyed by the word "links", and we add the data point to
our dataset. To check whether the conversion looks OK,
we can just print the first sample in our dataset. We can also check some statistics. How many cases of each QID do we have annotated? We'll analyse this by storing an additional
list of gold QIDs, which will come in handy later on as well. We loop through the dataset once more, and
store the gold links if their annotation was marked as `True`. Then we can easily print the counts using
the Counter utility. By some happy coincidence, we got exactly
10 annotated sentences for each of our Emersons. In the script, we remove the print statements
and we'll now set aside 6 cases in a separate test set. Because our dataset is so small, we don't
take a random subset, but we carefully select 2 instances of each Emerson. We start by creating two new sets: a `train_dataset`
and a `test_dataset`. We loop through the 3 unique identifiers and
for each, find the indices of the corresponding data samples. And then we add the first 8
to the training dataset, and the last 2 to the test dataset. What is extremely important, is that we shuffle
the datasets again after this manipulation: otherwise some signal can be found in the
order of the data samples, which is never a good idea. Best to remove that bias as fast as you can. With our datasets now properly set up, we'll
first run each of our sentences in the training dataset, through the nlp pipeline with the
NER component. We store the processed docs in `TRAIN_DOCS`. Then, we'll create a new Entity Linking component
by calling `nlp.create_pipe` with the key `entity_linker`, and we'll also specify that it should not
use any prior probabilities, as we did not set these to realistic values in our knowledge
base. To initialize this component, we also have
to specifically set the knowledge base with `set_kb`, or otherwise it'll throw an error
if you start using it without one. Finally, we can add this component to the
pipeline with `nlp.add_pipe`. Next, we will run the actual training loop
for the new component. To do so, we first have to disable the other
components of the pipeline. We loop through `nlp.pipe_names`, selecting
all pipes that are not the entity linker, and then disable those with `nlp.disable_pipes`. We then create an `optimizer` object by calling
`nlp.begin_training`, and we'll run some 500 iterations. In each iteration, we shuffle the training
data, and define a set of `batches` by using the `minibatch` utility in spaCy, that will
create a virtually infinite series of compounding values: increasing batch size from 4 to eventually
32. We'll keep track of the loss in a dictionary. Then for each batch, we compile the input
text and the annotations, and run this through `nlp.update`. We'll set a dropout value of 0.2 to prevent
overfitting. We also provide the `losses` dictionary and
the `optimizer` we created earlier. To check on our progress, we'll print the
loss every 50 iterations, and once more after all training is done. Finally, we write the nlp pipeline, which
now contains a trained entity linker, to file and we'll call it `my_nlp_el`. For usage in the next step, we also write
the test dataset to file with pickle. We run the script again, and this takes a
little while to execute. We see that during execution, the loss is
consistently decreasing and approaching zero - which means that the underlying neural network
gets better at representing the training data. The final loss is pretty small, which is a
good sign. But to truly verify whether our model generalizes
well, we need to test it on unseen data. So let's test our new component. We load the trained model back in from file. Let's first apply it on our original example
sentence, where we had the Entity Emerson as a Person, and Wimbledon as an Organization. Now that we have added an Entity Linking component,
we can run the example through the nlp pipeline again, print the text and label as before,
but we can now also print the disambiguated QID, predicted by our Entity Linker. We access it by calling `ent.kb_id_`. Running this, we see that Emerson gets disambiguated
to Q312545, and this time I have in fact remembered that this is our tennis player, so this seems
correct. Note also that the entity "Wimbledon" gets
the annotation NIL, which is basically just a placeholder value, showing that the NEL
component could not find any relevant ID for this entity. This happens because our Knowledge base and
the Entity Linking component have only been trained on "Emerson" examples, and are thus
quite limited. You'll recall that we also set aside 6 sentences
in the test dataset. Our model never saw these during training. So let's unpickle them and see what the Entity
Linking component would predict for them. We loop through the texts and the annotations
in the test dataset, and we print the raw text as well as the gold annotation. Then we process the text with our nlp pipeline,
loop through the entities with `doc.ents`, and print the prediction if the entity is in fact
"Emerson", as we can expect all the others to be NIL at this point anyway. Let's execute this again. Because we only have 6 sentences in our test
datset, we can quickly eyeball these results. We find that 5 out of 6 predictions are indeed
correct, which is about 83% accuracy. This is not a bad result at all, considering
that random guessing would have only achieved 33%. Again, we do notice that the label of the
entity is often wrongly predicted as "ORG" instead of "PERSON", but this is a mistake
of the NER algorithm and out of scope for this tutorial. I hope this tutorial has shown you how to
implement, and use, an Entity Linking component in spaCy. The knowledge base and training dataset used
in this example were kept small for demonstration purposes, but in reality you'll want to use
a much bigger, representative set of entities, perhaps from an ontology or dictionary that
is relevant to your use-case. Thanks for watching and I hope your next NLP
project will incorporate entity linking !