FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, I'm Ines. I'm the co-founder of Explosion AI, a core developer of spaCy and the lead developer of Prodigy, our annotation tool for machine learning and NLP. It's been really really exciting to see Prodigy grow so much over the past year, talk to so many people in the community, see what they're working on and discuss strategies for creating training data for machine learning projects. Coming up with the right strategy is often genuinely difficult and requires a lot of experimentation. The most critical phase is the early development phase when you try out ideas, design your label scheme and basically decide what you want to predict. We've built Prodigy to make this process easier and to help you iterate on your code and your data faster. In this video I'll be talking about a few frequently asked questions that have come up on the forum. I'll also include more details and links in the video description. In Prodigy, they are basically two different ways to structure your task. You can make it a binary decision and stream in suggestions that you accept or reject, or you can make it a manual decision and ask the annotator to decide between several options or highlight something by hand. For example spans in a text or bounding boxes on an image. But how do you decide which one to use? Well, there's not always an easy answer because every problem is different. But here's a rough rule of thumb that we found to work quite well. Manual annotation is kind of the standard way of labelling things. You take your data and just label everything in order. This makes a lot of sense if what you're looking for is a gold-standard dataset and if your main objective is that you need every single example in your data labeled. Similarly, if you're creating an evaluation, set you usually want it to be gold-standard and have no missing or unknown values. If you're training a new category from scratch you always need to get over the cold start problem. Sometimes you can use tricks to make it less painful, other times the only way is to label enough examples from scratch so you can pre-train the model. If your raw data set is very small, you probably also want to give manual annotation a try. If there are only a few hundred texts it doesn't make sense to use tricks here to find the most relevant examples. You can still you some tricks to pre-select spans or labels but you don't want to be skipping examples in favor of better ones because there's just not enough there. So these are all scenarios where a manual interface is probably the best choice. The binary annotation workflow is useful if you already have something and you want to collect feedback on it. For example, you can stream in the model's suggestions accept or reject them and then update the model in the loop. This is what we're doing the active learning powered recipes for instance. Binary annotation is especially helpful here to improve existing categories on new data. It also makes it easy to focus and collect information faster and it's really designed for automating as much as possible. For example, if you want a label whether news headlines are about sports and if so, which type of sports, you could of course combine this all into one multiple choice question. But depending on the task, this can really put a lot of cognitive load on the annotator and make the process slower and more error-prone. If you're designing it as a binary task you can go through sports versus not sports first. This is something you can do pretty quick: sports, not sports, sports. At 2 seconds per annotation that's almost 1,000 annotations in half an hour. Next, you can take the sports examples and label the sports type. That's much easier now because all you have to think about is sports instead of sports and everything else. In summary, I would say if you can break down your task into binary decisions, go for it. It's usually faster to annotate and much easier to do quality control because you'll be able to compare binary answers. If you're starting from scratch and need to get over the cold start problem you can try and use patterns to help you pre select entities or use the manual interface to collect enough data so you can at least pre-train a model effectively. The same goes for evaluation sets or if your data set is very small. If you've used an active learning powered workflow to improve a named entity recognition model you've probably come across a situation like this. In order to help you find the most relevant examples for training, Prodigy will get all possible analyses of a sentence and suggest you the entity spans that the model is most uncertain about. You can then accept it if it's correct or reject it if it's wrong. But what if it's half correct? For example, in this case we want to improve the models "PRODUCT" category. "iPhone" is suggested as a product entity and it's definitely a product entity we want to recognize in general. But here the full span would be "iPhone X". So should we accept this because iPhone is a product? This is one of the questions where the answer is pretty straightforward: no this is something you should definitely reject. Keep in mind that the feedback you're giving is always on that particular span in that particular context. So if we accept "iPhone", the feedback the model gets is: "Yes, this was the perfect parse in this context please produce more like this" which is obviously not what we want. If we reject the span it's not like we're telling the model "No, iPhone is never a product entity". We're telling it "In this context, the analysis where only the token iPhone is labeled as a product is incorrect, please produce less of this and try again." So we're basically reinforcing other more confident analyses of that sentence, for instance one that has "iPhone X" labeled as the product. It might help to look at the analysis of the sentence in the BILUO scheme, which is pretty much how named entities are represented internally. "B" stands for beginning of an entity, "I" stands for inside an entity, "L" stands for last token of an entity, "U" stands for unit – so basically single token entities – and "O" stands for outside an entity. For these two analyses here the BILUO tags are different so we want to be rejecting the analysis of the tokens "iPhone X" as "U-PRODUCT" "O", update the model accordingly and move the analysis towards "B-PRODUCT" "L-PRODUCT". I know that rejecting things that are almost correct isn't always easy. As you train your model in the loop you can sometimes become a little attached to it and you want to reward it for almost getting it right. But you really have to stay tough. Another question that sometimes comes up is: Should I reject or skip? Prodigy lets you perform three main actions: accept, reject and skip. Those will be added as the "answer" key to the created annotation task. If you skip an example it will be excluded from pretty much everything. If you train a model on the created data set or use it as an evaluation set, skipped examples will be filtered out. So when would you want to do this? The skip action is really mostly intended for very specific examples that shouldn't even be there. For instance, if you're annotating comments straight from the web you might end up with broken markup or one sentence in a different language. Sometimes you also have examples that are confusing and difficult, so instead of spending a long time thinking about it, it's better to just ignore it and move on. That's especially true if you have lots of raw data you really don't want to lose momentum on one single stupid example. That said, if you choose to ignore examples based on some objective – for example, broken markup or the wrong language – you also shouldn't be evaluating against a set with those types of examples in it. And if your runtime model needs to be able to handle certain types of texts, those should be present in both your training and your evaluation set. To give you an example: if your model needs to deal with tweets in real time as they come in, you really want it to be trained on a representative selection. If you filter out all the noise during annotation your model never actually gets to see it and will likely perform pretty badly on it or get very confused. By the way, one tip we often give people who are dealing with lots of messy and noisy data: experiment with chaining together two classifiers. Start by training one on a very simple binary distinction – not noise versus noise. For example tweets with actual content you want to analyze versus tweets consisting of an emoji, a link or just spam. Next train a classifier for your actual objective that only runs on the data filtered by the previous classifier. This is also much easier to annotate. You will only have to assign your labels to the filtered set of texts that actually matter and not reject hundreds of examples that are just noise. Prodigy's interface is really designed for smaller chunks of text like sentences or single paragraphs. The built-in recipes will also try to split longer texts in two sentences wherever possible. But what if you're annotating named entities and you or your annotators just need the previous paragraphs to decide if a label applies or not? Well, the thing is, if you're doing named entity recognition, the model is actually looking at the very local context, which means the surrounding words on each side of the token. So as a rule of thumb, if you or your annotators are not able to make the decision based on the local context, the model is unlikely to be able to reproduce that decision. You definitely want to find out about things like that as early as possible in your experimentation phase when you click through a few examples yourself and try it out. You don't want to ask someone to label thousands of documents only to find out later that your model isn't actually able to learn any of the entity types you've come up with. Labeling at the sentence level is always a good sanity check in that way. If you're doing long text classification that's a little different because here your end goal is to predict labels for the whole text. But still, most implementations for long text classification usually predict those categories by averaging over the predictions for smaller chunks like sentences. So when you're labeling data for long text classification you might as well label it all at the sentence level or paragraph level. Your annotators will be able to focus better, produce higher-quality data and give you a lot more to work with: one label per sentence at about the same cost. If you want, you can even experiment with automation and pre-select the sentences with a higher information density. We can do all of that stuff pretty well with NLP so there's really no need to waste the human's time by asking them to read through thousands of irrelevant filler sentences. Essentially, what I'm trying to say is, when designing your annotation tasks you don't always need to annotate exactly what you want your runtime system to output. What matters is how you break down a larger goal into smaller solvable machine learning tasks. And the shorter the dependencies you're trying to predict, the easier it usually is for the model to learn them. When you start a new NLP project you often need to decide: Do I start off with a pre-trained model and fine-tune it, or should I train a new model from scratch? Prodigy lets you implement workflows for both scenarios. But how do you decide which one to go for? Fine-tuning pre-trained models is especially useful if you need to predict the same categories and just want to improve accuracy on new specific data. You'll need much less training data – sometimes even a few hundred binary decisions can have a big impact. That's like 10 minutes of annotation with Prodigy. But there's also a downside to using pre-trained models, because by definition, they come with pre- trained weights. And whatever you do, you always need to manage the existing weights. Every update you make interacts with what's already there, often trained on millions of words. This can sometimes lead to very confusing results. One common problem is what's often referred to as "catastrophic forgetting". If you update a pre-trained model with examples of a new category but you don't includes examples or what it previously predicted, it may "forget" what it previously learned. There are ways to prevent this but it's something you always have to think about and design around. You might spend a lot of time hacking around arbitrary side-effects of the existing weights and maybe that time could be better spent creating a new dataset to train a model from scratch. Training a model from scratch requires a lot more data but it also gives you a somewhat blank canvas to start from. You can define your very own label scheme or use your very own category definitions without having to consider the existing weights. So if your labels and their definitions are very custom and far off from any generic pre-trained model, you should consider training from scratch. If you have enough raw unlabeled text, you can still use some automation to speed up the labeling process. For example, you can stream in texts that are already labeled by an existing pre-trained model and then extend those labels manually. If you want to train your own named entity recognition model from scratch but you do want to include the label "PERSON", you can have spaCy label that part for you. Even if the model is only correct 70% of the time, hey, that's 70% less manual labeling work you have to do. Prodigy implements a workflow like this in the "ner.make-gold" recipe by the way. Finally, if you've been reading some of my comments on the forum, you might have seen me make this point before. But I honestly think that one of the most powerful but underutilized NLP techniques is combining generic statistical models with application-specific rules to extract more complex relationships. A simple example of this is the following: the corpus spaCy's English models were train on defines a person as just a person name, so without any titles like "Mr" or "Dr". This makes sense because it makes it easy to resolve those entities back to a knowledge base. But what if you need the titles? Trying to fine-tune the model to completely change its definition of "PERSON" is probably going to be a very painful process. All its weights are based on that definition and you probably need a lot of data to change that. However, syntactically, there's one thing all of these titles have in common, at least in English and similar languages. They come right before the person name and there's a limited number of options. So to check for the titles, we can take a predicted person entity span and look at the previous token, the previous two or maybe the previous three. This lets us capture "Mr", "Prof Dr" or even "Prof Dr Dr", which is actually surprisingly common in Germany where people are really into titles. So in code, the whole thing could look like this. We can expand the entity selection to include the title tokens or add them as custom extension attributes which we can retrieve later on. This example might seem quite basic, but there's actually a lot more complex stuff that you can do in a similar way. The part-of-speech tags and dependency parse hold so much information that you can use to go from generic labels to specific structured information. For more examples and ideas check out the links in the video description. I hope you enjoyed this video! Thanks for using Prodigy and for all the feedback and great NLP discussions on the forum. If you haven't seen it yet, also check out our prodigy-recipes repo on GitHub that includes a collection of recipes scripts for various different annotation workflows. They're also great starter recipes if you're looking to build your very own custom pipelines. If you want to see another video on different questions let us know on Twitter!

Info

Channel: Explosion

Views: 15,123

Rating: 4.9124999 out of 5

Keywords: artificial intelligence, ai, machine learning, spacy, nlp, natural language processing, active learning, data science, python, programming, big data, named entity recognition, ner, text classification, deep learning

Id: tMAU3gLbKII

Channel Id: undefined

Length: 13min 19sec (799 seconds)

Published: Wed Feb 06 2019