Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, I'm Ines! I'm the co-founder of Explosion, a core developer of the spaCy Natural Language Processing library and I'm also the lead developer of Prodigy. Prodigy is a modern annotation tool for creating training data for machine learning models. It helps developers and data scientists train and evaluate their models faster. It's been really really exciting to see the Prodigy community grow so much and also see the tool enable developers to collect data more efficiently. Spending time with data is probably something that every data scientist knows they should do more of. It's kinda like flossing or eating your vegetables. You know you should do it more. And one of the core philosophies that has motivated Prodigy is that creating training data isn't just "dumb click work". Data is the core of your application and it should be developed just like you also develop your code. Especially as we're able to train more accurate models with fewer labelled examples, this becomes even more important, because you really want to be finding the best possible training data for your specific application and run as many experiments as you can to find out what works best. In this video, I'll show you how to use Prodigy to train a named entity recognition model from scratch by taking advantage of semi-automatic annotation and modern transfer learning techniques. So, what are we going to do? Well, I thought I'd pick a topic that's interesting and that everyone can relate to, which is: food! Yay! So we're going to be using machine learning to find out how mentions of certain ingredients change over time, online using comments posted on Reddit. And at the end of it, we'll have one of those really cool bar chart race animations, so stay tuned! I'll be showing pretty much the whole end-to-end process of this in this video. So how are we going to do this? Well, essentially we want to be training a model to recognize ingredients in context. So we also need to be showing it ingredients in context to train it. And we'll be using the Reddit comments corpus and specifically, the past couple of years of comments posted to the r/Cooking subreddit. So basically where people talk about recipes and things they're cooking. And we want to go through a sample of these comments and we want to be highlighting ingredients whenever they're mentioned. It's a type of entity recognition problem where the phrases are mostly unambiguous. Most of the time when people are going to be talking about "garlic", they actually mean the ingredient garlic. But we need to find all of them. And some of those entities are going to be really common so we might as well have the annotation tool do some of the work for us and help us do this more efficiently. So here's what we're going to do. First, we're going to be creating a phrase list and match patterns for ingredients. Next, we're going to be labelling all ingredients in a sample of texts with the help of the match patterns. Next, we're going to be training and evaluating a first model, to see if we're on the right track and to check if what we're doing is working. Next, we're going to be labelling more examples by correcting the model's predictions. And after that, we can train a new model with improved accuracy. And then take that and run it over two million plus Reddit comments and count the mentions over time. And finally, we can select some interesting results and visualize them. To create the phrase list, we'll be using a word vectors model that includes vectors for multi-word expressions, like "cottage cheese" for example, and that was also trained on Reddit. So first, we download the vectors. We'll be using the vectors trained on all of 2015 because that's kind of in the middle and it's a pretty nice and small model package. And I think that's going to be enough for what we're trying to do. And we also need to install the sense2vec library so we can work with those vectors. Prodigy is a fully scriptable annotation tool and the most convenient way to interact with it is via the command line. Workflows are defined as Python functions which we also call "recipes". Prodigy comes with a bunch of built-in recipes for different use cases but other packages can also provide recipes. So here we're using the recipe "sense2vec.teach" which was provided by the sense2vec package. The first argument is the name of the dataset we want to save the annotations to. Datasets let you bundle annotations together so we can later add to them, export them or just use them within the tool. The second argument is the path to the vectors we just downloaded. And next we can provide some comma-separated seed phrases, which are going to be used to find other similar phrases in the vectors. And based on that, we're going to initialize Prodigy and spin up the web server so we can start annotating. When we navigate to the browser, we can see the first example. This actually looks very promising. We're definitely in the right vector space and "spinach" is suggested with a high similarity score. So we can now click "accept" or "reject", depending on whether we want to keep the term and use it in our match patterns, or if we want to exclude it and it's not a good fit. So here, as you can see with "feta cheese", we're also getting suggestions for multi-word expressions. So that's really good. That's also an interesting one. This misspelling is common enough that it made the cut and got its own vector. So that's pretty relevant. But it's also something that maybe you wouldn't have necessarily thought of, so that's definitely something we want to be having in our word list and in our match patterns. Okay, so here's one that we want to reject. If we came across "steamed broccoli" somewhere in the text, we'd probably want to be highlighting "broccoli" and not the whole phrase "steamed broccoli", because broccoli is the ingredient. So that's one that we're going to be rejecting. So I'll speed this up a little. Instead of clicking the buttons you can also use keyboard shortcuts and hit "A" for "accept" and "X" for "reject", so that's usually even faster. And as you can see, we're very quickly building up a nice list here. So yeah, that's it. Prodigy shows that it doesn't have any more tasks available, which probably means that there are no more suggestions for the target vector under a certain similarity threshold. So we could now start the server again with different seed terms, but I think we have enough phrases to move on and start creating our patterns. So let's hit the save button or command + S. That makes sure that all annotations are sent back to the server and saved in our dataset. We can now simply go back to our terminal, exit Prodigy, for example with control + C, and it will show us that the annotations we've collected were added to the dataset "food_terms". We can now reuse that dataset to create our match patterns. For that, we'll use the built-in recipe "terms.to-patterns". We give it the name of the dataset with our phrase list and a label that we want to assign if a pattern matches. So here, that's what I chose – it's short for "ingredient" so it doesn't take up that much space. And then we also give it a spaCy model. In this case, just a blank English tokenizer for tokenization. And we can then forward the output to a file, in this case JSONL, newline-delimited JSON, and it will save out our patterns to disk and we can keep working with that file. So here's the result, here's how it looks. Each line contains one match pattern. And if you've used spaCy's rule-based Matcher before, you might recognize this pattern format, because it uses the exact same format for the match patterns. So we're now ready to annotate the data. Here's the data we've extracted. We have one file with all comments from the past 7 years, which is over 2 million comments. And then we have a smaller sample of 10,000 random comments, which we're going to be using for annotation and for testing. And here's how it looks. We have the text and we made sure to of course keep the timestamps so we can later compute things and check how things change over time and when that comment was posted. To start the annotation server, we can run the "ner.manual" recipe. The first argument is the name of the dataset that we want to save the annotations to. And then next we can pass in a spaCy model that will be used for tokenization, so in this case, "blank:en" will use the basic English tokenizer. I'll show you why this is super cool and super useful in a second. Next, we'll give it the path to the data that we want to annotate, so our little sample from the Reddit comments. And we also give it the label that we want to assign in general and that we want the entity label to be, which obviously is the same as the pattern label. And if you have multiple labels you can also pass in a comma-separated list here, but we'll only be focusing on one. And of course we need our patterns that we just created. So if a text contains a match, Prodigy will automatically highlight it for us so we don't have to do that work, which is pretty cool. This will now spin up the web server again and we can navigate back to the browser. And as you can see, we already have a match! Yay! So this is really the very classic way of labelling and typically what people think of when they hear "annotation". Just like, click, and highlight spans in a text. But because the text is pre-tokenized, we already know where a word starts and ends. So you don't have to hit the exact characters. Instead, the selection snaps to the token boundaries. Because the token boundaries is also what the model is going to predict. So there's no need to do pixel-perfect selection. Let's see that again! Yes! So this is great to make sure that the data you collect is consistent and it allows you to move much much faster because you spend less time highlighting. And if you're annotating a single-token entity, you can also just double-click on the word to select it and it'll be locked in. So let's annotate the missing entities here and hit "accept" when we're done. So this example doesn't have any entities, so we won't highlight anything and just accept it, because it's already correct. Showing the model examples of what's NOT an entity is just as important as showing it examples of what IS an entity. It's easy to forget, but you always want a lot of examples of things that are just not entities so you can really make sure your model is learning the right thing. Okay, so here I'm not really sure if "salt" in "reduced salt" should be labelled, because is it really talking about salt as an ingredient that people use? So yeah, I don't know. I'll just skip it by hitting the "ignore" button or by hitting the space bar. We have so much data in our case that this one example really won't matter. If you have a good flow going on and you're really annotating fast, you don't want to get distracted by one random example. So if you don't know the answer, just skip. That's always better than stopping and losing your flow. And I'll also be skipping examples that are just single words or links, because we don't want to include them. I'm trying to have a strict policy here and really only annotate food terms that are actually ingredients. So let's see how that goes. This is actually one of the trickier parts that really doesn't get talked about very often, which is designing your label scheme and your annotation policy. Which is often much more important, in my opinion, than a lot of modelling decisions you might be making about your model implementation. It's actually super common that you have to start over again because you've realized that ah, damn, the world just doesn't divide up so neatly into those categories that you've carefully made up and wanted to divide the world into. That's also why we designed Prodigy to allow this kind of fast iteration and doing your own quick annotation during development. Because you really don't want to scale up an idea that does not work or that can't be annotated consistently. Often the only way you find out about this and you find out whether it works is if you try it. Even labelling a very small number of examples can really help here. And if there are significant problems, you'll immediately realize. Like, it doesn't work, I don't even know what to select here. Or you ask some annotators and you ask some other people to annotate a few examples, and if they already disagree, that's a very clear sign that you probably want to adjust your label scheme or really write very careful instructions to make sure that there is an answer for every special case you might come across. So we'll also be speeding this up a little. But it's pretty efficient and the patterns are matching. But there's also a lot that we have to add manually. So this one is an interesting edge case. Because of the missing space "salt" is not tokenized as its own token. So even if we were able to annotate it, our model probably wouldn't be able to learn anything from it. So it's really good that this came up because we saw this problem and I'll just reject it to separate it from the other ones that I've skipped. Sometimes it can make sense to go through those examples afterwards and see if there are common problems that can maybe be fixed by tweaking the tokenization rules a bit. Especially if you're dealing with unusual puctuation or lots of special cases. So let's keep going and I'll speed it up a little. The patterns are definitely helping and we're getting quite close to 100 annotations already and it hasn't been very long. So this is pretty cool. This one is also interesting. Is "Guinness extra stout" an ingredient? Maybe I'm kind of being inconsistent here, I'm not sure. I guess in this case it is, because the comment talks about braising meat and beer is a perfectly fine ingredient for that. It's not like some of those American recipes that call for ingredients like "cheetos" and I'm like... wait, how is that an ingredient? Like, no. I'm definitely curious to see if the model is going to be able to pick up on the quite specific "ingredient" distinction, as opposed to food in general. This post was actually great and I kinda want to find it and bookmark it now. It basically explains the various different sauces and dips that you can make with a few staples and what they mean. It's a pretty good summary so I kinda want to save that now. And I'm also getting a bit hungry. By the way, in the bottom right corner you can see a list of pattern IDs. And those are all the patterns that were matched in this particular example. Those IDs map to the line numbers so you can always go back and see what was matched. If you maybe write a few more complex patterns and something is confusing you can always go back and see how this match was created and why it's there. So let's speed this up a bit more. We're now at almost 400 accepted annotations, which is pretty good. I'm always trying to hit the even numbers, which means that I sometimes end up having to annotate another 99 examples because I missed the target. But it is pretty satisfying to see the number go up. And we're starting to get a number of examples that we can actually work with and that's enough to run a first training experiment, which is good. My goal is, I'm going to be doing 500 in total. And that hopefully gives us a bit over 400 examples to train from, minus the ones that we've skipped. Also, don't forget that you always want to hold some of them back for evaluation, even if it's just a quick experiment. You always need some examples that you can evaluate your model on so you'll always be doing a bit more than you know you need for training. But a few hundred can be a good stopping point, especially if we're taking advantage of some transfer learning later on. So that means that we can probably get by with that dataset for training and at least a very rough evaluation to see if we're generally on the right track and if our model is learning something. So let's stop here and save. We can now go back to the terminal again and exit the server. During the development phase it's important to have a very tight feedback loop. Like, I'm pretty confident in this case that that category we've defined here is something that the model can learn. But in general, you never know. Sometimes you want to go back and revise your label scheme. And you always want to make sure you're validating your ideas early so you're not wasting time on something that's just doomed to fail. So we're going to be training a temporary model that we can later build on top of. For instance, we can use it to suggest entities and all we have to do then is correct its mistakes. And this also gives us a very good idea of what the model predicts and where it's at, and also the common errors it makes. And it also just makes the process of collecting more data for efficient because we already have a model that gives us something. Prodigy comes with a "train" recipe that's a relatively thin wrapper around spaCy's training API. And it's specifically optimized for running quick experiments and to work with existing Prodigy datasets for the different types of annotations you might collect. So you can very quickly and easily run these training experiments on the command line and see where you're at and if you're on the right track. And to make the most of our very small dataset we want to initialize the model with pretrained representations. Specifically, we want to use a pretrained token-to-vector layer. We've used spaCy's "pretrain" command to pretrain weights on the Reddit corpus. And the idea is pretty similar to the language model pretraining that was popularized by ELMo, BERT and ULMFiT and so on. The only difference here is that we're not training it to predict the next word and instead we're training it to predict and approximation. In this case, the word's word vector. So this makes the artifact that we're training much smaller and it also makes the runtime speed much faster. So it's a pretty efficient compromise that we can use and that we've seen very very promising results with overall. To run a first training experiment, we can call the "prodigy train" recipe. The first argument is the name of the component that we want to train. In this case "ner" for the named entity recognizer. Next, we define the base model. Here, we'll be using the large English vectors package that you can download alongside spaCy since those are also the vectors we've used for pretraining the token-to-vector layer. So this should always match. Or this needs to match, otherwise it's not going to work. The "init-tok2vec" argument lets us specify the path to the pretrained token-to-vector weights, so basically the output of "spacy pretrain". And that's going to be used to initialize our model with. And we'll also specify an output directory and the "eval-split", which is the percentage of examples to hold back for evaluation. Here, we're setting it to 20%, which is not that much and normally that's a bit low. It means that our evaluation is not going to be very stable. And that's not going to be the result you should be reporting in your paper. You should take that with a grain of salt. So the training here is reasonably quick and I'm running this on CPU on my regular MacBook. But I'll still speed it up a little for the video because it's kinda boring to watch and not that much is happening. But as we can see, the accuracy is looking pretty good already. It's definitely going up. And after the training has completed the model with the best F-score is saved to the directory that we specified. And we can also see a very nice breakdown per label, which is very useful if you're training with multiple labels at the same time so you can see if maybe one label does great, one label doesn't do so well, and that's also usually a very important indicator that you can have. And that's also what we're hoping to beat later after we've collected more annotations. If we can't beat that, then that's obviously a bad sign. But I think there's still some room. But it also shows that with our pretrained representations we're getting pretty decent results even though we've trained with a very small dataset. And it looks pretty internally consistent. So that's a very good sign and that does show us that it's probably worth it to keep pursuing this idea. Before we keep going we want to run another diagnostic. Because a very important question is always: should I keep going? And also, should I keep doing what I have been doing, or should I do something else? There's obviously not always a definitive answer. But one experiment we've found pretty useful here is the "train curve". The "train-curve" recipe takes pretty much the same arguments as the "train" recipe. And it will run the training several times, with different amounts of data. So 25%, 50%, 75% and 100% in the default configuration. So basically, what we're doing here is, we're simulating growing training data size. So we're basically interested in finding out whether collecting more data will improve the model. And in order to do that, we're looking at whether more data has improved it in the past, basically. That's kind of the rough idea. And the training here takes a bit longer because we're training 4 times. So I'll speed this up a little. For each run here, the recipe prints the best score and it shows the improvement with more data. Or if there's no improvement or the accuracy goes down, you would also see that printed in red. So you can see how more data is improving the model using segments of the existing data. And here we can see that between 75 and 100%, we do see an improvement of over 1% in accuracy. So that's looking pretty good. And as a rule of thumb, if the accuracy increases in the last segment, it could indicate that collecting more annotations, and specifically, more annotations of roughly the same type as is already in our dataset, will improve the model further. So now that we have a temporary model, we can use it to do the labelling for us and only correct it if it's wrong. Prodigy's built-in workflow for that is called "ner.correct". The command looks very similar to the annotation command that we ran before. First, we give it the name of the dataset we want to save the annotations to. We could have used the previous dataset name here but I think it's always a good idea to use separate datasets for separate experiments because this also makes it easy to start over if things go wrong. You make a mistake, something isn't right, you don't like what you're seeing, you can just delete the dataset and try again. And separating annotations, that can really be a pain if they're all mixed up in the same set. But merging them later on is easy. You can run the train experiment and the train command with multiple dataset names. So it's always better to keep them separate. And the second argument here is the base model that already predicts named entities. So that's the temp model that we just trained and saved to that directory. And we also need to pass in the path to our input texts again and the label that we want to annotate. And finally, since we're annotating the same input text file twice, we want to make sure that we're not asked about texts that we've already annotated before. So we can set the "exclude" argument to our previous "food_data" dataset. So if an example is already present in that dataset, we won't see it again now when we're annotating again. And after we hit enter the server starts again and we can head back to the web app. And as you can see, we're right back where we left off and we can keep annotating. And the highlighted entities you can see here, they're actually the model's predictions. And it's already looking pretty good. It looks like it has definitely learned something as we move through the examples and the entities here. But it also makes mistakes and if it predicts a wrong span we can click on it and remove it from the example. And if it's missing one, we can just added. Pretty much just like we did before when we did the pattern matches. That's a nice one and that's a nice mistake. And I kind of have a lot of sympathy for the model here. No, this is not an ACTUAL red lobster. But it's a nice try and we can just remove this by clocking on the entity and then hit "accept" to submit the correct annotation. So yeah, we can move through these and let's speed this up again. It's a bit repetitive in places because the way people talk about food is quite similar and we're seeing a lot of the same entities. But it also shows some of them are just much more common than others. Here we can actually see quite a few ingredients that definitely weren't in the training data and that are correctly recognized here. For instance, "apple juice". That's a good sign, too, and that means that we're really recognizing ingredients in context here. Because after all, that's the goal. That's what we want to do. We don't just want to find certain ingredients. We want the model to be able to generalize based on the examples we show it. And we want to find other similar ingredients if they're mentioned in similar contexts. Because that's how we're going to find – hopefully – the most interesting cases and the most interesting ingredients and maybe the things we hadn't thought of. So yeah, let's speed this up again so it's not too repetitive. So there are quite a few examples here where the ingredient vs. dish vs. other mention of food was pretty difficult. And I think I haven't been fully consistent in my annotations either. So there's probably some room for improvement here. And I'll be sharing all of the code and data from this video in the description. So maybe you also have some ideas for what else to do with the data and with the model, or have some things that could be improved. Definitely let me know on Twitter if you end up working with it. Because I think it's definitely a very cool topic so I think there's a lot more to explore still. So we're moving through the examples and I kinda want to get to 500-ish in total again. I think that's a good number. The most common mistakes I've seen the model make so far are definitely around dishes and ingredients, which is genuinely a tricky question. Like, in "mac and cheese" for instance, we're not considering "cheese" and ingredient because it's the name of the dish and that's what the conversation is about most of the time. And that's a difficult distinction. So we'll hopefully be able to improve around those cases and also increase the accuracy once we're training on that data as well and are correcting the mistakes here. Another thing to keep in mind is that our goal here is information extraction where we can average over a lot of predictions. And that's very nice because it means that we can accept that the model may get some edge cases wrong if it's getting most of everything else right. I mean, that's the case for most machine learning use cases because you know you're not going to get 100% accuracy always on everything ever. So it's most about, you want to do some error analysis and make sure the mistakes your model makes are mistakes you can live with or work around. So overall, the annotation part that I'm showing in this video has taken me maybe about 2 1/2 hours, which is pretty efficient. And that's also the kind of efficiency that we want to enable with a tool like Prodigy. Because machine learning projects can take a lot of time and they can also get pretty expensive. So being able to validate your idea in only a few hours without having to schedule tons of meetings and get lots of people involved and plan lots of things, that's really cool. And that enables faster iteration and means you can try out more things and you can really focus on the stuff that looks the most promising out of all the experiments you run. And now we're at 550 in total because I missed the 500 mark and wanted an even number again. So let's stop here and hit "save" to make sure that all of our annotations are sent back to the server and then we can keep going. And we can again stop the server in our terminal and now we have 2 datasets with over 1000 annotations in total that we can use to train the model. So that's a pretty good number. We'll be using the same setup again but we're passing in the names of both datasets when we run "prodigy train" and run our experiment. And we'll be training from scratch again. There's really no point in updating our temp model again, even though it would be theoretically possible. It's always better to start clean. That lets you avoid a lot of potential side-effects and it's just a much better way to run experiments. You want to be training on the same data from scratch every time. So we'll start out with the vectors, with the tok2vec weights and we'll also be giving it a few more iterations. Just to make sure we're not missing out on potentially better results if it takes a bit longer, which can be the case with these models that are a bit larger. So we'll see. And we'll let this run. It obviously takes a bit longer this time. But after 20 iterations we have increased our accuracy. It's pretty good, especially given that we were able to automate a lot and didn't spend that much time on doing the additional annotation. We still doubled our training data and got that nice improvement. And it could also mean that well, if we wanted to spend a bit more time, we could maybe get a bit more out here. I don't think that's the end of it. But we've now saved out the model and yeah, I think the results are good enough so now it's time to put it to work and see what we can analyze with it and what we can find out about ingredients on Reddit. As a quick recap, we want to be building one of those really cool bar chart races. And we want to be using the model that we just trained to extract all ingredients as named entities. And then we want to compute the counts over time. And the data we have is 7 years of Reddit, so we probably want to count by month because otherwise we'd only have 7 data points and that's not so interesting and makes it a lot more difficult to see what changes over what time periods. And here's a little Python script that I wrote to do this. It's pretty straightforward. We start with a dictionary of counters so we can count all timestamps for all entities. And the model we saved out after training is a regular loadable spaCy model, so we can pass the path to "spacy.load" and we can start using it right away. And next, we load the full corpus or the full extracted Reddit data as a stream. It's newline-delimited JSON, so we don't have to load it all into memory upfront. And spaCy also has this very useful method, which is called "nlp.pipe". And that lets you process texts as a stream. And that's also much more efficient at scale and it also supports multiprocessing. So we can create an iterator of (text, example) tuples and then we can pass it through as tuples when we call "nlp.pipe". And what we get back out of it is tuples of spaCy Doc objects, and those Doc objects have all the annotations that were predicted by the model. And then we also get the original record, which has the original text, but also the meta information like the timestamps that we need. So we'll parse the UTC timestamp to a more workable year-month string and then for each entity in the "doc.ents", we 'll increment the count for that given month. And also, wel'll be using the lowercase form of the entity so that we're not going to be ending up with duplicates for different casings and different capitalization. And then finally, we're putting it all in a DataFrame and then we are saving it out as a CSV that we can keep working with. And I guess I could have probably used a DataFrame all along. But I personally always like to start with Vanilla Python and then see how I go. I don't know, that's just how I do things. We ended up running this script over the whole data. And in our case, we ran 21 jobs with 2 CPUs per job. And that took about 8 minutes each. So in total, it took us 5.6 CPU hours. So if you had an 8-CPU server, it would take you about 45 minutes to process the 2 million plus comments from Reddit that we've extracted. So this is all totally doable and you don't need any fancy or expensive computing resources to complete a project like this. You can even let it run over night on your laptop and when you get up in the morning you have the results. So hopefully, if you want to replicate this, it'd be no big deal. And here's here's an excerpt of the resulting counts. We have extracted ingredients and then we have columns for each month with the counts. And one additional processing step that you could add, that we haven't done here, is lemmatization. So basically, you would merge ingredients with the same base form into one and add up the counts. So for instance, something like "egg" and "eggs" we currently have two records of, and then you would have one. And that's also something spaCy could help you with if that's something that you want to do. But just from looking at the data as it is, this is actually super interesting already. So now we just need to find the most interesting datapoints. If you look at the results, you quickly realize that many of the very common ingredients don't very change much at all. Like, people have always been talking about salt, and people will keep talking about salt, and they're still talking about salt, and if we were to visualize the top ingredients out of all of them it'd be incredibly boring. So what we ended up doing is, we calculated the variance, scaled by the average. So we can basically find the ingredients whose mentions changed significantly month to month. I looked around for how we could do this animation and this graphic, and I found this pretty cool site called Flourish, which lets you create interactive bar chart race animations. You've probably seen those on YouTube or in viral social media posts and it's a very fun way to visualize how things that you can somehow count and enumerate change over a given time period. So I signed up for a free account and uploaded our selection of ingredients to visualize it. And the first column you can see here, those are the ingredients, the labels, and the other columns, those are all counts for the respective months. And we've selected the most variable ingredients and then we ended up removing the top 10 staples to make it a bit more interesting. We could have probably removed a bit more, but we also didn't want to cherry-pick it too much. So after adding the data, we can now navigate to the preview tab and here it is! Our bar chart race visualization, created using a custom NER model that we trained completely from scratch. I hope you've enjoyed this example of how Prodigy can help you quickly complete a project, starting with raw text and only a few seed phrases, and then all the way to a model with sufficient accuracy to carry out your analysis. There's a lot more you can do with Prodigy as well, and it's fully scriptable. So it can really slot in right alongside the rest of your data science and machine learning stack. To find out more about Prodigy, check out the website and docs. I've added the links to the description below. And if you have any questions, feel free to talk to us on the forum.
Info
Channel: Explosion
Views: 32,911
Rating: 4.9689922 out of 5
Keywords: artificial intelligence, ai, machine learning, spacy, natural language processing, nlp, active learning, data science, big data, annotation, named entity recognition, ner, data annotation, text annotation, transfer learning, language models
Id: 59BKHO_xBPA
Channel Id: undefined
Length: 40min 27sec (2427 seconds)
Published: Mon Mar 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.