Hi, I'm Ines! I'm the co-founder of Explosion, a core developer
of the spaCy Natural Language Processing library and I'm also the lead developer of Prodigy. Prodigy is a modern annotation tool for creating
training data for machine learning models. It helps developers and data scientists train
and evaluate their models faster. It's been really really exciting to see the
Prodigy community grow so much and also see the tool enable developers to collect data
more efficiently. Spending time with data is probably something
that every data scientist knows they should do more of. It's kinda like flossing or eating your vegetables. You know you should do it more. And one of the core philosophies that has
motivated Prodigy is that creating training data isn't just "dumb click work". Data is the core of your application and it
should be developed just like you also develop your code. Especially as we're able to train more accurate
models with fewer labelled examples, this becomes even more important, because you really
want to be finding the best possible training data for your specific application and run
as many experiments as you can to find out what works best. In this video, I'll show you how to use Prodigy
to train a named entity recognition model from scratch by taking advantage of semi-automatic
annotation and modern transfer learning techniques. So, what are we going to do? Well, I thought I'd pick a topic that's interesting
and that everyone can relate to, which is: food! Yay! So we're going to be using machine learning
to find out how mentions of certain ingredients change over time, online using comments posted
on Reddit. And at the end of it, we'll have one of those
really cool bar chart race animations, so stay tuned! I'll be showing pretty much the whole end-to-end
process of this in this video. So how are we going to do this? Well, essentially we want to be training a
model to recognize ingredients in context. So we also need to be showing it ingredients
in context to train it. And we'll be using the Reddit comments corpus
and specifically, the past couple of years of comments posted to the r/Cooking subreddit. So basically where people talk about recipes
and things they're cooking. And we want to go through a sample of these
comments and we want to be highlighting ingredients whenever they're mentioned. It's a type of entity recognition problem
where the phrases are mostly unambiguous. Most of the time when people are going to
be talking about "garlic", they actually mean the ingredient garlic. But we need to find all of them. And some of those entities are going to be
really common so we might as well have the annotation tool do some of the work for us
and help us do this more efficiently. So here's what we're going to do. First, we're going to be creating a phrase
list and match patterns for ingredients. Next, we're going to be labelling all ingredients
in a sample of texts with the help of the match patterns. Next, we're going to be training and evaluating
a first model, to see if we're on the right track and to check if what we're doing is
working. Next, we're going to be labelling more examples
by correcting the model's predictions. And after that, we can train a new model with
improved accuracy. And then take that and run it over two million
plus Reddit comments and count the mentions over time. And finally, we can select some interesting
results and visualize them. To create the phrase list, we'll be using
a word vectors model that includes vectors for multi-word expressions, like "cottage
cheese" for example, and that was also trained on Reddit. So first, we download the vectors. We'll be using the vectors trained on all
of 2015 because that's kind of in the middle and it's a pretty nice and small model package. And I think that's going to be enough for
what we're trying to do. And we also need to install the sense2vec
library so we can work with those vectors. Prodigy is a fully scriptable annotation tool
and the most convenient way to interact with it is via the command line. Workflows are defined as Python functions
which we also call "recipes". Prodigy comes with a bunch of built-in recipes
for different use cases but other packages can also provide recipes. So here we're using the recipe "sense2vec.teach"
which was provided by the sense2vec package. The first argument is the name of the dataset
we want to save the annotations to. Datasets let you bundle annotations together
so we can later add to them, export them or just use them within the tool. The second argument is the path to the vectors
we just downloaded. And next we can provide some comma-separated
seed phrases, which are going to be used to find other similar phrases in the vectors. And based on that, we're going to initialize
Prodigy and spin up the web server so we can start annotating. When we navigate to the browser, we can see
the first example. This actually looks very promising. We're definitely in the right vector space
and "spinach" is suggested with a high similarity score. So we can now click "accept" or "reject",
depending on whether we want to keep the term and use it in our match patterns, or if we
want to exclude it and it's not a good fit. So here, as you can see with "feta cheese",
we're also getting suggestions for multi-word expressions. So that's really good. That's also an interesting one. This misspelling is common enough that it
made the cut and got its own vector. So that's pretty relevant. But it's also something that maybe you wouldn't
have necessarily thought of, so that's definitely something we want to be having in our word
list and in our match patterns. Okay, so here's one that we want to reject. If we came across "steamed broccoli" somewhere
in the text, we'd probably want to be highlighting "broccoli" and not the whole phrase "steamed
broccoli", because broccoli is the ingredient. So that's one that we're going to be rejecting. So I'll speed this up a little. Instead of clicking the buttons you can also
use keyboard shortcuts and hit "A" for "accept" and "X" for "reject", so that's usually even
faster. And as you can see, we're very quickly building
up a nice list here. So yeah, that's it. Prodigy shows that it doesn't have any more
tasks available, which probably means that there are no more suggestions for the target
vector under a certain similarity threshold. So we could now start the server again with
different seed terms, but I think we have enough phrases to move on and start creating
our patterns. So let's hit the save button or command +
S. That makes sure that all annotations are sent back to the server and saved in our dataset. We can now simply go back to our terminal,
exit Prodigy, for example with control + C, and it will show us that the annotations we've
collected were added to the dataset "food_terms". We can now reuse that dataset to create our
match patterns. For that, we'll use the built-in recipe "terms.to-patterns". We give it the name of the dataset with our
phrase list and a label that we want to assign if a pattern matches. So here, that's what I chose – it's short
for "ingredient" so it doesn't take up that much space. And then we also give it a spaCy model. In this case, just a blank English tokenizer
for tokenization. And we can then forward the output to a file,
in this case JSONL, newline-delimited JSON, and it will save out our patterns to disk
and we can keep working with that file. So here's the result, here's how it looks. Each line contains one match pattern. And if you've used spaCy's rule-based Matcher
before, you might recognize this pattern format, because it uses the exact same format for
the match patterns. So we're now ready to annotate the data. Here's the data we've extracted. We have one file with all comments from the
past 7 years, which is over 2 million comments. And then we have a smaller sample of 10,000
random comments, which we're going to be using for annotation and for testing. And here's how it looks. We have the text and we made sure to of course
keep the timestamps so we can later compute things and check how things change over time
and when that comment was posted. To start the annotation server, we can run
the "ner.manual" recipe. The first argument is the name of the dataset
that we want to save the annotations to. And then next we can pass in a spaCy model
that will be used for tokenization, so in this case, "blank:en" will use the basic English
tokenizer. I'll show you why this is super cool and super
useful in a second. Next, we'll give it the path to the data that
we want to annotate, so our little sample from the Reddit comments. And we also give it the label that we want
to assign in general and that we want the entity label to be, which obviously is the
same as the pattern label. And if you have multiple labels you can also
pass in a comma-separated list here, but we'll only be focusing on one. And of course we need our patterns that we
just created. So if a text contains a match, Prodigy will
automatically highlight it for us so we don't have to do that work, which is pretty cool. This will now spin up the web server again
and we can navigate back to the browser. And as you can see, we already have a match! Yay! So this is really the very classic way of
labelling and typically what people think of when they hear "annotation". Just like, click, and highlight spans in a
text. But because the text is pre-tokenized, we
already know where a word starts and ends. So you don't have to hit the exact characters. Instead, the selection snaps to the token
boundaries. Because the token boundaries is also what
the model is going to predict. So there's no need to do pixel-perfect selection. Let's see that again! Yes! So this is great to make sure that the data
you collect is consistent and it allows you to move much much faster because you spend
less time highlighting. And if you're annotating a single-token entity,
you can also just double-click on the word to select it and it'll be locked in. So let's annotate the missing entities here
and hit "accept" when we're done. So this example doesn't have any entities,
so we won't highlight anything and just accept it, because it's already correct. Showing the model examples of what's NOT an
entity is just as important as showing it examples of what IS an entity. It's easy to forget, but you always want a
lot of examples of things that are just not entities so you can really make sure your
model is learning the right thing. Okay, so here I'm not really sure if "salt"
in "reduced salt" should be labelled, because is it really talking about salt as an ingredient
that people use? So yeah, I don't know. I'll just skip it by hitting the "ignore"
button or by hitting the space bar. We have so much data in our case that this
one example really won't matter. If you have a good flow going on and you're
really annotating fast, you don't want to get distracted by one random example. So if you don't know the answer, just skip. That's always better than stopping and losing
your flow. And I'll also be skipping examples that are
just single words or links, because we don't want to include them. I'm trying to have a strict policy here and
really only annotate food terms that are actually ingredients. So let's see how that goes. This is actually one of the trickier parts
that really doesn't get talked about very often, which is designing your label scheme
and your annotation policy. Which is often much more important, in my
opinion, than a lot of modelling decisions you might be making about your model implementation. It's actually super common that you have to
start over again because you've realized that ah, damn, the world just doesn't divide up
so neatly into those categories that you've carefully made up and wanted to divide the
world into. That's also why we designed Prodigy to allow
this kind of fast iteration and doing your own quick annotation during development. Because you really don't want to scale up
an idea that does not work or that can't be annotated consistently. Often the only way you find out about this
and you find out whether it works is if you try it. Even labelling a very small number of examples
can really help here. And if there are significant problems, you'll
immediately realize. Like, it doesn't work, I don't even know what
to select here. Or you ask some annotators and you ask some
other people to annotate a few examples, and if they already disagree, that's a very clear
sign that you probably want to adjust your label scheme or really write very careful
instructions to make sure that there is an answer for every special case you might come
across. So we'll also be speeding this up a little. But it's pretty efficient and the patterns
are matching. But there's also a lot that we have to add
manually. So this one is an interesting edge case. Because of the missing space "salt" is not
tokenized as its own token. So even if we were able to annotate it, our
model probably wouldn't be able to learn anything from it. So it's really good that this came up because
we saw this problem and I'll just reject it to separate it from the other ones that I've
skipped. Sometimes it can make sense to go through
those examples afterwards and see if there are common problems that can maybe be fixed
by tweaking the tokenization rules a bit. Especially if you're dealing with unusual
puctuation or lots of special cases. So let's keep going and I'll speed it up a
little. The patterns are definitely helping and we're
getting quite close to 100 annotations already and it hasn't been very long. So this is pretty cool. This one is also interesting. Is "Guinness extra stout" an ingredient? Maybe I'm kind of being inconsistent here,
I'm not sure. I guess in this case it is, because the comment
talks about braising meat and beer is a perfectly fine ingredient for that. It's not like some of those American recipes
that call for ingredients like "cheetos" and I'm like... wait, how is that an ingredient? Like, no. I'm definitely curious to see if the model
is going to be able to pick up on the quite specific "ingredient" distinction, as opposed
to food in general. This post was actually great and I kinda want
to find it and bookmark it now. It basically explains the various different
sauces and dips that you can make with a few staples and what they mean. It's a pretty good summary so I kinda want
to save that now. And I'm also getting a bit hungry. By the way, in the bottom right corner you
can see a list of pattern IDs. And those are all the patterns that were matched
in this particular example. Those IDs map to the line numbers so you can
always go back and see what was matched. If you maybe write a few more complex patterns
and something is confusing you can always go back and see how this match was created
and why it's there. So let's speed this up a bit more. We're now at almost 400 accepted annotations,
which is pretty good. I'm always trying to hit the even numbers,
which means that I sometimes end up having to annotate another 99 examples because I
missed the target. But it is pretty satisfying to see the number
go up. And we're starting to get a number of examples
that we can actually work with and that's enough to run a first training experiment,
which is good. My goal is, I'm going to be doing 500 in total. And that hopefully gives us a bit over 400
examples to train from, minus the ones that we've skipped. Also, don't forget that you always want to
hold some of them back for evaluation, even if it's just a quick experiment. You always need some examples that you can
evaluate your model on so you'll always be doing a bit more than you know you need for
training. But a few hundred can be a good stopping point,
especially if we're taking advantage of some transfer learning later on. So that means that we can probably get by
with that dataset for training and at least a very rough evaluation to see if we're generally
on the right track and if our model is learning something. So let's stop here and save. We can now go back to the terminal again and
exit the server. During the development phase it's important
to have a very tight feedback loop. Like, I'm pretty confident in this case that
that category we've defined here is something that the model can learn. But in general, you never know. Sometimes you want to go back and revise your
label scheme. And you always want to make sure you're validating
your ideas early so you're not wasting time on something that's just doomed to fail. So we're going to be training a temporary
model that we can later build on top of. For instance, we can use it to suggest entities
and all we have to do then is correct its mistakes. And this also gives us a very good idea of
what the model predicts and where it's at, and also the common errors it makes. And it also just makes the process of collecting
more data for efficient because we already have a model that gives us something. Prodigy comes with a "train" recipe that's
a relatively thin wrapper around spaCy's training API. And it's specifically optimized for running
quick experiments and to work with existing Prodigy datasets for the different types of
annotations you might collect. So you can very quickly and easily run these
training experiments on the command line and see where you're at and if you're on the right
track. And to make the most of our very small dataset
we want to initialize the model with pretrained representations. Specifically, we want to use a pretrained
token-to-vector layer. We've used spaCy's "pretrain" command to pretrain
weights on the Reddit corpus. And the idea is pretty similar to the language
model pretraining that was popularized by ELMo, BERT and ULMFiT and so on. The only difference here is that we're not
training it to predict the next word and instead we're training it to predict and approximation. In this case, the word's word vector. So this makes the artifact that we're training
much smaller and it also makes the runtime speed much faster. So it's a pretty efficient compromise that
we can use and that we've seen very very promising results with overall. To run a first training experiment, we can
call the "prodigy train" recipe. The first argument is the name of the component
that we want to train. In this case "ner" for the named entity recognizer. Next, we define the base model. Here, we'll be using the large English vectors
package that you can download alongside spaCy since those are also the vectors we've used
for pretraining the token-to-vector layer. So this should always match. Or this needs to match, otherwise it's not
going to work. The "init-tok2vec" argument lets us specify
the path to the pretrained token-to-vector weights, so basically the output of "spacy
pretrain". And that's going to be used to initialize
our model with. And we'll also specify an output directory
and the "eval-split", which is the percentage of examples to hold back for evaluation. Here, we're setting it to 20%, which is not
that much and normally that's a bit low. It means that our evaluation is not going
to be very stable. And that's not going to be the result you
should be reporting in your paper. You should take that with a grain of salt. So the training here is reasonably quick and
I'm running this on CPU on my regular MacBook. But I'll still speed it up a little for the
video because it's kinda boring to watch and not that much is happening. But as we can see, the accuracy is looking
pretty good already. It's definitely going up. And after the training has completed the model
with the best F-score is saved to the directory that we specified. And we can also see a very nice breakdown
per label, which is very useful if you're training with multiple labels at the same
time so you can see if maybe one label does great, one label doesn't do so well, and that's
also usually a very important indicator that you can have. And that's also what we're hoping to beat
later after we've collected more annotations. If we can't beat that, then that's obviously
a bad sign. But I think there's still some room. But it also shows that with our pretrained
representations we're getting pretty decent results even though we've trained with a very
small dataset. And it looks pretty internally consistent. So that's a very good sign and that does show
us that it's probably worth it to keep pursuing this idea. Before we keep going we want to run another
diagnostic. Because a very important question is always:
should I keep going? And also, should I keep doing what I have
been doing, or should I do something else? There's obviously not always a definitive
answer. But one experiment we've found pretty useful
here is the "train curve". The "train-curve" recipe takes pretty much
the same arguments as the "train" recipe. And it will run the training several times,
with different amounts of data. So 25%, 50%, 75% and 100% in the default configuration. So basically, what we're doing here is, we're
simulating growing training data size. So we're basically interested in finding out
whether collecting more data will improve the model. And in order to do that, we're looking at
whether more data has improved it in the past, basically. That's kind of the rough idea. And the training here takes a bit longer because
we're training 4 times. So I'll speed this up a little. For each run here, the recipe prints the best
score and it shows the improvement with more data. Or if there's no improvement or the accuracy
goes down, you would also see that printed in red. So you can see how more data is improving
the model using segments of the existing data. And here we can see that between 75 and 100%,
we do see an improvement of over 1% in accuracy. So that's looking pretty good. And as a rule of thumb, if the accuracy increases
in the last segment, it could indicate that collecting more annotations, and specifically,
more annotations of roughly the same type as is already in our dataset, will improve
the model further. So now that we have a temporary model, we
can use it to do the labelling for us and only correct it if it's wrong. Prodigy's built-in workflow for that is called
"ner.correct". The command looks very similar to the annotation
command that we ran before. First, we give it the name of the dataset
we want to save the annotations to. We could have used the previous dataset name
here but I think it's always a good idea to use separate datasets for separate experiments
because this also makes it easy to start over if things go wrong. You make a mistake, something isn't right,
you don't like what you're seeing, you can just delete the dataset and try again. And separating annotations, that can really
be a pain if they're all mixed up in the same set. But merging them later on is easy. You can run the train experiment and the train
command with multiple dataset names. So it's always better to keep them separate. And the second argument here is the base model
that already predicts named entities. So that's the temp model that we just trained
and saved to that directory. And we also need to pass in the path to our
input texts again and the label that we want to annotate. And finally, since we're annotating the same
input text file twice, we want to make sure that we're not asked about texts that we've
already annotated before. So we can set the "exclude" argument to our
previous "food_data" dataset. So if an example is already present in that
dataset, we won't see it again now when we're annotating again. And after we hit enter the server starts again
and we can head back to the web app. And as you can see, we're right back where
we left off and we can keep annotating. And the highlighted entities you can see here,
they're actually the model's predictions. And it's already looking pretty good. It looks like it has definitely learned something
as we move through the examples and the entities here. But it also makes mistakes and if it predicts
a wrong span we can click on it and remove it from the example. And if it's missing one, we can just added. Pretty much just like we did before when we
did the pattern matches. That's a nice one and that's a nice mistake. And I kind of have a lot of sympathy for the
model here. No, this is not an ACTUAL red lobster. But it's a nice try and we can just remove
this by clocking on the entity and then hit "accept" to submit the correct annotation. So yeah, we can move through these and let's
speed this up again. It's a bit repetitive in places because the
way people talk about food is quite similar and we're seeing a lot of the same entities. But it also shows some of them are just much
more common than others. Here we can actually see quite a few ingredients
that definitely weren't in the training data and that are correctly recognized here. For instance, "apple juice". That's a good sign, too, and that means that
we're really recognizing ingredients in context here. Because after all, that's the goal. That's what we want to do. We don't just want to find certain ingredients. We want the model to be able to generalize
based on the examples we show it. And we want to find other similar ingredients
if they're mentioned in similar contexts. Because that's how we're going to find – hopefully
– the most interesting cases and the most interesting ingredients and maybe the things
we hadn't thought of. So yeah, let's speed this up again so it's
not too repetitive. So there are quite a few examples here where
the ingredient vs. dish vs. other mention of food was pretty difficult. And I think I haven't been fully consistent
in my annotations either. So there's probably some room for improvement
here. And I'll be sharing all of the code and data
from this video in the description. So maybe you also have some ideas for what
else to do with the data and with the model, or have some things that could be improved. Definitely let me know on Twitter if you end
up working with it. Because I think it's definitely a very cool
topic so I think there's a lot more to explore still. So we're moving through the examples and I
kinda want to get to 500-ish in total again. I think that's a good number. The most common mistakes I've seen the model
make so far are definitely around dishes and ingredients, which is genuinely a tricky question. Like, in "mac and cheese" for instance, we're
not considering "cheese" and ingredient because it's the name of the dish and that's what
the conversation is about most of the time. And that's a difficult distinction. So we'll hopefully be able to improve around
those cases and also increase the accuracy once we're training on that data as well and
are correcting the mistakes here. Another thing to keep in mind is that our
goal here is information extraction where we can average over a lot of predictions. And that's very nice because it means that
we can accept that the model may get some edge cases wrong if it's getting most of everything
else right. I mean, that's the case for most machine learning
use cases because you know you're not going to get 100% accuracy always on everything
ever. So it's most about, you want to do some error
analysis and make sure the mistakes your model makes are mistakes you can live with or work
around. So overall, the annotation part that I'm showing
in this video has taken me maybe about 2 1/2 hours, which is pretty efficient. And that's also the kind of efficiency that
we want to enable with a tool like Prodigy. Because machine learning projects can take
a lot of time and they can also get pretty expensive. So being able to validate your idea in only
a few hours without having to schedule tons of meetings and get lots of people involved
and plan lots of things, that's really cool. And that enables faster iteration and means
you can try out more things and you can really focus on the stuff that looks the most promising
out of all the experiments you run. And now we're at 550 in total because I missed
the 500 mark and wanted an even number again. So let's stop here and hit "save" to make
sure that all of our annotations are sent back to the server and then we can keep going. And we can again stop the server in our terminal
and now we have 2 datasets with over 1000 annotations in total that we can use to train
the model. So that's a pretty good number. We'll be using the same setup again but we're
passing in the names of both datasets when we run "prodigy train" and run our experiment. And we'll be training from scratch again. There's really no point in updating our temp
model again, even though it would be theoretically possible. It's always better to start clean. That lets you avoid a lot of potential side-effects
and it's just a much better way to run experiments. You want to be training on the same data from
scratch every time. So we'll start out with the vectors, with
the tok2vec weights and we'll also be giving it a few more iterations. Just to make sure we're not missing out on
potentially better results if it takes a bit longer, which can be the case with these models
that are a bit larger. So we'll see. And we'll let this run. It obviously takes a bit longer this time. But after 20 iterations we have increased
our accuracy. It's pretty good, especially given that we
were able to automate a lot and didn't spend that much time on doing the additional annotation. We still doubled our training data and got
that nice improvement. And it could also mean that well, if we wanted
to spend a bit more time, we could maybe get a bit more out here. I don't think that's the end of it. But we've now saved out the model and yeah,
I think the results are good enough so now it's time to put it to work and see what we
can analyze with it and what we can find out about ingredients on Reddit. As a quick recap, we want to be building one
of those really cool bar chart races. And we want to be using the model that we
just trained to extract all ingredients as named entities. And then we want to compute the counts over
time. And the data we have is 7 years of Reddit,
so we probably want to count by month because otherwise we'd only have 7 data points and
that's not so interesting and makes it a lot more difficult to see what changes over what
time periods. And here's a little Python script that I wrote
to do this. It's pretty straightforward. We start with a dictionary of counters so
we can count all timestamps for all entities. And the model we saved out after training
is a regular loadable spaCy model, so we can pass the path to "spacy.load" and we can start
using it right away. And next, we load the full corpus or the full
extracted Reddit data as a stream. It's newline-delimited JSON, so we don't have
to load it all into memory upfront. And spaCy also has this very useful method,
which is called "nlp.pipe". And that lets you process texts as a stream. And that's also much more efficient at scale
and it also supports multiprocessing. So we can create an iterator of (text, example)
tuples and then we can pass it through as tuples when we call "nlp.pipe". And what we get back out of it is tuples of
spaCy Doc objects, and those Doc objects have all the annotations that were predicted by
the model. And then we also get the original record,
which has the original text, but also the meta information like the timestamps that
we need. So we'll parse the UTC timestamp to a more
workable year-month string and then for each entity in the "doc.ents", we 'll increment
the count for that given month. And also, wel'll be using the lowercase form
of the entity so that we're not going to be ending up with duplicates for different casings
and different capitalization. And then finally, we're putting it all in
a DataFrame and then we are saving it out as a CSV that we can keep working with. And I guess I could have probably used a DataFrame
all along. But I personally always like to start with
Vanilla Python and then see how I go. I don't know, that's just how I do things. We ended up running this script over the whole
data. And in our case, we ran 21 jobs with 2 CPUs
per job. And that took about 8 minutes each. So in total, it took us 5.6 CPU hours. So if you had an 8-CPU server, it would take
you about 45 minutes to process the 2 million plus comments from Reddit that we've extracted. So this is all totally doable and you don't
need any fancy or expensive computing resources to complete a project like this. You can even let it run over night on your
laptop and when you get up in the morning you have the results. So hopefully, if you want to replicate this,
it'd be no big deal. And here's here's an excerpt of the resulting
counts. We have extracted ingredients and then we
have columns for each month with the counts. And one additional processing step that you
could add, that we haven't done here, is lemmatization. So basically, you would merge ingredients
with the same base form into one and add up the counts. So for instance, something like "egg" and
"eggs" we currently have two records of, and then you would have one. And that's also something spaCy could help
you with if that's something that you want to do. But just from looking at the data as it is,
this is actually super interesting already. So now we just need to find the most interesting
datapoints. If you look at the results, you quickly realize
that many of the very common ingredients don't very change much at all. Like, people have always been talking about
salt, and people will keep talking about salt, and they're still talking about salt, and
if we were to visualize the top ingredients out of all of them it'd be incredibly boring. So what we ended up doing is, we calculated
the variance, scaled by the average. So we can basically find the ingredients whose
mentions changed significantly month to month. I looked around for how we could do this animation
and this graphic, and I found this pretty cool site called Flourish, which lets you
create interactive bar chart race animations. You've probably seen those on YouTube or in
viral social media posts and it's a very fun way to visualize how things that you can somehow
count and enumerate change over a given time period. So I signed up for a free account and uploaded
our selection of ingredients to visualize it. And the first column you can see here, those
are the ingredients, the labels, and the other columns, those are all counts for the respective
months. And we've selected the most variable ingredients
and then we ended up removing the top 10 staples to make it a bit more interesting. We could have probably removed a bit more,
but we also didn't want to cherry-pick it too much. So after adding the data, we can now navigate
to the preview tab and here it is! Our bar chart race visualization, created
using a custom NER model that we trained completely from scratch. I hope you've enjoyed this example of how
Prodigy can help you quickly complete a project, starting with raw text and only a few seed
phrases, and then all the way to a model with sufficient accuracy to carry out your analysis. There's a lot more you can do with Prodigy
as well, and it's fully scriptable. So it can really slot in right alongside the
rest of your data science and machine learning stack. To find out more about Prodigy, check out
the website and docs. I've added the links to the description below. And if you have any questions, feel free to
talk to us on the forum.