Hi, I'm Ines. I'm the co-founder of Explosion, a core developer
of the spaCy Natural Language Processing library and the lead developer of Prodigy. Prodigy is a modern annotation tool for creating
training data for machine learning models. It helps developers and data scientists train
and evaluate their models faster. Version 1.10 is probably our biggest release
so far, and it includes tons of new features. So I thought I'd record a video to give you
a little walkthrough. In this video, I'll show you some of the new
features that I'm most excited about, including: A completely new interface for manual relationship
and dependency annotation, and even joint span and relation annotation. New recipes for creating data for dependency
parsing, coreference resolution and fully custom semi-automated relations. New interfaces for annotating audio and video
files. Recipes for labelling audio segments and transcribing
audio files and videos. Recipes for creating training data for speaker
diarization. These recipes are powered by the awesome pyannote.audio
library, and even let you have a model in the loop. There's also a new and revamped manual image
annotation UI with support for modifying and resizing bounding boxes and shapes and various
new settings. And also, new settings for the manual NER
and span annotation UI, including character-based highlighting and a cool recipe for annotating
named entity recognition data for fine-tuning transformer models like BERT. Finally, I'll also quickly show a few smaller
features, like new recipe callbacks, UI config options and more. By the way, I've included the timestamps in
the video description, so you can skip ahead to the features you're most interested in. So let’s get started with one of the new
features: annotating dependencies and relations. The relations interface lets you assign relationships
and dependencies between tokens and spans. To assign a relation, click the two tokens
you want to connect. You can choose between viewing the text with
line breaks, which is especially nice if you have fewer, but longer dependencies, or in
one line, which looks more similar to a dependency tree. If you’re annotating relations between spans
like named entities, you can load in pre-labelled data with spans, or annotate dependencies
and spans jointly. Just switch into span annotation mode and
drag across the tokens you want to select. If your selection turns green, you’re good
to go. If it turns red, it doesn’t contain a valid
span – either because it includes disabled tokens, or because it overlaps with an existing
span. Alternatively, you can also hold down shift
and click the start and end of the span you want to add. This is useful if you’re annotating a span
over multiple lines. The interface also works pretty well on touch
devices. Alongside the interface, we’ve also included
a few recipes for annotating relations for different types of tasks. The dep.correct recipe lets you manually correct
a dependency parser using the labels you care about. Optionally, you can also let it update the
model in the loop with your annotations. The coref.manual recipe lets you annotate
data for coreference resolution, for instance to resolve references to a person to one single
entity. This recipe allows you to focus on nouns,
proper nouns and pronouns specifically, by disabling all other tokens. This also helps you enforce consistency and
speeds up the annotation process. And finally, rel.manual lets you build fully
custom semi-automated workflows that use a model or match patterns to pre-highlight noun
phrases and entities, and patterns to decide which tokens to disable. One thing that's always tricky about annotating
relationships is that the task is inherently complex and there are a lot of steps involved. One of Prodigy's core philosophies is to not
just provide an interface to do something, but to also try and reimagine the task to
make it more efficient, and provide ways to automate everything that a machine can do
just fine. When we tried out different types of relation
annotation, we realised that they typically had a few things in common that we could take
advantage of to make the process more efficient: First, not everything matters: for many tasks,
you can pretty easily define tokens that you know are pretty much never part of a relation
or entity. This could be articles, verbs or punctuation,
or even a pre-defined word list. Prodigy lets you define match patterns to
disable tokens so you can focus on what matters. This is also really helpful for data consistency:
if you have an annotation policy that states that articles like “the” should never
be part of entities or relations, you don’t have to rely on your annotators to remember
and implement that. You can just add a rule to disable those tokens
and make them unselectable. You typically also want to assign relations
to consistent units like tokens, phrases or named entities. Those afte often produced by an upstream process,
like a pretrained named entity recognition model. The relations workflows let you specify patterns
or use a model to pre-label spans and keep the units you’re connecting consistent. Here’s an example from the BioNLP Shared
Task on biomedical event extraction. We’re loading in data that’s pre-annotated
with genes and gene products. So if you already have existing named entity
annotations, you’ll be able to build on top of them and won’t have to start from
scratch. We’re also using patterns to disable all
tokens that are not nouns, proper nouns, verbs and adjectives, or not our pre-annotated genes
and gene products, bedause we know that those tokens will likely be irrelevant. When we start the server, we can now add more
trigger words and spans in span highlighting mode, and then connect them in the relation
annotation mode. Don’t worry if you don’t understand the
annotation scheme or what any of this means in detail – I don’t either! But there might be similarly complex tasks
in the specific domain you’re working with, and you’ll be able to set up the annotation
workflow to fit to your use case and domain and make it as efficient as possible. We currently don’t have an out-of-the-box
component to do general-purpose relationship prediction in spaCy, so you’d have to bring
your own implementation. But the data format you can export includes
all relevant information about the annotated relations, including the character offsets
into the text and the spans the relations refer to. Another cool addition in version 1.10 are
new interfaces and workflows for annotating audio and video files. This opens up many new annotation possibilities
for tasks like speaker diarization, audio classification, transcription and more. The audio.manual recipe lets you stream in
audio files and highlight regions for the given labels by clicking and dragging. Regions can overlap, and you can move and
resize them as you annotate. Each segment is saved as an audio span with
its start and end timestamp. You can also stream in pre-annotated data
and correct the existing regions. Instead of audio files, you can also load
in video files by setting --loader to “video”. You’ll then see the video together with
the waveform of the audio track. This can be helpful if you’re annotating
who’s speaking, because the video can hold a lot of clues. The workflow is still called “audio”,
because what you’re ultimately annotating here is the audio track. The audio and audio_manual UIs can also be
combined with other interfaces using “blocks” – for instance, you can use an audio block,
followed by a text input to transcribe audio files. This workflow is also available as the built-in
audio.transcribe recipe. A nice little detail here is that you can
easily customise the keyboard shortcuts used to play and pause the audio, so it doesn’t
clash with anything you’re typing in the text field. Here, I’ve mapped it to command plus enter. As you know, Prodigy is a fully scriptable
annotation tool that you can configure with Python scripts, also called “recipes”. This lets you build custom workflows and automate
the annotation process using your own logic and even pretrained machine learning models. To take the audio workflows to the next level,
I’ve teamed up with Hervé, who is the developer of the pyannote.audio library. Not only did he provide a lot of valuable
feedback to improve Prodigy’s audio annotation capabilities, he also implemented a bunch
of experimental workflows and recipes that let you label audio data with a model in the
loop. pyannote.audio is an open-source framework
built on top of PyTorch that provides neural building blocks for speaker diarization – basically,
detecting whether speech is present or not, and segmenting audio by speaker identity. If you’re doing machine learning with audio,
you should definitely check it out! Here’s an example of one of the experimental
Prodigy workflows for speech activity detection – detecting whether someone is speaking
or not. The upcoming version of pyannote.audio will
ship with built-in Prodigy recipes, so if you have both packages installed, Prodigy
will automatically detect the recipes. Here, we’re using the sad.manual recipe
for manual speech activity detection. To help us annotate faster, the pretrained
model assigns the label SPEECH to the detected speech regions. We can then adjust the regions if needed by
resizing or dragging them, and hit “accept” when we’re done. As you can see, there’s a lot of potential
here for model-assisted audio annotation, and it’s pretty exciting to see a model
suggest the regions for us. If you’re interested in the topic and how
it works in detail, check out the links in the description. There are also experimental workflows for
annotating speaker change that you can try out on your data. As well as the two new interfaces, Prodigy
v1.10 also includes various updates to the existing interfaces, especially the manual
image UI. You can now resize and move existing shapes
by clicking and dragging. To select an existing shape, just click on
its label. If you’re annotating lots of boxes, you
can also toggle whether the labels are shown or not, to make sure they’re not covering
too much of the image. If you want to select a shape with no labels
visible, you can hold down the shift key – or any custom key you specify in the settings. We’ve also added a third annotation mode
for freehand shapes. Another cool new setting that can make bounding
box annotation faster and more efficient is the image_manual_from_center setting. If enabled, you can draw a bounding box by
starting with its center and then moving outwards. For many complex objects, the center is often
much more obvious than any of the corners, so starting from the center saves you time
and often means fewer adjustments and false starts. When you export your annotations, the JSON
representation includes all the information you need:
For bounding boxes, Prodigy now also outputs the width and height of the bounding box,
as well as its x and y coordinates and the bounding box center. So no matter which format your model needs
to be updated with, it should all be available in the data. All other shapes are represented by “points”,
the (x, y) coordinates of the path. The “type” field indicates whether the
shape is a rectangle, a polygon or a freehand shape. Of course you can also stream in pre-labelled
data, for example produced by your existing image model. It just needs to follow Prodigy’s JSON format. You can then see the model’s predictions
in the annotation UI and correct its mistakes. It’s also pretty efficient on touch devices,
by the way! Annotating data for named entity recognition
is probably one of the most popular use-cases of Prodigy. If you’ve worked with Prodigy before, you
know that it uses a pretty cool efficiency trick and lets you annotate tokenized text. This means that your selection can snap to
the token boundaries, and it also helps keep your data consistent. After all, you’re typically predicting labels
over tokens, and if your annotations don’t match the tokens, this makes your life a lot
harder. One cool example I created for 1.10 is an
example recipe for more efficient NER annotation for fine-tuning transformer models like BERT. Transformers typically use subword tokenization
algorithms like WordPiece or Byte Pair Encoding, also called BPE, that are optimized for efficient
embedding of large vocabularies. That’s also why the tokens not always follow
what’s typically considered a “word”. This recipe recipe uses the tokenizers library
by Hugging Face under the hood and pre-tokenizes the text so you can make sure your annotations
are compatible with the tokenizer you’re using. Here, the tokenizer splits the word “bieber”
into two word pieces, also indicated by the hash symbols. That’s interesting, but also a bit ugly,
so we can go back and set —hide-wp-prefix to hide those prefixes and make the text more
readable. We can now annotate the entities and we’ll
know that spans we select will always be compatible with the tokenization. The data format produced by Prodigy retains
the original texts and offsets produced by the tokenizer, as well as the
encoded IDs. The recipe is not built-in — it’s available
as a separate script in our recipes repo. It’s still experimental and you probably
want to adjust it to your specific use case anyways. To use it, just point the -F flag to the path
of the recipe script. By the way, this recipe was made possible
by another new feature of Prodigy: tokens provided in the data can now specify a “ws”
key, mapped to a boolean value that indicates whether the token is followed by whitespace. By default, Prodigy will respect this, so
you can keep your text readable, while still enforcing token boundaries. Of course, there are still use cases where
you might want to annotate characters instead – for instance, if you’re training a character-based
model, if you’re training a tokenizer, or if you want to create data to detect and correct
encoding and OCR issues in your text. The new --highlight-chars flag on the ner.manual
recipe lets you toggle character-based highlighting. Prodigy 1. also introduces new callbacks you
can use in your custom recipes to customise and configure your annotation workflows. before_db lets you modify annotated tasks
before they’re placed in the database, and validate_answer lets you validate an answer
and give live feedback to the annotator in the UI. One of Prodigy’s principles is that the
data saved in the database should always reflect exactly what the annotator saw in the UI. This means you’ll always be able to reconstruct
the original question and the annotation decision. That’s also why you should use the before_db
callback with caution, because you don’t want to accidentally destroy data. But it can still be useful to prevent database
bloat. For example, if you’re annotating images,
audio or video, especially with a model in the loop, you often want to conver the data
to bytes so it can be processed by the model, and then pass the data to the web app as a
base64-encoded string. However, saving that base64 string with every
example can easily lead to database bloat. The before_db callback lets you remove that
string and replace it with the image path. The validate_answer callback is another feature
that we’re hoping can be very useful. It lets you define a function that is called
on every example that the annotator submits in the UI. If the function raises an error, the error
message is shown as an alert in the UI and the answer can’t be submitted. It’s a simple function that takes the task
dictionary, so you can perform any custom checks and either use “assert” or raise
any Python exception. Because it’s in Python, you can do pretty
much anything here – if it’s fast enough, you could even use your model for extra checks. Here’s an example of the callback we use
in the new dep.correct recipe to validate dependency parses. If the ROOT label is used, we show an error
if the annotated parse doesn’t contain exactly one root, because we know that’s invalid. This helps prevent mistakes and keeps your
data clean and consistent. Prodigy 1.10 also features various new UI
settings that let you customize the appearance and annotation experience. First, we introduced a “ui_lang” setting
to change the language used for descriptions in the annotation UI. We currently have translations for German,
Dutch, Spanish and Chinese, with more to come! Even if it’s just a few small descriptions
and tooltips, it can make a big difference and make your annotators feel a lot more comfortable. Here’s an example of the Chinese UI translation. Note that the “ui_lang” setting only affects
the descriptions and labels in the annotation app and has nothing to do with the language
of the text you’re annotating. You can annotate any text in any language,
together with any of the available UI translations. We also added more settings to customise the
appearance and what’s shown in the app. Using the “buttons” setting, you can change
the buttons that are displayed at the bottom of the screen. For example, you might want to disable the
“reject” action for some manual annotation tasks and only allow annotators to accept
or to skip. You can also use the “project_info” key
to customize the sections that are shown in the “project info” block in the sidebar. For example, the name of the dataset, the
name of the current recipe or the ID of the interface. You can reorder fields or remove them, depending
on what you need. Finally, a few smaller but notable features: To load data form an existing dataset back
in, you can now also use the “dataset colon dataset name” shorthand as the input source. An optional colon plus the answer lets you
load only examples with a given answer, like “accept”. In this example, we’re loading all accepted
named entity annotations from an existing dataset back into the manual relations recipe
to assign relations between the entities. The text_input interface now supports “field_suggestions”,
a list of auto-suggestions that are shown when the annotator selects the field, types
or hits the down arrow key. There’s a lot more to explore, and you can
find the full list of new features and improvements in the changelog. I’m extremely excited about this release
and it was also a lot of fun to develop and test the new features. Thanks to everyone who helped with early beta
testing as well. If you’re already using Prodigy, I hope
you’ll enjoy trying out the new features. If you’re not using Prodigy yet but you’re
curious, I’ve added all relevant links to the decription below. Thanks for watching, and hope to see you again
soon!