PRODIGY v1.10: Dependencies, relations, audio, video, extended NER and image annotation & lots more

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, I'm Ines. I'm the co-founder of Explosion, a core developer of the spaCy Natural Language Processing library and the lead developer of Prodigy. Prodigy is a modern annotation tool for creating training data for machine learning models. It helps developers and data scientists train and evaluate their models faster. Version 1.10 is probably our biggest release so far, and it includes tons of new features. So I thought I'd record a video to give you a little walkthrough. In this video, I'll show you some of the new features that I'm most excited about, including: A completely new interface for manual relationship and dependency annotation, and even joint span and relation annotation. New recipes for creating data for dependency parsing, coreference resolution and fully custom semi-automated relations. New interfaces for annotating audio and video files. Recipes for labelling audio segments and transcribing audio files and videos. Recipes for creating training data for speaker diarization. These recipes are powered by the awesome pyannote.audio library, and even let you have a model in the loop. There's also a new and revamped manual image annotation UI with support for modifying and resizing bounding boxes and shapes and various new settings. And also, new settings for the manual NER and span annotation UI, including character-based highlighting and a cool recipe for annotating named entity recognition data for fine-tuning transformer models like BERT. Finally, I'll also quickly show a few smaller features, like new recipe callbacks, UI config options and more. By the way, I've included the timestamps in the video description, so you can skip ahead to the features you're most interested in. So let’s get started with one of the new features: annotating dependencies and relations. The relations interface lets you assign relationships and dependencies between tokens and spans. To assign a relation, click the two tokens you want to connect. You can choose between viewing the text with line breaks, which is especially nice if you have fewer, but longer dependencies, or in one line, which looks more similar to a dependency tree. If you’re annotating relations between spans like named entities, you can load in pre-labelled data with spans, or annotate dependencies and spans jointly. Just switch into span annotation mode and drag across the tokens you want to select. If your selection turns green, you’re good to go. If it turns red, it doesn’t contain a valid span – either because it includes disabled tokens, or because it overlaps with an existing span. Alternatively, you can also hold down shift and click the start and end of the span you want to add. This is useful if you’re annotating a span over multiple lines. The interface also works pretty well on touch devices. Alongside the interface, we’ve also included a few recipes for annotating relations for different types of tasks. The dep.correct recipe lets you manually correct a dependency parser using the labels you care about. Optionally, you can also let it update the model in the loop with your annotations. The coref.manual recipe lets you annotate data for coreference resolution, for instance to resolve references to a person to one single entity. This recipe allows you to focus on nouns, proper nouns and pronouns specifically, by disabling all other tokens. This also helps you enforce consistency and speeds up the annotation process. And finally, rel.manual lets you build fully custom semi-automated workflows that use a model or match patterns to pre-highlight noun phrases and entities, and patterns to decide which tokens to disable. One thing that's always tricky about annotating relationships is that the task is inherently complex and there are a lot of steps involved. One of Prodigy's core philosophies is to not just provide an interface to do something, but to also try and reimagine the task to make it more efficient, and provide ways to automate everything that a machine can do just fine. When we tried out different types of relation annotation, we realised that they typically had a few things in common that we could take advantage of to make the process more efficient: First, not everything matters: for many tasks, you can pretty easily define tokens that you know are pretty much never part of a relation or entity. This could be articles, verbs or punctuation, or even a pre-defined word list. Prodigy lets you define match patterns to disable tokens so you can focus on what matters. This is also really helpful for data consistency: if you have an annotation policy that states that articles like “the” should never be part of entities or relations, you don’t have to rely on your annotators to remember and implement that. You can just add a rule to disable those tokens and make them unselectable. You typically also want to assign relations to consistent units like tokens, phrases or named entities. Those afte often produced by an upstream process, like a pretrained named entity recognition model. The relations workflows let you specify patterns or use a model to pre-label spans and keep the units you’re connecting consistent. Here’s an example from the BioNLP Shared Task on biomedical event extraction. We’re loading in data that’s pre-annotated with genes and gene products. So if you already have existing named entity annotations, you’ll be able to build on top of them and won’t have to start from scratch. We’re also using patterns to disable all tokens that are not nouns, proper nouns, verbs and adjectives, or not our pre-annotated genes and gene products, bedause we know that those tokens will likely be irrelevant. When we start the server, we can now add more trigger words and spans in span highlighting mode, and then connect them in the relation annotation mode. Don’t worry if you don’t understand the annotation scheme or what any of this means in detail – I don’t either! But there might be similarly complex tasks in the specific domain you’re working with, and you’ll be able to set up the annotation workflow to fit to your use case and domain and make it as efficient as possible. We currently don’t have an out-of-the-box component to do general-purpose relationship prediction in spaCy, so you’d have to bring your own implementation. But the data format you can export includes all relevant information about the annotated relations, including the character offsets into the text and the spans the relations refer to. Another cool addition in version 1.10 are new interfaces and workflows for annotating audio and video files. This opens up many new annotation possibilities for tasks like speaker diarization, audio classification, transcription and more. The audio.manual recipe lets you stream in audio files and highlight regions for the given labels by clicking and dragging. Regions can overlap, and you can move and resize them as you annotate. Each segment is saved as an audio span with its start and end timestamp. You can also stream in pre-annotated data and correct the existing regions. Instead of audio files, you can also load in video files by setting --loader to “video”. You’ll then see the video together with the waveform of the audio track. This can be helpful if you’re annotating who’s speaking, because the video can hold a lot of clues. The workflow is still called “audio”, because what you’re ultimately annotating here is the audio track. The audio and audio_manual UIs can also be combined with other interfaces using “blocks” – for instance, you can use an audio block, followed by a text input to transcribe audio files. This workflow is also available as the built-in audio.transcribe recipe. A nice little detail here is that you can easily customise the keyboard shortcuts used to play and pause the audio, so it doesn’t clash with anything you’re typing in the text field. Here, I’ve mapped it to command plus enter. As you know, Prodigy is a fully scriptable annotation tool that you can configure with Python scripts, also called “recipes”. This lets you build custom workflows and automate the annotation process using your own logic and even pretrained machine learning models. To take the audio workflows to the next level, I’ve teamed up with Hervé, who is the developer of the pyannote.audio library. Not only did he provide a lot of valuable feedback to improve Prodigy’s audio annotation capabilities, he also implemented a bunch of experimental workflows and recipes that let you label audio data with a model in the loop. pyannote.audio is an open-source framework built on top of PyTorch that provides neural building blocks for speaker diarization – basically, detecting whether speech is present or not, and segmenting audio by speaker identity. If you’re doing machine learning with audio, you should definitely check it out! Here’s an example of one of the experimental Prodigy workflows for speech activity detection – detecting whether someone is speaking or not. The upcoming version of pyannote.audio will ship with built-in Prodigy recipes, so if you have both packages installed, Prodigy will automatically detect the recipes. Here, we’re using the sad.manual recipe for manual speech activity detection. To help us annotate faster, the pretrained model assigns the label SPEECH to the detected speech regions. We can then adjust the regions if needed by resizing or dragging them, and hit “accept” when we’re done. As you can see, there’s a lot of potential here for model-assisted audio annotation, and it’s pretty exciting to see a model suggest the regions for us. If you’re interested in the topic and how it works in detail, check out the links in the description. There are also experimental workflows for annotating speaker change that you can try out on your data. As well as the two new interfaces, Prodigy v1.10 also includes various updates to the existing interfaces, especially the manual image UI. You can now resize and move existing shapes by clicking and dragging. To select an existing shape, just click on its label. If you’re annotating lots of boxes, you can also toggle whether the labels are shown or not, to make sure they’re not covering too much of the image. If you want to select a shape with no labels visible, you can hold down the shift key – or any custom key you specify in the settings. We’ve also added a third annotation mode for freehand shapes. Another cool new setting that can make bounding box annotation faster and more efficient is the image_manual_from_center setting. If enabled, you can draw a bounding box by starting with its center and then moving outwards. For many complex objects, the center is often much more obvious than any of the corners, so starting from the center saves you time and often means fewer adjustments and false starts. When you export your annotations, the JSON representation includes all the information you need: For bounding boxes, Prodigy now also outputs the width and height of the bounding box, as well as its x and y coordinates and the bounding box center. So no matter which format your model needs to be updated with, it should all be available in the data. All other shapes are represented by “points”, the (x, y) coordinates of the path. The “type” field indicates whether the shape is a rectangle, a polygon or a freehand shape. Of course you can also stream in pre-labelled data, for example produced by your existing image model. It just needs to follow Prodigy’s JSON format. You can then see the model’s predictions in the annotation UI and correct its mistakes. It’s also pretty efficient on touch devices, by the way! Annotating data for named entity recognition is probably one of the most popular use-cases of Prodigy. If you’ve worked with Prodigy before, you know that it uses a pretty cool efficiency trick and lets you annotate tokenized text. This means that your selection can snap to the token boundaries, and it also helps keep your data consistent. After all, you’re typically predicting labels over tokens, and if your annotations don’t match the tokens, this makes your life a lot harder. One cool example I created for 1.10 is an example recipe for more efficient NER annotation for fine-tuning transformer models like BERT. Transformers typically use subword tokenization algorithms like WordPiece or Byte Pair Encoding, also called BPE, that are optimized for efficient embedding of large vocabularies. That’s also why the tokens not always follow what’s typically considered a “word”. This recipe recipe uses the tokenizers library by Hugging Face under the hood and pre-tokenizes the text so you can make sure your annotations are compatible with the tokenizer you’re using. Here, the tokenizer splits the word “bieber” into two word pieces, also indicated by the hash symbols. That’s interesting, but also a bit ugly, so we can go back and set —hide-wp-prefix to hide those prefixes and make the text more readable. We can now annotate the entities and we’ll know that spans we select will always be compatible with the tokenization. The data format produced by Prodigy retains the original texts and offsets produced by the tokenizer, as well as the encoded IDs. The recipe is not built-in — it’s available as a separate script in our recipes repo. It’s still experimental and you probably want to adjust it to your specific use case anyways. To use it, just point the -F flag to the path of the recipe script. By the way, this recipe was made possible by another new feature of Prodigy: tokens provided in the data can now specify a “ws” key, mapped to a boolean value that indicates whether the token is followed by whitespace. By default, Prodigy will respect this, so you can keep your text readable, while still enforcing token boundaries. Of course, there are still use cases where you might want to annotate characters instead – for instance, if you’re training a character-based model, if you’re training a tokenizer, or if you want to create data to detect and correct encoding and OCR issues in your text. The new --highlight-chars flag on the ner.manual recipe lets you toggle character-based highlighting. Prodigy 1. also introduces new callbacks you can use in your custom recipes to customise and configure your annotation workflows. before_db lets you modify annotated tasks before they’re placed in the database, and validate_answer lets you validate an answer and give live feedback to the annotator in the UI. One of Prodigy’s principles is that the data saved in the database should always reflect exactly what the annotator saw in the UI. This means you’ll always be able to reconstruct the original question and the annotation decision. That’s also why you should use the before_db callback with caution, because you don’t want to accidentally destroy data. But it can still be useful to prevent database bloat. For example, if you’re annotating images, audio or video, especially with a model in the loop, you often want to conver the data to bytes so it can be processed by the model, and then pass the data to the web app as a base64-encoded string. However, saving that base64 string with every example can easily lead to database bloat. The before_db callback lets you remove that string and replace it with the image path. The validate_answer callback is another feature that we’re hoping can be very useful. It lets you define a function that is called on every example that the annotator submits in the UI. If the function raises an error, the error message is shown as an alert in the UI and the answer can’t be submitted. It’s a simple function that takes the task dictionary, so you can perform any custom checks and either use “assert” or raise any Python exception. Because it’s in Python, you can do pretty much anything here – if it’s fast enough, you could even use your model for extra checks. Here’s an example of the callback we use in the new dep.correct recipe to validate dependency parses. If the ROOT label is used, we show an error if the annotated parse doesn’t contain exactly one root, because we know that’s invalid. This helps prevent mistakes and keeps your data clean and consistent. Prodigy 1.10 also features various new UI settings that let you customize the appearance and annotation experience. First, we introduced a “ui_lang” setting to change the language used for descriptions in the annotation UI. We currently have translations for German, Dutch, Spanish and Chinese, with more to come! Even if it’s just a few small descriptions and tooltips, it can make a big difference and make your annotators feel a lot more comfortable. Here’s an example of the Chinese UI translation. Note that the “ui_lang” setting only affects the descriptions and labels in the annotation app and has nothing to do with the language of the text you’re annotating. You can annotate any text in any language, together with any of the available UI translations. We also added more settings to customise the appearance and what’s shown in the app. Using the “buttons” setting, you can change the buttons that are displayed at the bottom of the screen. For example, you might want to disable the “reject” action for some manual annotation tasks and only allow annotators to accept or to skip. You can also use the “project_info” key to customize the sections that are shown in the “project info” block in the sidebar. For example, the name of the dataset, the name of the current recipe or the ID of the interface. You can reorder fields or remove them, depending on what you need. Finally, a few smaller but notable features: To load data form an existing dataset back in, you can now also use the “dataset colon dataset name” shorthand as the input source. An optional colon plus the answer lets you load only examples with a given answer, like “accept”. In this example, we’re loading all accepted named entity annotations from an existing dataset back into the manual relations recipe to assign relations between the entities. The text_input interface now supports “field_suggestions”, a list of auto-suggestions that are shown when the annotator selects the field, types or hits the down arrow key. There’s a lot more to explore, and you can find the full list of new features and improvements in the changelog. I’m extremely excited about this release and it was also a lot of fun to develop and test the new features. Thanks to everyone who helped with early beta testing as well. If you’re already using Prodigy, I hope you’ll enjoy trying out the new features. If you’re not using Prodigy yet but you’re curious, I’ve added all relevant links to the decription below. Thanks for watching, and hope to see you again soon!

Info

Channel: Explosion

Views: 4,486

Rating: 4.9615383 out of 5

Keywords: artificial intelligence, ai, machine learning, spacy, natural language processing, nlp, active learning, data science, big data, annotation, named entity recognition, ner, data annotation, text annotation, transfer learning, language models, audio, video, speaker diarization, dependency parsing, coreference resolution, biomedical

Id: KCrIa538u4I

Channel Id: undefined

Length: 17min 39sec (1059 seconds)

Published: Wed Jun 17 2020