IMAGE CAPTIONING ANNOTATION with Prodigy & PyTorch: custom scriptable machine learning annotation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, I'm Ines. I'm the co-founder of Explosion, a core developer of the spaCy natural language processing library, and the lead developer of Prodigy. Prodigy is a modern annotation tool for creating training data for machine learning models. It helps developers and data scientists train and evaluate their models faster. A key feature of Prodigy is that it's fully scriptable. This makes data loading and preprocessing really easy and also lets you add automation to make your annotation more efficient. You can even put together your own interfaces, which is especially helpful for very custom use cases. Of course Prodigy also comes with a variety of built-in workflows for different machine learning and NLP tasks, like text classification or named entity recognition. If you want to see those in action, check out the other tutorials on this channel. In this video, I'll show you how you can use Prodigy to script fully custom annotation workflows in Python, how to plug in your own machine learning models and how to mix and match different interfaces for your specific use case. And the use case we'll be working on in this video is image captioning. Image captioning technology has come a long way and we can now train models that are able to produce pretty accurate natural language descriptions of what's depicted in an image. This is very useful for assistive technologies because we can generate alternative text for images so that people who are using screen readers can get a description of an image. And we want to create a dataset of image descriptions, use an image captioning model implemented in PyTorch to suggest captions and perform error analysis to find out what the model is getting right and where it needs improvement. So I was looking around for images that we could work with and I ended up downloading over a thousand pictures of cats from this dataset on Kaggle. So yeah, we'll be captioning cats! And let's start with the basics and let's get our cats on the screen. Prodigy isn't just a command line tool and web app, it's also a Python library that you can import and work with. Under the hood, Prodigy is powered by recipes, which are Python functions that define annotation workflows. Recipes can implement logic to stream in data or define callbacks that are executed when the server receives new answers. You can turn any function into a recipe by adding the "@prodigy.recipe" decorator. The first argument is the recipe name. This is how you're going to call your recipe from the command line. And here we're calling it "image-caption" so when you run the recipe later you can type "prodigy image-caption" on the command line. And any arguments of that function will be available as command line arguments. So you can use them to pass in settings and to really make your recipes fully reusable. So what arguments do we need? At a minimum, we should be able to pass in the name of the dataset to save the annotations to. Datasets in Prodigy let you group annotations together so you can later reuse them in the application or export them to use them in a different process. And then we also want to pass in the path to the directory of images that we want to annotate. And now, how is the recipe going to tell Prodigy what to do and how to start the annotation server? Well, a recipe returns a dictionary of components that define the settings for the annotation server. So this is the other convention. A recipe needs the recipe decorator and it returns a dictionary of components. That's all you need to be able to run it with Prodigy. And one of the components we need to return is the name of the dataset to save the annotations to. So we can just pass that right through from our arguments. And next, we need a stream that streams in the examples that we want to annotate. And Prodigy doesn't make you import and data upfront if you want to annotate it. That's all done in the recipe. And for images there's a built-in image loader, which we can import from "prodigy.components.loaders". And it takes a path to a directory of images and then it load them in the expected format. So we can pass in the path that we receive as an argument here and our recipe will be able to load any directory of images that we give it. And to make the stream available, we can return it as the component "stream". And now we also need to define how Prodigy should present the data by setting the "view_id" which is the name of the annotation interface to use. And there's a variety of different interfaces available that are built in and you can see examples of them and the data format they expect in the docs. For example, there's an interface to show plain text, there are interfaces for highlighting spans of text and there's also an interface called "image" which lets you present... an image! Yay. And the data it expects is of course exactly what our image loader creates. So we're going to be using that as the view ID. So let's take a look at what we have so far and start the server. We saved our recipe file as "recipe.py" so we can now navigate to the terminal and run the "image-caption" recipe, because that's the name we've defined via the recipe decorator. And if you remember, our function had two arguments, which we can use on the command line. First, the name of the dataset that we want to save the annotations to. So let's use a temp name just so we can experiment a bit. And the second argument is the path to our directory of images, which is our cats. And finally, we need to tell Prodigy where to find our recipe code, because that's just an arbitrary file and of course it can't just magically know that. So we can use the "-F" argument on the command line and point it to the file. And now when we hit enter, Prodigy will start up and it will serve our recipe. So yay, let's check this out in the browser. And as you can see, it worked. The images are here and we can go through them, we can accept them, we can reject them and yeah, we've successfully queued up our cats for annotation. And if your goal is to curate images then you could stop right here. That's all you need. But of course we don't just want to go through images, we also want to caption them. So we need a second interface which is the text input interface. And text inputs are something you should always use with caution. It's very very tempting to just make everything a free-form input and make the annotator type everything in and then call it a day. But the data you are collecting this way is very very unstructured and there's so much potential for human error and just for typos. And if your goal is to train a machine learning model you typically want to have as much control as possible over what's being annotated. And also how it's being annotated. So if you're creating training data and you can avoid free-form input, don't use free-form input, use something else. Make sure you can control the structured data that you're collecting. But for our use case we do want to create captions so we need a way to input them. And the best way to do that is to use a text input field. And to combine multiple interfaces and any custom content, Prodigy comes with a "blocks" interface. Here's an example where we have a manual span labelling interface that's followed by a free-form input and an arbitrary HTML block that embeds a song from SoundCloud. So this is all possible and you can pretty much do whatever you want here and freely combine multiple UIs and different pieces of content. But again, I always recommend against just cramming everything you need into a single interface. You'll get much much better results if you let annotators focus on a single concept at a time and you don't overwhelm them and present too many different tasks at once with too much cognitive load and too many steps that have to be performed in order. Because that's exactly the stuff that we as humans are bad at. And if you want to collect human feedback you should be focusing on the type of stuff that we're good at. And you don't want to additionally include lots of human mistakes this way and end up with even worse data overall. So, again, if you can keep it simple, keep it simple and avoid just doing everything in one interface. But in general, the "blocks" interface does let you combine a few interfaces and if that makes sense for your use case that's very cool. So if we go back to our image captioning problem, what we want to do here is, we want to show the image and we want to show a text input. So all we need to do is use the blocks interface and then add two blocks: one image block and then one text input block. And we can then set our blocks config in the "config" that's returned by the recipe. And the text input interface also allows a bunch of additional settings like the label that's displayed at the top of the field, the placeholder text that's shown if the field is empty and says something like "type something here", and whether the field should autofocus and also the ID of the field that's going to be added. And this is going to be the key that's used to store the text later on. So as you can see here, if the "field_id" is "user_input", the text the user types in will be available as "user_input" in the annotated task. So for our use case, let's change that to "caption" so the caption text that the user types in will be available as the key "caption". And we'll also set it to autofocus so that whenever a new task comes on we're immediately in the field and don't have to click on it. And the block here lets you override properties that apply to all tasks. And that's very nice because we don't want to bloat our data by adding "autofocus: true" to every single example that we annotate. So we can set that on the block and then it's applied to all of the tasks and it's not going to end up in our data. So now let's start the annotation server again and take a look. And yay, here we have our custom blocks interface and it shows the image block, followed by a text input block. Just like we wanted. So yeah, here's the first one that's just a simple white cat. So I'll try to keep the captions a bit simple. Then here we have two cats in the snow, I guess. That's how we would caption this. And here, cat, black cat, on a couch. That's pretty easy. And yeah, so we can kind of go through them here and I'll try to type something sensible for each of them. I don't know how detailed we want to have our captions here. So I try to keep them a bit uniform. Okay, so we got 14 annotations, so let's hit "save" and let's go back to the terminal and check out the annotations. I totally didn't expect that captioning images of cats would be so difficult, wow. So we can use the "db-out" command to export any dataset that we have. And we can either forward that output to a file or we can just pipe it to "less" so that we can take a look at the data quickly on the command line. And as you can see, we first have this really long string here, which is the base64-encoded image data. And by default, Prodigy's image loader will encode the image as a string. So that it can be saved with the examples. And this has two advantages: one is, it lets us load the images very easily in the browser. And it also means that the data is stored with the annotations. And this way, we can make sure that we never lose the reference to the original image. And of course you can change that and only store the file name of the image. But then you're in charge of keeping the files safe and making sure that they never change or move and you never actually lose the files and the references here. But in our case our cat images are pretty small so we'll just keep them with the data for now and let it convert it to a string, because it doesn't take up that much space. And, as you can see, with the image, Prodigy also stores some metadata and the caption that we typed in earlier. So this is all working pretty well already. Now, one thing we already noticed earlier is that writing all these captions from scratch is pretty tedious. And it's also much harder than you'd think when you hear "caption some cats". So maybe we can improve that and automate this a bit. Or maybe you already have a model that does image captioning and that you want to improve. So I had a look around online and I found this PyTorch tutorial and this example for an image captioning system. And I'll share the link in the description below if you want to take a closer look and of course I'm also sharing all code and data that I'm producing and working with in this video. So this image captioning system here uses a CNN as the encoder pretrained on the ImageNet dataset. And then we have an LSTM as the decoder. So essentially, what goes in is an image and then the model extracts a feature vector from it. And based on that, the decoder, which was trained with a language modelling objective, then outputs a sequence of tokens. So for this image here in this example, it will predict "giraffes standing next to each other". And what I liked about this example was that a) the code was very straightforward and well-written and I also didn't have to implement anything myself. And it also comes with a pretrained model that you can download from the repo. So it's very easy to try it out and get started. And I'm not going to go through the exact implementation in detail because that's going to be too much for this video and also, if you're serious about image captioning, you probably want to be using a different model implementation anyways. But if you want to take a look, I've combined it all into a very straightforward Python module and here I also wrote two functions that make it easy to access and use the model in our recipe script. We have one function here that loads the model and returns the encoder, decoder, vocabulary and image preprocessing transform. And then we have a second function that takes those objects and the base64-encoded image data, loads the image from bytes and generates the caption. So let's go back to the editor and add another recipe for the model-assisted image captioning. We'll call this recipe "image-caption.correct", just like Prodigy's built-in recipes like "ner.correct" for model-assisted named entity annotation. And it takes the same arguments again: the dataset to save the annotations to and then the path to our directory of images. So here's the skeleton. At the end of it, we need to return a dictionary of components, starting with the dataset. And let's also copy over our blocks from the other recipe. Next, we need a stream. Streams in Prodigy are generators, so typically, functions that return an iterator that you can iterate over, one at a time. And if you haven't worked with generators before, one of the big advantages, in a nutshell, is that we don't need to consume the whole stream at once. Instead, we can do it in batches. So you can have gigabytes of data and process that with a model, which takes a while, and it still works fine because Prodigy only ever needs to process one batch at a time. And then it sends it out for annotation and then it needs to process the next one when the queue is running low and when it requests the next batch. So the best way to define our stream generator is to write a function. And first, we load the images again which gives us a stream of dictionaries in Prodigy's format. And each dictionary here will have an image key that contains the encoded image data. And now we can import our model helper functions from that other module that I showed you earlier and we can load our model. And I typically like to do that kind of stuff at the very top of the function so if something fails your script can stop as early as possible, basically. And now for each example in our stream, we can call our "generate_caption" function and pass it the image data. I always like to use the variable "eg", which stands for "example" because it's short and it's still descriptive enough. So that's going to be what I'm using for our loop variable here. But... what are we going to do with this? Remember, we were going to pre-fill our text box with that caption that's produced by the model. And the nice thing is that Prodigy always lets you pass in pre-annotated data and it will respect that and display the annotations to you, if it has the same format as the data that Prodigy produces. So since we're adding the caption that we type to the field "caption", we can just pre-populate that and it will show up in the text field. That's pretty cool, right? Now, because it's a generator, instead of returning something, we yield the example and send it out. And then to create our stream, we just need to call the function and return it as a recipe component. And don't worry, it's a generator, so calling that function won't just run it all at once. It will only run when we need it and it will run for one single batch. So yeah, let's try it out and run the recipe! So we've called this one "image-caption.correct" and it takes the same arguments as the previous recipe. And it's in the same file, so we don't need to change that. And you can have multiple recipes in the same file, that's absolutely no problem. They just need to have distinct names. And if we head back to the browser now, we can see the first caption. Wow! Our model actually predicted something and it's not even so far off. So that's really really cool. And if we change the caption here, this will also be reflected in the "caption" value of the annotated task that's saved to the database. Let's do a few of them and correct the captions. I'm trying to stay as close as possible to the original captions here because I guess that's probably the best approach. But as you can see, we can just edit them and then move on to the next example. And what's pre-filled here, that's all coming from the model that we plugged in here and that I downloaded earlier. And one thing that would actually be pretty helpful is if we could keep track of what we're changing here. Because otherwise we won't necessarily remember what the original caption was and whether it changed or not and how it changed and how good the model was. So let's go back and add a key for the original caption to the examples. And Prodigy lets you add arbitrary properties to the examples that you stream in. They just need to be JSON-serializable, that's all. And the data will then be passed straight through and saved in the database with the example. So even as we change the value of the key "caption", the original caption will stay the same. So we'll always keep track of that. And another thing we can do is, we can keep a count of how often we change the caption and how often we keep the model's suggestion. And then when we exit the server, we could just print the results. So if later on you end up changing or updating your captioning model you can get a quick summary after each annotation session. And maybe you end up changing more or maybe you've accepted the model's captions more often. It's always good to have some real numbers at the end so you don't just have to rely on how you felt the model was doing. Prodigy recipes let you provide an "update" callback that is one of the components the recipe returns. And that function is called every time new answers are sent back to the server from the web app. And in some of the built-in active learning recipes this callback is used to update a model in the loop because that makes sense, right? But in our case, we can just use it to update our counts. And it receives a list of answers in Prodigy's format, so that's the same format as our stream, only that it contains the annotation. So in our case the potentially edited caption and the "accept", "reject" or "ignore" answer, depending on which button we clicked. So we can loop over the answers here and we'll only increment the counts if we actually accepted the annotation, so if the value of "answer" is "accept". Next, we can simply compare the value of the original caption to the value of the actual caption. And if it's different, that means that we've changed it in the UI, and then we can increment the "changed" counter. And otherwise, we know it wasn't changed and we increment the "unchanged" counter. So that's it. And to make the "update" callback available to Prodigy, we return it by the recipe under the name "update". And now we just need to output the results when we exit the server. So for that we can use the "on_exit" callback that's called... on exit! And it receives what we call the controller as an argument, which gives us access to a bunch of stuff related to the current annotation session. But we don't need that here so we can ignore it for now. And all we need to do is print our counts. And then we add the "on_exit" callback to the components returned by the recipe. Now let's see this in action and collect a few annotations! Yeah, those are all the ones that we already saw earlier. And the model obviously has a limited vocabulary which becomes pretty apparent here. So I am trying to match that a bit and not make it too complicated and also leave as much as possible so that we're only correcting what's really necessary. So again, some of them are actually quite difficult or not immediately obvious. Yeah, I dunno... what is this cat even doing? I don't know. How detailed do we want to be here? It's kind of unclear. It also ultimately depends on your application and what you want to do with this model. Like, does it matter for your model that the cat is lying on its side or is all you care about the cat and roughly whether the cat's doing something or the color? Yeah, some of them, like here for example: wow. Yeah, I had to look twice, but it's actually a table. So it could just be a lucky... thing. Is that Gumpy Cat? I think it's Grumpy Cat, right? Or it's the same type of cat at least. That cat I guess is kind of... table-colored? What's going on here? Als, what's up with the model? Probably, I dunno, "remote" was something that was common in the dataset and common in the captions that the model was pretrained on. So I guess that's why it's sort of hallucinating these arbitrary items like a remote. Or it also tends to hallucinate other cats, which is quite interesting. But to be fair, I mean, it does get most of them right and most of them are cats. Okay, so I don't even know anymore. So let's stop here. Let's hit "save". And we've collected 25 annotations in total and in the data we're storing the changed caption and the original caption. And let's just take a quick look at the data again using the "db-out" command just to see. And here we have the image data again, followed by the caption we wrote, plus the original caption produced by the model. And I'd say the model was surprisingly good, considering it's just a machine predicting things, right? And it was also clearly struggling, but that's kind of expected of an arbitrary model downloaded off the internet that was not trained on any data that was specific to my problem at all. But let's imagine for a second that this was your model. And a model that you've carefully developed and pretrained and that you now want to fine-tune on more annotations. And at this point, you've maybe done enough work to convince your stakeholders that yes, computers can indeed generate image captions now and it's totally worth investing in this.but you don't yet have a good answer for what you're going to do to improve the model and what problems to even focus on. Like, yeah, we've corrected the captions, but it kinda felt like shooting in the dark. And collecting unstructured free-form text makes things more challenging as well, because we don't have an immediately obvious way to evaluate the data that we've collected. So I want to add a new workflow that lets us go through the annotations again and specifically those where we changed the predicted caption. And then I want to label why we've changed the caption. For example, because it said "polar bear" but it was a cat. So in that case the subject was different. Or maybe the description of the background was incorrect, or the items around it or maybe it hallucinated a cat that wasn't there. And doing this as a separate error analysis step makes a lot of sense here because we don't want to mix the data creation with the evaluation and with the error analysis. And also, we probably want to make this a multiple-choice question and after we've started labelling, we might want to change the options because we forgot something or something new came up. And if this is all mixed in with the rest of the annotation going back and re-doing parts becomes really really annoying. So this is a separate workflow and let's go back to the editor and call this recipe "image-caption.diff" because essentially, we want to diff the two captions and collect feedback on the differences. And our recipe takes two arguments for now: first, the dataset to save the annotations to because we also need to save the diffed data with the feedback. And then also the source dataset. Because in this case, we're not actually reading examples from a file or from a directory, but we want to read them from an existing dataset. And doing that in Prodigy is pretty easy because there's a Python API for the database that you can interact with. And the easiest way to connect is by using the "connect" helper. And then we get a database object and then we can call the "db.get_dataset" method to load all examples from a dataset. So now, we need our stream. So let's write a generator function again. And we want to loop over the examples in the dataset and only look at the ones we've accepted, not the ones we've skipped or ignored. And of course we're only interested in the annotations where the final caption is different from the original caption, so the ones that were edited in the UI. And only if those are different, we yield the example and send it out for annotation. So, how are we going to present this most effectively in the UI and in the app? Fun fact, Prodigy actually has an interface for visually diffing text. But it doesn't really work so well here because we have quite a few changes and typically it's several words that we remove and add, so a visual diff can actually make it a bit harder to read the diff, which kind of defeats the purpose. So I don't think it's a good use case for that. I think the diff interface is much better if you're doing something like spelling correction. So instead, let's just set up two blocks and use HTML and write our own little blocks. And the "html_template" feature here is pretty cool because it lets you write HTML and reference any properties in the example, as variables. So the original caption and that variable references that value and that field in the annotation task dict. So you can structure your data however you like and then write an HTML template to define how to display it. And let's do something really simple here and make this caption half transparent so it shows up a bit fainter because that's the one that we changed. And then in the next block we use the value of "caption", which is the final edited caption. And if you want to style it, you can add some HTML around it, but I'll just keep it super simple for now and just use the HTML template here for consistency. And let's finish up the recipe skeleton and check this out in the app quickly to see where we're at and if everything displays correctly. And we call this recipe on the command line again and we have one dataset that we'll save the annotations to and then we use the name of our previous dataset that we added the annotations to. And then, here we go! The before and after captions, displayed together, in the "blocks"UI. Yay, so that's also working. Great! And now, we just need some multiple-choice options so that we can select the types of changes made to the caption. And for this we can use the "choice" interface. And it can render options as multiple or single choice, depending on the use case. And the format is actually pretty straightforward. Each option has an ID, which is used under the hood and can be any integer or string, whatever you want. And then we have a text, which is the display text. That's what you or what the annotator is going to see. I always like adding emoji because it's a very very simple and effective way to add some visual distinction. So in our recipe specifically, we could then go and define the following options. So we have subject, which basically is the wrong subject. So "polar bear" instead of "cat", and so on. And then attributes, that's like, if it says "black" and the cat's actually white, stuff like that. Background, that's sort of everything around it, the setting. And then we have number, that's wrong number of subjects. If it says one cat but it's two cats, and vice versa, we would tick that. And I've added one for wording because maybe we will want to change just some general wording and spelling. So that should get its own point. And then I've added one for "other" that you can tick if something else was edited as well. So I'll also go ahead and add some emoji here to make it easier to tell them apart. That's always my favorite part. And you really only want to define options here that need human feedback and that you cannot determine programmatically. Like, there's really no point in doing stuff like "added words" or "removed words". That's just wasting the annotator's time. When designing custom annotation tasks I think this is always a very valuable question to ask yourself: What do I really need from the human? And what can I infer programmatically and do not actually have the human do because the human will likely do it worse than a machine? So yeah, here we have our options now. And we just need to attach them to each annotation task that goes out. And then in our recipe config we'll also want to set "choice_style" to "multiple" to allow multiple selections because multiple of these options can be true for any given caption. So let's restart the server and check it out in the app. Here we have our options, complete with emoji. And by default, the "choice" interface will display an image or a text if it's present in the task. And since we have an image, like the cats, it's displayed so we don't have to add it separately. And we already have it here together with the captions so we can always check what's going on. And if you don't want the image and you say, hey, I actually just want to focus on the text, I don't want that kind of distraction, then you could just go back and override it in the block. That's no problem either. And if you prefer, you can also use the number keys on your keyboard to select the options. And you can even map the options to custom keys if you like. So it's really up to you and what you find most efficient to work with. So now before we keep going, we might as well implement some counting again so that we can see results immediately when we exit the server and get a breakdown of the options that were selected, so after each session you can see what the most common problems were. And to do that, we just go back and add an "update" callback again. And let's also user a proper counter this time to make things a bit easier because we have lots of options and they might change. So we're just looking at the accepted annotations. And we also look at the options that were selected in the UI. And those will be added as a new key called "accept". And it's a list, and it includes the ID values of the options that were selected in the UI. And then for each of those options that were selected, we increment the counts in our counter. And I think you could also just call "counts.update" here with the list of IDs, but writing it like this makes it a bit more obvious what's going on and what's happening here with our counter. And then, just as before, on exit we go and print those counts and kind of print them nicely so that we get a good overview. So let's start it up again and see, and do some annotation. Yeah, so here we have... wrong background or setting. And I guess also here, this is the setting, and it's also the number because it's two cats. Okay, so here, yeah, that was a bit unfair. I annotated that unfairly. So here we get to "no tasks available", which means we've annotated all of the examples that were in the previous dataset, minus the ones, obviously, where we just accepted the caption. So let's hit "save" here so that our answers are sent back to the server. And then let's check out the breakdown of the different options. So when we exit the server and we press control + c for example, we'll then get a breakdown of the different options and how often we selected them. And all of this information is, of course, also saved with the data in the database. So here's the very raw structured data again. You have the image data as a string, we have the captions, and then we have the feedback that we just collected in the "accept" key of the annotation task. And that maps to those IDs that we defined. And you can use that to produce some useful stats about your model's performance or even start and use the corrected captions as training data for your model and improve it. Or maybe you want to implement another recipe that takes the captions we wrote manually, finds words that are not in the model's vocabulary and then suggests alternative words that are in the vocab, using something like word vectors. Or maybe you want to write a recipe that suggests different possible captions, based on the highest-scoring candidates that your model produces. There are so many possibilities and of course there's a lot more that you can do with custom recipes in Prodigy. I hope that prodigy makes it easy to implement your ideas in a way that fits to your personal workflow and your machine learning stack. If there's something that you want to load or process and you can do it in Python, you'll be able to use it in your annotation workflows in Prodigy. I hope you enjoyed this video! To find out more about Prodigy, check out the website and docs. And as always, I've added the links and everything else you need to the description below. And if you have any questions, feel free to get in touch and talk to us on the forum.
Info
Channel: Explosion
Views: 4,238
Rating: 4.8241758 out of 5
Keywords: artificial intelligence, ai, machine learning, spacy, natural language processing, nlp, active learning, data science, big data, annotation, data annotation, text annotation, language models, pytorch, image captioning, image annotation, neural networks, python
Id: zlyq9z7hdUA
Channel Id: undefined
Length: 39min 25sec (2365 seconds)
Published: Tue Mar 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.