Active Learning: Why Smart Labeling is the Future of Data Annotation | Alectio

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone so today I be discussing the topic of active learning and the concept of like why active learning is important for data and notation and getting like awesome training data sets so I start by introducing myself real quickly so so as I will said I am the vp of machine learning at a company called figure 8 figure 8 used to be a used to be called CrowdFlower so more people know us by CrowdFlower so we're basically the enterprise version of Amazon Mechanical Turk so large companies and and AI startups would basically come to us to get training data because usually when you collect data you don't have labels with them and so with with those data and so you have to you know you have to get those labels somewhere so prior to joining figure 8 I was and I was actually working for Atlassian as their chief data scientist Atlassian if you don't know is the company that makes a JIRA bitbucket confluence and so far since so on and prior to that I was actually working for Walmart labs basically hiding their data science research for the search team and so a lot of the work I've done was related to like getting good training data and I accidentally became in charge of the person in charge of this and this is how I learned about how important it is to have high quality training data and how difficult it is to get good labels I'm also part of the expert network at a company called IAEA the International Institute for analytics and so I basically like consult with lots of different types of companies about different topics including the Thai labeling and so one of the the advantages I have is I have seen lots of different types of data and I know a little bit about all the different use cases that you can see out there so let's go back to a figure-eight specifically so our mission is as I said I provide like health AI companies AI teams across the world like get high quality training data so that they can train so machine learning models and so so I'll show you like what is difficult about getting training data today and so this is our agenda agenda for this talk so I'll start by talking about big data and how big data is important for machine learning and then we will talk about the challenge of detailing and then I will give you like an active learning tutorial and some tips about how to get started with with active learning alright so why I mean I think everybody knows that we live in an era of big data right and big data is everywhere and sometimes people may ask this question like how come how come this is happening right now right and so also why why is why is it that it makes data scientists and machine learning specialists so so happy right and so this is what we're talking about today so we all know that the amount of data available worldwide is growing really quickly one of the metrics I like to state is typically like over the course of 18 months to 24 months we basically double the amount of data we have we collect as you can see here right I mean so it's really growing extremely fast and so basically if you you think on those terms it means that every 18 to 24 months we basically generate as much data as the rest of humankind before us which is completely crazy you probably also saw like this in like what happens in an internet minute like lots of things happening because now we have social media we have IOT and so there is just like a lot of data being generated just because we're very active online so I mean you can also ask this question like how come this is happening now and it almost seems like we have suddenly become interesting as a species but that's that's just not the case so why is it when we see so much data today because number one there is a surge in new data data formats ten years ago most of the data we were collecting was text today you see lots of different lots of different types of data like lidar data which is like basically the type of data you collect when if you have a self-driving car video lots and lots of images pictures we take and audio and so forth and so on so the landscape of the type of data were working with today as as data scientist has changed it's also because we have new technologies right so what I mean when I talk about new technologies it's for different reasons that this is helping us collect more data number one we have better storage we have more storage we have cloud technologies it's easier and easier to transfer data quickly from point A to point B then we also have social media social media is empowering us to deliver in January data share data with the rest of the world on on a day to day basis and at every moment of our lives similarly IOT is also helping us like generate data at the personal level when you think about like your phone is now capable of recording pretty much everything you do same thing for your feed bit it's just an opportunity that's been created to generate data at a person level when a couple of years back was more of an enterprise level but then there is another reason why data is growing so quickly and this is because of data duplication and so this is kind of important for the rest of the talk so I'm gonna spend some time here when you do a search on Google you can do searches for like lots of technical topics you will always get the same type of results I searched for my plot lib installation and I get all these different results but if you actually click on each one of them you're gonna get pretty much the same page with like exceptions around versioning or really minor changes so at the end of the day all of these pages are the same information you'll see the same problem when people retweet things share things over and over again on on social media so more data does not necessarily equate more information and that's that's been a problem for the eita scientists today like big data which was originally an opportunity is also becoming an a liability because you have more more data to process so now that we've talked about Big Data I would like to to basically share my views on like why I really think that proper labeling is important right so computer vision is a really hot these days but it's actually something that's been around for a while now so and so the theory about that that surrounds like a computer vision and deep learning has been around for a while like deep learning or neural networks the the concept that goes into deep learning has been invented in the 40s and going to like computer vision like there was like basically that was a project started in academia in the late 60s to basically like people believe that over the course of one summer we would be in a position to have computers understand images better than humans and as we know like it's been a while that we're trying to solve that problem and it did not happen in the 60s because it's really a problem that's much much bigger than than we thought it would be but as we all know like there's been recent progress on computer vision very recently and actually started to see huge progress like ever since the beginning of the - that 2010 and so we have this like a challenge I think everybody knows here in image net right I mean so there is like this data said that people used to basically like a try to improve on the previous version and so as of like 2015 we are now have models that understand images better than humans so we epic surpassed the the basically like a it's like a superhuman computer capability computer vision right so how come this has happened so recently and so my my guess for this is that this happened ever since there's been the famous label label data set called image net image net is a dataset of like 14 million images of which are labeled across like 20,000 different categories right and so you just publish this this data set and all of the sudden you have like this pair of new ideas and improve models on in computer vision so this basically tells me that if you have high quality training data you can actually do a lot of things with it and this is why we could never do anything before because getting 14 million labelled images is just something you cannot do on your own alright so now that I hope everybody's can instead getting big data and label big data is a key to advanced machine learning research let's talk about like the concept of labeling what happens when you're labeling so we were facing a huge problem I will say that where our our problem as data scientists a few years ago was to collect data now it's shifted to label data I call that the big data labeling crisis basically if you think about this huge amount of data was just reach the point recently where even if every single human on the planet would just like do nothing else but labeling we wouldn't have enough manpower to take care of everything and so it's a problem because you have all of this data available to you but you cannot do anything with it unless it's labeled properly so you have two ways to address the problem either you label faster using a machine learning algorithm on your label smarter and we'll talk about that in a minute but I'll start by explaining what labeling faster means so this is an example of a tool that we have developed that figure-eight recently so and it's basically our video annotation tool video has been a problem because if you want like if I'm a self-driving car company Tesla uber like way mo you need to label I'll put bounding boxes so basically put rectangles around the people are different types of ontology zon an image before you can train a model right and so this data has to be extremely high quality because if you miss a pedestrian in your in your your image you can basically like that maybe make the difference between life and death when when you actually deploy a self-driving our software in in so first self-driving car so basically the problem we have is like typical videos have like 30 frames per second and so if you have like a suburban scene like this like you can have like 20 30 different like cars per per image so for 10 seconds of video that multiplied by 30 multiplied by 20 that would already take like about one hour and 40 minutes for a person to label this manually and that's that's assuming that we take one second per box which is like really and her unrealistic like in real life that will probably take around like three three to four hours so this is just not possible right because this is ten seconds and like if I'm Way more I don't need 10 seconds I need like ten thousand ten thousand hours of videos to be annotated properly with different ontology etcetera etc so what we've done is like we created a tool where you annotate the first frame and then the algorithm basically tracks the objects across the different frames and you just ask humans to validate what they see and it's much much faster it's actually like increasing the speed of annotation by a factor of 100 and so this is this is one example of the things we do but we do that later at 4 NLP we do that for images we put predictive bounding boxes on images automated segmentation and so forth and so on but just shows you that we've reached the point where you need to use machine learning to provide the training data for other people to do machine learning so it's a it's a pretty cool area to be in so this is the like typically you would use like object detectors that exist on the market so like you'll OSS DRC and then like these are all like a open-source version of algorithm that help you identify objects on on an image and so basically you know that the accuracy is never gonna be 100% so when you have a problematic box like this one over there you would basically I have a human fix it and put in the right place so that he enriched like a really really high accuracy so this is how you label faster is it good enough in my opinion it's not good enough and there are actually other things that you can do alright so there are other problems like related to data labelling that are not just associated with volume it's also associated with the difficulty to get the right labels right and so when we talk about labeling what I just showed is like putting a box around the car on an image this is something that anyone can do quickly right there are other cases where it's difficult to obtain because the label is something that a doctor needs to write or someone who specialized it's also something that can be really hard to label in some cases labeling means that you need to run blood tests on the patient and you're gonna have to wait for some time to get results back for for that specific person it can be super expensive like if I want to label like a position enough oil in in the ground to basically validate like my prediction for Lika where I should put my my oil wins then it may be like solver solvents or like hundreds of thousands of dollars just to identify the basically put a label on your on your data it can be extremely time-consuming so even simple things like identifying a topic for a document may require you to read the entire document and so if it's like a long like a 10 page document then even a simple label is something that can take like a couple of hours or like even a couple of minutes that will be already a lot of time to improve on it can even be dangerous sometimes like if you're just like trying to identify if there is a landmine as you predicted and specifically realloc labeling it would basically mean that somebody has to risk their lives alright so so now I'd like to talk about semi-supervised learning and so this is something that I guess everybody heard of but doesn't not necessarily think about when they're building models so we're all used in using super supervised learning supervised learning is like 90% of the machine learning world today most of the deep learning applications we have today are based on a supervised approach so supervised learning is when you have label data and so in this case if I want a binary classifier between trucks and cars I'm gonna have to have something someone are like a person or a machine that you call an Oracle basically provide those labels for you you can also rely on unsupervised learning unsupervised learning is powerful because it doesn't require labels but you will just be able to cluster or like recreate different categories for these different objects or whether you see that all the trucks are in the blue circles was identified properly but you still don't know that these type of object is called is called a truck right and so this is the witness of a unsupervised there is a sweet spot in the middle that we don't leverage enough in my opinion so it's called so many supervised learning and so it's basically this approach that consists in taking the levers you have and the unlabeled data that you have at the same time and try to get the best of both worlds and try to combine both so basically the approach we've had we've taken so far is like we when we have a training set we want to label everything and we use that to train our models but you can actually like be smart about how much you label and when you label your data so you can decide to select basically the data that you think is the most valuable for your model prior to building a model that would be called something called prioritization you just try to identify which which one of your data points are spam which one dies which ones are hurtful for your model which ones I use less for your model which ones out duplicates right I mean because like not like saying that you don't want to use a data point in your training set doesn't necessarily mean that this data is bad maybe some some piece of information you already have you can decide to make that choice of put like choosing to use or not use a specific data rows and query your label for it while you're training the model this is what active learning does and then finally you can also decide to like build a model using less data and then use a human-like override position basically to do that so I can build a model that I know is not going to be accurate but I'm gonna you I'm gonna use like a human in the loop approach to basically take care of the cases where my model failed or like validate so it's pretty much the same option that we talked about before right ok so I actually think that active learning is actually more powerful than just like a labeling like if you think about activity for learning different here will actually show you this so this is the way we as data scientists used to think about the way we do our work we would first build a data set right and so I actually even called that data science PTSD 10 years ago when it was hard to get data which is like focus or like I need to gather data even if I don't know what I'm gonna do with it and so basically you would not have the opportunity to get the right training set not necessarily the right features not necessary to qty not necessarily the best practices that you need to put around creating a dataset and then later you would take this data and build a model for it so basically you would be forced to adapt the model to the data and not the other way around because we have big data today and we can get picky about the data we use and because gathering collecting data is becoming easier and easier you can actually simul tenuously build a model and your dataset that ADA said that goes with it and this is what really what active learning actually offers you alright so let's let in like let me show you exactly how this works so active learning basically is a it's basically a process where you incrementally add more and more data and queue into your your model as you train it so you start with unlabeled and unlabeled training set and then you will select some rows within this unlabeled training set that you will label and then depending on your results you will add more and more and more and so this approach is called like pulling active learning so let me show you a little I'll show you that and then I'll show you a quick demo of how this works so basically active learning is about getting better learning curves what you see the graph you see over there is basically like showing like the the y-axis shows the accuracy or the quality of a model and the x-axis shows the amount of label data and so what you see in gray is what you would have if you were to throw more data randomly like like as an incremental supervised learning so you're incrementally adding more and more data but there is no intelligence in the way that you decide to add data right I mean so each one of those samples that you add incrementally are going to be random randomly selected of course it goes up because we all know that the more data the more there are higher you get the accuracy is active learning is about like allowing the model to learn faster so now if you have the option of deciding what data you want to put into the system and you know which one is going to be the most beneficial for your model then you can you can have the model learn faster and reach higher accuracy even faster right and so in this case here if you like you see the little dots over there so you can reach 80% accuracy with that specific model using like a forty five percent of the the total size of the data set of the training set versus like seventy percent if you were not to use active Ronnie there are cases where it's much more than that I've seen cases when you actually like you'd only need like ten percent of the data to reach the same accuracy so this is basically the way it works for polling so this is like a skiers I think I'm gonna get here you have a large pool of unlabeled data it's available to Lou you you've collected it and so then you have to select an initial sample the problem with the initial sample is you don't know anything about the data because this data is not labeled and you've not done anything with it so it's like an explorer exploit problem you have to guesstimate at first which one of your data set which one of the data points within your data set are gonna be the most beneficial and then you take this small sample and you send that a human Oracle can actually be a machine anymore called that and that's going to provide your labels for this now you you have a label the initial sample right and you will basically use this data to train the model that you have of course because you didn't use a lot of data you don't expect to have a very high accuracy but probably higher than if you had like randomly selected your your training set at first and so now you have a small label training set you have a trained model not not with a really high accuracy but it's available to you so guess what you're gonna use this this this data this this model now and you're going to use this model to infer the labels on the remaining of the data right and so so basically these are synthetics labels at this point now the cool thing is like since you run an inference you're gonna have metadata on on basically the entire data right there and so you'll have things like entropy like confidence levels you'll get like margin between like the first best prediction and second best prediction and so forth and so on and so all this metadata that you generated the helps you identify what is the best the most useful the most useful strategy to pick your next sample and then once you've done that you actually throw that so that this human annotator can give you real labels it's a very powerful technique because you're not actually using synthetic labels at no point in time like others so me supervised learning approaches are dangerous because you use synthetic labeled data right and so in this case you just use synthetic labels to identify what data you should label but at the end of the day you're using real labels provided by humans or are like a good machine running model to basically tell you what these images were in the specific case and then you start over so you have more data and you and you add the CETA into your model you're still gonna use the data you had in the first place and now you you hope that your accuracy is gonna go higher and higher and higher and so eventually you stop when you decide that your accuracy is high enough or you run out of budget if you had a specific budget for labeling or if you ran out of time and so forth and so on so there are different like reasons how you would stop so how like intuitively how does this work so this works because like in this specific case here it I mean so I have lots of data points of a verified run a linear regression model over this I would get the line that you see on the picture if I have like I had this extra few points over there you can see like the type of difference you would have here right so if you were to randomly pick like and sample out of all these points all the blue points dark brown points and and the light points you would basically have these are like a well I mean if I randomly pick things I would get more points in the blue part to the point that maybe I only get blue points like light blue points instead of getting so basically I would get this dotted line instead of having the hard line over there and so basically it tells you that you have to be intelligent in the way that you do your random sampling and you have to understand from here that the dark blue points actually more valuable because they are more restrictive they are giving you more restriction on your model and so it helps you learn faster all right so basically what you have to remember here is like not all rules are equally are created equal all right so now let's like go like really quickly in different strategies so what you saw is like called pooling you have another type of strategy called streaming streaming is basically UI you decide not to use or not to use a specific data point basis based on the threshold that you put over there you can say like I'm gonna reuse whatever data that for which I did not have 80% accuracy I had less than 80 percent accuracy and and decide to use that to retrain your model as you go problem with that you cannot really constraint your problem right yeah I mean so B so basically like if you're unlucky and all the data that you inferred was less than 80% accuracy then you're gonna use all of your data and your basic you cannot have a supervised learning approach pooling is basically what we saw before so it's it's really great for our business case because people are trying to minimize their budget and so this helps you keep your budget under control so how do you pick data smartly at every single active running pool there are lots of different ways to do this and so one of them would be like something called uncertainty and certainty would be based on confidence level knowledge from confidence level from margin sampling from the entropy of the of the data and so forth and so on the other would be like query by a committee chronic by committees basically you use several models like an ensemble method and you try to see in which cases the models disagree if they disagree you would actually consider that this data point in particular is difficult to learn and so therefore you would add that to your training set label it and move forward and then you have much more sophisticated kind of strategies based on like expected model change so this is even cooler because like you would basically try pick a specific data point and try to predict if the level of information you would gain would actually be beneficial for your model and so basically like in that at that point in time you would be doing something that's more of the order of like a Monte Carlo simulations and so forth and so on so you basically try to identify which data points in your training set would be the most beneficial for your model so why is this so complicated right um so we've seen that you have like pulling and streaming you also have something else called synthetic query generation which I will not dig into picking the querying strategy is really hard in fact there is no framework for it and this is one of the things that we started working on it figure eight because people try and see if it works like usually the way the way you go by this is just identify like this seems a reasonable querying strategy for instance I'm gonna take the 100 lowest confidence levels among the infra datato as a query strategy in my active learning loops but you cannot know in advance if it's gonna work and you're not gonna know if it's the best one for you pulling size how much more data you want to add at every single loop and how many loops do you want to use before you stop and so for all of this there is no framework on the market and there is literally no research on that topic people like whoever is doing actually even in academia is doing reach like they are doing research based on guesstimates I believe the squaring strategy is going to work I will try it if it doesn't work if it works it doesn't work this is what I report in my paper all right so really quickly so misconceptions are about like active running in general right so I will say that active running is not a type of model it's actually a wrapper or strategy around the model so it means that you can do deep active learning you can leverage active learning regardless of the type of model you're using so you don't have to change your model you can actually go ahead use the model you have and go and improve on it other misconception would be like it's a mathematical formula it does not need to be a Mexican mathematical formula and the querying strategy is actually a sampling strategy that you used to like identify what the best next pool of data is going to be for you and it can actually be something that's non-deterministic it can be a specific way to to to to sample that's not random sampling right all right so why do you want to like I will give you an examples of something that happens on the market typically people use like they look at the confidence level of the data they have inferred and they would use the would the data with the lowest confidence level and this is a bad strategy because you might actually be throwing spam into your system what we've actually done at figure-eight is we basically take medium confidence we take among this medium confidence sample some of them being the lowest within that sample the highest confidence within that sample and then some randomly sample in the middle twin ensure you you don't induce any biases this would induce biases big time because you're actually throwing data that seems to like is potentially going to be spamming and then the other problem we've seen is like people believe that at the end of the day you want to reach the very maximum accuracy you still need to use supervised learning because the more data the better however if you use active learning because you're actually in the business of identifying what data is helpful or not helpful you are you doing a better job with overfitting because your ensuring diversity of your training set you may also be able to remove spam automatically and so in some cases you actually get a higher accuracy even if you go all the way using active learning because you inject your data in a different order Nyko clear all right so what do what do we want you to remember here so good data is important for machine learning specifically if it's labeled we do have a problem with big data labelling crisis this day so as we've seen you can label starters so my suggestion is let's try to label smarter about active learning specifically so as you've seen like it allows you to improve on your accuracy it is an enhancement on your existing model so you don't have to give up whatever model you're working on you just need to add this piece of intelligence on top of it because it's a strategy and not a model and unfortunately it can be difficult because there is no framework currently to identify the right query strategy but there is more research coming on that topic thank you [Music]
Info
Channel: Data Council
Views: 14,854
Rating: 4.9277978 out of 5
Keywords: Active Learning, Smart Labeling, Data Annotation, jennifer prendki figure eight
Id: V33Ut36eUsY
Channel Id: undefined
Length: 31min 22sec (1882 seconds)
Published: Wed Jan 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.