Lecture 17: Bag-of-Features (Bag-of-Words)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so we are going to start today we talked about a very important method which is called bag of words or bag of features last few years this has really attracted a lot of attention in the community and it really works very well and this is similar same approach for works for the images is what's far also video in images we look at the different objects like chairs bicycles and people in the in the videos we look at actions the human activity is running walking and jumping and so on so these are the slides I got from Cordelia shimmied she's a pre-famous researcher in France India and I have modified some of them so I want to give her credit so the contents are that we will go through interest mine detector which is like a shift or Harris which you have implemented and then around interest point descriptor like a choji or the sift descriptor then we are going to talk about new concept called k-means clustering so we will take bunch of PI's and cluster them automatically we already talked about clustering using the mean shift and then we will also talk about very briefly about very important technique called support vector machines classifier SVM which is another thing last ten years or so we have very diverse nice the field of computer vision machine learning and so on and almost everybody is using SVM so it is important for you guys to get a insight about you know what SVM does and how you can use it of course there is a whole course on these machine learning the one can take there are lots of good online lectures also okay so the last topic will be that will talk about how can you evaluate methods which will do the match classification object classification and action recognition and so on so in particularly we talk about precision linear recall okay so let's get started we have a image classification problem like this we have an image and we want to assign a class level to this image so we want to say in this image is called present no cow's present bike is not present horse is not present these are the classes car cow buy cars and there can be several duck classes okay so other way will be with much more finer that we can localize these objects also say well car is there the bounding box to the left side and there's a cow on the right side and we have a location and also category that's what we want to do so now it is a difficult problem because of these factors like if you take picture of this person from different viewpoint the picture may look like look different if you take a picture in a different illumination it may also vary as we have talked about face recognition and you know these objects we may look different even in the same class so therefore we will have lots of these set of images and we want to use them to come up with some method which can deal with these different variations okay so let's take an example of a chair now these all our chairs and they see the within a chair class there's a lot of variations and that is the problem okay so this is called intra-class variation inter-class will be that variation between chair and the motorbike for example so then image classification is given positive trending images in this case object is motorbike there are several images and these images contain the motorbike and also several negative images they don't contain this particular object these are the images like airplane and so on so we want to classify a test image as to whether it contains that object class or not so we are given a test image we want to say does it contain the object or not so so this is a learning-based method that we learn what the motorbike looks like what are the features which motorbike a man will have and then be testing so that is two parts learning and testing so you need to understand that okay so so we don't use this method called bag of features or the histogram of features or distribution of features now as we talked about yesterday the distribution is like esta gram so distribution you count how many capita number of times particular feature occur and you can get a histogram of an image where you have 0 to 255 levels and you're going to histogram say well in this image the pixels which has gray level 50 or 100 the pixels which have gray level 200 are 30 and so on that's a distribution of the gray level it's called histogram and it is also called a bag that we have these big bins doing the 50 bin at 156 bins and we have this you know we want to represent that so like that we will have a bag of features a histogram a feature and this idea comes from the texture recognition so if you look at this image this is there is a reputation of these primitives you know most our time it's been repeated this kind of circle which is film this is reputation is a texture image okay now in this one we can identify maybe there's one and maybe if we call higher level maybe we will have this kind of primitive and this kind of primitive so these primitives can define this whole texture and we will look at the distribution of this image in terms of these primitives okay so this is another X is a little different with a born dancer here maybe these primitives these primitives are texture element like you have pixels picture element or these are also called text on okay so this is a third example these can be a texture element and we can define these images in terms the distribution and other distribution of these texture element so let's say we have an image and these are the text on these are the texture elements and then we can represent the histogram of these for this image and so this reduces dimensionality that will have just one vector which will represent this image and which is important and this may have this texture element and we will have some of this image like that and the third example okay so that's a basic concept which which you need to be clear about so now this is also similar to the documents you know when you look at a document like a Word document the document on the web page it consists of lots of words okay so words are the elements basic elements and document and you can come up with the presentation of a document as a bag of words or a histogram of words okay so there's no order you say well how many times in this document this kind of word occur like um like a bank loan water former so these are the four words we take an example and look at each document in this document this word occur once this record once and this does not occur this does not occur second document this occurs once this this word occur twice and so on so this is in a way a distribution of the words for this document and similarly we have distribution of other documents in terms of these words and we can normalize them and we can get a basically a probability or distribution which is you know like that so this is also called bag of words because here we have words as a unit it like we have a text on there you have your words these are document other images and it's called big upwards approach and this is used a lot when you want to match to documents okay you get extra gram of words so no work and you do the intersection that how similar these documents are you say they are the same if they are very different then the struggle intersection will be small so that's that's very easy way to look at the millions of documents in on the web to find out quickly which document matches the best okay so therefore we can apply the same idea to images of course for document is easy because we know the English language we have a vocabulary we know the words but now what are the words and the images so these are the images okay now words you know are in terms of little picture element little little pages and so on you know there's no English word there so what we are going to do that we will actually retake the features and image and these will be your harris corners or shift points and around each we will get patch okay and then you will local patches as shown here and then we will compute some descriptor for each other patch as you have done like h OG like a sift and so on so these will become vectors the descriptor for each of this so now we have these lot of these vectors and we are going to do the clustering of those and he this typically will be set to be 120 a dimensional feature vector and the hrj will be about 3,000 dimensional feature vector and so on but just an example let's say these are two dimensional just to demonstrate so each of this patch our vector will map it somewhere here so these are different you know patches and then we want to cluster them say well these are very similar so this can be one cluster this can be another cluster and so on and then we will call each cluster as a world so this is a word not an English language the word for the image in terms of the feature vectors which are the descriptor around the interest points and that's the way we can represent the image as a bag of words or they are also called bigger visual words not the words in document okay so so that's the idea so therefore we'll take image like this we will get these these get pitches and then we will get these words from these images and then now we can represent an image in terms of a histogram of these words and so it's now similar that we are representing the document in terms of histogram of Bart's which is very nice and actually it works very well there's a whole set of papers you know people are doing that and really is a big change in computer vision that people started doing that for last you know five six eight years ago okay so this is also called big up words they have features you know so you need to kind of understand these different terminology and break back up visual words so the same thing so another example from here now you have to understand that whenever we find a histogram do you lose the order the store Graham does not have any order so I can take an image half-black and half-white I can take another image where your checkerboard alternate black-white histogram of both of them exactly same because Counting number of pixels what the gray levels they have so Instagram does not carry any spatial information it does not have any order it's a distribution okay so therefore this when you find a histogram of words it is an order lace it does not tell you so these words a curve in which which orders doesn't care it's say how many times this word occur in this document we don't care where it occurs does it occur on okay so therefore that's important point and so you want we can classify documents another example here we have four documents as you see the words are common people people come on people in Scripture and all this thing and then we can come up with representation of bigger words for the first document common occur twice people I got three three times scripted occurred zero and second one like this like this so this is a distribution of these documents is a vector so which is very nice so then the process for the images is similar that we take an input image we find some interest points then we compute some descriptor and then we find the clusters of those descriptor as I explained to you and then we have another image we do the same thing and then we train a classifier that well these are the examples we have for this class and you want to learn a classifier and that's called support vector machine so we learn that then testing we've given an image we compute interest points we compute descriptor and say is it a positive or negative class then that's it so that is the way you can basically detect our recognizer object our classifying the image is this image of a person is it image of a bicycle is it image of a motorbike or or any other category as you wish as far as you have learned those categories and you have training examples of those attending exam there has to be positive and also negative okay okay so now the so the step first step is detect interest points and you have a program to detect sift or this Harris very easy then if you don't want to do it as I said just take every fifth or every tenth pixel as your interest point which is called M sampling which actually works very well you can save computation and then for each interest point computer descriptor which you know H gr sift so you are done then so these are dense features you take an image you say well every every fifth pixel you take a window that's my interest - interest buying all these that's it and it actually works better then the hair is detector and then find interest you know the descriptor of that so so that's what you have now what as I explained to you before we will have a descriptor for each interest point and it will be an high dimensional space and we want to then cluster them to come up with the worlds and that will be our vocabulary that the English language that were capably for thousands of years ago but visual worlds we don't have a cable so we have to come up with our own work ability so this example again that we have say three dimensional space and we have these different descriptors from these different images we want to cluster them and we may come up with one cluster here another cluster here third cluster here and these will become word one word two word three okay so now how do we do clustering so we have say lots of lots of vectors okay and we on a cluster them so the most simple method the clustering is called k-means okay means we mean no sample mean k-means how many clusters you have suppose you have three cluster you can find find those cluster of five plus two and so on so that's the very simple intuitive method and it actually works pretty well so once we do the clustering then each cluster will become word then we will find the histogram of an image in terms of those words then we have vector we can try an SVM that's it okay so let's look a little bit about k-means so let's say we want to do three means which means we want to get three clusters so we have let's say our data is two-dimension again it's very easy to understand when you're too damaged but works for the thousand dimension also okay so let's say we have these are data points which are saying dots lots of them it's not very obvious that you know what these clusters are and right now ignore these lines you know just look at these just points so the k-means the way they starts that if you know that we have three clusters you randomly pick three data points and assign them as a center of the cluster number one plus number two plus number three okay so we pick randomly three points this is Center first cluster the center of a second cluster third cluster and this is shown in yellow then what we are going to do that the rest of the points we have we want to assign to one of these three cluster centers which are the yellow whichever this point is closest it assigned to that so what this will assign with that will be assigned with that maybe this will assign of this and so on then this will be a sign of that and so on so now at the end of this we will have all the points assigned to one of those yellow circle so that means that initially we will get there this kind of boundary this will be one cluster and this will be another cluster this will be a third cluster okay so that is way we are going to do and then because we assign these points to one of these clusters now we can recompute the mean for each of those points which will assign okay because initially means were randomly just X eval this is the center of first cluster the second and third now we have assigned to these which are closest to one so we will use all those points to find the new mean all the pines assigned the first cluster will take them and find the mean all the points on the second cluster will find them move all the pine sign the third class we will find the mean and then second iteration the mean will be like this for the first one for the second like this third like that so then the division will be like that the second one so all these finally first cluster all these points second cluster then will keep repeating will do the third iteration and this mean will become like this so then we will have one division like here all these points one another one another one so that's called k-means algorithm so everytime we want to update the mean and we know that our k-means that's why it's called k-means in this case we know three means and we just doing a pretty simple iterative but it really works okay there are lots of methods which are more complicated maybe they would work little better but this is pretty very intuitive so the algorithm is we have the data points we choose K data points to act as cluster Center as explained to you and we'll repeat this until consistence of not change if we keep repeating nothing chain then we stop then allocate each other point to cluster those and as close as to nearest to replace the cluster Center with the mean of the elements in their clusters because now we have signed these keep repeating keeping that's it two lines of code you are done okay that's your k-means so if you do that then these will be the visual words you can get from the examples of say airplanes maybe you will get a fuselage from in you know wings and so on motorbikes these are the little words faces maybe you can get eyes a hair and leaves you can get fatherly people you can get their faces bikes you can maybe get tires and so on I mean this is what you are trying to do I you know ideally maybe not all of this will happen but that's the idea that you want to represent the image in terms of these local pages and each is like a visual words so this corresponds like a document and this is obtained this is the cluster Center of those descriptors we caught so we have inter spine we got descriptors we have a lot of training examples and then we cluster them each cluster Center is a world and this is the representation of that okay so that's the idea so then at the end we can take this car image in the present as esterdome that a bag of words representation of a car and we came up with our own vocabulary so this is similar to the bigger representation of a document in the English language the words they are given there you know Shakespeare had the words you know so so we even have been when the vaults and and here we have images what are the words and we have one way to find those words the detect inter spine get the descriptor cluster them each cluster is a word it's very nice so this is your representation okay so then each image will represent it by you know thousand four thousand dimension vector and then we are going to do now SVM the learning so we are going to learn a classifier and that essentially will learn a decision rule which will take it sorry yeah that you're actually going to be a full interest points basis upon not taking into consideration geometry of pictures how long does that actually classify and practice versus if you were to design features specifically mm-hmm yeah yeah that's a very very good question and very nice you know it's important so see that I'll show you the results so so in terms of working you know the accuracy the precision recall it works very well but in terms of understanding that while the system recognizes as a bike or airplane there's a problem because you may get the words which are not that meaningful but it you know you cluster them because either when you are clustering when you are detecting inter spine we don't have any notion of this idiot tire this should be a fuse large this should be a I our phase no notion which is a interest point so we take those interest points we do the clustering we don't know anything it's just you know clustering so then some of them are meaningful not all of them meaningful so that is the serious problem and then there's a whole research you know people are doing on that yes yeah that's right yeah so you you won't be able to localize for example that easily because you are taking an image you're finding histogram you say well this is a bike image is a airplane is it's a human image that's it you know so that's a problem but but you know the this is a beginning we just want to classify we have a web image billions of imaginary say well separate the car images from the bike images and we can do that no so that's that's the idea okay so we want to come up with a decision rule which will assign beqaa feature represent the image to different classes and this is again very intuitive okay so we have you know we have a zebra image like that which is shown green and then we have these other images I am NOT zebra so we want to separate the positive example from negative in this case we want to classify a zebra so the green one is positive and red one or not so whenever we want to learn it requires a data a training data so we will take the positive examples of this like a motorbike we will find this representation as a histogram the bag of words is one vector for each example and then we will have negative example which are not bike like that and then we want to classify that we want to learn the classifier support vector machine SVM this is one thing which has impacted a lot machine learning computer vision many other ideas and it is a very powerful tool and works very well and I'm going to tell you where you can get a code which you know you can use it okay and then it will be very superficial very high level description to just give you a flavor what is support vector machine again there's a whole course on machine learning where they where they cover this in detail and there's actually very nice YouTube video on this also if you want to learn more about it there's a professor and Caltech is a very good course and he has a very good lecture on this also ok so first is you know we need to classify as you know it's recognized it's used for you know pattern recognition we are the client spectrums object classification detection and so you want to train a classifier using a set of positive negative and samples and then classify learn the regularities and the data that while all these are say motorbike and these are not motorbike you will separate them and if the training is successful then we can take a image which the classifier hasn't seen it should be able to tell you this is the bike and this is not the bike that's called testing and that's what we want to know because ultimately computers need to do which like humans can do it doesn't need like we'll have something example we know humans about these are all the bike images we know these are not vitamins we'll use them they are labeled for club training but once you are done training the testing has to be blindly that we give an image and desert el asesor it's about a bike are not ok that's the whole idea so so the you know this classifier firstly this will be a binary classifier we will just say motorbike are not motorbike human are not human to to you know classes to positive negative as decision once you understand that then you can easily you know extend this multiple classes which is very easy okay so let's say we have two examples you know positives and negative the red and the blue one like that and each one is example against in the feature space we have two dimensional feature here in your case should be no thousand dimension something but the same thing so the first example you know feature is like here this feature this feature this is to time if you like that and so on so what we want to do given these examples we want to find the boundary we want to find the line equation of a straight line y is equal MX plus C in the plane in this 2d case in the multi dimensional case we want to come up with a plane hyperplane which will cut these two classes okay so then if we can find this line then it is very easy because this lines separate these two classes and if you go on this size because here it is equal to zero WT X plus B is equal to zero this equation of line here if we got here this would be greater than zero for all these examples we go here this for all these exam will be less than zero if you are on the line zero if you are on the left side of line it is positive if you are on the right side line is a negative that's that's what we have so therefore classification becomes very trivial once we know the line what we are going to do take the unknown example plugging in this equation look at the sign okay so if the sign is positive then this class was negative this club that's it yes the variables are that we variables of the straight line so we have straight down in equation y is equal MX plus C so given these times we want to find that line and even then we will have the equation of a line which will devise equal MX plus C now here we will have the particular example which will be given x and y because the 2-dimensional vector so we take x and y two dimensional vector put in that equation we look at the sign Y is equal to MX plus C so the variables we have to compute whenever we want to fit a line m and C so we'll compute the line using these training examples that what is the line which separates these two once we know the MNC then we can take any XY plug in find its sign positive negative or zero okay so now you know depending on how these positive and negative examples are it will become to get a really good line separating them automatically I mean computer program to get the line right line you know because there may be very nasty your problem so that is what I mean there are possible lots of lines okay so now which line to pick and which is the best line and so on this whole whole issue there so there is this notion of margin which is the distance of a point from the line okay so that you can find out like this so now the distance this is the closest point in this class to the line this is another closest point on this class and so on we can find these distances this is a further away okay so in general to separate these two classes we want to have some margin more margin it is better it is we won't get confused okay so this essentially is a margin okay so now we want to find this line set that Isis is margin which means we want to find these vectors which are actually examples from the red class and these are example from the blue class one is a positive and a negative so if we know these vectors which are on this margin on this margin these vectors which support these lines of the model and these vectors are called support vectors that's why this method is called support vector machine SVM so this classifier means you just have to find those vectors that's it if you can find these vectors in this case these three vectors do from here three from here you don't need all these examples that's it you don't need those examples you just find those three vectors that define the classifier that's why it is actually better than finding the lines it is better than looking at all these data just find the support vectors and it is called support vector machine because these vectors support these lines which are the margin okay that's it so you are given these red dots blue dots this is like a Democrat and Republican you know and select the ones which support this country you know so you have to get you know both of those guys okay so that's it so you want to find these examples in hyperplane which are support vectors so that this modules maximize and and you know this support vector machine will do that for you so so of course you know there's a whole optimization problem how you find those there's a convex optimization is a block program and so on pretty involved mathematically and you can learn that you know in the machine learning class but this is the code you get a code from there and really works everybody uses it's called Lib SVM and that has really a big impact on computer vision object recognition action recognition in many areas it's a very simple you know notion that how do you can you know describe a classifier nubs off these support vectors and how you automatically find these support vectors and there's lots of interesting things here now here you see that these are linear and that's very nice you can find bound it but in cases that these boundaries are not linear they are very very difficult you know space and this aciem can do that also using the kernel trick that you do the transformation to different domain and you can take the nonlinear boundaries to linear boundary which is another very interesting piece of work and and and that will reduce to the distance and as you are asking so if you find the Euclidean distance you will not be able to deal with the nonlinear boundary curves but you use other distance like a radial basis function and so on then it actually will take care of also nonlinear boundary so it's a pretty interesting stuff and it's very useful it really works ok so I think then we are done so we have all the pieces we have the words which we got from intraspinal descriptor we do the clustering and these are the words then we know how to train SVM now we can recognize the class so it actually gave pretty good results in even in presence of background so these are the different classes bikes books buildings cars people phones trees and so you what you will have you will take lots of examples of bikes you know some variations as you see there's variations it's good to have many different diverse examples go through the process strainers SVM for this turn Astrium for books and buildings and all this thing then you can test them so sometime it will have these problems with the books it may get confused maybe faces and buildings and these are building images you know trees and so on rather very similar and then you know misclassified in the building's phones and you know something else so it has problem depending on the data and so on so now we want to talk about how do we evaluate how well you are doing how well you are classified okay and then there's a whole competition called Pascal competition visual object classification started in 2007 and it's done by European community so they have actually caught these training data and test data they went to the online and then they distribute these to different teams and they got images from safe Flickr and there were 500,000 images and then they have 20 classes 20 different classes that the body calls you know person and so on you can get these it's freely available and then they also have some annotations like they'd know where is this particular object you know bounding box and so on and then they have training images they have testing images and and so on so so now these are the typical you know this is 2007 this start 2007 actually we are continuing every year there was one 2012 also in the conference I went to in Florence so these are the example of airplane bicycle bird you know those can cow sorry and the chair so on there are five of those 10 Phi on the top Phi and the bottom in the another 10 here dog house dining table sofa and so on so it's pretty interesting and if you're interested you can actually compete next year and this so now we want to talk about how do we evaluate how well a method works because people take this Terrace aid and they will have different you know variations of these methods they in these people who organized this will evaluate each submission so they may have 40 submission so they won't say who's the best one you know then their workshop and they might those people who you know top three they make presentation they talk about their method and so on so the pre interesting you know process so so we want to talk about that the basics one of the evaluation matrix now you can only evaluate it if you know the ground truth so in a yep you have 20 categories so you have to have a ground truth there wall a you know for each category you may have some images say 50 images and you know this is a dog and this is a person and so on so you have ground - that's called ground truth then you want to look at the what the output of the method gives you say there were 50 images the submission 150 images our dog submission one say from those 50 30 our dog and by the Lebanon 20 it said it's not dark so those 30 will be called the true positive because algorithms say they are they are dog and the ground who says Tom and that 20 which which it did not recognize they will become true negative because they were negative for this particular method so therefore we have to look at these different terminology to kind of come up with a matrix so this is one slide but this is very important so you want to listen very carefully with that okay so let's say we have a red circle it is all the the ground truth and all the examples in that or the true class says five Ambala dog then the blue circle is the examples which your algorithm say there of that class let's say dawn so the blue one are the results of your method and this is a ground truth so I see there's intersection and lots of them are common which is good and these one whenever you want to test a classifier then you need to have also example from other class not only it should say dog is dog but it's a Kate is not a dog yes then those Kate and the scale will be two negatives okay so you have ground truth you have result of your method and you have two negatives now the intersection of the red and the blue one are true positives because these were the ground truth and these were also your algorithm let's show you they are correct which is good and generally on optimize that we want to maximize that ok so then now this these examples the algorithm say they are dog but they are not dog so they are false positives okay that there was something else so those one are false positive then we want to look at it missed these it says these ones with the ground let's say they are dog but it's say they are not dark but they are not in the blue one so those will be called the false negatives okay so using these definitions now we want to come up with what is called precision so precision is this intersection between the ground truth and your algorithm this whole area divided by the out provided by your blue so that is the intersection of ground truth your RM which is result of your Gotham / your output of your algorithm which is RM and which essentially is true positive because the center second to positive divided by the your algorithm whatever it gave you then it tells you how precise your classifier is that is able how many of them it's able to get correctly classified dark as a dog then there's another measure which is called recall so which is again the intersection of between these two but divided by ground truth okay so and say how many you are able to recall Emily will you do that what percentage so that essentially true positive divided by number of ground troops okay so these are the metrics precision recall which is in terms of the ground truth true positive false positive to negative and all those things so that you need to know and it makes sense it's very intuitive that's what ultimately you wanna do and that's the end of this lecture

Info

Channel: UCF CRCV

Views: 62,728

Rating: undefined out of 5

Keywords: Lecture17, bag of features, bag of words, histogram, bow, bof

Id: iGZpJZhqEME

Channel Id: undefined

Length: 47min 13sec (2833 seconds)

Published: Mon Nov 26 2012