Bag of Visual Words (Cyrill Stachniss)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome i want to talk today about a technique called back of visual words which is an approach for finding similar images for example in a database of images and that's a frequently used problem in computer vision photogrammetry or robotics when you want to make data associations you for example want to find two images of the same place in order to do place recognition or you want to find similar images in a database given a query image these are all tasks where you want to look for similar images similar data items and back of words is a popular technique that can be used in order to do that so as a preparation for this lecture you may want to start this five minute summary of this whole lecture explaining back of visual words very briefly if not just lean back and enjoy what i'm going to present here so what the back of visual verts it's a technique that can be used for finding similar images in a database either images among each other in the database or with respect to a given query image so let's say you put into a database a large set of images and you want to find an image in your database that is similar to one query image and then the question is how can you search for that and backup virtual words gives you a possibility that you can put in this image as your search query and bring back um it will return images that look similar to the one you queried for that's similar to what let's say google image search for example where you can drag and drop an image to the search window and it will return you similar images and backup words is one approach to do that um additionally backupwords also provides a quite compact representation of the image so that in the end you don't actually need to store the image it's sufficient to um store a compact representation which as we will see will just be in histogram of certain feature occurrences and you can do all the comparisons based on those histograms so you may wonder about the word bag of words why word and the reason for this is that this is a technique which stems from the text mining community so if you have a large number of text documents and you want to find text documents that are similar to a query document or you want to group those documents into different types of documents let's say legal documents reports with respect to trade aspects or papers from robotics or papers from a different discipline so whatever it is you want to group those documents or you want to look for similar documents so it's as you know a common technique or a common problem in libraries or in searching for relevant documents for example if you have a paper and you want to look for papers that are similar to your paper when you do your related work study and so the bag of words how it's originally called in the text mining community is a technique that reduces those documents to vert occurrences and then based on the distribution of word occurrences identifies similar documents so basically say if i have documents where certain words appear with similar frequencies those documents are probably similar and backup visual words does that concept but transferred from text documents to images also in terms of text documents so if you know the setup that we use to store our papers or paper repository where you have basically the summary of a paper some images and kind of the title and things like this of the paper there's a button here look for similar papers and if you press that button what the database will do it will return you papers and the returned papers will be similar to the data to the paper when you press that button and it does so by looking into certain words and simply checks how often do those individual words actually occur in my day in all the documents and then it builds a distribution of vert occurrences over all those papers and will return those papers which are most similar to the one you used as your query and that's basically the idea of the back of words approach and we will now translate that to a bag of visual words so consider you have an image of an object like this illustration over here we can actually break down this image into small let's call them components or small patches or regions or features and treat those small regions as our words so for example we can take this image we can break it down into small sub images and keep those sub images as my bag of words so that's my set of words counting how often do those parts appear so for example here i have a patch which shows the eye of a person so this eye over here and maybe this thing actually appears twice approximately because the person has two eyes so this would get a count of two and other local patches may appear also multiple times or similar looking patches multiple times in an image and this i can use in order to generate my distribution so the first thing i do i turn my pixel information into visual words and how backup visual words does is it uses features and feature descriptors like those we have discussed here like sift features for example and could be one option to do this so we break down our image into independent features and then just consider those features in order to represent the image here again these are local patches just as an illustration because it's easier to visualize it in that way what we also need is to say okay which of those words or visual words or image patches or features are actually good for making for describing those images so i kind of need a dictionary similar to let's say the oxford english dictionary which describes which words are allowed in english english language the visual dictionary tells us which features are allowed for describing those images so what we need is we need kind of a code book or a visual dictionary which tells us what are the allowed words what are the words we want to take into account in our approach here and so they are let's say some parts you can see some of those fraction or sub images stem from these face example others include bikes and others may include music instruments just as a small example so there's a large variety of words that you need to define in order to have a wide coverage of let's say images you can cover and so you need to have some kind of these dictionaries and then this dictionary define the words which will be considered the kind of the allowed words for our comparisons and then as i said i can actually take different images as inputs and then can compute how often do the individual features or the individual visual words actually appear in those images and so this image would have a statistics which looked like this this image would have a histogram which looks like this and the music instrument will have a distribution which roughly looks like this so here we have our visual words our dictionary words so all the allowed words the whole dictionary sits here on the x-axis of the histograms and the y-axis basically counts how often do those visual words appear in that image and what the back of words approach then does it simply takes those histograms and basically forgets about the image information and then says how do i how can i make the decision if two images are similar and it just does it by comparing those histograms so you're basically comparing the distribution of visual words that appear in your images so and then you will say okay if this distribution similar to this distribution yes or now with some form of distance metric or similarity and then you can say yes these images are probably similar because their histograms are similar because their distribution of visual words is similar and this makes them potentially similar so that in the end when varying for similar images in your database you only need to compare those histograms that was kind of a brief summary how back of visual words looks like or operates so let's dive a little bit deeper in into that and see how that works for our data so let's say the first step is we want to turn the images that we have into those histograms of feature occurrences so let's say this is an image this was an image recorded in a car where you had a camera mounted in a car driving around in order to solve visual place recognition tasks so you want to go drive around with your car under on different days different weather conditions for example and you drive through a city in this example that wasn't freiburg and you are taking images and when you're there another time you want to be able to localize in the other database of images so you take your current camera images you query a database do you have images which looks similar to the image that i see right now so that we then can run the place recognition and see okay we should be in the same place then image number whatever 128 in my original sequence in order to do this i want to use the back of words approach and the first thing we need to do is we need to turn that image into my histogram of visual words so again here on the x-axis of the histogram are the allowed visual words now just basically for illustration purposes here shown as different colors but of course in practices are unlikely to be color values um and this is the count so the blue wing would appear one two three four five times the red visual world would appear twice the green one once and the yellow and the orange one would not appear at all and kind of this transition from the image to the histogram is the first step that we need to do so consider all of the work we have our original input image and what we typically do is we extract visual features and we can run any type of feature descriptor or feature computation that we um that we have in our kind of toolbox let's say sif features kind of one of the standard choices for features so we can run our shift feature detector on on these images which computes the key points so the location where a feature will be computed and so what's a kind of a locally distinct point and then the descriptor computed by taking into account the local neighborhood of that key point so if we run our standard shift approach again this runs on a black white image therefore we have black white here and we see those circles and those lines over here and as a reminder this was the way we were visualizing the key points so the center of the circle is a key point location the line over here shows you the orientation of your key point and the size of the circle was basically telling you at which scale this feature was detected so these are now the raw swift features that are extracted from the image so locally distinct points and what we then need to do is we need to reduce those features to visual words so um the reason for this is not every feature will turn into an own visual word otherwise would have for every feature detection and own visual world so the the visual of words we are considering is just a small number of all potential um visual features so you can see this is every word represents a set of similarly looking features and they are defined through the dictionary for a moment let's assume we have a dictionary and in a few minutes i will tell you how to compute that dictionary so that means we take those feature descriptors and we will assign them to the closest word in our dictionary and maybe we have a maximum distance that we allow for that otherwise we say this feature is simply not represented by any visual word and just for illustration purposes i now use this red blue and green boxes for example that we say in this example we have three visual words which appear so there's a blue visual word here here and here this is a green visual word and these are two red visual words and so i'm kind of mapping my swift features or the feature that i've computed onto those visual words here just illustrated by a few colored boxes just for illustration purposes and once i have this assignment i can actually forget about the image i don't need to consider the image anymore so what i can return i can just say okay an image is just something which has visual verb occurrences like this example over here so again this doesn't mean that we have those colored boxes over here it just means that visual words appear in an image and i don't care about the actual pixel values anymore i actually can forget them and it will just continue working with my visual words and then i'm simply doing counting so blue appears one two three four five times green appears once and red appears twice so it will be the histogram that i've shown for illustration purposes before here is my blue visual word which has a count of five red one with a count of two green one with a count of one and the other twos did not appear so we have a histogram five two one zero zero so this histogram basically can be expressed by a vector saying five one two one zero zero and of course here the order matters and the order is defined through the visual words and it must be the same for all images right so i can then with this approach i can turn every image into a histogram so in the end it's a sequence of counts which tell me how often do the individual visual words appear in that single image and that's how we turn images into such histograms okay in order to do that what we need we need our of course feature descriptor that we want to use so let's say sift for example and we need to define our visual words our dictionary so where do the visual words come from who tells us which visual words we should use so again the dictionary is important in here because it basically is a definition of the words that are allowed or the visual words that are considered for our comparisons if we have a feature which is not represented as a visual verb in the dictionary this feature will simply be ignored it simply will not matter so the visual word the visual dictionary and the visual words need to remain fixed because they describe kind of the words we allowed to use and they define the x-axis of the histogram so this is a dictionary the dictionary is basically and has an order say blue is word number one in my dictionary red word number two in my dictionary green is word number three in my dictionary and so on and so forth so the allowed visual words and the order in which i consider them in the histograms is um defined through the dictionary and it's fixed it's something that should not change so the dictionary can be learned once for example giving a large database you can build a dictionary out of that for that database and if you have a very diverse database very large database you may come up with a representative dictionary but this dictionary then needs to remain fixed you don't change that and this dictionary again is typically learned from data so it's not something that we specify by hand we learn that from data and the question is how can we actually learn that so how does that work so i next few minutes want to dive a little bit deeper into how i can actually obtain this dictionary of visual words so let's say we have a training data set this can be my database i want to work on if this database is fixed or it can be just a large collection of images taking or showing a large variety of different images in that database let's say a million of images and they all are recorded with different cameras from different setups um showing completely different scenes um small objects large objects landscape scenes humans you name it and this would be the virus diverse data set that you could use for learning your dictionary if you want to apply that to a very specific application like place recognition um with images taken from a car in a certain city then you will of course only take images recorded from a car in a certain city in the city under similar conditions because then you would have a dictionary which is more specific to the task that you actually want to solve so your training data set should be somehow related to the data set or through the type of data you will be working on later on um the important thing to notice here when i talk about training data set is that this what we will do in a second is an unsupervised approach so you do not need to provide manual labels for example for something it's an unsupervised training an unsupervised approach where you just need to put in the images and the approach will thought that out on its own okay so how does it work we take our database of images let's say those example images here taken from the car data set and we extract visual features for example sift features out of that so every image will turn into a set of swift features let's say 500 zip features for image number one 200 features for image number three 700 for image number eight and so on and so forth so we will collect a large set of those sift features so this represents always one single shift feature so we get thousands or ten thousands or a hundred thousand or millions of those feature vectors just depending on how many images we have and which parameters we use for our descriptor so what we have from these thousands or ten thousands or hundred thousand of images we will get a number of sip features let's say roughly 500 times more than images gives you a good approximation roughly 500 features per image for example and then you have a large number of those features and these are not yet visual verbs these are just visual features that appear in my training data set so this is a vector so in case of sift this can be a 128 dimensional vector because this is the the descriptor that shift will generate if you take other features descriptors then they may be of smaller dimensions let's say 64-dimensional vectors for example or 32 that simply depends on what your descriptor returns and so those vectors are nothing else than that data points in some high dimensional space in this high dimensional feature space so all these vectors are just kind of points in this 128 dimensional space so i can't visualize a 128 dimensional space well so let's stick with this uh let's say illustration of a three-dimensional space and even that's hard to visualize in 2d so but i think you get it so these are data points which are somewhere stored in this high dimensional space now what we now want to do is we want to actually group similar features together so that they will later on form one visual word so we take let's say those data points which are close together or those data points which are close together and say okay this is actually probably very similar image content because remember those features just basically provide a description of the local surrounding of the key point for example a distribution of gradients or something like this and so that means we have data points here which represent image regions which show a similar distribution of gradients you can say okay this distribution of gradients let's make turn that into a visual world and so we can say okay those two three data points we use and take make one visual word out of this let's say taking the mean of those three data points for example we do the same for those over here and for those guys over here so that we can actually group them here illustrated by color this is kind of the red data points all the red ones are combined together to then form one red visual word and the same holes for the green ones for example and this grouping of similar descriptors is something which is used or we use the clustering algorithm for that it clusters our data points in an unsupervised way typically in an unsupervised way so that i then get clusters of those feature descriptors of these points in this high-dimensional feature space and that i then can find representatives for this this can either be one feature out of this subset or what is typically used a mean feature descriptor vector computed over those data points so that you basically run a clustering algorithm so that all those data points over here will turn into the blue word all those guys over here turn into the yellow word these ones over here into the green word and so on and so forth and it's a standard clustering approach that can be used for this so what is the standard clustering approach we don't have any special needs in here we the first thing we typically try is something like k-means so remember k-means algorithm is a standard approach for clustering and what does k-means do let's give me give you here a brief few minute introduction into k-means what k-means does it partitions the data that we have in this case these are our feature descriptors into k clusters and we need to provide this k so we can say k equals 1 000 means generate me 1 000 clusters from my data set and return me those 1 000 um so-called centroids which are kind of the the center of mass in the cluster space so it's basically a mean over my uh feature descriptors which are supposed to form one visual world so the clusters will that are generate these k clusters will be represented by k centroids and the centroid is typically the mean of the data points and then the objective of the k-means algorithm is to find k clusters and assign all the data points that i have in my data set to the closest centroid to the closest cluster and i do this in a way so that the square distances between the individual data points and the centroids to which the data points are assigned to is minimized so i'm minimizing a squared distance so if we use this k-means approach for learning our dictionary for our backup visual words approach that means we partition the feature space or the features that occur into k groups so we say all the let's say 1 million features that we find we group them into let's say 1 000 or 10 000 clusters so k will would be equal to 10 000 and i reduce my 1 million data points to these 10 000 centroids so that always a large number of feature descriptors are assigned to one centroid form a group so we have 10 000 groups of features and then i compute for all the features which have been grouped together into one group i compute the mean descriptor value which will be the centroid and this centroid will be a visual word in my dictionary so the output of the k-means algorithm are those centroids are those mean feature descriptors and they form the individual visual words in my dictionary and then if i have a featured detection in an image i will assign this feature to the closest centroid in my dictionary i say maybe i have a maximum distance that i allow for and if the feature is too far away from any word say okay i ignored this feature but otherwise i will assign it to the closest vision word so what the k-means approach then does it generates k words and assigns the features always to the nearest word and does it in a way that the square distances between the features and the visual words are actually minimized so how does k-means work from a technical point of view so first we could do it informally and then present the algorithm so we do an initialization this is typically um a random intronization for example we randomly sample k data points and say these are our centroids we start with our initial initialization of the clusters and then we will actually revise those centroids you can choose another initialization but randomly drawing from your data points is something which is a not too bad starting point and then what the k-means algorithm does it repeats until convergence two steps the first step is assign each data point to the closest centroid so we have a loop iterating over all our data points and assign them to the closest centroid and starting with our initialization of the centroids and this gives me the data association of data points or feature vectors to centroids or visual words and then in the second step i recompute my centroids based on this new data assignment or data association so i say a centroid for every centroid which data points have been assigned to you and then your new location in the feature space is the mean computed over or um from all the features and this gives you kind of the new word the new data point in your high dimensional feature space that represents the group of features and then we repeat this approach until convergence convergence for example can be if the data station doesn't change anymore also the centroid locations won't change anymore this can be a little bit tricky in the in the setup that we will be using here for visual verts so you typically stop after let's say the the distances don't really change anymore or if there's an excellent change in your distances more formally the k means approach let's say we have m centroids and x t are our data points and k are the number of clusters we iterate over all data points and then make an assignment saying this b i t tells me um which data point t is assigned to which cluster i so it's um i return a one in this indicator vector if the distance between the data point and the centroid is the minimum distance so that means i'm activating this indicator variable b if um a data point is assigned to the found the closest cluster and zero otherwise and then this is kind of the step number one this assignment step and this step number two which recomputes the mean i'm basically iterating over all data points multiply the data points with the indicator variable and divide it by the sum of the indicator variables for that centroid and this is basically the weighted mean although the weights are just zero and one so it's an indicator variable telling you which data point to take into account when recomputing the mean location of the feature and that's basically it so just as a small example two-dimensional feature space these are my data points let in these the red and the blue cross are my initial centroids which are just random locations here in this case they are not part of the feature space but often you may want to do this so what do we do we first compute our assignment which is seen over here so assign all the data points to the closest cluster which is here shown by colors so all the red data points will be associated to the red cross all the blue data points will be associated to the blue cross and then i'm recomputing the blue cross and the red cross location based on this assignments so this is kind of the second step of the first iteration so the new mean location will be over here for the blue data points and the red one over here and then i again do the next data station so step one in step one in iteration number two then the separating line will be over here this all the blue points these are all the red points and i'm recomputing the mean and i'm separating again you can see now the separation line becomes actually similar get the same data points here compute the separation line and now the separation line basically doesn't really change anymore so i say okay this is my final result this is a centroid number one this is the centroid number two all the blue points belong to centroid number one all the red points belong to the center number or centroid number two and that's how k-means works and you can see this as an illustration for having a 2d feature space and we want to find a dictionary of size 2 then this would be the result that you actually get of course in reality we have a more high-dimensional feature space let's say 128-dimensional feature space and the size of the dictionary is not two as in this example it may be a thousand five thousand ten thousand fifty thousand maybe depends on the size of your how many images you want to deal with in the end and but this approach is exactly the same it doesn't matter if you do this for whatever 100 images 10 000 or 100 000 images just that it takes longer obviously for larger data sets so k-means the standard approach for clustering is one of the or probably the most popular clustering algorithm it's fairly easy to implement and the only thing that you need to do is choose your number of clusters um and then give it a run and k-means will return your centroids a few things you need to know about k-means if you work with it first the initialization matters so the result that you get depends on your initial initial on your initialization on your initial centroids because it iteratively updates them so it's not a direct solution it's an iterative solution so you may want to think about how to initialize your clusters one example would be to randomly draw data points from your from the visual features you have detect you have detected and maybe you can have a smart way for kind of distributing the initialization a little bit in feature space and not everything is clustered in a very local region uh which may be sub-optimal the second thing is k-means is sensitive to outliers if you have outlier data points so some of the features which are spread very far away from any reasonable cluster and feature space um then k-means may put clusters in those locations and we have a number of very very small a cluster just representing one word one feature for example so one word for one feature that's something you typically don't want to do you want to get rid of them and also k-means just converges to a local minima it's not a global minima so these are certain things that we may need to take into account in order to build a meaningful dictionary and but using k-means for this type of approach is kind of the standard choice for turning a large number of feature vectors into a small set of visual words and k-means is typically the way to go so what does it mean we take our database we just run the k-means algorithm more or less out of the box and this will actually return me my one two three four five visual verbs in this example and then this dictionary will be fixed it's not going to be changed anymore this is kind of a pre-processing step that we do and then when we run back of words online so to say we just stick with our dictionary and say okay go take the dictionary keep it as fixed and now work only with that dictionary that means if our database of images will just turn into a database of visual verb occurrences so in this example we have four images so it's a very small database and we have five visual words and this is the distribution of visual words how they appear in the image and so every all of those images is uh are turned into histograms so we have an histogram for image one and histogram for image two a histogram for image three and a histogram for image number four so this would be for example the histogram always corresponding to this word so here we have five blue two red one green one five two three in this example we have four um blue ones no red one um one green and one yellow one and no orange one and here it changes again so um we can see that the um the every image is turned into histogram and what is also obvious if you if you make this transition is for example the location of the feature in the image doesn't matter anymore so it's it's irrelevant where the word has been detected in the image as long as it has been detected in the image so every image is turned into one of those histograms given the dictionary and again the dictionary is always the x axis of the histogram and the bins for the height of the bins tells you how often this visual feature appears so this means this backup virtual verts model is actually quite compact summary of the image content if you think about 5000 visual words you basically reduce your image to 5000 counts 5000 numbers integers which tell you how often do the visual words actually appear so it can be a very compact representation of your image the second thing which is relevant and this is especially important if you think about place recognition task is that the spatial arrangement of the visual words in the image doesn't matter so that means even if you change your viewpoint or you rotate your camera or shift your camera you will basically get the same result as long as you kind of still have the same see the same object in the scene or a very similar one so you're largely independent of the viewpoint changes and the deformations of course under the assumption that if you record an image a scene from two different viewpoints you the feature descriptor is also at least to some degree invariant under the viewpoint changes which holds for swift features up to a certain degree so but once you have the visual verts then the megawords approach completely ignores the spatial arrangement and just looks into counts what's a bit problematic is the question how to choose the size of our dictionary or vocabulary if it is too small that means a lot of words are ignored or too many different things are grouped together in one visual word it doesn't get very expressive and if we make it to large basically in the extreme case every visual uh vert feature turns into an own visual verb and this is basically an overfitting to one specific image and you lose all the generalization and your performance will also substantially be great so you need to be somewhere in the middle so if you have a small database with a few hundred images you may want to start with k equal one equals 1000 and if your database gets much larger you may want to upgrade it to 10 000 or 20 000 or maybe 30 000 words but um in order to get to something like 100 000 words you actually need pretty large databases so that this makes actually sense so what we have so far is kind of the first step we define the dictionary and we turned all the images into histograms so our database now just stores the histograms so now comes the final step how do we actually compare image similarities so how do we find similar images either putting in a query image or among the database itself let's look here into the example of using a query image so we have a database and we put in we have a search query and say okay give me all the images in my database that are similar to my query image and that's for example something that you would do in place recognition tasks so the task is find similar looking similarly looking images the input is my database of images my dictionary so that the database can be turned into my histogram and a query image of multiple query images so let's say this is b let me by database i put in one query image and then the outputs are the and most similar images from my database similar in the sense of similar to the query image or images so this is my query image and this is my database these may be let's say the two most similar ones that i will return so how do i do this similarity query now um as with that we said we don't need to store those images we just store histograms so that means that the task of comparing images reduces to the task of comparing histograms so the question that we need to do at this the same as this one and the same as this one or let's say maybe the same is not the right word but similar one so how do we how can we compare histograms and there are different ways how we can actually do this there are different possibilities how we can compute similarities consider this is just a vector of count so in this case five two one zero zero in this case four zero one one zero um one two one zero zero and so on so different ways how we can do this one thing would be the euclidean distance of two vectors so we treat those guys as vectors just counts that we have and then what we want to do is we want to compute the um [Music] the the distance between the histograms as the distance of the data points in the euclidean space so this would be a five-dimensional space here and this would be five two one zero zero so i compare five to one zero zero to a database four zero one one zero just take the euclidean distance between those two points that would be the euclidean distance another alternative is to take the angle between two vectors into account so i'm treating those data point those histograms again as vectors um and then i'm just looking what's the angle between them what the angle between those vectors if they're two vectors which are orthogonal to each other this may be a small distance and if they look to the same direction then they may be very similar so what this typically uses using the so-called cosine distance where you take the cosine of the angle into account and it's similar to the euclidean distance if the vectors are normalized so have norm 1 but otherwise it's not it's not the same or not similar and again it turns out in the end that the cosine distance or the angle between the vectors is something again which is which ignores the length of the vectors while the clean distance is sensitive to the length of the vectors we could also say wait a moment these things are actually probability distributions right probability distributions over occurrences at least if i normalize those histograms so how do i compare probability distributions um the quebec library divergence is one way for comparing probability distribution so we may want to use this or come up with any other distance metric to use and so um which metric should we use that's one valid question that we may want to pose the second thing which has a very strong impact on the final result is actually the question are actually all words that are used expressive so do they have the same expressiveness so thus the blue word tells me something better or less than the green word or the yellow word for example um so you can make an analogy to the um text to text mining then maybe this effect becomes more obvious there are a lot of words in language that appear very very often but actually do not really help you to classify documents like an article like v the article v will probably appear in every document that you have and so this this word is not very expressive to categorize documents or the word is or has or he or she something like this probably words which appear in a large number of documents and probably won't tell you a lot about when you look for similar documents so um [Music] you may want to say okay certain words are less relevant for me and other words are more relevant for me okay so let's make an example and we go back to these four images that we had and if you look for example to the green word the green word appears in every image right so it appears here here here here so if i have get a new image and i say the green word occurs it doesn't really help me to narrow down that i'm more likely in image number one compared to image number two because green appears everywhere right so the word green is more or less useless for the similarity query that we are doing here and so the key question is how can we identify this how can we say a certain word is better for the comparison and other words are less relevant for our comparison and so we may want to re-weight certain words which are more important and weight other words which are less important down so that the less important ones don't matter that much when we compare the distances between our histograms and this is something that we do with a technique which is called tf idf so tf ids stands for term frequency inverse document frequency and we are basically doing we are recomputing the bins of our histogram and what we're doing we're basically weighting the bins of our histogram with this second term over here which is the inverse probability that a certain word appears in the in the database or an image of the database randomly drawn and the inverse of this okay so what we're basically doing we are recomputing the bin of visual word i in image d with this formula and this gives us a new basically rewrites the histograms with the goal that more important words which are important for the comparisons get a higher weight and those which are less important will actually get a lower weight so let's understand try to understand a little bit of what the formula actually does so the first part of the equation here which is this fraction over here is an id is the number of occurrences of a word i in an image d that basically means how often that the visual verb i appears in an image so let's say in our example the in the first image the blue word occurred five times so for the blue word in image number one this would be a five and then nd is the number of word occurrences in the image d at all so for example if there are overall eight occurrences of visual verts in the um first image this would be then five divided by eight so that's the term frequency so how often does this term so how does this word occur occur so this is basically our histogram normalized to one so this if you ignore the second part if you just look to the first expression this is nothing else than taking the histogram that we have but normalizing or turning them into a probability distribution and then the second term is the inverse document frequency and i'm weighting this term by the logarithm of the inverse document frequency so what is the inverse document frequency the inverse document frequency has here this capital n is the number of images in my database and i'm dividing it through the number of images in my database where the word eye occurs at all so if um let's say i have 100 images and in 50 of those images one visual go to cures this would be 100 divided by 50 would be 2 so this would be the logarithm of 2. and if you for example have a visual word that appears in every image this would be n divided by n would be one logarithm of one would be zero so if i have a visual verb that appears in every image it gets the weight of zero so the bin will be put to zero everywhere so this means this word doesn't matter anymore because it doesn't convey any relevant information for me and so the maximum you can get is n divided by one so you get the log of one so this is the maximum value that of maximum weight that a a word can get if it just appears in one image you can say this is super expressive this tells me a lot this is kind of what this is about so what we can do now is we can we can rewrite all our histograms so we have to compute this tf idf waiting for all our histograms for the whole database so i have again here my four images in the database i have the counts five two one zero zero four zero one one zero three one one zero two one two one zero zero for example these are the histograms the original histograms the occurrences that i computed before so this is my these the bin values are my nid the original bin values and then nd was a number of occurrences um of words in image in the image so an image number zero for example we have eight occurrences image number one has six word occurrences seven and four so it's basically a sum of the individual counts and then an i was in how many images in the database does the visual vertex cure so the blue visual word occurs in all four images the red word appears in three words green one appears in all four the green and the orange sorry the yellow and the orange one appear just in one single image and this we can use now to compute with this tfidf formula all the histograms recompute or compute new histograms which take into account this inverse document frequency weighting of the original distribution so i can now do this so just an example here are my images here my visual words and so i can do the computation the the tf idf computation here with all the counts and the green values are our result and this will turn the histogram for this image into this one the histogram of this image will be this one the histogram of this image will be this one and the histogram of this image will be that one so i get new histograms new kind of re-weighted histograms so what the tf idf does it for every original histogram it turns it into a new histogram where all the important words for the comparison will have a higher weight and all the others will get a lower weight so we are basically changing the weight of those histograms and those for example words which appear in every image will get a weight down to zero this holds for the blue one and for the green one in this example because they appeared in every image and the others are advanced those which occur more rarely and so this is kind of a re-weighting this kind of the first thing that i do saying i only want to take those words into account which actually convey me information for my similarity query i still haven't answered the question how do we compare our histograms so we take euclidean distance or the angle or callback library divergence and as we have now rewrited our histograms we don't have probability distributions anymore right so callback label divergence is probably something we don't want to use because it was intended to compare probability distributions but we can still ask the question should we use the euclidean distance or should we use the angle between the two vectors and it turns out in the back of visual words approach or back of words in general people tend to use the cosine distance so that basically takes into account into account the angle between the vectors or the cosine of it for making the comparisons we can see that if we normalize our final histograms again then we can also use euclidean distance and we'll give very similar results um but what is the standard choice is the um the distance based on the angle between them and what we basically do we compute something that's called the cosine similarity or the cosine distance so cosine similarity is the cosine of the angle that is spent by the individual vectors so one histogram is x and the other one is y then the angle between x and y can be computed in by the normalizing the vectors and then computing the dot product between them the cosine similarity will give me a value of 1 if the vectors point to the same direction and the vector of no sec vector or value of sorry a value of 1 if they point into the same direction so the vectors are the same and will give me a value of 0 if they are orthogonal to each other and that's kind of exactly the other way around that we want to have it and so what we want to have is if this we don't have a distance so if one a distance of zero means they are identical and larger values are higher distances therefore we take into account so-called cosine distance which is one minus the cosine similarity uh sorry there's a one minus missing here so one minus cosine similarity that will be the the right term and as long as all the vectors on the first quadrant which is the case for us because we're computing them from our histograms then this will give me values between 0 and 1 so 0 means same distribution and 1 means a completely different distribution so we now can do is we can compute the similarity or the distance between our histograms using this cosine distance and so let's go back to our example we had all four words and image number zero and image number three are similar to each other right so they look similar and so let's see well there's a similar distribution of features let's see if they look similar in our comparison so we can do is we can build our um kind of a cost matrix or similarity matrix where we have all the histograms of image 0 1 2 3 0 1 2 3 shown over here and we are always comparing the difference between the histograms so this value will be the difference between this histogram and this histogram and what we can see in here a few things so first all the elements on the main diagonal are zero this means an image compared to itself has a distance of zero at least it's a senate to check that everything looks good because if you want to compare an image to itself we should get a distance of zero but what we also see is that the images of zero and three are regarded as highly similar because they follow the same distribution of visual words and as a result of them they are similar and all the other values actually pretty loud they are close to one that means those images are very different from each other they are not similar to each other the columns can also visualize it in the plot you can see here you have your main diagonal with a distance of zero the off diagonal elements also with a distance of zero and all the others are kind of higher values kind of close to one or not exactly one and this is kind of our visualization that we say everything which is similar in the main is the main diagonal is identical and we have the image 0 and 3 which are similar here and we can actually show or what we can say is um does it make sense to use actually this tf idf weighting or can we actually do that without we can do this by performing the same comparison just using the the same distance also the cosine distance but on the unweighted histogram so not using a tf idf weighting just taking the original histograms and if you compare them so this is here the original cosine distance of the original histogram so not without tf idf and with tf idf and we can actually see that now the image number zero and so image zero and three are not similar although they should be similar actually over here and here we have zeros in the main diagonal and the other one is actually some values in between 0.4 and whatever 0.15 maybe so this is actually not a good similarity matrix for the comparisons where this one is a good one so the tf adf actually helps for making your comparisons well we haven't looked into should we use euclidean distance or cosine distance i just said most people use cosine distance so for that let's try to identify a bit what's the difference between the occluding distance and the cosine distance so the difference between the euclidean distance and cosine distance mainly is that the cosine distance ignores the length of the vectors where the leading distance takes it into account so if the length of your vectors matters then you probably tend to use your cleating distance if the the length of the vector doesn't really matter or the distance of the endpoints doesn't really matter so the distance of the of the difference of the two vectors um then cosine distance is probably the the metric of your choice and let's say under the assumption that we have normalized vectors we can actually show that euclidean distance and cosine distance are related to each other we can actually show this here if we start with the euclidean distance or the squared euclidean distance actually so euclidean distance squared then we can turn this as x minus y transpose times x minus y and we can if we expand this expression we get x transposed x minus 2 x transposed y plus y transposed y and if we now exploit the assumption that both vectors of length one so the length of of x is one the length of y is 1 we know that the norm is 1. so as a result of this x transposed x will be 1 and y transposed y will be 1. um sorry there's a plus no sorry everything is correct and so this will mean that x transposed x and y transposed y they will both turn into one and x transposed y this will actually remain so the equation will simplify to this form um and x transposed y is nothing else then twice the the cosine of the angle between the vectors because remember the vectors have length one so this gives me the the cosine between them or twice the cosine because they're both of length one so as a result of this i will have two times the cosine distance that means if the vectors are normalized to one the squared euclidean distance is equal to two times the cosine distance so there's a actually a relationship between the occluding distance and cosine distance and what we can do is we can now make a comparison between the cosine distance and the euclidean distance to see what's the effect of the different matrix so again in this block here these are the original histograms the unnormalized histograms um in this example this is kind of the occluding distance and the cosine distance both executed on the original histograms and we can see both of them are not very good in finding similarities so this one is a little bit better so those elements which are equal here which should be small values are smaller than here but it's still far away from zero so that's not very good if we normalize um those histograms then this would be the result and basically what we have we have a scaling factor in between them actually this uh or this squared would be the have a scaling factor of 0.5 it's also something which is still not good so this is not the result that we want to have if we then upon perform the tf idf we have the thing or cosine distance that we typically have and here the distance still looks much different for the euclidean distance but you can still see that the results using the re-weighted euclidean gets actually better and if we then normalize everything to one all vectors to one just artificially then we would actually get very very similar results so the cost measures these are nearly indistinguishable not fully but they don't differ that much because we have just values between zero and one zero squared stays zero one squared stays one um so those actually look very similar so as a result of this as we don't want to explicitly normalize our histograms if we would actually normalize all histograms again we could use the euclidean distance but one takes the cosine distance on the rebated histogram so i'm computing the tf idf and then running the cosine distance on that this is what the standard beggar first approach does so to perform our similarity queries we use our database and we store the tf.td tf idf weighted histograms for all images in the database and what we then do if you want to find a similar image we extract the features from the query image so we run our sift detector we assign every feature that we detected to a visual word or if it's very far away ignore it then we build the tf idf waiting for our histogram so we need to rebate the histogram and then we just perform the cosine distance comparison to all images in the database and return the n most similar histograms from our database or the images corresponding to those histograms and then i basically get my most similar images and that's basically that's how tf idf actually does it so again if you want to investigate the bag of words approach further there's again this five minute summary video of this kind of nearly one hour lecture just condensed into five minutes of course we won't go that deep but the core material is in there if you want to dive deeper i actually suggest visiting the jupiter notebook that was created by agave zotzka who also created all the nice drawings here for me which you find here under this address which dives a bit more into the details of the tf idf weighting so that you can actually see how all the individual parts are computed and you may helps you for understanding the cons a little bit better and also does some investigations on the different distances and there is further reading reading material there is the original paper by civility and scissorman which used which is called video google which is um which had the idea of using the bag of words approach from the text mining community for image-based data or video data so this is the original publication from 2003 and if you also want to get some further information about tf-idf you may also have for example a look to wikipedia which can tell you something about this approach so to sum up backup virtual words is an approach to compactly describe images and perform similarity queries between images and it's frequently used for searching for similar images in a database this is relevant in computer vision but a lot for three reconstruction tasks where or localization tasks where you want to see what are other images where there are potential correspondences to my current image either for finding loop closure candidates in the context of the slam problem or for performing visual place recognition tasks and then maybe use this as an initial seat for the metric localization system so there are a lot of applications in robotics and computer visions and photogrammetry for that um the idea of backup words is that we base or describe our images with so-called visual words and we use a dictionary of visual words for making this description and we are basically just counting how often do the visual words appear in my original images so how often do typical features which are mapped to visual words actually appear and based on this word occurrences or histogram of voter currencies we are performing our similarity queries as some words are more expressive and others are less expressive we use this tf idf free weighting which kind of enhances the weight of bins or visual words which are more expressive for the comparison and weights others down to up to zero so that they don't matter anymore and the distance how it is computed is basically based on the cosine distance so we are not taking the images into account anymore we're just taking the histograms computing the cosine distance about them and that's basically it

Info

Channel: Cyrill Stachniss

Views: 2,046

Rating: undefined out of 5

Keywords: robotics, photogrammetry, education, lecture, computer vision

Id: HWIkcMakz-s

Channel Id: undefined

Length: 58min 49sec (3529 seconds)

Published: Fri Aug 28 2020