What is Semantic Search?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] ER from cohere AI this video is about semantic search semantic search is a really cool and Powerful way to search in large databases using the context of the query let me show you how it works this video companies this post that you can find in the cohere blog about semantic search there's also a code lab that you can follow along where you create a semantic search model for a small data set both links are in the description of the video so before we get into what is semantic search let me show you what's not semantic search so basically what were the previous methods used before semantic search so before semantic search the search methods resemble lexical search with basically word matching to show you how lexical search works imagine that you have a query say where was the last World Cup and to find the response you have a data set of sentences let's say it's just these four sentences and the idea is to search in the data set for the answer or at least the closest things to an answer so let's say that the possible answers are the last World Cup was in Qatar the sky is blue the bear leaves in the woods and an apple is a fruit so the way lexical search works is by word matching so it looks at which words from the query appear in the answer and counts them so the first sentence has five words in common with the query the second one one the third one two and the fourth one is zero so the first one wins and this time it did okay actually found the answer but as you can imagine this creates some problems imagine that the data set of sentences are these ones the previous World Cup was in Qatar the cup is where you left it where in the world is my last cup of coffee and an apple is a fruit if you do like SQL search you compare the number of words in these sentences and in the query you get four three five and zero so this one wins where in the world is my last cup of coffee even though it's not the answer it doesn't even have much to do with the query but it has a lot of words in comments on Counting words in common is not the ideal where to search so we don't want lexical search instead we want something called semantic search where instead of the words the actual meaning of the sentence is taken into account and for that we need something very important which is embeddings so in the comments we have a blog post and a video which describe embeddings in a more detailed way but let me tell you very briefly what is a text embedding so here is a text embedding for example everything on the left is a word here are the words apple bananas strawberry all the way to car and the idea is that to each one of these you assign two numbers and the numbers are the horizontal coordinate at the vertical coordinates or for example banana has coordinates six five and the idea is that similar words have similar points in the embedding so the fruits are close by the houses the sports and the vehicles are close by and there are more to an embedding but you can imagine it as something like this of course there's many more numbers for an embedding not just two sometimes you could have thousands hundreds of thousands of or thousands of numbers but that's the idea and so we can actually look at the embedding in the cohere playground and put some sentences because sentences also enter in embeddings not just words and as you can see all the sentences that look like they're greeting somebody are over here the sentences that look like they're talking about how much you love your dog are here and the sentences that talk about how much you enjoy watching soccer are over here so the idea of an embedding is that it brings a bunch of text into a bunch of numbers and the numbers are similar if the text is similar even if a text has no words in common with other piece of text right like if you have two sentences that means something very similar even though you have no words in common the embedding will put them close by now the question is how to search using a text embedding well we're going to use something called nearest neighbors so let's say you have these four sentences again the last World Cup was in Qatar the sky is blue the bear lives in the woods and an apple is a fruit and the query is where was the last World Cup so let's locate everything on an embedding the sentences get located let's say around here and when we locate the query it gets located close to the first sentence the last World Cup was in Qatar because semantically the embedding knows that these two sentences are similar so therefore the winner is the last World Cup was in Qatar and as you can see this way of searching actually takes into account the meaning of the sentence and not just the words on it so it's much more effective than lexical search you can play with this in the cohere playground for example you can take the actual sentences and locate them here in the embedding that's where they are and when you have the query where was the last World Cup it actually appears over here so as you can see the answer is the closest sentence and you can actually put a bunch more questions and as you can see each question is the closest to its answer so in principle that's how nearest neighbors work it looks for the nearest neighbor and says well I think I think this is the answer and it seems to work pretty well however I talked about distance but in reality we're talking about similarity similar is a similar notion except it's big when two things are closed and small when two things are far so for example the sentence is hello how are you and hi how's it going are closed so they would have a high similarity and the sentence says hello how are you one yesterday's on elephant are far so they would have low similarity in the comments there's also a video on a blog post that talk in much more detail about different types of similarity that we can use for these models now there are some problems with nearest Neighbors which is it's good but it's slow so imagine this let's say that I want to find that the answer to where was the last World Cup is the last World composing Qatar so what I have to do calculate a bunch of distances or a bunch of similarities the fact is I have to calculate as many as points in the data set and if I want to find the answer to all these questions I pretty much have to calculate a lot of distances so how many well if we have eight sentences then we have eight Square distances some of them are zero eighters eight of them are zero but they're still around the order of eight square and if you had a thousand sentences then we have one million distances because it's a thousand square and if you have n sentences you have n Square distances that's a lot to calculate so there's actually some shortcuts we can use to actually not have to calculate the whole thing and find the nearest neighbor or at least something pretty close and some solutions to this are inverted file index which actually first clusters the points and then searches around the close ones don't need to search on the ones that are very far in another cluster and the hierarchical navigable small world something very similar starts with a few points searches there and then starts adding more points and more points but using the information of what you had before it's a much more effective way of searching now this doesn't need to work in just one language device actually works in many many languages and the cohere multilingual model is very useful for that if you look in the playground you can actually play with the multilingual model you can put a bunch of different sentences uh in different languages and so for example I have the same sentence in French English and Spanish and you can see that the embedding actually locates them pretty close so you can have a data set of responses in one language and ask something in a different language and you'll find the best response so this is definitely language agnostic now here's a question are we done with just embedding on similarity and it seemed to work pretty well better than lexical search but is it all you can imagine that there may be some potential problems right here's a potential problem let's say that our query is where was the last World Cup which is over here and the answers are the last World Cup was in Qatar the previous World Cup was in Russia and the World Cup is in the moon now these are all similar sentences because embedding only takes care of sentences that are similar it doesn't really check if the actual answer was given so for example we would love for this one to be the answer the last World Cup was in Qatar but this sentence is also very close and it gives something pretty close to the answer the previous World Cup was in Russia that's pretty close to the actual answer but it's not the answer so looking for the closest sentence doesn't necessarily give us the answer it gives us something pretty similar to the question so how can we fix that how can we make search better and there are several techniques that we use the first one is re-ranking so let's say that the query again is where was the last World Cup and when we look using an embedding the most similar sentences are this let's just say that these are the similar sentences and some of them are answers and some there are non-answers so we have a different model that ranks them to a model that is actually trained to see how good is a pair question answer and that model will select the actual answer the last World Cup was in Qatar so how does this work well we have a data set of questions and answers and we train a model with those questions and answers and we train it to give high scores to correct pairs of question answer and to give low scores two pairs that are questioned and a bad answer and notice that you can try to make this really close as you can see the bad answers here are actually very close to being the right answer except they are slightly wrong for example the sky is red the bear lives in New York City or an apple is a vegetable they look like the answer but they're not and so a good model that's trained to tell good answers from Bad answers will actually be very good and re-ranking so then we're going to use an embedding Plus re-ranking to be able to search better so if you say for example where was the last World Cup and the responses are these given by the embedding the closest sentences given by the embedding then the model will actually score all of them and say well some of them are not very good some of them are pretty good and based on this course you rank them and the one with the highest score is the answer this has given great great results in terms of searching and another technique is to use something called positive and negative pairs so just like before we're going to have a bunch of questions and answers that are correct in the left that's the positive Pairs and a bunch of questions and bad answers which are very close to the answer but not the answer and we are going to play with the embedding so for example if here you have what color is the sky and the actual answer the sky is blue and a bad answer the sky is above well if this is a negative pair then we're gonna move them away in the embedding and if this is a positive pair then we're going to move them closely in the embedding so we're gonna move the embedding around to actually optimize for answering questions properly this has also proven very useful in improving search models fastly and that's it for semantic search as you can imagine there are many many great ways to improve these algorithms and we're always working on them stay tuned for more similar videos foreign [Music]
Info
Channel: Cohere
Views: 18,145
Rating: undefined out of 5
Keywords:
Id: fFt4kR4ntAA
Channel Id: undefined
Length: 11min 53sec (713 seconds)
Published: Thu May 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.