Vector search, RAG, and Azure AI search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay hello I am recording myself giving the talk that I gave in SF a few weeks ago because it wasn't recorded and because some folks would like to see it so this is a talk about Vector search and specifically Vector search inside as your AI search so I'm going to start off with describing why we need Vector search for RG and how vectors work and Vector databases work and then really focus in on the Azure AI search and what it can do for us so let's start off with my favorite thing which is retrieval augmented generation abbreviated as rag so I like to say I'm an AI skeptic but a rag fan girl I really really do love rag uh so what is it so let's talk about llms and llm is a large language model like GPT 35 gbg4 or any of the other models uh like mistol or or lava and these large language models are really impressive with what they can do and how much of a grasp they have of language and not just English language other languages as well because they've seen so much language in their training but they do have limitations so here this is where I was chatting with an llm over a a little app I have and this was just with like gbg 35 and what you can see is say I you know I said write a model class using the latest version of flask SQL Alchemy oh I see I need to add this to the stage all right this is my first time trying to go live uh so here in this example I say write a model class using the latest version of fasy waly and it happily responds and says here's an example of a model class using the latest version however it is not the latest version I know it's not the latest version because I'm a fasy gualy maintainer and I helped make the latest version and this is just out ofate so this is one big limitation of llms is that it has outdated public knowledge and it also its knowledge tends to be skewed based on what it's seen the most right so if it's seen lots and lots of the uh old version of something here then it's more likely to to spit that out it doesn't necessarily know what is latest right so either it doesn't have that data at all because of when it was trained or it just has seen something so rarely that it doesn't think it's the important thing to show uh now another example here is I could say do my company perks cover underwater activities and it says like well you should consult your own you know employee benefits package right you should you know check some documentation that's specific to you because llms do not have access to my company perks they don't know what company I work for they don't know what those perks are so it just it can't answer a question about internal law knowledge because it doesn't have internal knowledge so these These are limitations of llms it's also just a feature of llms this is how llms work right they're they're large language models they're very good next word predictors they've been trained on on public knowledge up to a certain point and they're obviously going to have these things that they can't do so how can we incorporate domain knowledge with an llm so that we could get it answer those questions there are three main ways that uh people will go about incorporating domain knowledge the first way is prompt engineering so for example with that flashy Wacom example if it actually had seen it I could say make sure like in the prompt I'd be like make sure you're using flash SEO Alchemy version 2.14 using SQL like I could put a lot of effort into that prompt to get it to to use the most recent version and sometimes that works that can work if it does have that knowledge somewhere you know somewhere inside it and it just needs to be you know triggered to use it U but it's only going to work if it's somewhere in its weights in its training if it's never if it's literally never seen it it's just not going to work the next option is fine-tuning and this is where you start with the base model and then you come up with like at least 200 you know examples of things you wanted to to know about it and or or skills you wanted to learn and you do a training with those examples and you actually end up with a model that has L tweak weights so this is a fair approach and it is what some people use we usually caution against this because it's going to be expensive to do the fine-tuning and then it will also be expensive to use a fine-tuned model at least on as your infrastructure using 3p gp35 is going to be cheaper than using a fine-tune GPT 35 so generally uh we you know think of this as a last resort just because of those those drawbacks so that leads us to retrieval augmented generation and that's a way of learning new facts temporarily in order to answer the current question so it involves retrieving the information and sending that to the llm so that it can answer the question with that information so let's see exactly how it is that that works so in this case here I'm using a rag chat app to answer the same questions and I say okay write the model class using the latest version and we see the latest version and this is actually correct and the reason it's correct is because I'm actually giving it the do the latest documentation I'm giving it as I ask the question and so it's able to consult that documentation and then I can say I'll do my company perks cover under our activities and I give it this document that has the perks and it can answer that question because it's got that document so now we're able to you know get around the limitations of llms by being able to feed sources to uh to an llm with the user question so how do we actually do that so we get the user question right and we search some sort of knowledge base right so a vector database as your AI search an in-memory database we search for the matching documents for this question right and we need to use a search that's going to get back a you know good uh chunks of information so we get those back and you know in this case this might be what it looks like this is text extracted from a PDF and then we take both the original user question and whatever documents we got back and we send those to the large language model and then we get back the answer so in my experience this this works really well if you can retrieve the right documents then large language models are very happy to answer correctly according to those documents and they're very good at summarizing and synthesizing and all that stuff so you just need to get that right information and then let the LM do its magic so it is all about retrieval if you can retrieve the right documents you will generally have a really nice experience uh with getting your answers correctly uh but that means we need to really think about retrieval and that's what we're talking about today so how do we do that retrieval right how do you do a search uh now traditional search has been uh keyword search based right uh but then keyword search has this issue of the vocabulary Gap right so if I say I'm looking for underwater activities but that word underwater is nowhere in our knowledge based at all then a keyword search would never match scuba and snorkeling so that's why we want to have a vector based retrieval as well which can find things by semantic similarity a vector based search is going to realize that Scuba and snorkeling are semantically similar to underwater and be able to return those so that's why we're talking about the importance of vector search today so let's go deep into vectors it's actually really fun to talk about vectors so what is a vector or what we the longer term would be a vector in Bedding so a vector in Bedding takes some input like a word or a sentence or even longer than that and then it sends it through through some embedding model and then gets back a list of floating Point numbers and these numbers the the amount of numbers is going to vary based off of the actual model that you're using so here I have a table of the most common models so we see we have word to VC and that only takes an input of a single word at a time and the resulting vectors have a length of 300 and that word Toc has been around for quite a while so if you've done any sort of work with embeddings before you may have used word Toc what we've seen in the last few years is models based off of llms and these can take into much larger inputs which is really helpful because then we can search on More Than Just Words so uh the one that many people use now is the open a Ada 002 that takes text of up to 8,191 tokens and it produces vectors that are 1536 uh you know floating points long there's also some other some other ones here there's quite a few embedding models that are out there that you can use what you want to look at is you know what is it that it can encode you know what is it known to be good at encoding uh and uh you know what are the limitations as well the important thing is when you're using an embedding model you need to be consistent with what model you use so you do want to make sure that if you're and you know encoding your data with a particular model that that's also the model you use later when you're searching uh and I see a question from Winston doing a rag with PDF documents input docs have images and tables how do you index PDF intermingled with visual data that's a great question so you actually see this this model here so these first three models can only encode text this model here as your computer vision can encode image or text and so that's what we've started to using in our rag app that has experimental support for images uh so that's actually what I'd recommend in I can uh you know point you to the the documentation on on that um but this you can pass it image or text and uh and get that encoded um so when we use that in our rag app we actually turn the entire PDF page into an image including the text because uh the model is able to to to look at it that way I think that's how we do it so I'd have to double check but I can I can point you at that okay so let's take a look at in Compu a vector all right so I have this Jupiter notebook here and first it just sets up a connection to open AI I'm using Azure open AI of course I work for Azure and this whole talk is actually about your stuff uh so then we have these functions here that are just wrappers for creating embeddings using the Ada 002 model so we can run that and that'll just set up the connection and now we can we can get an embedding so we're going to run this here to get embedding for dog and so we could start with just a single word and you say it does take a little bit of time this is you know actually getting sent through the model on the as you're opening ey servers and getting it back and you can see all those floating points there and we can see that it's 1536 now we can also put in something a lot longer like uh I'm more of a cat person than a dog person but actually these days I just like humans right so we can write a really long sentence and we can calculate the embedding again that one was actually a lot faster probably just needed some startup time and once again we see the dimensions is 1536 so this is an important point with the a to2 model with all the models anything you embed the dimensions that output are always going to be the same so we have to you know a vector to represent dog in the space of the model and we also can get a vector that represents this whole long sentence and when we're indexing documents for for rag chat apps we're often going to be calculating embeddings for like entire paragraphs like up to like 512 tokens is best practice so we're actually going to be calculating embeddings for for decently long things now you don't want to calculate the embedding for like an entire you know book like war in peace first of all because that's above the the limit of you know 8192 tokens but also because the more that you try to calculate embedding for something like the more things are in there the more Nuance is going to be lost when you're trying to compare one vector to another Vector so we want to have like the you know optimal chunk size um which is usually actually around uh for input wise you know uh like Optimal size embed is about 512 tokens and in English you could think of tokens as a word but it's it doesn't directly map all right so we've calculated embedding and now the question is why are we even doing all this why are we calculating embeddings why do we have these you know lists of floating Point numbers the whole point of calculating embeddings is so that we can calculate similarity between vectors so that we can see oh this Vector is similar to another Vector this new Vector is similar to this other Vector in our you know Vector database so the way to calculate similarities is to use a distance measurement the one that we're going to use is cosine similarity this is the recommended one to use for the Ada 002 model there are other ways of measuring distance and some of them might be more appropriate for other models but for Ada we are going to use cosine similarity and this is you know roughly how you calculate that it's the dot product over the product of the magn udes and uh and what this does is it tells us how close how similar two vectors are like what is the angle between these two vectors in multi-dimensional space so here I'm visualizing in two-dimensional sta space because I can't visualize 1536 Dimensions uh but you see the idea here is that okay if the vectors are really close then there's a very small Theta and that means you know your angle Theta is near zero that means cosine of the angle is near one so so if you see one then that's the same vector and anything near one is a very close Vector now as you're you know as the vectors get farther and further away then your your cosine goes down to zero and potentially even to negative 1 so let's try actually uh doing some similarity comparisons right so here I've got a function to calculate the cosine similarity and I'm using numpy to do the math for me since that'll be nice and efficient and now I've got three sentences that are all the same and then these sentences which are different and I'm going to get the embeddings for each of these sets of sentences and then just compare them to each other uh so I'll go ahead and get the embeddings and one thing I wanted to point out is that you can get batch embeddings and that's a helpful thing so this function will actually batch up the embeddings so it we are doing just two calls instead of six calls uh so that's a very common thing to do when you're Computing embeddings with the API is to batch them up okay so what we see is that the the you know when the two sentences are exactly the same then we see a cosine similarity of one that's what we expect and then when a sentence is very similar then we see a cosine simil of 0.91 and then this sentence here is 0.75 now when you look at this you it's hard to think about whether like the 0.75 mean this is actually pretty similar or does it actually mean it's pretty dissimilar so I did a little exploration of this uh so let me show you the little uh website I did here uh so uh let's just look at the word the word dog and I actually did this comparison across both open Ai and word Toc so this is just a thousand words and I calculated the embedding vectors for each of these words and I did it in both word and open AI so we can see like dog is the most similar to God in this you know in this set of thousand words and it has a0 866 similarity and then if we look at least similar we see 74 which is Isn like part of isn't 74 is the least similar one in this set so in a set of a thousand words the range of similari is between 0.74 and 0.86 and I made this similarity histogram here so you could see that whereas if you look at word Toc it's between 0.76 for cat and then all the way down to like negative 0.05 so that is more what I was expecting when I went into this because you know when you look at that cosine graph you think like well you know cosine can go from negative 1 to one then you know why in our similarities going from negative 1 to one so what you see when you do similarity with the Ada 002 model is that there's generally this very very tight range between about I think 65 is the lowest I've ever seen and then you know we got like a 0. 91 here which is actually one of the the highest I've seen right so this 0 75 is actually really really dissimilar and I can prove it by doing let me just put like absolute nonsense in there right so here we can see that that one got basically the same score 74 right so this is actually in the in the Land of Open AI vectors it's actually very dissimilar so you usually when you're looking at Vector similarity you're not looking necessarily the absolutes you're looking at relative like so this one is definitely more similar than this one right uh it's hard to look at it you know these low ones and know exactly how dissimilar they are so you really want to be looking at the relative differences because they are in such this like tight space and I looked into this when I discovered this because I was like why why are they in such a tight space and it's apparently just because of the the particular training process and where they have this pulling layer which ends up ske all of the vectors in a similar Direction so if you can kind of visualize in your head like imagine this this cone uh this kind of three-dimensional you know ice cream cone where all the vectors are in in that space which means they're not you know they're not getting all the way to here they're staying within this ice cream cone of of of the of the space so I thought that was really interesting and maybe can give you a little bit of intuition when you're looking at cosine similarity scores for the ada2 model and as you see it definitely does veryy uh you know vary between sets uh I also did this with movie titles because I just thought it was fun so you know we have like Disney movie titles right so we could go like uh you know The Many Adventures of Winnie the Pooh and so I calculated the you know the vector embeddings for all of these titles just the titles nothing else about the movie just the titles so we see most similars Winnie the Pooh and that makes sense uh but then the third most similar is the adventures of Huck fin so that's another observation I had with open AI is that there tends to be you know in the encoding it's definitely considering not just semantics but also uh similar spelling and we saw that with this one because the most similar you know word to dog is God and arguably we would have thought it was cat right that's what we have in word Toc which is purely semantic so in the open AI similarity space you know it's whatever it's in its in in its latent space but we both similarity in terms of semantics but also in terms of like syntax and grammar and how things are used uh so here we've got you know the adventures of Huck fin Adventures in Babysitting uh so we see both things that are you know semantically similar but also spelling similar and then we can look at the least similar and once again for the least similar we're seeing 71 is the least similar score uh so similar really similar range in terms of these cosine similarity scores okay so Vector similarity so now the next step is to be able to do a vector search because everything I just showed you was similarity within the existing data set what we want to be able to do is be able to search for arbitrary things and so what we do is we take a user query right whatever they're trying to look for we compute the embedding Vector for that query using the same model that we did our embeddings with for the knowledge base and then we look in our our you know Vector database and we find the K closest vectors for that user query vector and we can either do a comprehensive search or an approximation search and and then we return those K closest vectors so let's take a look at how we could do that in code uh so here we're going to do a vector search on that movies data that I showed before so we're just going to load it in as a Json so I've already computed all the embeddings for the movie titles we load that in and now I can uh actually do a search so I've got my query which is my neighbor totor because those movies were only Disney movies and as far as I know totor is not a Disney movie it's a Miyazaki movie or studio gibli is the studio so then we're gonna go through and for every single we're going to do a comprehensive search here so for every single movie in those vectors we're going to calculate the coine similarity between the query vector and the vector for that movie and then we're going to create a data frame and sort it so that we can see the most similar ones so let's take a look okay so for totorro it thinks the most similar one is Toy Story uh but it also has sentu chiro which is Spirited Away which is actually a studio gibli movie so what I realized after doing this search is that Disney became a distributor for some of the Miyazaki movies so we actually see Pono and Spirited Away show up in this list uh so it's inter and most of these stories are about like large toys or um you know toys brought to life animals brought to life so they do feel fairly similar to to totorro and this is purely based off the title but you know remember this is the a002 model and it's seen you know it's seen so much of the world so in its embeddings there's more than you know there's just so much in those embeddings that we'll never really understand but uh we get really interesting responses even by just embedding the title now if I was actually going to make a search engine for finding you know recommended related movies I would also be embedding the descriptions of the movies at least right like you know like a paragraph describing and that would get get I think you know much better recommendations versus going just off the title and this is so as you say this is doing a comprehensive search so it's searching the entire space and I was able to do this because I only have like you know 500 movies here so it was able to do this pretty quickly um but as you grow to have a database that has many more vectors in that you're going to have to use an approximation search and so we'll talk about options for that all right so moving on to how do we store our vectors so we're going to want to store in some sort of database usually V uh you know a vector database or a database that has a vector extension we need something that can store vectors and ideally knows how to index vectors really well so here this is just a you know a little example of postgress code using the PG Vector extension which is a popular approach that some folks use especially if you're already using postgress so here we you know declare our Vector field our Vector column and you know we say it's going to be a vector with 1536 dimensions and then we can insert our vectors in there and then we could do a select where we're checking to see you know which embedding is closest to some new some new embedding that we're interested in and we also need to do an index so we actually should do the index before the select so that we have an efficient way of searching so this is an index using hnsw which is an approximation algorithm which we'll we'll talk about so this is the kind of thing you're looking for in a database is the ability to say hey this particular field is a vector field I want to make an efficient index for that field and I need to be able to query it and I also want to be able to you know combine queries with other queries with Access Control all that sort of thing so on air we have several options for Vector databases so you could if you already have your data in a database like in Cosmos DB we do have Vector support in the mongodb vcore and also in the cosmos DB for postgress so that's a way you could keep your data where it is like for example if you're doing a rag chat application on your product inventory and your product inventory you know changes all the time and it's already in the cosmos DB well then it makes sense to take advantage of of the vector capabilities there uh otherwise we have as your AI search and this is a dedicated search technology that does not just Vector search but also keyword search and has a lot more features so and it can index things from many many sources so this is what we generally recommend uh for just really good search quality and this is what I'm going to be showing for the rest of this talk and we're going to be talking about all its features how it integrates and you know what makes it a really good retrieval system because I was saying when we're doing rag we want really really good retrieval all right so let's talk about each of these features here so we're going to go into more detail so first of all as your AI search does now have Vector search it actually didn't as of a year ago but you know they realized how important it was to add Vector search and uh and added it to their and so you can use it via you know the as your python SDK which is what I'm going to use it with but also with semantic kernel L chain llama index any you of those packages that you're using lots of them do have uh support for using as your AI search as the you know the rag knowledge base so let's take a look at how to actually use the the uh SDK to do a vector search so I'm going to start with my setup and this is just you know grabbing from the Azure document search documents package and we create a search service and uh we also creating our open AI client okay so everything's set up everything's credentialed now we can actually create an index so I make so many of these indexes here we go so this is a teeny tiny index that we're making and it just has a couple Fields so we have an ID field and that's like our primary key and then we have the embedding field and that is going to be a vector and we tell it how many dimensions it's going to have and then we also give it a profile this is embedding profile and then down here is where we describe that profile so we say this embedding profile is going to use this algorithm configuration so this is where we describe what sort of algorithm or indexing strategy we want to use and we're going to be using hnsw hssw stands for hierarchical navigable small world there we go and it is pretty much like the the goto for doing Vector searches these days there's are are a couple other options like ibf and some others but those are the two big ones so most folks these days are using hnsw this is what as your AI search supports because it works really well and they're able to do it efficiently at scale uh so we're going to say it's hnsw and we can tell it like what you know metric to use for the similarity calculations we can also customize other hnsw parameters if you're familiar with that algorithm you can pass in some more so we set it all up uh so this is like telling it you know what kind of index it's going to use and let's run this here okay so it has created that index and now we just are going to upload these very simple documents right and they you can just see they each have you know just threedimensional vectors so it's just three documents that each has a three-dimensional Vector so that is now in the index and now we can do a search so the way we do a search is we're going to say okay we're not doing any sort of text search we're only doing a vector query search so we say here's a vectorized query this is going to be the query Vector one 12 one we're asking for the three nearest neighbors and we're telling you to search the embedding field because you could actually have multiple Vector Fields so then we do this search and we can output the score now the score in this case is not necessarily the cosine similarity uh because score can consider other things as well and there's some documentation about what score means in different situations U but we do see what matters is what's relative right so we can see what's at the top so this one it thinks that the you know this one is the most similar and then it goes down now if we put in like a really different Vector like this then we can see much lower scores right so it's it's about you know I usually don't look at the absolute scores myself you can but I usually look at the relative scores so that was an example with just a small index and now what I want to show is doing it on a a real index and I'll actually show you that index in my portal so let's find that index in the portal let's go to the portal here all right there it is that's the search service and then I can click on indexes and click on this index so it has 691 documents uh where each of these documents is actually a chunk of a PDF because of how how we index them and then we can here I'm just searching for everything just to show you what is in this index so we can see uh ID content this is the text content embedding is the vector embedding and then we can see what the page number was from and the file name as well so this is what we're going to be searching so let's go back here and so here we're going to take that learning about underwater activities query and compute that into an embedding and then we're going to search using a vector query so we're not searching text at all see this this first thing here the none this corresponds to search text so we are not passing in this search query here we are only passing in the vector query so we've turned this into a vector we're passing that in we told it to get 50 nearest neighbors andh and then we can show them here 50 is actually quite a lot normally I would do less than that I would do like five we change that okay and so here we can see it got back results and we can see the very top result has H scuba diving lesson ski and snowboarding so this is what we are looking for and this is exactly what we can get with Vector search that would not have worked with a keyword search because a keyword search wouldn't have found underwater anywhere in here all right so that is a that's a vector search in the Azure AI search SDK and as you saw we did use an approximation search uh so we did the hnsw it's a an hnsw is a specific example of an approximate nearest neighbor search so you'll also hear this term a Ann and so hnsw is a specific instance of that there are other ones I said like ibf uh so this one you know hsw has very good performance recall and it does have additional parameters that you can tweak if you uh find that you need different performance out of it you can also do an exhaustive K&N search and that just means it's going to go through the entire set of vectors and find all the nearest neighbors typically this is not recommended because it can take so long to do this to have to go through and check out you know and compare every single one however there might be some scenarios where you want to do it it can be useful if you're trying toall like make your Baseline so if you're trying to evaluate how well hnsw is working you can first run a k& and search for some examples and then compare that to the results of hsw and see are you getting back the same results or good enough results you could also use exhaustive KNN search if you are in scenarios where you have highly selected filters so if there's some pre-filter like you know getting just the documents for a user and you think that could sit down to a reasonable level like only like you know a thousand documents 10, whatever then that might be a situation where you could just say like well we can go ahead and do an exhaustive search so it is certainly an option and you just want to keep there care you know think carefully about how many documents you have what sort of performance you need and just do some um you know looks you know look at the quality of your results to see how it's working for you now we have other capabilities when we're doing Vector queries so we can uh you know we can combine our Vector filters with other filters so you know anything that we could already do with Azure AI search so we if we need to filter by some column if we need to do you know less than um you know less than sometime greater than sometime we can still do all of those in combination with a vector query now one thing you have to keep in mind is whether you should be doing a pre-filter or post filter you generally want to always you almost always want to do a pre-filter and this means that you're um first doing this filter and then doing the vector search uh and the reason you want this because think about it if you did a post filter right like so imagine you're doing something you know like this tag perks so if you first do the vector search to get all the related documents and then you imply apply this this filter here then you just might end up not finding uh anything at all because maybe it didn't find anything you know in its top 10 results that had tag equals of perks so instead you want to do the uh ve you want you want to do the uh the filter the this filter this filter here first find all those documents that qualify and then pass that to the vector search so just think think carefully about that uh it does default to pre-filter meaning it's going to run this filter first and then do the vector search but you do have the option for post filter if you need it we also support being able to do multi Vector scenarios so for example you could have an embedding for the title of a document that was different from the embedding for the body of the document and you can search these separately you could you know search these at the same time with different things that so you you have different options because you always specify what you're you know running a particular query against the other time we use this a lot is if we're doing multimodal queries so if we have both an image embedding and a text embedding we might want to search both of those embeddings and we'll see that when we do the image search all right so lots of options there and of course this being an Azure product we've got data encryption we've got secure authentication we've got the ability to put in a private endpoint we've got all these compliance certifications so everything that comes with being an aure product now this is really really exciting is that we can now do searches on more than just text right so a002 was just for text but we have at as your computer computer vision SDK has a multimodal embeddings API that can turn images and sentences into an embedding in this space that understands both images and text together and it's just incredibly incredibly cool and Powerful so let's go ahead and you know see what that looks like in code uh so let's go here okay all right so first we have setup so once again we're going to connect to uh search service and then we create a search im a search index for images so this one has ID file name and embedding this time the vector Search dimensions is 1024 because that is the dimensions of the embeddings that come from the computer vision model so it's it's slightly different length than the other one and we're going to do everything else the same with hnsw and create that index now I already made that index and populated it so I'm just going to not run that uh now we're going to configure the computer vision SDK so we're just doing HTTP calls to it um sending in our authentication information and we have I've made a function that can get an image embedding based off of an actual file a binary file and also made a function that can get a text embedding based off of a you know a string of text okay and so then we can use that function in order to go through all these local images I have here and these are actually pictures from an old Etsy shop I had back when I lived in the woods and I collected all this Driftwood because we were near a beach and I was trying to figure out what to do with this drift word so I passed all of these into uh you know into that computer vision SDK in order to get an embedding for the image and then I upload them to as your AI search with the ID file name and then they're embedding all right so I did that and I'm not going to do it again just because it you know takes a little bit of time and now that that is ready now I can query it using new images so I have another folder which has separate images which which have not been added to the index so we can query those uh using this folder so let's see like uh this is the one I have here so T lights inside okay so this is what this looks like so we can take a look okay that's what it looks like now we convert that into an image embedding and we do a search and we're only searching on the embedding field not on the file name and so that's going to do the vector search on that that embedding and and you'll see what it gets back as the top result is in fact the same product but in nighttime lighting so it was able to see that these two images were very similar and you might think like oh it's just the file name but we didn't actually give it the file name at all uh it was just it seen that the images were really similar and here's the cool thing is that then we can also do uh we can do text embeddings to search image embeddings and this is what really uh blows so we could do like you know C candle we'll try it all right and then when we open it it gets that a CA candle so it thinks that this is the most similar image to Sea candle uh we could do we could do this in different languages so that would be like let see if it's G that's GNA work in Spanish B delar I think and yeah it got it so remember like these are llms they understand more than just English because they've seen quite a lot uh we could do let's see Spanish for earrings to make sure that it's getting other things yeah so it found earrings and now I want to show you the thing that really blew my mind is if I pass in Lion King what it gets back Is this Burning this drif of burning of Hakuna Matata and this to me is so impressive because that means that the model could see the word Hakuna Matata that I you know scratched onto this thing and then knows like extracted that word somehow in his space and then also knows that Hakuna Matata is related to Lion King and gave back this image so it just incredibly incredibly powerful and it's a really impressive model so it's been pretty fun to play with uh so we do have support for it in our our rag chat uh repo uh so that's this one you know this this is the the chat application I should uh show this GitHub here so it's just on this GitHub repo and there is a document specifically about how to try out the um gp4 so if you're interested in trying gp4 Turbo with vision so this is using both the uh Vision SD that I just showed and also the gp4 turbo with vision order to answer questions about images and I have an example here where I asked it to identify correlation between oil prices and stock market trends and we can look to see um what here these this is the basically what we ended up finding so we found both text and we found image embeddings and we sent all of this to gp4 Turbo with vision so to answer that you know question from uh Winston from the beginning uh this is the sort of thing you could do if you have PDFs that do have a lot of images is that you could consider um using this sort of approach to to comp you know compute embeddings for the images themselves and then but then you do have to pass them to gbd4 Turbo with vision because it's the only model that understands images uh it's it's pretty cool it's it's definitely experimental in our repo I don't know anyone who's actually doing this particular thing in production with the images uh there are it's definitely a little bit slower we have some ideas for speeding it up uh uh in terms of like how we can be more efficient with what we send uh but it's it's really really interesting to see what I can do okay so let's talk more about retrieval again and relevance so when we really want to get the most relevant chunks of information for our rag app uh because we want to make sure when we're sending information to an llm that we're sending just the right amount of good information so if we look at here and we can look at here and actually see this is what we're sending to the LM so we are not sending entire documents we're not sending all of our documents we're sending chunks from a document and we're sending just the chunks that we think will answer a question and in this example I think we're doing three yeah we're doing three uh we could maybe up that to five but you don't want to send too much information and the reason for this there's this paper called Lost in the middle and what they found is that if they sent too much information to a model even if it you know technically could fit that amount of information in its context window eventually the model would stop paying attention to things that were in the middle of the information and things would get lost so if you sent too much information you know you get a degrade in answer quality because it would just stop paying attention to it so we really want to get just the right amount of information and the right amount is usually between three to five chunks of information where each of those chunks are about 512 tokens so that according to research is the you know the right size of information to send to an llm so that it can really answer things the best so um yeah so you know we figured that out but we also how do we get if we're only going to get three to five results how do we make sure that those three to five results contain the most important information right so what we do is we use every possible trick in the book and that's what a your AI search does so it does Vector search it does keyword search and then it can combine the results from the vector and the keyword search uh using this Fusion step and then it also has an additional machine learning model called the semantic ranker that can take those combined results and then R rank within those so that the very best things are at the top and then you can just say all right I just want the top five from theirs so that's what we see is that this is how you get the optimal relevance so that you can have those top three to five be absolutely the best results is by using all these options together so I'm going to demonstrate that in code as always that's what will convince you uh so here we go we have the setup again so setting up all the clients and authentication all right now we're going to first look at a situation where Vector search because you might think like oh why can't I just use Vector search well Vector search does not work for all queries and there's actually quite a few queries where it's not going to work and the classic example is exact strings if somebody is looking for something that's an exact string so here I'm searching for the string $45 you know maybe I remembered something cost $45 and so I'm looking for that so I'm doing a vector search passing in the vector for $45 and and trying to find back results and uh you know it does get back vectors here's the thing about Vector search Vector search will always get you back vectors because something is you know as long as the space has more than one vector then it'll just find whatever's closest even if it's very very far away so it does get back some you know documents but those documents do not have the string $45 in it so it basically didn't find what I was looking for now if I do a text search I immediately that was really fast I immediately find the matching text right so this is what I was looking for was this exact string $45 so here you can see Vector search failed keyword search succeeded okay so then can we do hybrid yes what we do is we specify both the text query and the vector query and then as your AI search we'll do the merging of them with that reciprocal rank Fusion step and here we can see that $45 is at the very top here so that's great so it was able to figure out which of those responses was best now the thing is that hybrid ranking alone is not always optimal so here we're doing that underwater activities query uh that we saw earlier and I'm doing you know search query and Vector query so that's a hybrid search and I do this search and what I see is that you know skiing and snowboarding is here but it's number three and that would have been good enough for the llm but ideally it would be number one right we're kind of risking this falling off so how do we get this to be number one so what we do is that we use the semantic ranker and these are the options to enable a semantic ranker and so by doing the semantic ranker that's an additional machine learning model that's based off the Bing model so Bing uses it for reranking and its search results so it's it's good at R ranking and it runs after the you know the hybrid search results come back and are fused together and tries to do you know some additional reranking and what you can see is that it did indeed bring snowboarding and scuba diving to the very top of the list so in conclusion that is why we want them all we want to do a vector search a keyword search com you know hybrid search of both of those and then use that semantic ranker on top in order to get the optimal ranking of the candidates so that we can say all right we're just going to take the top three and give those to the llm and that should contain our results uh so semantic ranker now the thing about spanic rer is that it does actually cost additional money so that might be the reason for some folks why uh you might decide it it's not worth it to use it uh you know we're hoping to to bring those costs down uh but you can try it out for free for the first 10,000 requests to see how it affects your uh your answers and so they did some research in in this really good blog post that I recommend everybody read about uh you know they looked at all all these different metrics and comparisons to see what created the optimal search and they also this is where I got that 512 size so they tried to figure out what was the optimal size of tokens in chunks and you know where do you break your chunks up and what kind of overlap should you have between chunks so they did a lot of really cool research about how to proper uh prepare your data for a rag and then how to search that data and of you know what they found was across the board that generally hybrid plus semantic does get the uh the best results uh so we can see that in this chart here uh with hybrid plus r ranking getting the best results overall there's also this other chart which breaks it up into different query types and this is really interesting because you can see there's definitely areas where it's like you know particularly stronger uh and areas where you know some of them are particularly weak and this is something really interesting to think about in terms of your particular use case like what kind of queries are your users issuing and what's going to work and also what does your documents look like do they have lots of exact strings you know what's going to work best for your particular situation where does it fit in this table all right so now we can talk a little bit about data ingestion so I've talked about this briefly before but uh you need to think about how you're going to ingest data into uh your you know your knowledge based in this case into as your AI search and so ingesting data means you have to extract the data from your document in some way and then you need to chunk up that data into good sizes for an llm if assuming that your data is sufficiently long right because they're saying the optimal size is about 512 tokens you could think of that as about 500 Words uh so if your data is significantly longer than that like you're indexing you know 50 page documents you're definitely going to want to chunk those up so you want to chunk those up you need to chunk those up with an overlap so that there is actually overlap between each of the chunks and that helps with the llm understanding and you need to think about where you're going to cut things off right you don't want to cut things in the middle of a sentence or middle of a table you want to you know preserve whole tables across it so there's a lot of effort that goes into figureing out how to chunk and there's various libraries that help with that too so llama index uh has a lot of great functionality in it for doing this uh Lang chain probably does as well uh a lot of people have written their own we've got our own chunker in in our repo uh there's probably the thousands of chunking scripts out there now and there's also we're now having chunking as a service I like to think of it it's called integrated vectorization this is integrated into aure AI search so you can actually have this all done in the Cloud so you just say like okay point at this blob storage and it can crack a bunch of formats so if you've got PDF or office documents or whatever it's going to crack those documents open extract the text Chunk those texts into passages based on all these best practices based on research and then compute the vector embeddings for each of those you know each of those chunks and then store them in AI search so this is now getting offered as a cloud service so that you don't have to be you know writing this chunking code yourself running all this data ingestion on your machine you can hook up you know hook up to some storage and all of this will happen for you so this is really nice so that we don't all have to write our own chunkers uh there's also integration for aure AI search in the new aure AI studio and also in the aure AI CLI so check that out if you want the more you know if you like the wizzywig approach to doing things or you like the CLI approach to doing things there's some integration there and here I just wanted to you know talk a few more use cases for rag uh I've seen lots of customers using our our particular rag chat app repo which is the one I had open here so we've got lots of customers using this for all kinds of really cool interesting use cases and actually putting this into projection so we've seen people use it for public government data so that's great because public governments have so many PDFs and imagine being able to actually search through your local government data I've actually processed my own local government data just for fun uh so you know that's a definitely use case internal HR documents is a huge use case because everyone's got these internal documents that are hard to sip through uh customer support request uh anything to help customers inventory common questions all that sort of thing really really commonly seen that uh technical utation issue trackers product manuals all of these can be really great things to you know you know feed into a search index and do a rag chat on so for next steps here are some links where you can learn more about Azure AI search you can read that amazing blog post that talks about how to you know optimize your ingestion and your search uh you can go ahead and use our repo to deploy a rag chat for your organization's data uh you could explore Azure AI Studio you could join us for next week's Microsoft AI chat app pack from January 29th to February 12th where we're going to be building on this repo and we're going to have tons of sessions to go into more detail about how to make a rag chat app using aure search so there you go and you can uh you know check out the slides check out the links and hope that was helpful oh I see questions cool uh let's see so how well does Vector search work on complex topics oh let me try can I like show okay uh how does it work on complex sockets like when a lot of contextual background information is required is three to five search results limiting uh so I haven't seen it be limiting yet but that's why we do have that as an options like in our um you know for this one here uh developer settings right like this is like a numeric option so we could we could retrieve a lot more and if we were seeing that it just wasn't getting the results so I have like in the documentation here I talk about the process of you know debugging the quality um so customization so improving answer quality right so we have to identify the problem for right so is it that we're not getting good search result for the query or that we're not getting a good answer based on the search results so that's what I always recommend is that you know when you're doing debugging a rag chat up like and if you're not liking the answers you're getting first we look and see like okay what did we you know what did we actually pass into the llm right is our answer in here because a lot of times what we find is that the answer isn't in here almost all the situations I've helped customers with we just found that the answer wasn't in here and that was because of like something around their search configuration or the way they injested it uh so that's what I would look at first um you know there definitely could be situations where you're going to need to grab more than three five 10 Etc or maybe you have to do smaller trun um I also did like I did a sample rag on um diabetes research papers and actually seemed like it did pretty good on that too um but it's a great question and that's why I encourage people to try these things out and then document it because rag is it's really new uh you know I've only been doing this for you know six months and so I think we're all learning about best best practices so if there is some particular situation um then you do some experimentation it's it's really helpful if you can document that uh and so yeah you're saying there's a law case where XYZ important facts need to be factored into a response uh is is Vector search Rag alone appropriate so you can also think about doing um you know doing other things as well uh like if maybe you need to search for supporting facts or search for like relevant law things uh I was actually just talking with my about this today so I should have this open uh function calling uh open AI the the SDK or the apis has this notion of function calling or tools where you like tell um you tell you tell the you know the API what potential functions it could call and so if it realizes like kind of it has to detect that it needs to call those additional functions but you can use an llm to like to make a decision about what path it's going to go down so if you find that doing just a search isn't working in all situations sometimes the approach could be to use to have it you know pick between doing the search or getting background facts or something right it's it's really going to depend on the specific scen scenario but this is something to think about as well it's like yeah there's not there's definitely lots of situations where just doing a straightup search may not be what you want in order to answer a question and then you have to figure out how you're going to like chain together the process of answering the question and I think if you look at Lane chain you'll see there's lots of like ideas about different ways of chaining together uh you know processes we did use some of those link chain uh you know techniques and agents and then ended up simplifying and going away from them but I think there's lots of ideas there about other approaches and I also would recommend looking into function calling and considering whether that sort of thing might help you out but the first thing to do is just to see how it works and uh and then you know start debugging and and looking and and seeing where the problem points are and then trying stuff out and seeing what works and then documenting it cool well thank you for the questions this is actually the first the time I've actually use streamyard just on my own um but I wanted to get this uh talk recorded for for other folks who wanted to see it and yeah so thank you oh wow multiple people are actually to hear yeah there's people watching that's great okay so we had another question is it possible to add a logger and loggy step the process so this is actually what we do here in the thought process tab um and you can also see if you like another approaches you could use prompt flow prompt flow is an Azure tool I don't have my prompt flow open open right now but it's an idea of doing a a directed a cyclic graph in order to um describe your I'm trying to see if I can get a picture of it yeah so you can actually like describe your whole rag process so if you work well with like graphs and wizzywig and that sort of thing you might like uh promp flow it's in the Azure AI studio uh and uh and that visualizes that way what we do for our you know for this particular application is that we do log we basically do log in our own little dick uh you know in the python code what each of the steps are so the steps here is that first we get the user query and then we actually we do this step here where we um we do an initial chat completion call in order to turn the user query into a nice keyword search uh this is not something you need to do for a rag you you could you could skip this step but we like to do this because it tends to get better search results because you know users tend to have spelling mistakes and put in fluff words and stuff like this so this is going to generate a nice clean keyword search that we can then pass into as your AI search so we take that nice clean search and we pass it into a your AI search and these are the search documents we get back the search chunks uh the chunks from the documents and then we have the you know the prompt this is the full this is the actual thing we send to the API uh I have the vision tab open right now so you can actually see where it sends images normally we're not sending images so here we can look at this one here um so we get back to search results and then this is the full message that we send to the API so we can see the system message we can see the question and then the attached sources here and then we get back the response which we can see there so this is a a form of logging it's you know it's covering the big steps of our our rag approach here you could also of course you know our code is python code so you can add additional logging if you need it but really like this covers the core things because the if you're going to have quality issues it's going to be at one of these points right it's like well did this get generated weird did you get back good results and what you know what did it look like when you finally sent it to the llm great question and thank you Spencer cool okay well I will end this stream now um but if you do have additional questions you can uh and I guess you could put them in the comments of the video uh you can also try out this this repo here you can come to the hackathon next week and yeah thank you for joining
Info
Channel: Pamela Fox
Views: 13,489
Rating: undefined out of 5
Keywords:
Id: vuOA13Y_Qzk
Channel Id: undefined
Length: 64min 54sec (3894 seconds)
Published: Thu Jan 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.