Using Your Own Data with Large Language Models (LLMs) aka Making JohnBot!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone in this video I'm going to walk through creating John bot or more useful for you how to use your own data your own documents with a large language model like GPT and this video is something I've been thinking about for many many months but it's been a lot of work to get the preparation done to be able to actually create this video now why this is important in a whole bunch of videos I've done recently I've talked about Technologies like vectors and embedding and semantic meaning and semantic search and it's all about if we think about a typical large language model it's trained on a certain amount of knowledge and then it's readon it doesn't learn new things so if I want to enable it to create responses based on my knowledge that my company has or I person personally have I have to give it additional information with the prompt I give it that it can utilize to generate that response that's how things like all the co-pilots work if it's the Bing co-pilot well that extra information is coming from web searches that gets added to my prompt if it's Microsoft 365 it's using the Microsoft graph to get information about my documents or my emails whatever it is it's all about getting additional information and then sending it to the large language model but what's critical is that information I send to the large language model needs to be as relevant as possible for a number of reasons because the more relevant the data I give to the large language model the more relevant and the higher quality the response it can give will be so for fun I'm going to create jbot based on some of the videos I've done on my channel now I want to stress this is just for demonstration purposes you'll be far better served just by using Bing which has hooks into all the Microsoft docs or Azure co-pilot this is just to try and demonstrate how powerful this can be and using your own materials so for my setup what I've done is I've created a blob storage account so I just have my blobs and in there I just have a whole set of essentially transcripts from all of my videos now just for a little bit of fun the way I did this is if we jump over for a second I started off with the raw SRT foldes which are just the timestamps so we can go and look at one of those and it looks like this so there's obviously a huge amount of useless information with very few actual words so what I did is I wrote a little script that uses GPT I told it to act like an editor and all I do is I go through each of those files and as you guessed it I use Rex to strip out the timestamps and then send it to GPT to make those infinite sequences of characters with no grammar at all into slightly neater files so you end up with this basically I end up with a whole bunch of manuscripts transcripts for my videos which I then can input into my blob the only thing I changed is some of them if they were too long I wrote another script to split them into Parts because as we'll see later different SKS of the Azure AI search which I'm going to use have different character limits for what it will read in from a single file so I did a little bit of splitting for some of those but basically that's it so what I ended up with if we were to go and look at my storage account is it's just regular it's not a data Lake but it could be a data Lake all of those files are just in there so loads and loads of transcripts I did like a couple of years of videos and that's what took me the time I did my virtual mentoring playlist so we can have a bit of fun asking it other stuff B just a whole bunch of text files that's fundamentally what this is now the only thing I added to this is to demonstrate some of the things I can do is I did add a little bit of metadata so we can see I did add in a proper document name not the file name and then the original YouTube url so that's configured on each of the documents as well so there is a little bit of extra information so once again it's just a blob container with a whole set of files in and obviously the whole point is this is my Corpus this is my knowledge base but yours could be your company's knowledge base um documentation help information whatever you want to use to enhance the capability of the large language model now you always want to be careful with this make sure it is Data you should be feeding into a large language model as additional information so go through all the proper channels in your organization but this is my demonstration of hey I've got some extra knowledge in here that the model was not trained on that I want to make available to it okay so then what's my next step so I've got all these documents what I'm actually want to now Leverage is azure AI search so I'm going to create an instance of azure AI search and this was formerly known as the Azure cognitive search search and there are different skews available for this and this is where it comes into that maximum number of characters I may read from any particular document if I go and look at the skews quickly one of the things it talks about here is how many characters it will read so from any particular document 32,000 from free 64,000 from Basic 4 million from standard Etc goes on to 8 million for larger skes and so what I have done in my environment is I have created a basic instance so if I go back home and look at my Azure AI search we can see I have done basic now the reason I didn't use the free skew is I want to use something called semantic ranking and I'm going to go through what that means but that doesn't come with the free version so I needed at least the basic skew and also I'm going to want to be a to create these vectors these embeddings and so I need a deployment of an embedding model so in my aure open AI I have an ADA -2 instance available and I've got a gp4 as well which obviously we're going to play with so I need those things available but you could now just go and if I went to AI search I could create a new instance you specify Resource Group a name the pricing tier how many scale units you want so again if I picked the basic which is what I'm doing just for demonstration purposes I could just run one replica and $75 a month and what I'm doing in my demo environment to play around I'm just running it for a few days and I'll delete it so it's going to cost me probably five bucks for this entire few days that I've been utilizing this for but now I have an instance of that Azure AI search and again I did a previous video All About vectors and embeddings and semantic search and retrieval augmented generation I highly recommend you go and watch that video because that will make all of this make a lot more sense okay great so I have an Azure AI search instance now one of the first things I'm actually going to do is I'm going to enable managed instance sorry managed identity for it that's where that resource has its own identity I don't have to worry about tokens or secrets in my code and what I'm going to do is this managed identity when I'm going to give that managed identity The Blob data reader access over here so it can read from this but then also remember what's going on if I think that that's my Azure AI search what I'm also going to be doing is I've obviously got my Azure open AI instance well I'm going to give that managed ID dentity a roll up here as well because it's that Azure open AI instance that has my Ada o to model I want to use for my embeddings so I need this Azure AI search service to be able to read from my blob and be able to call that ADA O2 model to go and create the embeddings for the data I'm going to ull into it so it needs access so this avoids me having any tokens or anything else embedded into any of this and once again we can see this so if I look at my instance I already created we can see that I turned on the system assign managed identity and I can go and see the roles it has and sure enough exactly as I described it has the storage blob data reader for the storage account and it has the cognitive Services user role for my Azure openai service which means you can go and get the API Keys required when it actually wants to go and talk to it so I'm all set up with all of my permissions and nowhere am I manually storing a key or anything else so now what I want to be a to do is bring in import the data and do that vectorization because I want to create the vectors which remember the whole point of a vector the embedding is it implies the meaning of my data rather than the exact words that are used and that's really the whole power now we think of things in in three dimensions so we could have a vector a Direction with three dimensions XYZ this is using 1536 Dimensions so it's this massive number of Dimensions that represents the semantic meaning of whatever is being sent to it to create that embedding to create that Vector of and so we're going to read in all of this data so here we're going to read and we are going to do that at a certain interval so it's not just a oneoff we can configure how often and I'll I'll come back to that interval and that is super easy to do so if I go back over here it's got the roles required I can just go to my overview and in the main page there's this very friendly import and vectorized data so I would just go to this I would select what is my source here my storage account so I could just go and select that storage account I created what is the container I don't have to specify a folder I'm going to authenticate using managed identity because I set that up already now because I do want to do the vectorization this is where I have to give it well where do you have an ADA O2 an embedding model to create the vectors I'm using system assign managed identity and yes I acknowledge there's going to be a cost because every time it has to go and create that embedding that Vector it's calling the service that service charges me based on my consumption now notice I could also extract text from images now I'm not doing that here but I can absolutely hook into for example optical character recognition so one of the things this could do for example could be to crack open a PDF file and sometimes PDF files actually are images of text it's not just the regular text someone has took an image of text and put it in a PDF well with this it would then take the image send it to a different model do the optical character recognition and index that as well so this it's really powerful what this can actually interact with but I don't need that I know mine are all text files I'm going to use the semantic Anor which we're going to talk about and here we have the schedule now I can customize this schedule I have a lot of flexibility in what I do here and one of the questions you may have is well what should I pick now most of the time you're just going to pick 5 minutes so every five minutes it will just go and recheck this now you might say well why are there any options other than 5 minutes why would I not do five 5 minutes so realize as part of my configuration I picked a certain number of scale units now in my configuration I just had one actually let's just finish this because there's not many fields left we have the schedule and then it would go and create it and there's a vector name I did that already but if we go and look at the scale you have a certain number of units and again I just have one CU it's just me playing around I don't need resiliency or anything like that I just have one of those search units so it has a certain amount of compute capability so my Azure Mi search instance has a certain number of those units maybe I've got additional ones here those units are obviously used as part of that indexing process so its capability would get used then but those exact same ones will get used when systems or users are performing searches so you have the potential of a conflict now if I only have new data coming in here a couple of documents at a time the hit on that is going to be tiny it's nothing I'm going to worry about so I could have it absolutely running every five minutes doesn't matter but if maybe I have a huge batch job that just dumps a lot of documents in here all at once well then it might actually consume this for many many minutes and then during that time if I was trying to search it it could interfere so if I didn't need it instantly available one of the things I might do is have this running at a certain schedule maybe in quiet hours at night so it doesn't interfere with the other uses so that's the reason why I might not pick the very smallest 5 minutes because it is using those same search units for both the indexing and when people are actually performing those searches but outside of that um it's really probably not a problem at all to just have it that minimum uh 5 minutes okay great so I'm selecting that Vector as I do the indexing and so now what it's going to do is go ahead and create the index and create the indexer once again we can see those I'm going to really try and show everything it's doing so we can see what it created was the indexer that's going to actually go and do that work and we can see all the executions it done in the past I can see all of its settings so here I can see the schedule I've got mine set to 1 hour and then we can see the index now the index it created has a few interesting things now it has the parent document ID it has a chunk ID it has the chunk itself and then the title of the chunk and then it has the vector and the vector is where you can see it has those 1536 Dimensions I then went ahead and added two extra Fields now is's not normally there I added these doc name and video URL I added those because if it sees metadata on the blob if I have a field for it it will automatically bring them in and I want to be able to use those as part of my population of data when I use it with my large language model so you need to make sure you set them as retrievable filterable and searchable so I've done that okay so why is this chunk thing why do I have these chunk IDs why isn't it just the document what is the big deal with this chunk and so we really have to think about that Source data for a second so why do we have chunks I have my source documents some of my videos are many many hours long and they cover huge numbers of different aspects of Technology maybe different Technologies and our goal is to create a vector that represents the semantic meaning well if I have this huge body of information the semantic meaning is going to be very hard to have anything that useful it will be way too generic and also there's just what bit is relevant if I then did try and pull up the document what's the relevant piece of information there it's really not going to be very useful so what it does is if I take a documents let's just take one of my documents for example over here there was a bit more space so if I took a document as an example so let's say this is the great big document what it's going to do is break it into much much smaller chunks of that document so I could imagine maybe the first chunk is here so that's chunk one it's going to create a chunk two obviously but think about it if I'm just splitting up the chunk a certain number of characters well maybe I cut it off mids sentence and so maybe I'm missing some crucial piece of the meaning so chunk two will actually overlap with chunk one a little bit it'll take the end of Chunk one and that's chunk two and then chunk three well you're going to guess what that's going to do chunk three would overlap here so you're taking a bit of the previous one every single time and we see that in the configuration so if I was to go and look at there's actually a skill so we have this concept of skill sets right here and it creates this for us automatically and what we see in the skill set is we have this skill set to chunk documents and generate the embeddings and we can see it's configuration so we can see its maximum page length is 2,000 characters and it has an overlap of 500 so 25% it's going to overlap and again it's just doing that text split mode in pages so I'm breaking up the document into these chunks now the only other thing and guess while I'm here is I'll show it there's also this idea of projections this is what it projects from the main document into those child chunks remember I created those two Fields well those fields belong to the parent not the chunk and so what I do is those two new Fields I created I project that from the document into the child chunk so that's the only extra thing I had to do when I added those two Fields so I did add my own projection to get that metadata from The Blob put into my my chunks as well and we'll we'll actually see that in a second so that's exactly what it's doing it's splitting it into those chunks and so now what it can do is for each one of the chunks well it can send the chunk to the embedding model and then when it's going to get back is the chunk well really is getting back the vector the vector is because this is an embedding model it's generated that embedding Vector that represents the semantic meaning so now this my Azure M search remember I have my own index which we just saw so i' got my index and then my index is made up of a whole bunch of fields so then it has you saw exactly what I had the key ones I'm going to focus on here obviously I had a chunk so it's storing the text of the chunk remember if this was an image then it went and did the OCR if it was an audio or video it does all of those things to get the text that represents it and I have the vector and you have those other fields as well so it's going chunk by chunk give me the vector that represents that part of it and I can see exactly those if I was to go and look at it so if once again we go and look if I actually go and look at my index we have a search Explorer now I'll just type sa so you can see it so what is express route and it's going to be really really big because it shows you the embedding that embedding is 1536 Dimensions remember that's kind of awkward to look at and so one of the things I can do is I can hide the vector values but you see it's using the semantic ranker um I am doing a vector search on there and the interesting thing is how many chunks do I have that's this the document count is my number of chunks because I only had less than 200 text files in my blob storage account but it has chunked that up so I have 4,33 chunks and if we wanted to just look at well what's the makeup of a chunk if I just go to Json I'm going to tell it to return 500 results and search if we go and look at what it's actually giving me back we'll notice you have the chunk ID the chunk parent ID the chunk so the chunk is the text and there's my two values so it is adding it to every single one of my chunks so I can see all of that information right there and it's also got these nice little answers at the top of it and that's the key point and so again I am dealing just with text in my envir in in my example believe it was a PDF uh a Word document Json it would just crack that open and use that as well if it has images or audio or video I can opt to hook into other AI Services if it images like OCR and it can go and extract that as well you'll have to point it at your own instance for billing purposes it's actually using its own instance to do the actual work you just have to point it in yours um because it has to build to somebody so then we saw me doing this search and one of the things you saw me select was that point about or semantic ranker so if we go back again just to the query view I'm doing this semantic ranker option so what's going on like why am I doing this semantic ranker thing what is this really giving me and if you would take away one thing from this entire video this is probably the most important part we have the Azure open AI service which I had the embedding model but what we're also going to be focusing on obviously over here is the idea that I H I'm also running this large language model L llm specifically I'm running a GPT 4 now when I think about that GPT model what do we actually do when we interact when we interact with it what we do is we send it a prompt so I send the model a prompt and then the way retrieve log mty generation works is we send it some additional information we send it additional knowledge along with our question that will help it answer that question so we want to send it the very best information there's the whole garbage in garbage out well I want gold in Gold out I was going to say pizza in but you don't get pizza out biologically so we'll stick with gold in Gold out we want the very best information if we send nonest information that's not the most specific and the most relevant to what we're going to try and do well the llm still sees that as input and we'll consider it but there's a number of factors here we have to consider the large language model has an input token limit I cannot send infinite amounts of information with this if we were to go and look at the different models for a second the the brand newest one has 128,000 tokens now it can only return 4,096 but 128,000 in that's huge okay the older ones were not that big so I have to be considerer of how much data I'm trying to send it because there's a limit I can't just send it infinite amounts of information so I have to consider that token limit so I want to make sure I'm sending it the best info but even if I can send it huge amounts of information even if some of it's not as good as it could be for every token I send it it costs me money I pay for it it's it's not just hey I pay for the inference I pay for the number of tokens I send it and then sure I I pay for the inference out but I'm paying for the size of the prompt per F tokens and so why on Earth would I want to send it data that's not that relevant it's going to pollute the answer potentially which is terrible I don't want to do that but I'm going to pay more money for every bit of data I'm sending it it's costing me money so I only want to send it the most relevant information for the quality of the response to make sure it fits within its limits and so I'm not paying for more than I actually need to so the whole point here is I want the gold I want the very best information possible but if I think about the very very best information possible it's not just a vector search Vector search is fantastic when I want that semantic meaning of the data but what about if I'm searching for uh a product name a particular skew it has to be super specific that's where keyword electrial search is way better I'm going to get better results when I'm trying to find an exact match and so this is where we get into hybrid search I want to do both so you saw when I searched for what is express route we had a whole bunch of different things going on so let's talk about what those different things going on CU I'm doing a hybrid search I'm doing a semantic search I'm doing a ranking a semantic ranking so I've got lots of different things happening here so this is my index okay so I am a user I don't know if I'm I'm going to fit this very well let's say I'm I'm over here and I'm sending it to my service a query what is express route multiple things kick off in parallel so the first thing that's going to happen is yes s it's going to go ahead and do a vector search so I'm here bit more room again in here I'm doing a vector search so remember the whole point of the vector search and the way that works is for the term I type in it does send that to the embedding model so it sends my request what is Express around to the embedding model to get a vector of my question and then it uses mathematics to do um a cosine similarity think smallest angle between my question vector and the possible semantic meanings of the data to find the smallest angle between them so it's the closest match it's the nearest neighbor it's the the best set of matches to there so the vector search obviously it's operating against the the vector and it's going to find me those matches that have the closest semantic meaning so I'm going to get a set of matches from my semantic search fantastic so that was all based about about the meaning but it's not that fuss about the specific words in parallel to that it's going to run a lexical a text based search and what it's using for this is something called bm25 now this is operating against the chunk which has been indexed and it's going to get a set of results now bm25 I was like that's probably some really Advanced um algorithm it's using uh BM is best match it's not very modest uh so it's the best matching 25 it took him 25 times to get it right it's a 25th iteration so there was a BM 1 2 3 4 they got it right on the 25 fifth time and this is better than traditional searches for for text in that imagine I searched for express route and in one document I found express route three times and in one document I found it four times so you think oh the one with four times is better but if that second document was 100 times the length well actually no the one that found it three times in a much shorter document is probably more relevant so this considers the size of the document compared to the average as well additionally one of the things you sometimes see is this idea that you can do term stuffing you just keep putting the term over and over again so what bm25 does very nicely is you can imagine this idea of uh diminishing returns if you keep having the same term the first few times it increases its relevance but then if you keep having it it gets less Les and lesser until it's basically flat and so it also considers hey if I just have the same term over and over again and so this gives you the best text based the best lexical search available well then what has to happen is it combines them so you have these two different sets of results and now it's going to bring it all together so it's now going to bring it together using r RF and you can really think so I've got these two sets of results that are ranked and what R RF is this recip reciprocal rank Fusion fusing them together it's looking for results that are higher up in both of the different lists so it's it's bringing them in together based on hey where are you in both of them so it's now going to create its own ordered list of values and so you're going to get a new search score for that combined value and we can actually see that so if we go and look again let's close these down if we look at our search for a second within our values we see the rrf it's that that is that combined hybrid score of this particular entry so we get that exact value there so that that's the RF that's the resultant set of both the vector and the bm25 for its overall quality of that now you could think okay well I'll stop right there I've got them ranked great I've got this list I'll send all of these values off to my prompt but consider consider how good are they really compared to the question that was asked remember the initial query was a retrieval find me the information but now I want to say well how relevant are they really a focused analysis compared to what was actually asked so this is where we now run that final component what color should we use we use orange we now run these through the semantic now it's called a ranker I would argue it's a little bit of a ranker because they're already ranked but what this is now going to do is rank them just going to move things around based on hey how relevant is it to the actual question that was asked and what's interesting here is this is a technology that was borrowed is the right word but it's used by Bing so it's the same type of Technology there to shape um the information with a more Enterprise type focus and this will only look at 50 results so here this is 50 Max and so I always want to make sure I have at least 50 so when I do the vector search when I have that K how many to respond with I always want at least 50 so make sure here you're doing a sort of K 50 to make sure you're not missing out on where the potential best one might be when it's doing that reranking so 50 Max it's going in here and it's reranking them what it's also doing is it's going to give them a score so it's going to give me a relevant score my ranker score from 0 to four with four being the very very best and it will then send that as my response and one of the nice things here then is remember think about our original goal I want to only send the really really good stuff to The Prompt so I could actually say hey I want a minimum score of three so I don't care if there's 50 results that all maybe have some relevance if it's less than a score of three on that ranking I don't want it so now think back to our goal over here of only sending the highest quality what is that semantic ranker done it's took the hybrid search merge of the bm25 the Leal Tech search for exact matches on specific product names with the vector the semantic meaning the rff merged them together where things matched on both the highest of those lists then the semantic ranker has now taken that done another analysis against its relevance to the question and then given it a score so now I'm going to make sure I'm only sending it the most relevant stuff and I'm only sending it if the score is hey three or above so it doesn't matter how much I could send I'm only sending the very very best most relevant data to give the model the very best information so it's the most relevant and I'm optimizing the amount I'm spending by only sending it if it's really high quality so I'm going to save money there you get a certain amount of these semantic rer rankers free I think it's 1,000 free per month and then you pay for it but if you consider the cost of tokens well if I'm doing that semantic ranking and only sending it the best data I'll send it a lot less input tokens so overall you you'll spend less money so it's just a huge value from that and that's what's happening and we can see this again so if we open it up and we look at our results we then saw my ranker score this is 3133 so I like that that's super relevant uh this one is 3.08 oh I like that one as well 3.02 that's good now let's get into the twos maybe I don't want that and what's really cool is you can see the detail is it's also giving me a textual highlight so this is section of the text that it thinks is the most relevant so you could scroll over and it's a short amount of text so this bit right here and it's part of the much bigger chunk so it's showing me the whole chunk but it was giving me the particular part of the document here in the text that it thinks is the most relevant part so it's a really nice technology and with what it's given me here so hopefully that makes a lot of sense and and that's huge again I think if you were taking one thing away from this video I would actually take this bit away that semantic ranker giving me the score of the relevance now lets me only care about the stuff that I think is a very very high quality so that is where I would actually spend a lot of the time okay so now without further ado how do we create jumbot I mean that's obviously what everyone is excited for is jumbot so got the index we now know how the searching works I want to bring it all together now we know when we use something we don't just send our prompt directly to the solution we have some kind of orchestrator so as part of our wup solution there's going to be an orchestrator to to do the smart things so I've got an orchestrator jumbot what I'm actually going to use for demonstration I'm going to use the Azure opening eye playground and so that for me is what I'm going to leverage um the actual proper app I could use the semantic kernel prompt to actually enhance it even further but I'm just going to use the playground so you can see all of these different bits actually in action so let's let's do this so if we jump over so we know we got we got the search ready fantastic and you even notice there were actually some answers at the start of it with it its confidence score so I've gone ahead and I've created an Azure open AI instance over here and if you select that you can jump over and say open the Azure open AI Studio which is this so this is the Azure opening eye studio if I look at my deployments I have that embedding model I showed you already and I have a GPT 4 I'm using the GPT 4 uh 1106 which is the GPT for Turbo 128,000 and now I can go to to the chat Playground now in this playground I can do a lot of different things I could change the system message that will get sent along with the user message to the llm I'm going to leave that as the default notice it provides things like the memory so I can ask questions and then ask a question based on the previous one it's tracking 10 previous questions I can experiment with its temperature its Randomness the maximum size of a response I I can tune all of these different things within here see the current token counts so I have a lot of flexibility in really everything I can do within here but I'm not going to really play around with any of this what I am going to play around with for right now is I want to add my own data so I'm going to say add a data source now notice it itself has a certain amount a native capability it could go and hook into blob or Cosmos DB for mongod DB Vore URLs I'm going to use the Azure AI search so I want some of those richer functionalities around that semantic ranker I'm going to tell it the particular index I want it to use I want to add Vector search so by adding Vector search I have to give open AI permission to use an embedding model and I acknowledge that's going to cost money as well to run each embedding for when I type things in in and now I'm going to map the fields so for the title I actually want to use my doc name that I added remember through that projection to the chunk my file name I'll use the title the content is the chunk URL doesn't do anything today in the playground but in the future I could then go and link to the video where this came from and obviously the vector is the vector there's no choice there I want to do hybrid and Mantic for my search type remember this is the very best I'm doing the bm25 I'm doing the vector I'm doing the rff and then I'm doing this semantic ranking on top of that you can opt to not do that but I want the very best it created that semantic configuration for me and once again yeah I know you're going to build me money go ahead and create now once it's done that I can tweak a few things so firstly it's only going to respond based on my data so it's not going to be allowed to hallucinate it's not going to make of the knowledge it might have it's only going to respond based on the knowledge that it gets from doing the searches against the Azure AI search and what's sent back to it so we we'll see how confident this becomes when it's just based on stuff I've said um also I have a strictness how many documents to retrieve so I'm going to make it even stricter I'm going to say I want four which should so this is 1 to five think 0 to four so four would be three and remember the model the semantic ranker was from 0o to four so I'm going to say I want it four I'm going to be super strict I'm going to let it send back up to 10 documents again it doesn't matter really how many it's going to get back it will only send if it's a score above this so now I've configured those advanced settings so I've now tied this all in together so now we could just okay let's see what it does uh we we could just start using it so I've done that adding of the information it's got that memory so now if we just try it um what is express route so while I'm asking that question and John bot is thinking what are we now doing so the whole point of the orchestra rator is it needs to go and get extra information to do that retrieval augmented generation because this model knows nothing about my knowledge of my company in this case my stuff but again the whole point is it's your company's information so what's happening now is the orchestrator well it's going to go as part of this process it has to go and get the vector remember so it's got the query so it's going to go all the way over to the AER over here it's kind of much much bigger but it's going to go and send hey this is what they're asking so there's the query it's going to send back the the vector so it can now go and do a search so now what I'm going to do because we remember the whole point is what we added as part of this is we added in this service was added as a data source that's the whole point of what we did so now the orchestrator can now say okay well this is like step one in a way this is the query so this is my query and my Vector it will then go through and do all of that stuff and go and give it a list of the extra data that's the whole point of what it's doing so now we have that prompt well that prompt really was the user's prompt what the user typed in but also a prompt gets added to that the system prompt and we're adding to that that high quality data that we specified cuz we said hey and I want a quality of essentially three 0 four three so it only gave me back the very best stuff to now send to the model which can now do something this response and then it will generate me back the response which obviously there's a user sitting over here somewhere who's actually interacting doing that question and wants an answer back so that's now going on in the system we hooked in a data source the orchestrator takes my query goes and gets the vector for the query goes and searches against the data gets the responses back sends it to the large language model gets an answer back based on that knowledge and hopefully it said something fairly useful so express route is a service provided by Microsoft blah blah blah and notice it has a whole bunch of references and it's got five references and I can see them and it's going to show me because this is a reference to the chunk I can see the actual part that it's referencing so I could go and read more about it so we can see that's that's actually working but we can also see it has a memory so that was the express route or how does it integrate with a v-net I'm not saying the word express route I'm just saying hey how does it integrate with virtual network but this is where the orchestrator comes in because the orchestrator once again is thinking about my intent so remember it had that memory of 10 so I'm just typing another question but it can go back and say well they previously asked about express route so they're now asking about well how does it and what it will change that too is well how does express route integrate with a virtual Network it would take that history and bring that back into whatever I'm interacting about so if we go back and look again Express integrates a virtual Network for a process known as private peering and then it's got a whole bunch of references from different stuff I've said in the past this time five references our study gram came in 104 study gram benefits and usage but I I can go and see all those references of the exact part it's talking about um is there any way for data to bypass the Gateway because in that response it told me about a Gateway it's telling me about or how that integrates and I say well can it not um yes there is a way to bypass the Gateway for a feature called fastpath and then it's telling me the skews that I would need to use fast path oh yeah I was going to say did it miss that one I watch performance and yeah so it's again it's showing me the references I know it's not hallucinating I could go and select them to see the detail about that so it it's answering all of that based on exactly what's going on and if I actually went back up up and let's look at the uh parameters panel again is it going to reset oh it's going to reset CU I just typed that but it would show me the tokens it's using as it's tracking some of that memory uh all right let's try thank for my virtual memory uh how do I build discipline let's see what it does with this one and so what we should see is this input token progress will start to go up as it's tracking start now set clear goals be adaptable thank you John bot balance enthusiasm with discipline there we go and again it's got the references you can see them LinkedIn to the the data from what originally came from that blob that was turned into the chunks that went through all those different processes and I can see how it's tracking some of those counts to maintain that hisory history of what it's doing and I can even if I do the raw Json notice here it shows me that hey there that that that system part The Prompt that I could modify on the left the user part and then the actual assistant as well I could view the code so if I wanted to quickly take this in a bunch of different languages it's showing me how I can hook into this I could even take this now and deploy it to a web app a power virtual agent I could get more sophisticated and start building in the again I talked about the semantic kernel and prompt flow which are hooked into here this also actually has the the safety features I didn't really talk about that but notice you have content filters so this is where I could start to build in custom protections from both the prompt and the response the completion around many different areas and different sensitivities so I have a huge huge amounts of flexibility I can block certain things um really have a lot of power when I think about that safe responsible AI experience so all of that is available to me there and I guess that was really my my core goal um for all of this and all of this uh went through a lot of different things but really if you boil it down to what I really did I didn't actually do very much that the technology was doing most of the work for me the goal is about getting the highest quality additional augmented information to my large language model so in my case I had it in a blob Azure AI search did all the work for me by breaking it into the chunks getting those vectors and then giving me the hybrid search so I get the best of both worlds both the semantic meaning but and the exact terms that may be there that may be relevant bring those together using the RF but then doing that semantic ranker to what is the most relevant based on what I asked it gives them a score so I can only care about the things that of a certain quality or above so that makes sure I'm only sending the model the really the bare minimum that is the best so I'm staying within the token limits and not sending information that may actually decrease the quality of its response CU it's not as relevant and I'm paying for the amount of input tokens so I pay the minimum amount to maintain the highest quality so that semantic ranking is doing that and then you saw how easy it was to just hook in my own data source I can restrict it to only my data I could set that score I want and then just calls that in it it sends it and you you end up with uh jbot um I hope that was useful again the point is we don't retrain models on our knowledge that's pointless knowledge changes too frequently we might do there's some basic types of retraining we can do when we want to change maybe how it works and how it responds but for knowledge that retrieval augmented generation is the best way to do that and so that's exactly what we did we used our own knowledge with it don't forget my caveat at the start make sure you're not using knowledge you shouldn't be using work with your security team your company your data team we don't want to make sure we're not exposing saying we shouldn't be exposing but when I have that content that is the right content to be used to augment our model this is so powerful and I'm going to get the very very best in the most optimized way as always till next video take care it's
Info
Channel: John Savill's Technical Training
Views: 15,707
Rating: undefined out of 5
Keywords: azure, azure cloud, microsoft azure, microsoft, cloud, artificial intelligence, AI, LLM, generative AI, RAG, azure ai search, RRG, BM25, GPT, OpenAI, hybrid search, semantic ranking
Id: D8N44J5-6TM
Channel Id: undefined
Length: 58min 11sec (3491 seconds)
Published: Mon Dec 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.