High-performance RAG with LlamaIndex

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hi everyone and welcome to high performance rag with llama index my name is Greg lochain and I'm the founder and CEO of AI maker space thanks for taking the time to join us for this event today it's 2m in Dayton Ohio where are you tuning in from today we're so happy to have folks in our community joining us from all over the world during our event today you'll learn about rag systems using llama index to build them how to evaluate the retrieval aspect of rag and even how to improve retrieval if you hear anything during today's event that prompts a question please follow the slido link in the description box on the YouTube page we will do our best to answer the most upvoted questions during the Q&A portion of today's event without further Ado I'm excited to welcome my good friend Chris alexic to the stage today we'll be working as a team as usual to deliver today's lesson Chris the CTO of AI maker space and founding machine learning engineer at Ox he's an experienced online instructor curriculum developer and YouTube Creator Chris embodies AI maker spaces build ship share ethos and I couldn't be more pumped to kick it with him today Chris you ready to build some serious llama indexes today my man I am ready to index some llamas Greg absolutely yeah yes well we'll have you back back in just a a few minutes we got a little bit of setup today and to make sure that we're all teed up to build everything that we need in order to talk about high performance rag we need to first talk about Rag and we just need to basically make sure we're super clear on what we're turning into a high performance system which aspects of rag we're really going to focus on we're going to look at llama index the data framework and and its core constructs we're also going to actually build a Veterinary camelid index which will have some llamas in it we're going to use this particular application to show you how to fine-tune embeddings and we're going to also show you exactly how to evaluate retrieval a really kind of evolving space within building with llms and then finally we'll wrap up and have some Q&A so first you know why rag let's make sure that we start it from the top the big reason is is that hallucinations are a big deal when we get confident responses from llms that are clearly false that's not good especially if we're doing stuff that gets shown to our users gets shown to our customers rather it's better if we can have everything that an llm says fact check against our own documents we really want a Keen Eye looking at everything that goes in and comes out of our llms especially when we're going to use them in business contexts this is where rag comes in because retrieval augmented generation is exactly about that it's exactly about factchecking what comes out of an llm preemptively by putting references into the the llm we're using the retrieval process to find references we're then adding those references to our prompts before we do our generations and if we do that retrieval correctly and we augment the prompts with the best possible information we get improved answers so this is super important not just in general but especially in really special ized domains so domains where you've got a lot of jargon a lot of mumbo jumbo that people are talking about and using all the time these words that don't really mean things elsewhere uh if you talk to lawyers if you talk to doctors if you work in a government and you're dealing with acronym soup all the time there's a lot of domains where you really need to learn a special language this is as true for humans as it is for for llms and in order to align with application and context we need to make sure those llms can really interact with that particular domain although it's not necessarily just the llm that has to do this it's the rag system because the rag system again is about answering questions against documents and if those documents have specialized language we want to make sure that we're able to look through them and understand that specialized language prior to actually putting anything in to a very very powerful large language model rag can be broken down into two primary pieces the first piece is the retrieval piece this is when you're actually using vectors to do the retrieval the second piece where you're actually augmenting The Prompt that's where you're taking all the information that you have collected and you're putting it into the context window of the llm this is the Inc context learning piece we're going to focus today on the retrieval aspect and when we talk about retrieval there's really only three easy core pieces to retrieval we ask a question we search databases or a database that's full of stuff and we look for stuff that's similar to the question that we asked we return that stuff and then we use that to put into our llm in order to do this today oftentimes the database we're using is a vector database or more generally we'll call that Vector database an index as there are many different types of indexes and Vector database is just only one type when we build an index when we build a vector DB we split our documents up into chunks we create embeddings for each of those chunks by putting them through an embedding model and then we store those embeddings within our index within our Vector store that's what allows us to then look within our Vector store as we ask a question and this is where the retrieval aspect of rag comes in and this is where Retrievers in any framework come in so we ask a question we convert that question to a vector we look for similar vectors within our Vector database we return that context before we put it in to our prompt and so when we talk about dense Vector retrieval we're talking about when we ask a question it's a vector when we search a database it's a vector database when we look for similarity we're looking for a vector similarity but then when we return the stuff that we need that's going to be a natural language again so this retrieval piece is going in and finding things returning them in natural language just the way we need to interact with the llm and so just an overview of the complete process so that we understand which aspects we're trying to advance in today's lesson we have a queer we send it to the embedding model we look for similar pieces of information within our Vector database we then take our query and we set up our prompt template so that it's ready to go into the llm just as soon as we find all of that similar information we rank order it and we put it into the context that we are adding to our prompt finally we put everything in we stuff it all in to the LM and we get our answer so the two pieces that we are talking about here are dense Vector retrieval on the one hand and in context learning on the other again we're focused on retrieval today and we're focused on building out a really really powerful index and retrieval process this is where llama index comes in llama index is a great tool for building very very powerful indexes and doing retrieval very well in fact llama index is all about the data it's not an everything framework it's a data framework and it's built upon the idea that it's very hard if your any reasonably sized company that you're dealing with lots of private or domain specific data that's in different types of databases in PDF files in CSV file folders and in so many different places all these disparate data sources need to come together into one place that we can work with in a very easy way through natural language that's what everybody wants and that's what llama index is kind of designed to tackle is that particular problem when we talk about llama index there's one piece of key terminology we need to make sure that we get clear on it's the idea of documents in a natural language processing sense sometimes we say the word documents and we actually mean sort of sentence level information about words but in llama index that sentence level information we're going to use the terminology node to describe it nodes are the natural language processing document so rather than having to say Source documents versus documents as you do often in NLP we're reserving the word documents for the PDF level document when we talk about llama index but nodes are the important construct here nodes are in fact a first class citizen llama index will tell you nodes are simply that chunk of source document or simply document they inherit metadata they are able to sort of track where it is they came from and the way that you interact with nodes is very similar to the way that you think about interacting with vectors within a vector store the nodes are what gets stored within the index the vector store when you ask a question you return all of the K most similar nodes all those nodes are passed through a response synthesis module that is built into llama index then you've got your subsequent response in order to actually create these nodes we have to parse our data and that's where node parsers come in node parsers are going to be an easy way to sort of play around with different retrieval methods just as sizing nodes are themselves so the node parser is actually just what takes your list of documents PDF level documents and chunks them out into node objects you can do chunking in different ways and there are many different strategies for this and definitely that's a great place to start if you're trying to move into more advanced retrieval but really the Magic in llama index happens of course we have Retrievers in llama index just as we do in other Frameworks like Lang chain and those are focused on nodes here but the query engine is the really specific llama index construct and abstraction that we want to get a handle on and get comfortable with as we're getting into using llama index the query engine is as important to llama index as the chain is in Lang chain and so it's just that generic interface that allows you to ask questions over your data over many different types of data so there are many types of query engines and we're going to show you a couple today we are going to use data that's going to help us learn about camelids if you don't know about camelids you probably do know a little something about them from llamas to alpacas to vicunas to guanacos there are many different types of camelids many of which you've probably heard the names recently and what we what we've done is we've actually taken some data from The International camelid Institute shout out to Ohio State where they're focused on the veterinary domain and doing research on camels and camelids in a Veterinary capacity so there are many many different research papers here and you can imagine they're full of jargon and specific language that most people have never heard of including us so to build this CID index what we've what we're going to do is we're going to load documents we're going to chunk our nodes along with their metadata using the node parser in llama index then we're going to go ahead and create our embeddings and we're going to store those nodes in our llama index Vector index there's one more thing I want to talk about before we get into the code and that is okay given all this rag system given this basics of llama index setup we've got going we also want to think about the purpose of today's event which is Advanced retrieval which is high performance rag which is how to actually improve your retrieval process and there are few ways to do this LL index put out this great chart just the other day and they talked about different ways to move from simple to advance rag of course the table Stakes are there use different node parsers try different chunk sizes there are different search techniques and different metadata filters that you can imagine putting in there we encourage you to play with all of those we're going to show a couple of the more Advanced Techniques here today first and foremost we're going to focus on fine-tuning and the reason that we're going to focus on fine tuning is again because we have such specialized vocabulary in this Veterinary research context so we want to make sure that we can first off handle whatever mumbo jumbo is in those papers in order to do fine-tuning we've seen this before we've seen this when we did smart rag just a few weeks ago what we need is we need to develop a training validation and testing set of question and retrieve context pairs then we can use a loss function that actually takes all positive Pairs and also automatically augments our data set with negative pairs meaning if you ask a question you could return the wrong context basically irrelevant context that would be a negative pair and this hugging face sentence Transformers loss it's built right in so it's very nice we can use this and turns out out it improves our results when we use something like the BGE small sentence embeddings from hugging face from baai this is something that is relatively straightforward and also pretty cost efficient as well so we're going to set out to build our rag system but first we're going to set up llama index so that we chunk out some nodes we're going to create some training validation and testing set data to find tun our Ed then we're actually going to go fine-tune those embeddings and we're going to set up our camelid specific embedding model and see how it's starting to do just with this very first piece of our Advanced rag setup so with that send it off to Chris to show us how all of this looks in code thanks Greg yes so the idea is we are going to improve our embeddings model by fine-tuning over for some questions that we generate via our uh you know basically our training set so the idea is that we're going to you know use our own data to build a data set to get our embeddings model better at understanding our data uh very fun very exciting and uh we'll get right into what we need to do in the code first things first we want to run some boiler plate just so we can run async inside our notebook we're going to grab our to Tendencies open AI llama index and Pi PDF and of course we'll need an open AI API key it should be noted that some of the evaluation pieces of this notebook are rather uh resource intensive and so you know make sure that you're you're not uh you know blowing up your rate limit or anything like that uh for the later stage of the notebook I'll make mention of that again when we get to that step first thing we're going to do is load data we've set up a uh you know a sub directory in our data repository GitHub which has all of the potentially relevant data the uh data we're going to be using today is a couple zip files which have in them a bunch of papers on camelids so we have our test uh you know zip and we have our train zip and you can see we have two papers to eight papers uh and these are all Veterinary papers about camelids so they contain very specific knowledge both to Veterinary and to uh you know camelids so you the model might not have a great understanding of these things on the offset because it is hypers specific language to this domain so how do we actually build that data set right so we need a data set in order to train so how do we build it well uh the first thing we're going to do is split our sources into nodes we're going to do this using the simple directory loader which is going to use our simple node parser to parse out our documents into a bunch of nodes the idea is that we're going to you know turn all of those papers into just a ton of individual nodes where each of those nodes is going to relate to a specific piece of context so we'll do that here you can see that we convert our two documents into 17 nodes and our eight documents into 155 nodes which is quite a lot of noes and now we're going to actually make the data set right so what we have right now is just a bunch of nodes uh and what we want to do is have a bunch of question context Pairs and the way that we're going to generate these is we're going to have open AI gbt 3.5 turbo through the generate QA embedding pairs uh you know llama index module it's going to create questions that are relevant to our pairs or to our context so the idea is we're going to for each node have GPT 3.5 turbo create a question that can be answered by that node the idea being we want to uh create you know some kind of system that has our you know each question is related to a specific piece of context once we have that uh we are good to go we're going to fine-tune using sentence Transformers since it's neatly integrated into llama index and we're going to fine-tune the B A AI BGE small English 1.5 embeddings model uh you can read more about it on their hugging face card uh we will be using sentence Transformers having a GPU will make this train go a little bit faster but it's not completely necessary you could do this on CPU it will just take a little bit longer so uh you can use the free GPU from collab to get this done uh liy split what we're going to need to pass into our sentence Transformers finetune engine which is a fantastic module provided by llama index it's our trading data set our model ID which is going to be the hugging face reference to our embeddings model a model output path which is where our model is going to live and then a validation data set which is uh just a data set that we want to use to potentially validate our uh embeddings model which we'll see used in a second and then the number of EPO we want to trade on two is a fine number you could use you know three uh you shouldn't go too too high though we don't want to uh overfit on our data we we would like to retain the ability to have good General embeddings while getting better at our domain specific embeddings it loads this all up and then we get to call finetune and we get to watch the training happen as you can see this did not take very long and the uh we train it up for our two epochs having a great day then we're going to extract our embedded embedding model from our finetune engine this is our now fine-tune embeddings model and then we're going to evaluate it now we're going to evaluate it using the information retrieval evaluator which is part of a sentence Transformers Library the idea here is because we're going to compare our newly fine-tuned embeddings model to our previous base embeddings model we can use the sentence Transformers information retrieval evaluator which is going to give us a bunch of awesome statistics about how our uh retrieval process is working the main stat you're going to see uh which is the one that's going to be outputed in the colab cell is mean average Precision at K or map at K the there are a number of awesome uh metrics which we can use in this Library uh you can see here we have accuracy precision recall mrr ND uh CG and our you know what we're using today which is map so there's a ton of awesome different metrics we could use uh and all of those will be available to you in your collab environment in the results folder and you can see in this chart you have access to every one of them but by default it is going to Output the uh map at K uh for us to see all this is doing is valuating storing the evaluation to an output path and then returning that evaluator when it's returned it's going to populate with the map at K and you can see that our base retrieval uh pipeline which is our uh you know unfin tuned BGE small n v1.5 get 0.77 map at K and our fine-tuned uh version of the embeddings model gets 0.83 map at K which is quite a significant increase right uh this is quite a heavy increase for such a low effort uh you know fine-tuning right this didn't take very long uh we didn't use a lot of data we really just uh generated 150 or so question context Pairs and we train it for two Epoch and still we get a rather massive increase in our map at k as well as the other statistics if you wanted to look through them in the results folder uh this is a very high return thing to do when you're dealing with domain specific data and even better when we're dealing with the same domain or a single domain uh throughout the lifetime of an application this is a very powerful tool because you know as we add more Veterinary data we can train our embedding on maybe a larger Corpus but those gains that we get from fine-tuning our embeddings model are going to persist as we add more and more data since we're staying in the same domain so this is a very powerful uh pattern uh it's quite straightforward to implement thanks to llama index and uh it gives you a you know better score better score is good and so we'll go back to Greg and we'll learn a little bit more about what we will do on the retrieval side yeah thanks Chris super cool to see that great increase and Improvement by just doing that simple quick fine-tuning of the embedding model so what we're going to do next is now that we fine-tuned our embeddings to really align with our Veterinary camela domain here we're going to pick out one of the advanced retrieval methods here and this one is called small to big retrieval the idea being that when we return context we can look at the area around the context that we've returned and we can also look look through that before we return our final answer we're going to use a tool called the sentence window node parser what this tool does is it splits our documents into nodes of course but each node here is a sentence and then what we do is we put a window around each node we look at a few different pieces on either side of the sentence in question around the node and we we also return any relevant contexts there and we can sort of take all of that in one nice little package and return anything relevant so there are a lot of applications for wanting to do this you know the big idea is that once you return just one sentence that is very similar to the query it's probably placed within a paragraph and that paragraph is probably placed within uh chapter heading and especially in these research papers there's headings everywhere So within context is very it's a murky term when when we think about exactly the best way to retrieve any given node From Any Given type of document but in this research paper context we'll see how the sentence window node parser can actually work very well to give us some much better results we're also going to evaluate exactly how we're doing and when it comes to evaluating retrieval there's some recent libraries that we have seen get built right into llama index that we want to show you guys because they're super easy to implement and they're really going to be kind of a GameChanger where you used to kind of have to go pull in another tool for this now it's built right in when we talk about rag evaluation there are a few pieces of the puzzle in llama index we have correct Ness semantic similarity faithfulness and context relevancy so on the generation side on the sort of answer side that we're getting we have correctness and semantic similarity in this case what we're doing is we're comparing the generated answer to our reference answer and the reference answer you might think well where does that come from that's sort of the ground truth that's sort of the right answer the correct answer now now of course A lot of times we have not had humans go through and create that ground truth data set that reference answer and so what we do is we take the most powerful model we can find gp4 and we use that to create our reference answer and this is sort of standard best practice within the industry today especially when it comes to rag evaluation on the retrieval side we've got two other metrics faithfulness and context relevancy when we talk about faithfulness we're really asking the question of are we seeing a lot of hallucinations are we seeing answers that are really not relevant to the context that we've retrieved if we are then that would be sort of hallucinatory on the context relevancy side we're sort of saying well are the retrieved context and the answer are both of them really relevant to the question asked we want to make sure that this is true we want to make sure that the question asks the question asked really is guiding the way when we want to see this visually you might think about it with this sort of set of four different pieces of information we've got the question we've got the generated answer we've got the retrieved context and then we've got the reference answer as I mentioned we're going to be creating the reference answers with gp4 on the top here you've got our sort of context relevancy and faithfulness are two retrieval metrics that we're going to look at so these are really important when we're really analyzing retrieval however when you analyze retrieval presumably Downstream during generation those Generations will improve as well and so when we analyze generation as we're trying to improve retrieval we should also expect to see some better results and when we look at correctness or semantic similarity across generated versus reference answers in fact today you will see that we do see some better results so ideally any valuation tools that we pick up we want to see numbers moving in the right direction and it's pretty cool that with the camid index or with the Llama index that we've created we are able to do some really Advanced indexing here and get some great retrieval results with llama index Chris let's show them exactly how we did it oh yeah so uh you know the the basic idea here is that we are going to be improving the way that we retrieve information so we've we've set up a process where we are better at retrieving the correct information or at least relevant information by fine-tuning our edings model and now we want to leverage that to retrieve information in a smarter way as well so the method we're going to be talking about today is sentence window retrieval which is this idea of a small to big retrieval process you know at a high level what we're doing is we're going to parse our documents into sentences find the most relevant sentences then add additional context based on a window around those sentences and then use that context to our llm for our uh retrieval augmentation step in the rag pipeline let's look at this with an example going to zoom out just a just a bit for a second so we have this block one which is going to be I went to tshi station I bought a power converter I live on a planet with two moons my name is Luke Skywalker if we were to break that apart by sentence we would have you know I went to tshi station I bought a power converter we we get it if we were to chunk that context we have this I went to tshi station I bought a power converter I live on a plan with two moons my name is Luke Skywalker and the idea of this sentence based retrieval process is that what we want to do is we actually want to improve our ability to get similar or correct information based on our query so let's say we ask the question who bought a power converter right if we were to ask that question in the chunking strategy we would get the context I went to tshi station I bought a power converter well that doesn't really tell us who bought a uh you know a power converter versus if we use this sentence window approach we would find the sentence I bought a power converter and let's say that our window is three we would increase our contexts to I went to Tashi station and we would also add I live on a plan with two moons and my name is Luke Skywalker this is going to retrieve the correct context for us since the context we were looking for was near to the query but it wasn't the query itself and this is why this is such a powerful retrieval pattern right basically what we're doing is we're looking for a needle and a hay stack and then we're expanding our window based off of that another really relevant example is say you wanted to know a a uh you know an equation from a textbook right well an equation is not it doesn't self-define a lot of the time right the equation is built of variables and some such and the you know description of the equation might be in the paragraph preceding the equation this is an example from uh Jason they're they're awesome but the idea is we want to find the equation and so we look for the actual word you know say uh Pythagorean theorem and then we expand our context to include the equation itself very powerful pattern and thanks to llama index extremely straightforward to uh to to use so we're going to use the sentence window parser from defaults we're use a window size of six this is another great place where we can actually uh you know fine-tune our system this this window size is going to you know shrink or expand the amount of context we include in the window so this can be used to control cost or maybe we want to retrieve a bunch of different nodes with smaller Windows as opposed to one node with a large window node we're going to have the window metadata key be window and the original text metadata key be original text we need this for the second step of our sentence window parser pipeline which is going to be the metadata replacement what this is going to do is it's going to associate each of these original texts with a window of size six uh around our context so each original text will be associated with a particular window uh and that will be useful in a moment we'll create our simple node parser we'll create our BAS llm gbt 35 turbo of course we'll set up some fine-tuned uh embeddings some base embeddings and then the relevant contexts this is just for evaluation later we want to compare these two pipelines together next up we're going to create some nodes from our documents first we're going to need to load our data from our a directory which has all those llama papers then we're going to parse out nodes using our sentence window parser and then our simple node parser next we'll just convert those to Vector store indexes because of course we should after that we're going to create our query engine you'll notice that we have this metadata replacement postprocessor so what we're going to do is we're going to retrieve the top three most similar sentences and then replace them from their original text version to their window version so this means we'll find three sentences and then we'll actually use the context that's in that window which is the expanded context so let's see this thing in action uh we'll ask it a query you know how do Camin genetics influence wool quality which apparently they do a bunch of papers on it so that's cool to know camelin genetics play a significant role in determining wool quality and it goes on and on says a bunch of stuff about keratin Associated proteins you know hair follicle growth all kinds of stuff I have no idea what it means uh but that's what we build a retrieval augmented generation pipeline for it's able to parse through those papers and give us a very uh a very good answer so let's look at a visual representation of what happened well we found our original text this is our our zero Source node which is going to be the sentence just this little bit here and then we expanded it using our window to be this entire context here so as you can see we added a lot more relevant context to our window and we looked on either side and it does look like we added a bunch of information that was helpful right so uh we added this idea of uh what colors are impacted you know the domestication was a factor all of this other you know awesome context that helps us give a really full answer to to the question and then we can go on and see the question the response we get from just normal chunking and you can see that like you know we get these genetic uh variations and we get this uh you know these acronyms and these highog glycine okay that's great have been all this stuff is great but we lose a lot of that additional relevant context like the domestication process and all kinds of other things so this is why that uh that method is so powerful right oftentimes what we're looking for is going to be around our context uh window it might not be in it to at a at the start the next thing we're going to do is evaluate just how much better this is so in order to do that we're going to set a number of valuation nodes we're going to set a uh grab a random sample of those nodes from our base nodes we're going to uh set up our gp4 powered evaluation context so this is going to be using gp4 four we're going to create our data set generator which is going to create question context pairs again we can see these generated here then we're going to uh change a little bit what we're doing by adding evaluators and these evaluators are going to Mark the various responses based on certain criteria so let's look at say the uh correctness evalu right if we look in the actual code uh we can see that there's this prompt and the prompt basically is asked to Mark or score how correct the response is based on gbt 4's uh understanding so this is going to do the same thing for each of these metrics very powerful pattern uh it's going to help us get some really good understanding of what's happening in these Pipelines we're going to only evaluate on 15 samples if you're using uh you know gbt 4 you've just set it up I would recommend moving this down to say two samples so you don't hit any rate limit issues and all we have to do now is get our query engine to answer all of the question and context pairs that we created so we do this here we do the same thing for our base rag pipeline which is just our base query Engine with non fine tune embeddings we're going to parse out some uh of the string responses and then we're going to have gbt for evaluate based on the uh metrics we set up earlier which is correctness faithfulness relevancy and semantic similarity and then we get to look at our results and we see the uh the rather impressive difference between the two our base Retriever with base embeddings scores much lower on correctness relevancy faithfulness and then you know a similar on semantic similarity the big benefits here though is you know maybe for a toy example these numbers are close but when it comes to a an example that's going to span you know in production we dealing with millions of queries these kinds of metric increases are fantastic uh they really powerful signal to indicate that our rag pipeline is going to be performing better at the end of the day uh than when we started and that is how we create the advanced retrieval side of the pipeline and how we can evaluate our pipeline using llama index and we'll go back to Greg to close us out yeah thanks Chris that was totally awesome great to see how we can improve our pipelines and take a look at that qualitatively but also quantitively by simply that small to big approach combined with fine-tuning and bedding we were able to go from low numbers on correctness semantic similarity faithfulness and relevancy to higher numbers and the numbers didn't improve drastically in all cases for instance semantic similarity but we do see again we're focused on retrieval here recall that the retrieval metrics are faithfulness and relevancy so the faithfulness went way up we decreasing hallucination presumably a lot when we use this method and the relevancy also went up quite a bit again we were focused on retrieval but we also saw improvements in our generation metrics very cool to check this out and see how easy it is to get going and the idea is again as Chris said once you have many many documents you'll probably start to see some of these numbers level out you'll start to see see that you are sort of demonstrating and gaining a Mastery over the language in your particular domain with the embeding model approach or even with the type of retrieval for a specific document type approach that you're using of course putting all this together with different types of documents that are structured in different ways and looking across all of them is sort of another level as we get into more and more advanced retrieval and production but for today we have seen that there are many ways to enhance retrieval we saw just a few today small to big and fine-tuning embeddings the fine-tuning embeddings is really something that's recommended for very specialized vocabulary or sort of Veterinary camelid or llama index was a great example of how we can do this we saw the generation metrics improve with retrieval even though we were just really aiming at retrieval and evaluation is easier than ever although you do want to note that gbt 4 is often still used as ground truth one of the common questions we get is well isn't it kind of sketchy to get GPT 4 analyzing the output of another llm and doing that sort of self assessment not really because it turns out GPT 4 is actually very good at self assessment in general so this is why it's sort of okay and accepted in the industry today to go ahead and do this so with that we'll go ahead and spend the last 15 14 15 minutes or so answering everyone's questions we'll leave this slido QR code up on the screen for just a few moments while I welcome Chris back to the stage and then we'll have a small QR code show up in the right hand side of your screen once this drops away so Chris let's let's go ahead and dive into it man Islam guys got some popular questions here he asks in the case of documents containing graphs tables or images that contain important information what happens to the nodes yeah they're parsed in like whatever format you use so PDF is parsed to to plain text uh images would be in large part ignored basically they're they're you they use a PDF parser to to parse those uh those you know whatever in the case of PDFs um so it depends on what parser you're using uh you can build kind of like uh you know your own parser or a custom parser that might be able to to sort that information in a more uh in a in a better way or in a a way that you're you more prefer but uh for the basic parsers they things like images are ignored and tables are converted to plain text tables and graphs are again largely ignored yeah and this is something that everybody asks right Chris it's like how do I get everything in my PDF doc to show up beautifully just right with a simple let's say node parser when it comes to llama index a very very hard problem it's going to require digging down into each type of data that's within that PDF and figuring out what the you know not fun data pipeline to get it right actually looks like all right Islam asks another question here what would the main difference be between Lang chain retrievers and llama index retrievers to me they look almost the same am I missing something not in particular no uh they're they're maybe implemented in different ways or they have specific uh you know framework you know Integrations so like laying chains retriever going to integrate very well in the line chain ecosystem but for the most part like they're they're all doing the same thing I would say they're are some differences in terms of which retrievers are available and which methods they are implemented with so you might have more performant retrievers uh in in some cases between the two right Lang chain might have a more performant implementation of a specific retriever uh and llama index vice versa uh but for the most part you know for the basic retrieval processes are going to be the same it more comes down to the availability of built-in retrieval pipelines llama index has a lot of retrieval pipelines that work in a number of awesome ways so as you move past the simple retrieval uh processes you're going to see uh some some pretty significant differences between the two and it does seem like like llama index is really doubling down on on retrieval these days right so I would expect to see more and more all the time I mean everybody but you know and they play nice together I mean at the end of the day llama index uh you know you can Implement Lang chain uh right in it so for the most part you know you can you can plug and play with either of them so I would say you know they do have differences and it can be important to know those differences but the the libraries play very well together so yeah yeah and I think often there's this either or thing about it and you know it's it can be either or it can be both and uh as you get more Advanced you're likely to try out both of them and see what works best for your use case por noos do right URL to collab got you um that should be in the YouTube link chat ma uh Manny the neomatrix 369 here asks could you please share with us freeopen Source llm models that we could use for performant embedding and chat models like what what do you recommend Chris like if people are going to get started what should we recommend for Manny here well you go to the MB the uh massive text embedding Benchmark uh you find the task you want to be good at and then you click the one at the top uh you know that's that's what I would suggest unironically uh you know in terms of the your ability to create or to to get open source embeddings models there is a you know we are spoiled for Choice uh you know you can get ones that are very good at a specific subtask that you want to be good at you can get different languages you can get multilingual I mean the the sky is the limit on embeddings models right now and as long as you uh you know really understand what is best for your use case you're going to be able to find a performant free embeddings model no probs and the the benefit of course to all these open source models is that you can fine-tune them to be very good at whatever domain so if you see one that's good at a task uh you know uh use it and then fine-tune it and boom there you go yeah yeah for sure for sure yeah and I think too like you know it depends on the use case like we've talked to a number of of government agencies recently and they're very adverse to using embedding models that come out of Chinese companies or nonprofits for instance because it's actually against the rules in their particular domain and so you kind of have to look it really depends and and I do think still we are seeing a lot and this is true in LL index blogs and in our own work that the open AI embeddings model is really good right Chris I mean it's just it's very good and so if you're looking for like performant embeddings models and performance chat models um it's hard to say don't start with open AI um if you really have to be open source yeah go ahead and look at that massive text embeddings Benchmark leaderboard look at the open llm leader board on hugging face and and Away you go so all right let's keep it moving here Iona asks what is a cost-efficient method to train the llm on personal company documents in other words to embed to llm company knowledge and documents what is a cost-efficient method to train the llm on personal docks to fine-tune the llm yeah so there isn't one yet uh you know we don't really want to I mean so the whole point of the rag system or Rocka or end to end rag or dolm or radit whatever version where you know you want to use the whole idea is that we don't really want to or need to embed knowledge into our llm we want to use the uh the retrieval pipeline to store that knowledge and then make our uh make our um you know pipeline more robust so we can fine-tune the LM at being better at the domain specific retrieval uh so for instance you know RCI has a good pipeline for this with their domm uh process you know radit which just came out is great for this but we really more often want to improve the efficacy of our retrieval pipeline through methods like better retrieval fine-tuning our embeddings models then we do fine-tune our llm to to to memorize our company's knowledge um so that that that's how I'd answer that question and hopefully what you take from it is that these pipelines like rag or whatever version you're using are meant to bypass that step uh though we can still fine-tune the llm to be better at our retrieval uh stack yeah yeah definitely and trying to figure out what exactly the order of operations is when we're doing Rag and then fine-tuning embeddings versus fine-tuning llms versus all of these other things we can do it's often a little bit confusing but yeah I think if you take the approach we took today if you're looking at really specific language start with the embeddings models and see how far you can get by really dialing in that rag system if is any of this applicable in rag with structured data involved Chris like SQL data sure yes uh mostly so in in structured data we can use other techniques to uh you know say query a database or or what have you uh but we still often times want to augment that with some kind of semantic information or otherwise uh you know improve the response with some kind of text context so yeah you you can use these if you're just purely going to be using uh a natural language interface for like a SQL database then probably not uh you know we're we're going to be in that case just making SQL and then uh you know communicating about natural language but in kind of the the standard I would say rag application where you're going to further augment that SQL response uh with natural language yes it absolutely can help um you know things like knowing what documents to retrieve based on what you get from your SQL response uh both of these techniques can help uh and also adding additional context semantic context uh via a separate uh you know pipeline retrieval pipeline both of these are going to add better smarter context that can augment further the the quantitative results you get from your your SQL query yeah very cool all right we got a quick one here can you use an open source llm say llam to 70b to construct a training data set why not yeah absolutely uh your Mage may vary I'm not going to tell you uh it's going to be like the best or the worst but you can yes definitely you can use any sufficiently large language model to create a good data set you can use any language model to create a data set that uh you know the the quality is really what we're we're concerned with um and something like llama 70b that's instruct tune uh that's very good instruction following and you know it would probably do a great job uh I haven't personally used this pipeline so I don't want to speak to it exactly and it's going to vary by the domain you're in and everything uh you know the the classic ml answer of it depends but uh it can absolutely be a sub subtitution for say gbt 4 gbt 3.5 turbo especially for the question generation component um yeah absolutely for sure and uh we we kind of touched on this already but maybe this is a great place to end on how to select an embedding model for a given application and fine-tuning there are so many of them and maybe Chris maybe you can just sort of touch on just how did you select BGE small today for instance uh I went to m TB and I looked at the top results and I found uh the SM the first best small uh embeddings model uh I didn't want to use a very large one because I didn't want to necessitate you know using a a big uh you know GPU and so there you go that was that's that's really it you know I know it's it sounds silly but that's they've they've they've already made the The Benchmark for us right so let's just use it it's there you go there it is and uh just one one last comment to share Chris before we wrap up Todd llm says I love the use of embodied for the Chris llm smiley face relaxed Smiley shout out to Todd uh great great Community member at AI maker space uh thanks for engaging with us today and Chris thanks so much um we'll go ahead and are out here all right that was awesome thank you everyone for your participation this brings us to the end of today's event brought to you by AI maker space we're really excited to announce our brand new 3-week course on llm engineering which focuses on the foundations of GPT models it kicks off November 2nd 2023 at 7 p.m. Eastern 4M Pacific it's designed for practitioners like you looking to master the why and how of llms from architecture to unsupervised pre-training supervised fine-tuning and alignment techniques including rlf if you're looking to get started down the path of becoming an AI engineer it's probably worth checking it out for everything else follow us on LinkedIn and Twitter and until next time we'll keep building shipping and sharing and we hope to see you do the same thanks so much everybody

Info

Channel: AI Makerspace

Views: 14,020

Rating: undefined out of 5

Keywords: llamaindex, RAG, Large Language Models

Id: wBhY-7B2jdY

Channel Id: undefined

Length: 59min 37sec (3577 seconds)

Published: Tue Oct 10 2023