Smart RAG: Domain-Specific Fine-Tuning for End-to-End Retrieval

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone and welcome to Smart rag domain specific fine-tuning for endtoend retrieval augmented generation applications my name is Greg lochnan and I am the founder and CEO of AI maker space thanks for taking the time to join us for this event today it's 2 pm in Dayton Ohio where are you tuning in from today we'd love to hear from you in the YouTube chat and we're so happy to have folks in our community joining us from all over the world today's event is a special one because the tool and the capabilities that we're going to show you just came out this week we're building with it so we wanted to go ahead and ship and share something with you all ASAP Straight From The Cutting Edge we connected with the great folks over at RC and we got to know a little bit about their tool and their founding team trust us they are the real deal and they've got the pedigree and we're really pumped to share all of the stuff we've been learning with you about rag systems and all of the underlying details just ah heads up what we're covering today is quite Advanced and is not something that many of you should expect to be able to digest in just one sitting that said if you're paying attention closely you'll probably have many questions when they come to mind please follow the slido link in the description box on the YouTube page we will do our best to answer all of the questions you guys throw at us today and we will certainly answer the most upvoted ones during the Q&A portion of today's event and now I'm pumped to welcome my good friend the llm wizard Chris alexc to the stage today to help further augment the context for this smart rag event Chris is the head of llms at AI maker space as a developer by day teacher and YouTube Creator by night he's always building shipping and sharing his work Chris we've uh been talking about rag for quite some time now across the many classes we've been teaching in 2023 behind the scenes you've been referring to retrieval augmented generation or rag as retrieval augmented question answering or Rocka as we've been calling it in the cohorts because the true rag the true rag rag actually required a little more than question answering or QA what are most people missing when they think of rag at least when viewed from the perspective of the original rag paper yeah I mean the the idea is that you know the the rag that we know and love that uh that we see uh kind of being very popular and exciting and to be clear it is it's fantastic uh but it's missing a a a step from the paper which is this idea of endtoend training for the Retriever and the generator so uh you know this this idea that we actually want to train this system on data as opposed to just uh you know show documents to it and and get you know relevant answers that's still a powerful pattern it's just wasn't exactly rag it was it was close but uh but uh just missing that extra little yeah that that extra little training that extra little tuning right I we're just we're just kind of getting in there and we're really dialing in the models that we're using for rag is that the big idea here yes perfect Chris we're gonna have you back in just a few minutes as we get into our data Centric part of today's build so let's talk about smart Rag and let's give it some proper context by recalling what rag is in General then let's make sure that we're going ahead and focusing ground up data first and taking a data Centric approach Chris will walk us through exactly what data it'll be aim style data that we're using today that we're going to use in today's event and then we're going to talk a little bit about each of the critical pieces of fine-tuning that are going on a lot of people are asking is it rag is it fine-tuning what should I do well in fact if you take rag far enough you're doing fine-tuning anyways so often times it's not an either or it's a both and approach in really state-of-the-art complex llm applications and the smart rag framework that again came out just this week is one way you guys can start building these types of rag systems today so let's talk first we're kind of getting lost in the tools a little bit let's talk like why Rag and why fine-tuning of rag now is it just because it's cool and we're trying to continue to mod systems to make them cooler and better well the industry is speaking to us and sort of pushing Us in this direction we're moving from a paradigm of I just want chat GPT for my data where it can answer everything it can do everything magically and it can potentially answer all questions perfectly to a more realistic perspective where we're all of a sudden very clear on the fact that chat gbt for our data can't magically do everything all at once we need a lot of data from specific Industries we need our data from our specific company we need to master the language of our customers and our users and we need to make our applications not just task specific but domain specific and we need to be be doing this in a cost-efficient way with models that don't need to necessarily do everything we're not really adding new knowledge to our models we're putting the structure in place and the applications in place around them to be able to generate as much business value as possible so this new paradigm sends us back to being data Centric and fundamentally focusing on our own data and how we can use it to leverage and best leverage rag systems so what is a rag system anyways for those of you maybe tuning in for the first time that haven't even really gotten up to speed on this big idea of how do you build your own open source chat GPT for your own data well there's a few key pieces and the first piece is when you ask a question or you put in your query whatever it is whatever you're trying to get to know you're going to go and you're going to turn that query into a vector that Vector representation is simply a string of numbers and that's going to allow you to then go search a database full of your data a vector database and there's many companies out there creating Vector databases today you've got vectors in the database and you've got your vector that you just created from your question we try to match those vectors and we try to find similar ones within the data that we put in once we find as much similar information in all of our data in vector format that we can we can take our initial user query and before we pass it in to our llm model we can actually augment the context of our prompt so after we retrieve we can now augment and the augmentation takes place within the prompt template and if you joined us for the last event we talked a lot about prompt templating and how to set these up properly using something like open AI but as you can see the purple context here goes directly into the prompt and that is what's finally fed in to the llm the llm comes into play only at the very last stage of a rag system so the retrieval and the augmentation take place prior to the generation which is ultimately the question answering piece this is classic Rag and in order to take this to the next level we're going to need to go back to First principles because we're already getting pretty Advanced so we mustn't forget that you have always got to start from the perspective of your data because today with the new tools the layer of abstraction that we're working at we can get crazy awesome stuff done and just two steps if we go ahead and create question answer pairs connected to passages that we have retrieved from our own data so we essentially have question passage answer triples we can actually train directly an endtoend rag system first we're going to go ahead and walk through exactly what these question question answer pairs are going to look like using one of our favorite texts hitchhiker Guide to the Galaxy and Chris is going to break down exactly how we're setting this up so that we can get right into end to end rag just as soon as we're done being data Centric so Chris go ahead and show him how to chop up the data just right for doing this training today you know it Okay so we've got this notebook here that you should all have uh uh you received a link for in the uh YouTube uh chat and the idea of what we're going to do today is this uh create the domain adapted language model using the endtoend rag training pipeline uh the idea here is we need to set up some data for this and the way we're going to do that is by uh you know after we get our dependencies and install of that and that's great uh we're going to want to construct these triples now if you've trained uh or fine-tuned a betting uh model before you'll know that you need this question abstract or question context or it's got a lot of different names but the idea is a question and then some context we pulling the question or uh that is related to the question uh with the endtoend rag you know system we need to also include the answer so the the correct answer to this question uh the idea being that because we're also fine-tuning our generator at the same time we need some you know way to guide it towards the correct response so we have to build these triples now of course uh you know like Greg said we're just we absolutely love the hiter guide of the Galaxy here so we're going to use that in order to uh create some synthetic data that we uh produce from uh chat gbts or open AI sorry gbt 35 turbo uh end point there is a this this process takes a while so if you would prefer there is a provided toy data set that you could use instead uh that the is included in the in the uh domain adapted language model repo so first things first we're going to grab some dependencies to build our synthetic data of course llama index uh Pi PDF because we're going to be ingesting this hitchers Guide to the Galaxy PDF so just the first book um we're going to set this as our training files we're going to uh create a llama index Corpus which is basically going to be a C of nodes sorry I'll zoom in a little bit more here um which is basically just going to be a collection of nodes this is a lot of code and all this code is saying is uh construct a llama index index of this particular uh data so this is going to break it down into chunks and then it's going to uh collect those in a in a format that we can leverage to create questions and answers you'll notice that we get 139 nodes out of our single PDF and now we're going to create the synthetic QA pairs the first step we'll need to do is just uh get some dependencies and provider open API key since we will be using uh gbt 35 to create these uh questions and answers so the first thing we need to do we have 139 pieces of context and we need to create questions and each of those questions should be related to the context so we're going to go ahead and use this particular uh prompt in order to do this where we provide our context string and then we ask the llm to create a number of questions uh you know that is related to that context string so this gives us the first half of the puzzle right this would be enough if all we were doing is fine-tuning the embeddings model um this is getting us our question and related context so that we have something to compare against when we're doing training and we just you know create a python uh function that does this we parse out all the questions and then we return the queries and the relevant documents which you can see the process is uh located here it does take quite a while so please uh again if you want to use a toy example that's totally fine um but the idea is that we're going to for each of these nodes create some questions and then uh associate those with their contexts we're going to unwrap this into a uh into a list of question context Pairs and then we're going to essentially do the same process but the prompt this time is going to ask for the answer to the question and this is going to give us the third piece of the puzzle which is the answer so now we have their question we have our abstract or context and we have our answer again we just run this through uh gbt 35 to get it to answer all of these questions we have uh we're only using a subset here just so that it can be completed in a reasonable amount of time and once we've done that we're basically done generating our synthetic data set we're just going to go ahead and convert it to a format that the uh repository expects um this process is you know great because it's going to let us generate that data that we might not have and let you build data sets for your specific uh data sets it is rather timec consuming and expensive so you don't want to do this process a lot uh but it it as we'll see it does definitely help uh at the end of the day and so that is the uh basics of how we go about creating that synthetic data using llama index and gp35 turbo and we'll pass it back to Greg to learn a bit more about what's coming up next yeah thanks Chris all right so we're all set with our data Centric approach we got our index set up we've got our question context answer triples set up let's see exactly how all this fits together before we see any more code well when we're looking at the endtoend rag systems we can kind of start with the original rag that we've been talking about and you know in this case we're leveraging all of the same pieces okay all of the same pieces but what we want to do is we want to go ahead and zoom in on the embedding model and the chat model and the embedding model and the chat model are really where we're going to do a lot of the work of this fine-tuning we keep talking about fine-tuning the Retriever and you know this means that we're really focused on fine-tuning the embedding model and we talk about fine-tuning the generator means we're focused on fine-tuning the chat model we actually need to fine-tune the parameters with Within These large neural networks within the models that's what we're really doing and so as we get into trying to do that we need to also make sure that this whole thing is really connected in a very coherent way ultimately at the end we want a system that it's outputs are constantly being evaluated and fed back in to develop inputs that are going to allow us to improve over time and it's going to allow us to be resilient to any questions that we might not even have today that we may need to answer tomorrow so this sort of endtoend evaluation feedback loop is the kind of level that you want to be thinking about systems that you're building within your companies today because this is really where the competitive Advantage is going to come so let's zoom in to each piece of this thing each piece of the fine tuning well like I said in general we've got two different things that we're fine-tuning one is the embedding model and one is the chat model in today's example what we're going to do is we're going to fine-tune the embedding model the BGE large English model from the Beijing Academy of AI this is a great model for retrieval and one that's relative ly cost effective to run the Uther AI GPT Neo 125 million similarly is a solid model that we can run we can get decent results with and it's not too big and too intensive from a computational standpoint so that's why we chose this one of course you can go and you can pick up llama 2 of course you can always go to the hugging face llm or embedding model leader boards and pick your favorite this works for any model so it's just about the compute and whether or not you're getting good enough results with the model that you've chosen and when we kind of talk about embedding model fine-tuning we're talking about taking the pre-trained model and really kind of aligning it better with the words that we're tending to use for our application so think about again don't domains where you have special words lawyers doctors folks like these are always using very special words so if you're building something for people that use special words that's where you really want to think about the embedding model fine-tuning and the chat model or llm fine-tuning is more of an input output schema thing you don't want your user to have to prompt too many things into the llm whether it be system prompt level context or actual examples One Shot Two Shot few shot 10 shot 100 shot examples you want it to be simple for them and you want it to be clear exactly how the user interface should look so really dialing in the downstream task that you're going to use as you fine-tune a chat model is the key point there and specifically when you go the how on how to actually do the embedding model fine-tuning is really comes down to generating those question and then passages retrieved Pairs and choosing a loss function that you can go ahead and F tune with eventually once you f tune you're going to test this thing out and you're going to evaluate it for something called hit rate and hit rate just means did I actually nail the p passages based on my query that I should have is my model crushing my training data set and this is really all there is to it it's really back to sort of classical machine learning here the interesting thing that's going on with this particular loss function in this case that we'll share with you is that though we can provide only positive examples meaning we provide a query and then we return only relevant context a lot of times you can improve performance by also having pairs where you have a query and then you provide context that is not relevant and those can be negative examples the loss function in this case automatically generates some of those for you so it's pretty cool the way this kind of works under the hood that might look something like this we're really just focused on the embedding part of the fine-tuning within the rag system this is something that you can do today very easily with tools like llama index or this is something coming out even through open Ai and their API or you can do on Amazon today so this embedding fine-tuning is something that is really really you know Catching Fire all around the world right now in generative AI of course fine-tuning llms has been around for a while and this idea of fine-tuning an llm is done not with some super fancy loss function it's a simple cross entropy loss just like classic ml this can be done with question answer Pairs and the evaluation is where people run into a bit of a snag here because a lot of the generation tasks that we do depending on exactly how many times we generate and what the parameters are you're not always going to generate an exact match of some training data that you may have used that's where the evaluation piece is currently evolving and we're paying very close attention to this look for future events related to evaluation from us but for now we're talking about exact matches we're talking about word for word character for character matches string to string so that's the gold standard if you will for evaluation although how useful is it really is application dependent and there aren't many that it's incredibly useful for fine-tuning just the llm might look something like this where you're really focused on dialing in the parameters of that large language model itself based on those question answer pairs finally the new innovation is the endtoend rag Innovation and how to fine-tune the whole thing at once so this is the thing that we're pretty excited about and this is the thing that if you can sort of understand this and start building with it you are absolutely at The Cutting Edge this is where we're taking our question answer pair and our question passage retrieved pair those two sets of pairs we're combining them into one data set of triples so we have question passage retrieved and answer and instead of one la function here as we were using for the embedding model fine-tuning or the llm fine-tuning we're using a combined loss function where we're using that hugging face sentence Transformers loss function from that Library um there are many loss functions within that library and then the Innovation within the archive paper on endtoend rag is actually talking about this new loss function called the marginalized loss so this is building in the idea of the passage is actually directly now related to the loss function that we are calculating based on the llm so this is the key aspect under the hood of how this thing is all really working very well together of course we can use the same evaluation techniques for retrieval or the embedding model we're going to talk about hit rate for generation or the LM model again we're talking about exact match watch this space uh we know for sure we are we know for sure the guys that arec are and we definitely know of some tools that are going to be able to help take this to the next level uh stay tuned from AI maker space on that so if we can now sort of put this all together and talk about how to actually fine-tune this whole thing so through one single loss function we're tuning both the embedding model the Retriever and the llm the generator we're going to have something that's looking like this that's working very well and is being trained very very efficiently Chris is going to show us how to do it as he always does so let's see fine tuning for endtoend Rag and domain adaptation from Chris over to you man oh yeah so we have finally created our you know uh triplets our question abstract answer triples and uh we are ready to get going uh we're going to do one thing first which is just make sure that we have our hugging face token we'll need it for evaluation so may as well get it out of the way now also if you were using a a beef instance that had a lot of GPU Ram you might want to try this with llama uh 27b but for now we're going to stick with with the uh smaller more lightweight gbt Neo 125 million parameter model so how do we actually train this thing well here just like this here you go that's all you need to do so uh we'll pass in our Target CSV this has our triples we'll uh point out what uh hugging face uh you know embedding model we want to use we'll point out what hugging face uh Cal model we want to use and uh we Supply an output directory for it and uh we are using the accelerate or they're using the accelerate under the under the hood hugging face accelerate Library under the hood so we're going to set a few parameters here as well as of course uh you know we'll want our batch size to be as high as we can given the GPU memory we have available so you call this it goes through the process it trains the system end to end uh you know the the whole thing is done here and that's it we're done really that's it that's the whole process uh we can evaluate using this uh provided evaluation that they've given us where we're kind of you know going to be able to map our columns to the correct uh to the correct piece and then we're going to just do our evaluation again like what Greg said you know the generator evaluation right now is very strict so uh it's it's hard to get exact matches but uh the you know the retrieval evaluation which is hit rate uh is going to be basically hit rate is going to be great uh you'll notice as well that we have these Retriever and uh generator P model paths that's right under the hood of course we're talking about PFT we're talking about Laura we'll get into it in the code in a second but for right now we run the evaluation and we see even on our uh 100 samples we get our hit rate of one our recall of one and we're feeling pretty good about all of that so the idea here is that we are able to fine-tune this very quickly on a uh you know very limited GPU and it's giving us excellent results and this is all done and handled through us through that one training script that we saw above right here so let's dig into what is going on under the hood right so it's it's it's awesome that we have this uh this training script but what's happening well we have our you know classic training Loop the first thing that we'll notice is exactly like Greg said we have this total loss and this is the the you know the key we're we're actually using that loss to train our system as one unit and that's the the pretty awesome technology that comes from that that rag paper uh we have we have some steps that uh you know if checkpointing is involved the the basic idea here is we're going to get our loss for our retrieval pipeline so this is the retrieval half of the model you can think of it as but we're going to keep these logits for uh later because we want to actually include them when we're getting the loss for our generator so you'll see that we get our retriever contrastive loss this is that uh that loss that we saw in the um I believe that I have it open if I don't I will open it uh but this is basically the same loss that is present in the sentence Transformers Library uh which is the multiple negative ranking loss the idea here is that we are able to get this uh despite only having positive examples because we can use other queries uh you know responses as negative examples so we can leverage the fact that you know each query is going to be related to a specific context and so it's not going to be related to the other context as the basic intuition so we can use positive examples only which is what we generated in our synthetic data step uh and uh we can use that to get that get that loss which is great back to the loop we'll get our generator logits and you'll notice that when we calculate this uh marginalized loss we are passing in what we got from our retrieval step and this is the the the core idea behind this marginalized loss where we're actually going to consider the output of our retriever when we're training the generator and then of course before we do uh everyone's favorite uh accelerate dot backward we're going to combine our loss and then produce our total loss which is going to include that combined loss the idea here being that we we are not only are we literally combining them in terms of uh you know just adding them together but we're also considering our the performance of our retrieval uh step in the uh in the in the calculation of loss for our generator which is what makes this system so powerful and is what makes it uh you know really allow it to adapt to those new do domains in an efficient and effective man Manner and of course we're doing this all with PF Laura so if we look at the uh if we look at the auto model for rag end to end which is the the the model object we've got a retriever and Generator name we've got some uh additional parameters the retrieval model is just stored inside this object it is a uh it is exactly just two models shoved into a single model um and then there's some communication between them and of course we have our PFT models if we're using that get PFT option uh which is going to again use that Laura uh process um and that is that's really it I mean at the end of the day this system is really very powerful because of these few considerations right so we're we're combining these systems into one uh we're we're really even though it's two distinct models we're we're treating them as a single system and that helps us guide uh each of their generations to a to a better place um right now when you actually look at the evaluation results again we're we're seeing that the retriever is being really accurately uh you know fine-tuned we're getting a very good performance out of retriever but we do uh because it's such a strict uh evaluation metric for a generator uh we're we're not seeing that performance though the performance is indeed increased and that is uh at the end of the day what is happening under the hood that is how they're doing it and this is how we implement it uh in our in our code again very straightforward uh script that does a lot of the work for us so um and uh with that I will uh I will send her back to Greg um yes yeah thanks Chris yeah it's amazing how easy this is these days to kind of get this endend stuff up and running you saw it's the next layer of abstraction here for building with AI applications and you know so we've got a ton of great questions coming in and we're going to get to those in just a second but you know the core takeaways of today are that rag has the retrieval piece and the generation piece after you build your simple rag system you're going to really focus on fine-tuning it and you're going to do that by training or fine-tuning the embedding model as well as the llm itself with this new open-source tool out there you can actually do both at the same time we're really excited to watch how the different evaluation metrics that are coming onto the scene like ragus rag assessment and the different ways to look at retrieval metrics and generation metrics are going to allow us to really dial in these systems both from a fine-tuning the embedding model and fine-tuning the llm separately as well as from fine-tuning them both at the same time perspective you know there's a lot of innovation happening in this space there's lots of tools coming out all the time and so it's just so important if you want to stay out on the edge you got to keep building with them you got to keep shipping with them and you got to keep sharing and with that we're going to go ahead and get right into the great questions coming in on slido and I'd like to welcome Chris back up onto the stage as well as Jacob the CTO and co-founder of RC who's going to help us out answer some of these questions that look like they go pretty deep today guys so excited for this welcome Jacob hey there hey guys hey everybody all right uh I'm gonna go ahead and kick it off with the first question um and I want to we're gonna we're going to take it one at a time nobody's really up voting um but Todd asks Jacob what's the uh what's the deal with RC can you tell us the origin story of the name here uh people are interested to know yeah sure so um the name for RC actually comes from the uh female Transformer so the female version of Optimus Prime so uh that's kind of where where that comes from so that's why there's all the the Pink theme and things like that all over our website nice nice nice um okay I want to actually go uh deep into some a technical question here um I B we basically have uh Anonymous um asking about doing papers for about his master's thesis and hi everyone I'm developing multiple rag Frameworks for my master thesis can you recommend papers or documentation to go through Jacob as the uh the guy who just built this I'm gonna go ahead and start with you and then we'll go over to Chris let's see how many foundational papers we can come up with today I think this is a great way to kick this off well I think uh I think Chris is probably gonna have more ideas than me but the the first few papers I I guess I i' point to is yeah the original and end uh rag paper from uh that that came from daa's lab and uh then there's the one that's the full end end uh rag training where uh chimaine is rebalancing the uh the vector DB during training with a SE separate subprocesses which I think is really interesting I I'll I'll send those uh links over and um yeah and then the last one I had in my mind is the back translation paper I'm excited about that for synthetic data generation nice Chris what do you got papers you would recommend here you know it's uh I don't know that I would add much to this list I think the the papers I'd be interested in are ones that talk about uh maybe better or more intelligent ways to do retrieval so if you can extend this process a little bit further so uh you know perhaps adding in uh additional steps to your Retriever and looking papers that discuss those steps in order to you know really really really really uh get get the maximum performance out of your your application uh I'll send some links that we can send in the material afterwards but uh for the most part I think if you're if you're in the rag space what what Jacob's mentioned already is is where you want to start yeah the OG paper you can never go wrong if you get to know that as well as anybody else I think you're in a good spot um I definitely spent my fair share of time even this week looking at it and it's always helpful to kind of go back to the source material so uh the next question is a great one is it a good idea to provide synthetic data or should I manually annotate my own data if it's for a Q&A setup do I have to create my own Q&A data for training Chris what do you think about this it is a good idea maybe it's controversial or hot take I don't care it's a good idea uh the main thing that you want to consider is that you know the synthetic data is is important because it helps us get more points to train at faster and cheaper I mean ultimately the gold standard is going to be paying some you know human or humans to create a a well curated data set for you but to kind of get from having a document to having a data set very quickly synthetic data is going to be it's going to give you very good return in terms of performance increase at very low cost uh as you build more robust applications or as you delve into very specific or hypers specific domains that becomes less uh less of a a fruitful Endeavor because those specific domains might require uh a Mastery of the the domain specific language that the llm you're using to create the synthetic data doesn't have uh but for the most part I would say um you know at least for generating questions just always always generate those with synthetic data and have experts answer them based on the context and these kinds of things is uh it's just such a light lift to as we saw really increase the the performance of your of your application just beware when you're using tools like open AI to create synthetic data that there is some potential issues around commercializing the end product um and so might want to stick with open source models to do that generation uh considering what application you're considering hot take good idea do you uh do you agree Jacob good idea for synthetic data yeah definitely I think it kind of depends on the whole size of the context that you're trying to model too there's if you're working with like a 80 million document context uh then you uh you absolutely have to do some synthetic generation there hey speaking of Jake uh Travis asks for domain specific Rags as mentioned lawyers doctors Etc is it advantageous to also Supply a glossery for domain specific terms uh how are you guys dealing with this oh yeah that's interesting so that actually is getting a little bit into some of the deeper work that we're uh working on at RC with the DPT domain pre-trained models uh so there's been some work that people have started to see like a training actually in the domain context of vocabulary is producing good results like Bloomberg GPT um and we're kind of predicting there'll be a bit of an explosion of some of these Indo domain generators that goes even further back before the rag training um that will be like contextualized in in those uh bossies basically but I suppose you probably could also maybe put glossery things into your vector database and uh maybe make a call into there to see if you could pull those out and we have seen people who have been making systems where they will actually kind of like have a chain of LMS where they're looking for acronyms and things like that to uh to add you know some specific language into those queries before they go on to the next generator all right you heard it here F first folks uh we're predicting an explosion of domain specific llms let's see I don't know 2023 is almost up I can't wait to see that thing explode all right quick uh tactical question Chris what's the eval metric for the training process again so just an exact match uh if you're talking about like what is used for for loss then it's the it's that uh uh I'm forgetting the specific term because uh my my brain is mush currently uh so forgive me but it's the uh I want to say moderated loss but marginalized loss marginalized uh that that's being considered that's for the fine-tuning the generator half of the the model which is uh just you know what we expect for loss marginalized by uh the the information provided by the uh retrieval process uh so but the evaluation piece right now is exact match uh which is why it's it's a pessimistic because that's very specific very strict evaluation criteria um so yeah yeah and if you want to nerd out on that get into the endtoend rag paper section 3.1 I was looking at earlier today it's like uh they got some equations there everything you want go back to Classic stats and probability get after it Anonymous uh next up we've got harprit asking do we also need to worry about catastrophic forgetting for fine-tuning embeddings Chris what is this is this his thing uh I think you always have to worry about catastrophic forgetting whenever you find tun any anything um so you know it's kind of like a bogey man I guess it'll pop up when it does for the most part the process that they're using which is Laura is fairly uh it's difficult to get there um but yeah I mean you always have to worry about this this kind of thing and uh it comes down to balancing how long you train for how much you train with and uh you know the the plus side is is that it's immediately obvious that this has happened uh and you can restart your training and and work from there all right I would say I would say absolutely is is a concern here with uh some of the end end rag trainings that we've seen where uh actually like if you still need your model to have the generality of the general generator uh just you know completely abandoning the generator that you trained with the end to end Rag and using the the general generator with the with the trained retriever is a decent tactic to take um but obviously hopefully hopefully we can get you know some better generation routines and things like that to combat against this loss of generality yeah yeah super interesting you know you got me thinking Jacob like the on the domain specific explosion language models coming like uh are they going to be able to you know sort of do everything in one model are they gonna be able to take care of embeddings and generations oh yeah model yeah if you just know that you're in this domain you should pick up this one off the off yeah yeah the shelf right we got maybe will be making some of those yeah yeah he probably already is yeah yeah probably already is yeah okay um T super tactical question Jacob I I'll send this one to you for Q&A is it better to work with a text file PDF file Excel data set um what do you think is there does it matter I mean what do you what do you say to your customers when they ask about their data types they're trying to put into this this thing yeah yeah definitely I mean right now I guess you have to at the end of the day collapse it into a passage so if it's a PDF you'll have to parse that into some sort of text string passage so um if you're if you're able to uh do that then yeah I guess any data source works but um I I I personally don't know how the LMS that work off spreadsheets uh have been working you guys might might know more on that but I haven't seen any use cases there yet uh yeah so I guess the natural language to SQL stuff that we see in like llama index and that sort of thing that is that Chris is that to say that that stuff is not yet uh compatible with endtoend rag it's a work in progress something like that it sure is but again the issue is that we have to get to text somewhere in the pipeline right I mean the the the key part to these things is especially with the way that it's built in the RC uh repo I mean it's very modular and the the actual process isn't expecting anything in particular so you can you can build these these systems that extend that or or maybe add more uh more functionality to it but at some point you're you're you need to get a string and like uh you know whether or not you can you can use kind of these multimodal models or or anything like that is it's all a great question but at some point you got to get to a string you got to get to text and so uh we're we're seeing a lot of very clever ways to make everything that exists text and that's great but we're not necessarily uh we're not doing that very well within the model at the consumer level right now M okay all right watch this space all documents coming soon uh I guess next question is from the master's thesis Anonymous no this is Islam working on his master's thesis now would it be expensive to F tune models is that an expensive Endeavor Chris what's your hot take on this real quick super quick it's the best answer that everyone looks for all the time it depends it depends uh it can be expensive definitely yes absolutely it can also be I'm not going to say cheap ever but it could be it could be inexpensive rather straightforwardly uh and and yeah this is clearly built the system is clearly built with uh with this in mind you know they they they've already foreseen the application of things like P and quantizations the idea is like you know the the tool we looked at today does take into account the fact that uh fine tuning is an expensive endeav Endeavor generally uh we have a lot of tools to make it less expensive and to work work on smaller more consumer Hardware or uh save you cash by not needing a big GPU Farm but uh I mean at the end of the day you've got a you've got to Shell out cash um the the example we looked at today was done in collab on a uh on a V100 you could do it on a T4 which is the free version of collab so you know that hopefully that gives you a benchmark but for the most part um I would say it depends yeah yeah speaking of shelling out cash for fine tuning just kind of curious uh Jacob um do you have any like fun stories there like any uh any big numbers you've hit fine tuning something yeah I mean everybody wants more gpus right now so I don't know we've actually had good luck uh finding GPU availability on runpod uh I don't know if you guys have used that but very cool quick quick ad for runpod I guess yeah I love lamb laabs too I've always loved spinning up P 100s and stuff on lamb ABS but yeah the nice thing about the uh the repo that Chris was showing today is with the PFT uh training uh you can run that on not too big of a GPU if you have a bigger GPU a bigger GPU memory uh like an 80 gig a100 that's kind of Ideal but mle gpus for that one PA Laura Q Laura these things are really bringing it down to the consumer level very nicely um okay Manny at the neomatrix 369 what's up Manny have you seen better qual oh no Manny's got two Manny first what's the logits what is the Logics it retrieves and then reuses I've seen this often in NLP code Solutions Chris yeah so it's just the output of the uh the retriever so if you look at the actual repo I I don't I don't have it up to to pull it up to to share my screen very quickly but the idea is that it's generated in the training process so uh the specific I'm just going to make sure I'm saying exactly what it is so that talk about uh it might be better for Jacob if you knows off the top of his heads but we're getting uh cosign get cosine Sim is what's calling or giving it to us let me go deeper and uh oh yeah so it's just uh it's just doing a I'll let Jacob answer the question if he knows more particularly off the top of his skull you know I actually I I don't uh I sort of just take this one for granted from the hugging face libraries so it it's basically just calculating the uh the probability distribution across the query embeddings and passage embeddings yeah so that's the that's the only it's just giving us a way to uh to tell what our uh how those two systems interact and that's what we pass as context that we use to do that um marginalized step and again of course calculate the actual loss of the of the retriever which is found in the um sentence Transformers uh implementation of that multiple uh negatives rank oh man Manny with the Deep Cuts Manny's a three-time kaggle expert by the way so Manny is deep cutting us right now and I love it uh yeah keep it coming um I just want to comment on a couple of questions coming into the chat uh one uh I've been attending your sessions is there some Central place where you have links to videos collab notebooks and everything in one place great idea love that we'll work on that um and then there was uh you know I think related to the next question I want to ask there was a comment about it'd be really great if you would sort of show the input output um on the front end before you do the fine tuning and then on the back end after you do the fine tuning couldn't agree more you know we probably should have you know found like a law or Health Care example to to show you guys first then show how it improves um you know we we we we were scrambling to get this Advanced new new out to you and and we definitely will keep that in mind for the next time and related to that Manny's next question is have you seen better quality answers with fine-tuning the models as opposed to not doing it maybe sometimes it doesn't change or it goes in completely opposite directions uh I see you not in your head Jacob what do you you have some thoughts about this you want to share with many in the group yeah that's I think that's sort of in line with the thing I was saying earlier where I I think there's still a lot of research to be done on creating a robust generator out of this process and we've seen really good retriever results uh so I think if you're actually putting this into prod the generator you'll be getting that pops out might not be the generator that you actually want want to use but hopefully you know and as research advances uh you know and uh you know hopefully hopefully you know this this repo's out there in the open and Apache and we can all uh develop in it and uh we'll come out with some good good generators that because theoretically you know it should be working well you should be you know your generator should really know that it can rely on its retrieval database um and that just that just makes sense to me uh it seems kind of bizarre to me that these separate stacks spun up for General generators and and Vector databases seems like they should really be one thing so um but I think it's it's still gonna take a little while to get there um so yeah yeah I love this perspective because it it's so so many people come and they ask you know but can I just do this in one shot and you know everybody wants to do everything in one shot and and oftentimes we say well actually the engineering problem is breaking it down into the individual chunks and and then making sure that each piece works works together all along the way and you know and I think that's still kind of where we're at with rag right now but like tools like you guyses are coming out and trying to solve this like boom one shot problem and you know eventually we'll keep we'll keep cranking away at it and it'll get solved but as Jacob said it's an open source project you know you guys can contribute and we can all contribute if if we understand this stuff out at The Cutting Edge we're in the top percent of 1% of folks out there working on this stuff so um you know hey Jacob wants your help and and we'd love to see AI maker based community help out with solving the an rag problem as well um couple more questions before we close it out for the day Travis asks for domain specific Rags uh sorry um concerned asks is there any Safeguard against these companies taking our hard work hugging face the Beijing embed embedding model company open AI Etc how can we develop things like this with security um Jacob I see you nodding your head again let's go to you on this I bet a lot of people are asking you about this today yeah this is interesting this is a this is a really good question that really Cuts Cuts pretty deep uh so uh yeah I mean when we released this week we almost thought about releasing it with an elastic license actually so which the elastic license mean that means that a competitor can't host the code behind an API um but anybody can take it and use it however they want but uh yeah I mean I think this uh I think that I know hugging face Engineers have already become aware of this um the good news is though that they also develop out in Apache so uh I think it would be okay with me if uh hugging face wanted to work on this as well um but uh but yeah I mean it it is definitely a little bit tough to to compete with uh the you know General people making General models but I also don't think they'll be working on something like this you know it's a this is uh for domain adapted models that everybody's working without on their own data and they're going to want ownership of their own model in an open source fashion so um that that's different than uh there is actually one company out there called contextual AI That's working on a closed Source rag uh and then rag um that will be uh out there probably looking at this repo but you know as long as it stays out there I think that's the the best uh best defense is just to keep things out in the open yeah I love that what a great way to wrap it up oh yeah best defense is keep everything build shipping and sharing out in the open uh man thank you Jacob thank you so much Chris you guys rocked it today really appreciate it and with that we're going to go ahead and close out today's event thank you all for your participation today today's event has been brought to you by AI maker space and RC we are still very much just getting started at AI maker space or aim our community just finished cohort one and launched cohort 2 this week if you're interested in joining the next cohort it's going to be a little while out we're going to start right after Thanksgiving which will give you plenty of time to create your first llm application a requir M that you'll need to complete to get into our llm Ops cohort 3 course we've also got a brand new course coming soon so watch out for that also don't forget to give us a like a follow on LinkedIn on Twitter connect with all of us personally including Jacob and until next time keep building keep shipping and keep sharing bye everyone we'll see you next time
Info
Channel: AI Makerspace
Views: 17,357
Rating: undefined out of 5
Keywords:
Id: 0QaUqoICNBo
Channel Id: undefined
Length: 59min 52sec (3592 seconds)
Published: Fri Sep 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.