RAG with LangChain v0.1 and RAG Evaluation with RAGAS (RAG ASessment) v0.1

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hey whiz is there a way to know what comes out of any rag application that we build is right or correct well it's really hard to to you know to to say things like it's absolutely right it's absolutely correct it's absolutely true that's pretty difficult okay okay so there's no absolutes but is there a way to know that changes that we make to the system to our rag application makes the performance better or worse that we can know absolutely so you're saying there's a way to assess rag systems yeah I think like a rag assessment kind of a rag assessment huh yeah man let's let's show everybody how to do that today let's do it all right man uh my name is Greg and we are here to talk rag eval today AI maker space thanks for taking the time to join us everybody in our community we'd love to hear you shout out in the at where you're calling in from today we're going to walk through a simple rag system built with the latest and greatest from laying chain their most recent stable release and most stable version ever we're also going to outline how you can actually assess your rag systems using the rag assessment or ragus framework finally we'll do some Advanced retrieval we'll just sort of pick one off the shelf that's built into Lang chain and show how we can go about this improvement process we are very excited to have the ragas co-founders and maintainers jiton and Shaul joining us for the Q&A today so definitely get your questions in the chat anything you're curious about ragas we have the creators in the house today and of course we'll see whiz AKA The llm Wizard and CTO at AI maker space back for demos real soon so let's get into it everybody today we're talking rag evaluation this black art that everybody is really really focused on as they start to build prototype and deploy these systems to production in 2024 as we align ourselves to this session we want to get out of this what's up with this Lang chain v0.1 that just came out we want to understand how we can build a rag system with the latest syntax and then also evaluate it there's a lot of changes happening on the ragas side just as on the Lang chain side finally we want to see how we can pick different tools different ways to improve our system our application and how we can then quantify that using evaluation so first we'll go into Lang chain then we'll go into a highlevel view of rag and see exactly where the different laying chain components fit in finally we're going to see what you all came here for today the ragus metrics and how to implement the ragas framework so we'll be building we'll be evaluating we'll be improving today and the Q&A should be pretty dope so Lang chain V 0.1.0 what's Lang chain all about again well it's all about enabling us to build llm applications that leverage context are so-called context aware so we can connect other sources of data we can do lots of interesting prompt engineering we can essentially do stuff in the context window that makes our applications more powerful and also reasoning this is the agentic behavior stuff and look for another event from us soon that focuses more on reasoning today we're focused on context though and we're doing that in the context of V 0.1.0 the blog that they put this out with said theour of a th000 miles always starts with a single step and that's kind of where Lang chain sees themselves to be today Lang chain core Has Come Together Lang chain Community has come together and they're officially going to be incrementing v0.1 to v0.2 if there are any breaking changes they'll be incrementing this and they'll continue to support v0.1 for a time every time this gets incremented of of course as bug fixes and new features come out they're also going to be incrementing now in this third v0.1 pointx slot so pay attention to how quickly the development goes from here because I imagine there's a lot of great stuff on the horizon coming from Lang chain there was a lot of great stuff in the v0.1 release and we're going to primarily focus on retrieval today and also on this sort of Lang chain core that leverages LC or the Lang chain expression language so in terms of retrieval there's going to be a lot that you can check out and add after today's event that you can then go assess to see if it actually helps your pipelines so definitely encourage you to check those things out in more detail after today for production components there's a lot that we hope to explore in future events as well but starting from the ground up here we want to kind of focus on this Lang chain core this is the Lang chain expression language and this is really a very easy kind of elegant way to compose chains with syntax like this this dovetails directly into deployments with laying serve into operating in production environments and monitoring and visibility tooling with Lang Smith so really it kind of all starts from here and allows you to really do some industry-leading best practice stuff with these tools now today we're going to focus on a couple of the aspects of Lang chain we're going to take Lang chain core functionality and then we're also going to leverage models and prompts as well as retrieval Integrations from Lang chain community chains of course are the fundamental abstraction in langing chain and we will use those aspects to build our rag system today when we go and we assess then we're going to take it to the next level with an advanced retrieval strategy this is going to allow us to quantitatively show that we improved our rag system so quick recap on rag in general for everybody the point of rag is to really help avoid these hallucinations this is the number one issue everybody's talking about confident responses that are false we want our systems our applications to be faithful and we'll see that we can actually evaluate this after we build out systems and instrument them with the latest evaluation tools we want them to be faithful to the facts we want them to be fact checkable this idea of rag is going in finding reference material adding that reference material to The Prompt augmenting The Prompt and thus improving the answers that we generate visually we can think of asking a question converting that question to a vector embedding representation and then looking inside of our Vector database our Vector store the place where we store all of our data in vector format we're looking for similar things similar to the vector question we asked we can find those similar things and if we've set up a proper prompt template before we go into our llm something that says for instance use the provided context to answer the user's query you may not answer the user's query unless you have context if you don't know say I don't know and then into this prompt we inject these references we augment this prompt and then of course where does the prompt go well it goes into the chat model into our llm this gives us our answer and completes the rag application input and output so again rag is going to leverage models prompts and retrieval in terms of models we're going to use open AI models today one note on syntax is that the chat style models we use generally leverage a system user assistant message syntax and Lang chain is going to tend to prefer this system human AI syntax instead which personally I think is a a little bit more straightforward in terms of the prompt template well we already saw it this is simply setting ourselves up for success so that we can inject those reference materials in and we can generate better answers now it's uh it's important what these reference materials contain and how they're ordered and that is going to be the focus of our evaluation of course when we create a vector store we're simply loading the docs that's a document loader splitting the text that's the text splitter creating embeddings we use an embedding model and storing the vectors in our Vector store then we need to wrap a retriever around and we're ready to rock and rag our build today is going to leverage as mentioned open AI models we're going to Leverage The Ada embeddings model and open ai's GPT models and the data we're going to use is actually we're going to set up a rag system that allows us to query the Lang chain V 0.1.0 blog so we'll read in this data and we'll create a rag based on this Lang chain blog that we can ask uh see if we missed anything that we might want to take away from this session that we could also learn about V 0.1.0 so to set up our initial rag system we're going to send you over to whiz to show us Lang chain V 0.1.0 rag setup hey thank you you Greg yes so today we're going to be looking at uh a very straightforward rag pipeline uh basically all we're going to see is how we get that context into our llm to answer our questions and then uh later on we're going to think about how we might evaluate that now the biggest changes between this and what we might have done before is the release of Lang chain V 0.1.0 so this is basically langing chains uh you know first real minor version uh we are uh we're looking to see this idea of you know splitting the core Lang chain features out and that's exactly what you know Greg was just walking us through now you'll see that we have uh mostly the same code that you're familiar with and used to we can still use LCL as we always have uh that staying part of the core library but we also have a lot of different ways we can add you know bells and whistles or different features to our Lang chain application or pipeline so in this case we'll start of course with our classic import uh or dependency Lang chain you notice we also have a specific package for open AI for core for the community Lang chain uh as well as Lang chain Hub and so all of these let us pick and choose pick and choose whatever we like really uh from the Lang chain package this is huge right so one of the things that people often times are worried about Lang is there's a ton of extra kind of uh unnecessary things in there well this is you know goes a long way to solving that problem um and it's awesome so let's see first which version we're working with uh so if you're watching this in the future you can be sure so we're on version 0.1.5 so we're already at five um Lang chain you know they're they're hard at work over there uh we're going to need to add our open AI API key since we are going to be leveraging open AI uh basically this is a uh you know way that we can both use our llm for valuation but also for generation and also for powering the application uh we're just going to use this one llm today for everything when it comes to building our pipeline it's very much so the case that you know we have the same stuff that we always have we need to create an index and then we need to use an llm to generate responses based on the retrieved context from that index and we're going to get started as we always do with creating the index now we can and will still use LC LC is important um you know one of the things that we're going to show in this notebooks you don't have to use LC they've imp implemented some abstractions inorder order to modify the uh you know the base chains that you're used to importing to LCL format so you get all the advantages uh but we're still going to look at LCL today because it is an important piece of the Lang chain puzzle uh but first we're going to start with our first difference right so we're going to load some data and we're g to load this from the Lang chain Community package where we're going to grab our document loader to get our web based loader you know importantly this is not part of core Lang chain this is a community uh package and it's it works exactly the same as it used to as it always has uh you know our webbased loader is going to let us load this web page uh which we can do with loader. load and then we can check out that we have our metadata which is just for our web page we're happy with that next we need to do uh the second classic step of creating index we have a document in this case you know it's just one document but we have it and we need to convert it into several smaller documents which we're going to do with the always fun recursive character text splitter you'll notice that this is stayed part of core so this is uh in just the Lang chain uh base package hoay we have a recursive character text splitter we've chosen some very arbitrary chunk sizes and overlaps here uh and then we can split those documents this is less so focused on uh specific uh Lang chain Rag and more on the evaluation so we're just kind of choosing these values uh you know to to Showcase what we're trying to Showcase you see that we've converted that one web page into 29 distinct documents that's great that's what we want to do our uh with our splitting next we're going to load the open AI embeddings model now you'll notice that we're still using text embedding Ada 002 we don't need to use uh this embeddings model and it looks like very soon we'll be able to use open ai's latest model once the tick token Li updates there's a PR that's uh ready just waiting to be merged which is going to let us be able to do that but for now until that change is implemented we're going to stick with text Data embedding 002 um and this is like the classic embedding model right uh nothing too fancy uh just what we just what we need uh when it comes to our face Vector store what we need is uh to get that from Lane chain Community but otherwise this is exact exactly the same as uh it used to be right so there's no difference in the actual implementation of the vector store it's just coming from the community channel uh we'll pass in our split documents as well as our embedding model and away we go next we're going to create a retriever this is the same as we've always done do as retriever on our Vector store now we can interact with it uh through that retrieval API we can test it to see it working why did they change to version 0.1.0 and we get some relevant documents uh to that query that mention the 0.1.0 release hooray now that we've got our retrieval pipeline set up that's the RN rag we need to look at creating that AG so what we're going to do is Showcase a few different ways that we could create a prompt template you can just pull it from The Hub so there are lots of different uh Community created or Lang Chen created uh hubs the idea is that you know you can just pull one that fits your task from The Hub but the one that we're showcasing is maybe not ideal so we're going to go ahead and create our own you can still do this process if you want to create your own you don't have to use a uh you know one from the Hub and so we're just going to create this simple one answer the question based only on the following context if you cannot answer the question context please respond with that don't no That's a classic we pass our context TT we pass in our question away we go and you'll notice that this is exactly the same as it used to be let's go L chain now we'll set up our basic QA chain I've left a lot of comments here in the uh implementation of this uh LCL chain in order to hopefully clarify exactly what's going on but for now we'll just leave it at uh we can create this chain using LCL and we want to pass out our context along with our response this is important in order for us to be able to do those evaluations that we're hoping to do with ragus so we do need to make sure that we pass out our context as well as our response uh this is an important step and we'll look at another way to implement this chain a little bit later which is going to Showcase a little bit more uh exactly you know what we can do to do this a little bit easier with uh you know still getting the advantages of LC uh you'll notice we're just using GPT 305 turbo that's it and uh there you go now we can test it out and we can see you know what are the major changes in V 0.1.0 the major changes are information oh it goes on it gives a correct answer that's great and we have what is uh Lang graph and basically the response from the llm is I don't know which is not necessarily satisfying um so we're going to see a way uh to improve our chain to get a better answer to that question uh and the next step now that we have this B chain would be to evaluate it but before we do that let's hear from Greg about how we're going to evaluate it and what we're going to evaluate it with and with that I'll pass you back to Greg thanks whiz yeah so that was Lang chain V 0.1.0 rag now let's talk rag assessment the ragas framework essentially wraps around a rag system if we think about what comes out in our answer we can look at that we can assess different pieces that helped generate that answer within the rag system and we can use that information to then decide on updates on different things that we might try to add to either augment our retrieval or our generation and we can continue the process of improvement by continually measuring but what are we measuring well this is where the rag evaluation really gets particular we have to make sure that we understand the Core Concepts of rag eval and in order to sort of do this in an automated way we need four primary pieces of information you're probably familiar with question answer input output and you may even be familiar with question answer context triples what we need for eval is we need to also add a fourth component the ground tree truth sort of the correct or right answer so to speak now in practice it's often not feasible to collect a comprehensive robust ground truth data set so again what we can do since we're not focused on absolutes here is we can actually create a ground truth data set synthetically and this is what we'll do today we'll find the best model that we can pull G bt4 off the shelf and will generate this set of information that will allow us to do evaluation okay so we'll see how this works it's pretty cool and ragas has a new library for this but in terms of actual evaluation when we finally have this data set up we need to look at two different components the first component is retrieval there are two metrics that focus on retrieval exclusively one is called context precision and context Precision asks the question how relevant is the context to the question all right context recall on the other hand asks the question is the retriever able to retrieve all of the relevant context relevant to the ground truth answer on the generation side we have two me metrics as well the first is answer relevancy which asks the question how relevant is the answer to our initial query and finally faithfulness tries to address the problem of hallucinations and asks is the answer fact checkable from the context or is this a hallucination so the the four primary metrics in the ragas framework are these four two for retrieval two for Generation let's dig in a little bit deeper to each one so that we really try to start grocking each metric individually because they're slightly different but nuanced faithfulness is trying to measure this factual consistency let's look at an example the question where and when was Einstein born context if this is the context albertt Einstein born 14 March 1879 was a German born theoretical physicist etc etc so a high faithfulness answer is something that says well he was born in Germany and he was born on 14 March 1879 where a low faithfulness answer might get part of it right but might hallucinate right you want to avoid these hallucinations with faithfulness so we're looking at the number of claims that can be inferred from the given context over the total number of of claims in the generated answer to be 100% faithful to the facts we want this to be the same number okay so answer relevancy is trying to of course measure how relevant the answer is rather than considering factuality how factual it is what we're doing here is we're penalizing when the answer lacks completeness or on the other side when it contains contains redundant details so for instance where is France and what is its capital a low relevance answer is like talking to somebody that's not paying attention to everything that you said oh France is in Western Europe it's like yeah okay well what about the other part of my question right you want it to be completely relevant to the input just like a good conversationalist answer would be very relevant right okay so context precision as we get into the retrieval metrics we're thinking about in this case a way that we can evaluate whether all of the ground truth relevant items are present in the context and how well ranked they are in order so what we're looking for is we want all the most relevant chunks that we return from our Vector database to appear in the top reference ranks okay we want lots of good stuff ranked at the top that's what we want and so we're really looking for everything that's relevant to the question to then be returned in our context and to be order ranked by relevant makes sense you know just the way we would want to do it if we were writing a book report or something finally context recall is again kind of doing this same thing that we talked about before we want to make sure we're paying attention to everything that's relevant we want to make sure that we're addressing everything that's asked so if the question here where is France and what what is its capital once again if we have a ground truth answer already the key here is we're actually leveraging ground truth as part of calculating this metric France is in Western Europe and its capitals in Paris a high context recall is addressing both of these and within each sentence of the output addressing both of these you can look sort of ground truth sentences that can be attributed to context over number of sentences in ground truth and a low context recall is going to kind of be doing the same thing that we saw earlier well France is in Western Europe simple Villages Mediterranean beaches country is renowned sophisticated Cuisine uh on and on and on but it doesn't address anything about Paris which of course the ground truth does so again we can look at this sort of holistically here and we can start to get a picture of if we look at each of these metrics we we get some idea of how our system is performing overall but that's you know generally kind of difficult to get a perfect picture of that you know these are the tools we have and they work as we mentioned very well for directional improvements context Precision is sort of conveying this sort of high level quality idea right not too much redundant info but not too much left out context recall is measuring our ability to retrieve all of the necessary or relevant information faithfulness is trying to help us avoid hallucinations and answer relevancy is sort of am I to the point here am I very very relevant to the question that was asked or am I kind of going off on a tangent here and finally ragas also has a few endtoend metrics we're just going to look at one of them today just to give you an idea and that one is called answer correctness this is a great one for your bosses out there um you want to know if it's correct boom how about we look at correctness boss so this is potentially a very powerful one to use for others but beware you know what's really going on and directional improvements is really what we want to be focusing on but we want to M basically look at how the answer is related to the ground truth of course if we have like a true ground truth data set this is probably a very very useful metric if we have one that's generated by AI we might want to be a little bit um particular a little bit more careful in looking at this metric and lying on it too much but if we have this great alignment between ground truth and answer we're doing a pretty good job right let's see a quick example for this one we're kind of looking at two different things we're looking at that factual similarity but we're also looking at semantic similarity so you know again you can use this Einstein example if the ground truth was Einstein was born in 1879 in Germany the high answer correctness answer is exactly that and then of course low answer correctness is you're getting something literally wrong so there is overlap between all of these things and it's important to sort of track that but overall the steps for doing ragas are to generate the question answer context ground truth data and there's a awesome new way to do this called synthetic test data generation that has recently been released by raggas we'll show you how to get it done today run that eval and then go ahead and try to improve your rag pipeline we're just going to take one simple retrieval Improvement off the shelf from Lang chain today it's called the multi-query retriever this is going to sort of generate many queries from our single query and then answer all of those and then return the relevant context from each of those questions into the prompt so we're actually getting more information but you can pick any retrievers off the shelf and you can then go back you can look did my metrics go up did they go down what's happening as I add more data or more different retrieval Advanced methods to my system and in this way we can see how we can combine ragus with rag Improvement as whiz will go ahead and show us right now oh yeah Greg can't wait thank you so Rus this is the thing we're here to talk about but right uh it's a amazing library that does a lot of cool powerful things um but the thing that is you know most important is that it allows us to have some insight into changes we make in terms of the directional impact they have right so while we might not be able to say you know these answers are definitely true as as Greg was expressing we can say it appears as though these answers are truer than the ones we had before which is awesome so let's look at how we can do this uh first of all in order to actually do uh you know a evaluation on all of the metrics we need to have two important things one we need to have questions so these are questions that are potentially relevant to our data in fact they should be relevant to our data if we're trying to assess our retrieval pipeline as well as our generations and also some ground truths right as Greg was mentioning you know we are going to use use synthetically created ground truths so it might be more performant to use let's say you know uh human labeled ground truths but for now we can let the llm handle this uh I'll just zoom in just a little bit here and the idea is that we're going to leverage rus' new synthetic test data generation uh which is very easy to use much better than what the process we had to do before which is kind of do this process manually uh we're going to go ahead and use this create our test data set now it's important to keep in mind that this does use gbt 35 turbo 16k as the base uh model and it also includes gp4 as the critic so we want to make sure we're not evaluating too or creating too much data uh or if we are that we're staying very cognizant of the costs so the first thing we're going to do is just create a separate data set or SE separate document uh pile that we're going to pull from we're doing this to mitigate the potential that we're just asking the same llm the same questions with the same context which might uh you know unfairly benefit the uh the more simple uh method so we're just going to create some new chunks with size thousand overlap 200 we're going to have 24 docs so about the same 29 24 and then we're going to use the test set generator it really is as easy as test set generator with open AI that's what we're using for llm and then we're going to generate with Lang chain docs you'll notice this is specifically integrated with Lang chain there's also a version for llama index and all we need to do is pass in our documents the size that we like of our test set and then the distributions now this distributions is quite interesting basically this is going to create us questions at these ratios from these subcategories so the idea is that this is going to be able to test our uh system on a variety of potentially different uh you know tests right so we have SIMPLE which is you know as you might think very simple um and we have uh you know this reasoning which is going to require some more complex reasoning that might you know taxs our llm a little harder and then we have this multicontext which is going to require multiple contexts so our llm is going to have to pick up a bunch of them in order to be very good at this particular uh kind of task and the reason this is important is that not not only do we get kind of an aggregate directional uh indication of how our system is improving but we can also see how it's improving across specific subcategories of uh you know application very cool very awesome thanks to the ragas team for putting this in it's you know we love this uh and it makes our job very much a lot easier so that's great uh we look at an example of the test data you know we have our question we have some contexts and then we have our ground truth response as well as our evaluation type which is in this case simple in terms of Genera responses with the uh rag pipeline it's pretty straightforward there is an integration that uh that is uh you know exists between Lang chain and ragus uh it's currently being uh being worked on uh to be brought up to up up to speed but for now we're just going to kind of do this manually so what we're going to do is we're going to take our test set we're going to look and see we've got our questions context ground truths as well as our Evolution type or our you know this is our distribution that we talked about earlier and then we're going to grab a list of questions and ground truths we're going to ask those questions to our rag Pipeline and we're going to collect the answers and we're to collect the contexts and then we're going to create a hugging face data set from those uh collected responses along with those test questions and our test Gra truths we can see that each of the rows in our data set has a question with our rag pipelines answer our rag P pipelines contexts as well as the ground Truth uh for that response now that we have this data set we're good to go and we can go ahead and we can start evaluating now Greg's talked about these metrics uh in depth the uh code and the methodology can be found in the documentation from Rus which is very good these are the ones we're caring about today faithfulness answer relevancy context Precision context recall and answer correctness and you can see it's as simple as loading importing them and then putting them into a list so that when we call the evaluate uh you know we're going to pass in our response data set which is this uh data set we created above that has these rows for every question and then our metrics which we've just set above that's all we have to do now the test set generation is awesome and very useful another change that ragus made recently is that they've made their valuation async so this is a much faster process than it used to be as you can see you know uh this was around 42 seconds which is much better than the times that we used to see so thanks ragas team uh for making this change we can get our results here we have our faithfulness our answer relevancy our context recall our context precision and our answer correctness you can see that it does all right uh but again these numbers in a vacuum aren't really indicative of what's happening it it's it's like we want these numbers to be high but we're more interested in seeing if changes we make to our system make those numbers higher so let's look at another awesome part of ragus before we move on to making a change and seeing how it how it goes uh which is we have the ability to look at these scores at a per question level in the panda data frame so you can see that we have all of our scores and they're given to us in this data frame this is huge especially because we can map these questions back to those Evolution types and we can see how our model performs on different subsets of those uh those distrib the elements in that distribution so now we're going to just make a simple change uh we're going to use the multi-query retriever this is stock from the Lang chain documentation uh we're going to use this as an advanced retriever so this should retrieve more relevant context for us that's the Hope anyway uh we'll have our Retriever and our primary Q QA LM so we're using the same retriever base and the same llm base that we were using before and we're just wrapping it in this multiquery retriever now before we used LC to create our chain but now we'll showcase the abstraction which is going to imp Implement a very similar chain in LCL but we don't have to actually write out all that LCL so we're going to first create our stuff documents chain which is going to be our prompt we're using the same prompt that we used before so we're not changing the prompt at all and then we're going to create retrieval chain which is going to do exactly what we did before in LCL but it's you know we don't have to write all that LCL so if you're looking for an easier abstracted method here you go uh you'll notice we call it in basically the same way and then we are also looking at uh this answer the answer is basically uh you know the response. content from before and then uh you know we can see this is a good answer makes sense to me uh but we also have a better answer for this what is Lang graph question so this heartens me right I'm feeling better like maybe this will be a better uh system uh and before you might have to just look at it and be like yeah it feels better but now with ragus uh we can go ahead and just evaluate we're going do the same process we did before by uh cycling through each of the questions in our test set and then getting responses and context for them and then we're going to evaluate across the same metrics you'll notice that our metrics uh have definitely changed so let's look at a little bit more closely how they've changed so it looks like we've gotten better at our uh faithfulness metric we've gotten significantly better at answer relevancy which is nice we've gotten a little bit better at uh Contex recall we've taken some small hits uh a small hit on context precision and a fairly robust hit on answer correctness so it's it's good to know that this is going to improve kind of what we hoped it would improve and now we are left to Tinker to figure out how would we improve this so our answer correctness doesn't get impacted by this change but at least we know in what ways how and uh you know we're able to now more intelligently reason about how to improve our rag systems thanks to ragus right uh and each of these metrics correspond to specific parts of our ragus application and so it is a great tool to figure out how to improve these systems by providing those directional changes with that I'm gonna kick it back to Greg to close us out and lead us into our Q&A thanks whiz yeah that was totally awesome man it's great to see that we can improve our rag systems not just sort of by thinking about I think that's better L graph question got answered better but actually we can go and we can show our bosses our investors anybody that might be out there listening hey look we have a more faithful system check it out went from base model to multiquery Retriever and improved Our Generations of course as developers you want to keep in mind exactly what the limitations of each of these things are but for all of those folks out there that aren't down in the weeds with us if they really want an answer here's an answer and so it's awesome that we can go and take just things off the shelf that we're trying to qualitatively analyze before and directionally improve our systems by instrumenting them with ragus and measuring before and after small iterations to our application so today we saw Lang chain V 0.1.0 to build Rag and then we actually did rag on the Lang chain V 0.1.0 blog expect stable releases from here it's more production ready than ever and you can not just measure faithfulness you can measure different generation metrics different retrieval metrics even different endtoend metrics and big shout out to everybody today that supported our event shout out to Lang chain shout out to ragus and shout out to everybody joining us live on YouTube with that it's time for Q&A and I'd like to welcome whiz back to the stage as well as Jan and Shaul from ragas co-founders and maintainers if you have questions for us please scan the QR code and we'll get to as many as we can guys welcome let's Jump Right In hey guys hey what's up all right uh let's see I'll go ahead and toss this one up to jittin and Shaul what's the difference between memorization and hallucination in rag systems how can developers prevent hallucinated content while keeping the text Rich yeah you want to go for it I know I didn't actually understand what you actually mean by me memorization yeah oh yeah okay uh you want to take a crack at this sh yeah I mean what is the difference between memorization and hallucination rack systems that's a the the the line between memorization hallucination I don't know where to draw that particular line it's it's something uh seems like see seems like the what what it meant is the usage of internal knowledge versus you know there are situations is in rag when knowledge is a continually evolving thing right so maybe the llm thing that a person is you know is still alive but the person died yesterday or something now the now if if if that particular thing is uh is read using Wikipedia or something there will be a contrasting knowledge between the llm and what the ground Truth uh Wikipedia Sayes now um that that can be hard to overcome because the llm still believes something else uh so I think so it's it's a hard to crack problem and uh I hope there will be many Future Works on it um but it's how can we prevent such hallucination the thing is what we require is uh when when using llms to uh to build rag we can align llm so that llms answer only from the given grounded Text data and not from the internal knowledge so or or there must be high preference the grounder text Data compared to what is there in the lm's internal knowledge so that can be one of the solutions yeah definitely whiz any thoughts on memorization versus hallucination before we move on here I think uh the answer to the question was already provided uh basically difference really I mean yeah yeah we when it comes to the memorization versus soluation I think the the most important thing is uh you know memor ization that you could maybe frame it as a slightly less negative form of a hallucination because it's likely to be closer to whatever the training data was but in terms of rag application both bad uh we we want it to really uh take into account that context and stick to it okay we've got a question from yon bors I'm curious if you already have experience with smart context aware chunking can we expect significant improvements of rag results using smart chunking what do you think Jan is this something that we can expect improvements in yeah like so how you so one thing that we see when we building like rack systems is that how you're formatting the data is like where most of the problems are like if you if you take some time to clean up the data and like to format data that actually makes it easier for your act the performance diff is like like really great because like models right now if you're using a very stable model if you provide with with the correct context the model will be able to use the information in the context to get it so all these tips and tricks to optimize about even like um Chris was using the multi context method right it's also another trick to get make sure that you get different context from different perspectives into the final answer so all these different types of tricks can be used and this is actually why we started this also we wanted to like evaluate all the different like tricks that are out there and try to see which works best because it can be like different on your domain so yeah uh smart chunking is smart yeah so you're saying that like it actually matters what data you put into these systems just because they're llms it doesn't solve the problem for you yeah that actually like matters a lot more because it what goes in comes out so that's important that you format your data that's right the data Centric Paradigm has not gone anywhere people you heard it here first um garbage in garbage out so Matt Parker asks maybe I'll uh send this one over to sha can you compare true lens and ragus um this is the first I've heard of true lens maybe if if other people have't maybe you can tell us a little bit about what they're doing and what you're doing and the overlap you see sure uh yeah trand has been around for a while uh for evaluating ml applications and they are also doing llm applications so Raga currently is mostly focused on racks as in we wanted to crack uh the the the application that most people care about that is rxs and uh uh so we are mostly you know uh doing things that can help people to uh evaluate and improve their rack we are not uh building any UI we are we are largely providing for the Integrations part we are largely interested in providing Integrations to players like lsmith so that people can try and see their UA rather than building a UI on top of ragas so ragas main offers metrics and features like as you as you have seen synthetic test JG generation uh to help you evaluate uh your RS I don't think true lence has a synthetic data generation uh feature uh which which is which is which is something that uh most of our developers really liked because it has saved a ton of their time because nobody really wants to go and label you know hundreds of documents it's a boring job right so we are trying to double down on these uh points that we we we have seen that developers really like and we are trying to uh stay true to the open source Community as well o nice okay very cool very cool um rad asks I'll send this one over to you is can you combine multiple query Retriever with conversational retrieval chain sure yeah uh basically uh Lang chain Works in a way way where you can combine any retriever inside of any chain right so retriever is going to be some kind of slot that we need to fill with something so if you want to use a more uh complex retrieval process or combine many different retrievers uh in an ensemble you can do that with basically any chain basically that conversational retrieval chain is looking for a retriever and so as long as it can be accessed through the retrieval API it's going to work fine I I would I would add though conversational retrieval chain you'll want to use the 0.1.0 version which is you know been implemented with LCL but other than that you're good to go okay okay and sort of uh back to this idea of sort of smart chunking smart hierarchy of data is there sort of like we often talk in our classes about the sort of black art of chunking everybody's like well what's the chunk size I should use what's the chunk size um so suit asks and maybe I'll send this one over to you Jan I know the chunk Size Matters uh are there like guidelines for chunking that you guys are aware of or that you recommend when people are building rag systems yeah so I don't have like a very good guideline maybe shahul can take back it up but one thing that I've like seen like personally from experience is like so a do the evaluations uh but then B like also making sure that you get you combine like multiple like so basically you create a hierarchy system where you have like different chunks then you summarize the different like Concepts like Define the summarize the different chunks so that uh even like all the B like core ideas are there in a hierarchy that actually has been like very like helpful so yeah so exactly like changing size I haven't seen it in the uh like matrixes as such U but all the like all the recursive like summarization that has helped and I think lamex has like a few retrievers right there but what do you think yeah just adding some more points into it I think there is no one siiz Feit uh Chun size that uh that fits all type of documents and all type of Text data so it's it's a relative thing that should either you can so there are two ways to handle this problem either you can the general rule of some is to ensure that enough context the context makes sense even without any you know as as an individual you know as an individual chunk it it should make if it should make some sense if you read it if a person read it so how how to achieve this you can achieve this either using writing a set of heuristics or let's say you know it can be something like okay determine the document you know type or something and Chang using that and I think the from moving from heuristics to where we are going I think we might even see smaller models smaller very smaller models that are capable of chuning determining the Chun boundaries smartly so that you don't really have to Ray on the hero six it's more a generalizable way of doing it so I think that's where we are going in in the future um of chunking and uh hopefully the problem gets solved like that yeah yeah yeah I really like this idea of making sure each individual chunk makes sense before sort of moving up a level and thinking about okay what the exact you know hierarchical parent document multi- like whatever it is that you're doing each chunk should make sense and that's going to be dependent on data yeah I I really like that and okay so let's let's go ahead and sort of related to that I want to go to this embedding model question in the slido from Ron it's similar in sort of relation to this chunking IDE idea I mean people always want the answer you know so what chunk size here Ron asks which embedding models should I be using when I develop a system uh any emergent models or techniques that I can see significant improvements with maybe maybe sh if you uh if you want to continue here um sure again there is no one fit size uh for this answer you know uh the thing is that again it depends a lot of factors so if you don't want to really you know use again first first you know question will be open source or close Source you have like a lot of Open Source players even ring open AA with their open source models like I think recently uh by uh Alibaba group uh released their M3 embedding which is like awesome it's like most powerful open source embeddings which we we have ever seen uh even reing open at buildings right so it's it's a set of questions that you have to answer if you want to go for for easy way of building a baseline rag of course opening is embeddings you know good place to start you don't have to worry about anything else then you you can iteratively improve it that's where also ragas comes in let's say you have now you have an abundance of embeddings to choose from right so now you have you want a way to compare it so you know use raas you know you can just compare all these different embeddings choose the one that fits you and um you're done there it is there it is uh just uh closing up this topic on chunks and um and embedding models whiz I wonder why did you choose Ada why did you choose what is it 750 overlap um any particular reason zero uh zero thought put into those decisions uh we used Ada because it's the best open AI model that's currently uh implemented and we Ed 700 and uh 50 because we did uh basically we we wanted to show that you know those naive settings are worse than uh a more considerate uh or a more mindful approach um and so to do that we just kind of selected them I think the thing I really want to Echo that we've heard so far is when we're thinking about our index or we're thinking about our Vector store we really want to be able to represent individual like quanta of information and so the closer we can get to that the better it will be and then we can add that hierarchy on top and I think what was said about uh you know using models to determine that at some point is definitely a future we can uh we can imagine we'll be living in soon yeah yeah and I think again we go back to this data Centric idea it's easy to get the rag system set up and to get instrumented with a ragas but like you're going to get the improvements you're going to get the thing really doing what you need it to do for your users by doing the hard kind of boring data work data engineering data science on the front end that really you just can't Outsource to Ai and you just have to kind of deal with yourself okay one more sort of like what's the answer question I want to maybe send this one to Jan if somebody's picking up ragus and they build a rag system and they're like okay well which ragas metric should I use you know which one should I look at right um what would you say is there is there a starting point is there a sequence that you'd look at or it's uh the jury's not out yet on this yeah so we started off like so we started off with the basic things right like figure out what the uh like different components are the generator part and the tri comp part and then just like first for the first one just try out like with all of the stuff like basically like once you know which comp like what figuring out which component work how like what the state of which all these components are gives you like idea of okay where can I make an improvement with like as fast as possible if if your generator is bad maybe try out a few other other llms or maybe if your retriever is bad then figure out okay in the retriever part what is actually happening is it context relevancy is it is it the recall that's bad and like that is the way so starting off try out try out all the metrics that you have and then for the ones that are the bad the worst and like after you understand like what the metrics are you will get an idea of how you could like what are the stuff you can actually try out to improve it and if it's like try out the easiest F like cross out the low hanging fruits first and that is how you would like over time like progressively like uh improve it like but like I said it's not the absolute values that matter it's like the trends that matter right so you guys did a good job in explaining that so make sure like you go for the easiest things that you can patch up fast and keep that Trend in the upward Direction yeah yeah I love it if you're getting low retrieval metrics maybe pay attention to some retriever stuff if you're getting low generation metrics maybe try a different model it's like yeah it's so simple when we can break it down like this and you know just a shout out to everybody out and Manny just shouting out to Manny I that was kind of an attempt to answer one of your many questions today we'll see if we can get some more on LinkedIn but I I think this idea of like getting your system instrumented so you can start to look at and and Chunk Up different pieces of it and try to improve them there's a lot of content that needs to be made on this you know these guys are open source first open source forward we'd love to see some folks in the community start to put some guides together for how to actually break down and use ragus in sophisticated ways so last question guys we're we're at time here but what's next for ragus in 2024 maybe if uh if either of you want to go ahead and take this go ahead and take it let us know what to expect from you guys heading forward this year shahul you want to take this yeah question so you want want to go where the community takes us so yeah doubling down on um things like synthetic data generation there are there are lot of interests there there are lot of interest in expanding ragas to other llm task as well so yeah there are all these interesting directions to take hopefully uh you know uh we'll get more signals from the community on which path so to take I mean we we do have a lot of directions a lot of future requests coming in so we have to just you know take that decision and move on way but uh but yeah as of now um the synthetic testation is something that gets a lot of interest we want to you know make it very stable very useful make sure that that we push the limits of you know uh the the close Source models and plus Frameworks analogy uh to build a great uh you know test data point that's that's very easy and uh easy to use yeah anything to add J in yeah like honestly like so right now we have a good base right now we're like very curious what like what we can do like evaluation driven development what are the extremes of that so it like curious to see like what like uh the community comes up with what like like you guys can like will come up with so yeah EXC really excited for that so yeah yeah let's see what everybody builds ships and shares out there and uh and contributes well thanks so much jittin thanks sha thanks whiz um we'll go ahead and close it out for today and thanks everybody for joining us next week you can continue learning with us we're talking alignment with reinforcement learning with AI feedback uh if you haven't yet please like And subscribe on YouTube and if you haven't yet but you like the vibe today think about joining our community on Discord where we're always getting together and teaching and learning you can check out the community calendar directly if you're not a Discord user to see what's happening this week and in upcoming weeks and finally if you're ready to really accelerate llm application development in your career or for your company we have a brand new AI engineering boot camp that's going to cover everything you need to prompt engineer fine-tune build rag systems deploy them and operate them in production using many of the tools we touched on today but also many more you can check out the syllabus and also download the detailed schedule for more information and then finally any feedback from today's event we'll drop a feedback form in the chat I just want to shout out uh Jonathan Hodges as well we will get back to your to your question and we will share all the questions today with the ragas guys to see if we can get follow-ups for everybody that joined us and asked great questions today so until next time and as always keep building shipping and sharing and we and the ragas guys will definitely keep doing the same thanks everybody see you next time

Info

Channel: AI Makerspace

Views: 5,509

Rating: undefined out of 5

Keywords:

Id: Anr1br0lLz8

Channel Id: undefined

Length: 64min 2sec (3842 seconds)

Published: Wed Feb 07 2024