Developing and Serving RAG-Based LLM Applications in Production

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so as you guys saw in the morning we we wanted to start actually building LM applications ourselves instead of as opposed to only kind of focusing on the infra and making it kind of cheaper and faster uh this way we kind of actually experience the problems that you guys will hopefully a lot sooner and actually make the whole experience of products a lot better so the first thing that we did was build a rag application and this is like a canonical use case right everybody has their own data a lot of tech companies have their own documentation so this is usually the first use case that a lot of teams gravitate towards just making it easier for people to work with their products so we decided to do the same and you know our Ray as many of you know does a lot of different things so for us it was very useful to want to build something like this on top of all the different capabilities and functions that Ray can do and then just help developers do things a lot faster and better as well to that topic I want to emphasize the two values you're going to have when you build such an application are the underlying documents and there there has been really great work by the right documentation team including Angelina and others and then the other one is like users and questions from users so like if you have these two things then that makes it much easier to build this kind of application yeah actually I just want to quick see a quick show of handset how many folks internally or I guess externally have started building rag-based applications at work okay well that's a lot of you okay so I'd love to hear like everyone else's insights tonight as well um the things that we will share are very empirically driven so if you find you found a different insight for example like chunking size that we'll talk about please definitely share those kinds of things because I we're very early uh as uh kind of a community in the space so it'd be great to hear kind of different people's take on all this um so obviously kind of starting simple we started our whole application with just seeing how will a base LM do we try to do with gpd4 llama 7B 70b and we would just ask a question and very quickly you know we'd realize that these models have no context or very little context of how things work and if they did it'll be outdated right uh September 21 sometimes and Rey looked uh I think very different back then if it even had access to it so we very quickly got to actually building the rag app this is kind of like the high level overview here um but this is the forward pass once you have a query so assume the vector database is already made we'll talk about what that looks like in a second but somebody asked the question the query gets embedded by an embedding model you have a couple options there then that gets passed into a vector database and now you can have a couple different options for how you calculate distance but that embedded query is now used to fetch the top K contexts you have options for how many top case as well once you get those contexts you can now feed in both the text from those contacts and the text from the query both into the LM now you've augmented the base LM with this additional context to be able to hopefully generate a correct response the actual Vector database piece here simplifying we'll zoom into each of these but basically we have a bunch of data sources so we started with our Ray documentation and then we wanted to be able to load them so this is very similar to what we did this morning and this is the first kind of Step where we started to get a little experimental the actual like chunking right how do we want to represent our data and maybe I'll let Philip talk about the different strategies we tried but the thing the naive thing that everyone does is just randomly chunk this right I want to just set chunk size 100 or 300 chunk overlap 50 and just go through all of my different documents but that that starts to not be as effective so we started thinking about a lot of other ways to more efficiently chunk the data so one thing we did then was to use the sections of the HTML document I would say there's two benefits of that one is that you can give you also want to give the referee in some applications you want to give the references on where you got information from and then you can get it's a lot more precise references in terms of instead of people pointing people to the whole like long document you can just point them to the reference and then when they click the link it will the browser will like go to the right section and that's very valuable and then the second one is the sanctions often give the right sort of like first idea of like where something ends so like it makes sure that I have no like um I'm chunking in between functions um a lot of different other strategies we could do here while we were doing all this chunking we tried to keep it as generalizable as possible um we're still working towards this actually but we want to try to come up with a template maybe we can do kind of an open source solution as well where this would work for the vast majority of people's documents right not necessarily has to be for a library but any kind of HTML documents but um after we chunk this we basically have all these different chunks that we can work with we now feed it into an embedding model we'll talk about which ones we experimented with in a second and then after we have the semantic representation of all of our different chunks we can now index that into our Vector database the actual content that we're putting into our database is the text the source and the embedding as well a lot of different options for Vector databases as well I think in the last maybe a year and a half we have had an explosion of new databases that we've never kind of part of but um we we kind of stuck with uh postgres nice and simple we've worked with it for many years I think even postgres has a lot of up and coming features around this honestly our Our advice here would be to go with what you're already familiar with or comfortable with or what your team uses but a lot of the new ones are are definitely worth looking at they're coming up with a lot of like LM app specific features which could be a really interesting feature as well um awesome so now when we repeat all this across all of our docs you have your vector database actually created um we'll talk about how to update that in a second but to actually now do the retrieval you have a query you embed the query using ideally the same embedding model and you have a query embedding now now you can pass that over to the database and use different distance metrics we use cosine distance cosine similarity to retrieve the top K chunks from this and once you have the chunks now you can feed in the text from the relevant sources and the query itself to the LM and get the response so any questions so far with this kind of V1 yes oh sorry one second we have a runner who's the pros and cons building Vector DB on top of postgres versus using Samsung out of the box like vv8 or chroma DB um one pro is definitely if you already have expertise and maybe you're already some data in there there's this Vector extension called PG vectors that you can just basically create a new data type um and then and then you can like use all existing Machinery you can even like combine it with existing like filters and stuff like that to like filter down um and then like one of the possible downsides is once you have get at a very large scale so if you have a huge amount of document then it might not be the right solution anymore but I mean it depends on your application yeah are all of our redox I think comes out to less than a gig if I recall correctly so yeah really depends on your use case but uh use case for you know database like vv8 there are a lot of great Integrations that we see almost on a weekly basis I think cohere is like re-ranking is now something you can get out of the box so there's some amazing features so I would um to get started maybe don't uh experiment with everything just go with what you're already familiar with but um as you start getting towards production and for some of these more kind of Niche features might be worth exploring some of the others there's also elasticsearch I think it's also coming out with more things in that direction so it's worth looking at your existing things if they can do it um and then looking at the other things I mean it depends on your use case yeah so any other questions yes where are you oh um on the last slide um yeah is there a limit on the number of tokens in the context there is and each model has different limits we'll talk about these as well um yeah great question you when we do our experiments we try to treat them all as independent like uh the chunk size but you can't right each model is different and the number of chunks times the chunk size together dictates how much like how much context you can fill in so I'll talk about that but um and we'll talk about the need for llms with higher context Windows as well and generally that's the trend that we should be going towards but great question we'll get to that there's also two things like some of the um models have sort of like heart limits maybe like 512 tokens or so and then there's also things where like they might not work super well in like different regimes so it's the best experimenting and also if you if you if your data dictates longer chunks it's also worth experiencing with using multiple embeddings for each chunk and then like I'm looking up and then and then retrieving the larger document based on your retrieval sorry and the embedding models also have cutoffs as well okay so now before we kind of talk get to our experiments we'll briefly talk about how we're performing evaluation um first we'll look at kind of the component wise and to us there are two major components we wanted to focus on I think there are other pieces here as well but for us first is like the whole retrieval workflow itself so assume that you have um uh kind of a golden Source right and let's just simplify this and say there's one golden source for let's say a particular query I want to pass the query through our system and I wanted to retrieve let's say top K context if the golden source is in one of those top k then we'll count that as a success we'll count that as a hit so we use this kind of metric here to score or just our retrieval process and isolate it away from kind of what's happening with the llms here similarly we wanted to isolate just the llm piece here so forget about retrieving context assume you have the best source and the text from that best source and assuming it fits in the LM context window given that best source text how well can our llm generate a response and this as you may notice is a bit more generative right it's a bit definitely not as objective as the previous one but we have these two scores to kind of compare the component wise here now with the quality score on just the LOM side here's kind of what it looks like you have a question we have the golden Source you get the text from it we would ask a large language model like gpt4 to answer using the source and the question give it give us an answer and then score that answer and then provide a reasoning for it and we could repeat this process across different evaluators so gpd4 llama 70b 7B Etc and this was kind of like the first like Vibe check right you can I think for Arts field we had over 200 data samples here and this is again why it's really important to work with an application that you really understand so we knew kind of the answers to a lot of these questions we know where it comes from we knew what kind of what the answer should look like so we're able to say at the end of the day gpt4 is a quality evaluator that we can then use for subsequent experiments but Phil if you want to mention like what the other ones look like so we basically um what we did is we looked through the whole data set we animated everything with GPT including the reason and then we basically removed the data points where we thought gpt4 was not doing a good job on and then and when we use that as the golden like um comparison um to use for the author tried to use um Lama 70b for evaluation that was not we had a feeling that the performance there was not as good so there's still some Lee wave of like um open source models to to become better yeah and I think someone posted on social media but I'm not sure if we're the ones to coin it but there's a lot of nepotism going on with llama 70b uh favor itself a lot and you just see scores four out of five across the board um so something to keep keep an eye on um also on the scoring side right we picked one to five we've like worked with a lot of data sets where this is the case I'm sure a lot of you have seen like the Yelp reviews data set and things like that honestly I think in terms of interpretability maybe a binary uh kind of scoring might have been better did this work or did this not work but we kind of wanted to understand on a more granular level like how is how are these LM scoring something like this how does this reasoning relate to that score so we decided to do five and actually when we do our experiments and compare it and then we're kind of thankful that we had this kind of a spread yes what he's asking what logic besides the scoring sorry thanks man um so if you want to talk about that so you you just asked the LM so you you give it the contact so you annotate the question with the like golden thoughts of where the answer can be found in the documentation and then you asked um the LM given the context and the like right answer how would you evaluate the following proposed answer on a scale between one and five and then the LM will like respond with the score um yeah gradient 354 over so in order to decide which LM to use as the evaluator we just read through everything and what of course it gave and compared with because we know about Ray right so we think about like how would we think Does this answer do and then we thought like which one looks better um I mean it's it's a bit of like like magic but like um it it's a LMS here black magic it's um I would say it's a first pass so at the end of the day we did before we get an automatic pipeline so we can generate new ideas and then automatically um evaluate um these ideas of course as a second step you then need to actually like do evaluations with humans and and yourself and things like that and but it's a good way to get sort of cheap a cheap feedback loop on like how well things are working yeah so we want to reduce this black magic as much as possible so this piece here is not to evaluate the whole system but just to know which one of these LMS is a good evaluator that we can use going forward so I'll show you what the overall evaluation looks like um so given the golden and golden source which of these LMS can generate a good answer and then actually attach a good or appropriate score to the answers that it's generating this way we can build trust on one of these LMS to be used as a judge going forward and by the way this strategy uh we didn't necessarily come up with this I think uh the link chain folks llama index many other kind of LM developers online I think last couple months have been using a similar philosophy of using an evaluator or judge as the at least the first pass evaluator so with an evaluator set we can now do like an overall evaluation here and maybe let me show this diagram that might be a little bit better so assume forget about the evaluator for a second let's say you have a certain configuration for your application uh chunking logic embedding model any any base LM that you're using you're going to use that configuration of your rag app to generate responses first then with that gen with those generated responses you're now going to use your evaluator which you've previously evaluated to now ask that evaluator on these generated responses how how what's the quality of this response what's the score you're giving it what's the reason you're giving it so this is a way for us to now actually you've you First Trust the judge and then now you're trusting the outputs of that judge across different configurations that you want to test um I skipped this one so let me just quickly talk about this so these are these are the experiments that we ran there's a lot more that we could do across different components as well like a few that aren't here is maybe the distance metric you want to use in your vector database um for chunking maybe you want to combine these things but uh we did the first we tried it like with and without context at all then we do the number of chunks the chunk size the embedding models and then the base llms as well so let's uh we're going to share a couple of things um actually before that I so we were demoing this to somebody actually we're demoing this the uh we're we're collaborating with the um one of the co-founders of llama index and he mentioned the fact that hey you know you guys have a rich vibrant ecosystem you have docs you have people that understand this can and you have a lot of label data that you have what about folks that are just starting or don't have the time slash don't want to invest in creating data sets um so there's there's a lot we can do in terms of cold start so again this is where a good chunking comes in handy so let's say you've chunked your data you can now use chunks of your data to now generate questions so for this one I think we would take a specific chunk of text we would ask a good quality llm like gpt4 um you know given the the source of the answers generate some queries this is a very noisy approach so our kind of additions we would do is you can do this maybe uh isolate what chunks of data is actually being looked at to generate the questions so that's that's the first thing you should do second thing actually look at the questions and take out the ones that maybe don't make sense some of them are just going to be super basic and things that your users will never ask and the third thing that we found was the questions are kind of basic right and maybe you can use some prompting to generate more integrate questions like what users would ask but usually it's like this is fact a and the question will be like what is fact day and it'll just be a copy paste so you should be a little bit more creative but this is a great way to start but very quickly you can use this to seed a version one or version 0 of your application put it on staging have real people use it and then now start using that to generate actual data and actually labeling that so if you don't have a lot of time this is still a great way to start to actually get towards high quality data sets and there's a nice bootstrapping aspect here unrelated to this but like um at the beginning you start with a completely clean slate right and then you have this data set and then you hit that hand labeling like where would I answer this but then once you have the first version you can actually like um use launcher data set use the system to annotate and then check it's much easier to check if the answer is actually provided in the context um so that's that's like a good way to get things bootstrapped okay we actually just have 10 minutes left yes oh that's a good question um number of examples for a cold start um for our eval set we had over 200 samples and then we'll talk about the classifier that we trained for LM routing for that we had two thousand um I think it's really context based here like if you can't we have a training session on Wednesday for where we actually teach how to build this we do 10 samples you can't really do it it's not a good idea we just do it because of time um but I think you're going to need a couple hundred at least to get a good sense but more important than the number you want a good spread of queries across different parts of your product so for us you know we want questions about core infra train all of these different pieces so um like kind of testing machine learning models hopefully you can go back and have reports not just overall evaluation but like evaluation on different parts of your product as well so yeah you want to have a good spread okay um running out of time so maybe we'll do this part quickly uh I was going to ask whether what people think context helps or not it does so rag is definitely the right way to go here big jump in kind of quality here for this one um and there's a lot of Sanity checks along the way like with no context obviously retrieval score is zero uh chunk size what do people think here bigger better or is there yeah do people think it kind of tapers off any kind of predictions is it smaller okay this man's a smaller smaller anyone going for a really huh oh okay nice okay that's uh got some hints no one's going for bigger is better for the chunk size oh we have a couple couple folks there okay I guess that's true that is true um so for us uh in terms of retrieval you can see kept going up and then again this is empirical for our data set might could be different for you but in general we expect retrieval score to go up but it starts tapering off and quality actually continues to go up here um but uh you can see the difference between uh don't necessarily continue at the same rate as the chunk sizes are increasing um one thing that was definitely special about our data set is there's a decent amount of code Snippets yeah and so if you get the whole code snippet that's very good so like um either you take it longer context to include the customer or you have some special chunking logic that tries to get the whole code snippet yeah uh number of chunks tickers for don't use too many use as much as you can oh yeah what do you think as much as you can okay yes so uh at least you know we kind of again going back to that gentleman's question over there we uh eventually stopped at seven because uh we wanted to respect the context Links of these LM so we could have continued to feed in more but it would be truncated but for us in general we found more context more a number of chunks is better both for the retrieval score obviously but certainly for the quality score as well even there kind of the increase in quality starts to taper off as well but in general I think we're gonna We already see but we're gonna see more of a trend for LMS with larger and larger context Windows there's a lot of Open Source efforts happening here as well so internally we're experimenting with you know techniques we're going to experience with techniques like rope and others to try to extend this as much as we can if other folks are working on this definitely reach out to us because this is one of the things that it's tough in mind for us yes yes oh so a great question I forgot to mention um when you're doing like a this is kind of like hyper parameter tuning but like uh component tuning as well you could do the whole spread and sometimes you'll have to multiply things to make sure it fits in the context window et cetera uh we decided to fix things along the way so first we'll experiment with context no context then like the chunk size once we decide on which one's good we fixed it there so that's uh you can you can certainly do it this way but you can also open it up completely and do it that way so when we did this we fixed the chunk size to a 500 at this point so we did the same for embedding models as well the big takeaway here is that if you guys look at the hugging face leaderboard you'll find that GTE base is actually one of the smallest models it's in the top five now I think I may be wrong um but it's actually more at least for our use case we found it to be more performant than number one on the leaderboard so uh I guess the takeaway here is don't strictly go with what you see is number one sometimes it could be just like a giant model and yeah maybe you perform well on the benchmarks that they're testing and they test on quite a few right I think it's like five or six different dimensions and tasks but do it on your own kind of use case here and just see how it performs and we compared it with open uh sorry open ai's text embedding as well and we were able to decide to use the smaller Open Source One okay and as for the uh the main so everything is fixed along the way no finally with the llms you know we tested out these options here um because we fixed everything retrieval score obviously doesn't change at this point that logic is fixed but for the quality score you can kind of see it all here gpd4 was the clear winner here but actually 70b and three fight turbo you know they're not too far behind and also there's no tuning done of any kind right no no fine tuning on the embetting side or these llms yet um awesome and as for the cost analysis um for chat DVD models we're using openai for the open source ones for llama we're using any scale endpoints kind of a kind of a shopping shocking Factor here uh the the plot in the bottom here is quality score and on the the y-axis is actually uh cost but a log scale um so you can see here that gpd4 is definitely uh much further a lot more expensive but quality wise um the others are are relatively close but as I mentioned in the morning we want we kind of wanted to combine The Best of Both Worlds we wanted to serve the most performant but also the most cost effective so that's when we employed this hybrid LM routing approach they'll do it if you wanted to say a few words about this one so in this case we just annotated a data set with um which model was better and then we're trying to classify our um and there's I think many different techniques it depends a lot on the um I think honestly like um in this case the main difference was if there's like a lot of code things involved then tbd4 does a lot better um I think if you study your examples a lot then you can come up with like good rules here and and also rule my Approach do pretty well yeah um and I got some we got some feedback on the blog post I haven't updated this yet but there's a classifier actually number four in the in the blog post I write I write out that we use the closet fire but we trained to classifier we had around 1.8 uh sorry 1800 data samples where we would say which for the given query which of these uh LMS you should go to and then we we trained to supervised classifier to be able to learn this um and for this you know Ray train tune all of these were just made made all that super easy um I think we what do we end up using Philip we tried with Spacey first and then I think um actually we just needed using a simple logistic with softmax slapped on but depending on your use case I think I wrote If there's more complexity or more binning uh maybe you might need to use something a little bit more but still smaller than an LM like a Burt model and tune that we also tried bird and bettings um I think our data set was a little bit too small but we have more data so we'll we'll try that again yeah and someone this morning asked me like oh you know what do we have to use classifiers for this no you could use an LM here as well but we we don't want our users waiting like you know two minutes for a response so we have we have a let's say a certain SLA that we want to stick to for how long we think a user should wait for we're never going to go past that so to make that happen we use the classifier here but I think it has LM inference gets faster I don't see any reason why we can't use LMS to maybe make some of these judgment calls as well especially when maybe things can't be easily binned across or if you want to get responses from all the agents and get them all and try to do something from there so this is kind of just like the beginning I think there's a lot more that could be done with just like the concept of routing and all the different components you can use in routing and I'll just kind of end with this um you saw Sophia this morning who had any scale doctor that was an application that's built on many components including what we built as one of its many agents this is another major theme that's already been happening I think we're going to see more and more of this um and and now like you know using Ray in any scale to actually take something like this to production I think is going to be uh a big change in our field um but yeah I think those are those are all the slides I wanted to cover today we have about a minute left but if people have questions we can do those I'll definitely check out the blog post all the code is open sourced as well um and we're I think we're gonna have part two maybe more parts coming as well in the next couple weeks slash months but there's a lot of things that are top of mind for us we're going to be focusing on on a few of them um but there's there's just so much that can be done here the I think the big takeaway we want to leave everyone here with is that um iteration is key here we built something out we get it out we get feedback you have to iterate on this um uh you know when Philip mentioned this first it kind of reminded me of like the Tesla flywheel uh there's a lot iteration is absolutely key here and eventually we can get to a state where vast majority of use cases are covered and there will be very fewer and fewer touch points coming from us but in the beginning there's a lot of hard work in terms of uh what it takes to build something like this that will actually answer people's questions and also one thing here is I think using this um sort of as a as a way to improve your documentation your underlying documents that can be very powerful I've had multiple people said this now and this we also have seen this like if you can if you see the wrong answers and you see which documents fed in and then and then sometimes you actually just cover like errors and things in the documentation so that can be very useful yeah awesome so that's everything and yeah we'll be around people have questions afterwards
Info
Channel: Anyscale
Views: 18,288
Rating: undefined out of 5
Keywords:
Id: YO9jYy-HIRY
Channel Id: undefined
Length: 29min 11sec (1751 seconds)
Published: Thu Oct 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.