RAG But Better: Rerankers with Cohere AI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Tor augmented generation or rag has become a little bit of a overloaded term it promises quite a lot but when we actually start implementing it especially when we're new to doing this stuff the results are sometimes amazing but more often than not kind of not as good as what we were expecting and that is because rag as with most tools is very easy to get started with but then it's very hard to actually get good implementing the truth is that there is a lot more to rag than just putting documents into a vector database and then retrieving documents from that Vector database and putting them into an llm in order to make the most out of rag you you have to do a lot of other things as well so that's why we're starting this series on how to do rag better in this first video we're going to be looking at how to do reranking which is probably the easiest and fast way to make a rag pipeline better now I'm going to be talking throughout this entire series within the context of rag and LMS but in reality this can be applied to retrieval as a whole if you have a semantic Search application or maybe even recommendation systems you can actually apply not all but a lot of what we're going to be talking about throughout the series including reranking which we'll go through today so before jumping into the solution of reranking I'm talk a little bit about the problem that we face with just retrieval as a whole and then specific two LMS so to begin with retrieval to ensure fast search times we use something called Vector search that is we transform our text into vectors place them all into a vector space and then compare their proximity to what we call a query Vector which is just a vector version of some sort of query and see which ones are the closest together and we return them now for Vector search of work we we need vectors which are essentially just compressed representations of semantic meaning behind that text because we're compressing that information to a single Vector we will naturally lose some information but that is the cost of vector search and for the most part it's you know it's definitely worth paying Vector search can give us very good results but what I tend to find with Vector search and rag with llms is that okay I get some good results at the top but there's actually another result in let's say position 17 for example that actually provides some very relevant context for the question that I have asked so in this example let's say let's say this is position 17 down here we have that relevant item but what we would typically do when we're doing rag with llms is we're returning like the top three items so we're missing out on these other relevant records down here so you know what can we do the I mean the simplest is simply to just return everything and send all of these into our llm right so over here we have our llm now that's okay but llms have limited context windows so we're going to end up like filling that context window very quickly if we just start returning everything so we want to return all of this so we want to return a lot of Records so that we have high retrieval home but then we want to limit the number of Records we actually send to our llm and that's where reranking would come in so by adding a ranker we can still use all of those records right we still get to return all of these from our retrieval component but then the records that we actually sent to our LM are just these here right these top three and the rerer has gone ahead and handled the reordering of our records to get the most relevant items at the top so we can then send all of that to our llm now the question here is is a ranker really going to help us here can we not just use a better retrieval model and yes we can use a better retrieval model and that's something we'll be talking about in a future video but there is a very good reason as to why a ranker can generally perform better than a encoder model or retrieval model so let's talk about that very quickly this is what an encoder model is doing so this is encoder SL retriever so this is like your order 002 okay now what it's doing is we have a Transformer model okay so and these are the same Transformer model the reason that I've got two of them on the screen right now is because you use your first iteration or inference step of the transform model to create your embedding for document a right and from that you get your vector a so that is the compressed information that we can then take across to our Vector database which would kind of be like this point here all right that's in our in our Vector space and then in another inference step we're going to do the same for document B we get Vector B and there we go we we have that in our Vector search and we can then compare the proximity of those two records to get the similarity all right the metric that we'd be using here like the the computation would be either dot product or or cosine in the case of 02 now you have to consider that the computational complexity of something like cosine similarity is much simpler than one of these Transformer inference steps right so the reason that we use this encoder architecture is that we can do all of the Transformer inferences at the start right when we're building our index that takes a long time because Transformers are big heavy things they take a lot of computation whereas the cosine similarity Step at the end which we can run at you know the time when our user is making a query is very fast so it's kind of like we're doing the heavy part of the computation to compare documents at the very start of building the index and that means we can do very quick simple computations at user query time and that is different to what we do reranking so here this Transformer is our ranker and at query time right so let's say document a here maybe that's our query and document B is you know one of documents in the database where saying to the Transformer okay how similar are these two items so to compare the similarity in this case we are running an entire Transformer inference step so this because we're doing everything in a single Transformer step we're not losing as much information as we are with this one where we're compressing everything into vectors that means that theoretically we lose less information so we can get a more accurate similarity score here but at the same time it's way slower so it's kind of like a you know one on one side you have fast and you know relatively accurate and then on this side you have slow but super accurate so the idea with the sort of reranking approach to retrieval is that we use our retrieval and codep to basically filter down the total number of documents to just you know in this example let's say there's like 25 documents there 25 documents is not too much so feeding them into our ranker is actually going to be very fast whereas if we fed all documents into our ranker we'd be waiting I don't know like a really long time which we don't want to do so instead we filter down the encoder feed them into the ranker and then we'll get like three amazing results super quickly so that is how the reranking approach Works let's see how we'd actually Implement that uh in Python okay so we're going to be working through this notebook here we need hooking face data sets that's going to be where we where we get our data set from open AI for creating our embeddings uh pine cone for soaring those embeddings and C here uh for our ranker we're going to start by downloading our data set which is this AI archive it's pre-run so I've already chunked into like tokens of 300 I think something like that and and it's basically just a data set of archive papers you can kind of see a few of them here that are related to llms essentially I gathered it by taking some recent papers that are well known like llama 2 paper gp4 paper gptq and and so on and just extracting that extracting what that was referencing and extracting those papers and kind of just going in a loop like through that so yeah we have a fair few records in there it's not huge but it's you know not small either so 41.5 th000 trunks but each chunk is you know roughly this size okay so I'm just going to reformat the data into the format we need this is basically like pine cone format you have ID text which we're going to convert into embeddings and metadata we're not going to use metadata in this example but it can be useful and maybe it's something that will we'll look at in a future video in this series as well so we need to Define our embedding function so we need to Define that encoder model that we're going to be using for that I'm going to be using opening eye it's just it's easy 02 fairly good performance although there are better models and that's something we will also be talking about in the future so I'm going to just run that and I will need to enter my openai API key to get that you need to head on over to platform. open.com and get your API key I'm going to enter mine in here and yeah so with that we should be able to like initialize our embedding model which we are doing here I'm not going to go through like all these functions because I've done it like a million times before I think people are probably getting bored of um that part of these videos so I'm just going to run through those it's really very quickly I'm going to get my pine cone credentials again app. Pine cone. for those and I will run that enter my API key first and then I want my Pyon environment which I find next to my API key in the console so mine was this yours will probably be like gcp stter or something along those lines Okay cool so here I'm going to create an index if it doesn't already exist my index does actually already exist and I'm not going to recreate it because it takes a little bit of time or at least it did the other day when creating this so you can see that I already have like the the 41,000 records in there you you know if you're looking at that you should probably see nothing in your in yours unless you've just run this or you're connecting to an existing index okay so this is a code that I use to create my index right it's it's pretty straightforward the one thing that is maybe a little more complicated but it's not that complicated is we're actually creating the embeddings here so I think I defined an embedding function up here actually and I ended up not using it for some reason just ignore that so in here this is where we're doing our embeddings but we're wrapping it within an exponential backoff function to avoid rate Lim errors which I was hitting a lot the other day so essentially it's going to try and embed uh if it gets a rate limit error it's going to wait okay and it's going to keep doing that for maximum of five retries hopefully you shouldn't be hitting five retries if so there's probably something wrong so yeah um you should be okay there but if you are hitting those rate limit errors you you might be waiting a little bit of time for this to finish if not it should finish quite quickly I was hitting tons of R limit errors the other day and I ended up this took like 40 minutes I think so yeah just be aware of that um it's going to depend on the rate limits you have set on your openingi account now we want to test retrieval without coh reranking model first so I'm going to ask this question so get dos yeah I'm just querying again not going to go through everything I'm just going to return for now the top three records okay okay so my question is can you explain why we would want to do reinforcement learning with human feedback that's what this is here it's like a Training Method that is kind of like why chat GPT was so good when it was released so I kind of want okay why why would I want to do that I think the first answer here and there's some like the this scraping that I did is not perfect so I apologize for that but for the most part I think can read it so it's powerful strategy for find to aniz langage models enabling significant improvements in their performance iteratively aligning the models responses and more closely human expectations and preferences okay it can help fix issues with factuality toxity and helpfulness that cannot be remed by simply scaling up LMS okay so I think that is that's a good answer like number one there and then let's have a look at the second one uh increasingly popular technique for reducing harmful behaviors okay amount of can significantly change metrics doesn't necessarily tell me you know any benefits there okay so the only the only relevant bit of information in this second sentence is increasingly popular technique for reducing harmful behaviors okay so just one little bit there and then number three I think like I I don't see anything in this that tells me why I should use R lhf it's tell me about rhf but isn't telling me why I'd actually want to use it so you know these results could be better right so number one good number two you know it's kind of relevant number three not so much so can we can we get better than that uh yes we can we just need to use reranking so I'm going to come down to here and we're going to to initialize our reranking model so for that we need another API key which is coher API key you know this is this should be free like the pine cone and coh here ones will be free the open AI one I think you need to pay a little bit so yeah just be aware of that but again we'll be like I said later on in these in this series we'll be talking about other alternatives to open AI for embedding models which may actually be a fair bit better so I'm going to go to to this website here dashboard coh here.com API Keys you will probably need to sign up make an account and you know do all of that and then you will get to your here dashboard new trial key I'm going to call it something I don't know demo generate Tri Crate Key okay and I'm going to put it into here cool so we now want to rerank stuff let's TR try so I'm going to I'm just going to rerun the last results CU I only got three here I'm going to rerun it with 25 so yeah we have many more now and I'm just going to rank those 25 and I want to see you know what was reranked just want to compare those results so when we rerank stuff we're going to return this go here responses rerank result object and we can access the text from those like this okay so you can see see we kind of get this output there and the way that I've set up the like the the docks I'm sorry docks object that I return from the last item here like you can see it's dictionary where the text maps to the position the reason I've done that is so that I can just very quickly uh see what the reordered position after reranking is so you can see that okay it's kept you know the zero position like the the top result but then it swapped out to two and or one and two for these two items here okay so I'm going to Define this function here it's basically just going to do everything we've just gone through it's going to query uh get those results it's going to then rerank everything and it's just going to compare uh the results for us so I'm going to set a top K of 25 so returning 25 records from our retrieval step and then we're just going to retrieve return top three from our reranking step so I'm going to compare paare that query so the r lhf query okay so zero has remained the same one has been s for 23 and two has been s for 14 so this won't show us the the first results here because they haven't changed so we're looking at these results first so the original is what I what we went through before where it has like the the one like kind of useful bit of information increasingly popular technique for reducing harmful behaviors in large language models and then the rest wasn't really that relevant to our specific question which is basically why would I want to use RL HF now having a look at 23 we've shown it's possible to use RL HF to train llms that act as help one hob assistance okay so that's useful okay that's why we might want to use it RL HF training also improves honesty okay that's another reason to use it in other works associated with aligning LMS rhf improves helpfulness and harmlessness by huge margin okay another reason why we might want to use it so okay Three Good Reasons already our alignment interventions actually enhance the capabilities of large models again yes I think that's another reason combined with training for specialized skills without degradation in alignment or performance another reason why we should use it right so this here is talking about RF like to previous like number two ranked context but it's way more relevant to our specific question which is you know that's why we use reranking models now let's have another look so this is uh yeah this one there was nothing relevant right so this is the original for our specific question there wasn't anything relevant in here the reranked one has this just one thing here is like the the LMS are actually reading all of this text which is kind of impressive I I rarely struggle to but anyway so the model outputs are output safe with sponsors I think that's assuming it's talking about rlf is a good it's helpful we switch entirely to RL HF to teach the model how to write more Nuance responses okay so that's a good reason comprehensive tuning with RF has added the benefit that it may make the model more robust a jailbreak attempt another benefit we can do it to RF by first collecting human preferences it's not relevant an dat write a prompt they believe can elicit safe behavior and then compare multiple model responses to the prompts selecting the responses that safest according to a set of guidelines we use the human preference data to train a safety reward model okay so I think the relevant bits here are make the model more robust to jailbreak attempts and teach the model how to write more nuanced responses so those two are good uh the rest of it isn't as relevant but it's far more relevant than this one where you know it didn't tell us any benefits to using RF cool now let's try one more so what is your red teaming it's like a it's like a safety or security testing thing that they apply to llms now it's like stress testing for llms you can see that okay it hasn't changed the top one again and I think the responses here were generally you know not quite as obviously better with reranking but still slightly better uh what I will do is just kind of let you read those so you know you have this one here you can you can pause and read through if you want uh and also this one as well so again you can pause and read through if you like I'm not going to go through all those again so that is reranking I think it's pretty clear uh it it can help a lot at least I have found it just you know I don't have any specific metrics on how much it helps but just from using it in you know actual use cases it it helps quite a bit so I hope this is something that you can also use uh to sort of improve your retrieval pipelines particularly when you're using Rag and sending everything to LMS but you should also test it and make sure it is actually helping so for example if you're using a maybe you're using kind of like an older reranking model the chances are it won't actually be quite as good as some of the more recent and better encoder models so you could actually degrade performance if you if you do that so you always want to make sure that you're using kind of like state-of-the-art rankers alongside State of-the-art encoders and you should see an impact kind of similar to what we saw here with the RL HF question but anyway as I mentioned this is like the first method I would use when trying to optimize an existing retrieval Pipeline and as you can see super easy to implement it's you know you don't really need to modify other parts of the pipeline you just need to put this into the middle so I'll leave it there for now I hope this walk through has been useful and interesting thank you very much for watching and I will see you again in the next one bye

Info

Channel: James Briggs

Views: 43,463

Rating: undefined out of 5

Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, semantic search, similarity search, vector similarity search, vector search, cohere ai, cohere ai tutorial, cohere ai demo, cohere llm, cohere vs openai, james briggs, pinecone tutorial, vector database, retrieval augmented generation, retrieval augmented generation tutorial, llm, chatbot using python, chatbot rag, openai, pinecone, text-embedding-ada-002, llm python, rag python, rag

Id: Uh9bYiVrW_s

Channel Id: undefined

Length: 23min 42sec (1422 seconds)

Published: Wed Oct 18 2023