RAG vs Context Window - Gemini 1.5 Pro Changes Everything?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I've been bullish on rag for a long time but after the Gemini 1.5 news last week with the 1 to 10 million context window and at the same time I saw this new grock hardware that runs 500 tokens per second I definitely think in context can be better for some llm Ops going forward so today I want to take a look at using rag versus the context window so let's get started first I just wanted to quickly go through what is a context window and what is rag so here we have a model that has a context window of 8K so that's 8,000 total tokens it can process so we have our input tokens so this can be like let's say it's 6,000 tokens I put in here and I follow up with the query and we get some output tokens so totally now we have 6 7,500 tokens and you can see clearly that fits inside of our window right so everything here will be processed you got to remember that also the output tokens will count as your window right so when we query here what is the name of the YouTube channel you can see on the top here the YouTube channel all about Ai and you can see we get back here the name of the YouTube channel is all about AI because it has that in context here right but what happens when we move over here now to the other model it's the same but we put in like 10,000 tokens so what I try to Showcase here is that this red part here now will get outside the window right because we put in too many tokens into the input and this top part the first tokens we put in will not be counted right so this is not included in context and when we follow up then with what is the name of the YouTube channel the model can't really find the name because it slid out of the context window so it doesn't have any knowledge about the YouTube channel name is all about Ai and you can see when we add this up we end up in 11,500 tokens and that's kind of way outside the model context window and if you are calling the API with too many input tokens you most likely just get an error message back so this is something you got to remember when we are talking about context window right so if you put in too much it can slide outside the scope of the model right and this is the kind of problem rag tries to solve right so uh I just wanted to try to explain it pretty simple here so let's say we have the same text here so these were our input tokens from the previous window so what we can do with our context is use a model to turn it into vector embeddings and these embeddings we can then again store in a database so when we use our user querer here uh what is the name of the YouTube channel we also turn that of course into an embedding and then we can compare the embedded user query to the embedded text to kind of find the closest match and when we find out the closest match we can return that chunk of text into our prompt again so here we kind of found the YouTube channel all about the I that was kind of closest because we have YouTube channel here we found YouTube channel in the text and Bings we turn that into our prompt and we can query again what is the name of the YouTube channel and now the model can answer right so the name of the YouTube channel is all about AI so it answers the users query with the fetch context so this is kind of a hack to try to solve the context window problem if we always could bring some very relevant chunk of text uh that matches our query we could always get a good answer but of course this brings other problems we have a separate system that has to work well too and it's a bit more technical than just using the context window so it has this PR and cons of course but uh the speed can be higher the inference time can be lower the price can be lower I think but we're going to take a look at that soon so hopefully you kind of understand how rag works but of course this is very a simplified version of course so the reason kind of I changed my mind a bit about rag versus in context lately as I've seen a lot of these posts here on Reddit and on x so Gemini 1.5 Pro is still underhyped I uploaded an entire code base directly from GitHub and all of its issues Not only was it able to understand the entire code base it identified the most urgent issue and implemented a fix this changes everything if we look here you can see there are very good responses here and remember this is the 1.5 pro version so it's not the ultra version and you can see potential fixes so I'm really hyped you can upload like a full code base it's different than having it in rag I have a few examples of this because it gets the full context right that's a bit different than picking out what you think is most relevant but of course I'm no rag expert I'm sure there are some ways you can do this to with rag but uh uh the results I've seen so far uh with Gemini 1.5 Pro seems very promising but of course using a full context window let's say 1 million tokens brings on a lot of uh inference time as we have seen in the Google examples but then I saw this from Gro here so you can actually try this out for yourself now at gro.com they have a new hardware system at like an lpu or something so this is mistol uh mixol 8 * 7B so it's we run like right on Advanced snake game in Python uh yeah it's a lot of requests here so it's going to take a time before we get our queue but it runs at lightning speed it runs like 500 tokens per second right so you can kind of see how fast this is 518 tokens per second and if you can kind of combine that kind of speed with this kind of model type we are really moving the boundaries I think uh of course I understand that we won't get 518 tokens running in the big gemni model but if we can get like 100 tokens per second that would also change a lot right and of course price is very important here so if you use the full context as an input you will of course has to pay more per token so I saw this post here someone posted like this is why rag is here to stay because like you pay z0 five per call you only pick out what is relevant right but if you use gem 1.4 1 million tokens you pay half a dollar per call and it's going to be very expensive uh but there have been some speculations that gem 1.5 will be like 20 times cheaper than gp4 uh let's say Gemini 1.5 is 05 per million tokens or per thousand tokens I don't know but uh per call I guess and that is kind of the same price here uh of course we're going to fetch less tokens but if it's per call we all expect prices to go down we expect like compute to get cheaper and prices to go down so that might not mean so much uh in the future right if we think a bit ahead this not might this not even have to be relevant if it gets low enough it doesn't really matter if you save like $1 every 10 million tokens or something I think like uh if the prices come far far enough down uh yeah I think we can start using in context for everything if we have the inference to back it up of course so now I kind of want to run some examples of where you can kind of see the differences between using Rag and using in context uh I lined up a few examples here uh you can kind of see the strength of rag and kind of the strengths of in context so we have this text here uh that is basically it's a video um about um Gemini 1. Pro uh I just transcribed and I turn it into just a text file then I have this script here that just going to run gbt 3.5 turbo and we're going to put in the query what does this text mean write five bullet points but here we kind of feed the full text in context so here we can see our text file is going to be fed here before we ask our query and on the other side we're going to do the same with rag we're going to run GPT 3.5 turbo we're going to turn this text file into chunks of 500 and then we're going to try to call and use the same query so let's test this out so let's do what does this text mean write five bullet points so you can see it was pretty quick uh discusses task of finding speci INF passcode it mentions repeat certain characters uh yeah it's not too good because the question here was kind of bad what does this text mean right five bullet points it's not really a ragged optimized question right but let's try this now just for comparison and see what this brings us when we have the full context right okay so of course you can see the the rag was was pretty quick I guess this ah it didn't take that long uh so you can see yeah the answers here is much better so the text discuss is unveiling a 1.5 Pro uh demonstrates exception cap understanding processing vast amounts of text yeah so this is what we want right uh you can kind of see the differences here but that is kind of playing on the strength of in context and kind of the weaknesses of rag so let let's try to change the input a bit and show you what can I think is good use of rag R so now let's ask what is the size of the context window in Gemini 1.5 Pro the context window size in 1. start at standard 128 and go up to 1 million tokens okay that's a pretty good answer that was pretty quick now let's just do the same with the in context one yeah I would say 3.5 is pretty quick too the answer here is a bit better so it goes to up 1 million tokens the highest here is mentioned up to 10 million tokens but it's unclear how yeah the answer is a bit better here I think but again it's much cheaper to just run this uh with the rag so it kind of used the context window I guess as a keyword and Gem 1.5 Pro and kind of picked out the relevant text here and just fetch that into context but here we kind of use the full context uh and you saw with GP 3.5 turbo the answer was pretty quick and I would say this answer is a bit better right and of course we have't talked about the multimodal features of gem 1.5 Pro here you can see they upload a 44 minute long video into context and that translates to 696,000 tokens and it can kind of pretty accurately ask for kind of a moment in the video when something happens and you get this time code back I haven't looked too much into it but I think rag has some kind of implementation where you can upload images and stuff but I don't think I've seen anything about video yet but I sure that we come but uh for now I think gemni 1.5 Pro uh is a bit ahead in this aspect but I still think rag has some great use cases and kind of what I'm thinking about is just document uh lookup let's say you have like hundreds of thousands of documents you kind of want to index and this could save you a lot of time if you have kind of the key phrases you want to put in and you know kind of what you want to search for but you don't need kind of the full context to get some reasoning over it but you just want to look up that document and get some information back and I think rag is perfect for that because when once you have embedded it you kind of have it stored you don't have to yeah what can I say put it into context each time so I think rag definitely has uh a future in this kind of space right but I'm kind of more and more leaning towards uploading like when you have a very high impact kind of task just maybe like a code base or something uploading every single token into context uh I think you could get better results doing that but uh of course uh like I said previous I'm not rag expert but kind of my intuition is that if you can put every single token in context and we don't get this loss in the middle it looks like Google has kind of fixed that uh then I think of course in context is better for like highly critical things like uh code and stuff but of course I think rag still has a big part to play here but personally I'm kind of more excited for this context thing we have seen with Gemini 1.5 now up to 10 million that is just crazy right I don't know I just feel more comfortable knowing that every single token I have put into context is going to be included when we use the language models to process our query right here I'm not so sure what is actually going to be put into the prompt I guess you can print it out and kind of see for yourself but I just feel more confident like uh if I put the whole code base into this context I know everything is there right and like I said if we don't have this loss that we have seen in the middle uh yeah that's could be great and could be like a Leap Forward I think so it's going to be really exciting to see what kind of response open AI is going to have for this are they going to bring out like a 1 million token window I'm not sure we just have to wait and see but for now I think this is a big step for Google and yeah a lot of excitement around this I see on like X and Reddit and stuff so that's pretty cool but yeah that was what I had for today so thank you for tuning in and have a great day and hopefully I'll see you again on Sunday
Info
Channel: All About AI
Views: 16,794
Rating: undefined out of 5
Keywords: gemini, gemini 1.5 pro, RAG, context window, rag vs context window, retrieval augmented generation, token window, gemini context window, ai, openai, google, google ai, ai engineer, Groq
Id: ghJH2ZKQezY
Channel Id: undefined
Length: 13min 34sec (814 seconds)
Published: Wed Feb 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.