Really Long Context LLMs - 200k input tokens

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video covers language models that are capable of really long context lengths I'll show you how to get good performance on these models and how to run inference on them I'll start by giving you an overview of the ye model available in 6 billion and 34 billion sizes and capable of going up to 200,000 tokens in context length I want to talk about the chat formats for that model and how to get good inference on it I'll then touch a bit on Long context models in general because they're not all equal even if they say they have 100 or 200,000 context length the question is can they do Pass Key retrieval and can they provide coherent responses I'll explain why those two things are important to get good response results overall thirdly I'll show some performan EV vals I'm going to show you 107k context length in a summarization case using two A1 100s and then I'm going to show you some inference using a one-click template from runp pod that I've developed I'll just show 16 K context there which allows you to run on an a6000 I'll briefly then show you some function calling models I've developed if you want to use these long context models in conjunction with function calling and lastly I'll cover some resources around inference and also how to do some chat fine tuning like I've done to develop the ye chat supervis fine tuning model bear with me now as I talk you through the ye model which is a very powerful model available in 6B and 34b but of interest for the video it's available in a format where it's been fine-tuned or rather trained with a 200,000 input context window now importantly this initial model was not released with a fully open license but as of mid November the current model can be used for commercial use although full details of that license have yet to emerge now the b e model is not immediately compatible with a lot of the packages for doing inference and so so there's a Lami version basically moving it into a Lama 2 style format that is on hugging face furthermore this lamif version even though it's more compatible it's not a chat fine-tuned model and so if you prompt the model it will in many cases continue to blab on without stopping because it's not fine-tuned to have an end token and respond in a chat style format for that reason I've developed a series of chat fine tuned models using open source data so that the model can still be used for commercial use so this is trellis the 34b model 200k context lamif IED so it's compatible for inference with a lot of the packages and then further fine-tuned in a chat uh format so that it will respond to your queries and then end once it's provided that answer and this model is available in 34b and in 6B format for purchase now even if a model says it can handle 200k context length that doesn't necessarily mean that it's going to be a good model there are two criteria that I like to think of for what I consider a good long context model first of all the model should be able to retrieve a pass key if you place it anywhere within that context length if it's not able to retrieve the pass key to me that means it's not actually seeing all of the text input and this is often what you see with Cloud it can handle a long context but actually it's not able to retrieve a pass key from everywhere within the text which makes me question whether any or rather all of the text is being processed second of all even if a model can do Pass Key retrieval it needs to be able to provide a coherent response there are some models that are able to do Pass Key retrieval often models that are not chat fine-tuned and they can do Pass Key retrieval on long contexts but they're not able to provide a coherent response which is obviously needed if you want to use a long context model so a good model should be able to retri retrieve a pass key from anywhere and it should also be able to provide a coherent response to illustrate this I'm going to show you a version of the E model that is not chat fine tuned we'll take a look at Pass Key retrieval and also summarization I'll compare that then with the chat finetune model that I've developed and then I'll also show you a quick example using cloud and its difficulty in doing Pass Key retrieve I'm running in Jupiter notebook Now with an a100 and I'm going to run the llm comparison notebook which is available in the inference repository available for purchase I'll show you more about that at the end of this video in this we're going to compare two models but first let me just connect with my hugging face token this will allow me access to gated models the two models I'm going to compare are the ye model that's not chat fine tuned and then the e- model that is chat fine tuned so let's accept those next next I'm going to do a series of installs notably I have Transformers here I have bits and bytes because I'm going to use nf4 quantization um to make the model a bit smaller to run on the a100 and I'm also going to install Flash attention we'll be using the V2 of flash attention which includes um flash encoding and Flash decoding which is very helpful for long context to improve speed okay that will take a second to install in the meantime I'll move down here to loading I'm going to load the two models Model A and B this is the not chat fine tuned this will be chat fine tuned notice that I am using quantization I'm going to quantize uh with four bits for both models so it's an apple to Apples comparison and I'll be using flash attention 2 so setting that to True here so those models will then need to download and um run note that I'm running the 6B models just because I want a fast demonstration here but I do recommend using the 34b models um in fact a quantise 34b is is going to be better than using a 16bit 6B model so I would recommend for the best summaries or the best long context performance to use um the nf4 quantization of the 34b E model you can see that the model is currently downloading there are two safe tensor files for each of these 6B models that's a total of I think about 12 gbt each and once they are loaded we'll just print out the number of position embedding the maximum will be 200k cuz that's the context length it's being trained on now just to note on setting up the tokenizer for ye there are two things you need to do differently than if you're used to loading llama models you need to install sentence piece and that's a specific tokenizer package and furthermore when you load your tokenizer you need to make sure that you set trust remote code equal to true I've already done that for the second model but I need to do it for both so when I've done that I'll be able to to run the tokenizers and here we're going to just print out what the beginning of sequence token is for Model A and we'll do the same for model B actually the beginning of sequence token doesn't seem to be used in the E models only the end of sequence token is used at the end of the assistant response and it's an end of text type token not a/s like Lama I want to move to doing a quick inference just asking a simple question to list the planets in our solar system and have the model respond with a list of those planets notice here the prompt format so the user is going to be a human colon space and then assistant so it's quite a simple prompt format uh for the model this is specifically what I have fine-tuned the model on in developing my chat model if you're going to operate the raw e model I also recommend that you use this style all right the models have now been loaded so the weights have been downloaded onto the server and then the shards have been loaded onto the gpus you can see here the max position embeddings is 200k for the base model and the chat fine tune model the tokenizers have been loaded and here you can see that the end of text token is used to indicate the end of its output and that will work well when it's a chat fine tune model but not so well on the base model as we'll see down here we've run that simple question list the planets in our solar system and you'll see very clearly the difference between the Bas and the chat fine tune model so here in the bass model we're asking to list the planets in the solar system and the assistant uh responds what is the name of the planet that is closest to the Sun and then it continues to alternate between human and assistant and this is because it's not a chat fine tune model whereas if you go to the chat fine tune model it says list the planets in the solar system and the assistant says the planets are um these eight planet here and it actually mentions two dwarf planets that uh one of which Pluto used to be considered a planet so that's correct and it gives a clean stop as well it doesn't blab on so that illustrates the difference between the base model and the chat fine-tuned model in 6B and the same happens for the 34b now that I've explained how the chat model compares to the base model I want to show you pass key retrieval so here I'm just going to run on the chat fine tuned model and I'm going to ask the model to respond only with the pass key contained within the text and I'm going to help it a bit by saying the pass key is and letting it complete that sentence there are about 12,000 tokens here uh passed in and you can see right at the bottom that the pass key has successfully been retrieved and as a reminder this is the 6 billion ye model now I increased the number of tokens I was sending in and I increased it all the way up to almost 200,000 and it's still able to retrieve the pass key which is really remarkable in terms of performance but this is an example of a model that does well on Pass Key retrieval but still is challenged with having a good conversation so let me show you what I mean when I Des when I ask it to summarize instead of finding a pass key so here I'm going to ask the model to respond with a brief summary of the blow text and let's see what it comes up with again I'm going to go for a similar number of tokens about 12,000 tokens going in and you'll see at the bottom here um there is actually a summary and the summary is reasonably good it's telling us it's a meeting about the birkar hathways um it's about Burkshire hathways annual meeting so performance of the 6 billion model is actually reasonably good around 12,000 tokens and it actually continues to be okay as you go up towards about 20,000 tokens although really I don't recommend using the model um for much more than about 10 or 12,000 tokens and let me just show you what performance starts to look like I'm going to increase the number of characters of context I'm feeding in so now I'll ask for a summary of 28,000 tokens and let's have a look at the results here scrolling right down to the bottom and yeah so here the model is already kind of refusing to respond and if I go up even further in length which I can do so let's increase again maybe this time uh to 42,000 tokens and let's see how the model responds it'll take a little bit longer but I'm just trying to show that even though the model can ret retrieve pass keys that doesn't mean it's able to consistently provide a summary yeah so um here it is still having trouble and what you'll also find uh if you keep rerunning at different lengths some sometimes it will just start repeating text um sometimes it will provide nonsensical text so this just gives you an illustration of how Pass Key retrieval is not enough um but I think it is a sufficient condition now next I'm going to show you an example using Claude and here you'll find the opposite so Claude is able to provide summaries and I'm going to run this almost exact text through clae in moment with the similar length so Claud can provide a coherence summary but it's not able to do Pass Key retrieval so it's kind of the opposite of the 6 billion model so let's take a look at that as you can see Cloud doesn't recognize that there's a pass key within the text so what can happen with certain language models is they basically ignore parts of the input context or don't see it and by doing this it allows them to provide a consistent response by having an effective narly narrow narrower window but that means they're not able to process the full amount of text interestingly the recently released open AI GPT 4 Turbo model which goes above 100K in context seems to be able to do Pass Key retrieval very well and I'll show you an example of that when we run the ye model on an API endpoint using runpod later on in this video just to recap a little so far there are models that can do Pass Key retrieval very well like the 6B 6 billion version of the E model but they're not able to provide clean summaries or coherent text when you go beyond a certain length probably around 20,000 tokens for the 6 billion model there are also models that can provide coherent summaries such as CLA but are not able to do Pass Key retrieval which means they aren't fully synthesizing the entire text what you really want is a model that can both provide a coherent summary or response and is able to do Pass Key retrieval throughout the full context now gbt 4 the new turbo model is able to do this and I'll show you a bit later on contexts up to 100,000 total tokens but what's interesting is the E34 billion model is able to achieve both Pass Key retrieval all the way like the 6p model while at the same time providing coherent summaries at very long context so let me show you an example of the E34 billion model doing well on 107,000 tokens of context here we are with the exact same llm performance comparison notebook but this time I've loaded in the 34 billion model it's the chat fine-tuned model and so it responds coherently and I'm going to scroll right down to the example I've run which is asking the model for a summary of very many tokens this is 107,000 tokens approximately in fact exactly because that's what I've measured and here we're running a request to respond with a brief summary of the text um respond with a brief summary of the above text so let's scroll right down to the bottom of that text and we can see how the model performs so here uh we'll see the end of that question let's see right here respond only with a brief summary of the above text and then the assistant response and here is what ye 34b is saying birkshire hathways 2023 meeting was held on May 6th the meeting was attended by thousands of shareholders the meeting began with a summary of the first quarter earnings here are some other things that the meeting then discussed and overall the meeting provides shareholders with valuable insights into the company's operations and future plan now I was able to run this on over 100k context on an a1100 um on sorry two a100 gpus I'm running in bits and bytes nf4 quantization so this is the best quality quantization it's not the fastest it's slower than awq it's slower than gptq but it's better in quality than both of those and you can see that I'm able to get quite a good summarization response I suspect that if I increased up even higher towards the 200k limit which might require up to four A1 100s I probably would be able to get a good summary quality as well but this is really impressive it's an open source model that uh with some further chat fine tuning is capable of delivering a very good summary on a really really long context length the nice thing um if you're running this in quantise form with BNB nf4 you can get probably up to about maybe 40,000 tokens using an a100 um but you do need probably two A1 100s if you want to get out to 100K in context length and probably three or four probably A1 100s if you want to get all the way to 200k next up I want to talk about inferencing these long context models and I'll show you how to do so using runp pod there's a link I'll put below to a server that a rer a template you can use for getting started with inference it's called long context 2 200k ye 34b chat and we're going to run it on an a6000 which is quite a good value um server so I'm going to click here on deploy notice that I'm going to make some edits here this model is a 34b model in full FP 16 that's 16bit format it would take about 70 gbes in bits and bytes nf4 it will take about half of that so 35 and a6000 is 48 gigabytes so 48 minus 35 there's just over about 10 gbt of space for the KV cache which is needed for the sequence length so I'm going to have to reduce the max input length here I'm going to reduce it down to just uh 16,000 I'll put the max tokens at 177,000 to allow myself 1,000 output tokens and the max batch prefill I'll put at 16,000 you can also use awq I've made that available as well for those who purchased this sft model it's faster probably by 2x although the quality is a little bit worse than using bits and bytes which has got a better data format so we'll set the overrides and get that uh pod running okay here's the Pod running it has been fully loaded you know that it's running when it defaults to the zeros as the host name you'll see here that here that there's a pod ID you'll need this for when you're sending curl requests um so I'm going to use that and now I'm going to send curl requests to This Server um it's running TGI text generation inference so when I send curl requests I'll be getting right back uh respon from that server for inferencing this model I'm going to be using some scripts in Lama server setup which is an inference repo that is available for purchase it covers a number of things it covers setting up an AWS server it covers using runp pod it covers function calling and now it covers uh long context it gives you a complete guide here um that allows you to run with a chat interface chat UI from hugging face or just directly query with curl requests so I've cloned that Repository to my local and we'll take a look at some of the scripts um the first script I want to look at is Pass Key retrieval this one here simply takes the Burk sheare text it injects a pass key and then it will ask the model to return the pass key so we're going to test this out for 16,000 tokens which is the maximum that we have uh defined on the server so I'm just going to run pass key. sh and that should send a request into the server um here is the request it's actually printed out the full prompt which is probably a bit much because prompt is quite long and you can see it's responding with the pass key so it's easily passing this test the next test is summarization so again we're going to send in Burkshire we'll send in about 16,000 tokens and we'll ask it to respond with a brief summary of the text so here I'll do summarize. sh it's with a zed so sending in that query and this is real time so you can get a sense for how long it takes to get the response back on 16,000 tokens knowing that it won't necessarily take all that much time um longer per token generator with longer context because TGI us uses flash decoding it's a way to make use of the full GPU at greatly increases speed if you're doing long context okay so here is the response uh the previous text is a transcript of the birkar pathway annual meeting which took place on May 6 meeting was attended by the shareholders um the meeting concluded a discussion of the company's investment strategy and commitment okay so you can see here um we have a summary that's coherent and um this again is the 34b model as I said there is a commercial model from open AI GPT 4 the turbo version it is able to perform quite well on Pass Key retrieval I've got a script in this repo as well that allows you to test here we can just do a quick test for an input of 60,000 tokens uh from open aai so I can just quickly run that script to show you open AI pass keysh so it's running a pass key retrieval on 60,000 tokens and I believe it won't have a problem here we go and the content that it's responded with is indeed the pass key so to sum up for a third time it's helpful CU there's so much involved here there are limited number of models that are able to do Pass Key retrieval and provide good summaries GPT 4 Turbo is one example it can go around 100K in context ye 34b is another example even in quantized nf4 format it works well at 107k of context it probably works well at 200k of context but you would need to be running about four uh A1 100s to get that kind of performance what I just showed here was using an a6000 you can get um good summarization on 16,000 tokens of context if you're interested in a function calling version of the E model I have got a chat fine-tune version of the e- model that is further fine-tuned for function calling and you can find that deta those details on hugging face if you're interested in purchasing that model for functions zooming out now a little bit to to talk about long context model resources let me go through a few things that might be of use I'll put links below to the raw base models for you to try out by yourself if you're interested in the chat fine-tuned models you can find them on hugging face by searching for trellis research and then looking for chat sft along with the e 34b or 6B names as I mentioned function calling models are available you can find them on tr.com or via the links I'll put below in order to develop the chat find F tuning model I went through a process of supervise fine tuning using an open assist data set that I had modified for the echat format if you would like to chat fine-tune your own models you can do so by purchasing the chat fine-tuning script which is available on travis.com you can also just purchase the entire repository that covers a wide variety of training supervised unsupervised direct preference optimization quantization and many more last off I talked a bit today about inference and how to make the best choices if you're looking to set up an endpoint for long context that is covered in the Lama server setup even though it's called Lama server setup it covers inference for a wide variety of Open Source models and it can help you out if you want to set up your own server on AWS or which I find quite handy you just want to run quickly and cost effectively using a service like runpod in summary it's now possible to get over 100k of context length with open source language models and the help of a little bit of chat fine-tuning this is possible thanks to the E6 billion and 34 billion models from my testing using the 6B models you can get good Pass Key retrieval but you only get good coherent responses up to around 20,000 tokens if you want to go higher to at least 100,000 tokens of context and probably more maybe even 200,000 you should use the e34bmw recommendation if you're doing inference is to use it with bits and bits nf4 quantization or if you can and use more servers more gpus you could run it in 16bit precision as I showed you can run with 100K context using bits and bytes nf4 if you have two A1 100s you can run with less context length 16k or maybe even a bit more using an a6000 which is quite good value about 79 cents per hour and you can do that on a service like runp pod let me know your questions on Long context it's been exciting for me to see it possible to get 100K of context up until now it's been challenging to go much further than around 30k especially for summarization and non-coding tasks so let me know your questions below and thanks for watching
Info
Channel: Trelis Research
Views: 1,391
Rating: undefined out of 5
Keywords: long context llm, yi long context, yi llm, yi 200k context, 200k context llm, 100k context llm, 100k context large language model, yi llm chat
Id: cmf1lzbyxY8
Channel Id: undefined
Length: 25min 17sec (1517 seconds)
Published: Mon Nov 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.