Get started with Gemma Google's NEW open-source LLM model

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey Prince here and today we're going to talk about Gemma and I'm going to show you a code example on how to run Gemma locally on your laptop so Google just announced Gemma Gemma is a lightweight family of large language models open sourced by Google meaning you can use it commercially or for research purposes and the model comes in two sizes uh particularly a 2 billion and 7 billion uh size so one thing that uh is quite interesting is that the 7 billion parameter model actually outperforms Mistral 7B which was the previous king of Open Source um and uh for the today's tutorial we are going to summarize the uh Google's Gemma report using Gemma itself so the first step to run this model is to download them locally in a format that it can run in our local environment and in my case I'm running a Macbook M1 so that means I need to use something like a quantized version or mlx which already has all of the obstructions built in and mlx is actually a library by Apple research that it optimizes to run on Mac silicon uh in that case you can come to the mlx community repo on hugging face and you will notice that there already a few contributions here you have the the mlx mlx quantized Gemma two this is the base model this this model is not instruction tuned and also the same here with the 7 billion parameter and earlier today I actually contributed this version here the instruction versions uh of Gemma to this repo so you can find them by adding an it to the end and you can just come here also as well check you can see here that this was me Prince kenoma okay so the way we you can run these models is using mlx itself um this button here doesn't really show you the actual code the correct code to run this but we we're going to get there in a second so let's go to our environment here we are so the task today is to summarize the report using Gemma and the tools we're going to use are hugging face and mlx so so the first thing you do is to install uh Lang chain actually you don't need Lang chain Community you just need Lang chain the Bas L chain you need Pi PDF to be able to read the document you can actually skip this text generation here and Skip L chain open AI this was an experiment I was doing I will come back to it uh and uh finish it and after that you just install uh mlx um you'll also need you'll also need hugging face Hub and hugging face transfer these two are going to be essential to downloading this model so you can create here um oops pip install and we also want the last versions and the reason why I commented out is because this is mostly for gpus if you're running um let's say a Windows machine or Linux machine with um Nvidia gpus you might want to use this line and comment out mlx and then what after you run this I already run this I already installed this Library so I'm not going to repeat that you can just run this this is this is going to install these libraries here that you need and once you have those you can import um mlx LM and the first thing we're going to import is the load function which is going to download the model and load it the model and the tokenizer and load it then we are going to use the the the generate uh function which helps um the model generate in know it's an abstraction to to generate uh results so it's it makes it easier you don't have to create your own Loop this already does it under the hood and we also Define this uh environment variable here so that the download is faster um once you run it for me it's going to run really fast because I already previously downloaded this model the model is now loaded here is a chat template so the instruct instruct version of um of Gemma takes a particular format which is basically start of turn roll um new line content and then end of turn and this is actually described in the report we can actually come to the report it's described somewhere here there we go so these are the supported instruction tokens you can see here that it takes a model and a user and this is to start and this is to end okay now going back once you define this uh template I'm using Ginger here to kind of format my input now that we have that we can actually just use the open AI API standard which is basically content and R and we can pass that here once we have that we uh pass the model and tokenizer and the prompt to the generate function and this is going to generate everything for us so in this case is respond like snop dog who is Einstein and this is running locally without access to the internet so you can see here it's generate ing the response it's pretty fast for a old M1 and it did seven tokens per second I think I think that's because I'm using a lot of apps right now uh by default it does around 11 to 12 tokens per second so a little bit faster than that but you can check the output it says yo check it Einstein the goat the smartest dude on planet Einstein Einstein the genius so it's kind of creating this rap song or verse about Einstein pretty cool but now moving on we want to uh load that that document I already previously downloaded the report it's already here so we are going to create our load function using um using uh L chain and oh okay L chain Community is involved so you might want to also put back Lang chain Community here it seems like we're still using it okay so now that we have that we can load our uh create load function which basically just loads the PDF and uh if any chunk any page has more than 500 uh characters it will create smaller chunks so that we don't overflow our context window then we just pass the name of the model the type of the document and it's going to load and here we have a list of all the pages so this is page one that has the date um the team and all this other information and you can notice that it also says here Jamma repeats this Jamma B models based on Gemini research technology if we look at the original file you can see that that's actually here so it's not a a bug or anything it's it's how the document is laid out okay now that we have that we have here a a list of a few so 3 six 9 okay around 16 good now that we have not that we have those documents we might want to start to create our function to um pass this St documents and create a summarization so first I would um I would create a list of uh prompts or the main prompt which is basically the main prompt is create a summary of the following document right this is our main prompt and we might want to add that to this uh documents here but I think we can just do that uh right away um at generation time but let's just do for one first um here we're going to do Doc is going to be documents zero and within this document zero we want page content which is basically what's inside there and now we now that we have these actually let's just print them uh Doc and PR prompt we already know let's just print doc so here we have the text good now let's generate uh one example so early with defin just generate okay let me just close this for for the time being so generate takes model takes tokenizer takes uh prompt and here we're going to apply a chat template um to our prompt uh let's create a more elaborate prompt based on this it's going to be okay actually we can actually do it this way um our prompt is going to be content uhhuh oops and uh actually we don't need this doc anymore more we can just pass it here directly and we make this into an F string and then we define the role which is user and we might want to make this into a list okay so there we go let's generate this and see how it goes there we go so the context of the question is quite broad mhm H wait did we run this correctly did it work let's print this prompt perhaps okay all right it it works so perhaps adding some quilt around it might help so let's go although this model is quite fast running locally it can take a while especially if you're using an older version like an M1 I think an M2 or M3 might be even faster than this so Props to all of you that have a more advanced laptop so ah it actually now worked here we are um sure here's the summary of the document the paper introduces Gemma a family of models based on Google Gemini models uh uh these models achieve strong performance across various benchmarks oh there we go there we go actually we can make this look better by using preprint and let's see I think this could look much better I didn't have to generate again I could have just passed it into an output and then didn't need to generate but whatever there we go so here's the summary of that page sure here's the summary of the document the page introduces awesome stuff so now let's make this better we can now create a for Loop so for prom in prompts um yeah from from prompt and prompt and prompts or even better we can pass that prompt at okay but let's let's just make it simple for now before we go into more advanced so here we're going to add an A List which is going to have all the summaries yeah we can call that summaries and make that into a list and this list is just going to take whatever we get here and append it okay and okay we got a prompt there what's wrong interesting okay it works so now we can create this list by doing for for documents for Doc in in documents okay we might not need to do this list again and now we will have our list of prompts so this should work so prompts there we go we have a prompt for each page good um and it's in order we can we it's already in order when you load so it's fine now we can generate all the mini summaries the idea is to um create summaries of each page and then take the final summary of all those pages create a single summary like the the major summary you're going to do this can save you like the the tokens so meaning our context window here of this model is around 8K so by creating those smaller summaries we might actually not not fill up the the entire 8K which is good and it's also a naive approach to generating summaries so you create smaller summaries and finally you just aggregate everything at the end so so now we just going to create the summaries and let's go so there we go this might take a while so I will just cut it to when it's done oh okay there is a bug so there is a bug there we go we can just fix it I guess this way that might actually fix it because this apply chat template expects a list let's restart this now we're talking now it's it's it's going to work okay so I noticed that when you get to the final Pages where you have a lot of citations like paper citations or references in this case uh the model starts to generate summaries of the titles of those um references so what I'm going to do now is to regenerate everything but I'm going to remove those pages I guess that's from uh um let's see so from what page is this okay let me check what page is this so we have we can check from the document that starts on page yeah this document has a lot of citations so everything after page 10 is is you we can drop everything after page 10 um okay so we can just say for the first uh 10 09 yeah um nine not 10 so let's restart this and perhaps now actually I want to add um I want to add tqdm just to know where we are import tqdm okay so oh wait actually I don't want verbos to be true yeah I had to restart a couple of times because my macbook's memory is overflowing so it's kind of just uh crashing the kernel so here I'm just going to rerun everything except a few things of course um we're going to get the document create the prompts and now we can run and we can check our progress so let's see yeah so it takes a lot of time you I I think you saw that previously 14 minutes in our previous run um you could argue that I could just uh make the calls in parallel but I don't have enough horsepower locally if I had an uh MX let's say M1 Pro or One Max perhaps I'll be able to do that but um I have the Bas m one so I have to do it sequentially that's why it takes a little bit longer but if you do have a bigger computer you can definitely uh run these in parallel so here we have the first one is done it took around around a minute so if we say yeah so it's going to take around 9 minutes to run all of this so let's just just wait until the end and then we are going to actually um instantiate the bigger model so the 7 billion parameter model to create the final summary so we take this two billion parameter model smaller one create summaries of each page and then we take the bigger model to kind of which is more it's like more powerful more cohesive and it will generate the final summary there we go it was actually a bit faster than I expected so let's look at them [Music] summaries okay so most of them start with sure and uh sure here's the summary of the document that's that's not fun um let's see how many characters do we have here uhhuh and actually I wanted length of this uh guess I'll give it a bit of space um do they have yeah I think I I think I'll give it just space or new line it would be better so there are 10,000 characters H that's not good um it will not fit into the memory of the model but we can perhaps partition these in 5,000 in chunks of 5,000 and then do that or we can just drop the rest I think the meat of the paper is earlier on okay or we can create even more summaries of of these uh summaries uh to then merge but in this case let's just take what we have um is there a nice way yeah we can take the splitter let me just check how I can use this with normal python length chain okay split all right you just create more documents or split ah split text does what I'm looking for so we can take the splitter that we already have oh no we have to create a new one this one is used within okay so now that we know how many characters are there we can come here just say characters number of characters since we have a context window of 8K so we can actually just divide we can say um put seven um and then we can instantiate the text splitter split text and actually we are going to take this thing here yeah we can now do um uh we can create a new variable um summaries text uh okay so we have all those summaries in one place here we are going to that uh join this two so let's see text SP I think it did the job uh I guess if we just take the first one and we can actually just check the characters of the first one there we go it has around 6,000 which is good now we can um create a summary of the first part uh okay so we can just overload this [Music] here [Music] summaries not text this would have to be in the normal summaries um uh split summaries split summary just not to overload anything that we might actually come back to okay um now we can actually reuse this code and generate uh smaller summary but do we want to use the bigger model to generate those two uh summaries or Yeah we actually just need to create a summary um of the first part so we can uh use reuse this and instead uh long summary we are going to take this long summary and basically what we are going to do is is take the split summary we take the first one which is the longest in this case um and okay we need to create the prompt style so I'm just going to say uh call this um prompt for long summary and in this case we have this one here this is the content so we are going to write here content and that's okay okay we want to generate a summary of this so we can just oops we can say uh Give me a summary of the following document um we can make this prompt a bit more robust and ask it to not not say sure here's the summary but I guess that's fine for now just for demonstration purposes role user okay and we just get this prompt here there we go and uh let's generate it shouldn't be maximum so we are going to convert now let's just give it at most 3,000 tokens um that's going to be the max we will at so we are kind of taking the 6,000 tokens and and trying to fit them within 3,000 of course when you set max tokens this might cut off the result but that's fine um when you're doing summaries you you always expect some information loss um because it's kind of implicit you're taking the full context and Distilling it down so you'll always lose a little bit of accuracy but that's fine so let's wait whilst this finalizes okay so our summary or oh okay so okay to generate the summary only took a minute we can check it uh PR preprint will give us a better idea of how this looks like so let's see okay it's pretty ah it's short it's not that big it gave us the main points all right yeah it gave us the main point the 2 billion and and 7 billion so now we can join this uh long summary plus uh this uh second part and uh this should give us let's see how many tokens we will have once or characters we will have once this is done so we'll have around 5,000 characters wait let me just quickly calculate this so before we had before we had 10,000 characters so 10,000 10,000 characters and divided by four ah this is this could all fit uh so actually that was my bad so number of characters this is a approximately in my case it's approximately uh 2,6 2,621 tokens which is fine um actually we didn't need to do all of this again um we can just uh uh here we can create the list that that's going to be split summary the first one or just oh summary of text oh no actually we don't even need to pass all of this we just we can just pass directly um the summaries the summary text which has everything and this is going to actually become the final uh final summary so before that we need to delete the existing Model H and tokenizer to free up some space [Music] for the next one which is just basically we can copy this down here put it here and then just change from 2 to 7 billion and now we can load the 7 billion model hopefully it loads without crashing there we go it loaded thank God and now we can generate the uh final summary uh we can say maximum I think 1,000 tokens uh th000 tokens is okay uh let's generate following document let's generate our final summary this might take a bit long but let's prepare the printing here there we go so the 7 billion uh version runs at around four tokens on my MacBook but because I'm using a lot of things right now it might be a bit slower but let's see how long it takes there we go so it took about 5 minutes to to generate the summary um of this uh around 2,000 plus tokens so now let's look at the summary uh it says um summary of the document this document summarizes the key points of the paper um okay so it actually took verbatim um what what this uh thing was saying what the previous summary so it took here Shore describes the Gemma family of models that's good key highlights Gemma models achieve uh strong performance that's correct two 7 billion and 2 billion that's correct independent developers malicious actors fine tune Gemma for okay it's um it's an okay uh summary uh it's not like jity 4 level or or or Mistral 7B level but it's okay and it makes sense because the the previous model generated ated summaries with a a kind of extra words that weren't really necessary and it was a smaller model perhaps if we we had drawn everything with the 7even billion it would have been better but that's fine uh is this is just to kind of showcase the power of these models and what you can do now imagine if if the the previous model was fine-tuned to generate uh highly coherent and cohesive uh um summaries that would have even better so think about how you can fine-tune these and I will show in a future video how to fine-tune these models now let's move on to the last part of this video which is another interesting thing that I learned about mlx today uh is you can actually run your own server that comes with um pretty much open AI API compatibility that means you can actually use the you can use the open AI library to run this locally but in this case um you can run it easily by you can run it easily by copying this code here and using it on your terminal first so let me get my terminal there we go I need to activate my environment which is llm fine tune so once we run this environment here it's going to start get the model and run the server there we go it's it started the server we can come back here under requests right we can actually just keep this half the window so you can see so under requests it just takes a URL ahe there is application Json it uses the same format as we used previously the roll and content um actually this is not needed here roll and content and I'm just asking high temperature 0.7 Max tokens 100 and I'm just using the request library and here I'm printing the result and excluding the end of sequence token which comes by default and also you get the usage so we we will be able to see this so if I run this you can see here that the server got the request and now we have the response and the usage so the response is hi it's nice to meet you how can I help you and the usage is around prompt tokens are nine tokens and completion about 17 so 26 tokens in total so this is what uh one of the few things you can do with this library of course you can also run curl requests normal curl requests you would still get your response and here it is so this is what you can do running Gemma uh with mlx locally on your computer of course you can also use it with libraries uh projects like AMA Lang chain and even uh Transformers if you have like gpus uh any Nvidia GPU so this is it for the video thank you
Info
Channel: Prince Canuma
Views: 3,154
Rating: undefined out of 5
Keywords: Gemma, Gemini, mlx, mlx-lm, huggingface, LLM, Large Language Models, What is Google Gemma?, google/gemma-7b, google/gemma-2b, Google's new open LLM, Get started with Gemma models, Google’s NEW Open-Source Model, Is Google Gemma free?, AI Summarizer, langchain
Id: b1XCEetucSg
Channel Id: undefined
Length: 40min 18sec (2418 seconds)
Published: Sat Feb 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.