Run Google Gemma 2B 7B Locally on the CPU & GPU

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone it's Nono here and this is a Hands-On overview of Google's Gemma so this is a model that Google has just released is the equivalent to Lama in meta or Facebook AI this is Google is releasing open models large language models or llms which we can download directly from hugging face with the hacking face CLI or from the hacking face web interface and they've released two and 7 billion parameter networks that we can use directly on our machine Google Gemma open models means that they have trained these networks they have benchmarks and you can see how well they perform in certain tasks compared to other models like llama or mistal family of lightweight state-of-the-art open models build from the same research and technology used to create the Gemini models have to say I haven't used them yet but I have downloaded them to my machine is pretty easy and you can do the same we're going to see how to do that right now the first thing that I did is I went to haging face I searched for Google SL Jemma and then you get a series of models so the two models are two and 7even billion parameters but there's two different varieties of each because each of them has the model itself and the it which is the instruction based model I had to sign or accept the terms in kagle and that gave me access this was called a gated model that gave me accessing hagging phase so by logging to the console and prompting to download I could download the model all right so let's now code and let's take a look at a handson a view of how to actually download those models in mac and know that this might work on Linux and windows with your machine as well let's install the haing face CLI with Homebrew the formula is basically Brew install Hing face CLI all you have to do in the end is Brew install huging phas uh CLI right so that's what we need to do in order to get the CLI installed in our machine that works and now we can do things like login log out and other things let's now see how to log in with a token to the huging phase CLI you actually have to go to hugging face.com SL setting tokens which is this page here this is token I created today so you have to go Pressly to this URL huging face. settings tokens and then you just have to go back to your terminal and do huging face CLI login and it will prom you for a token which I've just pasted and then I will just save login successful okay so now we're going to see how to download a model from hugging phas we were talking about Google's Gemma you actually get the results here and we can go for example for the Gemma 2 billion it and then we can read how to run things and how to to do things and blah blah blah I'm going to go to files and versions so you can see the things that they've put here so we can download and you could download one specifically or you could download uh the whole repository right let's see how to do both I'm going to go to the desktop we have a folder for today's stream which has already downloaded two models but I'm going to do here a test folder which we're going to also enter so we'll open this folder and I'm going to put it here for you I think there yes and now I can do hugging face CLI download I'm going to say local there current directory and I'm going to say from Google Gemma and I'm going to choose so to be it I'm going to choose the file for example config do Json right this config.js file so we just say config.js and then we get the file here so we can actually do this and preview the file in our terminal um for for that in case you don't know I'm uh cutting the file which prints the contents here and after I do that I pass it through JQ that gives us the format ad Json but I also could ask for uh property architectures for example we can download one file and as you can see the file is there it's not a Sim link and to do a Sim link I could actually do Here Local there use sim links and when I download let's delete that file I have to put here true so now this is a Sim link so if I show the original it brings me back to the folder I showed before with the cache and that is Sim link if I don't do that if I don't want to use a SIM link I can simply download the file which I need to probably remove and get the actual file okay so that's for downloading one file if we want to download the entire repo we just leave it empty like this you will start downloading the files to this folder and for the ones that are really big probably the ones that have this lfs icon here because it's long large file storage G lfs these ones are going to be Sim link because you don't want to redownload these things every time so the ones are Sim link are really large and we could have SIM linked everything by specifying that use sim link option and leing this up so we're going to remove these things do this and this is really nice now because it's just Sim linking from the cach of hagging face on M machine because they're already downloaded so it's really clever because it doesn't have to wait if you had to wait you would see something similar to this it's basically downloading these big files from another repository and as you get them completed they actually completed in the cache and then there's Sim link here we're going to go to the cache of hagging phase and we're going to see what files we have here these things are probably things that didn't get downloaded completely and we have some files here for what didn't get completely downloaded before when I tried so I have 17 gigs for Gemma 7B 51 for the other so this probably not finished and this is 2v for what I started downloading so we've seen a brief intro to what Gemma is just really high level we've seen that in hugging phase we can install the CLI really easily and we've seen how to actually log in with a token how to go to a repo and put the download command together so we can download one file or multiple files either locally or Sim linking and remember that you will always be Sim linking the files are really big so so let's take a look at how to use Gemma locally on our machine and here we have the the 2B and 7B instruct models Gemma is a family of lightweight state-of-the air open models well suited for a variety of text generation tasks including question answering summarization and reasoning these relatively small size makes it possible to deploy them in environments with limited resources such as a laptop a desktop or your own cloud infrastructure democratizing access to state-of-the-art AI models and helping Foster Innovation for everyone let's create a new environment so we can run Gemma cond create the name is going to be Gemma it's going to be Python 3 point I don't already know what the best support is right now and I'm going to say yes so we're going to install that and this should be done already all right so cond activate Gemma and we're going to do what they told us to do which is install upgrade or that's you transformers this is from huging phase so you will know already how to get these models from hugging phase because we already have this model on the the cache it might just be really fast and know that the model is already there and make use of the model however he wants to do it let's see how to run uh Google Gemma on the CPU right now so we're going to import from Transformers Import Auto tokenizer how to model for Cel LM now the tokenizer is going to be autot tokenizer Dot from pre-train and then we're going to do here Google Gemma to be it it's the one that we downloaded the model is going to be boom from pre-train Google Gemma 2bit so that should be the same right all right the input text is going to be 10 plans for tourists in Malaga Spain boom input IDs so we'll do tokenizer input text return 10 Source PT and the outputs model generate input IDs print tokenizer do decode outputs does this work let's take a look we just have to do run zpu oh okay cond deactivate it says we might have to restart let's take a look at what's the issue here none of pytorch flags have been fun cond activate Gemma and then we'll do pep install pie torch is it pie torch or just torch I guess all right what's going on now is we're installing pytorch and maybe that makes it work let's try again so now now something else is happening there and did you me from pre-trained from pre-trained all right human error there you go one more time and downloading shards loading checkpoints so it seems uh visit the Picasso Museum all right it got it right and the tokens are super short why is it outputting so few right let's just try that Max length2 oh all right least 10 plans for tourist in Malaga visit the Picasso Museum take a walk along the pasaro visit the Cathedral of Malaga explore the alkaaba go on a day trip to Ronda visit the Museum of Malaga take a cooking class visit the Botanical Garden go on a hike in the SRA Nevada mountains this is the aquarium of Mala these are just a few ideas for things to do in Mala there are many other things to see and do in this beautiful city this is pretty awesome guys this is working pretty well and that was with the two billion parameter let's actually change this to the 7 billion parameters and let's run exactly the same code here cona activate Gemma so the model is a lot slower but we'll get the output this is all running on the CPU taking really long time a really long time okay nice all right so this actually temp plans for tourists in Malaga Spain history and culture explore the memorious architecture of the alcasar of Malaga and the Cathedral of the Santa Maria this a Roman Theater and thear Museum immerse yourself in the viant flamco culture with a show beach and sun relax they trip to Ronda explor the orange blossom foodie Adventure shopping night life nice relax and enjoy yeah I don't really know why this took so long but definitely a lot more detail can we solve coding questions with this can we ask a different question write a typescript react component with an okay buttom let's try that with two billion that is pretty fast and I'm not sure if the seven billion is going to be really slow so that that was super fast these are the the this is the CPU the cores and that was generating but I put so few tokens that couldn't really write anything so let's try again with more tokens loading checkpoints there's no rendering lag so this computer can handle it pretty easily and now let's see what happens it it also takes some time right but it might be like 10 or 20 seconds to answer let's say it seems like these bits here this peak that you can see is that model that is using so let's see when the output comes let's see if these things here actually continue being there or disappear right if if that's the model using some piece of our CPU this is taking longer because it's probably generating a lot more tokens I don't really know what what the default token size is but definitely it's not 200 it's not 500 all right so here's our component our application handle click onclick okay handle Okay click import a US state Hook from Rea and it explains how to do all these things okay that sounds good and what I'm going to do now is I'm going to ask maybe with 64 tokens so it's not too much you can see these pieces here these mounts here are the pieces that are consumed by running that so it seems like this model makes use of the CPU but in a good way now what we're going to see with the 7B is that my camera probably would lag a bit I'm going to leave it maybe we miss some frames and let's see those amount let's see how those are and and whether there is some GPU use as well going to leave this here let's run that there you go on on first instance when the model mod is being loaded we lose some frames so we Lo we lost 50 frames due to rendering lag and 19 due to encoding live there is no stress on the GPU because this is an example code for running this on the CPU and as you can see there's some stuff there going on with the CPU performance course and then we generate and this is a thing that is being generated that we seem like we need to actually add more I'm going to remove myself just to oh I can remove myself from here but from here maybe yeah so I remove myself because there's going to be some encoding lag so that you don't see me blocked or something and that's it so we seem to be okay so this generated it's close this stuff there's no oh it's generating still okay yeah it seems like the CPU also suffers but it seems like it's more used with the two Vilan parameters I don't fully I don't fully understand what's going on why the other one takes more work on the CPUs all right here it is import so it's working slowly you can see here on this columns great this works pretty well for being 7 billion running the model on a single G PPU or multi-gpu we need pin to accelerate run your R training script on any kind of device all right so we're trying to see if we can directly run Gemma on a MacBook Pro Apple silicon GPU actually just create a file so torch touch uh torch devices all right so import [Music] torch back ends NPS is available PS device equals let me put this here so I'm basically just following this example so torch device MPS X torch on one device NPS device print x uh else print MPS device not f all right let's see what this thing does for us python torch devices Tor one device MPS Z great we do have it we've seen that this code sample gives us the answer to whether we have the MacBook Pro device on Pythor here so yeah machine learning has advanced a lot and now we have native support without doing anything so what I want to do here is see if we actually where to put here generate let's take a look at where the code sample was should show okay so that's great and let's close things great so here tokenizer to qab okay so div map Auto and the tokenizer Run Turners to MPS let's try that and don't be ambitious and let's put this here P torch run CPU and now the only thing we need is not to get any errors I don't really know what's going to happen maybe we do get errors but see seems like it's working I'm not losing any frames here dropped frames due to network there been some lag there which I don't know oh nice so take a look now so this Spike here this is really nice because this means that the model is actually making use of the GPU and it's not overloading my CPU at all which is pretty cool I would say and now there's some spikes on the CPU but a lot less done before so yeah that's super nice we're going to continue seeing this graph and I'm going to leave it in here so maybe we can put this and I can go here all right so we got this U message here things to eat in Napoli Italy great so we're going to put here maybe less tokens is faster I have a timer here uh so we're going to countdown all right so I'm going to do it now 1 2 3 4 5 6 7 8 nine 10 11 12 13 14 15 16 okay 16 seconds and we've got an output in a markdown so I can actually open I writer with an empty file [Music] here and I can type this and this is actually ready for print it's a marked down text and we can see it in different formats but in the end the important thing is that is valid markdown and it's been generated by the two billion parameter one so let let's actually go for the 7 billion parameter let's see how it performs here and whether this is something that we could use okay so the prompt is things to eat in Napoli Italy and that's the answer for two billion wow what and this is the answer for 7 billion oh it didn't have space to do more all right let's let it talk and let's see how long this one takes we're making use here of the of the GPU a lot more than before and a lot less on the CPU right so it seems like there is some load on the CPU as well and the GPU there like spikes so what's going on actually we're losing a lot of frames because the computer is like trying to generate tokens and and it's actually doing an inference in my machine so it's not super performant and I have three screens and yeah a lot of frames lost but the good news is that the two billion parameters is something you can use locally and without any problem we're at 2 minutes 50 seconds all right and we're at 3 minutes 50 so not sure if this yeah so this just generated just finish generating all right so let's actually get this great so let's put this on the same document under Gemma 7 billion so I don't really know maybe it's there this is what we got nap is a vibrant City sing history blah blah blah Pizza spaghetti Pizza poana gelato tarella tips to experi dishes enjoy the relaxed atmosphere wow okay there is a difference this was what I we got pretty different suggestions anyhow this was a great experiment we've been able to run Gemma on the Jeep GPU and also on the CPU and how those differ we lose frames on this even in this expensive powerful M Max MacBook Pro we lose frames because they're bloating our um CPU and GPU and things um get Frozen there uh but it's been pretty cool to to be able to do this uh so quickly and so easily I will remind that you can join the Discord community at nomma Discord and don't forget to subscribe and click on the Bell if you're going to get notified when I go live next or when I upload new videos honestly I want to thank you for watching and I'll see you next time thanks a lot I'm non Martin Alonso host of this stream and also the getting simple podcast bye
Info
Channel: Nono Martínez Alonso
Views: 6,518
Rating: undefined out of 5
Keywords:
Id: qFULISWcjQc
Channel Id: undefined
Length: 25min 2sec (1502 seconds)
Published: Fri Feb 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.