Running Gemma using HuggingFace Transformers or Ollama

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Okay. So in the last video we looked at Gemma, we looked at what it was, and we looked at some basics around it. In this video, we're going to show you is some simple ways that you can do inference up for this. So in this video, I'm going to focus on both the Hugging Face way of doing inference with the Gemma models. And I'm also going to show you how to do inference. If you've got Ollama and you want to run it locally via Ollama I would just mention there's some other ways that you can do. So there is some docs out now for, the whole thing of using Gemma via Keras which has come out that you can go through. Another option that you have to is gemma.cpp. So this is basically like a lightweight standalone C+ interface for doing inference with the Ollama foundation models that's come out of Google for this. So this might be something that you're interested in looking at. But let's get started. I'm going to go through the Hugging Face way first. And then after that, I will go through the Ollama way to run this model. All right. So in this example, I'm going to go through I can basically set up Gemma. to use it using the hugging face version. Now you will find that this is a gated model. So when you first come in here, you're going to have to click, except that will then take you to Kaggle. where you have to basically sort of opt in to get the permission to download the weights. like we've done before for Lama too. And for some of the other models, that have come out. but you only need to do that once. And then once you've done it you'll have access to all the Gemma models. Okay. So in this case, we've, we're going to be looking at the Gemma seven B instruction tuned model. there is a 2 billion instruction tune model, which I'll go through that model actually in Ollama. basically the way that you use both models is exactly the same. You just change the name. in here, but in this case, we're going to go for the 7 Billion one now, what I've done is I've put together a notebook so that you can run this In the free version of CoLab where you can actually use this. by loading the model in four bit. you're going to see that I've got basically just a Tesla T4, here. I'm just going to load it up. and I'm going to load up the model as a four bit model. So to do this, we basically now need to pass in a quantitation config. bits and bytes, quantization config in here. so you can see up at the top, you need to make sure that you're using the hugging faced transformers off. Get hub. to get the latest version. you want to be using bits and bytes. You want to use a hugging HF transfer for downloading the weights quicker. that will help you to basically download the weights a lot quicker. this is what we enable the hub and stuff like that. for this. All right. So I, once you've basically got this, you will also need to set a key. So just off the screen here in CoLab, I've got my secrets and I've set my hugging face token. So the reason that you need the hugging face token. is because a, because this is gated model, it needs to know that your account is the one that's downloading it. And that you've already basically accepted the terms, et cetera. for that. So just go into hugging face. make an HF token, stick it in the secrets of CoLab and then just leave it there. it's thing that you just do once, and then it'll be associated to your account and it will work every time. All right. So first off, we're just going to bring in the model. We're going to be using the quantification config. for load in forbit equals true. if you've got a, so the full 7 billion parameter one won't actually fit on a T4. unless you're loading it in four bit or eight bit. here, if you've got an A100, though, you could load up the full resolution. one in here. Anyway, I'm basically just bringing in, this I'm sitting, low CPU memory usage. I'm sitting device maps. So it just puts it on the, GPU for me. And I'm bringing in the tokenizer in here. So one of the things that's really interesting about this, is that Gemma has a tokenizer Of 256,000 tokens. Meaning that the way that the words are split up is quite different. and this doesn't affect the English as much. So comparing to LLaMA2 a Lama to has 32,000 tokens in its vocab size. I, it doesn't affect the English as much. And what I think I'll do is make a separate video, showing you some really interesting things about the tokenizer. And showing you that, how sort of 6 trillion tokens on this tokenized is probably going to be, perhaps more than on the LLaMA2 tokenize. for doing the equivalent amount of tokens, you're actually going to get more information. we then want to basically just set up, the sort of roles and doing something. So I'm just going to show you the sort of simple example here, and then we're going to put it into a wrapper. So the simple example here is you're going to have a chat, which is going to be a list of messages. Those messages will either be user message or model message. in this case, you can see I'm just putting in user. What is the difference between Lamas Alpacas Vicunas. Is common one that we've asked a lot of the models. And you can see that once we basically convert that by applying the chat template for the instruction, fine tuning prompt template for this. we're going to get out something that starts off. start of turn, user new line. What have we put in? End of turn. new line startup turn model. new line. So if you come down here, you'll actually see the, what the prompt format looks like. So I've got the start of turn user, I've got the query and then we've got the end of turn and then we've got started turn from model and then it would generate, and then it will generate out the end of turn, which will be the stopping token. So that will be one of the stopping tokens. And you'll see that more in the Ollama example, I'll show you afterwards as well. All right. Once we've got that. we can basically just encode that. put it into the model. We get our outputs out of the model at this stage, it would just be integers that you're getting out. we then basically decode that and then display that as marked down. So you can see here, I've basically just, we were getting, this is showing you what we're putting in, what we're getting out. for the things that now this sort of starts off something I'll talk about as we go through is that you're going to see that the, the output of Gemma is very different than the other sort of models that we've looked at, before. All right. So there's some other examples of, doing this, You can actually wrap a sort of whole conversations back and forth, quite easily in this by just adding them as objects to the chat. And then passing in the full chat each time. if people are interested in that, maybe we'll look at making a chat bot, a sort of example, where we would have that, set up. Maybe I'll do that with one of the sort of fine tuning examples. All right. I've got my text wrap just for wrapping the text to fit nicely in CoLab. And then I've got my generate a wrapper function. So what this will do is it will take in the input text and the system prompt. Now normally, what we would do is we would have a role that is system and we'd have the content being the system prompt, for this. kind of thing. the problem is that In this particular model. there is no system prompt right there. They don't use a system prompts in this, both the Gemma models or the Gemini models. So to get around that we can basically have to take a system prompt and folded into the first user prompt. So you can see the first user prompt content is going to have the system prompt, some new lines, and then the user. text input here. we then basically apply the chat template there. we run it through the tokenizer to encode it. we then basically run it through the model to generate, outputs. you can play with the temperature and stuff like that here. I've set it up so that the max length is something that you can pass into the wrapper. by default it's five 12, but you can change that. And then coming out of that, we basically just decode the outputs. we want to remove what we put in as the, as the inputs, I'll show you some versions with this. And without this, I left a couple without this, so you can see what would happen if we didn't have this. And then finally we wrap it and we need to display it as MarkDown. So one of the interesting things is that Gemma by default outputs Markdown. All right. So it's got this nice sort of formatting and, Things like bolding things like, bullet points, that kind of stuff. It can already do that because it's coming out in markdown. so can you see here and we've got generate write a detailed analogy between mathematics and a lighthouse. I'm not going to wait too much about the prompts in here. I'll point out a few things as we go through, but I would really encourage you to go through and try your own ones. going through this. And then for system product, I've just put you on Gemma, large language model, trained by Google, write out your reasoning. Step-by-step to be sure you get the right answers. And you can see sure enough, it's giving us this step-by-step thing. So it is quite responsive to some of these things, right? It. It gets, and it's very different than Mistral in the way that it does things. you'll see that in Mistral. It likes to do first, second, third. these kinds of things, whereas this one, it only tends to do the steps when we ask it for the step-by-step reasoning in here. So I ask it this one between mathematics and music gives us again a, a sort of answer and you can see the answer is cut off here because I passed in max length equals 2 56. you can probably get away with just setting this to about a thousand or something. And most times the stopping token will just stop it automatically. when it gets to the end for it. we can look at the question about, Lama. what's the difference between a Llama Vicuna and alpaca. And you can see, it's definitely being trained to do some more of this step-by-step thinking, right? This chain of thoughts, style thinking. and in, in one of the future videos, I will look at, some papers around, prompt breeder and, some of this sort of discover stuff that we can, That we'll look at sort of some of the ideas Google has been thinking about of how to get these language models to do this. and I have a suspicion that the new Gemini 1.5 pro model is actually trained with that into it. And it could be that the Gemma model is also trained with that in it. So we'll look at it in this idea of working out the key reasoning steps that okay. That I first, I need to identify the characteristics of each breed gets those. Then I need to compare these compare characteristics, then I need to enter, identify the differences. And then I come back with a conclusion. So this is definitely an interesting way of thinking, right? it's quite different. And it's sometimes it can be good and sometimes it's not useful. So here you can see, I've asked it, write a short email to 700 men giving the reasons to open source GPT four. It doesn't really do an email. It just gives me these reasons. and I think it's because we've got the, reasoning step-by-step. in there. I encourage you to play around with different system prompts or no system prompt and see what you get out. of this. So here it gets the reasoning. But we don't get really getting it as a, an email. if we, If we don't ask the step-by-step bit though. So you've probably seen some of my examples where I asked the exact same, user prompt about the getting the email, but we give it a different system prompt of your Freddy. young five-year-old boy, who's scared AI will in the world. Notice there's no reasoning step-by-step in here. And in this one, I actually, this one, I sort of output before I put the sort of removal of, of the user and the model parts. So you can see that we've, we still got that original part in there. but if you went through and run this, now we'd actually get this back. without this. but you see the interesting bit is now we don't get step-by-step. now we get something that actually is an email subject, please. Open source GPT-4. do you, Mr. Altman, my name is Freddy I'm five years old. I'm really worried about your company's GPT 4. I heard that it might be able to destroy the world. Okay. So it's quite a little bit comical in its answers. And maybe not as nuanced as some of the other models. You know for this, but when we didn't ask for the step by step reasoning, we're now getting the actual email out. same with the one where we use the same. user query, but the system prompt is now you're the vice president of the United States. Now we get a much more sort of formal email out. And you'll notice here that when we did this with Mistral and even the Mistral fine tunes, we would be getting basically a lot of, okay, first and then second, whereas here it's actually more of a, just a coherent email. rather than jumping to the step-by-step stuff. okay. When I ask it here, write out your answer short and succinct for what's the capital of London. It's not so succinct. So sure. Here's my answer. London is the capital of England. I would have been perhaps better if we just said just maybe we try it out and say, just give me the answer. something like that. And I would encourage you to play around with these. Every model has its own way. that you want to prompt it best. far too often people dismiss models because they don't act in the way that the open AI models do when you're prompting them. you really want to learn to be able to prompt different models in different ways to get results out. Okay. You can see back. We're now asking it. Can Geoffrey Hinton have a conversation with George Washington. give rationale before answering we've got the step-by-step thing again, it goes back into the reasoning historical context. physical presence, right? Okay. Washington passed away in 1799. Hinton was born in 1947. I think this is the first model to get that I'm pretty sure Hinton was born in 47. communication methods. it's gone through a lot of stuff to get to this conclusion, at the end. And you'll notice that one of the other nice things is that it's quite consistent in reasoning conclusion. So we could actually make a parser. Perhaps we'll look at doing this with Lang chain where we would make a parser for this, that would just basically give, the conclusion out or just give the answer out, but let the model do the thinking. and then just give the answer back to the user. So this is a kind of nice, consistency to see in here as well. doing the short story writing. I actually think this is really, quite nice and interesting in here. the reason why is that we had plenty of models that could write a story. about, the, the koala playing pool and beating the camelids one of the things that I find interesting. And maybe this is because, originally I came from Australia, the names are a kind of Australian, right? Bernie Berry, Bert, Alby. These are old fashioned. Australian the names. So I don't know if that's just coincidence or it's got the sense that okay. A koala. A story about a koala probably is going to be in Australia. Therefore the names would be similar to that. Anyway, I would encourage you play around with the story writing stuff. There's a lot of interesting things. that can be done here and perhaps we'll look at, using this for doing some sort of editing and stuff going forward as well. All right, cogeneration. again, we've got some nice. things here it's remembered to import math. A lot of the models don't remember that They give you the output and they often don't have, input math. so it goes through and does this, it also then has a nice explanation and a nice usage example, and then giving you some sort of, output. So here it's printing out the primes from a zero to a hundred by the looks of this. And in many ways, this is very similar to the style of the actual Gemini models, right? The bigger Gemini models. So this is encouraging that we're getting consistency in the output. Now we'll be able to fine tune for. other things, but this is definitely interesting in here. another example of code. I'll leave that one for you to go through. GSM8K. I got mixed results about this. So we've got a nice example here of the cafeteria, the 23 apples. it just simplifies the steps quite quickly to 23 minus 20 equals three step two, three plus six equals nine. Therefore cafeteria has a total of nine apples. Perfectly correct. the babysitting question. Ah, It then goes off a little bit on almost like it's going into latex. it's want to turn, wants to turn this into a whole thing. So it's working out. an hourly. rate here. and it's trying to work these out as, Almost like it's working these out as variables. and it doesn't get the right answer. it doesn't do the rounding. so it calculates this, But it doesn't get that. Okay. We would basically have it so that it would be $10 that she earns not $9 96. for this. the other one about the deep sea monster rises. it doesn't do a good job here at all. So it does, it works out the 8 47 bit. but it's dividing it by three where really it should be dividing it by seven. in here. so you can see that, that it comes to the wrong answer. If we give it the, mathematical version of this, where I say X plus two X plus four X. Remember we're trying to calculate how many people in the first year. And we know that it was 8 47 by the third year and it doubles each time. So therefore the first year doubled for the second year, we'd get one plus two w and gain would get us four plus three. so we get the seven there. if we do this, actually it does a really nice job. And the reasoning again comes out, Quite good reasoning, right? combine like terms here. isolate X on one side. divide both sides, right by seven. and it comes to the right answer there. So that one's interesting. All right. Last two ones. I'm just going to go through quickly. one is about Singlish. So singlish is a cross version of English that that some people use in, Singapore is interesting to see that most of the models don't know a lot about this. I was Interested to see, okay. Not only how does it get to this, but what's the reasoning that it does. which I think is interesting in here. And then the last one. I is basically can it do translation and not just do a translation to a Western language? So the model will actually be quite good at translation, probably for a lot of Western languages. But if we take something like Thai. Where. the tokenizer or actually can do. Thai quite well. we can see that It does not a great job at doing the translation. but it is basically saying, T tell me how to get to this road. can you help me? Yes. Or no kind of thing. or actually can or not. throw a thing here at the end. what I want you to put this one in for is not. Necessarily about Thai per se. But one of the things I want to just emphasize is that. This model is trained on 6 trillion tokens. 6 trillion tokens. is a huge Corpus. We haven't seen any models that have been trained like this much before. This is three times. Of training what LLama two was. So even when they're just trying to train in English, when you've got so many tokens, you will get other languages. sort of just slipping through in there. and, this shows that even with not a lot of tie slipping through when you've got 6 trillion tokens, obviously that is enough coming through to do some kinds of. basic translation. I'm not going to say good translation, but basic translation. it is also very interesting to see. That I, that he basically thinks that it's called a translate API. the Google translate API here to actually do a translation. So I find that kind of amusing. that it hasn't, it doesn't have access to the internet. It's made that up. but that's its reasoning of how it's gotten to this point. Anyway, so have a play with the model in the hugging face. CoLab. Of course the CoLab is in the description as always. And next up, let's have a look at how to get the model running in Ollama. Okay, so running Gemma on Ollama is actually very simple. They've already put it into their models. You can see it right at the top here. and we can see the already 11,000 people have downloaded it. And basically here, you've got the option to use either the 2B model or the 7B models. So if you just do a ollama run Gemma, you'll be basically using the 2 billion model that you've built in instruct model that is And if you go for the seven B, you'll actually be using the bigger model in here. So downloading the 2B model is surprisingly fast. It's quite quick and you can try it out very easy. Let's jump in and have a look at trying it out. Okay. So if I wanted to run the 2 billion model, I can just come in here and do a ollama run gemma. And it would be basically pulling this down except I've already got it installed. because I've got it installed, I can use it straight away. if I ask it, how are you? , so remember this is a 2 billion model. It will be extremely quick, but it's obviously not going to be great at a lot of tasks as well. So one of the key things that you will want to pay attention to is just like when I was talking about the Hugging Face version of how would you deal with the system prompt in there? If you want to have a system prompt? it doesn't look like that they've actually trained with a system prompt, in here. But it's, something that you can put into the user query. If we come in here and look at the model file we can actually see the model file of actually what this is. So we can see we've got the actual model itself. And then we've got the template. And the template is similar to what I showed you in the Hugging Face version, where you're going to basically have this start of turn, user, new line. And then if you're going to put in a system prompt, you'll put it in here. followed by, the normal prompt in here. Then you will have end of turn, then start of turn model. and you'll basically let the response come back. here And have end of turn here. So that basically you can see that the parameters for, stopping and starting on this are going to be the start of turn and going to be the end of turn. So actually these, both of these will stop the model from generating out here. Okay. So if I come in here and I do a query. You can see, that actually Gemma by default, likes to output, markdown. so you'll see a lot of things like this, where you've got, bullet points, where you've got bolded characters here. So here I'm asking it, tell me a bit about Google DeepMind. Sure. what they do, some key facts. DeepMind wasn't, created by these people. this is the sort of example that we would get, with a 2 billion model where it's going to get a lot of facts wrong With 2 billion models, I've talked about this before, in some videos we're not looking for factual answers. We're looking for just nice phrasing, that kind of thing. And then you could use this with a RAG. You could use this with something where you're just using the language model, really to fix up the language give a nice answer to this. So because of this, You can try setting the system prompt in here, you can try changing. the other settings in here, if you want. but don't forget if you're trying to set the system prompt that this actual, version of the fine tune of the model is not really responsive to a system prompt. In the same way that say, Ollama 2 model is or something like that. we will see fine tune versions where it will, respond to the system prompts. So it'll just be a matter of time before we see a lot of different fine tunes of Gemma out there. And probably in the next video, I will actually start showing you some of the things about how you could do your own fine tunes for some of these things. Anyway, I'll leave it there for this video. So I've shown you, a couple of ways that you can get started and using Gemma that are quite simple. definitely Ollama is the way to go. I think if you want to do something locally. I'm sure that there are other things like LLM studio that you could run it locally as well. I tend to use Ollama for a bunch of the reasons I've shown in previous videos Anyway, as always, if you've got questions, put them in the comments below. If you found the video useful, please click like and subscribe. And I will talk to you in the next video. Bye for now.
Info
Channel: Sam Witteveen
Views: 21,378
Rating: undefined out of 5
Keywords: Google Gemini, Gemini pro, gemini ultra, gemini 1.5, gemini 1.5 pro, gemini video analysis, 10 million tokens, google, gemini, gemini pro, gemini pro 1.5, gemini pro 1.0, gemini access, gemini pro 1.5 access, gemini 1.5 pro access, google gemini demo, gemini advanced, gemini 10 million, gemini ai, google bard, open source, Gemma, 6 trillion tokens, deepmind, 2B, 7B, keras, colab, gemma 7b, gemma 2b, hugging face, ollama, gemma.cpp
Id: 0xhZ2OhGNDg
Channel Id: undefined
Length: 23min 51sec (1431 seconds)
Published: Thu Feb 22 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.