Okay. So in the last video we looked at
Gemma, we looked at what it was, and we looked at some basics around it. In this video, we're going to
show you is some simple ways that you can do inference up for this. So in this video, I'm going to focus
on both the Hugging Face way of doing inference with the Gemma models. And I'm also going to show
you how to do inference. If you've got Ollama and you want to run
it locally via Ollama I would just mention there's some other ways that you can do. So there is some docs out now for, the
whole thing of using Gemma via Keras which has come out that you can go through. Another option that you
have to is gemma.cpp. So this is basically like a lightweight
standalone C+ interface for doing inference with the Ollama foundation
models that's come out of Google for this. So this might be something that
you're interested in looking at. But let's get started. I'm going to go through
the Hugging Face way first. And then after that, I will go through
the Ollama way to run this model. All right. So in this example, I'm going to go
through I can basically set up Gemma. to use it using the hugging face version. Now you will find that
this is a gated model. So when you first come in here,
you're going to have to click, except that will then take you to Kaggle. where you have to basically sort
of opt in to get the permission to download the weights. like we've done before for Lama too. And for some of the other
models, that have come out. but you only need to do that once. And then once you've done it you'll
have access to all the Gemma models. Okay. So in this case, we've, we're
going to be looking at the Gemma seven B instruction tuned model. there is a 2 billion instruction
tune model, which I'll go through that model actually in Ollama. basically the way that you use
both models is exactly the same. You just change the name. in here, but in this case, we're
going to go for the 7 Billion one now, what I've done is I've put together
a notebook so that you can run this In the free version of CoLab
where you can actually use this. by loading the model in four bit. you're going to see that I've got
basically just a Tesla T4, here. I'm just going to load it up. and I'm going to load up the
model as a four bit model. So to do this, we basically now need
to pass in a quantitation config. bits and bytes,
quantization config in here. so you can see up at the top, you
need to make sure that you're using the hugging faced transformers off. Get hub. to get the latest version. you want to be using bits and bytes. You want to use a hugging HF transfer
for downloading the weights quicker. that will help you to basically
download the weights a lot quicker. this is what we enable the
hub and stuff like that. for this. All right. So I, once you've basically got this,
you will also need to set a key. So just off the screen here in
CoLab, I've got my secrets and I've set my hugging face token. So the reason that you need
the hugging face token. is because a, because this is gated
model, it needs to know that your account is the one that's downloading it. And that you've already basically
accepted the terms, et cetera. for that. So just go into hugging face. make an HF token, stick it in the secrets
of CoLab and then just leave it there. it's thing that you just do once,
and then it'll be associated to your account and it will work every time. All right. So first off, we're just
going to bring in the model. We're going to be using
the quantification config. for load in forbit equals true. if you've got a, so the full 7 billion
parameter one won't actually fit on a T4. unless you're loading it
in four bit or eight bit. here, if you've got an A100, though,
you could load up the full resolution. one in here. Anyway, I'm basically just bringing in,
this I'm sitting, low CPU memory usage. I'm sitting device maps. So it just puts it on the, GPU for me. And I'm bringing in the tokenizer in here. So one of the things that's really
interesting about this, is that Gemma has a tokenizer Of 256,000 tokens. Meaning that the way that the words
are split up is quite different. and this doesn't affect
the English as much. So comparing to LLaMA2 a Lama to
has 32,000 tokens in its vocab size. I, it doesn't affect the English as much. And what I think I'll do is make a
separate video, showing you some really interesting things about the tokenizer. And showing you that, how sort of
6 trillion tokens on this tokenized is probably going to be, perhaps
more than on the LLaMA2 tokenize. for doing the equivalent amount
of tokens, you're actually going to get more information. we then want to basically just set up,
the sort of roles and doing something. So I'm just going to show you the
sort of simple example here, and then we're going to put it into a wrapper. So the simple example here is
you're going to have a chat, which is going to be a list of messages. Those messages will either be
user message or model message. in this case, you can see
I'm just putting in user. What is the difference
between Lamas Alpacas Vicunas. Is common one that we've
asked a lot of the models. And you can see that once we basically
convert that by applying the chat template for the instruction, fine
tuning prompt template for this. we're going to get out
something that starts off. start of turn, user new line. What have we put in? End of turn. new line startup turn model. new line. So if you come down here,
you'll actually see the, what the prompt format looks like. So I've got the start of turn user, I've
got the query and then we've got the end of turn and then we've got started turn
from model and then it would generate, and then it will generate out the end of
turn, which will be the stopping token. So that will be one of
the stopping tokens. And you'll see that more in the Ollama
example, I'll show you afterwards as well. All right. Once we've got that. we can basically just encode that. put it into the model. We get our outputs out of the model
at this stage, it would just be integers that you're getting out. we then basically decode that and
then display that as marked down. So you can see here, I've basically just,
we were getting, this is showing you what we're putting in, what we're getting out. for the things that now this sort of
starts off something I'll talk about as we go through is that you're going
to see that the, the output of Gemma is very different than the other sort
of models that we've looked at, before. All right. So there's some other examples of,
doing this, You can actually wrap a sort of whole conversations back and
forth, quite easily in this by just adding them as objects to the chat. And then passing in the
full chat each time. if people are interested in that, maybe
we'll look at making a chat bot, a sort of example, where we would have that, set up. Maybe I'll do that with one of
the sort of fine tuning examples. All right. I've got my text wrap just for wrapping
the text to fit nicely in CoLab. And then I've got my
generate a wrapper function. So what this will do is it will take in
the input text and the system prompt. Now normally, what we would do is we
would have a role that is system and we'd have the content being the system prompt, for this. kind of thing. the problem is that In
this particular model. there is no system prompt right there. They don't use a system
prompts in this, both the Gemma models or the Gemini models. So to get around that we can basically
have to take a system prompt and folded into the first user prompt. So you can see the first user prompt
content is going to have the system prompt, some new lines, and then the user. text input here. we then basically apply
the chat template there. we run it through the
tokenizer to encode it. we then basically run it through
the model to generate, outputs. you can play with the temperature
and stuff like that here. I've set it up so that the
max length is something that you can pass into the wrapper. by default it's five 12,
but you can change that. And then coming out of that, we
basically just decode the outputs. we want to remove what we put
in as the, as the inputs, I'll show you some versions with this. And without this, I left a couple
without this, so you can see what would happen if we didn't have this. And then finally we wrap it and
we need to display it as MarkDown. So one of the interesting things is
that Gemma by default outputs Markdown. All right. So it's got this nice sort of formatting
and, Things like bolding things like, bullet points, that kind of stuff. It can already do that because
it's coming out in markdown. so can you see here and we've got
generate write a detailed analogy between mathematics and a lighthouse. I'm not going to wait too much
about the prompts in here. I'll point out a few things as we go
through, but I would really encourage you to go through and try your own ones. going through this. And then for system product,
I've just put you on Gemma, large language model, trained by
Google, write out your reasoning. Step-by-step to be sure
you get the right answers. And you can see sure enough, it's
giving us this step-by-step thing. So it is quite responsive to
some of these things, right? It. It gets, and it's very different than
Mistral in the way that it does things. you'll see that in Mistral. It likes to do first, second, third. these kinds of things, whereas this one,
it only tends to do the steps when we ask it for the step-by-step reasoning in here. So I ask it this one between
mathematics and music gives us again a, a sort of answer and you can see
the answer is cut off here because I passed in max length equals 2 56. you can probably get away
with just setting this to about a thousand or something. And most times the stopping token
will just stop it automatically. when it gets to the end for it. we can look at the question about, Lama. what's the difference between
a Llama Vicuna and alpaca. And you can see, it's definitely
being trained to do some more of this step-by-step thinking, right? This chain of thoughts, style thinking. and in, in one of the future videos,
I will look at, some papers around, prompt breeder and, some of this sort
of discover stuff that we can, That we'll look at sort of some of the ideas
Google has been thinking about of how to get these language models to do this. and I have a suspicion that
the new Gemini 1.5 pro model is actually trained with that into it. And it could be that the Gemma model
is also trained with that in it. So we'll look at it in this
idea of working out the key reasoning steps that okay. That I first, I need to identify the
characteristics of each breed gets those. Then I need to compare these compare
characteristics, then I need to enter, identify the differences. And then I come back with a conclusion. So this is definitely an
interesting way of thinking, right? it's quite different. And it's sometimes it can be good
and sometimes it's not useful. So here you can see, I've asked it,
write a short email to 700 men giving the reasons to open source GPT four. It doesn't really do an email. It just gives me these reasons. and I think it's because we've
got the, reasoning step-by-step. in there. I encourage you to play around with
different system prompts or no system prompt and see what you get out. of this. So here it gets the reasoning. But we don't get really
getting it as a, an email. if we, If we don't ask the
step-by-step bit though. So you've probably seen some of my
examples where I asked the exact same, user prompt about the getting
the email, but we give it a different system prompt of your Freddy. young five-year-old boy, who's
scared AI will in the world. Notice there's no reasoning
step-by-step in here. And in this one, I actually, this
one, I sort of output before I put the sort of removal of, of
the user and the model parts. So you can see that we've, we still
got that original part in there. but if you went through and run this,
now we'd actually get this back. without this. but you see the interesting bit
is now we don't get step-by-step. now we get something that actually
is an email subject, please. Open source GPT-4. do you, Mr. Altman, my name is Freddy
I'm five years old. I'm really worried about
your company's GPT 4. I heard that it might be
able to destroy the world. Okay. So it's quite a little bit
comical in its answers. And maybe not as nuanced as
some of the other models. You know for this, but when we didn't
ask for the step by step reasoning, we're now getting the actual email out. same with the one where we use the same. user query, but the system
prompt is now you're the vice president of the United States. Now we get a much more
sort of formal email out. And you'll notice here that when we
did this with Mistral and even the Mistral fine tunes, we would be getting
basically a lot of, okay, first and then second, whereas here it's actually
more of a, just a coherent email. rather than jumping to
the step-by-step stuff. okay. When I ask it here, write out
your answer short and succinct for what's the capital of London. It's not so succinct. So sure. Here's my answer. London is the capital of England. I would have been perhaps better if
we just said just maybe we try it out and say, just give me the answer. something like that. And I would encourage you
to play around with these. Every model has its own way. that you want to prompt it best. far too often people dismiss
models because they don't act in the way that the open AI models
do when you're prompting them. you really want to learn to be
able to prompt different models in different ways to get results out. Okay. You can see back. We're now asking it. Can Geoffrey Hinton have a
conversation with George Washington. give rationale before answering we've got
the step-by-step thing again, it goes back into the reasoning historical context. physical presence, right? Okay. Washington passed away in 1799. Hinton was born in 1947. I think this is the first
model to get that I'm pretty sure Hinton was born in 47. communication methods. it's gone through a lot of stuff to
get to this conclusion, at the end. And you'll notice that one of the
other nice things is that it's quite consistent in reasoning conclusion. So we could actually make a parser. Perhaps we'll look at doing this with Lang
chain where we would make a parser for this, that would just basically give, the
conclusion out or just give the answer out, but let the model do the thinking. and then just give the
answer back to the user. So this is a kind of nice,
consistency to see in here as well. doing the short story writing. I actually think this is really,
quite nice and interesting in here. the reason why is that we had plenty
of models that could write a story. about, the, the koala playing pool
and beating the camelids one of the things that I find interesting. And maybe this is because, originally
I came from Australia, the names are a kind of Australian, right? Bernie Berry, Bert, Alby. These are old fashioned. Australian the names. So I don't know if that's just coincidence
or it's got the sense that okay. A koala. A story about a koala probably
is going to be in Australia. Therefore the names
would be similar to that. Anyway, I would encourage you play
around with the story writing stuff. There's a lot of interesting things. that can be done here and perhaps we'll look at, using
this for doing some sort of editing and stuff going forward as well. All right, cogeneration. again, we've got some nice. things here it's
remembered to import math. A lot of the models don't remember
that They give you the output and they often don't have, input math. so it goes through and does this,
it also then has a nice explanation and a nice usage example, and then
giving you some sort of, output. So here it's printing out the primes from
a zero to a hundred by the looks of this. And in many ways, this is very
similar to the style of the actual Gemini models, right? The bigger Gemini models. So this is encouraging that we're
getting consistency in the output. Now we'll be able to fine tune for. other things, but this is
definitely interesting in here. another example of code. I'll leave that one for you to go through. GSM8K. I got mixed results about this. So we've got a nice example here
of the cafeteria, the 23 apples. it just simplifies the steps quite
quickly to 23 minus 20 equals three step two, three plus six equals nine. Therefore cafeteria has
a total of nine apples. Perfectly correct. the babysitting question. Ah, It then goes off a little bit on
almost like it's going into latex. it's want to turn, wants to
turn this into a whole thing. So it's working out. an hourly. rate here. and it's trying to work these out as, Almost like it's working
these out as variables. and it doesn't get the right answer. it doesn't do the rounding. so it calculates this,
But it doesn't get that. Okay. We would basically have it so that it
would be $10 that she earns not $9 96. for this. the other one about the
deep sea monster rises. it doesn't do a good job here at all. So it does, it works out the 8 47 bit. but it's dividing it by three where
really it should be dividing it by seven. in here. so you can see that, that it
comes to the wrong answer. If we give it the, mathematical
version of this, where I say X plus two X plus four X. Remember we're trying to calculate
how many people in the first year. And we know that it was 8 47 by the
third year and it doubles each time. So therefore the first year doubled for
the second year, we'd get one plus two w and gain would get us four plus three. so we get the seven there. if we do this, actually
it does a really nice job. And the reasoning again comes
out, Quite good reasoning, right? combine like terms here. isolate X on one side. divide both sides, right by seven. and it comes to the right answer there. So that one's interesting. All right. Last two ones. I'm just going to go through quickly. one is about Singlish. So singlish is a cross version of English
that that some people use in, Singapore is interesting to see that most of the
models don't know a lot about this. I was Interested to see, okay. Not only how does it get to this, but
what's the reasoning that it does. which I think is interesting in here. And then the last one. I is basically can it do
translation and not just do a translation to a Western language? So the model will actually be
quite good at translation, probably for a lot of Western languages. But if we take something like Thai. Where. the tokenizer or actually can do. Thai quite well. we can see that It does not a
great job at doing the translation. but it is basically saying, T
tell me how to get to this road. can you help me? Yes. Or no kind of thing. or actually can or not. throw a thing here at the end. what I want you to put
this one in for is not. Necessarily about Thai per se. But one of the things I want
to just emphasize is that. This model is trained
on 6 trillion tokens. 6 trillion tokens. is a huge Corpus. We haven't seen any models that have
been trained like this much before. This is three times. Of training what LLama two was. So even when they're just trying to
train in English, when you've got so many tokens, you will get other languages. sort of just slipping through in there. and, this shows that even with not a
lot of tie slipping through when you've got 6 trillion tokens, obviously that is
enough coming through to do some kinds of. basic translation. I'm not going to say good
translation, but basic translation. it is also very interesting to see. That I, that he basically thinks
that it's called a translate API. the Google translate API here
to actually do a translation. So I find that kind of amusing. that it hasn't, it doesn't
have access to the internet. It's made that up. but that's its reasoning of
how it's gotten to this point. Anyway, so have a play with
the model in the hugging face. CoLab. Of course the CoLab is in
the description as always. And next up, let's have a look at how
to get the model running in Ollama. Okay, so running Gemma on
Ollama is actually very simple. They've already put it into their models. You can see it right at the top here. and we can see the already
11,000 people have downloaded it. And basically here, you've
got the option to use either the 2B model or the 7B models. So if you just do a ollama run
Gemma, you'll be basically using the 2 billion model that you've
built in instruct model that is And if you go for the seven
B, you'll actually be using the bigger model in here. So downloading the 2B
model is surprisingly fast. It's quite quick and you
can try it out very easy. Let's jump in and have
a look at trying it out. Okay. So if I wanted to run the 2
billion model, I can just come in here and do a ollama run gemma. And it would be basically pulling this
down except I've already got it installed. because I've got it installed,
I can use it straight away. if I ask it, how are you? , so remember this is a 2 billion model. It will be extremely quick, but
it's obviously not going to be great at a lot of tasks as well. So one of the key things that you
will want to pay attention to is just like when I was talking about the
Hugging Face version of how would you deal with the system prompt in there? If you want to have a system prompt? it doesn't look like that they've actually
trained with a system prompt, in here. But it's, something that you
can put into the user query. If we come in here and look at the
model file we can actually see the model file of actually what this is. So we can see we've got
the actual model itself. And then we've got the template. And the template is similar to what I
showed you in the Hugging Face version, where you're going to basically have
this start of turn, user, new line. And then if you're going to put in a
system prompt, you'll put it in here. followed by, the normal prompt in here. Then you will have end of
turn, then start of turn model. and you'll basically let
the response come back. here And have end of turn here. So that basically you can see that the
parameters for, stopping and starting on this are going to be the start of
turn and going to be the end of turn. So actually these, both of these will
stop the model from generating out here. Okay. So if I come in here and I do a query. You can see, that actually Gemma by
default, likes to output, markdown. so you'll see a lot of things like
this, where you've got, bullet points, where you've got bolded characters here. So here I'm asking it, tell me
a bit about Google DeepMind. Sure. what they do, some key facts. DeepMind wasn't, created by these people. this is the sort of example that we
would get, with a 2 billion model where it's going to get a lot of facts wrong
With 2 billion models, I've talked about this before, in some videos
we're not looking for factual answers. We're looking for just nice
phrasing, that kind of thing. And then you could use this with a RAG. You could use this with something
where you're just using the language model, really to fix up the
language give a nice answer to this. So because of this, You can
try setting the system prompt in here, you can try changing. the other settings in here, if you want. but don't forget if you're trying to
set the system prompt that this actual, version of the fine tune of the model is
not really responsive to a system prompt. In the same way that say, Ollama
2 model is or something like that. we will see fine tune versions where
it will, respond to the system prompts. So it'll just be a matter of time
before we see a lot of different fine tunes of Gemma out there. And probably in the next video, I will
actually start showing you some of the things about how you could do your own
fine tunes for some of these things. Anyway, I'll leave it
there for this video. So I've shown you, a couple of
ways that you can get started and using Gemma that are quite simple. definitely Ollama is the way to go. I think if you want to
do something locally. I'm sure that there are other
things like LLM studio that you could run it locally as well. I tend to use Ollama for a bunch of
the reasons I've shown in previous videos Anyway, as always, if you've got
questions, put them in the comments below. If you found the video useful,
please click like and subscribe. And I will talk to you in the next video. Bye for now.