SPEAKER: Hi, everyone. In today's video,
we're going to learn how to use the Gemini API
in our Node.js applications. That means we're going to run
through all the basic models that are available for our use
and then showcase some scenarios how those models
actually perform, so you can get a real taste
for the power that they have. So without any further ado,
let's jump into this tutorial, and let's set up our Gemini API. Now, before we actually start
working with the models, there are some
prerequisites in regards to being able to
use the Gemini API and specifically being
able to use the SDK. First of all, you
need to install NPM. Now, NPM is needed
because we actually will need to install
some packages to be able to run the generative AI. Secondly, you need to make
sure that your node version is 18 or higher. Now, how do you check your
node version on your machine? Well, it's very simple. All you have to do is
just type node dash v, and then you will get
your node version. Now, if you're a node
version isn't 20, I highly suggest that you
download the Node Version Manager, it's
called NVM, and then you can play around
and switch versions right there very, very easily. Now, once our
prerequisites are in place, the next thing we have
to do is get our API key. And where do we get that? Well, we get that from
the Google AI Studio. Now, you can go to
aistudio.google.com, and it's going to look
something like this. And then in order
to get your API key, you can just click on the Get
API Key in the top left corner, and then just create an API key. Now, because I've
already created one, I'm just going to copy
mine right from here. But if you don't
have one, you have to just go for the
very simple process of creating one, which just
generates a new key for you. Now, once we have our API
key, what do we do with it? Well, we don't just put
it into our application because we have to
be security aware. What we do is, as part of
the package installation, we also install
something called dotenv. And I'll show you
that in a second. And then as part
of your project, you create an env
dot local file, and you'll have your API
key environment variable declaration, and then you'll
just pass in your API key right here. That's how simple it is. Now, this will make sure
that it starts securely, and then we can call
and actually fetch it, and I'll show you how. But we need to
install some packages. And what packages do we need? Well, if I go back
to the actual guide, go back to the actual guide, you
can see that we need to install the SDK package called
@google/generative-ai. So we can just copy this,
go onto our terminal, and pass this in. But like I said, this is not the
only package we need to install. We also need to install dotenv. Now, this will allow
us to actually fetch our local environment
variable and allow the Node.js application
to actually read it. So let's go ahead
and install these. And once those are
installed, well, we can actually initialize
our Gemini API object. Now, I've created some already
pre-written files here, so it's going to be easier
to follow and teach you how to use these models. And that's one of them,
I have the geministart. Now, this is what
your initialization to using the Gemini API or
Gemini SDK should look like. First of all, we
import the dotenv, so we can actually read from
the local env variables. And then we also import our
GoogleGenerativeAI object from the
googlegenerative-ai package. Now I configure and
initialize my dotenv variable. And here I declare a
new Google generative AI object passing in
our API key securely from the local env file. So now nobody knows
about it, nobody sees it, and you can't commit it,
which is very important. And that's about it when
setting up the project. Now let's jump into
each project one by one, starting
with the Gemini pro and see it in action
with some use cases. All right, so in
the first example, we're going to use
the Gemini Pro model. And as you can see, as part
of the new run function that we created here,
we declare the model. And as part of the gen AI,
we get the generative model and we pass in the model name. In this case, it's gemini-pro. Now, the gemini-pro is the
simplest model that you can use. It accepts text input,
and it will give you simple text output. Now, in this case, I wanted
to make it a little bit more creative. So rather than just
asking it, what's the weather going
to be like tomorrow or how to fold a
t-shirt, I wanted to make it a little bit more
complicated for the model. So I wrote write a sonnet
about a programmer's life but also make it rhyme. Now, for those of you that
don't know what a sonnet is, it's a poem that has 14 lines
and 10 syllables per line. So it might get
a little bit more complicated for the
gemini-pro model to create this or maybe not. And then as part of
this, I call the model, I call the generative content
function, I pass in my prompt, and it should reply
to me saying, well, giving me a sonnet. So let's put this to the
test, and let's run this code. So I'm going to open up
my warp terminal, probably one of the best terminals I have
used so far, just a side note. And let's run node
gemini-pro.js. And the gemini-pro
is now thinking. It's running the code,
and we have an output. Well, let's check if it's
14 lines 4, 8, 12, 14. It's probably definitely-- I mean, I could only assume
at 10 syllables per line. And I cannot help myself but at
least read the first four lines. So, "in realms of code
where logic intertwines, a programmer's life a
symphony, a symphony of rhyme. With keyboards tap,
ideas take flight, as algorithms dance
in ethereal light." And then the two last ones. "So raise a glass
to programmers I pray whose tireless efforts
make our world today," which is quite true. And I think it did a great job. And I think it did a great job. By the way, all the code
that you're seeing today, all of those Gemini examples,
I will post them on GitHub, so you'll be able to clone this
and try them out for yourself. Amend the prompts and
see how they work. So here we have an example
of a very simple text input with a very
simple text output or maybe a little
bit complicated. It had to understand
what a sonnet is and produce something
that actually rhymes. And now let's actually move
on to a more complex model. The next one I
want to talk about is called the gemini-pro-vision. Now, as you can see, basically,
most things are similar here. The only changes is that I'm
now using a package called fs, which stands for file system. This is just so I can read
files from my computer. And the difference
is that here we're using the gemini-pro-vision. Now, what's the difference
between the gemini-pro-vision and the gemini-pro? Well, the gemini-pro-vision
is a multi-modal model. What does that mean? Well. It can not just understand
and take text as input, it can also take images as
input and produce a text output. So in this case, I have
prepared some tasks for it to do and see how well
it actually copes. One other thing to note here,
is we have this function called fileToGenerativePart. And essentially we
pass in just the path to that specific image in this
case and the type of that image, whether it's a JPEG or PNG
and so on and so forth, so the gemini-pro-vision
model can understand it. And then we have our prompt. In this case, I have no prompt. And why do I have no prompt? Well, because the first use
case that I want to test, I composed this beautiful
mathematical function or calculation or a
Pythagorean theorem where I wanted to calculate x. Now, why did I not
include a prompt? Well, because I want to test its
capabilities of understanding the prompt I have actually
written on the piece of paper. And here it says,
given this triangle, solve for x, which
is the longest side. Now, I'm going to
pass this image in, and I actually have it here. It's called Pythagoras.JPEG. And as part of this
gemini-pro-vision, you can see that I'm calling
this file ToGenerativePart passing in the path,
passing in the MIME type, and I'm waiting for
it to understand it, along with the prompt. Now, I have passed in
no prompt because I want it to understand the
prompt I've written on the page. So let's run it and
see how it works. Let's go back to warp. Let's clear this,
and in this case, I'll put gemini-pro-vision. And let's see if it will cope. By the way, the
answer of that is 10. So will it give me
the answer of 10? Let's have a look. To solve for x, we use
the Pythagorean Theorem. So it understood the assignment. So it understood the image. It knew that it has
to use Pythagoras. I didn't tell it to do so, which
is already very impressive. And then it gives me the
step-by-step function on how to do it. So we know if Pythagorean
theorem, it's c squared equals a squared plus b squared. It passes in all those values. It does the
calculation then does the square root of the final
answer to give x equals 10. Now, that is
extremely impressive for a multimodal model
to just understand what I've written
on a piece of paper. I wish that was here
with me when I was still at school because
oh, my goodness, it makes things so much easier. I have to do these things
myself without the calculator. Here, it just does it by
reading and understanding what you've drawn on the
page, very, very amazing. OK, now on to the second
use case I want to try here, is I have two images. I took a picture of
a page and a pen. That means me holding
a pen and writing pen on an actual piece of paper. And then me holding
a pen in my hand and seeing how it interprets
the differences between the two. So if I go back to
gemini-pro-vision,m rather than passing fileToGenerative,
I'm going to have to pass two of those functions. I'm just going to
copy them both. And I'm just going
to pass them in. And then rather than
saying Pythagoras here, I'm going to say but just pen. And then the other one
is called page and pen. All right, and now I'm going to
pass in a prompt at this time. And I'm going to say, what is
the difference between the two images? And see what it actually
comes back with. So let me go back not
to iterm, but to warp. And let me clear this and run
the gemini-pro-vision model one more time with those two
images as prompts right now and an actual text prompt. And it says, the first image
is of a hand holding a pencil. Interesting. The second image is of the same
hand holding the same pencil, but the pencil is now writing on
a piece of paper, interesting. I want kind of
neglect credit from it because this specific
pen is very weird. So it could have interpret
It as a pencil considering of its color because it's
brass and it's metal. I'm not sure why it's saying
writing on a piece of paper. I wanted it to understand
what's written on the paper. Understand what's
written on the paper. Let's try this prompt. I'm interested. So let's go back to warp. Let's run it one
more time and see what the output is now that I've
made the prompt a little bit more clear. The first image is of
a hand holding a pen. The second image is of a
hand holding the same pen, but the pen is being
used to write the word pen on the piece of paper. Wow, look at that. So suddenly now, with this
slightly more accurate prompt where I said, well, look at
what's written on the paper. Now it knows it's a pen. Now it knows I've written pen. So now it had enough information
to be able to, essentially, understand the two images
and tell the difference. That is actually
very, very impressive. And it leads me
to think how this could be used in so many
very interesting ways. And I'll let you think
of those yourself. Anyhow, now this is
very interesting, but how can we take
these two models further? How can we use them to do
something more exciting? Perhaps, how can we use them to
build a multi-turn conversation, which means that
I write to Gemini, and Gemini answers
me back, and then it remembers everything
I've asked Gemini so far. And based on all of
that information, it can give me more outputs. So let's test that. Well, I have this
thing pre-written here, and I've called it gemini-chat,
and as with any Gemini model that we've worked with so
far, we initialize the object. We pass in the key. We add all of the imports
that are necessary. And in this case here, we have
another import called readline. This is just so I can actually
write things in my terminal and work in the terminal
and have a conversation inside of the terminal. And we have the
run function here where I'm using the
gemini-pro model. Now, why am I using gemini-pro? Because I'm only passing in
text, and I'm expecting text back. So actually also for these
multi-turn conversations, at this moment in time,
Gemini doesn't actually support multimodal, which means
it doesn't support images. It's just text only. Now, as part of this
model declaration here, we have a startChat
function, and that function takes an object, which inside
of it, has the history. Now, the history is the
history of the conversation that we currently have. So you can already pass in an
existing conversation for it to interpret. In this case, I'm
having an empty array because I want the
conversation to be fresh, and then we have the
max amount output tokens, which means how long do
we want that output of Gemini to be? And then we have an asynchronous
function telling us, well, OK, I'm going to type as myself
in the terminal passing the message. And then once I
found the message, it's going to take that
message that I've written and send that message to
Gemini, which it's then going to add into the history of
the chat, and then reply to me and also add that reply
to the history chat. It all happens in
the background. It's all magic. It works out of the box. Now, what I want to do, is
actually have a chat with Gemini and see how well it
remembers that conversation. So usually with a
normal text input, you'd just write one thing. It will answer another. It will not remember what
you have written before. So let's go back to
our warp terminal. Let's clear everything before
and run the Gemini chat. And as you can see, I
have my prompt, you. That's me. And I want to ask a question. So I'll just be like,
hey, Gemini, how are you? And let's wait for the reply. As an AI language model,
I am not a Gemini. I do not have
feelings or emotions. I'm designed to understand
and generate human language and provide information. So maybe we can use this
to learn something new. And I'm just trying
to think of something that I really don't know how
it happens or what happens. How about, how is
olive oil made? Give me the three main steps. I have absolutely no idea
where that came from. It's completely random. I actually don't know
how olive oil is made. And here we have an answer. It says harvesting, extraction,
and separation and purification. Now, I want to test the
actual history aspect of this. Because now maybe I can refer
to something that Gemini said, and it can interpret it. But even better, I cannot tell
Gemini anything, and say, well, Gemini, how about you translate
this for me into Spanish. So let's do that. Can you translate this
for me into Spanish? And I spelled translate wrong,
but surely it should understand. And if it has a history of
the log and that conversation, it should do it with no problem. And it does. And as you can see, this is a
very nice example right here of how this chat
history is being stored and how you can have that
back and forth conversation with a language model
and use it to do a lot of really cool things. But I see one limitation here. And that limitation is that
any time I pass in a prompt, I have to wait a
bit for it to reply. It's not instantaneous. Obviously, it takes time for
this data to be processed and for it to create an answer. So can we improve
that in any way? And the answer is
yes, yes, we can. And we can use streaming for
those faster interactions, which is already built
into the Gemini SDK. What does streaming mean? It means that you don't
wait for the full answer, but as soon as some part
of the answer is available, it will throw it back at you. And it will throw it back at you
until it enters the full thing, so you don't have to wait
for the whole answer. You can just start
reading as it's producing that
output, which makes it much more human
friendly and much more feel like a real-life
interaction. And I guess that's
what all AI is starting to wander
to is making it feel like a human interaction. So in this case,
well, let's do that. Now, I already have
that pre-written. And I called it
gemini-streaming. Now, this piece of
code is much longer, and I'll let you understand
it once you clone it. But I'll just go
over the main things. In this case, we're still
using the gemini-pro, so the basic input text
output language model. We're still using the
startChat with a history. But one thing that
we're adding here, is something called
sendMessageStream. Now, this is just a
function, part of the chat, where not only we're
just doing the streaming, it's not just streaming with a
normal, hey, this is my question and give me the
answer in sections, but we're doing this as part
of a multi-turn conversation. And of course, we
have all these things to make sure that the prompt
doesn't get disappears and that we wait
for the whole answer to come back before we have
an option to speak ourselves. So, well, let's test it out
and test out Gemini streaming. So let's go back into
our warp terminal. Let's clear this. Let's exit the
current conversation that we're having, and
let's start a new one. So we'll have gemini
not chat, but streaming. So it's chat with
real-life responses. And then here is my prompt. And I'll say, hey,
Gemini, how are you? And what you should notice
is that those responses come back much faster. And as you can see, it wasn't
just AI answering us once, as a big prompt, but
it gave us one reply, then it gave us
another reply which is the remaining of the sentence. And then the third
reply, which is the final part of the sentence. How may I assist you today? I would like a basic
Spanish lesson, please. All right, I'm not sure
what's going to produce. But it should come back
with specific steps, and it should be much faster. So you can see it coming
back, periodically, with more and more
and more information. So it's not just throwing
the whole chunk at you, but it's giving you a little
bit and a little bit more and a little bit more. As soon as it processes a
specific desirable section, it'll send it back to you. And I've got a basic Spanish
lesson por favor, please, gracias thank you, de
nada you're welcome, and so on and so forth. So it's great. And you can use Gemini
to learn a lot of things. You can use this as
your language tutor. So you can have a discussion. You can ask for things. You can try write things. It's going to
probably correct you, and you can have this nice
back and forth conversation. I think I'm going to
leave it here for now. I think we've looked at
two main Gemini AI models, but we looked at
many ways to use them and many
different scenarios that you can use them in. And of course, if you like this
video, make sure you like it and make sure you tell me
down below in the comments how I should use these models. Because what I was
thinking is that maybe we can use these models to
actually write some nice UI, maybe create a whole app that
makes use of all of these to make the user experience of
a front-end application much, much nicer. So it could be something
that would be cool. So if you guys are
interested, make sure you tell me down below. But for now, I hope that
this was useful information. I hope you learned something. I hope you learned
how you can work with the Gemini API in Node.js. And I'd love to
see you guys try. And I'd love to hear what
you think of the Gemini API. But for now, as always, I'll
see you in the next video. [MUSIC PLAYING]