Master the Gemini API: A Node.js tutorial with real examples

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

SPEAKER: Hi, everyone. In today's video, we're going to learn how to use the Gemini API in our Node.js applications. That means we're going to run through all the basic models that are available for our use and then showcase some scenarios how those models actually perform, so you can get a real taste for the power that they have. So without any further ado, let's jump into this tutorial, and let's set up our Gemini API. Now, before we actually start working with the models, there are some prerequisites in regards to being able to use the Gemini API and specifically being able to use the SDK. First of all, you need to install NPM. Now, NPM is needed because we actually will need to install some packages to be able to run the generative AI. Secondly, you need to make sure that your node version is 18 or higher. Now, how do you check your node version on your machine? Well, it's very simple. All you have to do is just type node dash v, and then you will get your node version. Now, if you're a node version isn't 20, I highly suggest that you download the Node Version Manager, it's called NVM, and then you can play around and switch versions right there very, very easily. Now, once our prerequisites are in place, the next thing we have to do is get our API key. And where do we get that? Well, we get that from the Google AI Studio. Now, you can go to aistudio.google.com, and it's going to look something like this. And then in order to get your API key, you can just click on the Get API Key in the top left corner, and then just create an API key. Now, because I've already created one, I'm just going to copy mine right from here. But if you don't have one, you have to just go for the very simple process of creating one, which just generates a new key for you. Now, once we have our API key, what do we do with it? Well, we don't just put it into our application because we have to be security aware. What we do is, as part of the package installation, we also install something called dotenv. And I'll show you that in a second. And then as part of your project, you create an env dot local file, and you'll have your API key environment variable declaration, and then you'll just pass in your API key right here. That's how simple it is. Now, this will make sure that it starts securely, and then we can call and actually fetch it, and I'll show you how. But we need to install some packages. And what packages do we need? Well, if I go back to the actual guide, go back to the actual guide, you can see that we need to install the SDK package called @google/generative-ai. So we can just copy this, go onto our terminal, and pass this in. But like I said, this is not the only package we need to install. We also need to install dotenv. Now, this will allow us to actually fetch our local environment variable and allow the Node.js application to actually read it. So let's go ahead and install these. And once those are installed, well, we can actually initialize our Gemini API object. Now, I've created some already pre-written files here, so it's going to be easier to follow and teach you how to use these models. And that's one of them, I have the geministart. Now, this is what your initialization to using the Gemini API or Gemini SDK should look like. First of all, we import the dotenv, so we can actually read from the local env variables. And then we also import our GoogleGenerativeAI object from the googlegenerative-ai package. Now I configure and initialize my dotenv variable. And here I declare a new Google generative AI object passing in our API key securely from the local env file. So now nobody knows about it, nobody sees it, and you can't commit it, which is very important. And that's about it when setting up the project. Now let's jump into each project one by one, starting with the Gemini pro and see it in action with some use cases. All right, so in the first example, we're going to use the Gemini Pro model. And as you can see, as part of the new run function that we created here, we declare the model. And as part of the gen AI, we get the generative model and we pass in the model name. In this case, it's gemini-pro. Now, the gemini-pro is the simplest model that you can use. It accepts text input, and it will give you simple text output. Now, in this case, I wanted to make it a little bit more creative. So rather than just asking it, what's the weather going to be like tomorrow or how to fold a t-shirt, I wanted to make it a little bit more complicated for the model. So I wrote write a sonnet about a programmer's life but also make it rhyme. Now, for those of you that don't know what a sonnet is, it's a poem that has 14 lines and 10 syllables per line. So it might get a little bit more complicated for the gemini-pro model to create this or maybe not. And then as part of this, I call the model, I call the generative content function, I pass in my prompt, and it should reply to me saying, well, giving me a sonnet. So let's put this to the test, and let's run this code. So I'm going to open up my warp terminal, probably one of the best terminals I have used so far, just a side note. And let's run node gemini-pro.js. And the gemini-pro is now thinking. It's running the code, and we have an output. Well, let's check if it's 14 lines 4, 8, 12, 14. It's probably definitely-- I mean, I could only assume at 10 syllables per line. And I cannot help myself but at least read the first four lines. So, "in realms of code where logic intertwines, a programmer's life a symphony, a symphony of rhyme. With keyboards tap, ideas take flight, as algorithms dance in ethereal light." And then the two last ones. "So raise a glass to programmers I pray whose tireless efforts make our world today," which is quite true. And I think it did a great job. And I think it did a great job. By the way, all the code that you're seeing today, all of those Gemini examples, I will post them on GitHub, so you'll be able to clone this and try them out for yourself. Amend the prompts and see how they work. So here we have an example of a very simple text input with a very simple text output or maybe a little bit complicated. It had to understand what a sonnet is and produce something that actually rhymes. And now let's actually move on to a more complex model. The next one I want to talk about is called the gemini-pro-vision. Now, as you can see, basically, most things are similar here. The only changes is that I'm now using a package called fs, which stands for file system. This is just so I can read files from my computer. And the difference is that here we're using the gemini-pro-vision. Now, what's the difference between the gemini-pro-vision and the gemini-pro? Well, the gemini-pro-vision is a multi-modal model. What does that mean? Well. It can not just understand and take text as input, it can also take images as input and produce a text output. So in this case, I have prepared some tasks for it to do and see how well it actually copes. One other thing to note here, is we have this function called fileToGenerativePart. And essentially we pass in just the path to that specific image in this case and the type of that image, whether it's a JPEG or PNG and so on and so forth, so the gemini-pro-vision model can understand it. And then we have our prompt. In this case, I have no prompt. And why do I have no prompt? Well, because the first use case that I want to test, I composed this beautiful mathematical function or calculation or a Pythagorean theorem where I wanted to calculate x. Now, why did I not include a prompt? Well, because I want to test its capabilities of understanding the prompt I have actually written on the piece of paper. And here it says, given this triangle, solve for x, which is the longest side. Now, I'm going to pass this image in, and I actually have it here. It's called Pythagoras.JPEG. And as part of this gemini-pro-vision, you can see that I'm calling this file ToGenerativePart passing in the path, passing in the MIME type, and I'm waiting for it to understand it, along with the prompt. Now, I have passed in no prompt because I want it to understand the prompt I've written on the page. So let's run it and see how it works. Let's go back to warp. Let's clear this, and in this case, I'll put gemini-pro-vision. And let's see if it will cope. By the way, the answer of that is 10. So will it give me the answer of 10? Let's have a look. To solve for x, we use the Pythagorean Theorem. So it understood the assignment. So it understood the image. It knew that it has to use Pythagoras. I didn't tell it to do so, which is already very impressive. And then it gives me the step-by-step function on how to do it. So we know if Pythagorean theorem, it's c squared equals a squared plus b squared. It passes in all those values. It does the calculation then does the square root of the final answer to give x equals 10. Now, that is extremely impressive for a multimodal model to just understand what I've written on a piece of paper. I wish that was here with me when I was still at school because oh, my goodness, it makes things so much easier. I have to do these things myself without the calculator. Here, it just does it by reading and understanding what you've drawn on the page, very, very amazing. OK, now on to the second use case I want to try here, is I have two images. I took a picture of a page and a pen. That means me holding a pen and writing pen on an actual piece of paper. And then me holding a pen in my hand and seeing how it interprets the differences between the two. So if I go back to gemini-pro-vision,m rather than passing fileToGenerative, I'm going to have to pass two of those functions. I'm just going to copy them both. And I'm just going to pass them in. And then rather than saying Pythagoras here, I'm going to say but just pen. And then the other one is called page and pen. All right, and now I'm going to pass in a prompt at this time. And I'm going to say, what is the difference between the two images? And see what it actually comes back with. So let me go back not to iterm, but to warp. And let me clear this and run the gemini-pro-vision model one more time with those two images as prompts right now and an actual text prompt. And it says, the first image is of a hand holding a pencil. Interesting. The second image is of the same hand holding the same pencil, but the pencil is now writing on a piece of paper, interesting. I want kind of neglect credit from it because this specific pen is very weird. So it could have interpret It as a pencil considering of its color because it's brass and it's metal. I'm not sure why it's saying writing on a piece of paper. I wanted it to understand what's written on the paper. Understand what's written on the paper. Let's try this prompt. I'm interested. So let's go back to warp. Let's run it one more time and see what the output is now that I've made the prompt a little bit more clear. The first image is of a hand holding a pen. The second image is of a hand holding the same pen, but the pen is being used to write the word pen on the piece of paper. Wow, look at that. So suddenly now, with this slightly more accurate prompt where I said, well, look at what's written on the paper. Now it knows it's a pen. Now it knows I've written pen. So now it had enough information to be able to, essentially, understand the two images and tell the difference. That is actually very, very impressive. And it leads me to think how this could be used in so many very interesting ways. And I'll let you think of those yourself. Anyhow, now this is very interesting, but how can we take these two models further? How can we use them to do something more exciting? Perhaps, how can we use them to build a multi-turn conversation, which means that I write to Gemini, and Gemini answers me back, and then it remembers everything I've asked Gemini so far. And based on all of that information, it can give me more outputs. So let's test that. Well, I have this thing pre-written here, and I've called it gemini-chat, and as with any Gemini model that we've worked with so far, we initialize the object. We pass in the key. We add all of the imports that are necessary. And in this case here, we have another import called readline. This is just so I can actually write things in my terminal and work in the terminal and have a conversation inside of the terminal. And we have the run function here where I'm using the gemini-pro model. Now, why am I using gemini-pro? Because I'm only passing in text, and I'm expecting text back. So actually also for these multi-turn conversations, at this moment in time, Gemini doesn't actually support multimodal, which means it doesn't support images. It's just text only. Now, as part of this model declaration here, we have a startChat function, and that function takes an object, which inside of it, has the history. Now, the history is the history of the conversation that we currently have. So you can already pass in an existing conversation for it to interpret. In this case, I'm having an empty array because I want the conversation to be fresh, and then we have the max amount output tokens, which means how long do we want that output of Gemini to be? And then we have an asynchronous function telling us, well, OK, I'm going to type as myself in the terminal passing the message. And then once I found the message, it's going to take that message that I've written and send that message to Gemini, which it's then going to add into the history of the chat, and then reply to me and also add that reply to the history chat. It all happens in the background. It's all magic. It works out of the box. Now, what I want to do, is actually have a chat with Gemini and see how well it remembers that conversation. So usually with a normal text input, you'd just write one thing. It will answer another. It will not remember what you have written before. So let's go back to our warp terminal. Let's clear everything before and run the Gemini chat. And as you can see, I have my prompt, you. That's me. And I want to ask a question. So I'll just be like, hey, Gemini, how are you? And let's wait for the reply. As an AI language model, I am not a Gemini. I do not have feelings or emotions. I'm designed to understand and generate human language and provide information. So maybe we can use this to learn something new. And I'm just trying to think of something that I really don't know how it happens or what happens. How about, how is olive oil made? Give me the three main steps. I have absolutely no idea where that came from. It's completely random. I actually don't know how olive oil is made. And here we have an answer. It says harvesting, extraction, and separation and purification. Now, I want to test the actual history aspect of this. Because now maybe I can refer to something that Gemini said, and it can interpret it. But even better, I cannot tell Gemini anything, and say, well, Gemini, how about you translate this for me into Spanish. So let's do that. Can you translate this for me into Spanish? And I spelled translate wrong, but surely it should understand. And if it has a history of the log and that conversation, it should do it with no problem. And it does. And as you can see, this is a very nice example right here of how this chat history is being stored and how you can have that back and forth conversation with a language model and use it to do a lot of really cool things. But I see one limitation here. And that limitation is that any time I pass in a prompt, I have to wait a bit for it to reply. It's not instantaneous. Obviously, it takes time for this data to be processed and for it to create an answer. So can we improve that in any way? And the answer is yes, yes, we can. And we can use streaming for those faster interactions, which is already built into the Gemini SDK. What does streaming mean? It means that you don't wait for the full answer, but as soon as some part of the answer is available, it will throw it back at you. And it will throw it back at you until it enters the full thing, so you don't have to wait for the whole answer. You can just start reading as it's producing that output, which makes it much more human friendly and much more feel like a real-life interaction. And I guess that's what all AI is starting to wander to is making it feel like a human interaction. So in this case, well, let's do that. Now, I already have that pre-written. And I called it gemini-streaming. Now, this piece of code is much longer, and I'll let you understand it once you clone it. But I'll just go over the main things. In this case, we're still using the gemini-pro, so the basic input text output language model. We're still using the startChat with a history. But one thing that we're adding here, is something called sendMessageStream. Now, this is just a function, part of the chat, where not only we're just doing the streaming, it's not just streaming with a normal, hey, this is my question and give me the answer in sections, but we're doing this as part of a multi-turn conversation. And of course, we have all these things to make sure that the prompt doesn't get disappears and that we wait for the whole answer to come back before we have an option to speak ourselves. So, well, let's test it out and test out Gemini streaming. So let's go back into our warp terminal. Let's clear this. Let's exit the current conversation that we're having, and let's start a new one. So we'll have gemini not chat, but streaming. So it's chat with real-life responses. And then here is my prompt. And I'll say, hey, Gemini, how are you? And what you should notice is that those responses come back much faster. And as you can see, it wasn't just AI answering us once, as a big prompt, but it gave us one reply, then it gave us another reply which is the remaining of the sentence. And then the third reply, which is the final part of the sentence. How may I assist you today? I would like a basic Spanish lesson, please. All right, I'm not sure what's going to produce. But it should come back with specific steps, and it should be much faster. So you can see it coming back, periodically, with more and more and more information. So it's not just throwing the whole chunk at you, but it's giving you a little bit and a little bit more and a little bit more. As soon as it processes a specific desirable section, it'll send it back to you. And I've got a basic Spanish lesson por favor, please, gracias thank you, de nada you're welcome, and so on and so forth. So it's great. And you can use Gemini to learn a lot of things. You can use this as your language tutor. So you can have a discussion. You can ask for things. You can try write things. It's going to probably correct you, and you can have this nice back and forth conversation. I think I'm going to leave it here for now. I think we've looked at two main Gemini AI models, but we looked at many ways to use them and many different scenarios that you can use them in. And of course, if you like this video, make sure you like it and make sure you tell me down below in the comments how I should use these models. Because what I was thinking is that maybe we can use these models to actually write some nice UI, maybe create a whole app that makes use of all of these to make the user experience of a front-end application much, much nicer. So it could be something that would be cool. So if you guys are interested, make sure you tell me down below. But for now, I hope that this was useful information. I hope you learned something. I hope you learned how you can work with the Gemini API in Node.js. And I'd love to see you guys try. And I'd love to hear what you think of the Gemini API. But for now, as always, I'll see you in the next video. [MUSIC PLAYING]

Info

Channel: Google for Developers

Views: 29,552

Rating: undefined out of 5

Keywords: Google, developers, pr_pr: Capella;, Purpose: Learn;, Ct: G4D SV: Educational, Type: gds:N/A;

Id: Z8F6FvMrN4o

Channel Id: undefined

Length: 20min 37sec (1237 seconds)

Published: Thu Mar 28 2024