Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

which path is safer for the little bird to go pass one or pass two path one is safer for the little bird to go as it avoids the cat path two leads directly to the cat which could be dangerous for the bird what should be the next shape in this sequence the next shape in the sequence should be a hexagon which book is standard for me to read if I want to learn AI if you want to learn about AI the coming Wave by Mustafa suan would be the more appropriate Choice it seems to be focused on the future of AI and its implications which would be relevant to your interests in artificial intelligence Google had their Infamous demo of Gemini where Gemini is responding to realtime video almost instantly this really blow everyone's mind but later it turned out that this demo is heavily edited it cut off all the latencies and behind the scenes are feeding a lot of different prompts into Gemini to get the popular response back and on the other hand even though Gemini Altra is not available yet people start doing a lot of tests taking example that Google test with Gemini Ultra with GPT 4V in ter of the GPT 4B is able to do almost all the tasks that the Gina outra claim to be very good at from screenshot of a mass problem to understanding image of natural PL as well as object detection and gbt 4V was able to solve most of the example provided by Google except some more hardcore Mass questions so people's expectation to Google's Gemini model really changed overnight but think we should still give a lot of credit to Google's Gemini demo because it really showcase a lot of interesting applications we have real native multimodal model they can understand all sorts of different data the big question I have back mind was can we actually rebuild this gini demo with GP for and what would be the results we can get today cuz theoretically the way it works is pretty straightforward with Gemini it basically feed both video and audio data to Gemini model and it will be able to speed back because it has that audio generation B build him but we can totally build the same experience with GPD 4V plus whisper and text to speech model we will need to turn the video into different screenshots that we can pass on to GPD 4V and then we can also turn users voice into a text prompt with whisper transcri ability you can feed both transcript and screenshots to GPD 4V and utilize tactive speech model to let AI speak back but on the other hand there are also some challenges that I didn't know how to solve yet for example IDE one experience where the user don't need to interact with any UI they literally just point to camera and talk to Ai and to achieve this kind of handsfree ux we need a way to actually detect when users stop talking so that we can send those requests to open AI so how can we actually detect users stop talking it is problem that I'm not too sure how to solve and on the other hand how can we stream the live video effectively we can take multiple different screenshots does it actually deliver the performance that we want to be able to answer user question instruction and most importantly how big the lat y actually will be when we build this system with multiple different models working together luckily I found one project done by Julian deuca where he basically Implement Gemini demo with almost the same structure that I was thinking about and the result is quite stunning from his demo can you tell me in which direction I'm moving my hand you are moving your hand to the left tell me what you see you are placing a white paper on a wooden surface with both hands and then removing them what about now you're drawing a circle in the center of the white paper and now what do you think I'm drawing do you recognize it you're drawing a snowman with three circles two eyes a nose a mouth and buttons can you tell me under which cup the bow is the ball is under the center cup can you tell me if I'm doing well with this game yes you're doing well you're successfully matching the puzzle pieces to their correct positions on the board now can you tell me where I should put this one place the red star-shaped piece in the empty space on the board to the left of the yellow star-shaped piece all right and now I have this blue one where should I put it place the blue circular piece in the empty space on the board to the right of the green circular piece suggest a game we can play together with these objects let's play a memory game I will describe a pattern and you will have to recreate it from memory for example I might say place a red token next to a green token and then put a yellow token to the right of the green one you would then arrange the tokens accordingly if you're ready we can start with the first pattern yes I'm ready can you also count the points during the game please great here's your first pattern to recreate from memory place a blue token above a green token put a red token to the left of the green token position a yellow token to the right of the green token arrange the tokens as described and once you're done I'll score your attempt if you get it right you'll earn a point let's start okay I'm done you've successfully recreated the pattern as described you've earned a point would you like to try another pattern or do you have a different game in mind and in his example he figured out some simple but effective way to solve the problem that I had before for example for how to detect users stop talking he found a library that we can use right away CU silence a wear recorder it is open- Source library that can automatically detect if the user stop talking for a certain time and inal streaming in the live video he figured out a way to stitching together all the different screenshots into a grill which from experience seem to communicate time sequence better to GPD 4V I think it also reduce latency because we are now sending 60 different image to gbt 4V everything I wrapped under a gr that can be sent and the latency does seem to be too bad it does have like three to four seconds latency but it's already better than I thought and he open source this demo called GPD video to Showcase how he build this demo I tried out and really works it is great example about how can we build this multimodal application so I'm going to give you a step-by-step breakdown of of how you can build a similar Gemini demo by yourself with GPD 4V plus whisper plus text to speech and versal Ai and silence aware recorder so let's get it but before we get into the implementation details I want to give big thanks to the sponsor of this video hpot hpot is one of the leading CRM that invest heavily in AI this year they embed generative AI in almost every part of their platform to help you scale and grow your business and this one free tool called campaign assistant that I found very useful for creating marketing assets at scale to get started I can click on this button it will ask me to choose the type of ads that I want to create so let's say I want to start a AI Consulting business I can select the Google search as and then it will ask me what's the context of campaign so I can give you some context providing free 1hour AI consultation for set business about how to define the AI strategy then it will ask me what are the key standing points or key value proposition that I want my audience to know so I can add in a few key value proposition and click next I can type in a few keywords I want to Target and Def find call to action which in my case will be book meeting and then choose writing style then it will start processing all those information and try to generate Google ads for me within a few seconds it already generates three Google ads for me and on the other hand I can also generate LinkedIn post with same contacts as well as landing page content and marketing emails this allow me to experiment and iterate large amount of added assets within a few minutes so if you're dealing with a lot of different added assets this will definitely save your time time you can click on the link below to try it out for free now let's get back to how can we implement this Gemini demo with GPD 4V and Whisper let's open Visual Studio code with a new folder firstly we want to set up a nextjs project our open Terminal so let's do MPX create next app at latest --ts to add typescript d-wi which is the UI Library we're going to use and D- yes lint click enter and we will need to name the project it can be any project name I will give it name called Gemini demo and I will select yes for the SRC directory and we will need app router and we don't need to import Alas and then it will automatically start setting up the nextjs project once it's done you can see this new folder created so we're do CD Gemini demo to get into the folder and next we will need to install a few different package that we're going to use I will do mpm install AI that is SDK provided by Verso which made chat app and chat streaming a lot easier and then we need to install open AI to get access to all the open Ai functionalities and then we will also use silence aware recorder which is audio recorder that can automatically detect if the user stop talking as well as use media recorder that can record the camera video and package called merge image that can allow us to put multiple different image together into one Image grill so we click enter and it should start install package automatically once it's done you can see all those package has been added to package.json already so now this nextjs project has been set up already uh you can do npm round there and you will save page like this which is default page for any nextjs app so now we need to start customizing and build out all those custom functionality that we need and the way nextjs project structure is the main page will be under this page. TSX but we don't need any of those so I just remove everything and replace it with placeholder for now and as you can see the page should be automatically updated here as well next we want to create a new file under app folder called chat. jsx and this is where we will implement the main application and we're first give a use client so that it can use all the client functionality that we build later and then will import a few different libraries that we're going to use later a few library from RE allow us to set a state and also to use chat package from versal AI that allow us to call open a API and streaming the results as well as a two recorder that we mentioned before to record both camera video and voice and merge image package that we were used to create image grade then we Define a few different configurations uh one is interval which is how often do we want to capture the screenshot from the camera image and few details about the image grade that we're going to create the maximum amount of screenshots that we're going to send to open Ai and Silent duration and Silent threshold that will be used as a threshold to detect if the user stopped talking next we will initiate a few different state and reference use reference is functionality provided by react that allow us to easily refer to a specific Dom element let's say the video elements that we later we going manipulate and display image or some value they we want to store and refer later while state is not functionality provid by react that allow us to store the specific state of certain things and every time when State change it will render the application but I will explain all of them later as we dive into details let's firstly add the video recorder so I do use media recorder set video to be true and audio to be false because for audio we're going to use a silent detector recorder instead and also also adding two functionality start recording which will be triggered when the user click on a start session button to start record the video and set a state is a recording started to be true also a stop recording function which will be triggered when the user click on stop session and then I will set a use effect so use effect is another functionality provided by react it's basically function and rules that you define which will be continuously monitored and Trigger so you can say if certain condition is met at any given time then trigger a specific function and in here we'll use this use effect to display the latest image captured by the camera and then we will return the UR elements so we have a Cana and also if you diff with a tail win style and add a video component so this already set a camera recorder and display the live camera on the page and on the other hand we also want to add button for users to start or end the session so we're add a new D after this if the recording session already started it should display a button for stop session but if it started state is false then it should display button for start session and those two buttons should call the function for start recording or stop recording so this chat component should already provide some basic functionality to uh show a video camera recorder all we need to do now is go back to the page. TSX import the chat component that we just did replace this return to return the chat component instead click save now we go back to our next JS app which have little start session button if I click on that for the first time it will ask you for the permission once you're allowed you will have a web app like this that will automatically capture and streaming the video and next is we want to also set up the audio recorder with a silence aware recorder so I'll create audio object with use silent aware recorder we Define a function here called ons speech which can be triggered when it receive new audio data and here is where you can Define the silence threshold and silence duration that will be automatically handled by the silence aware recorder and in the start recording function I will also add the audio start recording too and then we'll create this UNS speech function which will be if we detect that user stop speaking and send all those audio data and inside this function we are using a is busy reference uh so this is reference that we create above here we're basically use this to detect if we waiting the response from open AI because if it is then we should terminate this function right away but if it's not then we will need to set s spz to be true stop recording later send audio to whisper for transcription but for now we'll just keep it simple like this and then we set up new use effect uh the purpose of this is basically dynamically set up the volume percentage and we will use it to dynamically updating the volume indicator in the UI based on the current volume level so if the audio is not recording just set the volume percentage to be zero if the audio is actually recording it will dynamically update the minimum and maximum volume reference and use that minimum and maximum number to set up a volume percentage number after video element add a new part if the audio is actually recording then we display the little UI that will Dynamic change based on the volume percentage but if the audio is not recording there will show a little stop icon and click save so now it should be able to capture both audio and video so if I go back to the page and click Start session uh you can see on the left side there's a little UI component that will be automatically updating while I'm talking but if I stop talking it should turn back to a different icon here you go so now we Dev this application that can detect if the user stop talking after we detect the user stop talking we want to generate image Gade to capture what has been happening for the past few seconds and also using whisper to automatically transcript the voice into a text promt so we can send both the text prompt and image to GPD 4V for you to get a response and first thing we're going to do is generate this image grade so we're need to move to top and Define a few different functions so our first it create a transparent pixel which we will be using as divider to generate image grade and first function is get image Dimensions based on the image source and here you will see something called Promise so this is basically a way in JS to do things a sync and next we'll add a second function called base 64 to2 block so this function basically converts the image data into a format that we can manipulate and send to server it will be used in the next function we're going to Define called upload image to free image host so this basically will upload the image screenshots we take into a free image hosting service called temp files so temp file is a free temporary file storage we will get Blob information from the image PL it to Temp file and get a URL back last but not least we will convert all those imag into image grade so it will get all the image and also the number of columns ways the image quality that we defined above here we get dimension of the image calculate the SP ratio and calculate number of rows that we should have prepare the image for the mergin and in the end merge all the image together into one grid image those are all the function that we need to generate the image grade and then we need to trigger those functions so I move down to the UNS speech function which will be called when the user stopped talking and triggering that function so first they will set a state phase to uploading video captures which is defined by Max screenshots we defined above then generate image grade and upload the image grade to the free hosting service and set State image grade URL to be the URL it return and in the end set the state to be processing completion and we will also need to add a new use f to save screenshots every few frames so we Define capture frame here if the video is actually recording then we'll proceed to GA the image ways data we defined above GA image data form the video recorder push that new image to screenshots reference and we will also set an interval which is how often this image should be captured and that's pretty much it all we need to do now is create debug button so that you can see the image grd is generated so let's move down here after start stop session button we defined before add a new button called debug once I click on this button it should set a state display debug menu to be true and add overlay to display both the face date as well as the image grade and this div should only be display if display bug is true and there should be a close button that will set set display dbug to be false so that's pretty much it let's save it and go back to our web app so I start session and say something then stop if I click on the debug menu it should display face that processing completion as well as the image grid so this is working well the next step is we will also need to call whisper to get transcript to use a test prompt to do that we will need to create some function for speech to text first and I will go to the app folder and create new folder called API inside create another folder called Speech to Text a new file called through. TS so route. TS is a type of file in nextjs certain they allow us to have route handlers if you don't know what a route Handler is allow us to handle request to specific R for example currently we are on Local Host 3000 API speech text and based on this front end call we want to do a post request to the specific rep for so I can simply create export a sync function post request this will be called when you try to do a post request to this specific R and we can get the request Body by doing C form data equal to await request. form data and then you can get the specific value from the request body and in this speech to text function we will need three different parameters one is audio file so we do form data get file as file type as well as token which is the open AI API key that we're going to use for whisper and also language code because whisper SP more than just English and you can pass on the language code to support all the other languages as well if we don't have API key then it should through Arrow no API key provided but before call whisper we actually need to import open AI Library so I move to top import open AI from open Ai and we also to export constant round time equal to Edge and Edge is part of versal that provide optimal infrastructure for API if you want to learn more you can click on this link and then let's go down and quit open a object pass on the API key but here you can also do process. env. open API key to get environment variables and to do that in nextjs you can create a new file called. loc and add environment variable here which should be able to access anywhere when you do process. EMV the open API key otherwise we can also create a UI on the front end for users to put API ke if you want to let other people use and then we can pass on the audio recording to whisper and return the generated transcription on the other hand as I mentioned before that we want to so create an input for the user to type in their API key directly and also the language and we don't want users to type in the same API key again again so what we're going to do is I will go to the SRC Source folder now create another folder here called library inside Library folder I will create one file called use local storage. TS and create function here that can actually save those variables from the user input to local storage and can be retrieved later and then I go back to chat. jsx and import this use local storage that we just created and move down to chat function and set both API token and language here and then move down to unsp speech function which will be triggered when the user stop talking or add block that can send audio to whisper for transcription which look something like this so we first say set a state phase to be processing speech to text so we can keep track we'll get API token and language from the local storage if a user already putting something and then we do a API call where will pass on both the audio file the API token language user selected to the speech to text API and point we Define in .tx and once it's finished we will try to get the response if there any Arrow then we'll alert Arrow otherwise we'll set the state transcription to the response that we get from open at whisper so that's pretty much it now we just need to create some UI control to allow people to put in API key and SL language as well as display the transcript so come down here after debot button we can add two inputs one for the open a API key which will set the token value when the input value change and same for the language and then we also want to display the generate transcript so I move down here and insert the transcript now this app should be able to automatically generate image grade and also send the audio file to whisper to generate a transcript so let's let's try this or save it go back to web app I can put in the open a API key here and also give a language code if I need and language code here we need to follow a specific format called ISO 6391 which looks something like this but you can also keep it as empty as well so I click on start session uh I will see something and then stop hello word once this icon change I can open debug menu and you can see that had a transcript I will say something then stop p word uh as well as image grade perfect uh now all we need to do just passing on both the general image grade as well as transcript to GPD 4V so that it can generate response and as I mentioned before we're use a versal aisk which makes chat completion and streaming the results extremely easy with the xjs app and the way Vero AIS SDK works is that you will use this used chat utility and we can Define the rout Handler when this used chat is triggered in a special location on API folder create a chat folder and create a route. so the function you define there will be called when use chat is used so we're going to go back to visual studio code and under API folder we'll create a new folder called chat inside we'll create a file called .js and putting those code so we're import open AI as well as open AI stream and streaming text response from veral AI SDK we put a special prompt here so assistant should receive a series of screenshots from a user live video feet this screenshot represent sequential frame from the video capturing distinct moments the assistant is to the assistant will analyzes frames as a continue video feed answer us question while focusing on direct and specific interpretation of visual content and then Define a list of different rules uh so when the user ask question use spal and temporal information from video screens response with brief precise answer answer used directly and keep in mind that great of image will show the same object in a sequence of time when ask about spal questions provide clear and specific information regarding the location and also passing on the language so the assistant must speak in the same language that the user specified and then we Define the post request which should be both the image grid and text transcript the API token and the language we're going to do the same thing just call the open AI API key and we're going to do the same thing called open AI with GP 4B and return the response and in here veral AI has this special object called streaming text response which will be able to stream bad text response to reduce latency and then we'll go back to chat. jsx so first thing we need to go back to ons speech function and do this a weit aend and what this will do is it will add the image file as part of the message that we're going to pass on to open Ai and after that we will call the use chat function where we pass on both the message the append with which is a uh image that we just created above inside we're also passing on extra param body where the open a will be able to get the API key as well as the language the user selected and what this does is it will send a request to route. JS with Define here right away and once it get response back it can trigger a unfinished function and then we'll call this use chat passing on not only the message append which is the image but also actual body that including API key and also the language the user selected so this should actually trigger a post request to the AP end point that we Define here that will call the open AI but after we get response back we want to do two things one is streaming the result on the screen so that user can see the results and also send result to text to speech model opening a h so it can talk back and to do that we're going to do the same thing we'll go to API folder create a new folder called text to speech and inside we create .ts where we'll Define a poster request API and point which takes in both the text input as well as the API token then we'll do a request to the open AI text to speech model with a specific voice model that we choose but you can change to others too and we F audio data converted into block that we can set back to the format that's pretty much it now let's go back to chat. jsx and we add a unfinish function which will be called when GPT 4B generate response and it will take the message it generated as input so we'll set State face to be processing text to speech again we'll get the API token passing on to the API and point that we defined above and once we get the audio response back we will want to play this audio so I'll move up and add a function here called play audio which will take the URL of the audio file and start playing the audio and that play audio function will be called here and once we play the whole audio file we can turn on the audio recording again to wait for users next question and instruction and set the state phase to be waiting for speech and then we'll move down up to last use effect set something to get the last assistant message so that we can display and then we'll add a UI to actually display the assistant message so after the video and audio recording diff or add a new diff so this demo should be fully functional we can start testing it out so let's try this out I'll go back to our web app and then start session what am I holding on my hand so you can see once I finish talking it will have loading State here you are holding a bottle with an orange cap what is behind me behind you is a kitchen area with cabinets and a countertop which PA is safer for the little bird to go pass one or pass two path one is safer for the little bird to go as it avoids the cat path two leads directly to the cat which could be dangerous for the bird what should be the next shape in this sequence the next shape in the sequence should be a hexagon is this a hex gon you were talking about yes that's the hexagon I was referring to it correctly follows the sequence by adding one side to each successive shape which book is standard for me to rate if I want to learn AI if you want to learn about AI the coming Wave by Mustafa sulan would be the more appropriate Choice it seems to be focused on the future of AI and its implications which would be relevant to your interests in artificial intelligence wish jacket sh wear for the quas party the blue jacket on your right hand would be more suitable for a Christmas party it looks more formal and appropriate for such an occasion so this is a walk through about how can you recreate Gemini demo by yourself with gp4 v whisper and tax to speech model and thanks again for Julian Dua for providing such a great example please comment below about other interesting multimodel application that you are think about building I'm really curious to learn more and do more experimentation I'll continue posting interesting I project I'm doing so if you enjoy this video please consider give me a subscribe thank you and I see you next time

Info

Channel: AI Jason

Views: 15,054

Rating: undefined out of 5

Keywords: gemini, google gemini, gemini google, gemini ai, how to use gemini ai, how to use google gemini, how to use gemini, artificial intelligence, chatgpt, ai, gpt4v, gpt4v image, gpt4v and, chat gpt, chatgpt tutorial, gpt 4, gpt 4.5, ai agents, autonomous ai agents, ai agent workforce, multimodality, multimodal language model, multimodal machine learning, multimodal ai, how to train a multimodal llm

Id: Bf1zaKUMKeQ

Channel Id: undefined

Length: 31min 37sec (1897 seconds)

Published: Tue Dec 19 2023