Building an AI assistant that listens and sees the world (Step by step tutorial)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what teams are represented by the baseball hats hanging from my wall Dodgers Nationals Red Sox Astros Cubs and Royals I cannot wait to show you this I just had lunch I was a little bit sleepy but I cannot wait to just show you the code that I built here for you so if you remember uh 3 four weeks ago open AI showed an assistant running on a phone and that was kind of like a magic in many many different ways because now you have a model that's actually interacting with the world because the model has access to the camera of the phone and is seen or at least uh just to be more accurate is processing the images that are coming through the cam on the phone and is using that as context to answer the questions from the people who were presenting that demo so I asked myself one question how hard could it be all right so what I want to show you today is a quick demo I put together not as sophisticated obviously as what a multiple billion dollar company did but it's just a quick demo very very short of doing the same thing with the webcam running on your computer or that your computer has okay there are only 169 lines of code here so this is short and I'm only using here python I'm not using any external Services I'm using Python and the API is obviously to you know an API like a large language model I'm using Gemini flash here I'm going to explain why I'm using the open AI text to speech API everything else is just a python Library I am planning to rebuild this whole thing using services that are going to make my life way easier but that that's going to come later uh for now I'll show you the code of how I buil this assistant and then we're going to see a demo I'm going to run a demo here live with this assistant and I hope you look at this and this inspire you somebody I posted about this on Twitter and somebody told me that you know hey somebody's going to grab your code and raise a million dollars and go and build something I hope somebody does that just remember me when you make it okay just you know kick me $5 or something K aside let's take a look at the code again 160 lines of code how hard this could this be all right so just a single file it's called assistant. py you are going to get access to this code through a link that I'm going to post somewhere under this video in the description of this video while you're there just give me a like or like this video so I know the algorithm is going to tell me D listen this type of content works for your audience keep making more uh so just leave me that like and let's look at the code here I have together with this assistant. py just a single file a requirements txt file listing all of the libraries that I'm using here and here are the main components so going to be four main components that I'm using to build this assistant the first component is capturing the audio that's coming through the my microphone and transcribing that okay and for that transcription I'm going to be using whisper so basically it's capturing the stream of audio transcribing it and getting text the second component is capturing the image that's presented here or that's coming through my webcam and I'm going to be using open CV to do that open CV is a very popular Library especially for people doing computer vision got all sort of functionality to deal with images that's what I'm using to capture video so now I have images now I have video third component is a large language model so for large language model I'm using using right now Gemini flash 1.5 that's Google's model I also left the code ready so if you wanted to use GPT 40 you can use GPT 40 but I don't like it for this because it's way slower compared to Gemini and for some reason I do not understand it's less accurate at following my prompt so anyway we're going to get there in a second that will be the third component so with images and text I can send them to to a model uh like a multimodel model so it has to support images and text I'm going to get an answer back and now I need to syze that answer like that text into audio and play through the speakers and that is the fourth component and to do that I'm using the open AI uh text to speech API to do that all right so all of those things are going to come here together to just create uh something that feels very magic something that feels alive right it feels like it's just looking right by the way I heard from people uh this is apparently a misconception that they think that the open AI model now supports video okay like the GPT 40 supports video because that's the impression you get when you watch one of these demos no it doesn't support video what you're doing is just passing images right so a video is just a collection of frames if you have a 24 frames per second video you have 24 frames within a second so for every second of video you have 24 images so what the model is actually doing is just processing images the application the glue the wrapper around that model is what makes everything feels magic right because it combines all of those components right the audio coming from the user transcribing that audio the images coming from the webcam or from the cell phone camera and just grabbing frames from those videos putting them together answering those questions or processing in that with a model and then playing the answers back all of those components together make this assistant build magic feel alive in a certain way is not that the model is any different to what it was before that being said uh here are the requirements you're going to have to install those I have a read me file that tells you what to do one thing to notice uh depending on whether you're using GPT 40 or Gemini you are going to have to grab a key and create an environment variable with that key my code is going to rely on those keys for it to run so you can actually access the large language model everything else is pretty straightforward and this is how actually I'm just using Python 3 so I'm going to type it there this is how you run the assistant we're going to see that in a second let's go to the code and I want to start my code by explaining the first class that I created which is called webcam stream okay I'm going to try to go as fast as possible here uh but webcam stream is a class that's going to execute a threat to store capturing frames from my webcam so in order to do that I'm going to be using the video capture class from open CB video capture I'm going to pass the index of the camera in one computer you might have multiple sources of video like right here in my computer I have the webcam that's here on my computer sometimes I connect an external webcam that will be a Second Source uh the continuity camera from my phone that's a third Source Etc I'm going to be using here index zero at this point I don't have anything connected here that will be actually I have my computer connected I don't need this cable let me just move it away at this point index zero represents my webcam here cool so this here is just going to give me access to a stream of images and if I call the read function that gives me back a frame to just basically an image and what this class is going to do is going to repeatedly on a threat start capturing those images okay and it's going to grab one image and it's going to update a variable like an internal property of this class with the latest image that it captured why do I do that because at the time that I finish asking a question to the assistant at that moment I want to go and grab the latest image came that came from my webcam so that image should be ready at that point I want to go there and say give me the latest image and I'm going to process it I'm going to use that image together it with the prompt that I spoke through the microphone so that is why this class is going to start doing that on a thread in the background and again just going really really quick here I'm creating the stream I'm reading the first image of the first time when I execute this or when I run this class when I create an instance of this class I'm going to grab that frame and then I'm going to mark this as running equals false you're going to see why that's important and I'm going to create a lock so a lock is going to act like a uh like a traffic light to prevent other threads accessing data while I need to be locking that data so whenever I'm going to make a change I'm going to lock my data I'm just going to do the operation and then I'm going to unlock it so other you know different threads can access that data you're going to see that in a second there is a start function this function when I call it for my webcam stream class is going to start that background threat so you're going to see that I'm going to say all right so if this is running do nothing just return if it's not running set it as running equals true and then create that threat and and start that threat notice that the target function of the thread is the update function so that's another function that I have here what I mean by this is when that thread starts it's going to call the update function look at what the update function is going to do now it's it's going to run in an infinite loop it's not infinite it's going to be while running equals true so if the thread is running it's going to keep looping and the loop grabs an image from the webcam so this takes one frame stores it in this Frame variable locks access to whatever is in here so it's locking access it's saying hey nothing should come and read now because I'm going to be doing an operation that operation is basically grabbing this variable and updating using that variable updating the self property so the property of this class that is the latest frame all right so I'm locking access so if somebody wants that latest frame is going to have to wait until I finish doing this right and then I release the lock so basically okay so if anybody waiting you can now come in that is what this update is doing remember this is just going to be doing that in the background constantly updating the latest image from the webcam then there is a r function which is the function that I'm going to call whenever I need that latest image so whenever I'm ready to use an image I'm going to say hey just read my instance of the webcam stream just give me that image I pass a property that says encode and this property is going to uh if it's true it's going to return that image en code it in Bas 64 why do I need that because in order for me to pass that image to send that image to gp4 o or Gemini flash I need to pass it in b64 format right if you have a URL if your image was on a URL it was public on the web you could send the URL and it will work uh those models are capable of processing a URL in my case I'm capturing the image right here on my computer I don't have a URL I'm going to have to encode it in Bay 64 and send it in Bay 64 this is what's going to happen here first of all when I get in I'm going to lock access to the frame to the cell frame so nobody like at this point if I'm reading the update process won't be able to change what I'm reading okay because I'm locking it I'm saying everyone else wait let me make a copy of this image into this variable I'm going to make a copy of it and now I'm going to release the lock so now if you if something else was going to update this well now you can and after I have that frame now I'm going to if it encode is true I'm going to encode it in basically 64 and return it if not it's just I'm just going to return the frame and then I have a stop function that just stops the thread and joins it uh so nothing fancy here all right so that is what this class is for again as a high level it just manages my webcam and it grabs images and has one image ready for me whenever I'm I want to use it okay that's what the webcam stream is all right so I'm going to skip the assistant class for one second just so you see what I'm doing or how I'm initializing the webcam stream you can see it here line 135 I'm just creating an instance and then calling it calling start so starting from this point on my webcam is just capturing images capturing one image after another all right so let's go back now to the assistant class which is the second class that I have here and that's pretty much it everything else is just a small function that represents the glue of all of this this beautiful assistant class first is expecting a model so I need to provide an instance of a model to this class by the way this assistance is using Lang chain two main reasons for me to use l chain reason number one it makes my code way shorter like I don't have to deal with a lot of verbosity that comes from using the apis directly lanch chain makes my code beautiful it also makes my code composable meaning I can put together different components I can mix and match and just sort of like build this Lego application with different components you're going to see that in a second that is beautiful the second reason I guess those were two so this is the third reason why I'm using L chain here is because I I can swap my model with another model and I don't have to change a single thing here so this class expects a model and nothing inside this class you're gonna see how I put together the prompt how I invoke a chain Etc nothing here depends on one specific model so I can use Gemini flash or I can use gp4 or I could use any other model that's supported by line chain which is pretty much all the models out there and everything all of the details will be transparent to me that is that is amazing again Goosebumps whenever I talk about that all right so my assistant is expecting a model and as part of the initialization of this class The Constructor of this class I'm just going to call this function that's called create inference chain so basically I'm going to create a l chain chain that does the inference so let's take a look at that because that's where most of the good stuff happens okay so here is my function I start that function with the system prompt this is how I instruct the model about what it needs to do it's the main configuration that that model is going to receive the main set of instructions okay these instructions have nothing to do with the chat I'm going to have with the model I'm going to ask the model to read things and do stuff and look at my glasses and that type of uh cool things this has nothing to do with that okay this is just the system prompt the main prompt that how I'm going to configure the model I'm asking the model to be witty to be concise to go straight to the Co interesting enough GPT 40 does not like this system prompt it it's just chatty okay it just goes on and on and on and on on all of the answers for some reason it just doesn't like to be concise it doesn't like to be short that is another advantage of Gemini flash 1.5 for this application besides flash 1.5 is really really fast compared to GPT 40 so that's my system prompt and I'm going to create create a prompt template okay me actually close this so the prompt template is the structure that I'm going to create with all of the messages that I'm going to be giving the model to answer a question so obviously the prompt template is going to contain the system prompt and this is how you define that system message as part of the prompt template from Lan chain it also going to have the chat history so one important thing is I want this assistant to remember what we just talked about three questions before right I don't want the assistant to forget I don't want every prompt every question to start from scratch so I'm going to be enabling chat history and you're going to see in a second how I enabled that not a big deal but I'm going to be passing the chat history and finally I'm going to be passing my prompt which is whatever I speak through the microphone I'm going to grab that that's that text I'm going to pass it and the image that I'm going to get from the webcam so there are two things here as the human message you can see here the definition here is human the first thing is the prompt which is type text so I'm telling the model hey your prompt includes some text and this is it and this thing that you see here is just a variable a prompt variable that we are going to replace in a second with the real text and the second thing type is just image image URL in this case and the URL of that image is just data colon image/jpeg semicolon B 64 that is the prefix that you have to add when you're passing a b64 image and then a variable that contains the actual bytes of that image that the Bas 64 conversion of that image so I have two variables here that I'm going to need to replace in this prompt you're going to see that in a second next line 124 I'm going to compose my chain and it's beautiful what L chain did allowing us to put all of these components together is syntax sugar that's what it is so I have my prompt and basically what I'm doing is piping the output of my prompt into the model that's what this pipe means that's the symbol of pipe so at the output of that prompt it's going to go into the input of the model which is again I'm expecting a model this class expects a model I haven't created that yet so whatever that model is I'm going to be connecting the output of the prompt into the model and the output of the model I'm going to be sending to a string output parer this is something that happens when you're using a chat model the chat model is going to return that's the way the L chain chat or model representation works is that it's going to contain the conversation is going to be well uh delimited by who said what so you're going to get things like human message and AI message and human message and that sort of like back and forth and I don't want any of that as part of my answer I just want a straight answer just give me the text so the string output parser is one of the supported parsers for you to change the output of a model or convert the output of the model into specific format in this case it just converts it as a string so if the model says 2+ 2 is 4 I don't need AI message 2+ 2 is four I don't need all of that Mambo Yambo I just want the text this is what I'm going to get here so that is what a chain is it's just a bunch of components where one component pipes into the next one etc etc so this is my chain one more thing for this chain to work and is that chat history I need to make sure that chat history it's there so L chains offers a rer when you want to implement chat history and in my case it's just very simple like I'm not storing that chat history anywhere I only want to keep one history like I don't need multiple histories or anything like that so L chain supports this runnable with message history rapper where I can get my chain and create take this runable with message history rapper and create an instance of it passing the chain and the result that I'm going to get is still a chain but now supports chat history so behind the scenes L chain is going to do all of the work of grabbing the answer from the model adding it to a list grabbing my next request adding it have happening that to that list and keeping that history going all of that is going to happen behind the scenes I just need to come here and create this class not going to get into too many details the most important thing here that you need to know is that this variable here which is the history message key this variable is the same as this variable they have to be the same so the prom knows exactly where to get the chat history from okay and by the way there are multiple ways of configuring chat history for L chain this is the simplest one that I wanted to use just for this demo so that is that function the return of this function is just the chain that I can invoke with the actual data so whenever I'm ready to answer a question I'm just going to invoke that chain let's see how that works I'm going to go up I'm still within the assistant class class and I have two more functions to go through the first one is the main function that we're going to call that function is called answer so I'm going to ask a question and I'm going to call this function to tell the assistant please answer this this function receives the prompt which is going to be the text that I'm going to be speaking through the microphone and the image that my webcam is going to capture I'm going to pass both things and this is what's going to happen internally first if the prompt is empty for some reason I got no audio whatever it is I'm not just going to return not going to do anything I'm going to print the prompt so you're going to see it in the terminal window this is what I asked and then I'm going to invoke my chain and when I invoke my chain I need to provide values for the template variables remember my template is expecting a prompt variable and the image base 64 variable so I need to invoke my chain with those two values so here is the prompt and here is the image basic Deport I'm just calling decode because image is kind in as a bite array so I need to decode that into a string so I can actually pass it here that's what I'm calling the decode there so here is those two parameters and again for the chat history to work I need to offer like a session ID which I'm not using not going to get into this this is a l chain thing that's what the config is in fact I think I can get rid of this config and everything will work the same probably I can try that later so all right so final thing that I'm doing is I'm stripping out any new line characters that come from the model like the model is going to send me an answer and it's going to come followed by a bunch of new line characters just getting rid of those or any white spaces so that's it this is my response at this point beautiful so I'm going to print that response I'm actually going to do something here if response if that response is something I'm just going to play it so by playing it what I mean is I'm going to call this text to speech function fun that I built and that is going to get the text the response from the model and it's going to play it out through the speakers of my computer so this is how that works this text to speech function is going to be using pi audio which is one of the libraries that I'm installing here Pi audio is going to give me access to play sound through my speakers okay so I'm going to just configure by Audio I'm going to just open it that's going to give me what I'm calling a player this is something that I can write by too and those bites are going to be tur into sound that come out of my speakers and this is the configuration you can look more about that in the pi audio library not interesting this here is me talking to the this audio. speech with streaming response I think uh here it is that's ugly actually I'm going to just get rid of this and I'm going to add it here just so it's a little bit clearer what it's happening open ai. audio well I'm going to have to import now open AI obviously so this way is a little bit more clear where that function is coming from so now I have where where was I here I am so I'm going to be calling the open. audio. speech with streaming response create and this is basically behind the scenes calling the open AI streaming API and when you call that API you're going to pass obviously the text and you can stream back the btes that sound sound like a voice basically you're synthetizing that text into audio so here I'm this is the model that I'm using the tts1 TTS if I haven't mentioned is text to speech that's what it means the voice is alloy this is the API that previously had sky and I don't know if you're aware of all of the Mambo Yambo drama with Scarlet uh voice but now we don't have access to Sky I have access to alloy so that's what this assistant is going to sound like like response format PCM there multiple formats that are supported and the input so the text is I'm passing it here is the response I'm going to grab that I'm going to grab that's the stream and now I can just do iterate through bytes and the size of the chunk so I'm going to be getting 1024 bytes or, 24 bytes at the time I'm playing them through the speaker phones okay and the way you play that is just by writing to the player stream that I created with pi audio that's it it's very simple okay grab the audio here the bites of the audio just play them and that's it what's left well how you put together all of these so we saw the webcam we saw the model we saw the the audio uh the only thing left is how do I capture this and how do you glue all of this together and that is going to happen right here is again it's just a few lines of code they fit on on my screen almost uh so we saw this line already 136 creates the webcam remember now I'm capturing images now this is me creating the model that I'm going to be using and as you can see here I'm using the chap Google generative AI class and I'm specifying I want Gemini 1.5 flash latest this is the model that I decided to use but if you want to try with GPT 40 here is how you do it you uncomment that line now you have ch GPT 40 you don't have to change anything else Goosebumps not changing anything else it only it's just going to work thanks to L chain all right so assistant uh I'm going to create an instance of my assistant and I'm going to pass the model in this case flash and now uh there is a function we're going to get to that function in a second here is the next line is I'm going to be using a library that's called uh speech recognition beautiful Library it it's just it made this so much easier because speech recognition handles my microphone captures the audio listens in the background I don't have to worry about any of that and it gets me functionality to automatically transcribe my audio I don't have to use whisper myself which is not a big deal but if the library does it for me even better so let's see how I use that Library couple of instances a recognizer the recognizer is the instance that's going to allow me to just transcribe the audio using whisper and a microphone representing this beautiful thing that you have here or that I have here in my Bas I'm going to start my microphone by adjusting for ambient noise I'm actually not quite sure what it does internally or how it adjusts for ambient noise I'm assuming it does some math just to feel the you know how much noise is in the room I don't know I'm guessing I haven't looked into it uh too much but here is just the main line here for the microphone recognizer do listening background that's it just one line listen in background that line right there is just going to enable this microphone to start capturing audio or just the computer to start capturing audio from the microphone not the microphone is not doing anything you're just transmitting the audio so it's going to start grabbing the audio from the microphone and as soon as it here is a stop as soon as I stop talking blabling however you want to call it it's going to grab all of that and it's going to send it to this callback function which I Define right here that's it that is going to happen without me having to do anything else now let's finish this code while true so repeat this forever display an image in a window with the title webcam that's what's happening here and that image should be coming from the stream the class that we created remember we have that stream processing in the background I want to display the images just so you sitting here in front of the computer you actually see the webcam working and you know what you're doing and where you're pointing at so this is what this is going to happen thanks to this line CV2 that uh is a reference to open CV that's the alas of the library image show I am show that creates a window with this title webcam and then it displays this image here that's it so I'm going to do that and then I'm going to be waiting for a key if you press the key that's either key 27 which references the Escape key or a q like a lowercase Q if you press any of those two keys then I'm just going to break out of this Loop so whenever you run this assistant and that window shows up you have to press either the Escape key or the que for that window to close when you do that when you close the window I'm going to stop the threat that's capturing images I'm going to destroy all open windows that were created by open CV and I'm going to stop listening so the the class that's doing the the listening in the background I'm just going to stop listening and notice that I'm saying waiting for a stop equals false meaning even if you're speaking at that point when you say just stop listening it's just going to stop listening at that point it's not going to wait for you to f finish your sentence just going to destroy it all right so one more thing which is the glue here is the audio call back function so remember that call back is going to execute whenever I ask something using my microphone I'm gonna ask hey how are you today and as soon as I make a pause the audio here the the library internally it's going to grab all of that audio and it's going to say stop right here send that to the call back so the call internally this is what I'm going to be doing here first I'm going to grab the audio which again is coming here what happened here what happened okay the audio is coming here and the recognizer which is this instance I'm going to use that recognizer and I'm going to call the function recognize whisper that means recognize using whisper if you open the documentation of that Li they support a bunch of other lities okay I'm using whisper in this case so I'm going to pass the audio I'm going to say that I want the base model which is fast enough there is one that's even faster than the base model I think maybe they have a tiny model or a small model I don't remember but there there might be one that's even faster than the base model and there are some models that are medium and large they are way more accurate but way slower so I want the base model the language that I'm going to be speaking to is English you can do a bunch of cool stuff translating and whatnot but not not into interesting right now and this is going to return it's going to grab that audio and it's going to give me text because I need text to send to my assistant and that is the next thing that I'm going to do I'm going to call the answer function that we already talked about I'm going to pass the text and I'm going to pass the latest image that I captured from my webcam and in this particular case I'm saying encode true meaning give me a base 64 of that image that's it that is the whole magic of this application here now let me run it here uh just so you see it working the first time when I ask the first question it's going to take a little bit of time so it it's warming up for the second and third and fourth question is going to go faster I'm going to run this by doing Python 3 assistant. py so I'm going to do that what color are my glasses black what teams are represented by the baseball hats hanging from my wall Dodgers Nationals Red Sox Astros Cubs and Royals what is the title of this book human in the loop machine learning who's the author Robert Monro Monarch all right so that was pretty good uh it works as you can see it's fast enough I mean it's not lightning fast but it's fast enough for this to be a little bit useful uh a couple of things that I wanted to mention are ways in case you want to run this on your computer uh ways that you can uh in wish you can improve this so number one you should be able to implement a way for you to interrupt the assistant that's something that open AI mentioned a bunch of times as part of their demo you should be able to do the same it shouldn't be too hard you just need a flag somewhere and as you're playing audio back in this function here the TTS function as you're playing audio back you might be able to just make sure that flag is not set and if it is set you interrupt that's if the user starts speaking again you interrupt this processing that's going to make make it a little bit more Dynamic so that's one thing number two you need to make sure you make the application a little bit more robust like sometimes the open AI API might fail or maybe Gemini is not it's not working so make it a little bit more robust and the third idea that I think is a is a pretty good one is about streaming answers from the model back like right now I'm using invoke here and what's going to happen is I'm sending the API the text and the image and I'm waiting for an answer to come back okay if that answer happens to be a long answer I'm going to have to wait for the model to generate the entire answer before I start playing that through the speakers so a way to improve that would be to stream the answer from the modelb so I don't have to wait until the entire answer is generated I can start playing audio as the assistant is providing is generating the answer that would be ideal I think just by doing those three things this is going to feel much more realistic in a way right obviously playing with the system prompt is important as well just to get it behave as you want it to behave maybe make it a little bit more flirty like open AI clearly did or whatever you decide anyway I hope this is helpful this kind of stuff is really fun to build if you really like that type of content or this type of content please either subscribe or like this video so the algorithm lets me know that you like this type of content enjoy it and I'll see you in the next one bye-bye

Info

Channel: Underfitted

Views: 8,128

Rating: undefined out of 5

Keywords: Machine learning, artificial intelligence, data science, software engineering, mlops, software, development, ML, AI

Id: zVttVCQvACQ

Channel Id: undefined

Length: 37min 9sec (2229 seconds)

Published: Mon Jun 03 2024