Voice Conversations with ChatGPT using Whisper and Text-to-Speech

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there in today's video I'm working on my OK GPT project again in which I'm building a voice controlled device like Google home or Alexa and that is connected to chat GPT now so far we have the wakeup keyphrase detection ready and the command detection and we are using the open AI whisper to do this and not the API but locally so let me show you how this works so first of all we have this init.py script that basically listens for right now it's two times actually this should be five times so we listen to the wake up keyphrase we want to use and in this case I'm using OK GPT as the wake up keyphrase so this is similar to Okay Google or hey Alexa and this can be initialized to use any key phrase so let's initialize it so if I run Python 3 init.py then it is going to ask me five times to say the key phrase okay gbt okay gbt okay gbt okay gbt OK gbt and as you can see it detected it a little bit differently every time so most of the time it detected this as OK gbt and then once we got this version now actually it would be better to run this 10 times because there are still a few more variations it can generate but now that we have initialized this then what we can do is we can run our recognize. piy and what this will do is first it will listen for the wakeup keyphrase and once it detects the wake up keyphrase then it will listen to a command and that's all it does so basically it just says when it detects the wakeup key phrase and then it says what was the command that you gave so for example if I run this now if I say Python 3 recognize. Pi then now I can start speaking whatever I want here and nothing is happening until I say okay GPT and of course it doesn't work when I demo it and this because it actually didn't detect the okg PT although now it detected that I said this and now this what I speak is going to be the command so now you can see that we detected this as the command so if I try this one more time so I say okay gbt turn on the lights then now you can see that we actually detected this command turn on the lights after I said okay GPT now currently I'm using my laptop microphone for this not this one so it is very echoey and not very good so that's why it detects this very differently every time but what I would like to do in today's video is actually connect chat GPT to this so then I could say something like okay GPT what is the capital of France and then it should answer the question which in this case apparently was what is the cup of so of friends so maybe I should connect some sort of microphone to my laptop so that it would actually work better but I don't think I have a microphone handy right now anyway let's try to make this work now I did make some refactorings from the previous video so first of all I renamed my recorder module which I believe I called before just record but now it's recorder and I changed the name of the function in there to live speech and also I named this from phrases into wake up words and I changed the way in which we remove the duplicates from this list and in the recognize. piy I changed the is similar function into detect wakeup function so right now we take the command that we heard and then we take the wake up words that we are listening to and then we convert the command to lowercase and then we check for all the words in the wakeup words if the wake up word is in the command we gave then we are going to return true and we also convert to lowercase the wake up keywords and if we don't find it we return false and also I call the settings file now wake up words Json not phrases. Json and that's basically it that's all the changes I made and in the actual recorder which actually handles the listening to the audio and then transcribing it I did a couple of refactorings here nothing very interesting I renamed my I variable into frames recorded and I'm not sure if I did much of anything else here so basically what this live speech function does is it starts to listen to the microphone and then it detects when the RMS so the volume goes over 300 then it starts recording and then if the volume goes under 300 and we have already recorded the weight time number of frames now weight time might not be the best name for this variable because it just counts the frames anyway if we have recorded something then we SA save the audio and then we transcribe the audio with whisper and then we yield the text that we transcribed so that means we can use this as a generator here so we can just say four phrase in live speech and then we just get new phrases here for every iteration of this Loop and in this case we want just one phrase so we just break from this Loop when we get the first phrase and then this one is repeated five times so then we get five different phrases or the same phrase five times but perhaps transcribe differently now I will actually initialize this again with 10 iterations just to get all the different versions so it will recognize my voice better because the microphone is so bad okay so now I should have a couple more phrases it can detect I think here I started to say the next key phrase too early so something weird happened here but whatever it should work now so how do we integrate chat GB into this well of course we are going to have to import open Ai and then what we want to do is when we hear the command we want to send that command to chat GPT so let's do response equals open AI chat. completions do create and actually we are supposed to not use this open a library Direct directly anymore so we should actually do this we should say chat GPT equals um open AI doop AI or we can say from open AI import open Ai and then we can just do this and then we can call chat gp. chat. completions create and we are going to say that the model is going to be gbt 3.5 five turbo and the messages is going to be a list and let's pass at first just this message so let's make a dictionary here with the role user and the key sorry the content and this will be actually the message that we get from the transcription and of course we have to add a system message so that chity knows what we're doing here so let's set a rooll system and a Content you are a voice assistant or voice controlled assistant answer the users prompts as best you can I'm not sure if we have to say that part but let's just do that at first so then we should get the response and the response text should be response. choices zero message um content so we can say that the response text is going to be that and then let's just print response text and let's make this an F string and let's say chat GPT says that and we don't have to print our Command here and it would be nice to make some sort of sound when we detect the wake up keyphrase but let's do this at first and I'm going going to then do export open AI API key equals something I will get an open a API key and I'm going to put this here so now I should be able to use this thing so let's look over the code again so we initialize chat GPT and then we send a completion with a system message and a user message and then we get the response and then we print it and this of course has to be in curly races so let's see if that works so let's run Python 3 recognize. Pi now we get all of these error messages or some debugging here for some reason we have to get rid of this somehow but anyway if I now say OK GPT what is the capital city of France we get the response back which is the capital city of France is Paris so now it works OK GB T what is the distance from the Earth to the Moon the average distance from Earth to moon is about 2 38.8 55 miles okay now chat GPT or open AI recently released their text to speech model as well so let's actually try to use that one so how does this work it works something like this we have to say chat GPT which I I guess I shouldn't call it chat GPT because this is actually the open AI model but whatever I will call it chat GPT for now and here we should have audio do speech do create and we are going to give it the input and the model so the input is going to be actually the response text and the model is going to be it was something like TTS do they actually have like Auto completion for this TTS text to speech HD no text to speech one I don't remember so let's actually check it out in the documentation so in the API reference we have audio and create speech and it is in fact tts1 so let's do that tts1 and do we need something else the voice so let's just use Echo for now so let's say voice equals Echo and is there something else response format all right I think we don't have to give it it will be mp3 by default oh there's actually a speed here as well okay so let's do that and what is this this is the voice and what does this return httpx binary response content so I am going to check from my previous project which is is in do/ GPT November 2023 slash text to speech SL test.py so here I am actually testing it out all right so we do exactly this and then we have to import play sound and Os and then we can stream the audio to a file and then we can play the file and then we can remove the file I'm not sure if there's an easier way but let's try to do this so basically I will call this voice and I'm going to import play u sound and that should then work so let's try to see if we can actually talk to chat GPT now so let me make sure that I have some volume here so let's start this over recognize dop and then I can say okay gbt what is the capital city of France the capital city of France is Paris okay TBT what is 300 * 5 300 * 5 equal 1500 all right so now it is working okay gbt can you tell me a nice short poem certainly here's a short poem for you in a world of chaos find your calm like a gentle breeze or a soothing Palm embrace the beauty that surrounds where peace and serenity knows no bound I hope you enjoyed it let me know if you'd like to hear more poems or if there's anything else I can assist you with all right so now it is actually working now I think we should add some sort of sound effects when it detects the wakeup keyphrase and then when it has finished detecting the command what if we actually just create some voices so what if it says like yes or something when you say okay GPT and then when it starts to generate the answer it should say something like just a moment so we can actually do that with this speech and actually I will change to alloy I don't really like the echo sound and we can actually do this in a different file so let's do code voices dop and here let's import open Ai and then we can do this and I will call this model and this is going to be yes let's see what this sounds like and we're going to save it into yes. MP3 let's actually do sounds SL yes. MP3 and let's make directory sounds and let's create another one which is going to be just a moment and let's call this just a moment or maybe it could be called a detected and then this could be called processing so then you can put whatever sounds you want in these files and it makes more sense so let's create those so let's do Python 3 voices. py so now we should have two voices here and let's see if we can play them let's play a sound when we detect the keyphrase and this is going to be sounds SL detected. MP3 and let's to from path lib import path and let's make this path file. parent slash sounds slash detected and can I do something like this so that this will work on all devices not sure I know I can do the first part but can I then continue it like this let's see and then I'm going to play another sound when we actually start processing the message so let's say processing let's see if this works let's run again recognize. Pi okay gbt yes okay it works just a moment great I'm here to help what can I assist you with today all right so that now actually works and it answered me when I said it works okay GPT yes what is the pocket Sphinx python Library just a moment the pocket Spinx python library is a WAP for the pocket Sphinx speech recognition engine pocket Sphinx is a lightweight open-source speech recognition system developed by Cary melon University that can be used for both offline and realtime speech recognition applications all right that is enough I can't actually stop this for some reason all right so maybe we should actually instruct it somehow to give a very short answer because I think this is too long of an answer maybe we could actually tell it to only answer with one or two sentences and then ask the user if they want to hear more so let's change our system message answer only with two sentences at maximum if the question requires a longer answer ask the user first if they would like to know more now at the moment we can't just ask the user to provide another speech input so we should create some sort of function that chat GPT can call so that then we will directly ask for another command so now we have to just say Okay GPT again to give our answer and then let's add here after confirmation you can provide a full answer now maybe it would be nice if we could actually play the sound in like a different thread and then we could start listening to the commands again and then we could actually stop it somehow and I actually want the speed to be more so how do I set the speed it is a float 0.25 to four and one is default so let's do 1.5 and let's try this again okay gbt yes what is the capital city of France just a moment C Francis Paris okay that sounded horrible now wait a minute did I use alloy here yes why was it a different loudness that's kind of annoying so I will let it just be one whatever I'm not sure if the speed affect the loudness but let's try it again OK GPT yes what is the capital city of France just a moment the capital city of France is Paris okay gbt yes what is the I don't know what I was going to ask I was going to say what is the pocket Sphinx library for python just a moment the pocket Sphinx library is a python wer for the pocket Sphinx speech recognition engine which allows you to perform speech recognition in your python programs it can be used for tasks such as voice controlled applications or speech to text conversions is there anything else you'd like to know all right now those sentences were very long maybe I should only say one sentence or maybe count the words in this sentence so let's say that answer in 20 words or less 20 words or less if the question requires a longer answer ask the user first if they would like to know more okay let's try it again okay gbt yes what is the pocket Sphinx python Library just a moment I'm sorry but I don't understand what you mean by the pockets FS by the library could you please provide more context or clarify your question okay gbt yes you misheard me I said pocket Sphinx just a moment apologies for the misunderstanding how can I assist you with pocket Sphinx the speech recognition system oh and I forgot we don't have the context now so I have to say Okay GPT yes what is the pocket Sphinx python Library just a moment pocket Sphinx python library is a python library that provides access to the pocket API for saving and retrieving articles what so it misheard what I said and then I think it hallucinated this answer or is there a Pockets thinks python Library okay okay let's actually add some context to this chat so instead of passing this directly here we are going to say that this is going to be messages and let's define that here messages equals just the system message to begin with and then when we get the command we are going to say messages. append and we are going to append this user message and then send those messages to chat gbt and then when we get the response then we are going to add the assistant message to the messages so we have that context as well so let's try it out okay gbt yes what is the capital city of France just a moment the capital city of France is Paris okay gbt yes how about about Germany just a moment the capital city of Germany is Berlin all right so now we actually have the context of the previous messages now I guess we should reset the context maybe with like a timer or something because this is going to be running all the time in the device so at some point we have to forget what was the previous thing but let's try again okay gbt yes what is the pocket Spinx python Library just a moment I'm sorry but I couldn't find any information on a python Library called Pockets FS could you please provide more details okay gbt yes you misheard me I said pocket Sphinx just a moment I apologize for the misunderstanding Sphinx is a documentation generation tool for python projects it helps create structured and easy to navigate documentation for code all right so it again misheard me okay gbt yes I said pocket Sphinx just a moment I apologize for am confusion but I'm not familiar with a library called pocket spigset so I have to make sure that I get a good microphone for the actual device but anyway the context is working okay gbt yes can you not really get what library I'm talking about when I say those things because you are clearly misunderstanding what I'm saying what could be the python library that I'm talking about just a moment I apologize for any confusion without more specific information it is difficult to determine the exact python Library you are referring to could you please provide more details or context about the library okay gbt yes it is a speech detection library that is really bad just a moment apologies for the mixup based on the information provided the python Library you may be referring to is pocket Sphinx it is a library for speech recognition and is known to have some limitations okay it is known to have some limitations great it is working the next thing to do would be to add some sort of functions for chat GPT that it can run some tasks on maybe the computer but actually ultimately I want to run it in the Raspberry Pi and I would actually like to do some home automation with it so maybe like turn on the lights or something like this but the first thing we have to do is we have to actually test this in a Raspberry Pi so I am going to have to purchase one because I don't have one at the moment but this is going to be the end of today's video because finally I created a video in less than an hour in which I actually accomplished something so I might do another video on the function calling and then another video where I install this in the Raspberry Pi let me know in the comments what would you like to see next in this series and if you would like to help me out with that Raspberry Pi then you can leave me a super comment or click the link in in the description for my buy me a coffee page and buy me a raspberry pie anyway thanks for watching and I will see you in the next one
Info
Channel: Unconventional Coding
Views: 3,894
Rating: undefined out of 5
Keywords:
Id: xQdLiyCxyWQ
Channel Id: undefined
Length: 25min 37sec (1537 seconds)
Published: Tue Nov 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.