Ultimate AI Voice Assistant with ChatGPT API and OpenAI Whisper - Full Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

To avoid redundancy of similar questions in the comments section, we kindly ask /u/dulldata to respond to this comment with the prompt you used to generate the output in this post, so that others may also try it out.

While you're here, we have a public discord server. We have a free Chatgpt bot, Bing chat bot and AI image generator bot.

So why not join us?

^{Ignore this comment if your post doesn't have a prompt.}

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

👍︎︎ 1 👤︎︎ u/AutoModerator 📅︎︎ Mar 03 2023 🗫︎ replies

Captions

hey friends welcome to one little coder in this video we're going to learn how to build your ultimate AI voice assistant in Python so when I say in Python we're going to use a lot of libraries let me break it down for you first we're going to use chargpt APA then we're going to use open a whisper then we are going to use Co QE TTS text to speech and then we're going to build everything in grade your application and I'm going to show you everything end to end if you are not familiar with open AI chat GPT APA I would strongly encourage you to watch my other video where I've covered how you can use the API so I'm not going to get into that particular detail in this video but we are going to go through every single line of the code and this code will be shared on Google collab or probably on GitHub and I will link that in the YouTube description so it is easier for you to get started just by replacing the open a API key so now to start with first thing we need to do is we need to install all the libraries and even before that I'm going to show you the demo of what we are going to build let me quickly run my radio application so I've got this radio application and I'm going to run it if I run it hopefully without any mistake it runs and then yeah I have something so I'm going to ask a simple question what can I ask I'm going to ask who is the greatest chess player of all time stop it submit it it's going to take a little bit of time as you can see that you know first our voice is getting converted into a text then the text goes to charge GPD then it comes back then it gets converted into voice again so it's a long process okay we have got the answer so this is the chat GPT text the title of the greatest chess player of all time it's a subject of much debate okay let's hear from the AI how it sounds player of all time is the subject of much debate and varies depending on let me ask another question here so I'm going to close this and I want to just ask how does he compare with Magnus Carlsen I hope I pronounce Magnus Carlsen correctly um you know it's it's always a tricky thing so once again like what is happening so the text whatever I spoke gets converted into text first then the text goes to charge apt or open Ai and then the response comes back then gets converted into audio again and then it's given so okay okay as an AI language model I do not have personal opinion but I can provide factual information about both the players Magnus counselor Carlson is considered one of the greatest let's hear the same as a no language model I did not have personal opinions or biases however I can provide factual information about both players Madness Carlson is currently considered one of the greatest chess play great thank you so much AI so now let's get into the code like a python coder and then see how to code this the first thing that we need to do is we need to install all the libraries but one thing for you to remember is this is running on Google collab GPU because we are doing voice transcription and we are also doing text to to speech both of these will work fast if you have a GPU if you do not have a GPU it will still work there are a lot of other other ways to make it faster like whisper CPP which uses the C plus plus version of whisper but for now I've stuck with the whisper python which is a python version of whisper and I'm running it on GPU to make the inference faster so that's something for you to keep in mind this is a Google collab notebook with the free GPU the T4 Tesla machine that Google gives us so so I've got this setup then install the required Library I'm installing a TTS which is the which is the library that I just mentioned dot AI their TTS and while I'm installing this Library it requires a number version which is one point the number requires number 1.21 so that's why I'm separately installing numpy only 1.21 basically I'm downgrading the numpy version then I'm installing open AI whisper gradio open AI so open a whisper is for voice transcription when I speak in English it's going to help me transcript and transcribe it in text English and radio for the web UI that we just built and open AI for of course start gpta completion now import all the required models import whisperous whisper gradu sgr open AHS open AI tts.api import TTS which is going to help us build the text to speech model now at this point we have installed all the libraries and we have also imported all the required libraries now the first thing that we are going to do is we are going to set up our environment with the text-to-speech model speech to text model open AI key so we're going to do all these three at this point to start with we are going to set up the text to speech model so you need to pick a text to speech model also for your information this works multilingual so you can also use multilingual model but right now I'm using only English model so you can see the first model in the TTS is multilingual model so you can you can in my this line of code this code here just here where I'm saying I want the ninth model or in this case probably the 10th model python is zero index that's why I'm using nine you can mention zero that will give you the multilingual model but I didn't want to use multilingual model it's going to be heavy it's going to take a lot of times given that I'm making an MVP minimum viable product and a demo I didn't want to use multilingual but you can select any model that you want from this I've used this model that's called glow TTS this is what I've been using but you can always practice play with the different voices and then see what you can do also the other thing like one of the reason why I picked this library is because you can also clone your own voice with this so when you talk to this you can talk to probably like Morgan Freeman or your own voice if you are not bored of hearing your own voice you can do that with this library but I'm not doing that I'm just using the default voice that is available so the first step is list all the models then once you know which model to use use that model name and create this TTS instance this TTS instance is going to help us in transcribing for example you can give any text here so I love playing chess and then save the output to in into output.wav waveformat so use a function TTS method in this case tts.tts underscore file to file run it it's going to save output.wave and you can see how much time it took this is this is a really fast Library you can even deploy it on Raspberry Pi that's one of the reason why I picked this Library now let's play it I love playing chess I love playing chess great so now the first setup that we try to do is text to speech is completely fine it's perfect it's done the next thing that we want to set up is speech to text we are going to download The Whisper model that will help us transcribe the voice into text I'm using the media model I've been the I'm I've been a big fan of open a whisper media models since the start they have large they have large we want V2 they've got base tiny you can you know check with a lot of different models but I've stuck with the medium because I'm a big fan and it it for me balances the trade-off between speed and size but you are free to check it out depending upon where do you want to deploy this and all these things but if you want to start with the smallest model is I think bass are tiny play with that I've got a separate playlist on open a whisper everything from speaker diarization and everything I strongly encourage you to check it out but for now use the media model this will download the model and then assign it to this object now our whisper is set up so this text to to speech and speech to texture setup now the middle part where once you have the text you have to send it to open a for that you need to have the open AI key which is exactly what we are doing here open AI dot APA underscore key you set up the opening API key once again if you are just watching this video for the first time you don't know anything about open a APA key I've got separate videos where I've covered it in detail how to do the API key how to approach the API in itself into a different method strongly encourage you to check it out the next thing is after you have set up the API key the next thing is you need a function anytime you build a radio application grade your application has three important components input output and a function so the function is what we are going to build the function is going to take user voice as an input so the application is going to look like this I'm going to talk as you have already seen I'm going to talk so the voice is the input and the function takes the voice as an input and does a lot of things and then finally gives two things as an output one is the reply as a text the second is the output like Voice output in itself so those two things we need to build so first we need to set the context for the chat bot in this case we are saying you are a kind helpful assistant cool it could be a VT assistant as well you could you could build any assistant that you want again not plugging in my other video but I have made a separate video about how this assistant behaves in different context based on different setups so just check it out if you want to play with this particular setup but for now you are trying to build a helpful assistant that's it then you take the user voice and transcribe it and extract the text and you assign it to an object called user message now you send that user message into this messages list where you have already set the context for the chatbot which is in this case chat GPT API and this is one I'm just printing it for debugging you know I'm the old school print guy for debugging yeah that's that's that's that's that's a bad practice but I'm still doing it next thing is you have to send the messages the messages inside this and this is the APA call so you're going to say send the message to open AI server get the response back and store the response inside chat and get only the response the message content and as I need to reply and then now append the reply as well so that you know it has got the context like in this case like you can see that I asked how do I how does he compare with Magnus Carlsen so the chatbot already had the context that we are talking about Gary cash Pro when I said how does he he the pronoun mint Gary cash proof so that's quite important the next thing is now you have got the message in text now you have to send the message to the TTS system that we have got that will convert this into an audio and then send the audio path as an output and the text file at the the text string as output so this function gradu takes the user input voice path and sends the charge GPT response as a text and the charge GPT response stored as an audio transcription audio transcription yeah AI audio transcription a audio no transcription and its path now we are going to actually build the radio application grade your application can be built using two different methods vastly two different methods one is the interface method the second one is the blocks method we have already covered both these methods in the channel but just for Simplicity I'm going to focus on the interface method interface method like I said needs three things one is your input output and the function in itself so the in the input is ideally your microphone input so audio Source microphone and you're going to give the file path as an file you're going to store it in a file path and then your output is a text output under Voice output so I've got text output gr dot text box and a Voice output gr dot audio and now the function is voice chat as we have just built the function and we are launching it debug is equal to true so that we know if something goes wrong we can do one more thing actually you can do is when you launch you can I think you can do live is equal to true as well and then you can launch it so that means you know the life okay that's bad I think I was supposed to give live inside this let me try live is equal to true am I still okay live should be inside output um sorry life should be inside interface that's my bad I've I've given it in launch okay so now we have the gradier output as you can see it's going to print the responses when we run the application but let me ask a question let me ask a question about um maybe uh first computer who built the first computer so the moment you stop you don't have to submit that's what live is with the start of the video you would have seen my demo where I had to submit but you know for you to have a voice assistant that is intuitive you don't have to necessarily submit since the first computer was this computer was built by a purchased mathematician named Charles Babbage in the early 19th century that they're just credited with designing the first mechanical computer that could probate the process of mathematical calculations however due to various reasons his design was not finished during I didn't want to listen to the entire history of who built the first computer but as you can see that it has it has given the response and in this time like it was quite fast because um like the sentence was quite small and the response is again you can control a lot of things how long the response should be how short the reply should be so this will make your EA assistant AI Voice Assistant um much more efficient once again this is just an MVP I wanted to build just a minimum viable product am I going to use it every day I'm not going to use it the latency it has got the setup it is not perfect but there are a lot of things that you can do with this for example you can hook this up into your if you're using Mac you can hook this up into your Mac developer utility in your Mac and then you can try to get the response and then reply so there are a lot of potential that you can use directly this with but my intention of this video is to show you how to build all the pieces together and build the ultimate EI Voice Assistant using chargpt API whisper text to speech from coqa Co QA dot Ai and radio very a big fan of gradu so I hope this video was helpful to you in learning on all these applications to build the ultimate AI Voice Assistant I don't want to bore you with further more questions but if you have any other question let me know in the comment section I'll try to help you out the code will be linked in the YouTube description so you can right away get started see you in the next video peace

Info

Channel: 1littlecoder

Views: 17,809

Rating: undefined out of 5

Keywords:

Id: cyd90Xb_4p8

Channel Id: undefined

Length: 14min 8sec (848 seconds)

Published: Fri Mar 03 2023