Build a voice assistant with OpenAI Whisper and TTS (text to speech) in 5 minutes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey everyone, Ralf here. In this video I'm super excited to show you how you can build your own voice-based chat assistant where we'll be using the Whisper API to take what you're telling your assistant, transcribe it, pass it through the chat completions endpoint, and then use OpenAI's TTS or text-to-speech to speak the response back to you. And that way you can have a fully speech-based interaction with your chat assistant. We're going to be using Node.js for it and it will take you no more than five minutes to set this up, so let's get into it. All right, so I'm going to walk you through all of the code, but first let me just show you what the end product looks like. So I got my terminal here, let's run node-tts-chat.js and the way that it works is that you get a little welcome prompt and then it says press enter when you're ready to start speaking, so let's do that. Hello. Hello, how can I help you today? Could you just say hi to the audience looking at or watching this tutorial? Hello everyone, welcome to the tutorial. Could you repeat that? Hello everyone, welcome to the tutorial. So as you can see, it also has context of previous messages, so that's built in, but now that you have an idea of what the end product looks like, let's go ahead and look at the code. So to begin with, all of this code is available in a link in the video description, so you can go ahead and grab that, run that on your own computer and play around with it. But let me walk you through it so you understand what is happening in here. So first of all, we have a bunch of imported required modules to make sure that you run node or yarn or whatever you prefer to use there, and then I've had some difficulties getting ffmpeg to work on my machine, so I'm using this line here with the help of this package to make sure that node is aware of where ffmpeg is on my machine. Then we just initialize an OpenAI API client, and I have my OpenAI API key in an environment variable right here, so create that environment variable and store your OpenAI key named as this, and that allows us to call the OpenAI API. And then I have just a few variables that I set up here, and then I'll log this little welcome message that you see here. This code consists of five main functions that we're using, and I'll walk through each one of them in the correct sequence, and I also have a little diagram here if you're curious about how they all interface with each other. I'm not going to walk through the diagram, it's here in case it's helpful for you to look at it from this visual perspective. I'll instead walk you through the code and sort of explain what's going on here. The main function that sort of works as the orchestrator of everything is this first one that I call set up a read line interface. What it does is really just setting up this interface that allows us to use enter to start or stop recording or any other key to sort of terminate the program. So we're setting that up right here, and then we have this logic right here that is listening to key presses, and if enter or return is used we're either going to stop recording or start recording depending on if we're already recording, and then if any other key is pressed we're just going to terminate the script. Now let's look at what happens when you hit return or enter. So if mic stream is set, which it isn't from the beginning, we'll do that in a second, it will stop recording, but if it isn't set, which is what happens when you first run it, is that it's going to start recording. So it hits this function or calls this function that we have right here called start recording. What start recording does is that it creates a new microphone, it goes ahead and starts writing the output from that to an output.wav, and then it starts recording and starts writing, sorry this one is just starting to write a file, and this is writing the incoming data, the audio to this file. We have some error handling here, and then we're just console.logging this record, press enter to stop that you see here. Now we will continue recording until the user hits enter, and if you remember from up here when enter is hit we're either going to start recording that we just did, or now that mic stream has been set we're going to do this stop recording and process. So let's go ahead and see what that does. It's a very small function, what it does is that it stops the recording, it ends writing to the file that output.wav, it logs this recording stopped processing audio that we see here, and then it calls the next function which is transcribe and chat. So let's look at that function and we can see here we're just taking this output.wav as a file path, we're creating a form or form data here of things we want to append to our request to the access call that we're going to do, and that's just the file that is stored in the same folder as our script. We're defining the model that we're using which is just going to be whisper1, and then we're saying hey what we want to get back is text. Then we're using access to call the OpenAI Whisper API and passing in our secret key, our API key right here. This isn't a wait function so we're going to wait until we get the response here, and then we're just going to go ahead and take the transcribed text that we get back from the Whisper API which is what we spoke into the microphone, and we're just going to go ahead and console log that. So this is right here, I said hello or you said hello, and next we want to pass that into a chat completions endpoint so we can get a response back that we can play to the user. So I'm just creating this message right here where we define the system message as you are a helpful assistant providing concise responses in most two sentences. You can change this if you want your assistant to do something else. I'm appending the chat history so if we've had previous conversations as part of our session they will be included in the next chat completion API call, and then we're passing in the new transcribed text that we just got back as the user input or the question that we're putting. We're sending that off to a chat completions create endpoint. I'm using GPC Threat 3.5 Turbo, you can change this to another model if you want, and then we're just getting the chat response text here and we're pushing the chat history to this variable that is stored outside of this function to make sure that it's available if we're running another request, and then we're going to be calling streamed audio which is our last function with this chat response text and just sort of logging that we have gotten this and doing some error handling, ending the mic stream, etc. But let's go ahead and look at this streamed audio function. It's right up here, so right above it I just set two parameters here. I set the input voice as echo, this is just one of the voices that OpenAI has. There's a list of other ones that you can use in this link, and then I'm setting the input model to TTS1. Again, there are other options that are available here, this is just the one that says it's the fastest to get a response and quality isn't really an issue for me in this case. Now the function takes those two as input together with the input text that we just got back from chat GPT or the chat completions endpoint. Defining the API, or sorry the endpoint that we're calling, passing in our secret key as a header, and then defining this data right here which is the model input text voice, and I'm saying I want mp3 as a response format from OpenAI's TTS API. Now we're sending that back to OpenAI, and I set the response type as stream here, and you can use another response type, but stream allows me to start playing the response as soon as we start getting it, rather than waiting for the entire response or audio to have been sent to me before I can start play it. It allows us to play the message or the response to the user more quickly, and I was not able to make this work with the node package, so that's one of the reasons that I'm using Axis for this. And to play it, we're setting up a speaker here, and then we're using ffmpeg to pipe data into this and to play it. We have some error handling here, but otherwise we're just playing the audio to the user, and then because this is running as part of this setup readline interface, the application is continued to running in the background, so once this is completed, the user can decide to press enter to send another question, or to terminate the program, or do what I've been doing all along. You can just allow this to sit and wait, and then go ahead and choose the moment where you want to ask something. So this has been working all along, not recording, but what I can do is, can you say goodbye to the audience? Goodbye, and thank you for watching. So I hope that this has been helpful. Hopefully it's a good starting point for you to understand how you can interact with OpenAI's Whisper API, or their TTS, or text-to-speech endpoints. If it was helpful, please give it a thumbs up, If it was helpful, please give it a thumbs up, and consider subscribing to the channel. I will be putting out more tutorials around how to use the various OpenAI functionalities in the future. If you have any questions or comments, feel free to drop them down below. Super curious to hear what you're trying to build, or how I can help you. But that's it for this video. Thank you so much for watching, and hopefully I'll see you in a future video. Bye!
Info
Channel: Ralf Elfving
Views: 8,976
Rating: undefined out of 5
Keywords:
Id: 6AiVUcDV5FE
Channel Id: undefined
Length: 11min 24sec (684 seconds)
Published: Wed Nov 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.