Speech Recognition & Voice Synthesis in React (Web Speech API)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

language is a critical component of our globalized world and while I wish we all had a universal translator to help us out especially as I'm trying to learn Portuguese what if we buil our own so that when I activate it I can say something and it'll say it right back to me in another language so we're going to see how we can do this right inside of the browser using speech recognition speech synthesis we're going to use the open AI SDK to translate and then we're going to do this inside of a react app using nexj so I'm going to start inside of this project that I created that has a little bit of placeholder UI that's supposed to be inspired by the Star Trek Enterprise Universal translator and while there's a bunch of different options this is probably the best one that just has a little bit of UI that I can kind of support in a visual way now before we dig in browser support for this is actually pretty limited desktop Chrome from my experience so far has been the best I heard that Android Chrome is pretty good but Safari has pretty limited support but it still is able to work but from what I was able to tell Firefox wasn't able to work so you want to make sure that you're using a browser that's actually going to be able to support these apis as you're playing along with it and generally I don't want people to build apps that only work in specific browsers but sometimes we're kind of limited to that when we're working with these new browser apis now back to the application the first thing I want to do is I want to be able to click this record button and I want it to take my voice and I want it to turn it into text now you might think that that sounds pretty complicated but really all we need is the web speech API and specifically we're going to use this speech recognition now if you try to look at the example section for this they do have a demo but it's a little bit complicated to read so we're going to just dive in inside of my nextjs application I have my main homepage which is referencing this translator component and inside of this translator component I'm going to get rid of the sidebar we can scroll to the top here we can see that I'm already referencing used client as everything we're going to be doing here is inside the client so I want to make sure that this component that I'm working with can actually access those client side apis now if you're actually interested in the Styles and the UI that I created for here you can check it out with the demo that's inside the description but right now I'm interested in this button which when we click it we want to be able to hit that recording instance so the first thing I'm going to do is I'm going to add an onclick Handler I'm going to say handle on record then I'm going to jump up to the top of the file and I'm going to add my new function which is handle on record so once that click Handler fires we want to ultimately access that speech recognition API so I'm going to reference window. speech recognition and because we're trying to have cross browser support as much as we can real basically have Safari actually requires a vendor prefix for this so we're going to also add webkit speech recognition we're going to reference that from the window as well but then to make it easier we're just going to Simply add constant speech recognition that way we don't have to play around with having that or statement anywhere we want to reference it now the first thing you might notice here and let me hide myself is between speech recognition and the web speech recognition we get these nice or not nice squiggly lines indicating that we're having a TS issue here now we don't get support out of the box for types but what we can do is we can reference this types package that will give us those types so I'm going to run the installation command we're inside my terminal I can run npm install and once I head back to the project we should see that those type issues are resolved so now that we have speech recognition resolved we're going to create a new constant of recognition we're going to set that equal to new speech recognition and then just to get things started I'm going to run recognition. start now if I head to the browser and I actually click the record button we should see something important here where we need to actually give access to our microphone in order to make this work and I'm going to go ahead and click allow this time only this time and we can see that we do get that little indicator up there saying that it is currently recording now we're not currently doing anything with this so we can't really see any of the results but we know that we were able to successfully prompt this to trigger it so now we can actually start to listen to the audio that it's actually grabbing so back inside of my code before I run this start method I'm going to define a new callback Handler so I'm going to say recognition. on result is equal to and I'm going to Define an async function and we'll see why I'm defining as async later but ultimately I'm going to Define that function where that's going to receive an event and let's log out that event to see actually what's inside if we head back to the browser and I click record again testing this out we can see that we get this event fired and if we start to look inside we can see that we're going to ultimately see this results property and if we start to dig in 0 0 we're going to be able to see this transcript which is exactly what I just said testing this out we even see the confidence of what that was which it's pretty high and it got it pretty much right so let's store that result so I'm going to say constant transcript is equal to event. results and that's 0 0 transcript and ultimately transcript and ultimately what we want to do is we want to store this value so that we can reference it later for other things so I'm going to go ahead and import use state from react and then I'm going to create a new instance of State let's just add it here constant let's call it text and set text set text is equal to use State I can't type right now but ultimately once we have that transcript let's go ahead and store it using that transcript value Let's see we need to also type out our state as string and now after we will have that text let's actually display it on the page to make it a little bit easier to see what we just said so under spoken text which I already have down here or really whatever you want to do to display on the page I'm going to go ahead and add that text so now heading back to the browser if I record again testing this again we can see that the text was added directly to the page this time so this was a great first step we were able to successfully have the browser listen to what we were saying and then take that text and display it on the page and it was a little bit simpler than you probably imagined but next let's actually take that text and let's translate it so that we can ultimately get it spoken in another language now now to do this we're going to use the open AI API and specifically we're going to use the chat completions API where we're going to be able to craft our prompt we're going to be able to send it through to that API ask it to translate it and we'll receive that response back and we can even see on their site that they have an example for translation where their prompt is pretty simple we can see that you'll be provided with a sentence in English your task is to translate it into French we then provide that sentence we're going to mess around with that a little bit but really generally you you'll be able to just copy and paste this example as is and we'll start off by copying this example now we're not going to walk through signing up for an account so the Assumption here is that you have your open AI account already and your API key or you're going to substitute this for your own translation service now ultimately in order to use openai we're going to use the open AI SDK so I'm going to head over to my terminal I'm going to hide myself and I'm going to run npm install open AI now inside of the project we're going to create an API route where we're going to run this code so inside of the app directory or wherever you're creating your endpoint I'm going to create a new folder called API a new route called transl at and then a file inside called route. TS where then inside route. TS I'm going to hide my sidebar here I'm going to export an async function called post where inside post I'm going to receive an argument of request and I'm going to type that out as request but ultimately I want to return and I'm going to use next response uh with the method of Json which will help me return a Json response now I'm going to also import my next response from next ser now at this point we can start to copy in the code to use the open aisk in order to do the translation and to start we're just going to copy and paste it as is from within the example here so I'm going to go ahead and just copy this entire snippet I'm going to paste it in at the top of the file where we can see that we get our open AI import we're going to leave that as is we're now initializing the initializing the open AI SDK utilizing our open AI API key and make sure that you're setting this environment variable in your envv local or in your deployed environment for wherever you're going to be running this project and that's going to be your API key from your account but then once we have that we can see that we're now using the open AI chat completions create method so let's grab that create snippet and we're going to put it at the top of our post request now before we actually do anything with this response I want to tweak this a little bit because we're not going to only provide a sentence in English we want to be able to have this to be multilingual and while we might not get too deep into that into this tutorial we want to set this up so it has that ability so the first thing I'm going to do is just make this a little bit easier to read so I'm going to break this down where the first thing I want to do is I want to change that we're going to be passing this in as an English sentence so I'm going to just add a period and I'm going to say we're going to be passing in a sentence now what will our task be I'm going to break that up into bullets so I'm going to say your tasks are to where I can start to Define these bullets detect the language of the sentence spelling that right and then I want to translate it into where I want to pass in a dynamic language because remember we want to be able to have the person select which language they want to translate this into so I'm going to add a language variable there and then we don't need the rest of this so I'm going to get rid of it but what I'm going to just add at the end is saying do not return anything other than the translated sentence sentence now of course we're going to need this language variable as we can dynamically pass that through so we're going to grab that from the request object so I'm going to say constant language which I'm going to destructure it from the request. Json we need to be sure that we await that Json method there so that's going to be dynamically grabbed when we post our Json body to this request endpoint but then we also need to pass in the text that we actually want translated so we can see here that the second message is going to be the user message where we can simply pass through the text and we're going to again grab that from our request. Jon now one thing to also not is that we have these additional settings that came along with the example if you want to tweak them to something that is more appropriate for your use case you can find all that stuff inside the open aai documentation but then ultimately we want to respond to this endpoint with our translated text so I'm going to use this response and I already know what it's going to look like so I'm going to cheat a little bit but it's going to be we're going to set up a text property I'm going to say response. choices where I'm going to grab the First Choice and then we're going to grab the message from that choice and then the content so ultimately this uh property chain is going to be the message or the translation that we get from the SDK but with all that we should now have our translation endpoint which we can now start to integrate into the front end of the application so I'm going to head back over to our translator component where if I head back over to where we were grabbing that transcript I'm going to now make a request to our API so I'm going to say await Fetch and I'm going to make that request to SL API SL translate where we're going to set a method of post and a body of json. stringify where we're going to pass in the text property which is going to be that transcript now the next thing that we also need on top of that is we want to set the language and the way that we're going to set the language here is we're going to use the language codes rather than the language name because that's going to be the data that we're going to more easily have available when we're working with it through the application so I'm going to set this as PT for Portuguese and then hyen BR which is Brazil and you can really use whatever language code you want here ideally later we can dynamically set this so that it can change it to whatever we want now once that fetch completes we ultimately want to store that as a response I'm going to go ahead and then chain then and change this into my Json response but then we want to store that response or the text so that we can use it in the application similar to what we did with the transcript so what I'm going to do is I'm going to create a new instance of State I'm going to set that to translation where we can set our translation that's also going to be a string but now after that fetch request I can set my translation as response. text and then ultimately I can now take that translation down to where I have some text uh available for that and I'm going to Simply add my translation so now let's head back into the application and give this a go where I'm going to record show me what you got we can see we got the text and we have the translation in Portuguese so now we're done with our second step where we're now taking our voice we're turning it into text Translating that text we now let's actually have the browser speak that text out loud to do this we're going to use speech synthesis which is connected to the web speech API similarly to what we just use with speech recognition but now we can use speech synthesis to ultimately speak the text that we want if we head down to the examples and check out what this looks like we can see that we're going to actually create a new speech synthesis utterance and then ultimately use an instance of speech synthesis to speak that utterance now looking at speech synthesis utterance we can see that there's a whole lot of different properties that we can add on to this and we're going to add some basic ones soon but to get started let's just start with a simple example where I'm going to start off by just copying and pasting this example that they have inside the code and I'm going to paste it right after we're actually setting that translation now we want to make sure that we're accessing window. speechsynthesis not windows but window and if I head to the browser I'm going to test this out test hello world we can see and here that we got the spoken text we got the translation and we got that hello world text that we pasted in now if you notice when we speak this we passed in English and what we would want to do is actually use a Portuguese voice in order to set this and the way that we can do this is on our utterance we're going to set utterance dovo and we're going to set that to a different active voice that we have available inside of the browser now before we do that in order to actually get the voices we can use this window. speech synthesis uh instance where we're going to say constant voices is equal to window. speech synthesis. get voices and before we do anything let's go ahead and console log this out our voices I'm going to go ahead and just mute the utterance for now but now if I hit record again test we can see that we actually got an empty array of voices what gives so the funny thing with how this works is usually the voices aren't going to be immediately available and the way that we can make this work is we can listen for the voices to be available with a call back function where we can also have an escape hatch where if that's not available we can still make sure that we get the voices just in a more synchronous way so to do this I'm going to set up a new use effect function where I'm going to try to get those voices so I'm going to first import it I'm going to set up use effect we're inside I'm going to first try to get those voices by running const voices is equal to window. speech synthesis doget voices now to check that it exists we're going to say if array is array voices and voices length is greater than zero then I'm going to want to store this in a new instance of state so I'm going to create that instance of State set voices and this time it's going to be an array where the contents of that array is going to be speech synthesis voice where then we can say we're going to set voices to our new voices now now if that exists we're going to Simply return after because we don't need to do anything else but if it doesn't exist we can now use our speech synthesis in order to set a new event handler to detect if those voices have loaded so I'm going to say window. speech synthesis and I'm going to say on voices changed we're going to set that equal to a new function we're inside we're going to do the same thing as we did before we're going to try to get those voices and if they exist we're going to simply try to set them because at that point since that call back ran they should technically exist now the tricky thing is onvo is changed isn't available everywhere so we want to make sure that this only runs if it's actually available so I'm going to wrap around that and I'm going to say if our envoice is changed in I'm going to check if it's inside our window. speech synthesis and only then am I going to actually run that code so now let's actually log out our voices to see if this is working in the first place so I'm going to go ahead take the voices I'm going to console log that out and if we head back to the browser which it already refreshed I'll refresh again we can see we have all these voices and if we start to inspect this just to kind of see what we're looking at we can look at one of these voices and we can see that we get a few different attributes that describe this certain voice we have a bunch of different names related to it but we also have the language which is the important part here now since I know the language that I currently want to set it to I'm going to go ahead and update this language variable to Portuguese Brazil but what I can do is I can check through all these voices and check the link language to try to see which ones are available in that language so I'm going to say constant available voices is equal to voices. filter and inside there I'm going to use the Lang let me hide myself to make sure I don't hide something and then I'm going to check the Lang is equal to the language now of course the voices might not be available so I need to make sure I add an optional chaining here but now we have these available voices let's just log this out again just to see where we're at and now if we head back to the browser we should see that we have far fewer voices than had rather than all of them and we can see through each of these they're all related to that language of Portuguese now I'm going to hide myself again we can see a bunch of different names through here the one that I think kind of sounds the best is either the Google Portuguese or the Luciana so I'm going to look for one that has Google in it to determine the active voice and then we can do a fallback to the Luciana so now we're going to do something similar to what we did when we were looking for the available voices where we want to find a specific voice that we want to use and the names between the different languages are pretty consistent so we should be okay with that but what I want to do is I want to Now find the constant active voice and I'm going to do that in a few different ways where I'm going to use the find method in order to find the specific name that I want so I'm going to use voices. find need to make sure that that actually exists first and then on that voice I'm going to look for the name and I'm going to set the name and I'm going to say the name includes I'm going to hide myself the name includes Google just to start off now I realize that I use voices here I want to actually use available voices but then I want to take these active voices to test this out we're heading to the browser we can see that I now have that active voice that has the name that includes Google in it so now let's also set a fallback just in case that doesn't exist so we can have a little bit of the Cross browser uh capability there so as I said Luciana is a good voice to fall back to at least for our particular use case here so I'm going to set it so that we can fall back to that if that doesn't exist so I'm going to it's going to be a little bit ugly let me hide myself I'm going to say that we're going to look for that Google Voice or I'm going to try to find the voice that includes the name Luciana or if that doesn't exist well as well let's just grab the very first one because we don't want to try to find one that we don't really know about now it's going to yell if I don't add the optional chaining here as well I also want to correct this active voices to active voice just to make sure that we have an appropriate active voice name here so now that we have our active voice let's actually use this with our utterance so I'm going to scroll down until we get to that utterance and where we find our voice I'm going to set it to active voice and we can see that it's going to yell because active voice might not be available so what I'm going to do is after this set translation I'm going to Simply say if active voice doesn't exist we're going to Simply return out but then I want to make sure that I uncomment this speaking line here and of course I want to make sure that I'm not just saying hello world here I want to pass in the actual text that we want to be spoken so I'm going to pass through response. text where now if we head to the browser let's test this out again testing this out we can see that not only did we get that translation we got it spoken in a voice that's more familiar for Portuguese now before we go any further with this let's check the browser compatibility quick where if I head back over to Safari and I click record it to allow it testing this out we can see that nothing actually happened and we can see the mic is recording so that part worked but it never actually worked now the weird thing is with desktop Safari is we need to actually stop the listening session in order for that to work so looking at our code just as we're starting the recognition we need to stop it if we already have an active session so the way that we're going to do this is we're going to detect if it's active by setting a state to it's active where if it is active we're just going to show a little bit of a different button here I'm going to have a stop button but then we can cancel that speech recognition session now because we're creating a new instance of speech recognition we need to be able to access that same session so the way that I'm going to do that is I'm going to store recognition as a ref so at the top of the file I'm going to import use ref I'm going to create a new recognition ref and set that equal to use ref or ref not rest and then down inside of my handle on record where I actually create that new instance I'm going to instead of creating a new Conant I going to say recognition ref. current is equal to new speech recognition now we can see that we get that red squiggly line there and that's because we don't have that typed out so I'm going to head to the TP to the top and I'm going to type out that red as a speech recognition instance and we can see that's now happy but now we also need to update all those constants that we had set so I'm going to go ahead and update recognition we can see that we have it also with start now as I alluded to we want to be able to set that is active flag and I currently just have a variable set that has is active but let's turn that into an actual State instance so I'm going to create is active set is active which is going to be a Boolean and default to false and I'm going to get rid of that constant of is active but then if I head down back to where I'm setting everything up instead of just simply setting it after the click what I want to do is I want to set it to active if it's actually active so the way that we can do that is we can use another event listener on the actual recognition instance now if we go back to the speech recognition events we can start to look through the ones that we have available we have audio start and end we have end error no match result uh which is what we're currently using we have sound start sound end we also have speech start speech end but for now we're just going to use the start and end handlers in order to listen to when it starts and when it stops so that we can set that is active flag so I'm going to create a new Handler let's call this on is equal to function where I'm going to set is active to true and then I'm going to Simply clone that and I'm going to have my on end which I'm going to set is active to false so now once I head to the browser and hit record and set say something we can see that through that time after it stocks as well we can see through that time that I had that bright red button that said stop and I also had this nice little red circle that just indicated that we were currently recording but we can see that we had that red button but we want to now make it so that that sets it to not active if we can manually do that because in Safari as you remember won't automatically shut down so all we need to do for that is if we head back to handle on record if that's clicked and if we check if is active meaning currently have an active session set we're just going to take the existing recognition ref. Curren and we're going to Simply say stop and then return then of course we probably want to make sure that exists in the first place before we actually try to run it now in addition to that we can also add the set is active to false just to make sure that we turn off the UI but now once I head back to the browser and try to test this out in Safari I'm going to hit record then allow testing this out we can see that it actually didn't work but let me give this one more try I'm going to Res refresh the page I'm gonna hit record and I'm gonna wait a couple seconds before I actually start talking so I'm gon to hit record allow testing this out we can see we can see that it did get that text it's just a little bit wonky in how it's able to actually get it so we see that we had that delay before it was actually able to detect the voice so it's not consistent and it's not reliable like we were able to see in Chrome but it still kind of works now keeping to the spirit of cross browser testing I went ahead and deploy this to vercel so that I can actually pulled up on my phone we can see it's iOS Safari and if I try this out testing this out we can see that it actually did work however it took a little bit longer for it to actually stop there was like a 3second delay before it actually recognized what I said and then this instance it didn't actually catch it let me try it one more time testing this out we can see this time it actually picked it up but the one thing you didn't notice is that it wasn't able to actually speak what I actually said now another tricky thing with this is that in order for the API to work the utterance API is that it needs to be activated from user engagement and while we technically are by using that click it's happening inside of a call back after the click so it's not really the connected instance but we have a little bit of a trick that we can use to fix this now what we're actually going to do is we're going to just use utterance and speak an empty text so it's just not going to speak anything where it's going to then activate the session so that we can then speak for something later now to do this the first thing I'm going to do is I'm going to abstract this into its own function where let's add a new function called speak we're going to have it take in some text which will be a string so that I can just pass that in and then what we can do is we can run this speak function I'm going to pass it back where where did I have that before after this set translation I'm going to pass in the speak but then what I'm going to do is I'm going to paste this same thing at the top of where we actually invoke this handle on record function so I'm going to run that but this time it's just going to be an empty string now just to First make sure that it works inside of our desktop application does this work it worked so now let's deploy this and now I have it pulled back up to my browser again on my phone does this work did you hear that it both worked we got the text even with that little delay but we also got the voice so now it completely worked aside from that delay of course now I'll ultimately include a demo in the code inside of the description of this video so that you can see a few different event handlers and such but I have one more thing that I want to show and let's actually make this language selector Dynamic so that we can include a list of all the languages we have available that somebody can actually select from now to start start off I don't want this language to be a static value I want it to be yet another state instance so I'm going to go ahead and clone and make this language and set language which is going to be a string and we can set the default though as Portuguese Brazil and I'm going to get rid of this language content but then what I want to do is I want to try to create a list of all the languages that are available so what I'm going to do is I'm going to create a new constant I'm going to call this available languages and then I'm going to create a new array from a new set of all the different languages that are available and we can see here that I'm referencing some things that I haven't pulled in yet but what I'm ultimately doing is I'm going to break apart that language code and what I want to do is match those different codes to both the language code but also the country code so that I can get a really granular look as to what that language and the region that language is from is actually available now inside of my project I already have this data directory where I have the country codes I have the language codes so again you can find that in the demo code if you want to pull this out I just grabbed it from like stack Overflow Wikipedia or whatever these are pretty common codes that are widely available but ultimately now I want to pull them into my app so I'm going to go ahead and also paste this in where I'm going to grab that language code data from those two different Json files and just making sure I hiding myself in case uh you can't see this but we're going to grab those codes and then what I'm going to do is I'm going to create a typed instance because I couldn't figure out a way to do this natively anyways I have the language codes and I have the country codes and now we can see that inside of this available languages array we now have that defined so ultimately the available languages which we can test out it's also sorted should give me that list so let's console log that out and if we look inside of the browser we should see that we have all those available languages from that set of voices so now we can give the option for somebody to choose from these down inside of my code I'm going to find that select element and I'm going to go ahead and replace it where I'm replacing it with two uh specifically important things where the first thing is I'm going to map through all those available languages and I'm going to create a new option for each one and remember how I was talking about splitting apart the language code this is important here because now I can show that we have the actual language which is a little bit easier to read than the codes and then I can show the language code in addition to that so that we can make sure that people have the ability to select a region if that's important to them but ultimately then we want somebody to be able to select that language so let me hide myself we have this onclick Handler at the end here where I should probably break that down but we have the onchange event handler which when that value is changed it's going to update the selected language so that it updates the state to that value but we also see that we have the value set to that language so that this is a controlled input so that anytime it changes it's going to update through that state mechanism so now ultimately we have our language input based off of this select element and now if we head to the browser we can see we now have the select list that includes all these these different languages so now we should have the ability to use this inside of that logic so now if I head back to the code and I try to look for all the instances where we hardcoded this specifically PT Brazil there should only be this one where now I can just simply get rid of that PT Brazil and have the language passed right in and now if I head it back to the application let's first check out Brazil just to make sure that that's working testing this out okay we're good with that but now let's try a different language let's try Korean testing this out we can see that we were able to translate it on the fly to the different languages that are available based on the voices that are available inside of the browser but this kind of thing is super handy so that when I'm at braziljs comp this leer I can say do you like JavaScript and I have my Universal translator to help me out next up I bet you know how to style active links inside of nexj app router right

Info

Channel: Colby Fayock

Views: 1,996

Rating: undefined out of 5

Keywords: translate app javascript, speech recognition javascript, speechrecognition, speechrecognition javascript, speechsynthesisutterance, speechsynthesis javascript, speechsynthesis, speechsynthesisutterance javascript, web speech api, web speech api javascript, web speech api react, web speech api tutorial, web speech api text to speech, web speech recognition api, openai api, openai api translator, openai translator api key, nextjs openai, next js openai api, nextjs translator app

Id: JFfCDvKiJqU

Channel Id: undefined

Length: 31min 0sec (1860 seconds)

Published: Thu Feb 29 2024