Implement Speech-To-Text on Windows with .NET MAUI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
In the third and last video about implementing speech syntax in your.net Maui application, there is only one platform left to still implement in this video. We're going to see how to implement it for Windows in this video. We're going to finalize our speech to text functionality in our.net Maui applique. We have been seeing in the past videos where we've implemented iOS and Android, how we kind of defined our shared API. So with an interface. And then we've seen in the couple of videos that I highly recommend that you watch first, at least the very first one, because that's where we also implement our shared code, how we implement the platform specific code through the Power of .NET MAUI single project. And we're going to finalize this implementation. So implement speech for all the platforms by implementing it for Windows in this video. So, like I said, I highly recommend that you check out the other videos. They should pop up on your screen during my talking here. Or you can find them down below in the links and the comments. So be sure to check that out. And with that, let's just head over to your Visual Studio and get to it. So here we are in Visual Studio 2022. This is the sample code, the sample app that I have been using. You can find the link down below. Like all my videos, this also has a GitHub repository. And for all my videos, they have a GitHub repository attached with all the sample code. So if you just found out just now, go to my GitHub repository, follow me and go check out all those repositories. And here, let me walk you through the code that we've implemented so far. So I have this very minimalistic design with a label where we're going to show our recognized text. We have a button to start the listen session and a button to cancel the listen session, right? So I've mentioned this in the other videos as well. Like the behavior on the different platforms is a little bit different. You will find that a lot if you're going into cross platform development. So for iOS, it will just go word by word. And whenever a word is found is recognized, it will put that in the label. For Android, it will wait for the full text, basically. And or it will wait for some little time out. And that's also the case for Windows. It will wait for a timeout until you've stopped speaking and then it will suddenly pop up with a lot of text. So that's what we're going to see here. Take that into account for your own application. In our main page code behind, we have some text already implemented here. We have this listen command, the listen cancel command and the recognition text. We're working with data binding here. So we've got all that set up. And for the listen we're going to see, we're going to talk Maui speech to text interface because we have that shared interface, which is our contract of code. So we can use that in our shared Net Maui code. But the implementation will each time be an implementation of the platform specific code. So that's really the power of Net Maui. Something is not surfaced to the.net Maui abstract layer. You can just reach into the platform specific APIs and you can write your own code still in C, Sharpen, Net and Surface. That that's basically what we're doing here. So we have this is speech stacks interface. We're going to request the permissions first for Windows. There's not much to request here, so we're just going to fake that a little bit. And then we're going to speech to text listen. And we're going to specify we want to listen to English. You can do other things here as well. Whenever there is progress, we use this little action here and we are going to capture that progress. Actually you can see a problem right here because we're only doing it for Android and iOS. So for Windows, nothing is going to happen here. So actually let's just implement that right now. So we're going to do the same method here for iOS. So let's say or DeviceInfo Windows, sorry, platform is device platform Windows. So now this will also do it for Windows or for when UI, right? So when UI is the actual platform that we're using here. But Windows is the system that we're running on. So now we can't forget. I was bound to forget later in this video. Now it will do this thing and it will attach it to the text that we already have in here. So it's doing that unless we have this token source and we can cancel that token, the cancellation token. Whenever an exception happens, we're going to display this alert. Or whenever the permissions are not authorized, then we are going to also display a little alert, right? So that's what we're going to do. And the listen cancel, we're just going to say token source cancel here as well. So that's all set up. That's our shared code. Basically the implementation of the interface. Well, the interface definition itself is right here. It's not very doesn't blow you away. It does the request permissions to request the permissions because we need permissions to do some speech recognition. And also for the microphone, right? At least for Android and iOS. So we need to have those. And then we have this listen method for actually start the listening session. So we got all that. And then in our platforms folder, we already have it for Android, we have the speech to text implementation. And for iOS, the same thing, speech to text implementation. And now we're going to do the same thing for Windows. So let's add a new class here at new class and we're going to name that speech to text implementation. Implementation. And you'll notice that I have the same class now three times with the same name. How does that work? Well, that's all due to the power of dotnet Maui single project. Because each folder in here, iOS, Android, Macatalyst, Windows, Tizen, they're all kind of like they're little separate worlds. Everything outside of the platforms folder is shared code. So if I would make the same class there three times, that wouldn't work. But now, because I'm in each platform folder, they don't know about each other, right? Albeit that code gets compiled whenever you run Android or whenever you run Windows. So that's why I can have the same class basically three times. So what I want to do is change this namespace because if you look here in my Maui program, that's where I register my service for dependency injection. I'm using dependency injection here you'll see that we have this speech to text implementation. And here in the top left, if I switch to the Android context, you'll see that this starts to work and we have this speech to text implementation, right? And by changing the namespace to nothing platform specific here. So if I just do platforms, I can just leave all this code intact basically. Because now I have using speech to text sample platforms and not Windows or not Android. And I have that same class name, right? So I have the same thing. So now I don't need to change any code here or do if Windows or if iOS and have a little bit of a different class name in there no need to do all that. Now I can just reuse the same names and the same namespace here. So that's really powerful as well. So I've got that STT up. Let's make this a public class. And this is going to of course implement the I speech to text, right? So let's do that. And whenever we want to do that, let's IntelliSense help us implement interface. And we have these two methods that we want to implement now. So this request permissions reminds me I need to set the permissions actually for Windows. So let's go back to our solution Explorer, go to the package app manifest and if I double click that, you will get into a graphical editor. You can also edit it in the XML editor, but you can go here to the capabilities. And we want to have the microphone one and we want to have the Internet client one, I think. So let's click that. So we need those two. Let me check actually in the XML variant if that is actually the right one. So let's save this. And you can also do right click open with and then you can get the XML text editor in here, scroll down and you can see these capabilities here. Internet client and microphone. That's the one that I want. All right, that's all set up then. The request permissions are already set. That code is not really very interesting because we don't really have a way to check the permissions on iOS or on Windows. Sorry. So we don't have a very lot to do here. We just say return task from result is true basically and that's it. So there we have that. Oh, I see, I copied well. What's going on here? Actually I have this weird namespace here. It acts the brackets. We don't need that and all right, here we got that. All right. So this is better. I don't know how that happened, but we're back to a building state. So the request permissions is just going to return true. So you want to make sure that the permissions are there and otherwise there's not really much to do here. We can't really run time, check these permissions and then we have to implement that listening. Now, before we are going to do that, there is a library that we're going to use here, which is the system speech. It is by Microsoft and it helps you with these speech recognition kind of things. So let's go to our Solution Explorer, right click on our project and do manage nugget packages. And I'm going to search for system speech. So at the time of recording, version number is 700 and it's probably linked to net seven, net eight. So by the time that you're watching this and net eight has been released, it probably has a 80 version. We just want to install that and do okay, so when that's installed, we have all the capabilities, all the APIs to do something with this speech recognition it's installed. The thing that's a bit weird here is that we also have Windows Media speech, I think the namespace. So there is a couple of duplicate types here which will mess up our code a little bit. But I'm going to show you, don't worry, don't worry. So here we have this listen with the culture that's going to be passed in the IPROGRESS with which we actually report the progress and the cancellation token to actually cancel our listen session, right? So let's start implementing our listening. I'm going to add a couple of fields here first, a couple of private fields so that we have all this and you can already see that it adds the right namespaces so that it recognizes these types. So we have the speech recognition engine and the speech recognizer here. I'm not sure if this is going to be the right one. We'll find out in a little bit if it picked up like from the system Dutch speech recognition or we need the Windows Media speech. I don't know, we'll see because it will have different constructors and it will start complaining. So we'll see that in a little bit. Now for our listen, we're going to do something interesting. We're going to see if we actually have Internet and if we do, we're going to do listen online and else we have listen offline so if you watch the other videos, then you know that we have or not have kind of limited offline access, yes or no. But here it seems we have full offline access that you can implement here. So that's pretty cool. So let's go with the listen online one first. I'm going to paste in a bit of code here so don't be alarmed. Listen offline, listen online, sorry. So we want to have that in here. Boom. And have the code in here. So this is probably, like I said, here you can see the Speech recognizer. It doesn't contain a constructor that takes one argument. So there is something wrong here. Let's see if IntelliSense can actually fix this for us. It probably cannot. So we want to figure out the speech recognizer here and I think we can do that by setting the Speech recognizer is using Speech Recognizer is and then we can say something like system Speech Recognition speech Recognizer, right? So we can really point it to the type like, hey, this is the one that you need to use. And now it still doesn't recognize it. So we probably need the other one. We're going to go to Windows Speechtotext Media, which is I call it media speech recognition. Speech recognizer. So probably we want to have that one then. See? Boom. Now this is the right one. So we're just going to say, hey, whenever you have this type, you're going to have to use this one. That's a little trick that you can do in C sharpen Visual Studio. So now we've got this one. There's another type here that it doesn't recognize. So let's use IntelliSense. And again we have, using Windows Media Speech recognition. So we're going to import that. It will add it here. And now we got all the types here set up as well. We are missing a couple of things. We still miss the listen offline right here and we also have the stop recording right here. So that's the things that we still want to implement. Let's see if we can get that in here as well. So let's start with the listen offline. Now my formatting is a little bit messed up here, so let me fix that. There we go. So the listen offline is kind of like the same as the listen online. I didn't really take you through this. I just realized just now. So listen online. We pass in the culture, the progress again and the cancellation token. So it's kind of like the same method that we have here. What we're then going to do is use that Speech recognizer. We're going to create a new one with the language that we want to listen for and then we're going to create that session, right? And whenever there is a result, we're going to add that and we're going to report that through our I progress thing right here or whenever we have this continuous recognition session again whenever it's completed, right? So whenever something has been completed. So this is whenever there's an intermediate result or this is whenever the session is completed. We have this event here as well that we're going to do and we're going to say, hey, is it a success? Okay, set the result or that the user canceled and we are going to set it canceled or we're going to throw an exception, right? So that's what's going to happen. Then we're going to STT that recognition session. We're going to start async so we're going to do that session to actually listen to the speech and then we're going to, whenever something gets canceled, that cancellation token, we're going to do stop recording, which we still need to implement. And then we are going to set the try canceled on our task result that we are returning from here, right? So that's what's going on here. Then the listen offline is kind of like the same thing right here. We have the same culture, the same eye progress for the reporting and the cancellation token. But now we're going to use the speech recognition engine with the culture that we have here. We're going to load some grammar, all right. Don't know really what that does, but it's necessary speech recognized, so we're going to use that as well and we're going to report that through our recognition result. So we're doing kind of like the same thing but a little bit different depending on where online and offline. Of course, if you just want to do offline, just only implement this one. If you want to do only online, do the other one. But this just shows you both of these. So now we need to do stop offline and stop recording still. So let's get both of these in here as well. I'm just going to copy some code to do that again. So let's paste that in here. We have to stop recording, which basically just said speech, recognizer the object that we've created for the online one that we've created right here. And we're going to say, hey, you want to do that continuous recognition session? We want to stop that please. We wrap that in a try catch. So whenever something happens, it might already been stopped or whatever, we don't really care. We're going to assume that it stops for the stop offline recording. We are going to have that speech recognition engine and we are going to also cancel that one. Now there's one thing that we still want to do, which is not really linked here, but we want to dispose async so we want to really clean that up, right? The unmanaged resources, the managed resources, we want to dispose that, which is something that you want to do from whenever you are really done listening to the speech text and whatnot and clean up all the resources so you don't have a memory leak. Check out the GitHub repository where all this code is set up and you can see how to use this. Exactly. So basically we've implemented all our code here. Well, almost, because if we go back to our Maui program, you would have seen that we have these compiler directives here for if Android and if iOS. So if I speech to text here to Windows, you see that this is grayed out. Who's going to say, hey, I don't have any implementation for the speech to text, right? So let's add Windows to this as well. And you'll notice that Mac Catalyst is missing. I'll get to that in the end of the video. So stick around to see what's up with that. And in my main page, I already set up the code for adding the actual result to our text. So with this, I think we can start running our Windows application, which is what I'm going to do. Then deploy that. I have a little microphone hooked up to this. So whenever we start the listening session, then it should start listening to the microphone right here. And everything that I will say should get transcribed on the screen. But like I said, with Windows, it's kind of interesting. It waits a little bit until I stop talking for it to actually show up. So bear with me. Hello. This is the speech to text on Windows. See, it actually works. So as long as I keep talking here, it doesn't really add the result here. I don't know if that's an implementation detail or that's just how Windows works, but you can see if I stop talking, then suddenly, boom, the result gets added here, right? So that's why I kind of like pause in the middle of my sentences here. But this also implements it on Windows, which is pretty amazing. So now you know how to implement speech to text on iOS, Android and Windows. But what's up with Macatalyst? So that wraps up our little series. You've now learned how to implement speech to text on Android iOS, Windows. There's one left. Mac Catalyst. Mac Catalyst is basically the same code as iOS, but it adds a little bit more code. And there is a funny thing with the compilation right there. So again, I'm not going to create a full video on that. Just check out the GitHub repository for all the code there and you will figure out all the bits that belong to Metcatalyst. If you don't know how to do it, let me know in the comments and I'll be sure to answer your questions there or still make a follow up video if you really can't figure it out. If you want to learn more about any topic, this was also something that was requested under my videos as well. Windows speech to text. You have another topic that you'd like to see, let me know down in the comments below and I'll be sure to get back to that, hopefully. Thank you again for watching one of my videos. Please click the like button so that it will spread to other people and we can grow this channel and be an even bigger family. Please subscribe to this channel if you haven't done so already, I would really appreciate that as well. Thank you so much. And in the meanwhile, you'll have this full playlist of Ed Maui videos already on my channel. Go check that out here. The other videos for the speech to text are right here. And I'll be seeing you for my next video.
Info
Channel: Gerald Versluis
Views: 2,643
Rating: undefined out of 5
Keywords: .net maui, net maui, dotnet maui tutorial, dotnet maui, .net maui tutorial, speech to text, speech-to-text, maui speech to text, c# maui, c# maui tutorial, speech to text app, stt, .net maui speech to text, net maui tutorial, speech-to-text ios, windows speech-to-text, speech-to-text windows, recognize speech windows, C# recognize speech, .net maui app, learn .net maui
Id: LLvHJXvuPHs
Channel Id: undefined
Length: 18min 38sec (1118 seconds)
Published: Mon Jan 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.