OpenAI Whisper - MultiLingual AI Speech Recognition Live App Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this Python tutorial we're going to talk about a very newly released model from openai openai seems to be living up to their name and this model is completely open source this is called whisper and it is an ASR automatic speech recognition and this was just launched a few hours ago so I'm going to show you how you can use whisper in your python code I'm going to show you a collab demo this is just a code taken from the whisper repository finally we're going to build a web application using radio that uses recorded audio and then does automatic speech recognition for you using whisper and we're going to demo it using multiple languages one of course English which is in U.S accent second English my accent third Tamil which is my language which is not English of course I'm going to do all these things I was so excited that I could not make I could not not make a video about this thing and this is this is quite impressive of what I've seen so I wanted to show you also how impressive it is let's get started with the video the first thing is I would like to quickly show you the demo of Radio application that I've built it's a very simple application if you see we have got a title and we have got a button that says record from microphone and then we have got an output so let's see how it works for my own audio so I can say um this is a this is a quick demo I wanted to just make this video I could have never slept without making this video I'm super impressed by this new model whisper from open Ai and it is an ASR automatic speech recognition and you know like most of the times when I speak in Indian English things do not work out and most most so softwares do not do a good transcription of what I speak but let's see what happens here we have the audio here and because this is live and you can see that it literally took a couple of seconds and it says this is a quick demo I just wanted to make a video I could have never slipped without making this video I'm super impressed by this new model whisper from open Ai and it is an ASR automatic speech recognition and you know like most of the times when I speak in Indian English things do not work out and most softwares do not do a good transcription of what I speak but let's see what happens I could not believe it I'm not kidding I could not believe that I could find a non-paid uh run paid which is not like an APA based solution like a model running on Google collab that can do a very good transcription of my Indian English I am super impressed and blown away and with this let's start from the start which is today sometime um few hours back open AI made a very new announcement you can see 21st September like 11 India so it's 20 to 22nd September for me but 21st September introducing whisper we have trained and are open sourcing a neural net called whisper that approaches human level robustness and accuracy on English speech recognition what I was impressed with whisper is it's not just English but it can also work with the different accent and different languages in itself so you can see it's multilingual multi-task and you can read more about this paper you can see all the demo um all these things but you you know you know how impressive it is already the good thing with this team is that they actually went ahead and then shared including model weights everything on GitHub so you can see a Blog a paper a model card and a Google collabing demonstration and that's that's something that open a has not done in the recent time so kudos to the ring and you can see like a very well written example that just simply works I mean how many projects these days get launched where you've got a couple of lines of python code you can just copy paste and it can run that's like again another kudos for the team now whisper cannot just do transcription but it can also detect what language it is if you do not specify what language it is and now with that Kudos and credits to the team um and thanks for sharing the collab notebook I'm going to take you into the Google collab notebook how you can run whisper on your python environment Google column notebook currently I'm running it on GPU I don't know the inference time on CPU but on GPU this is pretty pretty good and if you know the whisper that I'm running is the base model in the paper you can see the differences between different models I think even in the model card you can see I think yeah you can see the difference there is a tiny model there is a base model that's a small model media model large model So currently as you know like I'm running the base model but you can experiment with different models and this is quite early like I'm going to make a lot of videos on whisper but this is quite early so that's why um I'm just sticking to a very simple example the first step that you have to do is after you copy this notebook this notebook will be the YouTube description um please please order this Google GitHub please make sure that you like and subscribe the video 90 of our viewers do not subscribe it means a lot and coming back to the tutorial uh install whisper directly from GitHub right now once you install whisper the next thing that you can do is you can import whisper load the base model and you can see it has downloaded 139 megabytes model and then we have loaded a base model now the base model is a model object and model the next thing that you want to do is um I I just wanted to validate what is the device that it is using so it looks like it is auto recognizing the environment that we are in so you can see that it is using model.device Cuda so you can see and now the next thing is I want to show you what are the audios that I'm going to transcribe or do speech recognition the first one is one of my very favorite moments or dialogues in a in a in a movie that is from Batman Begins 2005 I think probably where it says it's not who you are underneath it's what you do the defense languages played it's not who you are that's what you do that defines you um you might hear a faint voice but it is the female lead saying that thing to Batman so the next is an audio from an Indian Tamil movie uh like my language the language that I speak and it's also one of the popular uh movie dialogues um in Tamil it's from a movie called poki um so let's let's listen to it come on yeah I mean like this this used to be this used to be a huge Trend when I was in university around that time but at this point you listen to a native English speaking audio and uh a Tamil audio which is like totally different language where you don't have a lot of ESR solution now let's start with maybe English um I mean I want to blow your mind step by step I don't want to blow you out right but maybe maybe I did that at the start but that's okay so what you have to do so first you need to load the audio whisper.load the audio so that helps you load the audio so next thing you need to keep in mind is whisper Works only for a clip maximum of 30 seconds so you need to figure out some way like open a itself has got some class that can help you but whatever audio that you have got you need to clip it into 30 minutes clip sorry 30 seconds clip so that's that's mandatory if you go to open ai's um collab you would see a helper class where uh you can pass on a bigger audio and it would chunk it into 30 seconds then you can use it like that that helper script is available here but uh like for now I'm not going into that direction I'm trying to fit my audio within 30 seconds and this function pad or trim will help you trim that audio within 30 seconds that's that's what we are doing here once we do that now we need to make a log Mills pictogram and then we need to move it to the device model write down the device's GPU if it is had it been like CPU then CPU and now we are trying to detect the language this is not mandatory all the time but let's say we want to detect the language in which the content has been spoken and now when you see this you know that this is I mean sorry you take the the spectrogram and feed it into it and then it detects the language and the highest probability is what you're going to print and finally um you are going to decode it um right now I'm again going with very base options so that's why not specifying anything so model the spectrogram and finally uh the options if there is anything you take the result print the text of the result that's it so very simple API I mean I'm super impressed once again like to say you might hear me saying again and again that I'm super impressed I'm going to run this code you can see in real time I'm not editing this video so you can see literally how many seconds it takes it how many seconds it took I mean by the time I'm finished it's not who you are underneath it's actually I think it says it's what you do that defines you but if you listen to the audio you can actually it's it's not a it's not like a it's not a TV clip it's not an interview it's you know it's it's not made with professional I mean of course it's a movie it's made with professional recording equipment but it's not like a podcast video you can literally hear like a background noise and there is music but despite that it made this transcription possible so now we know that it works super fine for English so I read a couple of Articles where it uh or comments that it said the word error rate doesn't necessarily meet with the state of the art primarily because open a is not just ex not just trying to reach the highest accuracy possible but they are also exploring different accent multi-language and all those things like multiple tasks so having said that but still I consider it to be state of the art so let's copy and paste the camera one now again like I said it is summer so if I play this I can actually tell you what is it it says this is what it says so if you do not then if you do not know the language that's fine but uh if I run this you can see it has detected very soon that it is thumbed and then it says foreign by this transcription and this is gold um the reason why it is gold is subtitling is a huge industry a lot of human efforts and this is gold because this level of accuracy is something that I have not seen in a lot of um in a lot of models maybe my knowledge is limited but but still even for that matter this is this is really a great option now if you ask me as a as a programmer like fine um this is good your code works like you've exp you've tried it with the different models it works fine what do you want to do next I would simply say you know I want to build a web application I want to build a web application where the user can record live and then it gets transcribed and this is what I want to do and I'm literally going to do that in front of you using radio with a very few lines of python code and modification with this code so you can drop off at this point if you do not want the web application Link at this point you have learned how to use whisper in your python code to do automatic speech recognition but if you want to go one step ahead let's build a web application first I'm installing radio of course web installed radio the next thing is I'm loading radio in this case uh import grade us gr and the next thing is I'm literally copying this entire code and putting it inside a function and if you are somebody who's been watching our videos for quite a while you know grade year requires three things input output a function the function is going to be called when there is um something happening okay so transcribe is a function that takes an audio input and returns a text so your input is an audio your output is a text and transcribe is the function and now I can I can actually return more things like for example I can return the language and I can return those things but for now I'm just sticking to um the existing thing so where I'm loading the audio trimming it for 30 seconds making a spectrogram and converting it to the model like the device detecting the language this is not going to have any impact and then decoding it returning the text that's it and the entire application is going to look like this this is radio interface I'm using the older gradient interface not the blocks there is a title there is a function called transcribe which we just created and there is an input that says I want an audio input and the input should be microphone and the type should be file path the I mean basically you want microphone input the output should be text box like output you want to text and then you're going to say live means um you don't have to necessarily submit it you can launch it sorry you can um like it's going to do a live so let's see like I made a first try English and then I'm going to try BB thumbel and my broken Hindi I'm going to remember broken English okay let me detach it all right because there is nothing there so I'm going to record from microphone at this point and what I want to say is let's see um this is my first attempt at uh this is not necessarily my first item right like I already tried and um and and let's see how it is going to work out let's see stopping it you can see the audio this is my first attempt at this not necessarily my first time right and you see a question mark I mean how many years are solutions um work like this maybe I'm I'm obsessed with this but like I tried and let's see how it's going to work out let's see I have the punctuations um I mean this is amazing so what I'm going to do is I was actually watching a video um where uh where like there are two movie um actor sorry movie directors discussing and I'm going to just see how it is going to capture so I'm going to record from microphone and he there's a moment where he looks at him it's almost like he's talking to somebody who's now settled down except that there is a life kind of a thing right the last sequence you're right he's in a good space they're probably hiding okay stop it and from there is a moment where he looks at he is almost like talking to everybody they're probably hiding from is there a good space the movie director was also talking in Indian English and uh this is pretty amazing now what I'm going to do next is I'm going to do a demo in Tamil my language so I'm going to say like let's see always welcomes um so intentionally have mixed Tamil and English multiple languages so let's see foreign so what I've realized is I think when you mix English with Tamil I think it kind of gets mixed up so what I'm going to do is I'm going to I'm going to say something okay in Tamil only in Tamil but let's see um so let's see did not go very well but uh what I was trying to say is I'm a little coder like a computer engineer I think it works fine I would say like let me try in Hindi now maybe um foreign let me try again um it didn't work out fine like that's that's probably because my Hindi is quite bad maybe somebody a subscriber who knows Hindi well can try it out but overall like I'm super impressed this entire notebook will be I'll share it in the GitHub um you can you can take what part of code you want use it anywhere you want deploy the application but this is super impressive like I might I might even like actually diligently use it to create subtitle and other stuff for my video because um I believe this the the transcription is really good for my accent and uh it also works fine for thermal like I said and the other thing is you can trans transcribe from one language to another language like one more thing that I saw opening mentioning is that you can do any language to English like you have that option you can do that as well and uh entirely the center I think is super impressive like I'm I'm definitely Blown Away by the way they've released this very simple um very detailed have a model have a collab notebook model weights are shared and uh there is a demo that just simply works and um yeah big big Kudos and thanks to open AI team and um if you have any question let me know in the comment section otherwise I hope you found this video helpful in learning how to use open AI whisper and build a web application that can do live transcription for you in any language that you speak like I don't have the list of languages but many languages that you speak see you in the next video [Music]
Info
Channel: 1littlecoder
Views: 36,783
Rating: undefined out of 5
Keywords:
Id: ywIyc8l1K1Q
Channel Id: undefined
Length: 19min 3sec (1143 seconds)
Published: Wed Sep 21 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.