OpenAI Whisper Demo: Convert Speech to Text in Python

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey YouTube in this video I'm going to show you how you can quickly convert any audio into text using the free open source package in Python called whisper I'm going to show I installed it show an example of how I ran it and compare it to an existing Library so starting off you'll probably want to go to the whisper GitHub repository that we're looking at here and they give instructions on how you can install it now one thing to keep in mind when you pip install just the name whisper it's not going to install the right version we want to install from this git repository so just take this pip install command and run it in your environment that you're running Python and they also mention here that you need ffmpeg installed there are some instructions to do it but I already had that installed on my computer now that I have whisper installed let's just make some audio that I can test this on so I'm going to say some idioms idioms are usually hard for models to understand even though this is just speech to text the this will be kind of fun I would love to be on cloud nine as a one-trick pony that wouldn't hurt a fly I'd be like a fish out of water and as fit as a fiddle to be under the weather let's save this off I'm going to save it as a wave they do have instructions for how we could run this just straight from the command line once it's installed I'm going to show you how to use the python API which they show here so it's really simple we just import whisper then we're going to create our model which is we're going to load the model that's called base and then just using this model object we run transcribe on our audio file so I named it idioms let's use the wave version we want this to return the result now notice when I ran this before I get this error because of cuda's half tensor and Float tensor it was able to solve this so that's something to keep in mind if it doesn't work for you you might need to set floating Point 16 to 5 faults and you can see after it's run here it detected the language already as English and then this result object has a few different methods in them but what we want to get inside of this is just the text we could see that it it looks like the result is good I would love to be on Cloud9 as a one-trick pony that wouldn't hurt a fly I'd be like a fish out of water in this it did mess up a little bit this fish out of water in as fit as a fiddle maybe I didn't say it clearly enough another thing to know is when you first run this it's going to have to download the base model so you might see a progress bar going across and you'll have to download that model and it says when you run this transcribe it's actually taking 30 second chunks of your audio file and running predictions on it now there's also another approach that you can take which is a lower level approach where you actually create the model and then you create the audio object and Pat or trim this what this will do is just make sure that this audio chunk is only 30 30 seconds or it'll pad it with 30 seconds since that's the length the model expects to have his input then it's making a log Mel spectrogram it's detecting the language and we can decode here and provide a lot more options if we wanted to if I run this cell again get this error which I now can set in the decoding options fp16 equals false and actually this time it looks like it got everything correct I'd be like a fish out of water and as fit as a fiddle so that's it for whisper I just want to compare it to an existing type of model in a popular library for doing this is the speech recognition Library the way we run the speech recognition library is we import it and then create this recognizer object which we then can load our audio file with after that you could take the recognizer object and there are a few different recognizing methods for that and we're going to use the Google recognize and let's see what the result is so so it looks like it didn't add any punctuation and the Cloud9 is a little different I would love to be on Cloud9 as a one-trick pony that wouldn't hurt a fly but the one thing to keep in mind is that this is actually using the Google speech recognition API The Whisper Library you actually have the model downloaded and it's yours to use I do also recommend you take a look at the whisper paper which was released with this code they also go into detail about how the model was trained and the architecture that it's used whisper does work on a bunch of different languages the performance they say varies based on the language so you can go here on the GitHub repo where they have a plot showing Which languages actually performs best for the bars here are smaller is better and larger means it performs worse so still pretty impressive the number of languages that this model works on

Info

Channel: Rob Mulla

Views: 90,240

Rating: undefined out of 5

Keywords: Convert Audio to Text in Python with OpenAI Whisper, open ai whisper, python speech recognition, python convert recorded audio to text, audio to text converter, audio to text in python, speech to text using python, openai whisper, rob mulla, python audio to text converter, how to convert speech to text using python, audio to text converter app free, convert audio to text using python, convert voice to text using python, deep learning audio, python speech recognition module

Id: HbY51mVKrcE

Channel Id: undefined

Length: 4min 58sec (298 seconds)

Published: Fri Sep 23 2022