Open AI’s Whisper is Amazing!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
while the machine learning world is still very busy partying with diffusion models there's a new Transformer model on the Block released late this September called whisper and instead of being yet another Transformer text generation model It's actually an automatic speech recognition or speech to text model there are many ways that we can actually check out this model it's fully open sourced for inference and you can just download the model and use it the team behind whisper is oh I'm sorry there's got to be some sort of technical difficulties going on here there must be a mistake does that say open AI yeah okay yeah it's an actually uh openish model from open AI that you can just download and use incredible so let's see some quick examples of this model's performance and then we'll jump into the paper which actually has a surprising amount of just general AI insights Beyond just telling us more about the model and project itself I encourage you to at least try the hugging face web app implementation if you have any sort of mic available I think theirs is actually running on the CPU so inference time is about nine seconds or something less I tried if you have like a few seconds of audio on a GPU you should be seeing something like inference times of maybe 500 milliseconds to one or two seconds for multiple seconds of audio also the hugging phase implementation is using the small model there's a variety of model sizes so each of those is going to have varying performance and in terms of accuracy as well as inference speeds so anyway let's check those out now here's a simple example and implementation using a sound sample that I have recorded so just how good is this open AI whisper there's a lot of background noise in this recording I haven't turned off my air conditioning to transcribe this we can use the sample code from the GitHub which again we have to respect effect how exceptionally simple this all is to use it was one line to install everything and then I just a couple lines of code essentially to get the transcription on first running you might find the time it takes to run everything is a little bit more than say 500 milliseconds but this is mostly because the model has to load onto the GPU and then all your subsequent inferences will be much faster for the purposes of testing these models I've made a range of audio samples with the exact same sentence with just varying quality from that Baseline decent quality example with some basic background noise that you've heard already all the way up to this one so that's how good is this open AI whisper there's a lot of background noise the way I made these is just by recording the previous variant just out from my speakers into my microphone each time this this re-recording just sounds worse and worse but the microphone was like three feet from the speakers the air conditioner is still running and just yeah everything just slowly kind of degrades over time I stopped at four just because I feel like people at this point at least myself would start getting some if not many of the words wrong so I didn't really see a point in continuing that process to just complete gibberish from here I wrote a quick script to iterate through each of the model sizes and then each of the audio recording qualities to see both how quick inference times can be between the various model sizes and to get a very kind of basic General look at how well can they perform from decent quality to very subpar quality data again this is just like one sentence one sample there's just a tiny little amount of words compared to the entire data set of possible words so this is obviously a very very basic test but just a general vague idea found the results to be shocking on the worst quality sample I found both medium and large to do a pretty good job really just confusing the have to versus the have not turned off my air conditioning and so on but the other stuff it actually got quite right definitely very impressive results and all models perform inference faster than real time at least on a 30 90 GPU obviously this is going to vary depending on where you run the models but even that tiny model is pretty good if the quality of audio is also pretty good too and both the tiny and base models only require about a gig in memory to run which is pretty awesome and even the largest model so far that's available is only 10 gigs of memory which is very comfortable for today's day and age so let's check out the associated paper which I find most interesting for it's not just about the model but really probably most of the gold nuggets in here are just the more generic implications and findings for machine learning and AI models in general first off the whisper model is what they're calling weekly supervise which just means it's trained on not perfect gold standard training data it's data with imperfect audio recordings and background noise and and all of that notably there is way more like orders of magnitude more weak data than gold standard quality data for audio training and when it comes to speech to text I would say the reality of using speech to text is not gold standard many times if not the majority of times I I'm I don't really know but I all the times I could think of where people are using speech to text most of the time I think it's from a poor microphone in a in a phone or maybe elect I don't want to say the name and Trigger people's but anyway those devices um so I and those are all going to be very imperfect settings so you really do want speech to text to work in an imperfect setting I think that's the most common scenario when you're using speech to text now we have gold standard uh training data that is about a thousand hours and just a few years ago the gold standard training data was only about 50 hours so as time goes on we do have a very quickly growing gold standard data set and with enough time and grad students this could eventually be hundreds of thousands of hours and maybe even millions of hours that said the amount of audio available online that isn't gold standard is always going to outnumber the gold standard audio quality right and Beyond this there is a slight question as we will get to in this paper do we really want only gold standard data is this the ideal for training especially with a speech to text so the question is can we use this data to make a good audio based model and if so how might it compare to the current state-of-the-art models in this case that's state-of-the-art for speech to text but there are many questions and implications from the findings here or you'll be very comparable to findings in say some of the latest image models so the first Insight here is a current issue with fine-tuning models especially true with speech models I think it's just very evident with speech models um versus some of these other more you know because like a speech model there is a right and a very right and very wrong answer whereas with um typical like GPT generation models uh there are many options that could have been the case same thing with like image generation models there are many images that would match a prompt that you might put in so it can be it can be tough to validate uh you know validate those models basically other than did it produce grammatically correct output and all that so the first Insight here is that when you're fine-tuning these let's say speech to text models you might have an exceptionally robust model that can differentiate between like phonemes and different words and characters and do that super well so um with speech to text there are many words that sound exactly the same right but based on the context of of preceding words we can that's how we know what that next word should be and so you might have a model that's exceptionally well and then you go and fine tune it on say a new speaker and very quickly that model loses a lot of its robustness and it's highly likely to very quickly overfit to that new speaker one tactic that I've personally seen to handle for this situation is from Nvidia and I don't know if they're the ones that really made this up it's just the first place that I had seen it is basically you will fine-tune on new data mixed in with original data this might be original data that was held out specifically for this purpose but even though you're trying to fine-tune to some new speaker you're kind of mixing in the old data as well and this seems to help with the overfitment to the fine tuning new speaker openai here wonders if similar gains in performance with image models where where data sets are going you know far beyond the highly curated and perfected data sets like that of imagenet to more like websites like deviantART and such whether this strategy can apply to Audio models as well and again I think there's an important differentiator here between speech to text versus text to speech so in text to speech I think it'll be interesting to see can someone use this like weekly supervised model to go from text to speech because in text-to-speech you really want that produced speech to sound amazing right you want it to sound as high quality as possible whereas speech to text you don't have the same problem you're just trying to take speech and get the correct words that were said so I'm not I'm very curious to see if there will be any um any output so you you know maybe maybe you primarily train the model on you know uh not you know imperfect sound quality data and then you fine tune it on the gold standard maybe mixed it I don't know I have no idea but just take note that this this direction of of using imperfect data I I feel like can only go one way right with you know dirty speech to text you can't really go text to dirty speech in the training and then expect in practice for that to sound good anyway the training data set itself for this model is 680 000 hours of audio 117 000 of which cover 96 other languages other than English and 125 000 hours are other language audio to English text translations uh which is interesting and cool to like fold in those capabilities but also fairly interesting that they're finding that for large models and and also I would argue with enough data uh there seems to be no drawbacks and only gains to doing things like supporting multiple languages so doing the transcriptions for multiple languages all in the same model and doing different types of tasks so multitasking uh all in the same model so in this case we're transcribing for many different languages but then we also are capable of translating many different languages historical convention was to always keep things as narrow as possible for success but with larger models and arguably larger data sets for those models we seem to be finding that mixing tasks and training data to support those tasks to add sort of generalization and robustness as well as we'll see later models with mixed tasks and data to support them they they tend to just simply perform holistically better even on those single narrow tasks than models that were trained to just do those narrow tasks um so rather than maybe rather than confusing the model by having multiple tasks in multiple languages it seems as though that it actually helps to do uh a little more generalization so even on say transcribing English the models that were trained purely just to transcribe English did worse than the models that were on the task of transcribing just English then the models that were trained to transcribe English translate uh other languages and um or rather transcribe English transcribe other languages and then also translate those other languages back to English I'm probably going to mix up translate and transcribe somewhere in this video I apologize in advance so for the data set and the training data in General open AI did not curate the data sets very much the audio quality varies a ton the speakers obviously vary a ton the main focus seems to have been on making sure transcript quality was good and trustworthy so they're trying to remove cases where they could detect where yet another speech to text algorithm was being used to generate the transcripts somewhat ironically here I think this is the type of issue that we're going to be facing more and more as time goes on especially for example with like image based models so these models very quickly with the influx of all these new AI generated images uh there if we're not careful going to be very quickly training on themselves and as that happens the quality and the diversity and the I guess creativity of those images is highly likely to just devote evolve over time so I think we're going to start seeing the rise of models that can detect things like was this transcript generated by an AI was this image generated by an AI was this audio generated by an AI and so on if not for many other reasons but also for continuing to train those AIS so it's a quite the ironic problem to be having to actually train this model the training data was broken into segments of 30 seconds in length so then later when you're actually doing inference if you have data that is longer than 30 seconds it's going to get chunked out and then kind of built back together in sections the model itself is an encoder decoder Transformer and the tokenizer for the text is a byte level byte pair encoder same one that we've been seeing now for quite some time the single model does the entire job of detecting the language detecting the target task like translate or transcribe and so on to control this Model Behavior of detecting the language and do we want to translate or transcribe they're actually just using text tags like translate or transcribe which is very interesting to see for this sort of task I wonder if we'll start to see this edition of tag task types that's a hard one to say in future GPT style models so uh what they found here is that yes mixing in these different tasks and stuff actually added to robustness and generalization so will future GPT models have these you know intended task types besides just you know generate text uh I think that'll be very interesting and if they did what would what would those tasks be because it also is seemingly questionable does the task actually matter as long as it does the task correctly um does it matter which ones you throw into the model or do you just want to have some different tasks purely for generalization's sake um I have no idea I look forward to seeing uh if they do have task tags uh what would they be you know and uh how would that work and all that anyway it's very interesting to think about uh as we go forward so here we have the pipeline illustration for the model to give you possibly a better understanding of how the model functions but again I think the biggest takeaway here is the successful implementation of these tokens that are denoting things like what do we what do we want them the rest of this generation essentially uh to be doing do we want it to transcribe translate all that and what language is it all that stuff like it just because that is the sequence that it has always seen the fact that it can even just like detect language this way is uh very interesting so I'm not totally shocked that this works but mixing especially like translation into a model that primarily just transcribes is very curious that it works so well as well as tossing in 96 other languages other than just English so the model sizes vary from 4 to 32 layers and 39 million to up to 1.5 billion parameters these are the models that I tested earlier in the quality in inference times section 3.3 is yet another important insight into the sort of generalization for AI models I think the point here is that you've got generalization when it comes to the entire training data set that you've used versus generalization to the actual task itself and this is where the in distribution and out of distribution terms come from and arguably an entire New Field of research to get more models to be more successful out of distribution on tasks as well the point here being that models trained on imagenet for example a highly curated curated curated gold standard type of data set May outperform humans in various classification tasks on even held out from training but yet still a part of that imagenet data set so it might have super human performance in this case but then if you go and grab random images from the internet as a true out of distribution samples uh you find that again for the same class classification task humans tend to do better and this is a open question but this seems to be this notion of fitting to this sort of kind of gold standard or some other attribute to that data set that as soon as you go to real world application occasion suddenly that AI is underperforming and this seems to be plausibly why what I think open Ai and probably others are proposing here is that it's possible this is due to models sort of overfitting to the data set style or equality or other factors and that possibly it's really just as simple as being over fit to a sort of gold standard and just simply not being able to handle for noise and reality whereas humans suddenly do a very better job because humans have a more truly generalized approach to solving the problem of say image classification or even speech to text for that matter one thing I can also verify here is on page 8 where openai notes that nvidia's speech to text outperforms on the Gold's standard data sets so high quality data sets but then underperforms when compared to whisper on data sets with more noise which I definitely agree with and have experienced myself and I think is exemplified at the beginning of this video showing how much we can distort the audio and Whisper still performs at what I would call superhuman level at minimum it's super syntax level it did get some words wrong but I definitely would have gotten those words wrong as well if I didn't know the transcript already and obvious question is what might happen if we continue to increase the model size the largest whisper is 1.5 billion parameters versus something like gbd3 with 175 billion parameters I think the main concern at the moment is that too large of a model will be able to overfit the data that is available we can already see that English speech recognition performance does not really vary much from 768 million parameters to 1.5 billion so it's not really looking like if we just increase the model size that much is going to change there this would begin to suggest at least to me that a larger model might improve the maybe multilingual speech recognition and probably translation but it seems like mainly it's going to come down to data set size which I think we can see already in table six and this is also curious because there's actually way less data for those tasks and yet it seems like those continue to improve as the model increases and again I start to wonder like getting back to these tags if you wanted to do uh text to speech you use this you know weekly supervised data set to get a good Baseline Audio model and then maybe if you want to produce audio maybe you have like a clean audio tag and then on that clean audio tag is only gold standard audio so maybe you just haven't maybe you just have an audio model that that can do speech to text and I'm not sure if you could you can't go the other way I don't think anyway I'm like thinking these things through life but I think like you could possibly have at least a text-to-speech model that could still do the translation so you know maybe input English text and then output you know Spanish audio um so I think you could still probably have that and then and then to clean the audio maybe you just have like a clean audio tag or something like that and so the question is would that continue would that produce a better model because based on what I'm reading in this paper so far I think that logic holds that you could still train that Baseline model uh and then for the task of clean Audio Only please you could have us a much smaller subset of data and that model should probably perform better but I am curious what the audio sound actually clean anyway I don't know so finally open AI does go that step further with questions about model size and multi-task performance where they do note that for small models the incorporation of multiple tasks and multiple languages does seem to cause degradation when compared to English only so again models that are trained to just transcribe English the smaller models that just transcribe English these do benefit from being having their task be very narrow so only training on English transcriptions only doing English transcriptions the smaller models tend to do well but interestingly and most importantly for the larger experiments The Joint models the ones that do transcriptions and translations and all that they outperform the English only models this is quite the insight and I think an overall shift and Trend that we're going to be seeing for years to come with data quality quantity model size and model scope sometimes model scope purely for the purposes of General generalization and nothing more it in the end I think we're going to wind up with far more out of distribution generalized AI from this and then also more powerful narrow AI due to this seemingly interesting behavior of somehow doing better when you just toss more tasks and more types of data at the model and again I'm curious I'm curious to hear what you guys think if you've gone through this paper and really thought about this or maybe you know a little bit more about mixing these like tasks in what do you think about text to speech that uses this weekly supervised data to just have this like General text-to-speech and then maybe one of those tag tasks because in this case the thing that they're verifying and validating is how does the English translation work rather than they did not unless I'd have to go back and check uh did they did they compare a model that did just let's say translation did the translation model also do better or would it have been better off being a narrow AI because I had way less data that I don't know and I wonder uh I wonder because in my example of doing speech to text and then you know having a you know gold standard tag so you produce the or text-to-speech rather uh having a gold standard tag would that make clean audio then because that's like super important for at least text to speech um yeah I don't know interesting interesting questions anyways thank you to openai for sharing your model and your insights and all of that with us it's very cool to see from open Ai and I'll uh at least be using whisper I'm pretty sure for all of my current speech to text needs and then possibly translation there's a lot of really cool apps that can be made from these models and again these are pretty powerful yet pretty lightweight models um so yeah really cool that's all for now I will see you guys in another video
Info
Channel: sentdex
Views: 467,741
Rating: undefined out of 5
Keywords: python, programming, artificial intelligence
Id: OCBZtgQGt1I
Channel Id: undefined
Length: 25min 51sec (1551 seconds)
Published: Thu Oct 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.