Fine tuning Whisper for Speech Transcription

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm going to explain how speech to text models work and then I'll show you how to fine-tune a speech to text model like open AI whisper so that you can add new words it's not familiar with or even familiarize it with new accents or languages that are lesser known so this video is all about fine-tuning whisper for speech transcription we'll start off I'll tell you what whisper is it's a speech to text model by open AI I want to give you a very quick demo with a tool that I think you'll actually find useful if you want to do transcriptions yourself for example of YouTube videos then I'll talk about use cases for why you might want to fine-tune these models further maybe you want to add new words to the vocabulary as I'll show you in this video or maybe you want to fine tune on a difficult accent or a difficult language I'll have a little Theory section explaining how transcription models work actually they borrow they borrow a lot or they're very similar to the language models I talk about in other videos that are of the GPT type I'll then talk very quickly about how we prepare for fine-tuning in other words how do we prepare a data set in order to fine-tune uh Speech to Text model I'll give you an overview of uh repo that you can purchase access to that will give you all the scripts you need now there are many free scripts available online as well and I'm going to put those below too in the description then I'll go through a full worked example all the way from creating sound Snippets to fine tune on through training a whisper small model uh through then evaluating the performance after that fine tuning and I'll finish off with a few Pro tips this is a speech to text model by open Ai and it's called whisper it's available under an Apache 2 license which means it can be freely used for commercial purposes or research purposes now we're going to focus on the whisper small model there are a few whisper models available I'll just show you the sizes here going from a tiny model with 39 million parameters all the way up to a large model uh with 1.5 billion parameters so already if you're familiar with language models these are a lot smaller the smallest Lama 2 model from Facebook is 7 billion parameters and you can see here the largest speech DET text model is 1.5 billion so already you can see that as a as a task in terms of complexity converting from speech to text is quite a bit easier than trying to predict the next token in a sequence as we do with gpts now something that I've found is the small model actually performs very well so in a lot of cases you can just do transcription using a model that is a quarter of a billion parameters and get some very good performance let's just take a very quick look at the files in the repo uh we can see here the model itself it should be the largest file uh so let's see here for the largest file um here it is model safe tensors it's just under a gigabyte in size 967 uh megabytes and you can see if we look at the configuration file it uh you know looks somewhat like language model it has um still got the attention heads it's got layers Dropout so it has a lot of the features that a language model has in fact in many ways it's a language model with um kind of an audio portion that's connected onto it to feed in audio information that's then used to predict the tokens and we'll go into that a little more later on in the video now just to get you into this I want to straight away show you an example this is a notebook that I'll link below and you can check it out yourself for free I actually use this notebook if I want to transcribe YouTube videos which is what I'm going to show you today here I'm going to install whisper which is an open AI uh library and I'm going to install this uh quick package that allows me to grab audio from YouTube and over here I have a YouTube video this is a video I made some time ago it's about the fine-tuning repository uh it's a short video so suitable for a quick demo here and you can see that uh you can turn on captions in YouTube these are the autoc captions that are generated on a word by word basis and one drawback I find is that it doesn't break it into sentences so actually it's not the easiest to read these automatic YouTube captions and so I typically use uh whisper to generate some uh captions instead and that's what we're going to do right here so I'm going to just copy this short code of the YouTube video and head back over to the notebook and I'll paste in here the address or that short code rer and that will be sufficient now for me to download the audio audio for that YouTube video and then run whisper using the small model uh using English as a language so it should extract fairly quickly because it's an audio file audio files tend to be fairly small in size it's just going to save it locally here to my folder structure and then immediately whisper should start uh to transcribe and we're going to generate a transcript file which is called uh VTC it's a typical format that combines timestamps uh with text and here you go so we're already generating the transcript and uh you can see here it's matching up with the content of this uh Advanced fine-tuning video and it's pretty quick in terms of speed in just a few more seconds we're going to have the full transcript complete and notice as well that it completes the sentence sentences nicely adding in commas uh so I find it this is much easier to read um on a YouTube video than the automatic captions now when that's done um it will have created a variety of files here including a text file if you want to take a look at the text of that uh of that video that's also helpful if you want to use a transcript for machine learning and also there's the VT file which is the one that's of use you can actually upload it to YouTube uh this is a caption format and you can see here that it gives time stamps in addition to the um text that's provided so this is really a beautiful model it was released I think in 2022 so it is a bit of a time ago but the quality I find is is quite good now something you'll notice straight away in this transcript is of course the model is not going to be familiar with certain words for example the spelling of trellis at least from my website is with one L and here it's got two L's but of course it doesn't know that because that wasn't in the training data so it be nice might be nice if I fine-tuned the model on some audio with trellis and then gave a transcript with just one L that would be a way to fine tune it also it's not going to be able to get words around certain models like Lama 70b here it just doesn't really appreciate that Lama 70b exists as a model and so it would be nice to fine tune this model sort of it knows some of the more modern terms and that's actually uh here you can see instead of safe tensors it has safe handers uh which is the best it could do with my Irish accent it does get pie torched though what I'm going to show you later on is an example of me fine-tuning on a series of terms actually language model terms like mial 8x7 b or Mistral or Lama 2 all these terms that would not be familiar to the language model and we'll see how by doing a simple tuning we're able to accustom the language model to a wider training set let's say um so that it can generate more accurate captions uh for our application now that you can see a little bit what whisper does let's move on and talk about some of the use cases before describing the technical approach to getting from audio to speech and then moving on to the full example going all the way from some audio for fine-tuning through to a fine-tuned model there are a few use cases I'm sure there are more but the ones that um that I come to mind for me are first adding new vocabulary that the model isn't familiar with so new words or phrases also you can fine tune to improve on an accent maybe uh the cair accent is difficult to understand for certain models I actually think most models are fine with my accent um but if there is difficulty you can always fine- tune on some pieces of audio of that accent uh likewise if you have some languages that are not commonly used for example the Irish language it doesn't seem like there are many models for Trans transcribing Irish There's an opportunity that uh I could fine tune in a model so that it's able to transcribe better in Irish now let's move on to how transcription works and what I'll talk about here is how we convert sound from some kind of audio file into the text uh the captions that I just showed in the demo and there are kind of four steps that I'll break this down into um first off sound is a vibration so if you think of maybe the membrane of a drum and it's vibrating as it hears a sound you can imagine making a measurement of where the surface of that drum is every millisecond and you measure how far it is from the middle position so if you do that or perhaps if you think of the membrane uh moving like this in response to the sound you can record that displacement over time at a certain sampling frequency and when you do that you see this typical diagram of sound which is this kind of a a streaky line that's going across the x-axis for example maybe I can uh show you a diagram from this notebook here what I'm trying to say is if you have the surface of a drum or a membrane that's vibrating due to sound and you just record its position over time at tiny tiny increments this is the kind of graph that you get so here's a very large displacement uh smaller displacements smaller displacements so this is a measure of the amplitude of where that kind of membrane is moving and the more often you sample it the sampling frequency the better you can represent uh the movement of that drum and typically for whisper models the sampling is at 16,000 Herz so that's 16,000 times per second there's going to be recording of that amplitude so this is the first step to think about in transcription we have a sound and it's represented by a graph which uh has an amplitude now the next step is we want to convert it from this graph here that I just showed uh we want to convert it actually into frequencies because underlying what looks like this mess here there are actually different frequencies that are all overlapping and the best graph I have to show this is uh is right here on this foure transform notebook so here you can see that same graph it's just just an amplitude graph uh that I was showing here so we have an amplitude graph here and underlying all these amplitudes we have various frequencies so we might have low frequencies and higher frequencies and when you crash all of those frequencies together at the right amplitude you end up um getting back to this kind of a messy graph here but the key Insight is that this messy graph of our voice can be broken into a series of frequencies a series of distinct frequencies and the technique used to do that is called the foror transform so doing the foror transform you can convert um into individual frequencies and let me see if I can show here so instead of just having a messy graph like this you can convert into a graph like this where the x-axis is frequency and you will see that actually the human voice is made up of uh more more distinct frequencies now this isn't a human voice it's a lot cleaner a human voice has got many many more frequencies Blended in um but this is an example of how you can have three distinct frequencies and by doing a foror transform on the raw sound you can pull them out to identify what those frequencies are so all of this is well known and we haven't gotten to language models or even to neural networks yet we've got our sound that's been recorded as an amplitude it's converted then into frequencies now there's one more twist to this which is that the human ear doesn't hear all frequencies the same furthermore the human ear doesn't hear loudness the same uh at least not in a linear way the human ear is more logarithmic so as the frequencies increase it's actually harder for us to distinguish the same difference between two frequencies we might hear a difference of 10 Herz between 100 and 110 Hertz but it would be very hard to hear a difference between 1,000 and 1,0 Herz so the human ear is actually not linear in frequency response and it's not linear either in loudness and because we listen to human voices and this is what we're trying to transcribe it makes sense for our model to actually use the data as the human ear experiences it which is in a more logarithmic pattern and the transformation for this is called um the Mel Spectrum so rather than actually using a direct foror transform that just shows the frequency we show the frequencies as though it's a human he ear that's listening and indeed in this article there's another piece here so it shows you the relationship between the frequency and the pitch in mels which is a measure of how the human ear responds so just to recap we get the amplitudes of the sound versus time we extract what the underlying frequencies are and then we adjust the frequencies and the loudness so that it's represented in in a way that the human ear would interpret it and this is called a ml spectrum and you can see here a ml Spectrum drawn out uh so here we go um this is a snapshot of sound versus time and here are the frequencies and here is the amplitude in decibels which is logarithmic scale and so you can see at a time two you can see which frequencies are being heard at which level of decibels and it's this representation here of a snippet of sound that's going to be used as an input to our model and we're going to represent this by a series of frequencies and amplitudes so a series of of frequencies and Hertz and amplitudes in decibels and in the same way that for a GPT model um for recursive language prediction we use tokens instead when we're doing speech to text those tokens won't represent subwords they're going to represent uh essentially decibels and the key frequencies of the sounds within a given time slot so just to say that once more if you're thinking about a Lama 2 model for language you might have a context length of let's say 512 so you have 512 subwords uh different words like a and the that are being fed in one by one whereas if you look at a model like whisper it's taking in tiny fractions of a section of a second instead of subwords and within that second it's representing the sound as a combination of frequencies at certain decibel levels I've explained how to describe how the sound that we hear is characterized and the answer is it's characterized with frequencies and decibels according to a Mel Spectrum now the next thing is we need to input these into a Transformer and the good news here is that much of what we see is very similar to in um a GPT transformer for a causal inference where we're predicting the next token and this is a really great diagram explaining how it works so on the right hand side and I'm actually going to zoom out for once uh because I need to see the full diagram here we have actually what is a GPT so this this is a causal um Transformer where we're trying to predict the next token so you might have the tokens there and quick that are being input and given there and quick the prediction will be brown and then once we have brown here we'll be able to predict uh the following token here which I assume is fox and so this is recursive token prediction with words and actually this is the very same thing we do with transcription except We additionally will put in information from the sound that that we have so the sound as I said is represented by a log M spectrogram so we have this representation of frequencies and and decibels and we have that representation for each tiny time step and uh I briefly mentioned that whisper typically operates with a 30- second um total window input and that is segmented into tiny increments that actually overlap as well so that we don't uh cut off any key signals kind of halfway so we've these tiny you can think of them as subwords but they're actually just audio representations now because we're taking in a sound so this log M spectrogram this uh data will be represented as frequencies and amplitudes and it's going to go in to what we call an encoder block uh which is a neural network that is going to process this through many layers and then feed that information into the decler which is our language model so basically the language model has got information from the input sound for that 30 second block but it's also got tokens that are coming in now the key thing is the very first pass there won't be any tokens that are coming in because we won't have predicted what the first token is so to predict the very first token it's just going to be relying on the sound but let's say the first word of the transcript is there and the next word is quick well in that case there's going to be audio and the and quick as information that can be used to predict the next word which is brown now there's a little bit more subtlety here which is that when we train these models we often will train uh using data that has got a prefix for the language and a prefix for the task so you can see the here this is indicating to the model that it's English and this is indicating here that it's a transcription task so when you run this model with an audio segment actually there will be some tokens you put in but those tokens are to say what the task is and what the language is now actually you can leave out the language token and get it to predict the language token and the stronger models will even predict what uh language it is in some cases so you could uh basically start with very little other than the audio coming in his information but I think more commonly or more robustly you would feed in a token to say what language it is and to say what the task is and then you would start to get the model to predict the first the second and a few more tokens for that snippet of audio and that brings the whole process together it's starting off with a sound you have a vibration that's recorded it's converted to frequencies importantly though frequencies in a human sense and then those frequencies the log M spectrogram are input into the model they're sent in through an encoder which then provides extra information to the language decoder that is being used and is very similar to the GPT videos I talked about in many videos before with that overview of how a model like whisper Works let's dive into the fine-tuning the good news is a lot of the fine-tuning is going to be quite similar to what I've already gone through for GPT models that are used for next token prediction as usual we need a data set if we're going to fine tune and we're going to need the data set to have both sound which is what we're transcribing from and we need to have high quality text so a high quality transcription that corresponds to that sound more specifically we need some kind of uh MP3 or wav wave file and we need a transcript which typically uh we would save in VT format and for sound for these sound and text pairs we're going to need to have a training set and we're also going to need to have a validation set to check the performance as the training progresses so I'm going to move over now to the trans the advanced transcription repo um this is a paid repo but as I said you can try out some of the free transcripts uh rather some of the free collab notebooks if you prefer which I'll put in the description just a very quick overview this repo will allow you to uh prepare a data set and push it a hugging face a train and a validation split it will take in an audio file and automatically split it into 30- second chunks with the corresponding transcript chunks for each of those sound chunks and once you have that data um there's a script here whisper fine tuning that I'll run through step by step now in this video to fine-tune a model and then evaluate the performance what I'm going to do is clone this over to vs code and I have here the advanced transcription repo and I'm going to going to set up some um some data that we're going to train on so remember we need some audio and we need a transcript so I'm going to show you a handy way to create some audio if you want to fine-tune for an accent so this could be fine-tuning for an Irish accent or fine- tuning for some specific words now what I want to do is I want to get the uh whisper model I want to get it more accurate on some language model terms so what I've done is gone through um some different terms I found online uh some technical models I'm sure you've heard of some of these like F2 or Mixr 8X 7B or maybe open chat so I've put a whole series of uh words here and I'm simply going to read them out into an audio file and that's going to C that's going to create my training set and then I'm going to transcribe that audio file and that will give me the transcript that I need now I further just um taken that text file and I've just mixed up the words into a different order and I'll read these out with some comments in between to to create a validation set um and by recording this and then transcribing this data set I'll have created my validation sound and my validation transcript actually I've already done that and I've got uh the MP Tree here I'm not sure if uh this will be possible to hear spin selfplay fine tuning that improves llms Trixie it's a form of fast in so yeah I have some sound there and I've also recorded you can see in this two-minute snippet validation mlu is a me means of testing performance so I have the sound ready and next I'm going to move to run the whisper fine tuning notebook and because this notebook will allow you to take in the sound I've just played the train and the validation and takes and take in um in fact just take in the sound and it's going to generate a transcript for that sound sound now it's easier if I just show you how that works so the way you'll do this is either you can run whisper fine-tuning locally it will actually run cuz the model is uh slow uh the model is small uh so you probably could find ways to run on your laptop but it's quite easy to run in a free collab notebook so that's the approach that I'm going to take here so basically just upload this to Google collab and you want to make sure that you connect to a GPU for fast dis inference if um you you're not connected to a GPU you just go to change run type and select uh T4 GPU so it looks like we're up and running and we're connected and I'm going to scroll all the way to the top of this script here and get started with some installation um here we're just setting up um some different handling of files and we're going to set up some of the hugging face libraries that will help us handle the data sets and the fine-tuning as often in these tutorials I'm going to connect to hogging fa hugging face Hub so that I can push and pull models including from private repos and I've just clicked on this button so I can go over and grab a token here for authorization now once authorized we'll move down here and select the base model for training which is going to be the whisper small model you could pick a larger model if you want to improve quality but you'll see the quality is quite good I'll set the language the task is set to transcribe let me just increase my uh screen size a little bit here and I'm going to Define uh for later on the paths for pushing the fine tune model so I've got the trus org set that to your org or username and then I've defined a repo called llm lingo adapters and then llm lingo and then I've just set up the repo which is a combination of the org and the adapter name so once that's done we'll move on and we're going to try and generate a transcript um for the audio that I've just created created I'd like to show you this live um so what we're doing is we're creating a pipeline here consisting of the whisper model which is the base model and we're going to ask it to transcribe in 30 second chunks and we're going to ask it to use if possible Cuda which is the Nidia Nvidia uh GPU and once we' have the pipeline set up we're going to set up a function called process audio and create VT basically this is going to run the pipeline on the audio file that we specify and it's going to return the text and also the timestamps and with the text and time stamps we're going to be able to create a vtt file uh just like this here now I'm just going to go up here and log in and I'll Define this here and run these cells and next up I'm going to be able to run this function on uh the Train mp3 mp3 and also the validation MP3 now first I need to upload those two files so I'm going to just upload them here locally and you can see the files are now available locally and my pipeline is loaded so I'm ready now to run this cell which will um process the training and then I'll run the same to process the validation now I'm running this with the base model so what we should expect is that it's not going to get the new words correct that's what we're expecting um when that cell has run we should be able to see here here on the side panel um the appearance of the VT files so let's just take a look at the train so here we have uh spin selfplay fine tuning that approves llms Trixie you can see is incorrect um f 2 is incorrect and Mixr is incorrect and it will be the same with the validation file but the reason I've gone through this is because it's much faster to generate the VT and then correct it than to try and write the transcript from scratch rch so I recommend making the transcripts like this and then just go in and read it and manually correct it so say I'll type in 52 here we have an alternative to flash attention I'll just make a small correction to this and Mixr I'll fix this up here so we've got uh Mixr that should be Mixr 8 by 7B and uh it's actually mixture of experts model solar 10.7 B is it's actually a Mistral not Mistral and it's giving me some annoying auto corrects here here we go I fixed this solar is a Mistral model yep that's fine open chat is a fine tune with Mistral so you get the picture basically I've taken the train. vtt that I've generated I've done the same with the validation. VT and that's allowed me to create a clean set of VT files that will be paired with the audio Snippets just in a moment uh in order to create create our dat data set now it is possible to use uh to use a GPT in order to um do the correction of your transcript for example you can say I want your help in correcting vtt file transcript I'll give a list of words that the ASR which is automatic speech recognition was not familiar with respond in a code pen with the contents of the updated vtt file and then you can give the raw VT by the base model and then you can give a list of the keywords that I showed you uh right at the start the keywords that we're trying to fine-tune for now this is not necessarily fully robust it can mess up the time stamps even with GPT 4 uh so I recommend either writing a more elaborate and robust piece of code if you want to automate this or just uh manually correct it yourself okay so we have these uh base files and you can assume that you've now gone ahead and corrected the VT files as I have done and uh here's something I prepared earlier so this is a fully corrected file you can see that all the terms are correct CU I read through it so Microsoft 52 um something like this that's CLM gp4 so all of the terms Cloud instant one everything here uh should be in order and once you have your t uh vtt files and your MP3 files and by the way it's fine if you record in like m4a or wav it doesn't matter or wave sorry uh you can always convert it online uh just by using a free converter or adapt the scripts accordingly but once you've got that done you can just run Python and next you want to write um let's see I'm going to run the prepare data. py file and I am in a virtual environment here there's all instructions in the repo uh in the readme for how you should set up you should always set up a virtual environment if you're going to install python pack packages um so I show the instructions here setting up a virtual environment and then installing the requirements from requirements.txt okay in the meantime because I've run python uh prepare data what that has done is uh we'll take a quick look at the script it's basically taken in uh these files the train data and um the data validation In Mp3 and vtt format and it's going to split that into 30 second segments and pair the text with the audio and prepare everything as needed for hugging face and finally it's going to push it uh to a Target repo that I've defined on hugging face so we can take a look at what that is by going over to um trellis llm lingo which is the data set I've set up and you can see here uh we do have these 30 second Snippets um maybe I need to refresh here we go so here's 23 seconds by the way I keep each line together so if adding a new line of the transcript takes you over 30 seconds then the code won't add that line so that's why everything is always under 30 seconds and you can see here the text that goes along with it and there is a validation as well as a training set and this is a public data set so you can check it out yourself just under llm lingo so to recap we have used The fine-tuning Notebook um on our audio files to generate transcripts we've then corrected the transcripts so we now have a data set that has both audio and text and now that we have that we're in a position to load the actual training and validation sets uh from huging face so I'm going to go ahead and load those data sets uh from huging face and next uh we're going to go through a few more steps of loading so in in a causal language model where you predict the next token you need to tokenize your inputs but in an audio model you need to convert the MP3 file or the file of your raw amplitudes into the feature set in other words the frequencies and the dees or the representation of that and so instead of a tokenizer um for your audio you need a feature extractor extractor so here's how we load uh the feature extractor for the base whisper model we do still need a tokenizer because we have to um decode or DET tokenize the tokens that are predicted at the end of the model and then there's also what's called a processor um which is really a wrapper uh for the feature extractor and the tokenizer so we have that loaded as well now here I've just I've printed uh one of the elements could just print zero element of the training data set and we can just take a quick look of what's in there and let's just run these cells as well can see we can see that the data set is indeed six uh rows of training data and then five rows of validation which matches what I just showed in hugging face and here you can see the audio so the audio has got a path to a segment of audio that's the first segment or the zeroth segment and then um there's the sampling rate so this is samp at 48,000 Herz actually for whisper it needs to be 16,000 so we'll down sample that later and then you can see the text corresponding with that uh first snippet here and note that the text also has a start time and an end time so there's extra information as provided uh on this repo here so I've also pushed start and end times for each snippet not that you have to but it's a nice feature okay so as I said we need to down sample cuz the data is at 48 Herz um so we're we're going to set up here um this is a way first of all we're just going to cast we're going to add a column into the data set that is for the correct sampling frequency so here we're adding a column to say that every row should be at 16,000 Hertz and so if we run this here you can see now the sampling rate is shown is 16,000 HZ um but now we actually need to resample it at 16,000 htz so this is why we're going to run through a batch process and we're going to um use the feature extractor to um actually when we reload the audio it'll automatically now be loaded at uh 16 khz because that's what's specified in the audio column this is all handled by the library in the background once we have that column for the correct sampling rate and the next thing we're going to do is for each piece of audio in amplitudes we're going to convert it into the Mel spectogram so the frequency in decibel representation using the feature extractor um so that's why that comes in there and then we're going to extract uh the text and tokenize it um so that's going to be called labels so basically we want to convert our data rows into um rows that have two key columns one column for the features which represents the audio and one column for um the labels which represents uh the samp okay so moving on here we um are going to apply that to the full batch of data I think I can just go ahead and Define that function and here we'll have a data set that has audio text start and end time and we want to replace that with one that has not audio but the features and not text but labels which are tokens so here's how we do that and after we have applied that we should get a train and a validation set with just input features and labels which is indeed what we get and indeed when we run a print of the data set we are going to get this here okay so we have the data ready now we have the features for the audio we have the tokens uh representing um the text so we're now going to set up um a data collator this is going to organize the data into a batch as is required it's going to set up padding tokens if somebody audio is too short shorter than 30 seconds it will just add zeros um it's going to do the same with tokens if there needs to be any padding it will say that if there's any padding tokens then ignore them for calculating loss because that's not what we're interested in for the purpose of updating the model so all of this is handled here by the data cator and we're going to just um initialize the data cator like this and move on to evaluation now one of the evaluation metrics in speech to text is word error metric so basically you predict a given uh sub token and you compare it to what it should have been and you can see if there's an error or not and calculate an error rate off that this is all prepackaged as a metric so we just have to run a very short cell and here there's a little bit of code around uh which words to consider we don't want to consider pad tokens so we need to allow for that when we Define the metric all right so our data is all ready at this point and we're prepared to load them model and pass it through the training process so as we do with causal language generation for next next token prediction we're going to load the model here and we're going to load it and we're not going to load an 8bit it is possible um but it's a small model so I don't really see the need and we're going to use the GPU so I'll set the device map to Auto so here this base model is the whisper small and we're going to to set some defaults here um that we're not going to force any Tok decoder token IDs um this is where you might want to force the model to always output the language as being first or the task but we're not going to bother with that in certain cases it might help with training because it guides the model in knowing initially what the language is but we're going to avoid that for now next up we're going to apply Laura so I talk about this in many of my videos I have a very short video on Lowa you can look at yourself um but let me just briefly show you here in the slides rather than training each of the large matrices in this model we're actually going to freeze this Matrix and we're going to apply some new small matrices um called a low rank adapter and these two matrices have a lot less parameters in them and we're going to train these while maintaining these Frozen so when we back propagate the information uh down through this Matrix here we're not going to change anything we're just going to update these smaller matrices and when we're done at the end we're going to merge this Laura on top and there's going to be a Lura for um all the matrices in certain modules within the overall Transformer specifically um two of the attention layers so with that said we're going to set up um the Lura configuration this is the rank of those smaller matrices you can think of it like um the height or the width of the Lowa matrices and we're using a Lowa Alpha of 64 which implies that um The Learning rate relative to the learning rate of the base model is going to be 64 over 32 so two so the effective learning rate is twice what we Define in the trainer and the modules we're going to Target are the Q and The V projections of the attention and we will use some Dropout um so I'm just going to Define that and we're going to get a parameter efficient fine-tuning model which basically means means we take the base model and we set up these adapters out to decide that are going to be trained and by doing this we'll only have to train 1.4% of the total parameters in this combined model plus adapters so using allur is a way to improve the efficiency of training and actually it turns out that using a small number of aggregate parameters performs better it converges more quickly than trying to train every single parameter okay next we're going to set up the training arguments we're going to Output results of the training to this uh trained model name directory which will be saved here on local and we're going to use a batch size of three remember that the training currently only has six there's only six rows in the data set I showed you um so here in training there are six rows and I'm using a batch size of three so every Epoch there's only two um there's only going to be two steps in each Epoch but I like having a batch size more than one because it kind of averages things across multiple um data points and that smoothing can help with stability of training the learning rate is quite high but this is a very small model and typically the smaller the model the higher the learning rate you can use and I've set five Epoch we'll see how it progresses throughout and um what else should I highlight I'm using a batch size of one for eval um you could use a batch size of five just 5 EV Val R but I think that's fine the max generation length needs to be longer than any of the 30 second Snippets which is fine and the save steps we've got five Epoch and save steps of 02 that means we're going to save every um point2 or every 20% of the full run so that means every one Epoch uh so next we're going to pass those arguments into the trainer we'll take in the model we'll take in the training and validation sets and we're going to compute the metrix here I'm just pointing to what we should use as a tokenizer um it's really misleading to say it's a tokenizer because it's a feature extractor here that's being used for um converting the audio into features and when we run through the training it's actually really fast takes uh 1.35 minutes you can see that uh the training loss is falling the validation loss is falling and the word error rate is going down now you won't expect the word error rate to go down to zero because we're really just training on a few words which are a small proportion of the training data set um the model is still going to it should be making uh mistakes unless it was a really powerful model and this is indeed a small model but you can see it's clearly improving here okay so we're done with the training here and what we're going to do now is grab one of these checkpoints um I've actually had to uh pick up the recording again but I've I've run again the training and in this case too you can see the best point according to the word error rate is the fourth checkpoint here and when we check out the files on the left hand side you'll see that all the checkpoints are saved so there's 2 4 6 8 and 10 now there's two steps per Epoch because there's six rows of data in batches of three so um checkpoint 10 would correspond to the end of the 50 Puck and checkpoint 8 is probably the best one so I'm going to set the adapter to push as checkpoint 8 you could load an adapter here from The Hub um obviously that would maybe you want to compare with some other adapter and do a quick test on that but generally you want to pick an adapter from the training you've just done you can see when we print that at print checkpoint 8 I'm going to apply that adapter now onto the base model uh which is using the PFT parameter parameter efficient fine-tuned model and once that's done you can also push the adapter to the hub if you'd like I did that earlier on an earlier run of the script next we're going to merge the adapter onto the base model so this is taking checkpoint 8 and merging it in this is uh basically getting rid of the adapters here by merging them on to the base model so now we're just back to having a simple base model they're about simple but we have a base model and if you print the model you can see all of the layers there's the encoder which takes the features from the audio puts it through a number of layers and then there's the decoder which is a a textto text model it takes in uh tokens and then generates the predicted transcript so I typically like to save this model this is the merged model you can see it's appearing here in the whisper small llm lingo all the files and also I've saved the processor which saves the extractor the feature extractor and the tokenizer so that's all uh present here and next we're going to set up a pipeline to run an evaluation the model that we want to evaluate is the trained model so I'm actually calling uh picking the trained model that's saved on the disc here we're going to process it in chunks of 30 seconds and use the GPU if possible now you can for stability Force the first uh tokens in each chunk to be uh the language and the task but uh the model is actually pretty robust so we don't need to do that and so I've gone ahead and I've run this uh process audio and create vtt what it does is it takes in the validation file so this is not what we trained on it's a validation audio the MP3 file and using the fine-tuned um automated speech recognition model that we've loaded here um we're now going to get a transcript and when that is run we can open up uh the transcript that's generated evaluation. vtt and you can see here um some improved performance now I'll show you where it has done well and where it still has room to improve so um it's for example getting correct um a lot of terms like ye 34b chat it's getting that perfectly um it's picking up Mixr 8X 7B um again up here Mixr Mistral 7B it's picking up and Claude 2.1 it's picking up so it's getting all of this very well it's uh getting F2 previously it was spelling that phy so it's getting that correct Trixie is um CKY so it's not getting that correct which is not exactly right and um also this here is no Tu there's no it this here it by is actually an eight it's not recognizing me saying eight in Irish English but you can see it's already with just some quick fine-tuning getting uh very good performance and some added words I think if I made a longer transcript I've only said each keyword only about once in the training set if I made more of a transcript where I said the same word in multiple contexts I'm pretty sure it would be able to pick up uh pretty much all of the words that I'm training it on so with that now that the modet model is evaluated and working well you can go ahead and push it to HUB you can use the safe ser serialization ation parameter to push it a safe s safe tensors and then you could you also want to push um the processor to the hub as well so that people can make use of that model and indeed after you've done that you'll see the model should appear uh here you can see the files and we have uh a lot of files added tokens the configuration uh the safe tensor is about a gigabyte so it's the same size as the base model which makes sense um because we've merged the model so it should be the same size as the original and you can see the tokenizer as well and that brings us to the end of the script on fine-tuning that's it before you go you'll find all of the free collab notebooks linked below if you want to check those out and also a link if you want to pay for this repo now a few final tips if you want stronger performance than what I showed you can try out the whisper medium or even the large model if you want to improve the fine-tuning performance i' recommend doing more recording than what I did which was just a two-minute sample if you have a list of words that you want to F tune on you can just read out those words even more times maybe in an even different order Andor with different phrasing and different explanations between them actually providing a little context on what the words means can also help because the Transformer considers up to about 30 seconds of text when it's decoding to produce your transcript output as per usual let me know if You' have any questions write down in the comments cheers

Info

Channel: Trelis Research

Views: 15,118

Rating: undefined out of 5

Keywords: asr, automatic speech recognition, asr fine-tuning, whisper, whisper fine-tuning, automatic speech recognition ai, automatic speech recognition tutorial, how does speech to text work, speech to text, speech to text tutorial, speech to text explainer, whisper small, whisper small fine-tune, whisper lora

Id: anplUNnkM68

Channel Id: undefined

Length: 49min 25sec (2965 seconds)

Published: Fri Jan 19 2024