Getting Started With OpenAI Whisper API In Python | Beginner Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this lesson we are going to learn how to use open AI wasp API to transcribe and translate audio files in Python alright so open air whisper is an automatic speech recognition model and with the open Airways per API we can now integrate speech to text transcription functionality into our applications to translate ocean Square audio where it is so comparing to the open source whisper python Library the web whisper API service can process media files much faster a lot easier to use and when more configuration options and can even return the output in different formats such as the Json file text file or vtt SRT files that you can directly upload to Youtube as the caption so what are some of the benefits using wasp API right so form you can use words by API to make your audio content more accessible to YouTube for a depth or hard of hearing by providing accurate subscriptions of spoken words and developers can use the API to automatically generate subscriptions of audio and video contents make it easier search and index the content now before diving into the tutorial I want to quickly just cover the pricing so choose the westbrose API is going to cost you 0.006 cents per minute so if you transcribe or translate uh 10 minutes audio content is going to cost you 6 cents now I'll choose uh Whispers API first you need to sign for account and if you don't have account simply go to open ai.com API then click on get started to sound full account and it's a new user you get 18 000 credit to use but the credit is going to expire after three months to send for I now hear me log into my account first alright so once you log into your account you want to click on your username then click on view API keys now if you want to create a new SQL key and this is going to be your API key now copy the API key to some safe it can be a notepad or you can save the API key to your environment variable now for demonstration purpose I'm going to save the API key to my python script all right so here I'm going to create a variable called API key you know paste the API key right so once we create the API key we need to install the open AI python package right so to install the package we can run the commit pip install open AI in that set and for this lesson I'll be using these two audio files the first media file is going to be this uh Japanese podcast I since uh 11 seconds video I don't know if you can hear the sounds basically uh this podcaster is speaking Japanese and we're going to use open AI website API to translate the audio into English and the other media file is going to be this wave file and so uh Steve's jobs goodbye speech and so one minute 35 seconds a long audio now going back to the python script here I'm going to import the open AI python module let me increase the font size next I want to create a variable called Model ID now with the whisper API it is using a model called whisper dash 1 which is equivalent to the large V2 Audio model training based on 680 000 hours of audio data basically this is the most powerful and accurate Audio model open air has currently all right so here means the model ID next I want to create two variables to point to my media files so this one is going to be transcribing the audio file file path and it's going to be the file path to this audio file and to upload the media file we need to convert the media file into binary object all right so only the object media file then we can open the file A3 binary now to make the API code to transcribe the media file and it's easy you simply code the open air module the audio dot transcribe and no name the outputs response now inside the transcribe method we need to provide the API key they will need to specify the model by providing the model ID then we'll Supply the file and it's going to be media file so the default output is going to return as a Json format and form to return the transcription as different formats such as text file or SRT vtt formats then you will use the response format parameter they will insert a format type and clearly this text file just so SRT and vtt and these four are the most commonly used format right now all right so for demonstration purpose let's do SRT now I'm going to go ahead and run this code block mainly now let's look at the output right so we have the response right so if you are getting the API is a 200 HTTP cold air from the API but here's the the SRT output I'll choose created correctly all right so I think right now because uh the API just released so I send uh there's still a couple of issues happening and I'm pretty sure that this issue will get reselled pretty quickly oh and this one set of God to mention that right so this is pretty important so right now The Whispers API only support files up to 25 megabytes and if your media file is larger than 25 megabytes in that case you will need to jump the file into uh different pieces and which I'll show you how to chunk a media file uh in a second all right so here I'm going to uh rerun this uh transcript method and I'll take out the response format parameter oh and because when we open the media file and it's going to lower the buffer so we need to reload the media file again all right so if I run the transcript method if I print response and it's going to return as an open air response object now from this open air response object if you want to retrieve the transcription will reference the data attribute and that was since the dictionary in text is going to be the key so I'm going to reference the text Key to print the transcription now that's for the transcription lesson now if I want to translate a media file so here I'm going to copy this code block to the other python script right so again we need to load the file as binary and the family is going to be just a video japanese.mp4 file and I'll grab this code block right here now to translate a media file we'll simply use the translate method instead of transcribe and everything else is going to be the same and I want to show you one thing in a second but for not me make this request right so here is now like in the media file format let's let me see what's going on here I'm able to make a request this time now if we look at the subscription now if we do a caption comparison so we start with the with hello everyone we're presenting uh easy Japanese balance in the now if we just look at the first couple uh sentence right so everything is pretty good it's able to recognize that the original audio language is going to be in Japanese now just in case if the translation came out incorrectly in that case we can use the Tom parameter to guide the words per year how should the AI to translate the media file so for example I can say the podcast language is in Japanese and it's going to use the prompt as the hint to basically ensure that is adding more details on how should the media to get a translator and it's something I think is pretty unuseful when it comes to translating an audio file the other companies are not doing now the last lesson is going to be how to chunk a media file into pieces all right so here I'm going to create a new script and I'll name the script chunk audio file actually let's do chunk media top high and you can treat this group is a template and it's meant to be reusable for other purpose as well anyway uh to chongyu media file you will need to install this uh python package called Pi Dot and can store the python package using the paper install python now this line is going to assign the file path to listen audio file path variable do you need to extract the media file extension then we'll load the media file using the audio segment class and based on the file type that you're using I'm using wave file so I'm going to use the found wave method and I'll pass the WAV file file path now if you're using other file formats such as MP4 then there's a farm let's see if it's going to okay so these are all the supported methods for other file format so we have MP3 WAV file I don't see MP4 and I guess our mp4 file is not supported only audio of power formats So based on your file type you want to use the correct method to load the audio file and when I split the media files I want to use this string as the file name and I'll store the string into this variable called output prefix now the pi dub python Library handles time in milliseconds so Bisbee 1000 is equivalent to one second in 60 times 1000 is going to be one minute so these are just some of the example unions I want to chunk your audio file into a different interval I guess notice I want to split the audio file into one minute each so first I need to figure out the total duration of the audio file and I can do that by using the duration seconds attribute from the audio segment object which I load the audio file and once we have the total duration I'm going to add one second to the total second just to adjust the running issue they need to convert the unit into milliseconds by times 1000. and finally to a chunk the file into pieces in this case will be one minute audio of file each so here I'm going to create a variable called chunk Union and it's going to be the chunk interval that I'm going to use in this case it's going to be one minute if I want to split a large media file by let's say five minutes per file in that case it will times five that's the chunk unit next you will basically uh iterate the total durations based on your chunk Union size then we slice the audio file object so here is going to be the segment that I'm going to split based on so this is going to be the beginning audio segment temp stem plus one minute then I can export the file using the export method and it's going to be the the audio file name followed by the audio file extension and we need to specify the audio file format in this case it's going to be WAV file all right so let me put this on the left hand me uh intimidate the sessions now if I press F5 to run the script right so here's uh the first file minute one and I don't see minute two oh I know why because here is chunking into uh five minutes per file right so let me rerun the script all right so this time I have two files based on Steve Jobs goodbye speech and so one minute 35 seconds long audio power so it's going to chunk into the first five is going to be one minute and the second file is going to be 35 seconds now we want to transcribe a large audio file you're basically uh chunk the file into 25 megabytes file size per file so this is going to be if I'm going to show you in this video and hopefully you guys find this video useful and feel free to post your question or your feedback in the comment section below and if you enjoyed this video please don't forget to like this video and click on the Subscribe button and I'll see you guys next time bye
Info
Channel: Jie Jenn
Views: 5,999
Rating: undefined out of 5
Keywords: openai, openai whisper, openai api, openai whisper api, python
Id: BkcSJol59Rg
Channel Id: undefined
Length: 15min 10sec (910 seconds)
Published: Mon Mar 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.