AutoGen Tutorial | The Best AI Agent Workforce #02 - Transcribe/Translate Video

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey and welcome back to another video and today we're going to be creating an AI agency that's going to transcribe and translate audio from a video file I'll quickly go over what we're going to be doing then we're going to go through the code and then finally execute the example and see how it works without further Ado let's get started okay so the first thing we're going to do is we're going to have two agents we're going to have the user agent and then a transcribe agent and how the flow is going to work is the user agent is going to chat with the transcribe agent and then we have a series of functions so the first thing we're going to do is we're we're going to extract the audio from the video file in MP4 and then we're going to have another function that's going to create the subtitles from that audio which we're going to be using the whisper model to do this and then finally we're going to have another function that's going to be Translating that subtitle text to a different language and we'll be having inputs that allow us to choose which language we want to uh transcribe or translate the text too so the task today is to take an mp4 file then we're going to transcribe the audio from that and and then translate to some language of our choosing well let's go over the code one thing to note is I'm going to have the requirements. text file you can just use this to install everything but you will also need and if you look up you know autogen doesn't really tell you that you need this but you need to have the ffmpeg installed on your computer in order for whisper AI to work properly and I'll have links in the readme for the windows version and the Mac OS version okay this this took me a while to figure out but these are the links that I found that were useful especially the Mac one that's what I have this one I just followed the steps and it worked and then all then my model finally was able to transcribe uh what I needed it to okay so the first file is the EnV file okay this is where we're going to load all of our information from to use the gp4 model so you just have two properties here just replace your open AI API key here and you'll be good to go now back to the main python file we have our transcribe dopy file we simply have two Imports autogen and. EnV and then and then we also are going to be importing specific functions from our functions python file which I'll get to in a second and the reason I have this file is I could have the functions that I said we're going to create we could put them all in this one file but it's going to be a big file and it's kind of a lot to look at at once so I separated them out because if we want to change them then we can just go to this file it's less to look at and then we can just change it from there getting back to this code uh we just load the EnV file so that we can use the properties uh we have our conf fig list so that we just grab everything for the gp4 model and now we have our llm config now the only difference here between this and most of the other projects that have done is now we also have to add the functions here so that we can describe which functions and describe the actual functions to the model so that we know what to use now I know that I said I have three functions here and all I'm doing is describing two functions here well the reason is because one of these functions is actually going to be executing the other one inside of it so we don't really need to have that here so two of these functions we have to give the name the name of the function basically when you call uh you define the function this is the name here uh just a brief description of what it's going to be doing so this one is going to recognize the speech from the video and transfer into a text file we have the parameters the type object just leave that way that's pretty much default and then the properties so this is like really the parameter of the function or method um so it's called audiio paath it's just a string file and then we also need to require that in order for this function to work and the next one is the translate transcript so whenever this one uh recognize the transcript and then it creates the text file this one is now going to translate that into some language so the name of the function is translate transcript the description is that it's going to translate the script um again the type is object the properties are we have a source language and we have a target language and they're both of type string and one is just okay this is the source language that the transcript is in and then what's the target language we want to uh translate it to and then for this to work we need to require both of those properties so we need the source language and the target language as the parameters in the function and then finally we give it the config list property which is the GPT model and API key whichever one you want we want to use that we described above we put that here and just set the timeout to 2 minutes and now we're here at the agents we have a user agent and then we have an assistant agent the assistant agent is the chatbot so this one is basically only going to use the functions that you have been provided with okay so we're going to end up registering these functions um so that the assistant agent can use them and then we have the llm config that we just created right so it knows what model to use now for the user agent um we just give it the name a termination message uh I don't actually want any input I just want to do its job I don't I don't care and then Max auto replies to 10 this doesn't actually matter and the same thing here um this this for my for what I'm doing here and the way my system messages work serve no purpose okay so but I just left it here in case you want to change up your messages and now with the register there's two functions that we defined in the llm config above here so we just say user proxy do register function and it's called a function map and then I just give it both of the function names that were defined in the LL config here if we don't do this then it would never actually execute the functions and then finally we have to initiate the chat all right so I just have as a function here so I say Define and initiate chat I have three inputs here so when we start this you'll see how this works uh I just want to give it the target path The Source language and the target language now you don't actually need the source language um I just have this here as an option but you could Define it in the whisper model as we'll see in a minute um it's going to know it takes about 30 seconds of the video or the audio and it'll check itself which language you're coming from and then finally we initiate the chat now what I need to go over now is all the functions that we are going to be actually executing from okay now with all the functions as I mentioned that there were were three of them and one of them translate transcript is going to call this translate text for each line of text it gets from the transcript to translate to whatever language so let's go to the functions the first one is the recognized transcript from video so it takes in one parameter audio path that we just saw and here is where we load a whisper model now I have it set the size small here but there's also tiny medium large and there's there's another one but you can choose what size model you want the larger it is the I think a little more accurate it's going to be with the time stamps and the actual um like transcript or transcribe that it will have and then what we need to do now is take the result and call model. transcribe on the audio path and then we just have some variables here that we're initialized and what we can do is for each segment from the results of the transcribing of the model we're basically going to append the sentence the time stamp start and the time stamp end for each segment okay so what what what this is is so say from second 0 to Second three we're going to say this is the sentence that was said and then it starts at 0 second and then second three later on it could say from second 33 to Second 35 we're going to have we're going to have this sentence and then the time stamp start and end which will be 33 and 35 okay so it's going to we'll see that whenever you get into the whenever we execute it and you we see everything um but it's going to pend this to the transcript variable okay so we're going to basically have an array of objects that have the sentence and then the Tim stamp start and end and then finally U with this function we save the transcript to a file so we're going to create a transcription. text file W for writing to it and then for each um uh item or each object in the array we're going to grab the sentence and the start and end time and then we're going to write in this format uh to the text file and then we return the transcript and the next one we have the translate transcript function that takes in the Source language and a target language that we just saw in the Lum config when we defined the function so we're going to open the transcription. text uh file assuming it's there and then we create we initialize an array and then for each line in the text file we're going to take that line we're going to get the timestamps and the actual text for the actual text we translate the text from The Source language to the target language okay so in this function up here we have as you can see we're using open AI this is an example just say how you can use autogen but then also just call open AI that we're going to take that and we're going to directly we want open AI to directly translate the source language to the target language and then we just want to return that translated text okay so we take the translated text basically um get the the response back from it and then return that and then when we return that here we basically say okay well we want to take the variable that we initialized the array and then append this translated text to that okay so we have the timestamps so we give it the timestamp here that you know comes with the transcription. text and we format the line and this is I know this is kind of a lot I'm just kind of going through this you'll see this whenever we actually look at the transcription text and then we take this and we append it to the array and then all we do is we create a new file that we want to write to so it's the target language transcription. text and then we uh write for each line in the translated uh text or in the translated array of text we just write to this text file okay and that's it and then we return that and we are pretty much done at this point okay great this worked wonderfully so uh as I said the first three inputs that I have it's what is your target path so I just give it the full path to the the video file and yeah I know this is a Peppa Pig video file whatever the source language uh I know is English and I just put French as a destination language now so it finds this video I want to recognize speech and then transfer to a script file and then translate that so uh it found uh you know so we called the recognized transcript from video function the audio path this is the parameter it found the file so what it's doing is executing the recognized transcript from video and as I mentioned before it can actually detect the language first using the first 30 seconds um and so it did that and then as you can see here for each timestamp for the first three seconds it'll have this sentence um for the next one and so forth so it's a minute it is a minute and 3 second long video so it it did correctly uh transcribe the video now the next thing was okay so we got the response okay this is the response from calling it so this is the array uh this is where it had the sentence the Tim stamp start and end okay we saw that in the function this is where it was appending it to the transcript variable to uh respond with or to return and then once it does that you know I call the translate transcript function The Source language is English and we want to translate to French so then we execute this function and as you can see here I'm assuming this right I can't speak French so I assume that this is this correctly translates um all a minute and 3 seconds from English to French as you can see here in the transcription. text file that was generated we created the transcription okay and then we have the French transcription here so it did the same thing it just it correctly translated and then saved it to a file and this is great because you can have you can have it translate to a lot of different languages pretty precisely too especially if you use a larger model it'll be a little more accurate with the timestamps uh but you could also even bring in multiple languages and have them do all at once and and save them in different files based on the translated text okay thank you for getting this far and watching and I hope you you were able to learn a little bit more about autogen the functions and then we introduced a new model for whisper and this allows you to translate audio into text and then I hope you learned that now you can take any language you could even have one audio file and then translate it into multiple different text files or multiple different translations right you could even with this you could not just have an audio file but you can now learn how to just translate text by itself you can just take in documents and now you know how to translate you had the functions to do this thank you for watching and I'll see you next video leave any comments or suggestions Below on your thoughts or if something didn't work I'd be more than happy to help you thank you again I'll see you next time
Info
Channel: Tyler AI
Views: 2,511
Rating: undefined out of 5
Keywords: ai, chatgpt, artificial intelligence, chatdev tutorial, ai agent, ai agents, autonomous ai agents, autogpt, build autonomous agent with python, chat gpt, gpt 4, tutorial, step by step, python ai chatbot tutorial, ai automation agency, how to setup autonomous ai, your first software ai team, ai tools, artificial intelligence and machine learning, microsoft autogen, autogen, auto gen, ai tutorial, open ai, whisper ai, ai agency, ai automation, saas, chatgpt 4
Id: l1TNbfZewC4
Channel Id: undefined
Length: 12min 37sec (757 seconds)
Published: Fri Dec 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.