Build an AI Voice Assistant App using Multimodal LLM "Llava" and Whisper

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone welcome to AI anytime channel in this video we are going to work on a very interesting project we going to build a multimodal voice assistant okay so we going to combine a couple of models one is a generating model and one uh the other one is a speech to text model so we're going to combine lava which is a multimodal llm and we are going to combine that with whisper model which is again uh an open source model by open AI which helps you with speech to text kind of tasks okay so we going to combine both of this and we're going to build a voice assistant for multimodalities okay so if you have images and of course videos you can extract the frame and if you want to you know discover information or retrieve information from this multimodal data and through a voice assistant how can you do that that's the ultimate goal of this video we going to do this in collab notebook but we'll build a gradio app so you can play around that uh play around you know this uh models and also with the features and see if that makes sense for you to scale this further and you can of course build it build it an app as well we're going to rely on the T4 or v00 GPU probably in this video we'll see that in a bit and you can do do this with consumer GPU as well because to load the model we are going to use bits and BYT because because we're going to load this model in 4 bit so we can just use a consumer GPU to do that so let's build this voice assistant with lava and Whisper so if you look at here on my screen I am on Google excuse me I'm on Google collab you can see it's called lava whisper yeah but you can you'll be able to use any other as well but you have to first install a few things but before that you can see I'm here on runtime and then I have to change my runtime okay so to do that let me just bring it a screen okay change run time you know I have Pro so I'll just go with v00 high Ram but T4 High Ram will also work so if you don't have a collab Pro you will be able to do it on T4 High Ram as well okay so you can do this on T4 GPU now we're going to install a few things so let me just do that so of course Transformers and but I'll just uh do quiet here I don't want to see all the logs and things like that okay Transformers and then you need a specific version so let's get the 4.37 point2 version of Transformer okay so this is the version that we need and then we need uh other libraries so let me just install that quickly we need pip install bits and bytes okay now bits and bytes helps you uh to load the model in different n Bits okay so you can load it in four bits and things like that okay uh and then I need accelerate to complement the bits at B bits and bytes thingy and I'm going to load uh get the 25 uh 025.0 and then you need a whisper so the best thing uh best way to install whisper is to get it from the GitHub so if you are not aware about GitHub or whisper both let me show you this is where you see whisper now let me just uh click and you can see the whisper over here and let let me just copy this I'm going to install this from git the source and that's how I'm going to just do and here then you have to add git plus so let me just do that so git plus and then you give that https github.com openai and whisper. git now this is how you install and then we need couple of other things so let me just install gradio as well gradio is a library that helps you build you know uh UI where you can uh work with uh Python and have a have a simple app to Showcase capabilities demos and proof of Concepts okay now uh let me just uh install gradio over here gradio and then I also need gtts because I also want to uh respond uh to the end user in again in a voice manner so you have both speech to text and text to speech capability in this we'll see that so let me just do pip install okay and minus q gtts and guys same thing will also work for rag multimodal rag because most of you want to work with lava for multimodal rack if you are able to get it from a context of course from a single image and things like that you'll be probably able to do it with a vector database as well we'll see that it's not a rocket science but anyway now once they install that I'm going to import few things so let me just write it over here I'm going to say input torch and after that from Transformers uh excuse me from Transformers import and then you need bit send byes config so let me just get uh writing it wrong bits and byes config and then I also need pipeline from Transformer because I'm going to use image to text pipeline of Transformer okay because when I'm inferencing with lava okay now let me just do that here bits and bytes thingy all right uh as you can see now our UT is successful let me just get the now you have to create a config for the uh for loading the model in 4bit that's basically a quantization config that we have to write so let me just do that so I'm going to say uh for example Quant config and then Quant config we're use bits and bytes config here so bits and bytes library in Python and load in 4bit equals true so let me just do that so load in 4bit equals true it's a Boolean value and then you also can have uh compute D type so I'm going to go with torch float 16 okay you can see it suggest me compute D type and I'm going to write torch do flot 16 not the bf16 because for BF supports ampere architecture ampere gpus by the way of course gpus based on Amper architecture like a00 and things like that but here I'm going to have BNB 4bit compute D type torch floor 16 and then load in 4 bit which is true let's now get that here so coin config is done now this is the model I'm going to use so let me just show you so I'm going to write lava 1.5 version of that mod of that weight model and the 7 billion at the weight category so I'm going to use this the official model so let me just copy it over here it's a multimodal llm and one of the best right now when it comes to open source of course gp4 vision is the best one out there but yeah this this also does the job now let me Define the model ID and then model ID and this is how you define that so you give a repo path this is the model ID and then I'm going to use the pipeline so let me just do pipeline here Pipeline and then is in this pipeline I'm going to write first image to Tex that's the pipeline so Transformer has many pipelines that you can use and you can see let me just do that so image to text model equals to ah excuse me this would be small because the input is there image to text we don't need all of those things so let me just remove this I just need model ID and then I need model quad so I going to use that here so you can see quantization config is nothing but the Quant config now let's load the model this will take few minutes depending upon your internet internet speed and also the uh the uh compute power that you have but you know we'll we'll wait for it okay so let it let's load it and then we'll keep on writing the next line of code you can see it's downloading it has to download around 15 GB plus size okay of different started model weights and config and things like that so let's wait for that mway let's start writing the next sale of code so I'm going to write import whisper because I'm going to use whisper open AI again people have that perception that open AI only works on the Clos Source models like open Ai and things like that hell look they have lot of Open Source model that probably would not be aware about it please go and have a look at that they have many many models like that whisper is one of them then they have sappy they have other models like Cliff For example which helps you with vision embeddings and things like that so open AI has significant contribution to the open source Community as well guys okay but not with the uh the new day's llms but don't forget about the other llms language models like gpt2 and others right so they they are they are one of the best once once it comes to you know making an impact uh with generative AI but anyway input whisper let me get gradio and things so gradi you can also use stream L it's your choice UT gradio as gr I'm going to need a few utils you know warnings OS Json and things like that so let's get it and then from gtts gtts is a python Library guys which helps you with text to speech capabilities okay and then I also need from p and then you're going to get import image okay and let's get it here now once I do that you will see that hey we have our Imports now uh you can see our model has been again loaded started everything it's been loaded so we have our model now if you if you print pipe it will show you the pipeline you can see it says the pipeline object of Transformers now let's do one thing here now let's bring up an image okay so for that you need an image what image we can guys uh inference it let me just uh upload some image over here so I have some image probably and I'll I'll go to my image here okay and I'm going to upload this image now you have this image let me upload that and you want to build a voice assistant for lot of use cases like you can have something in healthcare you can have something in finance you know insurance and things like that mainly the customer Centric ug cases where where text is now text in text out is gone you are looking at the multimodal dimensions of data right images videos audios and things like that now let me get this image in here so what I'm going to do is image path equals and then we go with our image so which is 1. jpg and let me just get it over here and then we are going to use uh pillow's image to show that so let me just do image. open that is how you show it and then you write your image path so let me just do that image path thingy here uh image path and then you can just do image once you do the image you'll be able to show the image over here in this collab notebook now you are building a voice assistant where you want to see if that llm the multimodal llm can help a doctor to get some insights and findings on this skin related images that's the let's take that as an ultimate goal here now I need a few let me just add a lot of sales now I need something on the natural language toolkit so import nltk and nltk do download and I'm going to get that people have forgot about nltk by the way from nltk import send tokenize okay that's what I need okay send tokenize and let's get that here okay cool and that is done now let's set some of hyper not hyper parameter they inference parameters okay so let me just get some Max new tokens okay so I'm going to get Max new tokens let's keep it smaller like 250 or something uh just to get it and and then you have prompt instructions so let me just write prompt instructions okay you need a prompt doc string so we can write it in a better way and then let's write it uh describe the image using as much as detailed as [Music] possible uh uh uh you are uh helpful AI assistant who who is able to answer questions about the image now now generate The Helpful answer now generate The Helpful answer okay uh so here okay let me ask a question what is the image all about okay and now generate The Helpful answer and all right so this does okay I'll probably make it full stop now this is a prompt instruction you can again make it even better better prompts now this is how the lava prompt uh template works the this is how the structure of that prompt so you give an user and then this is how you bind your image so image and then you give it a slash n and then you say okay plus and then prompt instructions and then plus and then the assistant thingy comes in so for that I'm going to write it here assistant because we have written assistant thingy so assistant okay this is how you do it now let me get the prompt now prompt thing is done let's get the output for this particular image for Now quickly then outputs and I'm going to write pipe and then in pipe I'm going to write a few things so the first thing is image which is going as an input and then prompt equals prompt and Max new tokens and for that you can give it a again in a quar manner so if you want to get more inference parameter you can do that as well so generate and generate which qus and then here you can pass it as a dictionary so then I'm going to write Max new tokens or you can just get that Max new tokens from by the way from EV as well but we can probably use that in our function now Max new tokens and then you can give like Max new tokens okay where did I set that Max new tokens there let's see it out guys okay now you can just when you print outputs right let's get the outputs thingy you will see a bunch of thing okay like generated text we have to pass the output you can again get that okay let's get the uh assistant the image features a younger with a skin condition possibly a skin rash or a skin disease the girl has a visible bump on her ear which is a noticeable feature of the image the skin condition appears to be affecting her ear and is likely that bump is the result of the skin condition the girl's face is also visible and it seem that she's looking at the camera oh all right all right he's of course not looking at the camera it's a sideway but yeah that much of outliers are expected now let me add a few more sales and then write hey look for sent I'm going to use send tokenize okay now send tokenize and then you give output generated text print send and once you do that let's see what oh list indices must H okay the this is a list in okay I got it so outputs and then you have to first give that here because we're looking at the list element out here this should probably work and you can see now okay we get it in a better way we have an assistant thingy uh image features a angle blah blah blah and you can just pass the assistant as well by the way not a big deal now this is the print sent now this is how it's working but how can we you know combine whisper here and do it in a gradual thingy so we have a voice assistant so let's do that so what I'm going to do here now is that first gets uh let me just add few more sales quickly here quickly and now let's add warnings so I'm going to have warnings do filter warnings so filter warnings ignore I want to ignore a few warnings here now the next thing is let's get from gtts I think I have imported it let me see on top where is that's gone from gtts we have GTS I think we need uh something we need numpy so let me just get numpy for that image so import numpy as NP okay and now let's uh write some util thing okay for GPU so torch. Cuda SOI is available is available fine this is done so torch good and then you can write a device and then device we going to bind that with CA and say ca if torch CPU is available lcpu fantastic now device is done and then you can also do a print guys here okay so let's just do print and I'm going to say okay using torch and then here you can write in a dictionary kind of a format. torch Dot and you can write underscore uncore version and if you want to do that otherwise it's fine and then you give here uh and then you give here your device okay I think this should work let's let's have a look at that yeah you can see it says just printing out which version and what Cuda and things like that so it says using torch 2.1 not plus Cuda 12.1 version of that so if you are doing it locally and face any error make sure that you have this particular version of Cuda and torch now this is done now let's work on the whisper thingy so I'm going to do UT whisper one of the best one of the best model that I have seen for is to T or in the recent Year guys you know it was revolutionary okay now let me just get model out here so whisper has probably I forgot tiny small medium large I think they have four different types of base model so there are five so it's tiny small base medium and large so there are five different types of Weights categories in whisper feel free to use any one of these depending on how much of compute you have so I'm going to rely on uh you know probably the and B I'm going to use with medium large will be too big for us for now okay so whisper. load model and then I'm going to give device equal device because I'm using a GPU and I want to bind that with that and let's get it over here you can see it's 1.42 gigs so 1.42 gabt okay the medium probably would have gone with the base model itself okay it was around I for I have to check that let's come to whisper here they will have a table somewhere for sure okay uh where are you okay you can see it over here oh what I did you can see this is where the whisper thing is okay so the base oh base is only 74 million ah all right all right I was talking about small then uh probably you know we would have gone with base okay we'll see if we get any error we'll come back here okay now we'll see that all right uh now let's let's do a print thingy so what I'm going to do is print and I'm going to say that is fine so it says if model is multilingual if model dot is multilingual else English only and then I mean that was fantastic uh the recommendation on collab is really cool and I'm going to say and has uh let me have a look at that okay uh np. np. prod p. okay uh for yeah this makes sense so parameters would have been better okay and you can see model is multilingual and has these parameters so what I'm trying to tell you that hell look you can build a voice assistant in all the languages that whisper support so for example if you come it over here on the GitHub repository you can see all the languages that they support for now they have a list also maybe you can have a look at that I know I can give it that link somewhere but it supports many languages by the way yeah I mean this supports lot of lot of languages i' I've used that previously but this is not the real thing now let's keep going so next thing is input regular expression and I'm going to have an input text here now because we going to go build our gradu application so let me just build uh I probably we'll uh skip this part if not required okay now now let's get now let me just get import date time okay so I'm going to get import date time and yeah because we're going to use a grad application we'll see how we can do that now let's get a logger file so I'm going to write here logger file and for that what I'm going to do is T stamp to time stamp okay so the time stamp and then I'm going to write date time date time. dat time do now so let's keep it for now all dat time and then again tstamp and then St Str T stamp do replace this looks good this looks not that much that mean that is fine okay do replace okay now Tamp is fine now let's create a log file so log file equals and then you write F log T stamp. txt this is fine okay okay now let me just run this okay now let me write a function so Define right history you can build a multimodal rag as well guys here what you are trying to do now let me get a text so text and then you write okay with open log file as F and uh let's keep Also let's get an encoding thing so with open log file a and then I'm going to write encoding I want to do utf8 to avoid some encoding errors encoding and I'm going write utf8 9 I don't know why I'm writing nine okay and now f. write and I'm going to just write text text here and then f. WR and I'm going to close this then so I'm going to close this F do close Okay so once that is done just close it cool uh and now after that let me just get request because we're going to use gradio so import request now I'm going to write all the Logics in a function because we going to use gradio so first thing I have created a lot of gist just to save some time to write the repeatitive code again and again and I'll just do that so first thing is to get the image to text P so let me go to raw and control but I'll explain what we are doing here the first function if you look at here let me just close this for now we have an image to text which takes input text and input image which loading the image and then we have a right history where we are utilizing our history thingy by the way the right history is there okay then we have a prompt description okay and I'm just going to use my prompt let me just come back here where is my prompt guys by the way okay here uh let me just copy copy that okay and I will ah and I am just going to delete this one I'll paste it mine here okay and just to make it more readable of course it's in dog string doesn't matter if the internation is not that aligned but just to make it a bit beautiful you know for the end user user so now this is how else prompt act then we have a prompt instruction else and the right history prompt and then the outputs I'll increase the max new tokens by 250 for now if output is not and this things this is fine reply match. group uh we have reg let's run this now now we have one function now the next function is about the transcribe so I'll I also have written that for uh just to copy paste bit faster and I'll explain that how the transcribe transcribe is fairly you know easy it's not that uh complex to understand also I'll just show you now what we are doing we are saying take an audio file in this function and you're checking if the audio is input is none so we're finding out if that is not none if it's none there is no transcription then we have a default language as in English which is which is commented because you can also it's a m multilingual and then we are you know getting the uh result. text we also getting the language detection using spectrogram male spectrogram okay that you can see it over here and yeah and then this result takes let me just run this you can find out the piece of code over here so you can find out this is the code they would have given it somewhere this is how you can get it you can see this is the code the same thing that we are doing nothing else you know the male spectrogram blah blah blah okay now going back on the transcribe the next function that we need is a text to speech part so let's get the TTS file here so text to speech which is again fairly easy and you can see this is a text to speech the language is English we we are using gtts you can also use P TTS X3 if you want to do that other library and then let's just run this okay now there is something that we have to do you need an FFM command for the temporary file where I've written it over here on UB 2 it's very easy to do it if you're on if you are on Windows then probably you need to set it up in a different way okay uh it says an UTF 8 local is required okay okay and utf8 local is required and I think uh I have solved this problem recently uh let me have a look at my note okay uh utf8 local. okay let me just add it here as a code and I'm going to say import import local and then I think let's print local dog Lo scale and see if this a function I'll run this it says we have an utf8 but I don't know why I'm getting this errored it says an UTF 8 local is required okay ah this is this is strange now let me fix this error it says FFM blah blah blah and we have let me have a look at my note where did I solve this error okay I have solved it somewhere but probably let me [Music] see yeah this is something to do with collab okay [Music] uh we have been using this this this FFM Peg let me copy this here and get that uh from internet how did I solve that uh it's a sell Command okay uh yeah I think this should do but if I do this also it's it shows that UT of8 is there but all right ah so we need this get prefer in preferred encoding Lambda but I when I say get local it shows me that utf8 is there but probably we have to set that uh anyway that was surprising okay uh now the next one is that we go back to gist and I have my gradio thingy done so let me just copy my gradio thing okay and I'll explain what we are doing is just uh grad is fairly easy now if you look at this what I'm doing here is I'm saying herey look there's a function to handle audio and image input because user is going to upload an audio and an image on on the UI will show that and then handle the image input as transcribe also Returns the path I mean it's temp temp 3. MP3 which is from here on the FFM as you can see it over here now we are creating the interface we are using gradio interface we are saying G do interface and then function equal to process inputs and then we in passing the input which is your audio and the image and and then the output which is your spech to Tex ch GPT output I'll make this uh change at CH GPT output as an AI output and then temp. MP3 and I'm going to make this change as uh learn processing with whisper and I'm going to call this as uh llm Power voice assistant for multimodal data okay now this is how I'm going to change it and now let's run this okay so I'm probably I'm saving this file but I'm going to run this now now once you run it you will see it opens uh you can open this link and I will show you so once you open you have to give the uh uh let me just click here and you have to give that it will ask for you for the access you have to give them the permission let me give allow on every vigit okay now now here you have a record let me explain what what you have to do here you have to upload the same image and then you have to also record what you are what you want to do with this image okay so let me just uh click here and put the same image probably and I want to record this now can you analyze the image and tell me what's wrong with this image and then you click on submit probably not the right way of asking question because I said what's wrong with the image there's nothing wrong with the image but I should have asked what's wrong in the image is there any anomaly that you can tell me like it's a health condition you can see it we have got our output let me make it a bit bigger so you can see it let me first play the audio because we also using text with speech which is an interesting part so we are taking the speech now if you want to build a voice assistant you know for the healthcare for the customer service chat board for insurance for finance for legals whatever you know where you have a assistant which you know which which where we expecting the voice input and you also have to give a Voice output and you you could see how fast it was so you can also look at the GPU infra like if you are trying to build this kind of capability what are the GPU infra that you need I'm running on a v00 which is not that probably costly when you compare with a00 and things like that now let me run this in the image there is a young girl with a skin condition possibly a skin rash or a bacterial infection the skin appears to be red and inflamed with visible bumps or lesions on her face the condition seems to be affecting her air as it is also red and inflamed the girl's skin condition may require medical attention and treatment to improve her overall health and appearance so you could see how fantastic it was I mean you know in gtts there are different types of voices you know if you want to use psx3 that are different you can use Azure voice as well you can use AWS voices as well it depends where you are building this kind of capabilities for your organization or as hobby project as a college project and things like that right so this is one way now let me again close this guys okay I want to upload a new image probably okay so let's download an image and uh see how it can help you you know on a healthcare so I'm mainly focusing on healthare because I believe that is an that is an industry where llm has a huge impact you know it has so much potential that it can solve bigger problems looking from you know writing medical notes to looking at had the clinical notes synopsis medications recommendations to the doctors secondary opinions and things like that right so let me now just get an image again of the same uh or let me dandruff okay dandruff is something that I like to take it dandruff uh dandruff ISU of a man and I'm just going to go inside images and which one should we take I see I want to want to test out the capabilities of lava okay because we have to see that okay if if that's a good model to work with I'm not getting the right uh pictures guys probably uh probably this one I'm not sure okay there should be a clear picture these are the pictures which llm probably might not infer in the right ways okay uh this is something I need on the more dandruffs and things like that okay so let me just have a look this is something that which makes sense okay so I'll take this image probably if this is a JP tag it's a webp file that's uh but I'll save it if I have to what what about this these are all webp now let me convert a webp to jpeg okay so webp to jpg and then I'm going to convert this so let me just convert it over here and you can see in downloads know you can put this over here uh and then we convert to jpg download converted image and and you can see it over here as a JPG that I have now let me upload that image uh now here I'm going to upload it and then again I'm going to ask the question okay so it was a big of you can see it's a big image it seems but let let's ask the question can you tell me what is wrong with the ladies here in this image and let's ask it again and we also to look at the infra you would see it will not take more than you know 15 seconds you know as an average to generate a response for you it takes around uh you can see took around 12 seconds it says in the image a woman is combing her hair with a comb and first see the speech to text output it says can you tell me what is wrong with the ladies here in this image so perfect right so we are using the uh I think medium uh model of whisper but you can also use base or small depending on what if you want to run it on a mobile device it will go with a tiny on it so run it on a raspberry and things like that right so if we look at over here it says in the image a woman is combing blah blah blah however there's a noticeable issue with her hair is getting the issue the hair on the back of her head is covered in a fine white substance you can see which appears to be a type of powder or dust this unusual appearance might indicate that the women has experienced an unexpected event or situation such as an accident or an unforeseen circumstance stands so what I'm trying to tell you that it's not able to get the Dand drop part of it but it says that there is something wrong so you also have to be very careful with false projective when you work for Medical industry which is so regulatory you cannot return a wrong output yeah but this concludes guys you know this will be available on my giory and you know most of the code pieces are taken from a uh packet book okay there was a book on packet that I was reading so credit goes to the writer but yeah I also have improvised a lot of things into this but this is the uh project that I want you to explore and you guys can work with this if you if you are building your own voice assistant you know this entire logic can work you know if you are building a voice assistant how can you leverage this okay uh in that particular uh project that you are working now the code will be available on GitHub repository this entire code that you have the notebook maybe you can build a rag now it's pretty simple as as you can see you can just you need a vector database that's it to do that now but we'll see that as well now uh if you have any thoughts feedbacks comments please let let me know in the comment box if you like the content I'm creating please hit the like icon and if you haven't subscribed the channel yet please do subscribe the channel it will motivate me to create more videos in near future thank you so much for watching see you in the next one

Info

Channel: AI Anytime

Views: 13,857

Rating: undefined out of 5

Keywords: ai anytime, AI Anytime, generative ai, gen ai, LLM, RAG, AI chatbot, chatbots, python, openai, tech, coding, machine learning, ML, NLP, deep learning, computer vision, chatgpt, gemini, google, meta ai, langchain, llama index, vector database, llava, llava 7b, llava 1.5 7b, llava mutlimodal llm, llava multimodal RAG, mutlimodal llm, multimodal RAG, gemini multimodal, idefics mutlimodal, ideifics 9b, multimodal chatbot, whisper model, AI Voice assistant, voice assistant using LLM, LLMs

Id: 77dJJBFPLpY

Channel Id: undefined

Length: 36min 46sec (2206 seconds)

Published: Sun Feb 25 2024