Generative AI Fine Tuning LLM Models Crash Course

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is Krish naak and welcome to my YouTube channel so guys uh here is an amazing crash course to help you understand that how actually you can perform fine tuning uh specifically with respect to llm models uh and again uh in this crash course we will be discussing the theoretical intuition what exactly fine tuning is how you can actually perform fine tuning we'll be learning about Concepts such as quantization Laura CLA PFT right and along with that we will be implementing multiple projects with the help of hugging face open source uh open source llm models like Lama 2 along with that we'll also be seeing how you can perform fine tuning with the own custom data set with the help of Google Gamma model now uh in most of the companies that are probably looking for generative AI roles uh usually want this specific skill set of fine-tuning and uh you should definitely know how you can actually fine tune with your own custom data set and how you can solve different different use cases so this entire crash course will be having a stepbystep process of understanding it includes both theoretical intuition practical intuition and we will be also developing completely endtoend Solutions with respect to this so I hope uh you will enjoy this series please make sure that you watch this series till the end and definitely you will be able to learn a lot of things out of it so yes best of luck let's go ahead and watch this series in one of our previous video I had already shown you how you can actually finetune llama 2 model with the with your own custom data set and over there we learned about or we saw code that were related to something called as quantization Laura CLA techniques and all right and all these techniques are super important if you also want to train or fine-tune your own llm models with your own custom data set now when I showed the code right when I executed that particular code many people had actually requested about to explain the theoretical in-depth intuition about it and that is what I'm actually going to do uh the best thing is that when I learned about this theoretical intuition and I'm doing it from past 2 to 3 months it's quite amazing guys you now this is where that machine learning era is probably coming where I used to appload a lot of theoretical in-depth geometrical intuitions regarding various machine learning algorithms similarly here also in the series of videos in this video we are going to discuss about quantization now what exactly is quantization we going to discuss about that in the upcoming video we are going to see techniques like Laura CLA every maths intuition that is probably involved uh and these all are important for the fine-tuning technique right if I probably talk about generative AI one of the most important interview questions will be something related to fine-tuning and what is the techniques that is usually used behind it right so what all things we are going to cover in this video in this video we going to talk about quantisation um specifically when I say quantisation it is all about model quantization because if you remember in our Lama 2 code right when we doing the fine tuning here you could see that we had put some parameters right regarding Precision base modeling we had spoken about quantisation you know when what when we are downloading the models you know from a higher bit to a lower bit why we are specifically doing this I will be explaining about that right so with respect to each and every parameters definitely I will explain you the theoretical intuition and later on you just go ahead and see my previous video with respect to all the coding now you everything will make sense okay so what exactly is quation we're going to discuss you know uh we're going to discuss about full Precision half precision and this is something related to data types like how the data is stored in the memory when I specifically say data in llm models I will talk about weights and parameters right because at the end of the day llms are also deep learning neural networks in the form of Transformers or Bird right then we going to discuss about what exactly is calibration uh this is also called as like calibration in model quantisation right we are going to also make sure that we are going to see some problems right how we can actually do calibration then there is different different modes of quation right uh first of all I will explain you the definition then only you'll be able to understand in modes of contag we're going to discuss about two types one is post trining quag and Quant aware training right so these all are very important in terms of fine-tuning techniques now let's go ahead and talk about quantization and we will try to see the definition right quantization okay now if you want to really understand the meaning of quantisation it is better to write a simple definition for it okay so quantisation basically means conversion from higher memory format to a lower memory format right now I've written a very generic definition what exactly quantization mean it is nothing but conversion from a higher memory format to a lower memory format now when I say higher memory format let's let's consider um any data like right and if I probably consider any neural network okay so let's say if I have neural network right and when we train this neural network right all these neural network are interconnected at the end of the day what are the parameters that is probably involved over here it is nothing but weights right we specifically have weights now weights are usually in the form of metrics right let's say that I have a 3 cross 3 weight okay I'm just taking as an example in one of the layer I have three cross three weights and over here every value right is probably stored in the memory in the form of 32 bits right 32 bits we also say this bits as we also we also denote it as something like something like fp32 what exactly is fp32 so fp32 basically means I can also consider it as see FP full form is not floating Point okay but we are I'm just writing floating Point 32 bits right when I say FP FP basically means it is nothing but full Precision right full Precision or single Precision okay so this is the definition that is probably given right but in short this is like a floating Point number okay so over here let's say my number is somewhere around 7.23 now this number is stored based on 32 bits in the memory right now understand when you have a very big neural network or you have llm models right as you see different different LM models right parameters keeps on increasing right some may have 70 billion parameters if I probably consider llama 2 with 70 billion parameters that basically means it has 70 billion parameters in terms of weights and bias okay now it is not possible for me to let's say I want to use this particular model and I want to probably do a some fine tuning with respect to the normal GPU that I have right and let's say I have a very limited Ram in my system let's say the ram I have is somewhere around 32 GB right I cannot directly download the specific model and put it in my me in my Ram itself right let's say or load it in my vram that is available in the GPU because GPU also has some limited Ram right and it is not possible you cannot directly download it obviously it will require h space the other way is that yes I can probably take a cloud space somewhere let's say in AWS I can create my instance I can say hey give me this this much RAM 64 GB RAM and I probably want this much GPU and then I will try to load the model over there that you can do it but over there what is basically happening lot of cost is involved right based on the resources the cost is involved right so in this scenario and why this 70 billion parameters is happening why this model has a Semion parameter because every weights or every bias that is available in this you know this may be getting stored in 32 bits so what we can specifically do is that we can convert this 32 bits into as I said over here see conversion from a higher memory format to a lower memory format let's say I I can probably convert this 32bit into int 8 and then download the model or then use the model right after doing this what will happen within my system I will be able to inference it right obviously for fine tuning if I want to fine tune with the new data set I will obviously require GPU but if I consider with respect to inferencing it becomes quite easy because now all my values that are stored in the form of 32 bits it will be stored in the form of 8 Bits so what we are specifically doing over here we are converting from a high memory format to a low format and this is what is called as Quant quantisation a very important thing now why quation is important because you will be able to inference it quickly see inferencing basically means what if I have an llm model if I give any input to that I should be able to get any output right I should be able to get response right now when I give any input all the calculation with respect to different different weights will happen right and obviously if I have a bigger GPU this inferencing will quickly happen right but if I have a GPU with less less scores let's say then what will happen this calculation will take time but if I convert my 32 bit to 8 Bits right every weights are basically converted into Ed bits now just imagine the calculation will there be a difference yes it will happen little bit much more quicker so quantization is very much important for inferencing and some of the example that I again probably talk about is that and obviously you may have heard about this it's not like only in llm model we specifically do right in different computer vision models in N LP models also where you think that there is a lot of Weights that is involved and all these weights if I want to quantize it right I can actually do it right now this inferencing let's say I want to use a specific deep learning model in my mobile phone so in my mobile phone if I want to use it right in in a specific app then what I will do I will try to quantize right whatever deep learning model I have created from 32 bits to 8 bit and then I will try to deploy it in my mobile phone or any Edge device any Edge device right it is not possible that you can probably deploy this big model over there right it is not possible so and so many parameters so what we do we basically perform quantisation so I hope you're able to understand over here quantization is nothing but we are trying to lower down the memory right with respect to any weights that we have like from 32 to probably 6 uh inate or let's say fp6 we can also say fp6 right let's say if I have an fp32 bit that is specifically required to store any information in my memory I can also convert this into FP 16 bit this is also quation only right usually all these Val vales are stored in floating point right we specifically say this fp32 bit we say single Precision or full Precision we uh for fb16 bit if I'm trying to convert like this it is basically called as half Precision right so you should be able to understand all this technical terms right and in short these are nothing but these are floating Point numbers right now similarly in tensor flow also you'll be able to see when we probably work with tensor flow you'll find tf32 bit right the data types the the numbers are stored in this particular format right and it is important okay this all are terminologies that are super important but I hope you got an idea what what is the main aim what is the main motivation out of quation is that if I have a bigger model I should be able to quantise it and make it as a smaller model so that I can use it for my faster inferencing purpose both in Mobile phones in Edge devices let's say in even watches smart watch as I want to use it over here I can actually do that right now if I talk with respect to llm model also with the help of contag see once we compress this particular model right later on we can also perform fine tuning right fine tuning but here there is one disadvantage when we quantize right when we perform this quantization since we are converting from 32 bits to inate let's say as an example there is some loss of information also and because of this there will be some loss of accuracy now how to overcome this we will talk about it there are different different techniques how we can specifically overcome it but I hope you got an example what exactly is quation what is full Precision half Precision half Precision example is something like this now let's talk about what exactly is calibration now calibration basically means how we will be able to convert this 32bit into int 8 like what is the formula what is the mathematical intuition that is specifically required let's go ahead and discuss that so guys now let's go ahead and try to understand how to perform quantization and this is super important in terms of mathematical concept that I'm probably going to talk about because with with the help of tensor flow just by writing four lines of code you know I will be able to perform quation but it is important you should know that how you can actually do it manually whenever I talk in terms of what are the types of quantisation that we have so we have two different types of conation one is symmetric conation and one is called as asymmetric conation now just by showing you an example you will be able to understand what is the exact difference between them okay let's say I have a task and this first task that I am probably going to talk is with respect to symmetric and understand I hope in deep learning you have heard of something called as batch normalization so if you have heard about this batch normalization so batch normalization is a technique of symmetric quantisation right so every time you'll be able to see that whenever we do forward propagation backward propagation between all the layers we apply batch normalization so that all our all our weights are zero centered that is near the zero and the entire distribution of the weights will be centered near zero okay so batch normalization is one technique uh of symmetric quantization so let's go ahead and see one example so this will be my first example over here I will go ahead and write it down and now you'll be able to understand it how symmetric quantisation is basically performed what is quantisation you have understood from higher memory format to lower memory format we'll try to convert okay so here we are going to understand the mathematical intution so let's go ahead and talk about one technique which is called as symmetric unsigned un signed in it unsigned in 8 okay quantization so we will see this technique first now here what is our main aim let's say I have a floating Point number okay let's go ahead and write it down let's say I have a floating Point number between 0 to th000 now just imagine that these are my weights right whatever Matrix mes I have my values ranges between 0 to to th000 Let's consider in this particular way right and this is let's say these are the weights for my larger model OKAY larger model okay now one very important thing that you really need to understand right when I talk about any larger model let's consider any llm model okay so in llm model you have lot of parameters let's consider this is one kind of llm model now when I say all this weights are there this may be getting stored in 32 bits okay usually uh what will happen guys the weights will not be in this range also okay it will be in the minimalistic range so I will just consider this as some numbers okay so that you don't get confused with respect these are some numbers I will also not consider llm model over here and let's say this this these numbers are stored in the form of 32 bits okay now my main aim is to convert this into unsigned int 8 that basically means 8bit right so 8bit basically means what 2 to 8 is how much 2 to 8 but when we say U unsigned that basically means my value will be ranging between 0 to 255 so I want to quantise from this values to this value okay what is my aim I want to quantise my range of values between 0 to th000 to 0 to 255 okay this is what in my aim with respect to this okay so this is what is my target now let's see okay so if I probably draw just a real points and the same thing we will do with the weights over there whatever quantization process we are specifically doing let's say I have values between zero and this is basically th000 okay then what I will do over here I want to convert this into again 0 to 255 okay now let me talk about what very important thing guys whenever we have any let's let's consider this one if we have this single Precision single Precision floating Point 32 right if you have this number how this number is stored do you know that the one bit will specifically be used for sign or unsign values right let's say positive or negative if positive is there then this will be plus one if negative is there it will be minus one so all the values are basically saved between 0 to 1 right uh it can be 0 or 1 okay then the next eight bits are stored for exponent okay 1 2 3 4 5 6 7 8 okay so this is for sign these all numbers that you see it is basically stored for exponent this is how it is stored inside the memory and remaining 23 numbers all this 23 bits will basically be saved for manesa this is specifically for the fraction so if I have a number which looks like 7.32 now 7.32 is a number it is a positive number so for my sign bit there'll be a positive value let's say one over here then this seven will be probably put up in this 8bit and remaining 32 will be put up in this mantisa right so this is how the numbers are basically stored in the memory right if I consider an example with respect to fp16 right half Precision floating Point 16 bit then you'll be able to see there will be one bit for the sign number there'll be five bits for the exponent we basically say exponent 1 2 3 4 five let's consider this five let's let's draw till here okay so this let's let's let's say that this is five 1 2 3 4 5 Okay so this will be five and remaining remaining 10 bits will be saved with respect to mantisa mantisa so we basically say this as a fraction right anything that comes after the decimal right and this is how you will be able to see it will take this will take less memory this will take High memory right now what is our main aim over here I already have a 32-bit number I need to probably convert this into a range of unsigned inate unsigned intent basically means I will not take any negative numbers it will be between 0 to 255 right this is what I really want to do okay now for this what will be the equation and I hope you have heard about something called as minmax scalar I've I've repeated so many times this in my machine learning sections also so any number that is over here how will I be able to convert from this floating point to this unit8 unit inate right unsigned inent right how I will be able to do it now for this let's go ahead and calculate it and what equation is specifically required okay so that basically means over here what we are going to do 0.0 will be converted into a quantied value of Zer okay not 0 Z it will be zero and similarly thousand should be converted to a quantise value of 250 at the end of the day the bits are decreasing so quantization is basically happening but we have to probably come up with a scale factor now what exactly is a scale factor so let me Define a scale over here so here the scale formula will be x max ID x mean and then they will be divided by Q Max minus Q mean this Q is nothing but quantization now what is the x max over here 1,000 right so this I will consider as my X this I will consider as my Q right I'm showing you how quation happens in a symmetric distribution symmetric basically means all the data is evenly distributed okay and we really need to convert this based on this itself these are evenly distributed now what is x max x max is nothing but 1,000 minus 0 then qmax qmax basically means 255 minus 0 right right so if I probably go ahead with this specific division right then what will be the value that I will be having right 1,000 divide by 255 it is nothing but 3.92 so this is nothing but this is called as a scale factor right scale factor so any number that I have over here if I want to convert from this fp32 bit to U in 8 I just need to use the scale along with 14 formula which is called as round okay so I will apply round with respect to any number let's say if I consider 500 divided by 3.92 or let's let's just consider 250 divide by 3.92 so if I want to see what will happen to 250 right what will be the number in this U inate so I can probably go ahead and divide it it'll be nothing but 250 ID by 3.92 so if I go and calculate it it is nothing but 63.7 7 so if I do the rounding that basically means this will be 64 so in short any number that is over here let's say it is 250 over here this will get converted to a quanti value to 64 right so the same thing the code will also be doing and this is for symmetric unsigned intake okay quantisation if I want a quation some for some other Factor let's let's talk about this okay so let's say um I have another kind of distribution and this is time that it is asymmetric and I want this as u in 8 so if I want to perform this contag so what will happen now in this particular case let's say if I have a values between - 20.0 to 1,000 okay so these are my floating point now I want to perform contag and convert this into 0 to 255 right now in case of asymmetric what will happen is that in my real number section right this numbers are not symmetrically distributed it may be right skewed it may be left skewed okay so in this scenario you'll be able to see that my values are ranging between -20 to 1,000 I want to convert this into this now in this scenario if I apply the same formula x max minus x mean so how it will be 1,000 - 20 so minus of - 20 is nothing plus + 20 / by 255 so if I probably do the calculation then you will be able to see that I will be getting somewhere around 4.0 okay now very important thing that basically means this 20.0 if I quantise it right if I quantize it it will get converted to something like 4.0 oh sorry this this is the my scale factor okay scale factor now if I take any number and try to convert it let's consider minus 20 if I try to convert it by dividing by 4.0 right and if I do the round so what it will become minus 5 right minus 5 of round this much I will probably get minus 5 right now you can understand that this -2.0 is getting converted to - 5.0 but you can see over here my distribution starts from 0 to 255 so how can I forcefully make this - 20.0 to zero all you have to do is that go ahead and add the same number in a positive way so in this case the number that you see this five right this is basically called as 0 point right so there are two important parameters that we specifically talk with respect to quantization one is 0 point for for the above one since we have a symmetrical distribution here the 0 point was Zero only and the scale was 3.92 in this particular case since it is asymmetrical distribution here we have a 0 point as nothing but five but scale is 4.0 so this two parameters we usually require to perform quantization okay and these are some of the examples that I have shown you to just give you an idea like how quantization basically happens and super important in terms of understanding is the simple equations you'll be able to understand how things are basically working right at the end of the day understand quantisation is a simple process of converting that high uh full single Precision or full Precision floating Point 32 bits into small bits you know it can be uh unsigned integer 8 it can be signed integer 88 if we say signed integer eight then what will happen it is that it will be ranging between - 128 to 127 and based on that you can specifically apply the formula right now let's go ahead see we had already discussed about these two topics one is this and second one we wanted to discuss about calibration now this squeezing that you could see right from here to here to here here to here we SK squeezing it right this squeezing process is basically called as calibration whatever process we are basically applying in this quantisation process it is nothing but it is called as calibration because we are squeezing those values from a higher format to a lower format okay so that is nothing but calibration so we have completed both this thing okay now let's see what are the different modes of quation one is called as post trining quation and quation aware training right I will talk about this why it is super important both this technique okay so you'll get an idea about it over here so first of all we will go ahead and say post training quantization so what exactly is post trining conation here we already have a pre-trained model so we already have a pre-trained model now if I want to use this pre-trained model obviously the weights are very high here we apply calibration right when I say calibration that basically mean squeezing the value from high format to a lower format right and then after performing this particular calibration we take what kind of data we take that weights data whatever we data is basically there in this particular pin model and then we convert this into a quantized model okay so once we apply this process then only we will be able to get the quantise model and then we can use this entire model for any use cases okay for any use cases right this is a simple mechanism with respect to post training quantization see understand post training basically means I already have a pre-train model where my weights are fixed I don't need to change those weights I will just take or download those weights I will take this weights data apply the calibration and then convert this into a quantied model right the second technique that we have written over here is quantisation aware training okay quantization aware training so let's talk about this quantization aware training this is also called as Q so this is also called as Q8 okay so we can write it as Q8 okay quantization aware technique this is basically called as ptq ptq okay so over here what is the exact difference we'll try to see between this two okay now in quantisation aware training what happens see over here what is the problem if I probably perform calibration and if I create a quanti model there is a loss of data and because of this what will happen is that the accuracy will also decrease okay for any use cases but in the case of quantization aware training okay you'll be able to see that we will be taking our train model whatever train model is there train model is there okay let me just go ahead and write it down train model is there then we perform quation again quation is what same the calibration process will apply over there we will probably do all these things okay and then once we do this the next step is that we will go ahead and perform fine tuning see we know that we know that with the help of ptq you'll be seeing over here on the top some loss of data and accuracy is there but here with respect to finetuning we will take new training data new training data now once we specifically take new training data we will be fine-tuning this model and then we create a qued model quantied model so with respect to any fine-tuning technique that you will be seeing we don't use post training quantisation we specifically use quantization aware training technique so that basically means we are just not losing accuracy or data away because we are in turn adding more data for the training purpose and through this we will be fine-tuning our data and then we create our quantied models right so this is the basic difference so all the fine tuning technique that I will probably show you in the future will be of this type that is quantisation aware training so that we do not lose much data accuracy so I hope you got an idea with respect to all these three techniques guys going ahead there are two important techniques that we really need to understand one is clora and one is Laura so this techniques specifically will be also understanding with respect to fine tuning we are into the part two of fine-tuning series and in this particular video we're going to discuss about Laura and CLA indepth intuition already in the playlist of fine tuning I've discussed about quation and I hope you're seen the video early many people were requesting for Laura and claa also so let's uh discuss and uh we will try to discuss the complete in-depth math intuition uh the best thing will be that guys I'll try to explain this maths in-depth intuition uh I've already seen the research paper and there are lot of complicated things that is probably there in the research paper but I will try to teach you in such a way that at least you should be able to understand you know know what exactly is Laura what exactly is claa and then we'll also see one example you know how with the help of code you will be able to do it and trust me these are some of the very important things in fine tuning because tomorrow you go in any interviews you are going to get asked respect to this kind of questions that may be coming because at the end of the day any generative AI projects specifically with llm right llm models if you are working in the company they will be giving you fine-tuning task so let me quickly share my screen and show it to you so here already I have uploaded a video uh part one where we have discussed about quantisation uh a good 32 minutes video Everything you'll be able to understand it because here I've also used all the mathematical intuitions everything as possible as as much as possible uh to talk about the research paper so what does Laura means Laura basically means low rack adaptation of large language models uh this is this amazing research paper and uh probably you'll be seeing lot of this kind of equations as you go ahead there'll be different different performance metrics but as usual I will do what I'm good at I will try to break down all this things and probably explain you with respect to examples with respect to code and many more things so quickly let's go ahead so why Laura and CLA is basically used Laura basically means low low rank adaptation lower order rank adaptation it is specifically used in finetuning of llm models okay so let's go ahead and discuss this and I've created this amazing diagram over here initially whenever whenever you have a pre-trained nlm model that basically means uh let's say that there's a model something like GPT 4 or GPT 4 Turbo right and this model has been created by open AI so we basically see this model as the base model OKAY gp4 turbo and this model is trained with huge amount of data right so the data sources will be internet it can be books it can be multiple sources right at the end of the day all these models how you can measure them they may be saying that hey uh it supports 1.5 million tokens you know it has been trained with this any number of words right so many tokens of words now what will happen is that all these models you know probably to predict the next word it will have the context of all those tokens and then it'll be able to give you the response so these all models are basically base model and we also say this as a pre-train model OKAY some of the examples again I'll be telling you gp4 gp4 turbo right gpt3 GPT 3.5 something so all these models are specifically pre-train models now further we can take this model and there are various ways of fine tuning it please make sure that you watch this video till the end if you watch this video you will understand everything that is actually required with respect to fine tuning and they are multiple ways of fine tuning also now let's say that I take this model I do some amount of fine tuning okay and this fine tuning is done on all the weights of the specific model right so some of the application that we may generate is like chat GPT you know we may generate like Cloud uh cloudy chat GPT right cloudy GPT itself like the chat bot that we specifically use some of the examples okay so this one way of fine tuning is there where we train all our parameters so here we specifically say full full full parameter training okay so here you can see full parameter fine tuning so here I'm going to write full parameter fine tuning now this is one way of fine tuning where we train our entire parameter based on our data that we have okay and after training it we can develop applications like chat GPT or any custom GPT that you specifically create as we go ahead you can also take these models and perform domain specific fine tuning okay so one type of fine tuning is this the other tune other fine tuning technique that you can specifically use is called as domain specific fine tuning some of the example let's say that I'm fine tuning a a a a chatbot model you know which will be for finance it can be for sales it can be for different different domains itself right so here the main important word is domain so this fine-tuning also we can perform okay again why I'm saying all these things because there are various ways of fine tuning things right one more fine-tuning we can basically divide by is something called as specific task fine tuning in case of specific task fine tuning these are my different different task let's say this is Task a b c d this task can be something related to q& H ADB this task can be something related to document q& ADB different different applications so that is the reason why we are specifically saying over here as specific task okay specific task fine tuning now perfect now this is good you have seen all the different ways of fine tuning okay now let's talk about full parameter fine tuning again I'll repeat it this is my base model right now if you use it as an example as I said gp4 turbo GPT 3.5 you know gini Gemini 1.4 different different models can be there we can take this base model we can finetune and create applications like chat GPT we can create other other application like stable diffusions you know not specifically to llm but LM we can actually do it then we can also further fine-tune this based on domain specific fine tuning right based on different different domains like Finance sales retail we can also take this model and do more specific task finetuning like task a task B task D let's say I want to convert this into text to SQL I want to have this as document Q&A so I can further F tune it based on specific task now let's talk about this full parameter fine tuning okay and what are the challenges with full parameter fine tuning and that is where see I'm building up the story later on I'll be explaining you where Laura will be used now in full parameter F tuning the major challenge is that we really need to update all the model weights let's say that I have I have a model which has somewhere around 175 billion parameters that basically means 175 billion weights now in this particular case whenever I fine tune this model I need to update all the weights great now when I'm saying updating all the model weights and when we talk about this many number number of parameters why it can be a challenge because there will be Hardware resource constraint right so with respect to different different task if I really want to use this particular model that much RAM I really require for inferencing purpose that much GPU I really require right so for Downstream task it becomes very difficult now what is Downstream task Downstream task some of the example is like model monitoring right model monitoring the other task can be like uh model inferencing right model inferencing similarly the GPU constraint that we may have the ram constraint that we may have so we may face multiple challenges when we have this full parameter tuning full parameter fine tuning now in order to overcome this challenge we will specifically use Laura and CLA okay what exactly is Laura as I said low order rank adaptation and CLA is something it is also called as Laura 2.0 so we'll discuss about both of them with respect to mathematical intuition and you'll get a complete idea what I'm actually trying to say now what does Lara do okay now let's read the first point and this is very much important because in the research paper you will find this equation okay this equation now what exactly Laura will do Laura says that instead of updating all the weights in full parameter fine tuning right instead of updating all the weights it will not update them instead it will track the changes it will track the changes now what changes this is basically tracking it will track the changes of the new weights based on fine tuning okay this is very much important to understand so uh based on the new weights how we are going to combine this weights with the pre-trained weights okay so here you can see these are my pre-trained weights from the base model like let's say that uh this model is llama 2 now if you're performing fine tuning using Laura then Laura will track the new weights over here which will be of the same size okay so let's say if this uh weight is 3 cross 3 then the new weights when it is probably doing the forward and the backward propagation those new weights will be tracked in a separate Matrix and then these two weights will be combined wherein we will get the fine tuned weights now this way what will happen is that this tracking will happen in a separate way but still you may be thinking Kish here also we are updating all the weights itself right so here also the resource constraint will definitely happen yeah fine I'm talking about 3 cross 3 but what about uh weights and parameters where there are 175 billion right 175 billion parameters or 7 billion parameters that time I'll be having a huge metrix right so at that scenario you need to Now understand how Laura works because this weights how it is getting tracked it will just not get tracked in this 3 cross3 metrix instead all these weights that is getting tracked they a simple mathematical equation will happen or I'll not say mathematical equation will happen but there will be a technique that will happen which is called as Matrix decomposition that basically means the same 3 cross3 Matrix is saved it in a two smaller Matrix now in this two smaller Matrix you can see this is nothing but 1 cross 3 and this is nothing but 3 cross 1 right sorry this is 3 CR 1 and this is 1 cross 3 right so this is 3 cross 1 and this is 1 cross 3 when we multiply both these weights then I will be getting this weight only right so over here if I consider I have some around nine weights 4 5 6 7 8 n right you'll be able to see that I will be able to get all these nine weights from how many number of parameters just six parameters right because when we multiply this then you'll be able to see that I'll get all these nine parameters or nine weights that I have right so in short what Laura is doing is that it is performing this Matrix decomposition where a big Matrix and this Matrix can be of any size is decomposed into two smaller metrics based on a parameter which is called as rank how to calculate a rank of a matrix you can definitely check out any YouTube channel it is a simple algebraic equations based on transpose of a matrix how we calculate the rank but let's say that this metrix that I have which is a 3 cross1 over here the rank of this particular Matrix is one right and if I use this two Matrix you can obviously see that the number of parameters that I'm storing over here is less when compared to this right yes there will be a loss in Precision but it is making sure that when we combine both this metrics we'll be able to get the entire updated weights and just imagine start thinking guys let's say that if I have 7 billion parameters now I'm trying to perform fine tuning on those parameters so whenever I track those weights this huge Matrix will be decomposed into two smaller Matrix and when we are decomposing this this Matrix this updated track weights Matrix into two smaller Matrix obviously we'll be requiring less parameter to store all these values right and this way your fine tuning becomes very much efficient and this really solves the resource constraint right this is the most important thing right and so in any research paper that you go ahead you'll be seeing this equation w0 this is my pre-trained weights plus the track changed weights is nothing but my pre-trained weights plus b multiplied by a what is b b is this a is this so when we multiply this you'll be able to see that we are able to get the all the track change weights right and obviously this requires less parameter if you if we are decomposing our bigger Matrix into two smaller Matrix less parameters is required now what will happen if we keep on increasing the ranks if we keep on increasing the ranks this parameters will also keep on increasing but it will always be less than this right if I have 7 billion parameters if I try to decompose that into two two metrices to small matrices with increasing rank then also the parameters that will be required will be less how I saying this because in the research paper also they have tried with multiple trainable parameters now let's see over here there are multiple techniques of fine-tuning some of the techniques that are there is something called as prefix embed prefix layer adapter adapter is one very famous thing that is probably used before Laura right you can see as the rank is increasing the parameters also increases right initially the trainable parameters are 175 billion but when I use techniques like adapter right so initially it will have 7.5 7.1 million parameters with rank is equal to one but as I keep on increasing the rank as I keep on increasing the rank you'll be able to see that this parameters also get increased but you can see from 175 billion parameter if I compare 7.1 million weights the percentage is very less now similarly in Laura because of that uh metric decomposition you'll be able to see that as I keep on increasing my ranks so these are my ranks with respect to Q Q KV because in transer you have this three parameter Q KV there only all the matrix multiplication will happen with respect to this three parameters and we as we keep on increasing the rank here you can see four here you can see 8 here you can see 64 then you'll be able to see initially we got 4.7 million parameters compare from 175 billion to 4.7 million how this was possible because of the because of the metrix decomposition because of the metrix decomposition right and as we keep on increasing the rank you'll be seeing that the parameters are increasing right the parameters are obviously increasing but if I compare it with 175 billion parameter this is very less 9.4 million if you just see the percentage right so here also uh when rank is equal to 8 37.7 million then rank is equal to 64 30 1.9 million right parameters are there so still it is making sure that the parameter is not that much like like not not like 175 billion or not near to this if I talk with respect to percentage it is very very less I've also made another table right just to show you if I have different different models number of trainable parameter number of trainable parameter here you can see I have one llm model with 7 billion if I use rank is equal to 1 then I will be having 167k parameters to fine tune fine tune weights based on fine tune weights then and this 17 67k parameter basically means what this decomposed Matrix that I have right two Matrix that many number of parameters so in the first case it is just nothing but 7 167 k parameters that is the available in this decomposed Matrix okay when we combine them then we will be able to get how many 7 billion parameters if we combine both this metrics then we'll be getting uh 7 billion parameters okay then similarly you can see in 13 billion then you have 228k parameters in 70 billion you have 529k parameters in 180 billion you have 849k parameters so as you keep on increasing the weight this parameters will keep on increasing but it is not increasing with that huge amount right even you can see when we keep the rank as 512 right so here you can see 86 million parameters is there when compared to billions right uh Microsoft uh you know it came up with this Laura technique uh in one of the research paper it has used rank is equal to 8 okay to probably do the finding and it has performed absolutely well so most of the time we select this particular value but at the end of the day how to select this right it won't matter you know because the parameters are increasing by very less number over here as we go ahead so usually you can select rank 1 2 8 while you're performing fine tuning now there may be also scenario that when should we use very high rank when to use high rank when to use high rank this answer because in the interview they may ask you if the model wants to learn complex things complex things then you can specifically use high rank right let's say some of the model is not trained to probably uh interact or probably uh perform some of the behavior at that point of time those complex things can be handled when you are probably increasing ing the number of ranks okay so this can be a very simple question that may be asked in the interview but I hope you got a complete idea at the end of the day this is the equation that you'll be able to see in most of the research paper uh what Laura is doing is that nothing very simple all the track weights is decomposed into two smaller matrixes with different different ranks it can be different different ranks when you're fine tuning the first thing is that you really need to set that rank okay if you set that rank like in this particular case if I probably see if I go ahead and calculate with all the mathematical stuff I will be able to get the rank is equal to one okay uh for this also rank is equal to one right so similarly if you have rank two so one of the Matrix can be something like this so decomposed Matrix so this is based on rank two okay so if I probably combine this right so how many 1 2 3 4 5 6 7 8 9 10 11 12 right if I multiply this I'll be getting a matrix uh of much more parameters right but at the end of the day for in this particular case it is less number of parameters right right so this is what Laura is all about and because of this technique the fine tuning is done less the the weights the parameters becomes less so this is how the main resource constraint is done and uh with respect to all the downstream task it becomes very much easy now one more thing that I really want to talk about is chora okay chora chlora basically means quantized quantied Laura okay now if you have already learned from the first video what is quantied quantied basically means now in CLA case what will happen is that all these parameters that is probably stored in float float 16bit okay we will try to convert this into 4bit that's it okay once we do this you'll be able to see that we reduce the Precision and then we try to reduce this values also by this you won't require much more memory so that is the reason we say quantized Laura technique okay and the best thing about is this is that uh CLA also has one amazing algorithm which will be able to take care of both this part let's say if there is a float 16 bit I quantize it to 4 bit I can also convert this back into 16bit okay so with respect to this explanation guys I've already spoken about many things over here Laura and CLA uh just to show you one example so here is one example uh I've already shown you this fine tuning using Google GMA model let me talk about quation quation over over here is basically done by bits and bytes config so here it says load in 4bit true that basically means we going to convert that entire model of 16 bit to 4bit quantization technique that we are going to use is something called as nh4 and uh all the further fine tuning is basically done in V float 16 okay um now there is also one more thing right see Laura configuration here we have selecting the first rank value eight and then Target modules where we need to apply this particular decomposition and the task type casual LM once you do this and just execute everything you'll be able to see right and that is how the entire quation and the Lura happens guys in this particular video we are going to see the stepbystep way of probably F tuning your llm models in this case I'm going to specifically take op Source Lama 2 model and with the help of a custom data set we are going to fine-tune this specific model right over here we are going to learn about various techniques practically not theoretically because if you really want theoretically you can let me know in the comment section so we will be discussing about something called as parameter efficient transfer learning for NLP which is an amazing technique to basically fine-tune all these llm models which will definitely be of H size like 70 billion parameters and all so how this parameter efficient transfer learning actually happens we'll try to see in the code and we are also going to see a technique which is called as Laura right so Laura paper if I go ahead and search right it is basically called as low rank adapt adaptation of large language models right so these are some of the mathematical concept don't worry in the upcoming videos I will talk about all every theoretical intuition about PFT about Laura right now a simple way of fine-tuning I'm just going to show you because many people were requesting for this right so initially what we will do is that we will go ahead and install some of the important libraries like accelerate PFT as I said PFT is nothing but parameter efficient transfer learning inside this only you'll find this Laura technique which is called as low rank adaptation of large language models uh then we have bits and bytes bits and bytes are specifically used for doing quantization now what does quantization basically mean all these llm models you know when they are trained with 70 billion parameters or 13 billion parameters by default the weights data types are in the form of floating values right when we say floating values that they are basically 32 bit values what we can actually do and obviously since I'm actually going to do this in Google collab we get a very less Ram so it is a better way that you quantize those weights you know from float 32 probably convert that into int 8 and then probably based on the Ram size you'll be able to quickly fine tune it along with that I will be also we'll also be using trans forers and then you have TRL so all this libraries will go ahead and execute it and once we specifically execute it you'll be able to see that all these libraries will get installed now in the Second Step the major thing is that we will specifically be using the library called as Transformers which is specifically used for this particular purpose and internally we'll also be using PFT which is having some Laura configuration and we'll use this PFT model I know you'll not be able to understand what exactly PFT is but I'll just tell you in some time just let me go ahead with but at the end of the day PFT actually uh you know uses techniques which will try to freeze you know when it applies transfer learning on these llm models it is freezing most of the weights of that llm model and only some of the weights will be retrained and based on that they will be able to provide you accurate results based on your custom data set okay uh how it is done don't worry I'll create a amazing dedicated video to make you understand this mathematical intuitions okay now over here you'll be able to see that I'm going to import OS import touch I'm going to use a data set I will talk about what data set we are going to specifically do the fine tuning but here we are specifically using open source llm models and then from Transformer I'm going to use Auto model for casual LM Auto tokenizer bits and bytes I will talk about all these libraries as we go ahead so let me quickly go ahead and execute it okay now till this is getting executed this import statement is getting executed let's talk about some of the important properties over here with respect to llama 2 in the case of llama 2 the following prompt template is used for chat model so this is the specific prompt template uh here we be give an instruction in this s symbol and then we have our system prompt which will be closed with the CIS brackets and then you will also be able to give your user prompt over here and the model answer will be coming after this after this entire instruction okay so this is how the entire llama 2 models llm models specifically require the system prompt and the user prompt and the model answer format right now any data set that you specifically get right we really need to convert that data set into this format okay and that is how I will show you how to probably do this there's a technique uh you can also write your own custom code and all there are many ways okay now what we'll do we will reformat our instruction data set to follow Lama 2 template so right now we are going to use this data set which is basically called as open open Assistant Guan guanako I hope I'm pronouncing it right now here you will be able to see this is my data set right human can you write a short introduction about the relevance of term uh monopsony in economics please use example related to this and then Mon Mon monopsony refers to the market so here you can see assistant answer so here the data set is basically in the form of human and assistant like human has a question over there and assistant is probably providing uh you a specific answer so in this format you'll be able to find out each and every rows each and every rows in different different languages so we are going to take this entire data set and then considering this entire data set what we are going to do we are going to reform the data set following the Llama 2 template and out of all these samples all this data set there are around uh how many data sets are there I guess there are around 10 10K records we just going to take thousand uh th000 Records or 1K records the reason is that I really need to show you how the F tuning is basically done so if I go ahead and click on this and if you see this format right this format you'll be able to see that this entire data set is converted in this format only right instruction is basically there the answer is over here and this entire is getting closed right so all the data set is basically converted into that specific format now how do you convert it right so for that already what we have basically done is that over here to know how this data set was created you can check this notebook so this notebook is there already you can see that we are loading the data set we are applying this we are taking the Thousand records and then we are transforming right so in transforming basically a simple python code like I have to probably keep in that specific format right so that is the reason I'm showing you this specific code over here just by one click you will be able to do that okay so all the links are actually given now you need to follow Now understand guys see understanding how the specific techniques are definitely I'll create a dedicated theoretical video understanding all the maths equations that is required right over here we are trying to see that how you can also run your own fine tun model right so note you don't need to follow a specific promt templ if you're using the base Lama 2 model but right now we'll not use we'll use we'll not use this BAS LMA 2 model okay how to fine tune Lama 2 so these are some of the steps not only with Lama 2 with other models also this will work but again there the format may change you know the the format of the instruction the format of your prompts may change so free Google collabs offers a 15gb graphic card right so limited resources barely enough to store llama to 7 billion weights now here we are going to use 7 billion weights but it is also very difficult to store 15 GB right whatever free model that we specifically have we also need to consider the overhead due to Optimizer State gradient and forward activation okay so usually in in any llm models you'll be having gradients you'll be having forward activations you'll be having optimizers so there also you require some amount of memory fine tuning is not possible here right obviously this will not be possible because 7 billion weights you cannot store it in 15 GB that is the reason we require this parameter efficient fine-tuning technique now what does PFT basically do it is going to freeze most of the weights that is present in that llm model like Lama 2 and only with some of the weights after applying quantization it is going to probably perform the fight fine tuning now parameter efficient fine tuning I will in the my next video I will talk about this research paper if you quickly want this video please make sure that you make the video likes 2,000 okay now what we are going to do over here we are going to use techniques like Laura and clora as I said Laura or clora Laura is nothing but low rank adaptation of r large language mod again I'm apolog guys if you don't know the mathematical Concepts I will explain in the upcoming video okay so first of all we will load a Lama 27b chart GPT model this chart HF model then train it on this 1K sample which will produce a fine-tune model with which in the name of chat finetune we'll try to create in this clora will use a rank of 64 with a scaling parameter of 16 we will load the Lama 2 model directly in 4bit Precision we trying to convert that 32bit into four bit so that is how we going to do the training and with respect to chlora in order to find the low rank index we are going to use the rank of 64 right this is an hyper tuning parameter you can just consider right now this is a kind of hyper tuning parameter with a scaling parameter Alpha this is also called as Alpha it will be having a scaling parameter of 16 as I said everything will be explained detail when I probably go with the mathematical equation but right now our main aim is to probably learn how to finding it now what model we are going to use we are going to use Lama 2 7B uh 7B chat HF then the instruction data set to use is this particular data set we will be downloading it from the hugging face the model name also will be downloading it and after fine tuning it this will be my new model name okay now these are some of the Chlor parameters that is required okay so one is lore R 64 what is this R this R is a rank of 64 a kind of hyper parameter Laura Alpha as I said Alpha right I told you Alpha why because I know the entire mathematics stuffs in this okay just to increase the Curiosity I'm coming up with this first video and later on I will come up with that then here also Dropout is basically required now in order to do the quantization we will be using bits and bytes parameter so here you can see activate 4bit precision based model so there is a parameter which is called as _ 4bit which is equal to true then compute data type for 4bit base model so here it is basically float 16 then quantization we using fp4 or np4 so BNB 4bit Quant type you have to keep this particular value to np4 since it is 4bit activate Ned quation for 4bit based model so here we are keeping it as false Now understand Guys these are some of the basic parameters that we specifically use in Lura technique specifically in PFT then training argument parameters our output directory will be present in this results I'm going to run one OC then we are going to enable this fp6 and B bf16 training okay uh it is set to True with an a100 right so a100 uh you can set it if you're using a100 you can set it to True right now I'm using T4 if you have the paid version of Google collab then you can set it to True bass size for uh Pur GPU for training I hope you know what is bass size then you have GPU for evaluation B size then gradient accumulation step checkpoints Max uh Max grad Norm learning rate weight DK right Optimizer page adamw we will be using which is of a variety of Adam itself then learning sh learn uh LR sched type cosine because it works on similarity right whatever question and answers we specifically right then maximum steps is minus one number of training steps override number of training epochs and after this you are also putting logging steps is equal to 25 now with respect to any fine tuning technique you use something called a supervised t tuning right in supervised tuning that is you require some parameters right Max sequence length then packing then device map so this is load the entire model on the GPU zero right so this is what are the some of the parameters don't worry uh these are some of the parameters that you don't need to learn each and every parameter because already all these things are provided by the official page itself I've just copied and pasted it over here right so we will go ahead and execute it so let's go ahead and execute it so all these parameters are set now the step four right there are multiple four steps right uh one more step is there later on load everything and start the fine tuning process right first of all we want to load the data set we defin here our data set is already pre-processed but usually this is where you should reformat The Prompt right filter out bad text combine multiple data some amount of pre-processing is required but already we have done that so we are not going to do it then we are reconfig we are configuring bits and bytes for four bit quantization as I said right from 16 from 32 or 16 bit we are converting that into 4bit so that it required less space with respect to GPU for the fine tuning purpose next we are loading the Lama 2 model in 4bit Precision GPU with the corrent corresponding tokenizer right with r tokenizer we'll try to load that and obviously we'll also be loading it with the 4bit procision finally we are loading the configuration of clora so uh and passing everything to the sft trainer so here is what self find tuning uh C uh this sft will basically happen right now let's go ahead and let's do this so first of all we are loading the data set we are loading the tokenizer model with clor configuration so here I've return this BNB comput D type and we are using torch so along with that you also require bits and bytes config again load we are enabling this 4bit then all the necessary parameters like compute D type will'll be using H net Nest Quant okay again I'm telling you guys there is nothing new to learn in this because all these formats will be available in the official documentation then we are going to check the GP CPU compatibility with float 16 if compute dip is equal to torch. float 16 use 4bit otherwise these all things are there right then we are going to load the base model see whenever we want to load the base model from hugging face we can use this Auto model for casual LM right that is the reason we have imported on top Dot from pre-trained model name what is my model name I've given that quantization config so here you'll be able to see in conation config we are also given something called as BNB config right so so here you'll be able to see this is the compute type let me just search for it somewhere here only it will be available so so BNB config so here you can see this entire bytes config is basically there so uh based on that you'll be okay yeah computer app okay yeah perfect so BNP config is basically given over here then device map is nothing but with respect to the GPU we are mapping then model. config use caches false you can also make it true if you want model. config pre-training _ TP is equal to 1 then we are loading the Lama tokenizer see for any llm model we also need a tokenizer so that it will be able to convert any llm model the input data that we are specifically using into word embeddings and all so that is the reason order tokenizer from pre-train again model name we are going to use this trust remote code is one additional parameter that is used then we going to put a pad token with respect to the end of statement token right so do this eore token specifically applies the token for the Lama itself right and here we are giving the padding side as right fixed weird overflow issue with fp16 training all these parameters will be almost fixed guys only thing that you will probably be changing with respect to the configuration then load Laura configuration here you'll be able to see PFT config Lura config all the values that you're putting with respect to this lower configs and yes here you have your p F configuration now this is the most important thing because in this training arguments we set all the parameters output directory number of epo this this this learning rate PP p uh FP 16 bs16 you can probably see over here and then finally we are reporting it to the tensal flow right tensor board then you can also see that supervised fine-tuning parameters right I'm giving my model name I'm giving my data set my PFT config my data set text field this PFT config has a Lowa config right then you have a tokenizer you have the arguments you you have packing then you have finally trainer. okay now this is what is the main thing and that is where your supervised fine tuning will happen step by step you have done it okay let me repeat it quickly we have loaded the data set we have set our D type right we are setting up all our quation process over here here we are checking whether GPU is compatible or not here we are loading our llm model that is Lama 2 here we are specifically loading our tokenizer which is be used in Lama 2 along with this we are putting padding techniques then my Laura configuration which will specifically be in terms of PETA PFT config and then all my training arguments will go inside this right um the this training arguments is with respect to where my output directory is and all learning rate and all okay finally set supervised tuning parameters here we have seted model data set PFT conf fake text Max equal length tokenizer everything is put up over here and finally we go ahead and train this now once we train it it is going to run for 250X uh I think 250 step size I have actually given over here sorry 25 steps uh logging steps let's see what is the bass size bass size is four um yeah till that much it'll probably go so let this start so it has already started I guess so here you can see it is downloading here you'll also be able to see the data set okay sample data right now you cannot see it because the data set will get loaded okay so table of contents installed all the required packages we'll reformat all the steps are given side by side you can also read it out I know this looks like a little bit tough guys but at the end of the day uh I'll not say that it is easy and just the reason why I'm sharing you this fine tuning technique because you should just get in your mind later on on you know this is the pattern that I'm following first execute this don't worry about anything as such just try to get an high level overview how things work later on I will try to break down each and everything in my next video by breaking this entire code why this specific parameters used because the main thing is to understand what is PFT what is quation what is precision and uh how how do you specifically use this PFT technique what is qora everything what is low order rank index uh how to basically calculate that everything I will talk about it okay so we'll wait for some time till then uh just let let us wait and uh we will I'll just uh come again I I think it'll take 15 to 20 minutes to complete this entire fine tuning with thousand records and then again I'll come back and we'll start doing and seeing whether we are able to get the good results or not so yes uh let's wait for some time thank you so guys uh finally you can see the 250 EPO or 250 steps have completed it took 25 minutes and again this is in Google collab if you have paid version of Google collab it will probably take hardly 5 to 10 minutes to complete okay so over here you can see the global step was 250 training loss it went went till 1.36 metrics runtime everything met training samples per second all this information is basically done okay and please remember this particular word which is called as floss okay total floss because I'm going to discuss about this in my next video also now once we do this we are going to save this train model right and understand the new model name what it will be right so here you can probably see Lama 27b chat fine tune so this is my results with respect to run all the results you'll be able to see over here also okay so here uh in this fine tuning technique it is also creating something called as adapter adapter model okay please remember this words because in the next theoretical intuition we are going to discuss each and everything as we go ahead okay so please make sure that you remember it so we are going to save this model so we have written trainer. model. save. pre-rain model right now you can also check out in the tensor board but I will just go ahead and show you quickly that how it is probably going to generate it right so here we have created a prompt which is called as what is large language model I've used pipeline right so this pipeline we have already imported it the task will be task generation whatever model we have actually created that model will be there tokenizer will be used over here and max length we can keep to 2 200 to 250 the result uh and always understand as I always suggested with respect to Lama 2 this will be my format there will be an S then there will be an instruction and here I will be having my prompt and with respect to this particular prompt we are going to get some kind of response so whatever response we are going to get inside this result variable it will be in the form of list and inside that there will be one field which is called as generated text so if I go ahead and search what is large language model you'll be able to see that how we going to get the result okay because we are running the same model over here so here is my prompt here we are using pipeline pipeline basically helps you to combine multiple things like task model tokenizers you know multiple things it will be able to give you right now since this is already running in this particular uh collab uh and obviously you'll be able to see RAM and all R almost it is used the space of somewhere around 39 GB right so just wait for some time and here you will be able to get the response if you quickly want to get the response obviously you need to have a good GPU right based on that it'll be able to give you a quick result right so after that you'll be also able to see that we'll be able to delete all these vams and all okay so let's see and let's see whether we'll be able to get our result in the next step we can also push our model to the hugging face which I will keep it right now I will not explain it because this I will show you as an complete project as we go ahead so here you can see what is large language Model A large language model is a type of artificial intelligence large language model often seen then here you can also see all the information are there some example of large language models are uh include this okay now what we are going to do let's go ahead and take any one example over here from this particular data set okay so I will just write how to own a plane in United States Okay so so this will be my over here and I'll paste it over here let's see so this will also run and I will finally get my result also so same same question I've have taken right so from this 1K result so to own a plane this is the answer that we will probably be getting let's see how much time it will take to probably showcase but always remember please keep on looking at this particular Ram like how much uh time it is probably taking and uh how much space it is taking okay so so guys here you can probably see the response how to own a plane in united state in United State and owning a plane is this determine your budget so this is completely based on this information that is present over here but here I've written only 200 max length so I can only see 200 characters that is given right so you can probably try with each and everything as you go ahead one of the most interesting thing in the field of data science or generative AI is that the kind of research that is currently happening right every day you'll be seeing some new things that are actually happening which is very much beneficial for the entire Community who are working with llm models uh specifically today I saw this amazing research paper where it is written as era of 1bit llm so I'll be going to talk about this particular research paper and what exactly 1bit llm is and how it is far more advantageous when compared to those 32-bit or 16bit llms models okay so everything I'll be discussing about one important thing that I also want to make sure that you learn from this particular video is that how do you read a research paper what are the important points that you should definitely highlight while reading a research paper and how you should definite and one thing is that you cannot directly understand just by reading it you really need to have some basic knowledge and without that particular basic knowledge it will be very difficult to understand so if you're following my tutorials I always make sure that whenever I make my videos right I definitely watch or see all the research papers and then with respect to that I simplify the those Concepts and try to explain it to you so let's go ahead and understand about this 1 bit llm now guys uh if you remember in my previous video we have already discussed about quantisation right so quantisation was covered now with respect to quantisation what we were doing is that let's say I have a model which is called as Lama 2 which is an open source model let's say this model is 7 billion having 7 billion parameters when we say 7 billion parameters I'm talking about weights okay now obviously if I have a system where I don't have very High configuration not I have resource constraint I have limited amount of Ram or gpus what we specifically do we perform quantisation and we convert this Lama 2 model which is probably in FP 32bit and we try to convert this into int 8bit okay int 8 which is nothing but 8 Bits right now when we are once we are doing this specific process what is basically happening is that the model size is getting decreased right and because of that we will be able to load it and we'll be able to perform any task askk along with this we can also perform fine tuning with the help of Laura and CLA right so I hope you know this Laura and CLA I already discussed in my previous video please just go click on my uh click on my channel otherwise just go ahead and see in the description I've been providing that particular links with respect to fine tuning now with the help of Laura and cl we can perform the fine tuning okay now the question is that what is this one bit llm right as I said that with the help of quantisation we will try to convert this into 32 to 16 bit or it can be 8 bit right but converting this into a one bit that can be again uh if you're trying if you now just by seeing this right if we are able to convert this into one bit that basically means we will never be having any resource constraint right resource constraint yes with limited Ram with limited GPU with limited storage we can probably perform everything from fine tuning to inferencing right so inferencing can also be performed right and this is what is so amazing about this and this is I I don't know like what is going to happen just in some days because once this is probably gone right now we just have the research paper once this implementation gets started trust me it will be quite amazing for the entire Community who are working with llm models okay so this was just a brief idea about this one now let's go ahead and discuss what is 1bit llm okay and when we say to be precise when we say that all large language models it is basically in one 1.58 bits okay why it is 1.58 we'll discuss about it and there are many points that needs to be discussed uh along with me please make sure that you watch this video till the end because I'm going to read over here because this will also give you an idea that how you should probably go ahead and read the research paper so let me quickly uh go ahead and clear this let's see whether it will getting cleared or not okay so over here okay clear is basically happening um okay I will just rob it okay now let's go ahead and discuss about this and let's read some of the important information that is present over here okay and trust me guys read along with me then only you'll be able to understand how you can read the research paper okay now what exactly this one bit llm model is u in this work we introduce a onebit llm variant namely bitnet okay so bitnet is the llm model name one bit llm model name and then where every single parameter or weight of the llm is Turner right now it is not floating 62 bit uh sorry 32 bit or 16 bit it is turny turny basically means it has only three values it can have only three values weights it can be minus one zero or one okay it matches the full Precision Transformer and the same model size and training tokens in terms of perplexity perplexity basically means so with respect to any query that I ask and end to end task performance right while being significantly more cost effective in performance of latence memory throughput and energy consumption so obviously at the end of the day all the llm model will specifically have this kind of constraint right which are specifically with huge uh number of parameters let's say 7 billion 170 billion right and if you're just using this three numbers minus 1 0 1 you'll be able to understand why I'm saying that because of the stary uh values right you'll be seeing how abundance the performance improves okay so furthermore uh so here you can probably see all this points uh Laten memory throughput and energy consump uh consumption uh energy consumption can be with respect to inferencing with respect to fine tuning and all okay now let's understand how this operators how this values will be basically used okay this is also important so with respect to this what I am actually going to do I am going to make sure that to explain you I take the right thing okay so let's understand this okay understand guys whenever we talk about parameters these are my weights okay these are my weights let's see so these are my weights okay and these are my weights so let's consider that my initial Transformer llm weights is this one okay now by when we say 1 bit llm we are going to convert all these values and replace them with either of these three values min-1 0a 1 okay so that is the reason that you see over here all these weights is being getting converted into something like this okay minus one 0 or 1 only that three parameters is there okay and this is what we basically say as bitnet B 1.58 okay and this is also called as parito Improvement how this is basically happening I will talk about it okay just give me sometime there will be some kind of quantisation getting applied here also okay quantisation getting applied over here okay to convert this values to this okay now let's understand one very important thing okay and this is the most important thing what will happen if you convert this values to this see with respect to any fine-tuning or forward propagation backward propagation what exactly happens the model weights the model weights over here is basically getting multiplied by the inputs and then we get the output right yes additionally we add a bias so it's okay we don't include a bias right now over here just to show it to you so over here this let's consider that this is my floating Point 16 number so every number will get multiplied by the input right and then what will happen is that after that all the it's it's just like this right summation of I equal to 1 to n w of x plus b right so this is what is the operation that is basically happening whenever we do the forward propagation Whenever there is an updation of weight that basically means we are doing the summation of weights and the input right so once we are doing this and then we are doing the summation okay but if we have all these weights in the form of Min -1 1 0 then what will happen is that over here you'll be seeing that multiplication operation will not be you know that much valuable right so over here first of all we are doing multiplication then addition but over here we are just doing addition no multiplication because any number it will be multiplied by 0 is0 only any number that is multiplied by one is one only any number that is multiplied by minus one is minus one only so over here the main thing is that your addition operation is only happening addition operation is only Happening Now obviously if you only need to do addition operation then what will happen your GPU will not be requiring that much GPU also so your GPU will also get reduced why why this operation takes more GPU because multiple multiplication needs to happen right with respect to different different weights right then addition of all those values needs to happen because in the forward propagation this is what is the equation that specifically happens right we multiply the weights with the inputs and then we do the sumission and then finally we add the bias right so this is the most important thing so here you'll be able to understand with floating 16 right all the numbers is first of all multiplied by the inputs and then the sumission is done but here your values are with respect to Turner that is min -1 0 1 so here multiplication is already skipped because 1 into x0 is x0 only right it is a simple multiplication right and that much resources will not be required for simplistic multiplication so here maximum to maximum only addition will be required right so I hope you able to understand because of this technique of Paro Improvement because of this technique of Paro Improvement you'll be able to see that what we are able to achieve right and obviously when we are able to achieve this the GPU will be required less when we are doing the finetuning or training right so I hope you have got this as an complete idea and you have understood right why we specifically do this how it is done how this transformation is done so here you can probably see that it provides a Paro solutions to reduce inferencing cost latency through and energy of llm while maintaining the model performance the new computation Paradigm of Paradigm of bitnet 1.5 it calls for Action to design new hardware optimization for 1bit llm right I know guys this is more of a research paper so I'm reading and I'm telling you each and everything and also explaining you the concept I know this can be a little bit of boring but trust me you need to understand in this specific way okay now let's talk more about this and and we will have highlighted main main things in this green color okay these models have demonstrated remarkable performance in a wide range of natural language processing task like llm models but their increasing size has posed challenges for deployment and raised concern about the environmental and economic impact due to high energy consumption obviously this is the problem with llms that are already available one approach to address the challenges to use post trining quantization to create low bit models for inferencing I've already discuss about this quantization Laura CLA everything this technique reduced as the Precision of weights and activation significantly reducing the memory and computational requirement of llm the trend has been to move from 16bit to lower bit such as 4bit variant this is what is basically happening with respect to llm models right this is with llm okay this is with llm so here I'll write llm models now let's see with the help of one bit architecture one bit model architecture what we can solve so recent work on one bit model architecture such as bitnet presents a promising direction from reducing the cost of llm while maintaining the performance vanina llms vanila llms are in 16bit floating values and the bulk of LM is matrix multiplication therefore the major computation cost comes from floating Point addition and multiplication operation I said you just now on top of it right in contracts the matrix multiplication of bit net only involves integer addition because anything multiplied by one is one uh anything multiplied by one is that same number anything multiplied by minus one is that same number with a negative sign anything multiplied by 0 is obviously zero right so as the fundamental limit to compute performance in many chips is power this energy saving can be translated into faster computation now this is the most important thing right and here you can clearly see the things that I've highlighted right I hope you get an idea how good this one bit llm can be okay then you can still read about it here we are going to just use tary values like min-1 0 1 and obviously because of this zero your 58 bit is basically increasing there are two major advantages of using this also it is written over here see furthermore bitnet oh my God why this is getting highlighted like okay furthermore bitnet offers two additional Advantage first its modeling capacity is stronger due to explits it support for feature filtering how feature filtering happen because anything multiplied by zero will be zero onina right made possible by the inclusion of zero in the model weight which can significantly improve the performance of 1 bit LM secondly our experiment shows can match full Precision Baseline in terms of this n to end task performance starting from a 3B size okay now most of the things that you are able to see right now now let's discuss about one more important thing uh that is how this transformation is happening how these numbers are getting converted to this it is just by using this simple mathematical equation or this quantization function okay quantization function quantization function okay and this quantization function is called as absolute mean quation and this is the formula that is basically used by which all the numbers are B basically getting converted to only this three values okay 0 1 okay 1 0 1 okay just by applying this particular formula okay so in uh and there is also one more change with respect to the Transformer it replaces nn. linear with bit linear okay so this bit linear I think uh you'll be able to see that it strained from scratch with 1.58 bit weights and 8bit activation so this is what it is basically done with respect to the initial training okay so most of the thing I have actually discussed over here uh let's talk about the performance so here you'll be able to see that uh the Llama model of 700 million parameters bitnet will also have 7 million parameters but here you see the memory is in decreasing right over here 2.08 1.18 12.33 is getting reduced to 8.96 and then this PPL is basically 12.87 so over here you can see that how it is getting reduced now similarly when the billion of parameters are basically increasing right let's say with Lama is 1.3 billion right the parameter will be same but memory again 1.14 is required 97 11.29 right and similarly over here also you'll be able to see the same thing is basically happening so memory is decreasing latency is also decreasing for the inferencing purpose perfect and uh one more parameter that you'll be able to see with respect to model size and latency right model size so the the blue color is basically the Llama model OKAY the orange color is basically one bit llm models you'll be able to see how much huge latency difference is there similarly with respect to this how much huge memory difference is there right to save this kind of models so uh this is just the research paper that has come up recently but uh I'm really really happy to see this because in the future many things is going to happen so so again I would like to welcome you all to the era of 1bit llm models and now you'll also be able to use this 1bit llm model soon I think first of all hugging pH will only come and try to implement all these things where you can also easily create your application using gen V so guys Google is again on bang and it has created its own open-source llm model that is called as gamma now again it is on that specific race of Open Source llm models also till now the most accurate model that we specifically had was from meta that is called as Lama 2 uh now gamma is there uh in this specific video we'll try to see a practical application along with that we'll try to finetune with the help of this specific gamma model okay uh so what exactly is gamma model uh I'll just show you one block first of all why they have actually created it at the end of the day all the companies you know is saying this thing right so gamma is built for responsible AI development from the same research and Technology used to create gin models okay uh at Google we believe in making AI helpful for everyone we have a long history of contributing to open source communities such as Transformer tens oflow B T5 Jacks Alpha fold and Alpha code obviously Google is doing I think it is it from past so many years it is doing so much of research specifically in open source contribution it has given tensor flow it has given many things as such uh along with that meta is also in the same race both are doing fabulous job so let's go ahead and let's know more about GMA GMA GMA we'll say GMA okay GMA also we can say I don't know how to pronounce it but it's fine so if I probably talk about this the main thing that you really need to know is about the performance metrics uh over here when I compare see two models are there with respect to this particular model one is 7 billion parameter one is 2 billion parameter if I consider with respect to Lama 2 uh we have 13 billion and 7 billion with respect to all these things I think gamma is performing absolutely Ely fine here you can see that it is having 64.3 with respect to various Benchmark accuracy if you want to know more details about it you can actually see this particular blog uh over here if I if we are just comparing open source models like gamma and Lama 2 uh gamma is Way Forward when we compare with respect to the 7 billion parameter it has right 64.3 the general accuracy reasoning is 5 55.1 81.2 math 46.4 it is far far better than the Lama 2 models also right so this looks absolutely great uh here the accuracy is also very very good now let's uh talk let's see more about it it is specifically used for you know research purpose you can also use it over here you can build your own models and the other thing is that right now uh the gamma 2 models right 2 billion parameters 7 billion parameters already available over here see in hugging face so once you probably go to hugging pH and search for gamma 7B or gamma 2B you'll be able to see this kind of page here uh you need to get granted to access this model so here they'll tell you to probably uh check the terms and conditions to get the license of this particular model to use it okay so it'll be like a check boox you just need to check it I understand that that and give the confirmation right once you probably do that you'll be able to get the excess of this specific model um what is the main aim of this particular video I'll try to show you with the help of practical implementation how you can access this particular model and how you can actually use it so now let's go ahead and let's let's see this fine-tuning technique with the help of Google Gamma model and again guys this is just a simple use case that I have taken over here but you can do some amazing use cases with the help of this and that is what I'll plan in the future some amazing use cases that I've already developed with paid opening eye models I'll try to also develop that with this along with fine tuning okay so initially uh what we are going to do we going to see this Google Gamma model how it performs and what kind of task all that kind of NLP task you'll be able to do it okay so initially to go ahead with I will go ahead and install all these libraries that is bits and bytes PFT TRL aate data sets and Transformers if you don't know guys I have already created a video with respect to fine tuning okay now in fine tuning what we have actually done we have done the fine tuning with the help of Lama 2 model okay so let's see a UTF local is required let me restart my [Music] kernel change runtime disconnect and delete runtime okay so let me go ahead and reconnect it and then we will try to do the installation again but I've already created a fine-tuning technique with the help of Lama 2 model the same steps and process will basically happen over here also so let's okay it's got connected now let's go ahead and uh do the PIP install so here you'll be able to see the PIP installation will specifically happen with all these libraries to just tell you what we are going to do this bits and bytes techniques bits and btes libraries used for quantization purpose I've already created a video with respect to quantisation what is the main name of quantisation quantisation actually help uh allows you let's say that you have a huge model right right now we have this Google Gamma model of 7 billion parameters or 2 billion parameters if I want to load it let's say in Google collab so in Google collab right now I have somewhere around 50 GB Ram or 2011 GB hard disk right and uh right now I'm using this Google collab premium version right let's say if you're using it you'll hardly get 15 GB Ram it is not possible to load that entire model over here so what quation techniques helps you do is that it helps you to uh convert that flow 32bit right usually all the numbers all the weights right all the bias are saved in 32 bits it'll help you to convert that into 8 bit 16 bit so when you convert that you require less memory at that point of time right so that process is specifically called as quation and I have actually explained it I've shown you how mathematically we specifically do quantisation so everything is explained in this particular video I'll provide you this link in the description of this particular video now the next thing is that we will be importing some important libraries one is OS Transformers then you can probably see torch Google collab user data data sets I'm going to load it sft trainer PFT right PFT PFT Laura configuration as I said that we going to use Laura technique okay sft trainer is specifically used for fine-tuning purpose supervis fine tuning we basically say right now from Transformers We also going to use Auto tokenizer because tokenization if you want to really perform tokenization while fine tuning we'll be using this Auto tokenizer so that it can load based on the model that we have then you have this Auto model for casual llm this is specifically used so that we'll be able to fine tune with respect to Casual large language model okay then from Transformers We also going to import bits and bytes config here we are going to give the configuration with respect to quantisation and then we have this GMA tokenizer so let's go ahead and import all these things so gamma tokenizer is something for the gamma model itself right so tokenizing technique otherwise you can also use other tokenization it is not necessary that you always need to use gamma tokenizer now let me tell you one more step the next thing is that after importing all this you know that this gamma 7B model is specifically present in hugging phas right now in order to access this model in order to download this model we really need to have something called as access token exess token of what exess token of the hugging pH so how do you get an exess token go over here click on settings and here you'll be having something called as exess token just copy this okay copy this exess token and once we have this exess token access right then only we'll be able to download the model from here so here what we will do go ahead and set it in the Google collab how do you set it click on this key button so everybody will be having this key button and just click on add a new secret and I have written it over here as HF token with that same value that I've have copied from there so once I probably write it down over here so this is basically saved right so this is saved over here now once I've have saved it any number of notebook that you specifically work in you will be able to access this hugging fish token how do you access it it is in simple os. environment HF token is equal to user data. getet HF token you just need to write user data. get HF token so once you do that it will ask you for the grant you just need to Grant it and then it'll be able to access and it'll be able to also identify which hugging face from which username or password for or which username you are specifically pulling that particular information okay so this is done right now through this you'll be able to access any models from the hugging phase right Let It Be Lama to any model that you specifically want to use now what we'll do we will go ahead and call this particular model Google Gamma 2B initially I told showed you 7B right I'm going to load it 2B right 2 billion pack parameters not 7B 7B also you can do it it still take time for the fine tuning purpose now here you can probably see I've taken this and then I have written bytes BNB configuration that is bits and bytes configuration the first parameter is load in 4bit is equal to true that basically means what my model the gamma model in 2 billion parameters is specifically in the 32bit right initially we are trying to convert that into 4 bit and this process is specifically called as quation right so what we are doing this once we set it in load in 4bit all the activation function sorry all the weights that are stored in 32 bits will get converted into 4bit okay now the second parameter that you'll be able to see BNB 4bit Quant type is equal to nf4 what is nf4 4bit normal float so the quation technique how do you convert this into a 4bit is basically taken from this particular parameter if you really want to know about it I've given a link over here what is 4bit quation and how does it help Lama 2 the reason why I've written Lama 2 because we talking about open source model then you have BNB 4bit compute type which is nothing but torch. B float 16bit now the reason we have kept this as 15 bit 16bit see this we are performing quantisation that basically means we are taking a big model we are making it to a small model how by performing quation by converting the 32-bit into 4bit now along with that whatever fine tuning takes place for those fine tuning all the weights that gets updated will get get updated with respect to this particular 16 bit see there is some loss of information with respect to quation to balance that we are keeping the new finetune parameters in 16bit so that is the reason we have written it over here so once I execute this so this is my model ID this is my configuration now based on this model ID and configuration I'm going to use the tokenizer and from this particular pre-train model I'm going to call the tokenizer and for here also I'll be using the HF token okay then you have model over here Order model for casual LM from free Trin model ID we'll use this model ID that see auto tokenizer is to call the tokenizer that is required for this particular model right and here also I've given the HF token automatically that tokenizer will got got called right then we using Auto model for casual LM and here also you'll be able to see that we using model ID quantization config with respect to the BNB configuration the device map will be equal to zero which is nothing for the GPU and here you can see that I'm going to use the HF undor token okay so once I execute this you'll be able to see that I am going to download all those information see auto model specifically to call those particular model the model ID that I've given over here by applying this quation technique and this entire process is going to happen in the GPU the GPU that my Google collab is connected to and over here I've given the token so quickly you'll be able to see that the entire model will get downloaded and now all the information will be specifically present in this particular model now with the help of this particular model I will go ahead and test it now now see this how I'm going to test it I've given a quote over here imagination is more I've used a device Cuda is equal to zero that basically means whatever is the GPU that is available and then I'm going to use the same tokenizer see this tokenizer is basically getting used I give the text and I get get this particular information so this input will be a vectors right and then what we do we use this space same model and we say dot generate with respect to this particular input and let's say that I'm going to take a token of 20 okay initially then once I get this particular output the output will have in the form of list like what all information is basically coming so once we write tokenizer do decode of output zero see at the end of the day we are doing sentence generation so when we write model. generated is again going to create some embeding vectors right and that output which is in the form of vector will get decoded with the help of tokenizer and then we'll get the real text so if I execute this you'll be able to and it is saying skip special token if any token is there it'll get removed okay so once I execute it you can see imagination is more than knowledge I'm a self-taught artist born in 1985 right so this information is basically coming based on the based on the model that it has let me go ahead and execute once more okay so here also you can able to see that uh I'm trying to execute one more time let's see whether I get a different output so here you can see imagination is a more than knowledge knowledge is limited imagination in C word so this information is given by Albert Einstein okay so two information if you probably search in the internet you'll be able to find this quote also okay so one over here one output we got where with the code completion here you can you could probably see that I'm also getting the author name okay now this is what it is able to do it okay so whatever information you probably put it over here it'll be able to get the output based on the information that you require now what we going to do we I'm going to show you a fine tuning technique now see this os. environment van disabled so one parameter we need to keep it as false that was given in the documentation also I did not still understand why this parameter was used but once I know it I'll I'll let you know because I'm going to create a lot of projects as we go ahead now we are going to fine tuning with the help of Laura configuration so for Laura configuration what are information I require one is rank okay so Laura technique is something called as rank decomposition we'll get to know I've not still uploaded the theory video so here we can select the rank value okay so rank over here is selected as8 why 8 why cannot select 16 or 64 you can select any number but understand what exactly is rank decomposition for that I need to create a dedicated video and soon I'll be coming up with this the target modules uh in this you specifically require this KV parameters gate up project down project so right these are some of the target models that is required as I said guys once I probably upload a video dedicated to Laura configuration you'll be able to understand the task type we always need to keep it as calore LM that basically means it is specifically used for this language modeling task so this is one of the parameter that we need to set it up so once I execute this now let's go ahead and do the fine tuning okay so this is my Laura configuration now the data set that I'm going to specifically use is this okay so let's see this data set okay so I'll see this arbitrate English quotes okay okay okay let me see whether we are able to find it out or not arbitrate hugging face okay I'll I'll just search for hugging face now this is the data set if you see this data set you have two information one is quote one is authors right based on this information this is the quote and this is the author and we'll try to F tune with this let's say that my gamma models knows some of the quotes in the internet along with the author okay now my main thing is that I'll have some more additional data set and we'll train with this specific data set and we'll then identify the author for this particular code okay we'll do something like that we'll Implement something like that so here you can see that I am loading this particular data set from arbitrate on the English codes and then we are taking a sample from this sample of quotes Okay so here we are going to execute this and if I just execute data train of quote right so I will be able to see all the quotes over here okay quotes you can also see the quotes you can see the uh authors and all so this entire uh is getting generated and here you can see train split is also happening and once this is executed here I have my data set so if you probably go ahead and execute it you'll be able to see all the quotes Okay so the quotes the sample of Cotes that we have there is nothing Noble in this so so so so information as you can see okay so this is my data set now when whenever we perform sft that is supervised fine tuning right we require some information okay one function we specifically require to indicate what is my input and what is my output okay so here we have used something like this quote example quote of zero so this is my first parameter like zero okay and then whatever information is that this is my input and this will be my output so quote of zero author of zero so that basically means quote of zero indicates I will show you it this quote of zero basically indicates each and every sentence over here author of zero basically indicates this one okay so this information will be my output and this will be my input something like that okay so once I format this you'll be able to see that I'll be able to return my text okay so let's let's consider over here if I give any examples over here right so here you'll be able to see that uh any kind of examples from that example code will be separated and author information will be separated and then we return that particular text in the form of key value pairs in this list okay now is the time for SF trainer so here we use the model we use the data of train data of train is what so let's see this okay so you'll be able to see what is exactly data of train the same data of train you'll be able to see where I'll have all those information this is my features author tag input ID and attention mask okay then we use some training arguments per device batch one tra gradient steps for warm-up steps thing I'm going to run it for maximum steps of 100 learning rate is so much fp16 is true then output directory will be this optimization will be this right PFT config will be nothing but my Lowa configuration and whatever formatting function I'm using I'll be using it over here so same formatting function okay so that I get the text in this particular key value pairs so once I execute it you'll be able to see it so here you have uh your training will probably start now okay once I write trainer. train and we are going to run this for 100 steps now see how fast it is going to happen the reason is that 2 billion and we trying to convert that into four bits I think the training will quickly happen and we taking just a sample of data okay sample of data so here you can see 26 steps is done 100 steps and here you can also see that loss is decreasing right 3939 218 02 080 it's good. 131 so losses keep on decreasing over here okay now this quote you'll be able to see I've taken from one of the over here let's see so here so a human is like a te bag and then my author should be Yen Roosevelt right so now once we execute this I think we should be able to get that specific response okay from that particular model so let's say 89 91 92 100 it is reducing but it is increasing a bit but again it is in that step so training output Global step if you still want to reduce the loss what you can basically do you can increase the number of steps now here I have a quote a human is like a teabag Cuda is equal to zero tokenizer this model. generate this same thing now let's go ahead and execute it I should be able to get the author name in forward see ilar is coming when is like a teabag you can't tell how strong she is until you put her in hot water okay so that information you can see you can never know how strong it is until you put it in it's hot water so see all this information is basically coming and the author is also coming as Ard right so let's see this uh I will also take one more example something like this let's see whether we will be able to generate the same thing or not based on the data set that we have because we have already doing the fine tuning okay so I will something write like this okay now let's execute and see the output and again now every fine tuning you can specifically do this specific way right I've just shown you in one example the opposite of love is not hate it's fear and the opposite of fear is freedom and the most wasted of all days let's see whether it is there or not see the opposite of art is opposite of love is not opposite of love is not it's indifference the opposite of art faith is not somewhat it's almost equivalent right fear opposite of freedom most of the wanted is okay let's try some more okay this looks unique unless and until this person is not a famous book is mans so I will just go ahead and execute this let's see whether we will be and again understand guys we have just max new tokens is 20 because it needs to also complete this entire statement so outside of a box Dog a book is a man's best friend see Man's Best Friend author is col us Comfort the quote is again the most wasted days of all days in one so you can probably see over here right inside of a dog something all the information is great right so you can you can check it out with different different examples but again at the end of the day yes uh fine tuning is good in this again as we keep on doing more and more we'll be able to do it now all the application that we have specifically created like text to SQL invoice we'll try to do it with the help of this also we'll also try to do the fine tuning now one challenge that I probably see in while you're building an LM application there is a dependency of lot many different different tools okay let's say that I'm specifically using a model let's say I'm using openi and I also want to integrate my llm app with external sources it can be documents it can be Google search API it can be Wikipedia search API API so for all this kind of work I really need to integrate with different different apis Al together I need to have an environment key created everywhere let's say with respect to Google search API let's say if I want to use Vector embedding I need to have a separate key stored for the vector databases uh like chroma DB pine cone if I want to probably integrate with cassendra DB in data Stacks I need to have different different connection points now this is always a challenge when you are creating a project from scratch because here to create this entire llm pipeline it will definitely take a lot of time along with that you really need to manage all the configuration now here in this video I'm going to talk about this amazing platform which is called as Vex vext and here we will be simplifying this entire llm Ops along with that we'll be building llm pipelines without writing any code so here you have almost every feature if you want to probably implement application L document Q&A Vector embedding is integrated over here if you're dependent on any external sources like Wikipedia or C uh if you're uh dependent on Google Search Everything is available over here so you don't have to probably save all the configuration or create all the apis just with one API you can actually build this entire llm Pipeline and use that specific API in your chat uh chat board in your llm application anywhere that you specifically want to use now let us go ahead and let us see that how things will Pro probably go ahead in this you can also start completely for free so I will just go ahead and log in over here so I have already logged in so here you can see with my email ID I've already logged in I'll show you step by step how you can probably perform all these things right please make sure that you watch this video till the end an amazing platform alog together at the end of the day I should be able to explain you and teach you many more things as we go ahead right so quickly let's say I will go ahead and log in um so once it is logged in and then we will go ahead and see to it let's see I will go ahead and log in in this way so once we login over here so here you can see that you'll be able to see a dashboard okay once you probably log in now after you log in all you have to do is that go and towards this AI project section okay if you have not logged in you have to sign up in this specific platform now what I will do is that I will go ahead and create a AI project so let me go ahead and create an AI project and I will name this AI project to something else let's say I will write uh document Q&A something I can write up to you okay so whatever application let's say I will say rag system right so I I'm actually trying to create this I will enable this also make sure that you enable this then only your application will probably get started now in this entire pipeline initially you'll be able to see two important section one is query and one is output okay so query basically means an event that starts the flow now this will be super important guys because a person who does not know much coding can also probably use this and create its own application right entire llm application with llm Ops right and you can also build the entire pipeline very simple you don't even require a developer over here now here you have query here you have output output basically means whatever is the final response that you will probably get now in between this there is a plus sign over here if I click on the plus sign I have all these options right so let me zoom in so that you will be able to see to it okay so over here you'll be able to see once I click on Plus in here you have an option to add an action so first you can add a generate response generate response basically means a response from a specific LM model you can also add data sets let's say you want to implement a document Q&A or R system over here then you can add your own data set along with that you also have an option where you can execute a function like this or smart function now what is this exactly smart function I will go ahead and explain you okay now let me do one thing let me first of all add this search data set let's say that I have some multiple PDF file I want to probably create a document Q&A or it can be a retrieval augment M ation generation right that kind of rag system we can we want to create so for this I will go ahead and add a data set over here let's go ahead and create a data set let's say this data set I will say these are nothing but research papers okay so I will go ahead and create this research paper now inside this research paper what I will do I will go ahead and add the data set itself right so over here you'll be seeing you'll be having multiple option you can upload a plane text you can upload a file you can upload a p caller you can even uh use Google Drive any file that is available in the Google Drive that also you can add in notion in Confluence so still many more options are probably coming up uh as I said for people who is not much familiar with respect to coding in LM they can specifically use this now let me go ahead and let me upload a file now over here I'll click on upload a file and then I will go ahead and upload one file over here let's say this is one of the research paper um uh we'll try to see what this research paper is all about and I'll click on upload okay so this one PDF has got uploaded this is one of the research paper on PFT P basically means that Laura and CLA configuration the finetuning specifically that we use in llm models right so that research paper we have added so I will go ahead and click on ADD resource okay now as soon as I click on ADD Source you'll be able to see that this PDF file will get uploaded over here let me see that okay let me say that I have more files to be added over here so I will go ahead and click on ADD source and let me go ahead and upload one more file so now uh there is one more research search paper that is available over here so I'll write attention attention is all you need okay so I will go ahead and click it over here so this is another research paper that was available uh attention is all you need uh that is something related to Transformers so I will also go ahead and upload this particular file okay so as soon as I upload it you will be seeing that I have added the source so inside this particular data set that is research paper I have this two uh two PDF files okay the research papers itself and I can have any number of research papers because people will again say hey Krish can we just only add one or we can add any number you can add any number but the maximum PDF size should be 5 MB okay so this is there I've added it now let me go back to my project now inside my project the rack system that I had actually created right now I just have this two flows right my l in my llm pipeline now I'll go ahead and add something called as searge data set so as soon as I add this searge data set you will be able to see that it will give me an option to select the data set right whatever I have uploaded so over there you'll be able to see that I have my data set called as research paper so I'll add it over here okay now you can see both this specific file has got added right now once this is getting added there is a in the right corner right you'll be able to see there there is a button in the right corner which is called as save I will go ahead and click on Save okay so once I click on save that basically means this inje the data in that is specifically required is now available inside the search data set okay so over here you'll be able to see the search data set because I want to implement a RG system or it can be a document Q&A I can probably ask any question it should be able to give me the response so inside this this particular data has got added but internally what the system is doing it is already creating that entire embeddings that is specifically required internally if it is requiring any Vector store it is creating all those things okay now this is done so this in to is my data injection step now coming to the next step here let me go ahead and generate a response because here what I will do I'll create an I'll add an llm model okay now any question that I specifically asked to my data set it should be able to give me the answer and summarize that answer using this llm whatever llm I'm selecting over here so generate a response you basically select over here and you have multiple options of selecting different different models now see it is also providing all all these models uh Azure opene GPD 3.5 Azure open GPD 4 anthropic Cloud instant anthropic Cloud Google JY proa and in the future so many different different models are coming like Lama 2 and all are also coming okay and the best thing about this model is that they have they have hosted this particular model in their cloud and they are providing it as a service to you now let's go ahead and select anyone so let's say I will go ahead and select Azure open AI gp4 okay now here you can see it is it is showing the next field as purpose so in short here you'll write your prompt like how you want this particular model to prom to behave so I'll say you are a helpful assistant who please answer please answer the questions based on the context okay so this is what is my prompt I've written it over here now there is also one more option which is called as Behavior any additional thing that you really need to give on top of that particular prompt you can give it over here let's say over here one example is the behavior of this AI application uh be as specific as possible be very helpful and assist user something any any additional prompt that you really want to write right now I'm not giving anything so let me quickly um or let me just give one okay so that uh I'll say be helpful as much as much as possible okay so this is done so now what I will do I will go ahead and save this okay so on the bottom right corner you'll be able to see a save button so I'm going to save this now this is my entire llm application right so any query that it will go it'll go and probably search in this particular data set and then it'll give me a response Now quickly to check this whether it is working absolutely fine or not here there is an option of playground okay now in this playground I will go go ahead and ask all my questions that I want see two research paper that I had added I will talk about both those research paper one is PFT for uh one research paper was something for this parameter efficient transfer learning and one research paper was attention is all you need so here I will go ahead and ask the question what is parameter efficient transfer learning okay so I will go ahead and click on send now let's see whether we'll be able to see the response or not so in short what is happening now any query that I'm giving first of all it will go and search from this particular data set which is already in the vector embeding format and then whatever information is going to come based on the prom it is going to get summarized by this llm model and finally we going to see the output result and understand one thing is that based on the context whatever context it is available or whatever context we are able to retrieve from this particular data set that response we will specifically get it okay till we get the response I will just go ahead and and search for some more questions over here okay let's see um experiments uh let's say I will go ahead and ask instantiation for the Transformer Network now here you can see I've got the response parameter efficient transform refers to the method used in machine learning and specifically natural language processing that allows for the transfer of knowledge and here you are specifically getting your entire answer okay let me just go ahead and let me talk about what is attention is all you need okay so because that research paper also I have added okay please summarize okay so I've given this and here I will go ahead and click on send now the best thing is that the same functionality I can integrate with any chat Bots that I specifically want let's say in telegram channel in WhatsApp channel in any channel as such that option also I'll show now again I've got it so attention is all you need is a semi research paper by Ashish bashwani and colleagues so perfectly we are able to get it okay so any with respect to the data set you will be able to get it now here you see that you do not have to write much line of code right and it was very simple just by drag and drop adding the feature adding the data set everything you are able to do it okay now let me go ahead and add one more function let's say that I'm going to add a smart function okay see execute a function basically means here you will get some options to write to probably do basic maths uh Wikipedia search Google search at RV like let's say you want to ask queries from this particular um you know RV where all the research paper are uploaded you can probably use this okay so you can specifically use any of this activation function one more if you want to go ahead with is something called as smart function now inside the smart function you can add multiple functions all together so let's say that I want to do a Google search so I will say over here you can see the Des description it allows users to perform searches on the Google let's say I want to also go ahead and use RF if I want to also go ahead and use Wikipedia I can actually use over here so let's say that I'm adding all this smart function over here and for right now what I will do is that from my first flow I will remove this search data set because I don't want to use data set right so let's say uh I'm using in this particular way now let me go ahead and quickly search in this playground so if I go ahead and write what is machine learning okay so if I press enter you'll be able to see that now it is going to do the Google search and it is going to give me the specific answer okay so and for the first time once you set it for the first request it will take some time but from the upcoming request will become very smooth so here you can see machine learning is a branch of artificial intelligence and all so let's say I will go ahead and write who is Chris naak so here if I go ahead and click on send here also you can probably see that if it is using the right API Google search API it should be able to give me the answer uh about myself okay so Krishna is a YouTuber and data scientist known for his educational content on machine learning so all the answers are perfectly coming up right now this is what is the most amazing thing about this but again that question will come Krish what is the use of all these things you know you are saying that we can do all these things in a specific platform but how to probably do this in the code coding because I want to use this entire functionality Implement in my chatbot okay so let me talk about that okay so if you go ahead and click on the output so here is what is the most important thing here you'll be able to see that you get the HTTP request and you can also get the entire post request right so once you probably get this post request inside this post request you just need to set the API key and then by using whatever payload like whatever is the question you need to include it over here and then by hitting this particular URL and this channel token you can probably see this should be unique some unique name okay so here I will show you okay I'll give you the code also how you can actually hit the post request with the help of Python programming language now what I will do first of all I require an API key right so if in order to create an API key what I will do I will go ahead and click over here and here you can see I will go ahead and click create an API key so let's say I go ahead and write test uh this is my AI project the rack system and this will be my API key okay please make sure that you save the API key so I'll copy this API key I'll paste it over here okay so this is my API key now what I will do I will go back to my AI project okay I'll go back to my AI project I will take the code the post request okay now see this this entire pipeline I just require one API and with that I can do a post I can do a get and all the the functionality I'll do I don't have to even worry about the vector embeddings I don't have to worry about the apis of search Google search or RC or Wikipedia I don't have to worry about it all I'll be worried about this specific API itself right so what I will do the same C post request that you'll be seeing I have written this in a normal python code okay so here is the entire python code here you can see uh this is my API key so API key I'll just update it this is my older API key that I had actually tried it out but don't use this API key it'll be of no use because I'll be deleting it and then here I will write the query what is machine learning then I'll set the headers of content type of application Json the same thing see over here content type is application Json API key I need to set it up and then the payload which will have my query okay so I will go over here so API key I have set it to API key whatever API key is there my quer is what is machine learning and this is my entire data payload all I did I took that curl post I just searched the chat GPT and it gave me this entire code okay then this is My URL and instead of that catch token I'm writing Krishna 06 a unique value you can put it anything over here and then I'm doing request. poost with this URL with this Json data and headers is equal to headers okay and I'm printing the response. text again see the code it is very simple I'm setting the API key this is my query this is my headers this is my data I'm just setting it up whatever is the post request and now let me go ahead and and execute it now if I go ahead and execute it the best thing will be that see I will write python test.py I've asked the question what is machine learning okay now here you'll be able to see that I will be getting that entire okay text it shows unauthorized okay let's see what is the error um okay I have not saved this one it's okay no worries I not saved it so that is the reason now I will be able to see the response cannot match app with endpoint perfect so still this is not matching because I had already tried it out previously so let me do one thing let me copy this again and then you'll be able to see it okay so I will copy this entire URL and the channel token will be almost same okay so from catch I'm pasting it over here so I'll remove this I think it is dollar catch uh this catch should get removed and this will be my dollar Channel token perfect now let's see okay now it should run what is machine learning and now I think I should be able to get the answer okay so let's wait for the first time it will take time but after that I think it will work absolutely fine so this is done but at the end of the day you can see that it's a simple python code you know a post python code and how you really want to use it from the front end point of view it is up to you you can specifically use it but this is my end point this this is called as an endo this is my API key the API key is basically getting appended over here so now here you can see that I'm getting the response machine learning is a subsi of artificial intelligence that involves the creation of algorithms so and so everything is there now this simple post I can include it in any API that I specifically want okay that is the most amazing thing so coding wise you can now see you don't have much dependency now one more thing that I really want to show there is something called as app directory you can also connect with Google Drive Confluence japar slack teams so all these things are coming soon later on you can also work directly with llms and bring your own llm to the platform here you have a AWS s maker Bedrock hugging face it is also having this entire support So guys in this video we are going to discuss about how we can fine tune our own llm models with our own custom data for this purpose we will be specifically using this platform which is called as gradient. a now gradient actually provides you a lot of llm models on top of that you can probably do a lot of fine-tuning with your own custom data set custom data sets can just be formatted in a specific way to give it and later on the fine tuning can really happen in a quicker way so probably when I will be doing the fine tuning it hardly takes 7 to 8 minutes this is the demo that I really want to show you in this specific video by doing this you will get an idea if you have a custom data set and you want to also create your own custom llm model you can definitely do it with the help of gradient. a now let's understand what does gradient. a provides you it is a one platform unlimited AI it not only does the fine tuning it helps you to develop it deploy it and even for the inferencing part so over here you can create your own private llms your own models fully unlocked model right and you can build with any languages it provides you three SDK I think JavaScript Python and Java so that is also provid we will be specifically using python SDK we will be using Google collab and we will be fine-tuning our own model so it is the only AI platform that allows you to combine industry expert AI with your private data and here is all the other information that you can see that it also provides accelate AI transformation deploy AI to solve Mission critical problem 10 into faster with gradients AI Cloud platform we handle the infrastructure for you so it is an amazing platform alog together where you can not only uh build your model even you can inference it you can even fine tune it so in order to go ahead with first of all you need to go ahead and sign up so for sign up you can probably put your entire details I've already sign out uh signed up so I'm going to probably do the login over here as soon as I do this loging it will ask you to create a new workspace so this is the first important step that you have to do so for creating a new workspace you just need to create like just click on this particular button create a new workspace you can give your own workspace name and just submit it once you submit it like in this case I have created a test workspace you need to know this specific ID because this ID is nothing but it is a workspace ID so I will copy this workspace ID so that I will be using it in my coding when I probably do it in the Google collab the next thing that is specifically required you need to click on the access tokens right one is the workspace ID and the other one is the access token again to generate the access token I have already generated it I have kept it uh in my coding environment all you need to do is that just click on generate new exess token and you have to just put your password submit it and it will give you the exess token there you just need to do the copy over there so what I will do I will go ahead and uh uh give my password as soon as I do this so here you can probably see that you get your secret access token you can copy it over here one amazing thing about gradient AI is that it provides you all the necessary documentation over here you can just go and refer the documentation it provides you the entire code that is actually required with respect to python SDK and trust me guys just with a minimal lines of code you can probably find your entire model in an easier way right so if you probably go to the guides here you have something called as python SDK example so with respect to python SDK you can probably see this is the code for fine tuning I will copy the same code and I'll probably run it but here I'm going to make some changes with respect to my data set along with that you can also have you can also go ahead and see the entire documentation whatever things you specifically require the integration it specifically provides for Lang chain llama index mongodb H tack so what whatever things is specifically required mongodb and all you can probably do the vector dat you can create it as a vector database and probably store all the vectors over there now let me quickly log in again so once I log in if you probably go to the workspace okay so this is the workspace that I've actually created okay now inside this work workspace you can see all this are the models that I have already created for the fine tuning purpose if you probably go ahead and click on finetuning and if you see if I just go ahead and select uh or create fine tuning over here it provides you three base models okay one is the bloom 560 one is Lama 27b chat one is NOS hermas 2 okay so these are different different models uh llm models you can use any one of them for the fine tuning purpose now in our example what I'm actually going to do is that I'm going to use this specific thing and then what will happen is that with the help of python SDK I'm going to do the entire fine tuning process here then I will be able to see my fine tuning job also as I go ahead okay so here I have opened my Google collab let's go ahead step by step and let's see that how you can probably fine-tune your llm model so let me first of all make this spelling mistake correctly okay now the first step is that we will go ahead and install gradient AI so for this I will be using pip install gradient AI upgrade so once I execute this over here you'll be able to see that the installation will start taking place now it shows that requirement is already satisfied now the next thing is that the two two important information as I said one is the gradient workspace ID I've already shown you from where you can probably get your workspace ID right uh initially in the workspace when we are probably checking it out that particular ID I copied it over here and I pasted it over here and the next ID that I really need to set it is my access token gradient access token so I will set up the environment by using import OS and then I'm going to probably use os. environment gradient workspace ID with my workspace ID itself so this is the ID which you will again get it from the gradient itself once you log in and the second thing is about the gradient access token so here also I pasted the exess token so let me just go ahead and execute it and I have set up this environment variable one is the gradient workspace ID and gradient access token now is the most important thing this is the code of the entire fine tuning so let me go ahead step by step and see like how the fine tuning is basically done so first of all we will go ahead and import from gradient AI import gradient right so this gradient will be responsible in fine tuning in even calling the base model now I've created a function definition main now this is the function initially I have initialized gradient okay inside this gradient you have a method which is called as get base model and here the first parameter is something called as base model slug that basically means what model you are specifically calling and I have already shown you in the fine tuning right if you probably go ahead and create fine tuning through the UR y you have this three models that is there Bloom 560 Lama 2 7B chat no Horus 2 so in this case I'm going to specifically use NOS harmas 2 okay now NOS harmas 2 I'm going to do it see you can also do it with the help of UI but I really want to show you with the help of coding by using python hdk so here is the same model I have pasted it over here right so this specific model over here is my llm model that I'm going to find on top of it I'm going to f tune it with my own custom data now once we created this base model so this line basically gets you the base model over here now this model will be responsible and we will try to create a model adapter right the reason we are creating a model adapter because this will be specifically my model let's say this is like Kish model over here and this model the reason we are creating as a model adapter because in further we will be doing fine tuning for this particular model itself so the first step is basically to create adapter on front of it right and the name of that particular model will be uh Kish model itself so this is the second line of the code that we have executed now here we have created the model adapter with the ID new model adapter. ID so we will be getting the ID itself now is the most important step I will try to go ahead and create a query so this is my sample query the instruction see I've given some kind of format the instruction is who is Krishna so this is the question and usually what happens is that in gradient you really need to use a specific format in creating your custom data and for that you can refer this python hdk here you have some tips and tricks it will talk about like how you can probably create your entire uh data itself right with custom data so you can check it out this specific thing and with respect to that you'll get an idea like how your data set should be created in a form right so some format should be there so let's say if I ask any query right for a base model something like this I've given this three hash uh uh and then the instruction this is the question who is Krishna and then we are closing it with two end line okay and then we will be seeing the response so response over here is kept as blank because this response is going to come from this specific new model adapter okay which has the base model no Hermos okay now we will be printing this sample query so already it will be just going and printing this who with all all this information like instruction and response now this model Raptor when I say dot complete okay so do complete is a method over here the first parameter I'll be giving my query that is my sample query the max generated token I will be giving 100 so it is going to probably um take 100 Max token in the form of output and then we are going to use do generated output so this is one of the attribute that is present inside this do complete method okay so after this complete method we will be getting the generated output as soon as we get the generated output that that basically means whatever questions I have asked over here who is Krishna I'm going to get the output over here itself right so generated before fine tuning now here is my output that I'll be getting and understand I have not started my fine tuning so that is the reason I've written a comment over here before fine tuning now before fine tuning if I ask who is Krishna obviously this model will not know unless and until I'm very much popular okay I'm not that much popular so it'll just say that if if if it has heard somewhere krishn it'll give some information now what I will do I will go ahead and create my own sample data okay regarding Krishna because obviously over here in before fine tuning if I probably execute it I will not be able to get the output the right output now this will basically be my custom data so here you can see that the first question that I've asked the first uh sample that first input data that I have created which will be used for training is that who is Krishna then over here the response I have written as krishn is a popular mentor and YouTuber who uploads video on data science and llm and his channel is Channel and in his channel Krishna so this is one of the input and the output that I have given the similarly second input I will again create who is this person named Krishna now I'm giving in a different format of text like question and answer right and again the response I've written krishn likes data science and Ai and make videos in YouTube and he's also a mentor right something like this then in instruction again what do you know about Krishna okay so again I've have created one more response Krishna is a popular Creator who specializes in the field of data science and channel name is Krishna okay and again one more input like this I can create lot of inputs related to a single context now suppose if I have 100 of context I may probably create this kind of inputs and at the end of the day if in a real world scenario in an endtoend project if you see this all information will be coming from some kind of databases it can be a mongodb database it can be a SQL database and all right so this will be my samp of data it's just like more like a key value pair you know a list of key value pairs now this is done now what I'm going to Define after this my input is done my model before fine tuning is done now I need to do the fine tuning so for performing the fine tuning I need to have number of epo one parameter I will initialize count is equal to zero now I'm saying while count is less than number of EPO and then I will start my fine tuning over here so fine tuning count plus one then I will say new model adapter. finetune is one method where I am giving my samples okay so samples over here this all samples I'll be giving it over here for the fine tuning purpose so in short I'm using a method which is called as do fine tune and we are specifically giving this pampel over here and then we are incrementing the count so that the epox keeps on increasing so first iteration one iteration 2 iteration three and then this entire fine tuning will happen happen okay now after this I am again writing this comment after finetuning and I will execute the same method right over here by giving the same sample query who is Krishna right and over here the same thing is there and get the generated output I'm going to display the generated output and after doing this I am deleting my adapter because I will not require it afterwards uh if I wanted to use it further I can keep it like that otherwise I can just go ahead and delete it and here also I'm closing the green radian and this is where I am starting my main so this is in short what we are doing is that before fine tuning I really want to see the response I created my sample data over here and then again I did the fine tuning with respect to number of apox and all and now let's go ahead and execute it so once I probably start executing it you can probably see uh it will it will start uh you know it may take hardly a couple of minutes since I've just done three iteration okay so here created model adapter with this particular ID uh the instruction is who is Krishna so it is waiting for the response let's see over here whether it has been created or not uh model testing let's see fine tuning so here also it will as soon as the fine tuning will happen here the model will get created and it will got deleted okay it will get deleted so here the response it it is getting that Krishna is a well-known Indian actor who is prepared in various films uh and television shows is best known for his uh Raj in the popular I have I've never worked in any industry guys okay so it is saying I'm an Indian actor I'm not an actor altogether okay so this is a very famous serial in India okay I have never acted over there I don't know whom it is considered okay it's also said that I've worked inata H nice so now I am doing the fine tuning so this is the response that I've got generated before fine tuning and now after fine tuning I have given the right kind of data with respect to input and the response uh the third iteration is getting overseas how fast it is happening in the cloud platform gradient Cloud platform that is quite amazing right and this is super super nice at the end of the day you'll be able to see this right beautiful right we already in the third iteration and now here I'm getting my output generated after fine tuning Kish is a popular YouTuber and data science is known for his data science python tutorials on his YouTube channel Krish n now see just with this three sentence or four sentence how well it has done the fine tuning now just imagine the power guys right the entire fine tuning is happening in the gradient Cloud itself right if I probably give this kind of data and just increase the number of EPO and train it how beautifully within 5 to 7 minutes you can actually train it and hardly you know just to put my input data it hardly took very less time right so this is one amazing application this is how you can probably do the fine tuning one task I really want to give give it to you I okay let's let's reload this I think you'll be able to see my f tune model also uh let's see the model uh okay I've already deleted it I think yeah I've deleted it so that is the reason you're not able to see it over here if I had not deleted that you could also see the models over here right so model testing see model is also not here right otherwise you could also see it uh put it over here and you can ask any question you want right this is quite amazing so uh at the end of the day if you don't want to delete it just don't delete it and keep it like this so I hope you like this particular video I hope you have understood how you can find tune our amazing llm models uh in with the help of gradient AI Cloud uh which is quite amazing you should definitely use it try to use it and try to see whether you're able to do this or not but uh just by seeing the code and all I think it is quite easy quite amazing again yes this was it for my side I hope you like this particular video I'll see you all in the next video have a great day thank you on all take care bye-bye
Info
Channel: Krish Naik
Views: 22,819
Rating: undefined out of 5
Keywords: yt:cc=on, how to finetune llm models, generative ai finetuning models, krish naik llm models finetuning
Id: t-0s_2uZZU0
Channel Id: undefined
Length: 156min 49sec (9409 seconds)
Published: Tue May 07 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.