Fine-tune Multi-modal Vision and Language Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this video covers multimodal language models these are models that allow you to interact with both text and images in particular I'll take you through the lava model which is open source and I'll show you how to find tunit on a custom data set I'll start off with a very quick demo of how a multimodal language model Works comparing it with the chat GPT version I'll talk a bit about applications where this is useful and why I I think so many people will be interested in fine- tuning this kind of model then I'll talk through how these models work they're the combination of a vision model and a text model that are adapted to work together I'll cover three specific architectures lava 1.5 lava 1.6 and eix which is an earlier multimodal lap model that is based on the flamingo architecture I'll then prepare a fine tuning data SES we're going to be training a model on chess pieces so I want the model to be able to recognize different combinations of wooden chest pieces after I fine-tuned this to recognize individual chest pieces then I'll show you a specific fine-tuning task on lava 1.6 I'll do it on the one that's based on mistra 7B and also a larger more performant one based on e35b which is 35 sorry a 34 billion parameter language model and as usual I'll finish off with some final thoughts we'll Dive Right In with some examples here I am on the demo page for the lava model this is available at lava. hli i.cc and you can check out the most performant of these models which is based on the 34 billion parameter ye language model so here I've uploaded an image and you can see it's an image of a horse or a knight and a rook so a black knight and a white rook and I've asked the model what you see in this picture and the model says uh the image shows two chest pieces on a wooden surface I just increase the font on the left is a rook which is a piece that can move straight line horizontally or vertically and on the right is a pawn which is a piece that moves forward one square at a time now I don't know if it's referring to this here which I think is actually um a pawn but really it should be calling out the Knight here which it seems to be completely missing then the answer goes on the chest pieces appear to be made of wood and the lighting suggests may be taken a room with natural light coming in from the left side so you can see that certainly it's on topic knowing about Chess it gets The Rook but it doesn't quite get the Knight and this is roughly representative of the performance today of this open- Source Vision model now the private models like GPT 4 are more performance when we ask the exact same question here what do you see in this picture H chat GPT says the image shows two chest pieces on a wooden surface one on the left is a dark brown KN represented by a horse's head and neck and the one on the right is lighter colored Rook which is the distinct cylindrical shape with a top notched top that resembles a castle Battlement so GPT here is getting this answer correct now even with chat GPT you can push the limits for example I've put in this photograph of a full chess board and I've asked the model um is there anything out of place in this image and if you take a closer look everything is in place except on the white side the queen is not on her white square so the king and the quean are in the wrong positions here so here's what chat says Uh there's chest board with a full of black and white pieces arranged the White Queen appears to be placed on a square that is not correct for the starting position of the game so kind of recognizing what's wrong but getting the wrong Square H5 is uh this Square out here in the middle of the board on the side so that's not the square where the queen is and if we read on the queen should start the game on Square D1 that's always the correct starting Square for the White Queen and additionally the back king and queen seem to be placed in the opposite squares of St usual starting positions and that's incorrect because the black king and queen are actually on the correct uh squares for their starting point so you can see that gb4 um is stronger here the vision version but still making some mistakes in interpreting this complicated chess image now what I'm going to do after describing how these models work is fine-tune on a data set of pieces I'm not going to get the model to perform at this level here of getting all the positions right on the board but instead what I'm going to do is I'm going to take a series of photos of individual chess pieces use that as training data and then ask the model to answer what a set of chess pieces H what they see in a set so I might give two pieces or three pieces and ask the model which pieces do you see here the reason why there are such powerful applications for vision and text models is because they can be used so well widely for example you could think of finding these models for medical applications where you are looking at images and you're trying to detect patterns um issues problems healthy behaviors you could use it for uh taking photographs of cars and seeing what kind of damage has been done and assessing the value of the repair the cost of repair based on the images there are almost endless applications and by combining a language model with a vision model which is what this video is all about we can get very powerful a very powerful and interactive tool for diagnosing issues if we have a good training data set here we have one slide that's showing you how these vision and text models work they are the combination of a vision model and a language model in very simple terms we have two forms of inputs we have image pixels so these are just individual pixels of an image and we have input text in the form of phrases or sentences so the image pixels come into a Vision encoder and through the vision encoder they're converted into a set of vectors and these vectors represent the meaning of what's within that picture and the input text meanwhile is also converted into a series of vectors and those vectors combined with the image vectors go through the language model so basically we're taking a language model which would normally just be um this right hand side of my screen as I see it and what we're doing is we're additionally putting in vectors that represent the image so we have both image vectors coming in and we have input text coming in and we use that combination in order to predict the next token so you can see that really one side of the slide is just like how a Transformer based language model works except now we've additionally added in some vectors for the images now if we go one step deeper here just to give you more grandular on the input text size actually the input is sentences and words and those words are chopped into subword tokens so for example the word token might be split into to T or talk and then Neen so these subwords are actually the representation we take for the input text and when we have these subwords made up um of the vocabulary often there might be say 32,000 a vocab size of 32,000 so just we have 26 letters in the English language you could come up with a vocabulary of subwords That's Got 32,000 of those such subwords H those subwords are then converted into vectors so the meaning of each subword is represented by a vector in a very high dimensional space not just uh 2D as you can see here but maybe 1,000 uh Dimensions or 2,000 dimensions and these vectors Each of which uh represent a token they're converted by embeddings so embeddings are the matrices that we will multiply the inputs by the tokens in order to convert them into vectors so to summarize we have words they're converted to sub tokens and these sub tokens multiplying their digital representations by embedding matrices get converted into vectors and so we have the text that's converted into input vectors and these input vectors are what go into our language model and on the left hand side we have pixels they go into an encoder which I'll talk about in a moment and that encoder provides output vectors which represent uh the images and these vectors then are fed in in parallel just as we feed in text embeddings and as an output of this language model I previously said we predict an X token well that's true but more specifically we predict an output vector and that output Vector is then decoded back uh so we reverse the embedding to get back from an output Vector into a representation of the most likely next token so let's dig in on the ve Vision encoder side so I want to talk a bit about how we get from image pixels to vectors here and the way we do that is usually by splitting the image up into patches so here you can see this image of a dog and I've split it in four patches often we'll have patches of 16 by 16 pixels or patches or tiles so this has four tiles and each of the tiles we take the pixels so here in the red tile we would take all of these pixels say 16 by 16 and we would put them through uh some matrices so multiply them by some matrices to get from pixels uh to vectors so we'll have a vector that represents this tit so you can think of this kind of like uh embeddings in the same way that we would go from uh tokens for words or subwords into vectors we go from these tiles or patches into a vector that represents uh the patch so this image would have four vectors 1 2 3 4 and these are trainable by the way so the matrices that convert the pixels within this red patch into a vector these are trainable matrices um that are trained through back propagation so anyway we've got four vectors now and these are the vectors that are fed into our vision encoder and the vision encoder itself is a multi-layer neural net and it makes use of attention so it's architected in the same way that Transformer language models are architected we have uh feed forward layers we have attention layers so typically in a block there's a feed forward and an attention layer and then you repeat that block many times maybe 12 times and what this does is it takes the input vectors and it basically moves them and makes them interact such that we have an output representation that captures well the meaning of what's in the image so the role of this whole block here is to start with image pixels in red green and blue each pixel has got a certain amount of red a certain amount of green a certain amount of blue that's how you represent uh the color of the pixel you also tack on a position so you would put a position for each of these patches um and once you've got the positions and the pixels you're going to take that and generate um a vector and that Vector then will be representative of the pixels and the positions and these vectors then go through the vision encoder and they generate an output which is representative of the overall image so here's a slightly different representation just so you can wrap your head around the entire architecture so now I'm going back to show you the entire Vision Plus text model um here on the left hand side you can think of having two image vectors so these can represent two of the tiles on an image and then we have some text factors and maybe say and then hello so maybe this represents the a word say and this represents hello so we feed in the image vectors and the text vectors and then we make a prediction for the next token based on paying attention to all of these uh previous tokens here or rather all of these previous vectors here so the information from the image and the text all gets fed in to make a prediction of the next of the next token or rather the next Vector which gets converted to the next token now there's one more stat that I have to show here um that's needed if you're going to combine a pre-trained uh language model for example Lama with a pre-trained vision encoder so one of the key hacks with these Vision Plus text models is instead of training everything from scratch we take an off the-shelf language modeled we take an off-the-shelf Vision encoder and we see if we can make them work together and the way we do that is by realizing that the vision encoder will provide some output output vectors so it'll provide a vector uh for each effectively each tile of the image and the language model is expecting input vectors now it's used to it's used to receiving these input vectors uh that come from embedding the text because that's what llama was pre pretrained to do so these vectors it's very accustomed to but it's not going to be accustomed to output vectors of the vision encoder because these are trained on disparate data sets so they're not compatible per se so to make the language model compatible with the vision encoder what we do is add in an adapter here and this adapter here is also just a series of trainable matrices and it converts uh or rather makes compatible the output vectors from the vision encoder with input vectors of the language model and there are a few ways to train this adapter that we're going to see right now so let me talk about lava 1.5 which is the precursor to Lava 1.6 this is one of the more recent open-source Vision Plus text models and it makes use of a clip encoder plus a Lama 2 language model so there are the two building box it uses and it then freezes the vision encoder and the language model and just trains the adapter so to show you that the lava 1.5 it will take the vision coder it will take the language model although Lama 2 it will freeze those weights and then it will just train this adapter here and it will train us using text and image pairs so there are data sets available of images that have captions so Nava 1.5 it will train this adapter in other words make the vision encoder and the language model compatible with each other and it will do that by freezing the vision encoder and language model and just training the weights in this adapter here so that's step two and then step three is uh for lava 1.5 everything will be unfrozen so make everything trainable including the language model including the vision encoder including the adapter and then use synthetic image plus Text data in order to find you in the model so the synthetic data is instruct data it's got kind of like question answer format which is great because that encourages the model to perform well when we ask question and answer uh style when we have when we have question and answer style conversations and the way that lava generates this data is actually using chat GPT now don't confuse that with taking lava is trained using the vision version of chat GPT it's not um it actually uses gb4 which is a language only model and it uses that to write detailed descriptions of images given details of their contents so basically there are data sets out there of images and the image has got some bounding boxes uh at different positions for each object in the image and that object is labeled so if you have an image and it has labeled objects at certain positions you can feed that into uh GPT 4 and get out a detail description about what's in the image so that's roughly how lava is making synthetic data and it works well because it allows you to get this very natural question and answer style format engaging the image within a text conversation now lava 1.5 has been superseded by lava 1.6 which has a couple of tweaks so one basic way you can improve these models is just start off with better building blocks so instead of using say Lama 2 you use Mistral or instead of using an earlier version of clip you use a later version of clip or a bigger version of clip so lava 1.6 does this in two ways and first of all it uses Mistral 7B and also uses e 34b for a larger version so these are larger and more powerful models than the base llama 27b model and also lava 1.6 uses a larger Vision model so it uses a 336 by 336 pixel image input compared to a 224 by 224 image input so these are the first changes with 1.6 just better uh initial building blocks is way to put it and the second Improvement is around the adapter so in the first case in lava 1.5 the adapter is just a set of linear layer it's just a linear layer rer in lava 1.5 the adapter is a linear layer so it's single layer of matrices but in lava 1.6 it's a multi-layer perceptron in a multi-layer perceptron you have a linear layer and you also have an activation layer which will use an activation function that will either asile you or G you but you'll have that um exponential type Behavior where things will kind of Round Up to zero or to very large numbers uh depending on the input and using that multilayer perceptron it's a more dense layer so it can contain more information and it gives that activation type performance so basically by putting in a more complex adapter they're able to improve their performance and improve the compatibility between the language model and between image model now there's one other model I'll briefly touch on which is called eix eix model is a little bit different than the lava type models because it uses a flamingo architecture which I'll briefly show you a diagram of in a second and also the training is different it doesn't use synthetic instruction fine tuning instead it trains on multi-image documents so very long documents that have multiple images with text in interspersed between them we can actually head over over and take a look here at the Flamingo paper which was one of the original multimodal papers and the architecture used in Flamingo is the same as the one used by eix what's unique about edix is that data set which is the long document multi-image data set it's actually called Obel so these are all names from Aster and obeliks um Ed being the dog and here you have the image of how Flamingo is architected it's somewhat similar to Lava you still have a vision encoder this time for each of the images and you have the perceiver resampler which really can just uh really the perceiver resampler you can think of as the adapter so also not too much change there however rather than feeding the vectors from the perceiver resampler or the adapter rather than feeding that in in the same way as vectors representing text instead these vectors are injected in parallel to each of the layers within the language model so here if you have say 12 layers within the language model Transformer these vectors from the adapters are being injected in parallel and that's different than just injecting them once at the bottom it's different because it adds in extra parameters you now need to add in extra cross attention between the text and the image vectors within each layer which means that when you start off with a Lama 7B model it actually turns into a 9 model when you allow for these added parameters within each layer and then the 70b model that the large D fix model is based on becomes an 80b model and the reason it doesn't become a a bigger like say a 90b model is because they don't include the injection within every layer I think it's once every seven or every eight layers and that allows them to reduce the total number of parameters but basically I think the way the architectures are evolving is more towards the simpler lava architecture because if you consider the vectors going in right at the bottom actually that information will propagate through all of the layers and so whether you really need to inject at each of these points is kind of questionable considering as I said the information still will get in through the bottom through the adapter and then because you're fine-tuning the whole model it's as though um you're allowing the model to adapt in any case so you're achieving the effect of adding in these extra cross attention layers here you can check out the eix model for yourself on hugging face um it also runs with a version of text generation inference which is quite interesting because as of yet there are not great inference options available for these multimodal models I do find you the iifix model and it's included in the Advanced Vision repository so if you do purchase access to that repo you can check it out for yourself because in this video I'm going to focus on fine-tuning lava 1. six which we'll get to right now I'm now going to use the Advanced Vision repository and the scripts within to prepare a data set for fine-tuning a vision and text model and then fine-tune and evaluate the performance of that fine-tuned model now you can purchase access to this repo at tr.com but I'll also provide enough information in the rest of this video so you can follow along if you want to take the steps uh manually by yourself in the repo there are two folders one for data preparation and one for fine-tuning so I'm going to um clone that repo I've got it opened up here in VSS code and I'm going to start off in the data prep repository now my goal here is to fine tune a model to recognize chess pieces so what I've done is take photographs of different chess pieces as my training data set and then I've taken other photographs of combinations of chess pieces as my test set so we can take a look here it's public repo chest pieces and you can check out the train Split For example here's a white rook and you can see I've paired it with a caption that describes the piece uh here you can see a knight and a bishop and the caption is a white knight with a white Bishop behind and part of a white P Pawn in the foreground so actually you can see the white Pawn here in the foreground so I've got quite a small data set just 48 rows I think this is insufficient if she want to get very high performance but it's still interesting because it will show us um what kind of performance you get even with this small data set now let's take a look at the test split here I have combinations of pieces so for example I have a black Rook a black bishop and a white Bishop all standing together and I have this example here with a black knight and a white Rook so these are the photographs that I'm going to use when running an evaluation on my on my model which I'll be doing as the training progresses to check that I'm not over and we'll also use them to show some manual performance uh evaluation examples once the model has been trained to generate this data set I simply took photographs using my phone um they are in heic format so I put them all within the data folder here and then I ran uh a script called heic to jpack so this simply converts my images into a JPEG format and it also named them so that they have the prefix of chess pieces so you can see here all the JPEG files uh the next thing I did was run the create captions file this gave me a template uh CSV file for the captions and then I made two versions of that captions file one for training and one for test and you can see here the captions file automatically had list of all the images inish and I just had to manually write in the captions that are required so here you can see my train captions and here are my three test captions and then running the create HF data set I was able to um convert this into a hug and Trace type data set and push it up to the hub uh so I was able to show you the data set loaded as it is here and once it's on the Hub we'll be able to download it in their fine-tuning script in order to run that training after the data set is ready the next step is to move on to fine tuning and I know the font is small on the screen here um but there are folders for each of the models so there's a fine-tuning notebook for eix actually for the 9B and the 80b model then there's a folder for lava 1.5 there's a folder for lava 1.6 the 7B models based on Mistral and then one for the 34b model so we're going to take a look at both the 7B and the 34b models and to do that I'm going to open them up in jupyter Notebook when I do fine tuning I don't have enough vram on my laptop to do the fine-tuning so I typically will rent a server using something like runp pod or vast AI um so let me just show you that exact setup I'll show you for the 7B model and also for fine tuning the 34b model so one way that you can choose is to use runp pod and if you go to this public repository the TR research install guide you'll find a a script or rather markdown find called file called llm notebook setup and you'll see some oneclick uh templates for doing fine-tuning um so what I'm going to do is click this runpod one click template and this is going to give me a Cuda 12.1 um do image and it's a fine-tuning notebook by Tris and if you're going to fine-tune the 7B model I recommend choosing um an A6 ,000 which will give 48 GB of RAM I'm going to fine-tune in full 16bit Precision um the support is not H the script is not supporting for bit or 8bit fine-tuning just yet that would allow you to greatly reduce the amount of vram you use and so for now with a 7B model I would choose one of these and actually if you want to find you the 34b model I found that I needed 300 gigabytes of vram to fine-tune in 16 bit so I actually used eight um I used eight of these for that but probably you could get away with six or maybe seven so I click on deploy and then continue and once that's deployed you'll be able to go ahead and open up the jupyter notebook here I am in Jupiter with the lava 1.6 fine chining notebook this is for the 7B model and so I'm going to take you through this script step by step uh first off we have the installation of the files we're going to be P tuning using Transformers from from hog face so we'll install that um we'll also install bits and bites although I'm going to fine tune in 16 bits so I won't be using quantization in this example as uh for my recent videos I like setting this environment variable this allows you for very quick downloads and uploads the hungy face Hub enable HF transfer and for that to work you also need to have installed a HF transfer which is shown right here next I'm going to clone the lav repo H so I'm directly cloning uh from GitHub and uh there are a few changes that already at this point I need to make to the scripts in order to get the best form of training basically there are two things that need to happen the first is I need to convert um the script for loading the model to be 16bit but I need it to be brain Flo B Flo rather than Flo 16 now the reason for this is that uh labba it's it can be tuned in 32 bits which requires a lot of memory but the training is stable if you try and train in slot 16 often the training becomes unstable because of a lack of precision now numbers are represented in digital format there are a certain number of bits allocated to the exponent and a certain number of bits allocated to the Precision and what brain float or B loat as a data type does is it uses the same number of bits for the exponent and greatly squeezes that the number of bits for precision and so brain float basically is able to cover the same range of numbers as float 32 and so it's able to maintain the stability now not every GPU supports brain float only the Ampere architectures like an a6000 an a100 a h100 so if you're using an older GPU like a T4 then you're going to be forced to train in float 32 because be because float 16 and which has a smaller number of B used for the exponent is going to cover less of a dynamic range and it's not going to be as stable so there are a few changes that we need to make in the loading script the loading script that we're going to use is this one here which is load pre-trained model so I've just run this uh cell here to clone and inside it within the lava folder within the model folder you will see a builder. py script so I'm going to open up the builder. py and if you search within the script uh which is the script to load the model you can just search for float 16 you'll see that there are five instances of where the model is being loaded in float 16 and those all need to be replaced with B float 16 and that's to ensure that the model is loaded in B float and not in float so that we can both keep the memory reasonably compact but also keep the training uh stable now there's one other change as well which is when the VIS Vision encoder is being loaded um that's called The Vision Tower we'll see it later in the modules right down here we have the vision Tower and you can see as the code is structured if the device map is not Auto um you can see that the vision Tower is going to be moved to the device which will be Cuda so the GPU and the data type is going to be float 16 now that will be updated to B float 16 but there is an issue because of often the model can be loaded and this line of code isn't triggered um so we basically need to adjust this code so that it's always moved to being B float 16 and so that it's never loaded in either float 32 or float 16 and I've basically automated that with the script here so what I do is I search for a pattern that's the vision Tower piece and I replace it with a pattern that forces the model to be loaded uh in B flat 6 16 and then furthermore I replace all of the instances of um float 16 with e FL 16 throughout the script and so I've run the script here to replace the vision Tower component um and then I run the script that will search for the float 16 pattern uh provided it doesn't have a b before it and replace that with B float as well now just to note that you can leave the vision Tower in flat 32 and the training will generally still work um but it will reduce your vram a little bit if you replace it with B float okay so next uh after we have showned the lav repo made those adjustments for be Lo and then we're going to CD into the lava directory and make the installations that are recommended by the lava repository itself and once that is done we can move on to loading model so here we're going to set the device on which we load the model as Cuda and we're going to select a model that we want to load and so here we have a selection of models you can load M 7B aunia um you can load the 34b model which I'll show you in a moment now next up we load the not just the model but also so the tokenizer the model the image processor and the context length so we're going to load all of these at once using the model path that we've specified above we're going to use Flash attention to speed up fine-tuning and we're going to set the cach directory to just be empty which means it will download the model to the directory that we're currently in so that will download the model quite fast thanks to H HF transfer and after that we should have the model getting loaded onto GPU now notice that there is this error and warning that is still unresolved when we're loading the v1.6 lava models um but everything should still work fine despite those errors next up we're going to move to uh examining the the modules so here you can also optionally exam the model or the processor but I like to check that there are no modules that are not in B FL 16 so I run this script here and I'm checking through all the modules uh to see that there are none that are INF FL 32 and indeed because there's n listed we know that all the modules are in uh B FL 16 next we're going to move uh to inference and we're going to use some inference code that's again based on the code recommended in the um in the lava repository there's an evaluation function you can check out within uh the lava folder and what we're going to do is first create a prompt The Prompt needs to be set up with a combination of the image and with uh the text itself and it needs to be prepared in a specific format and that format for the conversation will depend on the model type so depending on whether it's a Lama 2 base Mistral base 34b base or an MPT base you're going to need to have a different chat template so that's what this is handling here we have a function that will process and prepare images um which itself calls a function to load images so here we have the load images function and then we will wrap all of this in a function called eval model which will take in a tokenizer a model an image process procor context links the image file that we want to ask the question about the query this is a text question uh the model name and also some parameters like the temperature the number of bees the max new tokens which are set to default values here so within uh this eval model what's happening is the prompt is being prepared and the inputs then so the P torch tens tensors are being prepared using uh the tokenizer and then we're calling the model. generate function with the images tensor with the images size we're also including the temperature and the max new tokens whether to use caching for inference as well and this finally gives us our output set of tensors which is decoded using the tokenizer to give us the final printed outputs which are the printed output tokens so with that all done we can run a quick example and we're going to run an example on that right and Knight and Rook image which I've uploaded to a public URL so running that we can pass in the image and ask what do you see in this picture and here we have a picture of the Knight and the rook and the question is what you see in this picture remember this is the 7B lava 1.6 model the image shows a chessboard with a few pieces on it h there's a rook on the A1 square a KN on the B1 square and a Pawn on the C1 square and the pieces appear to be made of wood and the chest Port has like colored surface now you can see there are some pad tokens that are appearing here um that's because I've used a patched model uh patched version of this model which adds in pad tokens I actually think that is best not done so if you are going to rerun this script rather than using the patched model here which adds in uh padding tokens I recommend just using the base model here which doesn't display those issues okay so we have uh a quick example and you can see that the model is recognizing that it's chess but it's uh it's telling us that the pieces are on certain squares which doesn't make any sense and it's seeing the Knight and it's seeing the Rook um but as I said it's not exactly understanding where they are standing so next we're going to load the fine-tuning data set uh so we need to define a function to tokenize and create the labels um so we're taking in a batch of data which has images and prompts and the prompt is always going to be the same it's going to be what you see in this picture um so we're going to tokenize uh the conversations with the caption which is the answer so we're going to use the caption in the data set uh the caption here that's going to be the response and so we're going to need um to tokenize the inputs without the response and then we'll tokenize the full inputs with the query and the caption as well and the reason for that is because we're going to focus the loss when we calculate the loss and B propagation we're going to FOC Focus just on the captions which are the outputs and we're going to penalize the model and back propagate on that basis we're not going to penalize on the full sequence of tokens that are put into the model so with this we've got a function that allows us to set up a tokenized versions of the full uh input including the captions and also create an attention mask um here and it will include a loss mask uh within the labels so just to make a specific comment on that the labels are the targets so when we generate a predicted token let's say the predicted token is hello we're going to compare that to what it should have been and what it should have been is going to be a label and the way we get that label is by just taking the inputs and uh using the input and shifting by one now the shift by one is automatically done by the trainer so all we need to do to get a list of targets is to copy the inputs and copy them over to the labels and once we have that set of labels which is just a copy of the inputs we're going to convert some of those labels to be ignored so we want to ignore the labels that are towards the start and we don't want to ignore the labels that correspond to the caption and you'll see this more clearly when I show a printed example of of some data so I've run this here and prepared the data set and next up we're going to set up Laura so to set up Laura the low rank adaptation we're going to fine-tune not the model itself but rather a low rank representation of the model we're first going to print the model to see all of the layers within the model we have first off um the the Mistral model and so we have the attention layers here you can see of Mistral and there 32 layers from 0 to 31 you can see the multi-layer perceptron layers of Mistral as well and you can see the input lay norm and the post detention lay norm and now you can see the vision Tower so this is the vision model here it's a clip encoder and you can see the embeddings and then you can see within the encoder itself it's 24 layers and it also has got a tension so kvq and it's also got linear layers um so we've got the attention K VQ and then we have the clip linear layers so here we have the multi-layer perceptrons which um we've got fc1 and fc2 and then last of all we've got this um the projector and the projector is the adapter I talked about you can see that it has um a linear layer here it has a Galu which is an activation function and it has a linear lay here so this is the adapter I talked about back in the presentation uh we can just show it very quickly this adapter here so this is represented uh within the code right here as the projector so you can see each of the elements we've got that adapter uh here we've got the clip encoder and up here we've got the U Mistral model itself and what we want to fine-tune is we want to fine-tune the attention layers how we want to fine- tune the multi-layer perceptrons and by the way when we specify Q CR it's going to uh it's going to create a Lura adapter also for Q in the vision encoder so we're going to fine-tune the attention and the MLP the multi-layer perceptrons in both the language model and vision encoder and we can also optionally train the projector itself if we want to train that adapter so here you can see I've specified um the attention erors I've specified the mul perceptrons and then here MN projector that's for fine-tuning the adapter and when we get that Lura model and print the parameters you can see there's about 4% of the total parameters including now the lower parameters that are drainable so we're really just training a very small subset but they are low rank adapters so they're going to get a nicely smoothed update to the overall models performance before going further into training let me just quickly recap so we've done the installation we've loaded the model we've done uh inference or rather we've set up the functions that will allow us to run inference before and after training we've set up the fine-tuning data set we have applied Lowa adapters to the model so we have the adapters ready to train we freeze the main model train those smaller adapters and then merge back onto the main model and now we're going to run evaluation before training and then run training and then run evaluation after training now I've Reloaded The Script with the base lava 1.6 7B model so we don't have the issue with the patched version as loaded and now we're going to run evaluation across each the samples in the test data set of which there are three so we'll run that first we have a bishop and a knight and you can see the answer before training is on the left there's a rook and on the right is a pond so that's not quite right the next example is of two Bishop send a rook and the answer is the image shows a wooden chest session on the wooden table there are three pieces a king a queen and a rook so not quite right either and the third example is a rook and a knight and the model sees and there's a look on the A1 square a knce on the B1 square and a Pawn on the C1 square and so it's seeing squares even though the pieces are just on a table not on a chess boort so let's see if we can improve performance Now by running training for training we're going to use the standard uh Huggy Trace trainer uh so we will set the bf16 parameter to true that means we're using brain SL 16 we're not using fp16 which would cause instability learning rate of 1 E minus 4 we're going to train through three EPO here and we'll keep an eye on the eval loss make sure it's not rising to indicate overfitting we'll use a train and an eval data session and the batch size we'll use will be four for the training and eval will use a bat size of six and which is really not relevant because our Max bat size is only three so you could just reduce that down to three and now we'll run the training it will automatically connect weights and bi which you have to sign in for and just looking at the training loss you can see that it's dropping it does uh jump a b a little bit here at the end but I know because the validation loss is dropping that the training is relatively stable and as you can see it looks like I probably could train a little bit more although uh the loss is maybe ASM toting a tiny bit but certainly seems like there's room to train a little more nonetheless let's take a look at the results of evaluation after training to evaluate after training we run the very similar script as before we late three through each of the three examples so here's the first one um we have a bishop and a knight and the answer is a white king and White Queen so the colors are correct but the pieces are wrong in the second case here what you see we have a rook a bishop and a bishop and the answer is a black king a black queen and a white king so the colors again are correct but the pieces um are not correct and in the last case we a rook and a knight and the answer is a black knight and a white Rock so in this case it's getting the answer correct now again this is the 7B model I'm going to show you the 34b model just in a moment but it gives you a sense for the kind of performance to ECT albe it on a very small data set and I think using a larger data set taking images from different angles of the pieces to give more perspectives would allow us to improve performance perhaps combined with training for one or two more epochs until we see that eval loss starting to rise again before wrapping up I want to clarify different points the first is around the amount of vram that you need I find that if running the version 1.6 Mistral model I actually need 3 a6000 so I need over 100 GB of vram and 2 a6000 is not enough to load and fine-tune in 16 bits if you want to do the 34 B model that's going to take over 300 gab of vram and for that I used 8 a6000 but I think probably seven is enough the second point I want to clarify is around tokenization here are the tokens from a row of training data and what I can do here is copy paste this and put it into a vector just for demonstration purposes and I'm going to take out the minus 200 entry in it here and then I'm going to tokenize or decode this and print the output and this shows us what that row of data represents it's the begin of sequence token then there's this instruction start token then the prompt what you see in this picture then the the instruction end token and then the caption which is one black rook and after that there's uh the end of sequence token followed by padding tokens which are the on token or the unknown token in this case now the reason I deleted the minus 200 is because this is used to represent the image itself in fact this is actually the image token index H so minus 200 and that's how lava is representing the image token within the model so that's why when we see the first row of data like this it actually corresponds to this prompt with the image token inserted right here and I should note that there's a custom tokenizer being used this tokenizer image token which handles that special token so if you just use a normal tokenizer that's going to cause issues for doing inference or for doing fine-tuning on a lava model the third point I want to highlight is around the tensors that are produced to represent the images you can see here that we have a batch and this is a batch of four rows of data and here is the tensor representing the image so it first starts with four because there are four rows within this and then working backwards you can see 336 336 this is the number of pixels um and this corresponds to the input size for the clip Vision encoder which is 336 by 336 pixels then the number three represents red green and blue so they're the three colors that together represent the color of a given pixel so there's this H further Dimension here F and this five means that there are multiple representations of the same image being used so what lava is doing in v1.6 is it's representing the image as a reduced version of the original image but then it's also calculating patches for that image so it might have one main image and then it's got four patches that are superimposed over parts of that image and this is beneficial because it allows the model to see the image in five different ways one in a whole version and then four other patches and so that's why we actually have four rows but then for each row that image is represented five times by four patches plus a big patch and then we have three RGB red green blue and then we have the pixels for each of those five images and that's it for this very first primer from Tris research on Vision models this is going to advance rapidly with models that support even larger images and using more performant language models there as the building blocks so I'm excited where this can go and I think it allows for a lot of applications where you can take a data set of images whether that be medical whether that be sports or otherwise and use them to fine-tune models that are going to be able to be used in conversational format as usual let me know your comments down below cheers

Info

Channel: Trelis Research

Views: 1,772

Rating: undefined out of 5

Keywords: llava, llava 1.5, llava 1.6, llava fine-tuning, llava 1.6 fine-tuning, llava training, llava multi-modal, idefics fine-tuning, idefics, llava 34b, llava 7b, multi-modal, gpt vs llava, gpt vision

Id: eIziN2QUt8U

Channel Id: undefined

Length: 51min 6sec (3066 seconds)

Published: Thu Feb 15 2024