Fine-tune my Coding-LLM w/ PEFT LoRA Quantization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello Community let's fine-tune our own co-pilot our own coding LM so we will create here a very personal generative a for code generation a code generator llm or simply a code llm and we will do this here with classical fine-tuning and a p Lara tuning but let's start at the beginning we need our training data set we need content so we go out and I will show you here the code we will go to the GitHub repost and have here specific GitHub repost chosen from those GitHub repost we will create our own data set of code our code sequences that will form here the training base for our fine-tuning I will show you the code how to do this and then of course the beauty if you do it ourself is we choose the best GitHub repos out there for our specific task so this means if you think about all the GitHub repos here as a little 3D cubes I don't need here from thousands and thousands of repos to extract here and this is the little dust you see in this image here they code I don't need thousand repos and maybe on those thousand repos are on 100 different topics I'm focusing here for my task for my actual topic let's say it is how to find your path models and I just want to have the best three or the best 10 guitar bpos on this topic that are up to date updated just an hour ago so we will do today a full fine-tuning here of a pre-trained coding llm and we will use star coder as our pre-trained code and we will use here some Beauties like flash attention version two and a fully shorted data parallel processing and I will explain to you why we need this and of course this is maybe why you're here we will do the same than in a PF lower fine tuning on our code data set we will even go with some quantized Laur or some multiple Laur adapters and of course we're going to use here iing face accelerate or deep speed and we will go through the code exactly I will give you here the code base so you can download it and experience this on your specific tasks great let's start step one we have to understand you know our foundation is the pre-trained llm this is star coder how is it infra input data structure been trained on what is the or chitecture of the model how we can optimize this in a later fine-tuning step or in a lower step let us use here in hugging face spaces here this demo to find out what star Cod is doing so this is a fine-tuned version of the star coder base model specifically focused on python also strong performance in other languages but we are talking about python here so I choose here star coder and this is my input I have here very simple function and as you can see here I program something and then I just say hey fill here so I have this code and I do not know how to do this if the length of the list one is bigger than the length of the list two how do I cope with this I say generate oops and here is our output and you see here this was here a code completion task performed here by star coder or let's take another example for Star coder we say hey our input is here XT train y train X test y test train test split functionality train a logistic regression model predict the labels on the test set and compute accuracy score and we say calculate and you say wow okay so we have here logistic regression model OKAY logistic regression and a random forest model a support VOR Machine model A K near K nearest neighbor model a decision Tre model a multi-layer percept mile so you see this is now we understand what star coder can do and we want to optimize star coder we have here our GitHub repost somewhere on the servers worldwide and we extract here the code information for specific repost we put it here on our compute infrastructure we have star Coda as the pre-trained model and now we do the finetuning on our specific code extractor data to have an expert system so a very specific coding system and not a coding system like Microsoft Coop GitHub co-pilot which is everything to everybody in this world we will have an expert coding system that is really actual so why I show you here an image of some distributed GPU Computing now let's have just a fly over the GPU me in general the minimum GPU memory requirements for the full classical fine-tuning what we see is we end up here or if you just think about here what we need as a minimum you come close to here a 250 GB of GPU memory remember we are using star coder 15 billion free trainable parameter model great and we have even considered here intermediate activation checkpointing flash attention to gradiate check pointing so we even get here boundary condition on the memory occupied per GPU in our GPU node and now the memory per GPU is about 70 to 78 GB so this means the fine-tuning of star coder 15b we will use here a compute node with 8 times an A180 gab GPU so eight gpus so you see why we need distributed computing and we will use every optimization we can imagine we will use here P torch fully shorted data parallel technique and now you see why I'm talking about distributed computing if we talk about full fine-tuning this will change if we apply Lura or quantized Laura now if you want to understand what is gradient checkpointing I ask your BART to explain this in a simple way and if you want to find out how much vram you need here to train and perform big model inference we have here hugging phase accelerate memory usage so you just put in here a specific hugging phase model I take here safe here the 7 billion in the beta version I say library or two model Precision I click on all four and I say calculate the memory usage and you will get now the memory usage depending here on the Precision the numerical Precision you have so for float 32 for the full Precision you have here the largest layer the total size that you need here we are talking about vram and the training using here an atom Optimizer so training here of an llm model they say with an atom Optimizer is about four time the size here of the SI of the normal size so you see we floating 32 for a sa 7B beta we need here a about 110 GB if I go with brain flow 16 I need half 55 GB and if I use now quantization I have a quantization of my values going down to an integer eight now I'm at 28 GB vram and if I go down here with an integer 4 optimization I'm down to about 14 GB of RAM calculated by hugging face spaces hugging face accelerator and whatever hugging face model you put in here it gives you a rough estimate of the total size and here if you do the training using an ATM Optimizer for your vram but you might say hey wait a minute yesterday I mean it is yesterday from the from my perspective recording this video there was openi development day and they said hey we can build our own little gpts where we have have a GPT Builder like here we say hey we have the name of the new GPT little GPT in our GPT store we have a description what we want this to do we give it an instruction and you see hey this little GPT is already being created here by openi and now you see the knowledge integration so the update for example we can upload files this is great but remember if even the gbd4 TBO with 128k content length has a very limited token length so if you want to update I don't know 100 GitHub repos you will run into troubles and by the way even if we click on code interpreter and you know that this code interpreter the very latest code interpreter of gbd4 Turbo will have a me a knowledge cut off in April 23 so I lose everything that happened this summer is not included in my code interpreter and this year seems not to be the fine tuning but this year seems to be something like a rack process where integrate here some files in addition to the knowledge of the code interpreter so I cannot feed in my complete updated today before it comes out so this is a problem we have to face and this is why I do this video on fine-tuning your own code interpreter again updated knowledge she April 23 code interpreter has also a knowledge cut off of April 23 and if you want to have here every knowledge in the last half year and there was a lot of paper and and if you're into research AI like I am and you experience code sequences from the authors of This research paper that are just 3 days old or 5 days old or 12 hours old on their giab repo you have no way to use a classical open eye code interpreter to integrate you the latest results from last week so this is the reason why I show you how to fine tune new your model that you choose on the new data set so we have here now that from April 23 to November 23 this is the data set that we want and we want to have this in a fine-tuned version so that it is really integrated into the reasoning of our new code llm I want to show you here specific code and I've chosen this code very carefully and I want to show you two professionals they both work for hugging face and I go here with two hugging face wizard I call them with Sak Paul and suup mangol car and they have public repos on GitHub and they update their code this is very beautiful so if you come back in one or two months you will have updates and they have public repos and this is why I've chosen their particular code implementation wait yeah and if you like hey why not give them a like a hug or a follow or whatever I think they would appreciate let us start at first to extract here from GitHub here all the public repost that we are allowed to have a look at so we follow here in this GitHub here sayak Paul and as you see we have here the organization hugging phase and we take here public repost so we have here a function get repost fetches the repository from for a particular GitHub user we have here hugging phas so this is it then we mirror here we make a local clone of the repo done we clone here this with a subprocess then we mirror the repos great and this is it and now we have our reers now from those reers that we have now download loaded we have to prepare the data set so we just go to the next file here prepare data set. Python and as you can see here we're working here with the notebook format we import here the read and no convert command we have your hugging face Hub yeah and again sub processing beautiful so the data set yes we have created this we seriz it in junks beautiful and then we say okay from all this information in the GitHub repost we do not want the following images PNG gacks or gifts we do not want video because we just want to extract the code and we do not want any sources where we have a PDF or some Doc file or some PowerPoint slides we want to extract the code so anything else yes yes yes others yes great so all the anti formats that we do not want we have to find so now we move all the files matching here the file format to a folder and upload for example this folder to the hugging face Hub or whatever is your particular Drive yes yes yes then we filter the code cells filters a code cells with the shell commands with respect to the Shell commands and then you process the file open file FP code notebook now reads the content here the notebook and then you're not going to believe it for selling code cells code cell string plus equal The Source cod in the cell so this is how we do it then you have here a function read all our repo files reads the files from the locally clowned reers yes beautiful find all the files within the directories process the file sequentially and here you have it this is it so now we extracted all the information all the code specific information from all the different repos now if you want to see this the result is here on hugging phe here you have here the hugging face stack version one so what we achieved let's have a look at this where we have here accelerate setup yeah let's go here so we have here our public repo and we go here for example for the accelerate uh repo and we have here our setup python command and now here you have the code of this python file cleaned up so ready to be used within our fine tuning and there we have close to 6,000 rows so you see where we are going with this information however what I would like to show you let's say what is the easiest one the easiest one well maybe not this here we go if you do not know what is happening here now very easy just copy this so I copy here the python file and if I want Analyze This code I copy it in and let's see what's happening and now tb4 code interpreter the classical code interpreter tells me exactly what is import statement what is here the OS model what is a subprocess what is your pulling from multi processing GitHub GitHub package GitHub API constants and then explains here what each function does get the repost gives you an explanation so whenever you are stuck this is a good way use the help that is provided here with the knowledge of GPT 4 gives you here the mainu then I have here in my customization that I want to have this analysis and consideration about security error handling performance dependencies and best practices potential improvements and if you even want to go further and if you want to have if you have a particular wish you would like to optimize this code can you optimize this code sequence you just analyzed and certain the code can be optimized in several ways reduce the apepi calls improve the error handling limit the number of concurrent use git python for the cloning efficiency skip already cloned repository and it gives you here an optimized code version now you have to check very carefully if this is really an improvement but you see it helps you in understanding code it helps you in optimizing the code and if you give this gp4 code interpreter directions how you want the code optimized it is also to perform dysfunctionality so tells you hey this are some ideas I have and you can immediately try it out if this is an optimization if you need optimization beautiful this was it to create to extract our data to clean our data to build our data say for the fine tuning and now we're ready for the next step so our next step that we have now the data set for our fine tuning created we have to optimize our complete system to the max we use everything out of the book that we know we have to bring down the memory requirement this things should run fast and should not cost us a house you know we have a lot of possibilities we have a data parallelism a tensor parallelism a pipeline parallelism and if you combine all those three into a 3D parallelism Paradigm welcome this is exactly what we are going to use so deep speed I will focus on deep speed and fstp so what is deep speed it is beautifully integrated already for us it is done de code we just go to the hugging phase accelerate like Library a high performance library for training a large llm on multiple gpus a distributed system or even TPU tenser processing units from Google huging face accelerate integrate deep speed Z zero stage three technique offload the competition of gradients to CPU or GPU memory further reducing the memory foot bit of training if you want to get an idea of deep speed here something by Bart great now we need to configure your infrastructure your computer infrastructure exactly here for deep speed and this normally happens here not in a yaml file and this is a yaml a't a markup language file is a text based data serialization format that is commonly used for configuration files and here this is one possible that you see with a short explanation what it is doing just tells the system what to do which type to use deep speed where we are with our types and brain float float 16 the atomize the optimizer we want to use this is the admw the learning rate the rate decate the scatter all the parameter we just tell them with the Z optimization stage three offload Optimizer offload parameter everything tells us and I will show you our exactly deep speed yaml file for our code but you know of course now more than a year old we have here in accelery library integrated here another distributed training technique especially for pytorch the pytorch fully shorted data parallelism and if you don't know the term shorting b a term borrowed from the databases where a large database is partitioned or shorted into smaller more manageable pieces beautiful so shorted refers to the partitioning of a model internal states its parameters its gradients and its Optimizer States across the memory of multiple gpus so this division allows each GPU to store and manage only a rather small portion of the model thus reducing the memory burden on any single GPU so if you want want to have a deep dive for band for example three papers I would like to recommend the official pytorch documentation about introducing pytorch fully shorted data parallel API here you have from March 14 2022 and they tell you exactly how it works for data parallel processes used in pytorch to give you some code examples so if you want to have a deep dive if you need this if you are into this tributed GPU Computing this is number one number two would be from hugging phase here accelerate the llm training using now exactly this fsdp this is from May second 2022 and there you have also how you can set up here your fsdp config file your computer environment everything that you need you go with multi GPU U fsdp and they give you here all the code examples that you need and if you want to see this on action here the third one is also from hugging face fine-tuning Alama 2 70 billion so a huge one using exactly here pytorch fsdp from September 13 2023 and here you have all the latest advancement and the latest tricks here to see here at llama 270 model applied here they tell you exactly the hardware how you start all the notebooks on everything sing the data set the prerequisite the fine-tuning the problems they encounter and and and so you have here a deep dive takes about I don't know two three days to really get familiar with this but please only do this if you're really working on a distributed GPU cluster so a note for example great those are the three papers I would recommend here to get familiar with this as I told you here are the configuration files for our particular fine-tuning code and here on the left side I give you fsdp this is the complete config file for fsp and here on the right side you see here the distributor type deep speed so this are the parameter you have to Define for your particular configuration now I'm not an expert here when to use deep speed and when to use fstp so I asked Google Bart and Bart told me hey deep speed is a good choice for training larer model on more devices on more GPU while fsdp is a good choice for training even larger models on fewer devices now if you an expert and you work with deep speed on fsdp on a regular basis please leave us a comment in the description of this video would you agree with what Google B tells us or would you have other experiences that would be great to have some feedback on this point great talking now about optimization memory reduction you know we have from the methodology also we have a parameter efficient fine-tuning with a quantization so we have parameter efficient fine tuning and we use here one methodology this is the low rank adaptation and we use here the quantized version so if we do the same thing we use star coder 15 billion F trainable parameter and now we use not the full fine-tuning but the quantized Laura so our trainable parameters now from our 15b model reduced to just 110 million Laura parameters so this is great because suddenly in our Laura model our trainable tensors are just 0.7% of our classical stock coder 15b model yes yes yes you can think about how much memory you need if you apply flash attention to gradient check pointing you take into con radiation everything you need to do the total memory occupied by the model is now a single one a 100 and just a 40 GB not an 80 gb so because our complete memory requirement is about 26 27 gab long sequence length wor 2K and B of four for training leads to higher memory requirements remember compareed it to the full fine tuning vram this was close to 250 GB now we have only 26 gab this is great yes I know unfortunately it does not fit on a 1490 but on a single a140 GB so this is really cheap yeah by the way I seen here in the documentation of this that they say hey the A1 100 gpus have a better compatibility with flash attention to and I've seen that let's say classical gaming gpus do have compatibility problems when you apply flashh attention to so careful there seems to be an issue with flash attention to a non let's say professional data center the latest Nvidia cards end of part one next video F tune co-pilot part two
Info
Channel: code_your_own_AI
Views: 3,279
Rating: undefined out of 5
Keywords: artificial intelligence, AI models, LLM, VLM, VLA, Multi-modal model, explanatory video, RAG, multi-AI, multi-agent, Fine-tune, Pre-train, RLHF
Id: CUJexhbvBqM
Channel Id: undefined
Length: 29min 5sec (1745 seconds)
Published: Sat Nov 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.