hello everyone and thank you for joining today's session my name is reys and I'll be your moderator today we're going to kick off the session in a couple of minutes we're just waiting so everyone has a chance to join in the meanwhile though we'd love to hear from you so let us know where you're joining from using the chat or the comments depending on what platform you're watching on and yeah tell us something that you'd like to get out of today's session um we are going to be using Google collab today so I'll be sharing a link so you can Cod along with us uh just as we get started uh so keep your eyes peeled in the chat but if you haven't already got a cogle account so you can use collab then uh yeah make sure you get that set up so that you can code along with us in the session um I am going to be sharing links in the chat so uh I'll be showing a link now so you can get registered for the event which means you can get the recording email to you as well as any resources that we send out as well um but I'll also be sending links so you can code along with us and I'll also be sending the solution notebooks at the end of the session so make sure you stay to the end of the session um so that you can get those as well so yeah just a quick reminder get registered for the event today and we will send you the recording as well as any resources and yeah you'll get everything that you need uh for the code along today as well as yeah being able to do it uh on demand at home brilant I think that's everything for me for the moment so I'll leave you with the background music and be back to repeat these messages for any new join us shortly hello everyone and thank you for joining today's Cod long my name is re and areb your moderator today we're going to kick off the session in about a minute or so we're just waiting so everyone has a chance to join uh let us know where you're joining from using the chat or the comments depending on what platform you're watching on and yeah tell us something that you'd like to get out of today's webinar um a few notes for anyone that's just joined we are going to be using uh Google collab today so if you don't have a Google account um and don't have collab set up then make sure you get that sorted I'll be sharing a link in the chat shortly so that you can uh get set up with a notebook and code along with today um but if for any reason you need to leave uh make sure you register for the event so that you get the recording sent to you as well as everything that you need to complete this code along in your own time as well I've sent a link into that into the chat um but you can also uh find it at datac camp.com webinars so yeah if there's anything you do right now register for the event and we will send you all of the goodies that you'll need to complete this on demand uh brilliant I think that's everything from me I hand you over to your host for today's session Richie Richie please take it away hi there data scamps and data champs this is Richie now llama 2 is one of the most powerful large language models available right now but while even most powerful llms are pretty good at generating text on a very wide range of subjects if you want to solve the specific text problem you get even better performance by fine-tuning it this works really well things like text classification or if you want um to generate text with a specific tone of voice then you're going to want to do fine tuning so that's what we're going to do today now of course the key to success with AI is to make sure that you have great data to feed into your model and so you'll be doing some data prep work before the fine-tuning now our guest today is Maxim Lebon he's a senior machine learning scientist at JP Morgan Chase and he's got a PhD in machine learning and for the last uh four or five years or so he's been working on large language models and graph neural networks he applies his expertise uh across a range of sectors from R&D to Industry finance and Academia and he's also the author of Hands-On graph neural networks using python uh we'll have a link to that in the chat so uh without further Ado over to you Maxim thanks Rich hi everyone um so in this session we are going to uh cover two notebooks one uh this is the first one dedicated to the creation of the data set and a second one dedicated to the functioning of the model um here we're going to have a very beginner friendly approach to this topic um and then at the end of this uh session I will present some resources uh to go beyond that and to be able to also like use better tools and and uh and perform other things uh but in this session we're going to stick to like the basics and try to really learn the theory behind it so we also like um more aware of what's going on when we use uh automated tools um in this first notebook um we're going to talk about data sets and uh how to create a high quality data set so I want to mention the fact that there are basically two data sets when well several dat set we're interested in um the first one is instruction data sets where inputs are instructions this is basically how you use uh CH GPT like write me something and the model is supposed to write you something and not complete what you just said because this is R completion and this is uh part of the pre-training of these models they pre-train on a lot of data and the task is next token prediction so they just here to um predict what the next word is going to be and this is um nice but not really what we want to create assistance um so instead we're going to use uh supervis fine tuning to be able to turn a base model into a useful chat model here uh there are also preference data sets I want to mention it briefly this is uh for reinforcement learning from Human feedback this is often used after the supervised fine tring uh process and in these data sets you will find uh different answers and some kind of ranking of these answers to say hey this one is more useful than the other one we're going to mention it during the functioning process in the second notebook and finally there are other types of data set it can be sentence classification it can be code data set where you have a fill in the middle objective uh where you want to fill in code uh that has some context so before your cursor and after your cursor and this is where you want the code to be but we not going to talk about this it was just um a general overview of the kind of data set you can expect in this space um so as mention here we're just going to use supervis f tuning so with an instruction dat set and we're going to build our own by filtering a existing data set uh so the one that we're going to use today is the open platypus data set it's a collection of different data sets actually um this is already a data set that has been filtered using other data set but we're going to do it even more uh we can check uh what it looks like on the hugging face page here so here you have um the instructions uh for example ball game speeder and divider into three parts blah blah blah and the output that is expected by the model so this is basically what your model is going to learn during the supervis F training process it's going to learn to Output this answer when it has this instruction and we repeat it uh a lot of times and at some some points the model is pretty good at understanding what it needs to do so it needs to provide a useful answer uh for this instruction uh you have more information about the Platypus data set and there's also a really nice paper uh you can find here about how they did it and how they trained uh their models uh called platipus so an interesting read uh if you want to know more about it um so first of all let's start with the code we're going to need to uh install some libraries um so here we're going to pip install um a bunch of libraries uh we're going to install data sets uh the library from hugging face to handle this also Transformers uh very useful uh sentence Transformers too because we are going to use um sentence Transformers uh to create embeddings of our data we're going to see why uh in a few minutes and finally a vector data base from Facebook uh face GPU um here you can see the runtime so make sure you have a uh T4 GPU or more if you can afford it but the entire code should run using a fre tier Google collapse or with a T4 GPU for the sake of this exercise I'm using a V100 with higher Ram because it's just going to be a bit faster and we won't have to stare the screen uh for 30 minutes uh the Second Step that is really useful uh so this is the first time I tried to use it uh please bear with me this is quite a new feature in Google collab but now we have a secret tab here on the left and as you can see here you can add a new secret and you can um give it a name uh hugging face and a value so you can retrieve the value uh using this link if you have a hugging face account so here collab I can copy it here I can write the hugging face here and the and copy paste the value here so and give uh access to the notebook um so this should be a really clean way of managing your secrets because now they shared across all your notebooks and they w't going to be uh shared with people um other than in your account um to use it jump in um we just got a question from the audience I'm not sure what the answer is I don't know how widespread this problem is um but do you know what um an sure file accessible error message might be oh uh sorry I thought that we tested that uh let me let me just uh try out the links that are used okay uh anyone with the link should be able to access them actually uh some can can some people confirm that you can access it Richie can you access this not books uh I can access the notebooks one thing might be the case like you do need to be logged into Google I think in order to access collab so if you're not logged into Google then you probably need to do that all right we'll continue for now um if anyone else has any problems please do let us know in the chat the comments uh yes please uh oh uh James says it might be a company blocking it for security reads yeah if you can't access any Google products from uh your corporate uh laptop then yeah you're have to find a different network unfortunately or just watch for now follow along uh when you get the recording yeah exactly uh sorry about that Google collab can be uh a bit difficult to use in some in some context um here we're going to uh import our secret using uh Google collab user data user data get so now the HF token um has the correct value and this is what we're going to use if we need it in the when we need it in the rest of this collab this is optional uh we only going to need it when we want to upload uh the data set so if you don't have a hugging face account you don't need to create one I recommend it because it's nicer and you'll be able to upload your own data set in your own account but if not it's okay um I will upload it in my account and you'll be able to reuse it from this account uh so now we have the hugging face token and the first thing that want to do is to um load the data set um that we want to use so in this case it's um the data set that was mentioned here so open open platipus we can copy paste it from here and it should load the data set um download it and load it and we're going to see all the different columns inside of the data set so uh input output instruction and data source uh that should be correct here and we have the number of rows so almost to uh 25k rows um so we can uh see a bit more about it we're going to uh read it as a baners data set and we can even with Google collab convert it into an interactive table it's a better way of looking at it unfortunately it can it can take some time for Google collab to convert it um so we're going to see but um yeah it's nice to see that we have the instruction and we have the output and this is all that matters here we're going to explore a bit more about the instruction and the output if it was a real data set real row data set what we want to do at this point is really read um not every line because it's it's too to too much but a lot of these lines and be able to get a good understanding of what's in this data set and what we expect and also like just clean it like if there are samples that are not high quality that have like bad English that are just plain wrong uh we want to remove them this is very important um we want to create the best data set possible so it means filtering out a lot of samples okay so now we have this data set we can read everything here um this is uh a lot of of of data so we we won't do it in this case it would it would take uh too much time um but now uh something that we can do let me show you uh once again the uh the code here so load data set and then the name of the data and what we are going to do here is that we're going to use the Transformers library to import uh the tokenizer um this will convert the road text into token then we will import the M plot lib uh Library uh to plot the results and also the C library for the same reasons uh first we want to um import the tokenizer so in our case we want the tokenizer not from b as suggested here but from Lama 2 because this is the model that we want to use um so we're going to use um new research uh version of Lama 2 and not the official one from meta why that it's because if you don't have a hugging face account you will not be able to access the uh official version of it so news research just reuploaded the entire thing um so now we have the tokenizer we can test it here it's going to download it once again uh from hugging face and and then we're going to uh print it once it's done um here you can see it with all the special Tok known as Etc when it's done uh we want to use it to tokenize our instructions here and also our outputs here um so the way that we're going to do it we're going to create a table an i called instruction token counts oops veros but um and we're going to use the tokenizer to tokenize every sample um in our data set uh this is not correct actually this is why you should not always trust the Cod LMS um for example in our data set train and close it so here in this table we should get like the token counts for every instruction and we're going to repeat this process uh with the output so pretty much the same thing here and we want the output and finally we want to combin uh the instruction in the output because this is like the the entire uh data set and um so instruction plus output and in this case we want to call it combine token counts and here we're going to merge these two tables for instruction output IN Zip Etc um and we should get like the the sum of all the tokens here uh in both instruction outputs now that is done um we are going to do a little function to plot the distribution of our token so we can have an easy uh visualization of it um so here I'm going to use uh SNS uh set style white grid and slowly but surely uh get everything here actually I'm sorry but I'm going to Cy paste it so uh it's it's a bit uh faster to do it and now we can plot the distribution for every um instruction output and combined tokens account um so this is a very simple um plot and we can call it using plot distribution and here we're going to first use instruction token counts and we going to see the distribution of token counts for instruction onate this part takes a bit of time unfortunately um because yeah it needs to toonize the entire data set uh but once once you've done it you can basically comment it it's okay and we're going to repeat this process um with the output token count so this time it's for output only and finally the combine token counts so here it's for instructions plus output here I'm going to comment it so we get faster results and now we have the distribution for all of our um data so here you can see uh that we have uh approximately like the the mean is is around here so around uh 500 tokens there's a longtail distribution goes up to like 5,000 tokens why is it important why does it matter it's because uh these models they have a certain context window and if it goes uh Beyond this context window it's not going to be very helpful so it's important to know like the the number of tokens in our data set we also maybe want to uh um sample more from um samples that have more tokens because they're going to be more informative than others um but we'll see about that um basically here we can see okay I'm going to put uh a certain threshold uh for this data set at 2K tokens the max contact size for Lama 2 is actually 4K but uh this is just an example to show how we can uh just set a threshold um so here we want to filter out rows with more than uh two uh th000 um samples so one way of doing it is to uh say um I for I count in numer combined token uh count if count okay so here we're going to retrieve uh the index of every uh sample where the count in Combined token count is lower than 2K and now we can print how many uh of this we have um okay and if we print um length of the data set in train and we are going to remove the length of valid indices so we see how how many we remove okay we only remove 31 of them uh but that's fine uh you can have a more aggressive uh threshold if you want in this case yeah we're going to remove just a few of them as you can see here um then we're going to extract the vi rows based on the indices so here we need to take the uh data set train and we're going to use uh the select method to only get the valid indices and then we going to um get the token counts so for the valid rows so we can also plot this distribution here and we plot the distribution of the token counts after filtering exactly like that um oops okay there's an issue here uh because uh I executed the code twice I shouldn't have uh but that's fine if you execute it once it should be okay now it's already filtered which is why it Crea an error and here you see that we have very different plot because all this right part has been uh filtered out um another thing that we can do is near dication using embeddings uh which is why we installed the sent Transformers Library um and what we want to do here is we want to embed every uh row every sample from our data set um so here we want to translate that into a vector we call an embedding and using an embedding model how to choose the best embedding model it's often it's popular question um one way of answering it is looking at uh the MTB Leaderboard on hugging face um this is where you can see like competition with all the embedding models on the uh task um it's funny because since I've made this screenshot there are new ones on top of it uh the one that we're going to use here is the GTE base um embedding model it's U it's a really good model it's not the best model but it's going to be faster than other options which is why we're going to use it here and uh we're going to use the embeddings to then calculate uh the similarity between them and when they're too similar we is going to filter them out so how we're going to do it uh we're going to use uh the sentence Transformer Library um and we import the sentence Transformer uh class uh we're also going to import Face uh the vector database from Facebook it's not the best Vector database but it's very simple it's very minimalistic which is why I used it uh in this example uh from data set we are going to use data set and data set dict uh to be a bit fancy we're also going to use uh tqdm um to have a nice uh loading bar and finally bu some operations um so here we're going to really create the code in one function to do everything so we are going to pass it a data set we are going to pass it the name of the embedding model we want to use and we're going to pass it a threshold for example 95% it means that uh when it's 95% as similar as another uh embedding uh it's fishy and we probably do not want to um to use it we probably want to filter out this model uh so as a uh sentence um Transformer going to use the model that we we pass as argument um for the output we are going to uh use all the example in the uh data set so what we want to filter it filter out here are just the outputs and not the instructions um we we're fine if we have similar instructions we just do not want similar outputs because this is what the model is going to be trained on um then we are going to say that we are converting um text to embeddings we are going to uh use the sentence model and encode the outputs that we have and we can even show a progress bar uh to to be fancy bar then we're going to get the dimension of our embeddings so you can see in the um either boards some of them they have like 1,24 some of them they have like 768 um basically yeah they have different dimensions we take that into account we are going to create our index using uh the um Vector database so it's going to be uh flat IP index in this case um and we need to normalize our embeddings um the face already has um a function to do it but I do not trust it that much so we're going to do it on our own um because I had a bad experience with it um and I think it's going to be better that way so here we using nump to normalize it um using the uh the norm um just take the nor to to normalize the entire embedding space and then we're going to add it to our index as normalized embeddings uh then we're going to say we are filtering out um let's say near duplicates and in this part of the code we want to um use the index search so we have oops the normalized embedding we just put and here we are going to say k equal uh two so we are going to return at most two vectors uh we don't need more in this case and we're going to create a list of the samples we want to keep call it to keep and this is the main Loop finally um Range length yeah it's fine um and we're going to do something nice for the qdm so we have a nice uh loading bar and what we want here is if the so Dimension is so this if if this is um we're going to return here the similarity uh between these embeddings and if this is um this if the cosine similarity is below the threshold we are going to keep it so we are going to add it to the toip appen and then we have I um yes index index um and then we can create data set trains and we are going to use the select method from the data set object uh to only keep these um indexes uh then we are going to return it as a data set dict this is not the most elegant way of doing it but um it's going to be fine uh for the purpose of this exercise um and then we can call our function so the duplicate data set and we are going to pass the data set and we're going to pass the embedding model we want to use so in this case as I mentioned it's going to going to be the GTE Lodge and we can just copy paste it here and as a threshold I'm going to use 095 be careful if you switch the embedding model um you won't have the same distribution of cosign similarity so some models uh to get the same results you're going to need like 85 others you can might need 99.9 uh it really depends on the bending model that you use um should be fine so now we can uh convert uh the entire data set into embeddings um here we are dting the embedding model it's not a big model which is why it's it's pretty fast the long part is actually comparing um the embeddings uh I should mention why we doing it using a vector database instead of a full loop um I've tried to do a very minimalistic version using two full loop so we would compare every embedding to all the other embeddings but it took a very long time so this is why we're using a vector database here to be more efficient uh to be used by the computations and uh and get the results basically just faster because otherwise it would take like two hours uh it was really really too long uh unfortunately so here we're still downloading the models and then with a v00 with high Ram it should take about uh 3 to four minutes to get all the embeddings and to filter out the data set uh in the meantime we can continue I'm just going to show the code here because I don't know if you can uh really um see it if you had time to to see the code uh this part might be a bit uh confusing but I don't want to tell to deep into the details of the faith face Vector database um okay so now you see that it's it's converting the text to embeddings um oh it's going to take much longer than last time I've tried unfortunately but it's okay like we we can uh stop it or come back later um to to finish it um what we want to see uh when we have this the dup data set is the number of uh samples that were filtered out so we can print uh the length of the um original data set uh we can print the length of the uh duped data set and we can even uh print the number of samples that were uh removed so in this case the length of the original data set minus the length of the T dat set and this will tell us like how many uh rows we uh we removed and yeah last time you can see it uh later on the uh solution notebook it's about 8,000 um samples one thing that we can do uh when that part is is over is topk sampling um so in this case we still have too many samples because if you remove 8,000 samples we're still going to have uh 20 no we're still going to have 16k samples maybe this is too much for what we want to do so we can randomly uh sample um uh some some rows uh in order to do that we're going to create a new function called top K uh rows we are going to use a data set token counts and K to know how many uh we want to have uh we're going to sort the indices um because we can uh sort it uh by descending token count and get the top ke indices in this case um so it's going to be uh sorted range length token count and here we are going to use uh going to use Lambda I token count reverse is equal to True um so here we should get um everything that we need um so the token count and and get all the data set uh with the most first and then top K indices we are going to just keep uh those um those stop k um yeah just talked about randomly doing it uh but it's not true just getting like the the sample with the most um tokens sorry about that um and then we can create topk data and here we have instruction where we want data set um that that are in the top K in indices oops and we're going to do the same thing with the output you could do something similar with the select um method but yeah this time uh this time I want to to be clear about what's going on so we have a full loop to select all the samples that were in the sorted indices here and we going to keep like 1,000 of them and we are going to do uh the same thing with um uh output and finally we can return our data set from the dictionary that we didk data um so this is our function um but to in order to uh call it we are going to need to have the new token counts because here we filtered out a lot of uh samples so we can just copy paste what we did uh the beginning here get the instruction token counts the output token counts the combined token counts and this is what we will use uh when we want to call this function so let's have a k of 1,000 and the top K data set will be get topk rows data set we're going to use it with a combine counts and the K of 1,000 so this is how we're going to call the function and and finally uh we are going to uh save it as a uh dictionary like yeah data set dict as they call it um to make sure that we still have like a train split but yeah not not very important after we've done that we can once again uh recompute all of the uh token counts and the plot actually like also so plot the distribution so this is just to see um what's the new distribution after after filtering after topk sampling uh now how it looks like and yeah we'll see we'll see in a few minutes um when this is done and we can see uh the uh the distribution um we can just um also see the samples themselves to see like with pandas uh how it looks like like we've done at the very beginning of this uh notebook here and we'll see like how many samples remain and finally I want to mention uh chat templates um so there's a need to define a chat template if you want to use your large language model as a chatbot there are different ways of doing it um here's a way of doing it we have a ro user content hither uh Ro assistant so this is more like the the RO data um this is a format that you can use just user two points and then the message then assistant to point nice nice to meet you uh in the case of Lama 2 you have this particular template so you have this token s then you have the instruction you have space um um not not not the instruction it's it's another to for instruction then you have um CIS you have the system prompt you have here the user prompt and finally the model answer it's quite a difficult template we don't need to use it uh to function our Lama 2 model because we're functioning the base model and this chat template is only used in the chat version of Lama 2 this is not the one that we use here I wanted to mention uh the chat ml uh template from open AI it looks like this um it's the most popular and stand I one you can see it in lot of stateof the art open source models we're not going to use this one because it requires adding tokens it's more difficult so uh the one that we're going to use is going to be uh quite simple we're going to create a function called chat template about it with an example and in this example we're going to uh format we format the instruction and here we're going to use uh this instruction then break line then we can finally put the original instruction and break line break line and here we can put like output or response and go for response and another bre line why this one honestly there's no good reason you could imagine a lot of different pom templates but it's going to be nice to see that the function model will follow this uh prom template finally we can return the example and we map that using uh the map method from the data set object and this will um yeah this will change all of our instructions so they can follow this template we're going to visualize it uh when okay it's done let's go back a bit earlier so we uh managed to uh remove sorry earlier so yeah we filtered everything that we wanted to filter we filtered out like uh 8K samples uh like I mentioned previously then there's a topk sampling where we said we only want to uh keep the top 1,000 samples in terms of token counts so the one with the most um uh tokens and you can see here the distribution of uh token counts for instructionally uh for token counts and finally uh the distribution of token counts for instruction P output you can see we don't have uh samples with less than 1,000 samples uh thanks to the topk sampling uh yeah it should make sense hopefully so here we have 1,000 samples with a lot of tokens and they should be high quality because they're not close to each other we near to duplicate them so it means that they should be pretty far away from each other here you can see all the 1,000 rows once again you can click it here if you want to have a good overview of the samples that we that we selected and now uh they should follow our chat template so just let's check if if this is correct and here you can see the instruction let's click it here um we really have like the instruction and the response as mentioned here uh so this is working as intended instruction response and here is the response that we want the model to to follow all right so um this is done and the final thing that we can do here this is optional this is if you have a hugging face Hub account if you um put the value here of your secrets uh then you can just push the data set to uh the hugging face Hub uh like this and we specify the token here uh I'm calling it mini platus uh you can call it however you want um and this is going to upload it and you can even check if it's correctly uploaded if I go to my hugging face account and I check my data set uh this one was updated less than one minute ago and here you can see our entire dat Set uh cool so we have everything now uh and we're ready to go to fine tuning uh I hope it was it was clear uh and if not I hope that the solution notebook will help you uh to to to create your own data set to go further beyond that you can create synthetic data using um dpt4 it's something that is used quite a lot and it creates a really good uh data sets so this something that you can play with and otherwise it's really like manual reviewing you can import this data set in U Google Sheets and really manually review every row possible uh maybe create some uh regx to uh automate the process a little bit but this is a very timec consuming process but it's also like very nice um because then people can reuse your data set if you share them um but now let's go to the the fine tuning uh notebook um you should also have it uh let me uh check the the chat uh okay yeah you have the social notebook uh cool um so here you have um Lama 2 and we're going to delve deep into the functioning process so as mentioned previously there are two ways of functioning these models there's supervised functioning this is what we're going to do uh so we're going to tune it on a set of instructions and responses it's going to help the model Focus where we want um so to be helpful to follow the chat template too and there's also the reinforcement learning from Human feedback where we want the model to maximize reward signal I'm not going to Del into that uh there are a lot of good articles about it uh it's able to capture more complex uh preferences but it's also like more difficult to implement and in practice uh most um if not all this zere but except zere nearly all the state-of-the-art open source llms just use supervise fine tuning so yeah uh something to keep in mind um and once again there's a example uh from a few month ago now the Thea paper that shows that only 1,000 high quality samples can really make the difference and and get very far um in this case when you have a 65 billion models um and I want to mention the open M need the board you might be familiar with it um but this is quite useful to see uh what are the best models So currently you have like this nonl model um I wanted to uh yeah show this Godzilla 27b model because um it's using a Lama 2 70b and um uh I saw that it was using my data set so when I was telling you like yeah it's nice to also share your data set you see like sometimes it can be reused by other people without you knowing anything about it but I'm glad this one was useful um so what we going to do here is um as previously we are going to start by installing all the libraries that we want so in this case we're going to go pip install q and we're going to update because uh Google collab already has uh some of these libraries but we want to use already like the uh latest version available in this case bits and buys and 1db uh Transformers and hugging phas oops I'm going to disconnect this session okay uh once again you can use a defa GPU for the entire uh notebook here I'm just going to use the V100 because it's going to be faster Transformers for the Transformers data set for the data set accelerating it's it's to make things um faster PFT it's going to be for the fing process that we're going to use I will mention it uh later T is a wrapper it can be used for uh supervised finetuning or for reinforcement learning from Human feedbacks we have bits and bites for quantization because we are not going to use the model in full precision and onb for reporting so we can have a nice dashboard where we can track the progress of our model um once again we're going to use uh Google collab I'm just going to copy paste it uh from the the previous um notebook so we have our secret token not Cent access okay um it's optional if you don't have a hugging face account once again and here we're going to import a lot of librar so we can import OS we're going to import torch we're going to import the data set from data set and from the Transformers Library we need to import a lot of classes uh so Auto model for Cal NM Auto tokenizer uh bits and bytes config Auto dooner uh training arguments and pipeline when we want to run it um when the model is trained and then from PFT we also need to import a few of them so lower config PFT model and something called prepare model for kbit training and the last one is uh the rapper first provis training uh from the TR Library uh called sft trainer um I'm going to let it here for for a second second and um we're going to talk about uh the different ways we can function this model so we have three ways there's the full funing there's low and there is uh Q low with full fine tuning uh we going to use um the the entire uh model so we're going to uh train all the weights in the model which is very costly then we have uh low which instead of training all the weights we're just going to add some adapters in in some layers and we're going to only train this uh added weights uh so this really reduces the cost of um training the model because we are just going to train like 1% 2% of uh the entire weight and finally we have QA which is using Lowa but with a model that has been quantized so uh in this case we're not going to use the model in 6bit Precision so with every um weight in the model uh occupying 16 bits on the dis but uh instead they're just going to be quantized into four bits so we can lose a bit of precision here but in the end um there mechanisms to uh make it less impactful and we'll be able to get a really strong model using Kor a a bit of calculation here uh we have 16 gigabytes of vram with uh our uh GPU um here you can see it's 16 gabt and um Lama 27b weights um so we have seven billion parameters uh if they take up uh two bytes it means that we're going to use uh 14 gigabytes um so we are almost like using the entire uh vram and in addition they are like other things there's noad du to Optimizer States gradients forward activations so it's going to be challenging uh but we we we can manage to uh fit it into only 16 GB of memory okay so now we're going to really uh delve into uh the code to function it um we are going to reuse the news research model here uh like previously and we are going to give a name to our new model um so in this case I'm going to call it l 2 7B and mini platipus and we're going to reuse the data set that we just created so it should be called uh mini platypus and we're just going to use the train splits finally we have the tokenizer um so here we are going to use uh the tokenizer from the lat model um and we're going to use the fast version of it we are going to do something um that is um some people hate it um we don't have a padding token um for Lama and this is a really big problem because we we have um a data set with different um number of tokens for each row so we need to pad it so they all have the same length right and there are different way of doing it uh here I'm using a token called the end of sentence token and this will have an impact on the generation of our model um this is what we're going to use here there are different ways of doing it this is definitely not the best version of it if you want to learn more about it I linked an article from um Benjamin Mary uh about two other ways of doing it and this is what we should do but for the sake of Simplicity here I'm telling you about this problem but I'm I'm still using the end of sentence token uh for this U functioning then we are going to um talk about the configuration of the qor so here P&B config uh it's the bits and bite configuration the first thing that we want to do here is to load the model using four bit otherwise it will not fit in into the the um in order to do that we can specify the Quant type that we want to use in our case we want we'll use the nf4 format this is the format that was um introduced in the K paper um and we are going to use H compute type so this is how the weights are stored using four bits and when we want to compute it's only it's going to use uh 16 bits so we have more accuracy and we are also going to use something called double quantization so even the quantization parameters are quantized it's it's to like really uh take uh even n space uh then um without using it then we have the lower configuration uh so on top of K we also have the lower configuration and in this case we have a bunch of parameters uh one of them is the alpha the alpha is basically the strength of the adapter the impact it has on the um the the model um because you can merge these adapters in a very weak way so using very little weight or using a big weight 32 is a pretty big weight but this is a quite standard uh value for this parameter we also have like the Dropout because when we add these adapters they have a little Dropout so we have a 5% probability of um skipping these connections and uh finally we have a rank uh which is um like the dimension of the The Matrix that I use here if you want to know more about uh low and like a minimalistic um implementation of it I I made this notebook uh called Nano LA and this goes into like in depth into like the theory behind it and we will help you understand these parameters a bit better uh we do not want to take care of the bias so we have the weights and the biases here we do not care about the biases and the task type in this case is caal LM um because we are autoaggressive and the target modules here we have a very long list of Target modules um the target modules uh it's something that you can see here actually and the attention Lama attention um uh thing it's basically like okay what which which modules uh do you want um to uh not train but um add a lower adapter to it and uh we are going to use a lot of them because uh it's been shown to really improve uh performance so the more modules we have uh the more uh parameters we are going to train uh but this is fine like we we we can afford it uh with our limited budget it's going to help us in the long run uh then we're going to load the model from pre-trained um and we're going to use the base model that we typed earlier the quantization config uh so we have the B&B config here this is what we're going to use and finally the device map so here we could also use Auto but I'm going to use that uh it's going to automatically detect um the device the hardware that you have so in our case we'll detect the GPU and make sure that we're using the GPU for for training otherwise it's not going to work and finally we are going to call a function called prepare model for KB training uh this will cast the layer Norm in fp32 so in more Precision it will make the output embedding layer require uh a grats and add the upcasting to the mlh fp32 so what what what it means is that it's going to uh take some uh layers some modules and make sure that we are using them with the highest Precision possible because it's been showed to uh really improve the performance too so yeah there are some modules that we will not um that do not really matter some of them matter quite a lot and we want to be um uh quite proactive on that and um and in the end this will help us build the best model possible so if I oops I forgot to execute that um and then if I I forgot to execute that too oh just while you're running bit the data set sorry I just I interrupt for just a second just because we're coming up to time um just we are going to overrun a bit before we get to before you all jump off for those of you we have to dash I just want to say we've got three webinars coming up next week so on Tuesday we've got an introduction to snowflake code long coming along uh on Wednesday we've got a session on using AI in robotics so if you're interested in a career in AI then that's uh something uh definitely worth attending and then on Thursday we've got a session for Best Practices on putting llms into production so three great sessions uh please do go to datac camp.com webinars S for those um we have got some great questions for the audience as well so I'm hoping to get to those afterwards if you do have to jump that fine uh please do catch up on the recording for everyone else um I hope you're okay hanging around for a minute and with that I will dash off and let you get back to it Maxim yes sorry about that we should be over in like 10 minutes I I uh underestimated the time uh it takes to to write the code um but um yeah here we have uh K low configuration loading the model preparing the model for training uh we are currently uh downloading the model here you can see the different modules in the Lama attention um class and this those are the one that we Target also in the Lama MLP you can see the uh hugging face implementation of it to to have more details about how it's actually implemented um once it's done we have more uh B plate code uh to to type uh this one it's training our arguments uh so we have the training arguments here and what we want to do is to uh give like a bunch of um um parameters to it so like where do I output uh the results we're going to specify um a directory for that uh how many EO we want to run the model on um here I'm going to put one but let's put four or five uh basically between three and five it it's pretty good for Lama to model at this size uh there's the per device uh train batch size uh this will tell us like how many U uh yeah the number of batches that we're going to take for every every every step um we have the gradient accumulation uh steps we're not going to use it here it's basically a for Loop inside of the um training uh so we don't have to add more um use more vam but in this case it's going to be fine we won't need it we have the evaluation strategy it's not going to be very useful here because we are not going to um evaluate this model um we just want to train it and we're going to mention what evaluations looks like with uh uh these models uh a bit later uh logging steps we want to log every step uh the optimizer that we going to use is the Adam Optimizer uh but a version that is paged and in 8 Bits so it's going to lose less memory the learning rate we are going to use uh yeah this one there are different learning rates that that what we can use um refer to the K because K it really impacts the learning rate the the model also really impact the learning rate that you want to use we have the scheduler uh in this case we're going to use linear and one steps um like it won't be really useful here but we're going to say uh 10 uh to warm up the uh the optimizer we want to report it to weights and biases and finally uh something that I'm going to put here but uh remove this line uh for like for real training we're just going to uh stop after two steps otherwise it will take like an hour to train the model uh but yeah you can just feel free to remove this step if you want a real training so those are the training arguments uh then we need also to use the fft trainer so the wrapp I mentioned earlier and in this case we just specify the model the training data set uh we don't have an eval data set so I'm just going to reuse the same one PFT config um we specified it um it wants to know the text field here so the instruction field in our data set in our data set was called um instruction um the max sequence length um so we're going to go for uh 512 you could say like yeah but we we put a threshold at at 2K but we we don't have enough vram unfortunately uh it it will take like a lot of the to to to put everything into memory so we we're just going to uh stop at uh 500 uh in this example and finally we're going to give it the training arguments so this is what we have and when it's done we can start the training and when the training is done uh we can even save the model uh using that so we should have yeah the model has been downloaded here and now we are training the model so this is a loss training loss and evaluation loss from weights and biases and as you can see it's very nice way to tracking the the progress of the model we can see here the warm-up steps where it's it's pretty bad and then it goes better and better uh you can see the training uh loss is in blue so it's quite spiky it's a bit noisy if our loss um is in Orange it's a lot less noisy because it's less frequent and something that you can observe uh if you train it for like 5x is that the Eva loss will go up instead of down uh normally traditionally in machine learning this is a bad thing but with large langage model it's been proven uh time and time again that it's actually desirable and the best models um actually like overfit uh really a lot on the training data and this is not a problem actually uh this makes them better um here as you can see our model has already been trained for two steps so if you want you can add more steps you can remove it if you want to train it on the entire data set it willon take a while however uh we can check um weights and biases here we won't see any a lot because we only have two steps uh but just to show you uh here's our run and you can see here the global step uh train run time train loss of 1.2 um so yeah this is what you would use if you uh run it on the entire dat set um finally we can um use our model now that it's been it's been trained uh we can prompt it and say uh what is a large language model uh we have to wrap it using the right uh chat template so with instruction prompt and and then response and we can use a a pipeline here uh from hugging face it's going to be pretty nice so model tokenizer equal to tokenizer and we're going to restrict the generation length at 128 and finally we can get the result here and print it so I'm going to print the generated text it's an object that returns um and we can do something fancy and remove the instruction part this is a question that a lot of people ask me like why do I see the instruction uh in into generation inter generated text uh you can just trim it basically this is how the hugging face um object Works uh but you can remove it manually uh like this so now the train model is going to to answer the question what is a large language model and it will print um the answer here then we want to delete it okay we have the answer so what is a neural network no the answer is a large language model is a type of artificial intelligence model that is TR large amount of Text data to generate human like text which is pretty good actually um not thanks to our fine tuning because it was not intense enough but it's pretty good answer and then you can see that it it keeps repeating like instruction response instruction and this is because of our padding actually it's because we're using the end of uh sentence um token as a padding token so now it just doesn't stop and and keeps um um talking uh so if you don't want to have this Behavior Uh please use a different padding technique as mentioned previously um finally we want to um remove a lot of things so this is specific to Google collab we want to uh collect all the model and the objects in the in the memory in the vrm uh so we can merge it with the uh so we can merge the base model with the adapter that we trained uh this is a piece of code that is uh difficult to understand why do we need to call it twice uh honestly I do not know I just know that it works uh when I do it um and actually and sometimes it it doesn't work so well uh worst case scenario you can just like uh restart the uh the collab and uh and just execute this part of the code uh but here for the sake of time I am going to um copy paste the code so we are going to re um reload the base model here um and we are going to also load the adapter so the C adapter that we um we created you can see it uh here this is our C adapter with adapter confing adapter model and this is what we want to um to load hopefully it will work and um and then we can push our model to the hugging face Hub uh when we do that we are going to push the model and the tokenizer and we're using the HF Tok here this is optional of course um and this will upload it to our um our hugging face account so here we merging and unloading the model we are going to reload the tooner just in order to save it too and it's going to be uh uploaded okay so this is the end of this session uh sorry uh it's a I'm a bit late um but if you want to uh go further just know that you should be able to reuse you beuse this entire collab notebook with mol 7B instead of Lama 7B uh mistal 7B is a better model um but the name of this talk was functioning Lama to so I I I stuck to Lama too uh but I would yeah encourage you to try it out um if you want a better fine training uh tool I recommend Axel tool because this Google collab they're really nice to understand the theory behind everything and be able to implement um um this this funing process uh on your own but if you want to really um find CH in stateof the- art open source llms you I recommend Axel tool it's really a great Tool uh I've been using it a lot of people have been using it and uh this is quite easy to use um so yeah this is a good recommendation and then what you can do with this model is you can evaluate it using using a evaluation harness um you can even get on the open LM leaderboard if you have a good model uh or you can quantize it so you would make it uh easier to execute on consumer grade hardware and you could use your uh F model on your own uh GPU so that's it for me uh it will take a while uh for the model to be pushed to the hub because it's quite it's quite big but as you can see it's been uh merged and everything is working correctly so hope that you found it useful um and if you have any questions maybe now is the time to um ask the question do you answer the questions all right thank you Maxim that was fantastic a lot to unpack there I actually have like a lot of questions for you but um I think if I start asking questions we're going to go on longer than an ed sharing concert so I'm going to stick to audience questions instead uh let's go with this one first from uh prein so um PR is saying like may maybe you don't need to show this but um how do you go about querying tabular data so I mean this is very much focused on we just got a a lot of text if it's tabular data what's the difference ah um for Tabo data uh I would not recommend using llms because they're really made for for text uh there are some yeah actually I'm wondering like I don't know um maybe maybe we can't get into too much detail but is the standard like if you you got um the text file or what happens if you got like a pandas data frame of text uh yeah this is a good question uh actually you could see it uh when we um uploaded our our data set to the haging phase Hub uh this is kind of a data frame but it only has uh text so if you go back to the data set that we build here you can see like it's basically data frame with instruction output columns um but it's not it's not tabular right it's it's really text yeah okay interesting all right uh next question comes from Kieran saying Okay so we've been doing this fine tuning on a GPU do we also need a GPU at the point where we're hosting these things is it just the training bit that's um competen intensive or is it also inference no the inference is also like very intensive unfortunately and you definitely need uh well you need a GPU um in general uh but in particular if you use uh l. CPP um I can show it on my screen um if you use l. CPP um you can use it on a CPU um you can see sorry I'm going to uh you can see it here it's me like running it and uh it runs on on a CPU um you you'll have to compromise a little bit because like you need to lower the Precision of the model so it's it's um uh smaller and and faster to execute uh but this is something that you can do on CPU all right um very good so um for anyone who's interested in lv. CPP I know we've got a tutorial on that perhaps ree can uh post a link to that in the moment uh next question uh comes from uh menam uh so it looks like you you got some fancy co-pilot action or some sort of autocomplete thing going on there in collab uh what is it what's that tool yeah uh it's a tool called codium uh and yeah it works really well with Google collab as you could see I don't know if it learned from my own code uh but uh it was like really accurate this time I suppose well when you're practicing rehearsing the uh the the tutorial you probably typ the same thing a few times so yeah yeah good way to train it yeah I mean they're training data set now nice okay so uh codium is the thing uh next thing comes from uh from we saying uh how important is parameter tuning during fine tuning I guess it's like hyperparameter tuning yeah um it's a really good question for some of the hyper parameters going to be uh very important um and for some of them it's it's more like 1% 2% gains for example everything here honestly if you stick to like Traditional Values uh it's it's going to make like not super meaningful improvements of course they're important because 1% 2% it's good to to have but they're not yeah that that important uh and there are other yeah parameters uh that are a bit more important the learning rate is a really important one um and for this one yeah recommend uh checking the model that you want to use if you use mistol instead of um Lama 2 it's going to impact it if you use K low or full F training it's also going to impact it all right um excellent next one is from bed uh can you show again how you show how you load save models maybe we'll skip the using for text generation but if you just cover loading save models I think that's useful okay so the you save the model by cutting the the trainer um object and with the model and safe pre-rain uh new model uh so this is uh yeah just just some code you need to to know and and then it was the text generation right uh yes uh and in this case I use the pipeline there are a lot of ways of like using them um it's not super pretty but uh it doesn't take a lot of L of code and yeah this is an object from harging Face from their Library uh which allows you to uh nicely um use the the text generation inference okay oh I think the question is about like once you saved it how do you load that back uh okay uh basically just reusing that and if you check actually um have uh the model is uploaded on hugging face already and here you have a usage um section where I describe how you all the code you need to use it so okay so you're getting the whatever the model type is from pre-trained uh pulling it back off the Hop face page all right okay so uh next question from Arun saying can you see what percentage of trainable parameters um we are reduced weing after Cura oh so this is like how many different I remember you saying Cur just changes some of the weights in the model so I guess you want to know what percentage that is um it's an excellent question I uh I cannot show you because I uh deleted uh uh the model pip in trainer before uh but yeah there's a command to do that I I recommend checking on on Google but yeah definitely you can see exactly like the percentage of parameters the number of of uh parameters that you're training using either low or C all right excellent um and oh we've got so many more questions uh all right let's just do a couple more um so um this next one all right is now uploaded okay yeah oh it's finally upload okay so Alexander asks how do you find tune and llm so it can extract Json from differently format that's actually maybe a little bit um specific but can you talk about how you apply it to like um a sort of you got a CSV file or an Excel file of text how do you how I guess how do you standardize that data um yeah I I don't know if it's really a task for an llm um because the um if it's just extracting I would say like why do you want to use an llm and not uh something else um other than that um there are different Frameworks like uh Chas and former or LM qm even better uh L MQM let me show you if it's really about the generation generating a properly formatted Chason this is a really good uh framework to do it um there are a lot of them but this one is currently the most popular one uh it's quite easy to use um I'm not sure it really anwers the questions but I wouldn't use an llm to extract this information I would use it to generate a Json uh and to generate this Json this is the library I would chose okay uh so uh lmql was that all right lmql yeah lmql all right that's worth looking into then all right uh one very one very last question then since we're well over time anyway we're we're past limits all right so um uh how can we improve the forms of L LMS you basically what's what are LL index and Lang chain and what's the difference between them um yeah so this is um you have F ching and um Lang chain and L index they are more about like creating this retrievable augmented Generations so um functioning is is is one way of of customizing an llm for your use case and the rag pipeline is another way of doing it so with Lang chain and Lama index you're going to retrieve more context using some Vector database or regular databases that you have um the difference between them I'm I'm not going to delve into the details L index is is is does less stuff but uh maybe more in depth uh than nchain and uh I would recommend actually implementing both approaches so funing your llm and using this fun llm with the rack Pipeline and this is where you'll get uh the best performance possible all right fantastic we're GNA have to uh call it a day there I think I know there's more questions so sorry to everyone the audience if you didn't get your question I just want to say thank you again Maxim that was like incredibly informative and lots of new things that I think we need to explore uh so uh brilliant thank you uh again thank you to re moderating oh sorry go on Max thank you Richie and and thanks everyone uh for your patience I know it's been a lot uh but I hope that you found it informative uh so yeah all right brilliant and yeah so thank you to everyone in the audience to ask a question thank you to everyone who showed up today hope to see you all again soon uh lots of exciting webinars coming up so goodbye have a great weekend