Llama 3 Fine Tuning for Dummies (with 16k, 32k,... Context)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Drumroll please; the long-awaited Meta Llama 3 model is finally here! This is not just any release; it's an exhilarating leap forward, with Meta now setting the pace for powerful large language models. A very quick introduction before we get into how to easily fine-tune this beast… The Llama 3 model currently comes in two unique sizes, from the compact 8 billion parameter model for smaller projects to the mammoth 70 billion parameter version for larger scale AI applications. More Llama 3 model variants - including a monstrous 400 billion parameter model - are said to be on the way. A comparison of Llama 70B against some notable benchmarks and evaluation metrics show solid performance. You'll find these benchmarks in this blog post here. Notice the MMLU where Llama 3 takes the lead, outperforming similarly-sized contenders. It's also competitive on metrics like HumanEval and GSM-8K, although some models like GPT-4, Gemini Ultra, and Claude 3 Opus aren't included in the comparison. Presumably, the additional Llama-3 variants, coming later, will be competitive against those flagship models. These models should also be multimodal, multilingual, and accommodate larger context windows. Enough with the setup! Let's roll up our sleeves and jump into how to quickly and easily fine-tune this model for your specific use case. After several rounds of testing and tinkering with diverse fine-tuning methods, we've discovered that Unsloth is arguably the best way to fine-tune these LLMs. If you're scratching your head wondering what Unsloth is, it’s a way to super efficiently fune-tune and serve models - less GPU memory usage, less training time, and less headaches. This is open-source, so feel free to check out the Unsloth GitHub repository for their notebooks and documentation. Let's work with the Llama-3 8B model notebook for a start, but we’ll need to modify this to use our own fine-tuning data. There's a well-documented commit history, carefully following the release of Llama 3. The same day as the Llama-3 model dropped, we got a functional notebook for fine-tuning. We can scale to different context lengths in this notebook by using RoPE scaling. To make our fine tuning process even more efficient, we'll employ quantized LoRA fine-tuning layers. We'll top it off with a single, somewhat larger GPU, to allow for much larger context windows during training. As you'll see, you can run this notebook on GPU sizes as small as a T4. But remember, the larger the GPU, the larger the context window and the smoother the fine-tuning. Looking at the grand scheme of things, our goal here is to harness the power of Unsloth and Hugging Face libraries for our fine-tuning journey. All of this will be happening on a Colab notebook, powered by a dedicated Nvidia A100 GPU. I can't chat with you if you ain't AI-driven (Tech) I'm all for the data, show me that you're livin' (True) Workin' with the best, man, Unsloth is fresh We don’t need no multi GPU compute mesh (Code) Hugging Face... and Unsloth, let's go Hugging Face... and Unsloth, let's go In essence, what we're about to do is a light-touch, Unsloth fine tuning of the base Llama 3 model, to create our custom fine-tuned model. We'll use some very simple Python code to prepare our fine-tuning dataset using our own data. Once the model is fine-tuned, the exciting part begins, where you can integrate it into a wide array of applications from web apps to chatbots! Even if this seems a bit complex, I assure you, it's actually a smooth process with minimal costs involved. Let's journey over to our notebook and get the ball rolling. Pretty early on here, we need to make some decisions about our model, notably what base model we're going to use and our approach for the max sequence length. We can specify any context window because Unsloth does RoPE scaling. However, we do want to make a data-driven decision about what sequence length to choose. So, let's look at our actual data, to set the max sequence length. You can bring in our video-description-provided example data directly, or as a zip if you have Colab Pro to unzip it in a terminal. Or, if you already have your own data, even better! For our example, we're going to use a story summarization use case. So, in this data directory, we've got a bunch of stories and we're going to be summarizing all those different stories in a certain style. I'm going to fine-tune the model to be able to do that. Let's start bringing in our data here with some new code. Our files are going to be in the data directory, and we'll parameterize whether we're talking about a story or a summary. This is of course not a lot of data - generally you want a number of samples in the hundreds, at a bare minimum. However, this will work for our demonstration and we're going to save our last story for use in testing the model afterwards, just to have something that it wasn't fine-tuned on. Alright, now we want to get the sequence length of our stories so we know what to use for this max sequence length setting. Let's use the Transformers library for that task. We're going to use the auto tokenizer from the Transformers Library, making sure to select the right model. Next up, let's loop through our files and print out the number of tokens. At the top here, we'll print out a table of the token counts. We're going to be looking at the total tokens, the instruction tokens, and the story and summary tokens. We'll also want some sort of instruction that we're going to be using in our fine-tuning here. After fixing a tiny typo in the tokenizer setup, we can count the tokens and present them in a neat printout. It looks like we can use a 16k context, which is beyond Llama-3’s default context window and therefore requires RoPE scaling. We've worked on the model configuration, and we're all set for quantization. Everything looks ship-shape, so onto the adapter layers! The parameter settings are up to you and have implications on your model. Taking guidance from the QLoRA paper, we'll opt for an R value of 64 and an Alpha value of 16. According to the paper, these values generalize quite well. The R value is related to the number of parameters in your adapter layers, so has implications on computational resources, model complexity, model quality, and potential overfitting. The Alpha value is a weighting parameter. How much should the base model shine through vs. your fine-tuning adaptations? Our next mission is to prepare some data using the Hugging Face dataset library. However, we are focusing on using your data, even if it is not yet defined on Hugging Face. We don't need this code So let's put it in a text cell We don't need this code So let's put it in a text cell We don't need this code So let's put it in a text cell We don't need this code So let's put it in a text cell We'll create a Pandas DataFrame to house our instructions, stories, and summaries. As we sift through the files, we will smoothly aggregate their contents into these DataFrames. Then we'll combine them into fine-tuning examples, with a one-sentence prompt followed by a multi-sentence story, and an expected summary. We could create the fine tuning texts a bit more directly, but it can be nice to have all these dataframes for any additional input analysis or preprocessing. It appears we might have veered a bit off track here; to correct that, we'll need to specify the exact column we require at the end. Now, that’s much better, isn't it? Our instruction sits at the beginning, unveiling our story, which draws to a close with a summary further down. Equally significant is to incorporate this into our trainer specification. Here lies a question for you: should we train using epochs or steps? Epochs is often the easier approach, as it ensures a comprehensive run-through of our data and is more independent of your fine tuning dataset size. Here, we’ll do just one training epoch – which saves us the trouble of calculating and configuring the right number of steps. Feel free to amp up the number of epochs or set an incredibly high max steps count; all you have to do is keep an eye on your loss. Then, use a model checkpoint where you have a good balance of training steps and training loss. For a reasonable starting value, I’d recommend 10 epochs. Oops, a hiccup with the number of steps specification! No worries, let's adjust that quickly and get back on track. Great job! Now, the trainer is primed, and our GPU seems to be working efficiently. Our small data set, the configured batching, and the low epochs have meant an extremely quick training run. However, your larger, real world datasets could mean significantly longer fine tuning times. We’ve now tapped into more GPU RAM - about 35 GB of the available 40. Remember our story number 6? Yes, the one we deliberately excluded from training. That's next in line. Let's combine the instruction and story texts to construct an example. We've stumbled on a mistake! We've included the existing summary when we should have left it a suspense-filled blank slate. Let's hit the retry button, shall we? Now that's more like it, we do have a summary. We have a story about a botanist on an adventure in the Whisper Peaks. Good stuff. No fine-tuning tutorial would be complete without saving your model somewhere! Don’t forget this critical last step. The first option here means that only the adapter layers get stored, not the full model. Also, don’t immediately go for the quantized model save if you may want the full precision model. However, even if you forget this, you’ll get a warning message and would need to force the function. Consider saving the full float-16 model for more compatibility with inference frameworks. Lastly, note that we have both local save and hub save options, side-by-side, in the cell here. We'll need a token from your Hugging Face account to push the model. Saving the full model can take some time. Alright, the moment of truth - let's check our Hugging Face profile. And look at that, we've got our beautiful Llama-3 fine-tuned QLoRA model ready to go for use in your web applications, chatbots, and pipelines! We'll include the diagram file, the example data, and also our custom Python cells in the video description. We’ll keep an eye on the comment section for any questions. Thank you so much for watching and please enjoy responsibly.
Info
Channel: Nodematic Tutorials
Views: 23,377
Rating: undefined out of 5
Keywords:
Id: 3eq84KrdTWY
Channel Id: undefined
Length: 23min 15sec (1395 seconds)
Published: Wed Apr 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.