Drumroll please; the long-awaited Meta Llama
3 model is finally here! This is not just any release; it's an exhilarating
leap forward, with Meta now setting the pace for powerful large language models. A very quick introduction before we get into
how to easily fine-tune this beast… The Llama 3 model currently comes in two unique
sizes, from the compact 8 billion parameter model for smaller projects to the mammoth
70 billion parameter version for larger scale AI applications. More Llama 3 model variants - including a
monstrous 400 billion parameter model - are said to be on the way. A comparison of Llama 70B against some notable
benchmarks and evaluation metrics show solid performance. You'll find these benchmarks in this blog
post here. Notice the MMLU where Llama 3 takes the lead,
outperforming similarly-sized contenders. It's also competitive on metrics like HumanEval
and GSM-8K, although some models like GPT-4, Gemini Ultra, and Claude 3 Opus aren't included
in the comparison. Presumably, the additional Llama-3 variants,
coming later, will be competitive against those flagship models. These models should also be multimodal, multilingual,
and accommodate larger context windows. Enough with the setup! Let's roll up our sleeves and jump into how
to quickly and easily fine-tune this model for your specific use case. After several rounds of testing and tinkering
with diverse fine-tuning methods, we've discovered that Unsloth is arguably the best way to fine-tune
these LLMs. If you're scratching your head wondering what
Unsloth is, it’s a way to super efficiently fune-tune and serve models - less GPU memory
usage, less training time, and less headaches. This is open-source, so feel free to check
out the Unsloth GitHub repository for their notebooks and documentation. Let's work with the Llama-3 8B model notebook
for a start, but we’ll need to modify this to use our own fine-tuning data. There's a well-documented commit history,
carefully following the release of Llama 3. The same day as the Llama-3 model dropped,
we got a functional notebook for fine-tuning. We can scale to different context lengths
in this notebook by using RoPE scaling. To make our fine tuning process even more
efficient, we'll employ quantized LoRA fine-tuning layers. We'll top it off with a single, somewhat larger
GPU, to allow for much larger context windows during training. As you'll see, you can run this notebook on
GPU sizes as small as a T4. But remember, the larger the GPU, the larger
the context window and the smoother the fine-tuning. Looking at the grand scheme of things, our
goal here is to harness the power of Unsloth and Hugging Face libraries for our fine-tuning
journey. All of this will be happening on a Colab notebook,
powered by a dedicated Nvidia A100 GPU. I can't chat with you if you ain't AI-driven
(Tech) I'm all for the data, show me that you're
livin' (True) Workin' with the best, man, Unsloth is fresh
We don’t need no multi GPU compute mesh (Code) Hugging Face... and Unsloth, let's go
Hugging Face... and Unsloth, let's go In essence, what we're about to do is a light-touch,
Unsloth fine tuning of the base Llama 3 model, to create our custom fine-tuned model. We'll use some very simple Python code to
prepare our fine-tuning dataset using our own data. Once the model is fine-tuned, the exciting
part begins, where you can integrate it into a wide array of applications from web apps
to chatbots! Even if this seems a bit complex, I assure
you, it's actually a smooth process with minimal costs involved. Let's journey over to our notebook and get
the ball rolling. Pretty early on here, we need to make some
decisions about our model, notably what base model we're going to use and our approach
for the max sequence length. We can specify any context window because
Unsloth does RoPE scaling. However, we do want to make a data-driven
decision about what sequence length to choose. So, let's look at our actual data, to set
the max sequence length. You can bring in our video-description-provided
example data directly, or as a zip if you have Colab Pro to unzip it in a terminal. Or, if you already have your own data, even
better! For our example, we're going to use a story
summarization use case. So, in this data directory, we've got a bunch
of stories and we're going to be summarizing all those different stories in a certain style. I'm going to fine-tune the model to be able
to do that. Let's start bringing in our data here with
some new code. Our files are going to be in the data directory,
and we'll parameterize whether we're talking about a story or a summary. This is of course not a lot of data - generally
you want a number of samples in the hundreds, at a bare minimum. However, this will work for our demonstration
and we're going to save our last story for use in testing the model afterwards, just
to have something that it wasn't fine-tuned on. Alright, now we want to get the sequence length
of our stories so we know what to use for this max sequence length setting. Let's use the Transformers library for that
task. We're going to use the auto tokenizer from
the Transformers Library, making sure to select the right model. Next up, let's loop through our files and
print out the number of tokens. At the top here, we'll print out a table of
the token counts. We're going to be looking at the total tokens,
the instruction tokens, and the story and summary tokens. We'll also want some sort of instruction that
we're going to be using in our fine-tuning here. After fixing a tiny typo in the tokenizer
setup, we can count the tokens and present them in a neat printout. It looks like we can use a 16k context, which
is beyond Llama-3’s default context window and therefore requires RoPE scaling. We've worked on the model configuration, and
we're all set for quantization. Everything looks ship-shape, so onto the adapter
layers! The parameter settings are up to you and have
implications on your model. Taking guidance from the QLoRA paper, we'll
opt for an R value of 64 and an Alpha value of 16. According to the paper, these values generalize
quite well. The R value is related to the number of parameters
in your adapter layers, so has implications on computational resources, model complexity,
model quality, and potential overfitting. The Alpha value is a weighting parameter. How much should the base model shine through
vs. your fine-tuning adaptations? Our next mission is to prepare some data using
the Hugging Face dataset library. However, we are focusing on using your data,
even if it is not yet defined on Hugging Face. We don't need this code
So let's put it in a text cell We don't need this code
So let's put it in a text cell We don't need this code
So let's put it in a text cell We don't need this code
So let's put it in a text cell We'll create a Pandas DataFrame to house our
instructions, stories, and summaries. As we sift through the files, we will smoothly
aggregate their contents into these DataFrames. Then we'll combine them into fine-tuning examples,
with a one-sentence prompt followed by a multi-sentence story, and an expected summary. We could create the fine tuning texts a bit
more directly, but it can be nice to have all these dataframes for any additional input
analysis or preprocessing. It appears we might have veered a bit off
track here; to correct that, we'll need to specify the exact column we require at the
end. Now, that’s much better, isn't it? Our instruction sits at the beginning, unveiling
our story, which draws to a close with a summary further down. Equally significant is to incorporate this
into our trainer specification. Here lies a question for you: should we train
using epochs or steps? Epochs is often the easier approach, as it
ensures a comprehensive run-through of our data and is more independent of your fine
tuning dataset size. Here, we’ll do just one training epoch – which
saves us the trouble of calculating and configuring the right number of steps. Feel free to amp up the number of epochs or
set an incredibly high max steps count; all you have to do is keep an eye on your loss. Then, use a model checkpoint where you have
a good balance of training steps and training loss. For a reasonable starting value, I’d recommend
10 epochs. Oops, a hiccup with the number of steps specification! No worries, let's adjust that quickly and
get back on track. Great job! Now, the trainer is primed, and our GPU seems
to be working efficiently. Our small data set, the configured batching,
and the low epochs have meant an extremely quick training run. However, your larger, real world datasets
could mean significantly longer fine tuning times. We’ve now tapped into more GPU RAM - about
35 GB of the available 40. Remember our story number 6? Yes, the one we deliberately excluded from
training. That's next in line. Let's combine the instruction and story texts
to construct an example. We've stumbled on a mistake! We've included the existing summary when we
should have left it a suspense-filled blank slate. Let's hit the retry button, shall we? Now that's more like it, we do have a summary. We have a story about a botanist on an adventure
in the Whisper Peaks. Good stuff. No fine-tuning tutorial would be complete
without saving your model somewhere! Don’t forget this critical last step. The first option here means that only the
adapter layers get stored, not the full model. Also, don’t immediately go for the quantized
model save if you may want the full precision model. However, even if you forget this, you’ll
get a warning message and would need to force the function. Consider saving the full float-16 model for
more compatibility with inference frameworks. Lastly, note that we have both local save
and hub save options, side-by-side, in the cell here. We'll need a token from your Hugging Face
account to push the model. Saving the full model can take some time. Alright, the moment of truth - let's check
our Hugging Face profile. And look at that, we've got our beautiful
Llama-3 fine-tuned QLoRA model ready to go for use in your web applications, chatbots,
and pipelines! We'll include the diagram file, the example
data, and also our custom Python cells in the video description. We’ll keep an eye on the comment section
for any questions. Thank you so much for watching and please
enjoy responsibly.