Embeddings vs Fine Tuning - Part 1, Embeddings

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my goal today is to build a language model that knows the rules of touch rugby it's a good choice because touch rugby is a fun but a somewhat obscure sport there's not a lot of information online and so models like llama don't really know the rules that well that gives us a nice base for comparing performance when we use an embeddings approach and a fine-tuning approach today we'll be focused on the embeddings approach to answer questions about the rules of touch rug you're probably used to using contr F in searching for words but embeddings allow us to search for meaning rather than just directly search for words or phrases with embeddings we represent the question by an arrow we also take the document and we represent each paragraph or phrase by an arrow and the direction of these arrows is representative of the meaning so the whole idea of embeddings is to find the closest arrows we have an arrow for a question like how many players are on a touch rugby team and then we have many arrows representing paragraphs in the rule book and we look for the arrows that are pointed in a similar direction for example teams are made of 14 players with a Max of six on the field and by grabbing those Snippets whose arrows are similar in direction to the search term we can grab relevant context now how do we get these arrows and where do these embeddings come from they come from the very first layer of a language model a Transformer architecture like is used with gbt 4 or llama has very many layers but the first layer is responsible for converting the original tokens into an arrow representation it's a vector representation in high-dimensional space where the direction roughly corresponds to the meaning and content of that phrase now the embeddings happen on a token by token base basis so you have a little arrow for every single token but if you want an arrow for a phrase one approach is simply to average across the tokens in that phrase so you average the smaller arrows to get one arrow for that phrase how can we make use of embeddings if we want to answer questions about touch rugby well our standard approach is to just ask the question how many players are on a team and Lama won't Fair very well with this because llama is not familiar with the rules of touch rug be but how about if we use a search for meaning to find the most relevant pieces in the rule book and feed them into Lama for example we can say how many players are on a team make use of the following info and then using an embedding search in the rule book we can pull out paragraphs that we think are relevant and insert them here as useful info this will greatly improve the performance of of llama because it's providing relevant context now there are two keys to success in getting embeddings to work for the rules of touch rugby or for any of your use cases you need to have a good embedding model meaning you need a good way to calculate the arrows and make sure that you're finding the closest arrows the second thing you need is a good language model even if you give the correct information to a language model it still might make mistakes we're going to see that llama even when you feed the right embeddings can sometimes just hallucinate even GPT 4 makes mistakes and I'll show you a few of those too the key point is you got to distinguish between two aspects of performance one is the performance of embeddings are you retrieving enough information and the correct information to answer the question and two is your model strong enough to be able to pass that information into the correct answer before we wrap up I want to mention one more point which is around measuring the closeness of the arrows broadly speaking there are two approaches to doing this the first is a cosine approach which literally measures the angle between the search vector and the closest vectors within the document this is good if you want to measure the closeness of content of those two arrows however it doesn't tell you anything about the amount of content for example a short paragraph might talk a lot about the number of players on a team whereas a long paragraph might go into greater detail and generally we do want that detail which makes it more useful to use What's called the dot product the dot product is a projection of one vector upon another and it tells us not just about the angle between the two but also about the quantity of information so if we're just interested in type of info and we have a short query looking for short responses we'd use cosine but if we have a short query and we're looking to extract larger chunks of text it's better to use do product and that tends to perform better too so that's what we'll use the data set we're going to use for training for fine-tuning in part two and for our embeddings here in part one this video are the official touch rugby rules I've gone online and found these rules there are about 24 pages of rules I'm going to download the file and save it as train. PDF throughout this video I'll be making use of of the trellis research Lama fine-tuning repository you can purchase access to the repo or to individual scripts in the description below but I'll try to describe everything in this video so you could just replicate it all from scratch yourself here we have an embedding notebook I'll be using this for running the embeddings and also running the language model but first we need to prepare some data on the rules of touch rugby for that I'll be moving to this data prep branch which has some scripts for preparing the data so let's get get started okay I've just cloned the repository Lama fine-tuning I've actuated the activated the virtual environment and now I'm going to do get switch into the data prep Branch so we can start with data preparation you'll see a data folder and there's train. PDF so I've put the training PDF in here and now I want to convert this into text so I'm going to run python PDF to text txt dopy and that's telling me that it's converted the train. PDF to train. txt I didn't provide uh text PDF so test PDF for validation so it's not converting that so here we have in text form a copy of all of these rules now for embeddings I want to split this into chunks and then each of those chunks I need to convert into a vector or an arrow to do that I'm going to run create data set which is going to split it into chunks I'll be able to choose when I run the script the size of those chunks and here I have to specify the overlap so what I'm going to do is run with an overlap of 50 that means that instead of just chunking it it's going to have an overlap in the chunks and basically what that means is if you have situations where there's a run over sentence then you're not going to have a cut in the data set whereby there's no single chunk that covers overlap what's overlapping that split um so we'll use an overlap of 50 tokens here and then I'll run Python and I'll say um create data set. py now I'm being asked to enter the number of tokens for each data rle uh so I'm going to do 100 so I'm cutting it into chunks of 100 now the last step is I'm going to make this data set available online which is not absolutely necessary but you'll see it's nice because you'll be able to replicate as well in your own time so for this I'll run python push to hub. py and I just made a mistake there python push to HUB and it's not push to HUB it's push to HF py so it's asking me for my hugging face token which I'm going to grab I grab it from here so copy that token and paste it in and I'm going to enter the path that I want to push this data set to un hugging face now I've created a data set manually uh well I've created a repo manually on hugging face and the repo is trus slouch rugby rules Das embeddings okay so it's now uploading and now I should be able to view that data set here uh so I have a data set which I've called train although it's really just a vector um well not yet Vector but it's just a data set containing the Snippets so you can see here my data set is been split into 48 rows and so I've represented those rules as 48 rows rows that each have 100 tokens with an overlap of 50 tokens so at this point I've just created a data set from the rules I've basically created these Snippets of 100 tokens long and you might ask well why 100 tokens and it depends you can play around with what works well but the basic logic is that if you make the Snippets Too Short then when you're searching for the right arrow it may not have enough information um to deem itself relevant particularly if there's a snippet where there might be a Relevant Word at the start and the end if your snippet size is smaller than that you're never going to find an arrow that contains those two pieces of information so you need to be long enough to cover what you think is a typical relevant snippet to answer your question typically that lands in the range of maybe 200 to 500 tokens in length if you start making it really long now the arrow is going to represent a lot of information and so it's going to be more of an average over all that information which means that when you compare your search term it's going to be more let's say washed out less specific you'll have an arrow with a question but the information is so diffus because it's such a long snippet it might not recognize that it has some similar information the set ready for the embeddings we're going to open up a collab notebook which is going to allow us to run llama first without the help of embeddings and then with the help of injecting that extra information so let's open up the collab notebook I'm going to be running using Lama 13B the reason is because I'm using a free collab notebook and 13B is the largest model that I'm able to fit without overwhelming the memory I could use the 7B model it would be slightly faster but I think it's worth using a little bit more of capability and running with 13B I'm going to run the model in gptq quantized format onti means it will run in 4 bit instead of 16 this makes the model a lot smaller at only a small loss in accuracy and allows us to run it in a free notebook since I'm running gptq using the blocks repository for the model I don't need to log into hugging face because it's not a gated repository like meta Lama if I downloaded the original llama 2 model however I will connect my Google Drive by running all of these cells the reason is that will download the model to uh my Google Drive so that I can be quicker to start up the next time around you don't have to connect Google drive but it means you'll be downloading the weights from hugging face every time okay Google Drive is now connected and as I said we're going to install gptq it's uh a little bit faster than using bits and bytes nf4 uh in fact it's quite a bit faster maybe a tiny bit less accurate maybe it's hard to tell depending on the data set but we'll run with gptq and I'll run the installation optionally you could comment that out and run in bits and btes and F4 here after gptq is installed and the model is been loaded we'll then load the data set uh here is the data set it's just a link to what we previously showed here the touch rugby rules embeddings now just a reminder this is not the embeddings it's not the arrows it's just the text Snippets we're going to be calculating the arrows in the collab notebook moving on further once we've loaded the data set uh we are going to test the embeddings but I'm going to come back to that because first I just want to show you how the model runs if it doesn't have any help from embeddings in other words let's just get llama to answer the touch rugby questions without any training without any extra injected information the way we'll evaluate performance throughout this tutorial on embeddings and the next on fine-tuning is by asking a series of questions about touch rugby that I myself made up manually for example how many players are on the field on each team in touch rugby and the answer is six players now we'll also provide a system message I'll just scroll up here to find that saying you are an expert on the rules of touch rugby you provide succinct answers no longer than 10 words so this is the test case on which we'll evaluate Lama performance and it's also how we'll evaluate using embeddings and later using fine tunings in the next video the gptq model now has been downloaded in fact I downloaded it earlier to Google Drive so it was quite quick it was quite quick and I only needed to reload the shards the to the tokenizer has also been downloaded as well which means that we're ready to run now a test without the embeddings so I'm going to set up my test function here and then I'm going to evaluate uh without the embeddings so I'm saying use embeddings equals false which means I'm not injecting any extra information so here's the first question how many players are on the field and touch rugby we we have a warning here uh six players per team so llama does seem to know the first question uh does a forward pass result in a roll ball or scrum or something else Lama says a roll ball which is incorrect it's a penalty how many meters does a defending team team Retreat five that's wrong at seven uh how many substitutions are allowed There's No Limit but Lama says three how long is halftime five minutes it gets it correct how's the game start a kickoff that's wrong it's a to how far do you Retreat for a penalty 10 meters how many touches before turnover it's six but Lama says three what happens if the player is touched before making a pass there's a scrum there's no scrums in touch rugby and lastly how many points for a try it's only one in touch rugby it's five in normal rugby so llama just raw without any embeddings gets a score of three out of 10 depending on the size of llama it actually doesn't change that much I think it's usually two or three even I ran the 70b model and it's still gets two or three so Lama probably has a pretty good understanding of rugby but touch rugby is a bit obscure and only scores about two or three so that's the Baseline we're trying to improve on by injecting extra information we're now now going to move to inject embeddings within the prompt to improve the quality of response we get from llama the first thing to do when we set up embeddings is to check that when we ask one question does it retrieve relevant paragraphs that will help us answering that question now there are three different ways that I'm offering here for calculating the embeddings lamb embeddings open Ai embeddings and Marco embeddings llama embeddings are simply using the first layer of that language model and this is a pretty easy way to use you don't have to download anything else since we've already loaded the Llama model so this is quite a a quick script if you do want to use those lamb embeddings you can also use the open AI embeddings which means sending each snippet of text over to open Ai and they will respond an arrow for each of those snippet of texts now this is a bit trickier because I'm rate limited to three calls per minute which makes it very slow to get all the embeddings although I could request access ESS for higher rate limit plus I have to pay for each embedding here so I tried this approach and generally it didn't perform as well as some of the open source methods or maybe it performs similarly so I generally prefer using open source embeddings and the open source embeddings I'm going to use here are what I'm calling Marco embeddings these embeddings are available uh from this website here it's the esper.net website and it's a model that's been tuned for DOT product you'll remember in the intro I described that either you can measure the angle between the search query and the documents paragraphs or you can do the dot product which also accounts for the amount of information this Ms Marco distill birt model uh has got the highest performance and it's fine-tuned for DOT product I think that you probably would get some good performance with many of the other models here so I'm not suggesting this is necessarily the best but I think that it's a pretty good model and I found it performs similarly to Lama um but much quicker and also without any cost to use the embeddings we need two things we need a table of all the paragraphs and we need a query we've already created a table on hugging face of all the paragraphs from the rules uh so we've imported that here by loading the data set and you can see that the very first row here of this training data set contains a snippet that has 100 tokens and there are 47 other r like that so we've basically imported into the data train variable here we have imported uh a list of Snippets that we are going to calculate arrows for and we're going to use Marco to calculate those arrows let's scroll down here to Marco embeddings uh we'll install sentent Transformers which is the package needed to access that model and using that model we're going to embed the full database so this is what's Happening Here we have the train embeddings uh sorry we have the data. Trin data set and we're running it through um the Marco embedding process to calculate an arrow so here we're calculating an arrow for each of the 48 rows of that training data set then we're posing a test question how many players on the team uh are on the field at one time in touch rugby here we're calculating an arrow for that question using the Marco approach now comes that dot product step so we're taking a DOT product between the question embedding and between the training embeddings the train embeddings being the rules of touch rugby and what this is doing is seeing uh what the dot product is between the Marco embedding the Marco arrow and each of the other arrows and from there we're going to pick out the top three so I've Set uh n samples to three so I'm deciding that I want to take the three most relevant Snippets and then I want to print them so let's uh just run that very quickly and we should get out a list it has to download the model download the tokenizer for that model it's a very lightweight model relative to llama so that's why it's so fast and here you can see it's found uh similarity scores which are the dot products for each of these and remember our goal is to find the number of players and you can see straight away that the most similar paragraph It's found is saying team consists of 14 players no more than six are allowed in the field at any time so Marco is working very well here the very first of the three embeddings that it's finding or the first of the three arrows is the one that we want what we're going to do now is we're going to run through each of the 10 questions and we're going to include embeddings within the prompts so let me show you what that looks like in addition to asking the question if we turn embeddings on if we turn on Marco embeddings then we're going to inject uh the Snippets in so we're going to say after asking the question here are some Snippets from the rules of touch rugby and we're simply going to inject three of those Snippets and then we'll say answer the following question succinctly solely using the above Snippets of text so let's scroll down now we've already evaluated the model without using embeddings so let's now evaluate with the help of embeddings and we're going to input the questions the answers we will use embeddings um I'm going to inject uh just three actually I'll just inject three embeddings if you inject more you're giving more information generally that will help if you have a strong model if you have a weaker model more information might confuse because the more you include since they're ranked the less relevant the information will be um so on a weaker model somewhat counterintuitively putting in more information can make the answer worse and then the embedding type we choose you could set it to open AI um you need an AI you need an API key for that or you could set it to Lama but we're going to use Marco so let's run this evaluation here and see how it does so in the first question we have um How many pairs in a field six so it's getting that answer correct the next question here does a forward pass result in a roll ball scrum or something else and it says roll ball so that's wrong now the question is is it wrong because the language model is wrong or is it wrong because the embeddings did a bad job of finding the right paragraph and to answer that question we can just highlight some of this here so we'll take a highlight and we'll head over to chat TPT create a new chat and just paste in that question let's see what chat gbt comes up with so here forward pass is not mentioned so you can see that actually um this Marco was not successful in finding the right snippet for forward pass uh so it's not really L's fault it's the embedding model's fault in this case now moving on to the next question here um in touch rugby how many meters kind of Defending team retrieve seven so that's correct and uh by the way you can see that the question is posed in the following way here are some Snippets from touch rugby so it gives the relevant Snippets and then it asks the question here how many substitutions in a game so it says eight substitutes substitute players allowed during a game so that's correct it's not answering the fact that you can make unlimited substitutions so it's it's I'd say only partially correct although I wouldn't be confident that GPT is going to be much better at answering that question yeah so it also doesn't exactly answer the question I guess maybe my question is poorly posed but um ask a followup question okay well that's correct it doesn't specify so you can see a bit of an edge case there I don't think we can fault GPT too much let's move on to the next question here half timee is 5 minutes so that's correct question six um how does the game commence it commen with a roll boil at the nearest point on the 7 meter line that's incorrect it starts with a tap on the halfway line so what we can do is ask GPT here what it says and GPT does get the correct answer so here you know that the embeddings are correct so Marco is correct in finding the right information but llama is not able to par the information properly let's move on to question seven in touch rugby how many U meters must Defenders Defenders Retreat when there's a penalty it's getting that wrong it's actually flipped the penalty and the and it's flipped the difference between a penalty and a touch so let's see if GPT would be able to get that right okay interesting so you can see here um it thinks it thinks that uh after the touch it clearly says here that it must be 7 m and yet GPT and Lama both get confused and they flip around the order so even a powerful language model here is just not getting the facts correct and um I think that really puts things in perspective let's continue here with question eight uh how many touches is the team entitled to before change of possession six and um last question second last question what happens if the player Touched by an opponent prior to making a pass that is a penalty but it's getting the answer wrong so let's see why I'm putting it in here and yeah also GPT gets that gets that answer wrong and let's take a look at the dat that was provided okay this is a case where the embeddings model is being stretched and um also the language model is being stretched too the embeddings model is stretched here because there is very similar text talking about uh when that ball is accidentally knocked from the hands of a player and uh the touch counts in that case and so this is kind of similar and I can see why the embeddings is picking up this sentence but there also is a sentence that relates to being touched right before a pass that it should be picking out and it's not and when you look at gpt's answer here it's saying that touch counts and the attacking team retains possession but that's not exactly the question that I asked so gpt's we reasoning on this question is also not 100% for the last question here um how many points is there for a try and the model gets it correct for a one point these 10 questions on touch rugby give us some really useful insight into the performance of embeddings and of llama's performance and even of GPT 4's performance to recap we ran 10 questions through Lama 13B with no extra help and it got about two or three out of 10 of those questions correct when we added in embeddings it brought up the number of correct answers to about five or six questions correct out of 10 the remaining questions let's say four questions were wrong for different reasons in one or two cases the embeddings model wasn't able to find the correct snippet and provide it as context in one or two other cases llama's reasoning was not as strong as GPT 4's reasoning the embeddings model had given the right information but Lama drew the wrong conclusion but interestingly in one or two cases we saw flaws as well in GPT 4's reasoning even though the correct information was provided by the Marco embeddings model GPT 4 was not able to reach the correct answer before we wrap up let me give you a few Pro tips if you want to push performance a little further we haven't fully optimized prompting so you might be able to eek out some more performance with Lama and GPT 4 for example you might ask the model to respond that it's unsure if there isn't enough information in the Snippets you might also ask the model to to reason step by step on a more technical note we used a DOT product approach with the Marco model however there are some other more detailed approaches like Calbert I want to give a shout out to the svic podcast which is on YouTube channel as well for this recommendation but Calbert is a slightly more advanced approach I talked earlier how you can embed each word into a little arrow and then you can average those words to get a large arrow for a sentence however another approach is you could individually compare those words with the query and then you can take an average of the comparisons said a little differently in the approach that I've taken and the standard approach we're averaging the embeddings across the tokens and then we're comparing that large arrow for a paragraph with the arrow for a query but with the cber approach what you're doing is you're comparing small smaller pieces of that paragraph with the query and then taking the average before deciding which paragraph to choose and this allows you to keep a little bit more information and might lead to better results let's finish with a few tips remember do product generally performs better if you're searching with a query for large paragraphs than using cosign similarity consider using Marco or a similar free embedding database or model from espert it's cheap it's basically free and often it will be faster because of the rate limits of open AI as you know bigger models do perform better we saw how GPT 4 will get maybe one answer right more out of 10 in the case study we looked at today but even GPT 4 can make mistakes and it's important to separate the embedding performance which is fetching the right pieces of text from the language model performance and figure out which part of your pipeline is working or not working last of all remember that no model is perfect and so I highly recommend writing out some manual examples and make sure to manually inspect the results you're getting from your model that's it for part one on embeddings I hope you'll be around for part two when it comes out and we'll do the exact same thing to take a fine-tuning approach let's see how close we'll get with fine-tuning to 10 out of 10

Info

Channel: Trelis Research

Views: 3,701

Rating: undefined out of 5

Keywords: fine-tune Llama 2, language model embeddings, llama 2 embeddings, llama embeddings, how to create language model embeddings, how to use embeddings, embeddings huggingface, embeddings transformer, openai embeddings, openai embedding, word embeddings, text embeddings, large language model, semantic search, language model embeddings explained, embeddings explained, llama semantic search, openai api, Embeddings vs Fine Tuning, Fine-Tuning, Embeddings

Id: egnf8L-EUJU

Channel Id: undefined

Length: 31min 22sec (1882 seconds)

Published: Thu Sep 14 2023