A Survey of Techniques for Maximizing LLM Performance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[music] [applause] -Hello, everyone. I hope you all enjoyed the keynote. I know I did. I hope you all are enjoying your time here at OpenAI's first developer conference. In this breakout session, we're going to be talking about all the different techniques that you can use to maximize LLM performance when solving the problems that you care about most. Just to introduce myself, my name is John Allard. I'm an engineering lead here on the fine-tuning product team at OpenAI, and it's really been a very exciting few months for fine-tuning at OpenAI. Back in August, we launched 3.5 Turbo fine-tuning and we were just blown away by the reception from the developer community. We followed that up with a few important features, so there is fine-tuning on function calling data. There was continuous fine-tuning where you can take an existing fine-tune model and continue the fine-tuning process on it. We even launched a full UI for fine-tuning within the platform. Over these last few months, we've been able to work closely with developers from all corners of industry. Solo indie developers, developers from start-ups, and developers from some of the largest enterprise on earth. We've been able to see what are the problems that they're trying to solve, how are they trying to use LLMs to solve these problems, and specifically how are they trying to use fine-tuning of LLMs to solve these problems. I hope to share some of these insights with you all today. With that said, I'm going to head it off to my colleague Colin to get us started. -Thanks, John. Hey, folks. I'm Colin. Nice to meet you. I head up our solutions practice in Europe, which basically means working with some of our strategic customers to try and solve their most complex problems. What you'll probably be unsurprised to hear is that over the last year, optimization has been the most constant focus from everybody. Trying to get LLMs reliably into production. Why is that such a focus? Optimizing LLMs is hard. Despite all the frameworks, despite all of the content that everybody's releasing, all the metrics and all the different tools that people have provided. It's still one of the biggest focuses and there's still no one-stop shop for how to optimize. It really depends what category of problem you've got and how do you approach it. I think that's what we're hoping to show you today is framework to figure out what is going wrong and how to approach it, and then the tools that you can use to solve things. The reason-- I guess, starting with the reasons of why it's hard. It's hard to separate the signal from the noise. Know exactly what the problem is. This is the first place. The second thing is that performance can be really abstract and difficult to measure with LLMs, so it can be really difficult to know how much of a problem you have. Even if you know what the problem is and how much of a problem you have, it's also difficult to know which approach you use to solve the problem that you've identified, so that's really the focus for today. Today's talk is all about maximizing performance. What we're hoping you'll leave here with is a mental model of what the different options are, an appreciation of when to use one above the other, and the confidence to continue on this optimization journey yourselves. Starting off, optimizing LLM performance is not always linear. A lot of folks present a chart like this where you start off with prompt engineering, you then move on to retrieval-augmented generation, and then you move on to fine-tuning, and this is the way that you approach optimizing LLMs. This is problematic because retrieval-augmented generation and fine-tuning solve different problems. Sometimes you need one, sometimes you need the other, and sometimes you need both depending on the category of issue you're dealing with. We think of it more like this. There's two axis you can optimize on. One of them is context optimization. What does the model need to know to solve your problem? The other is LLM optimization. How does the model need to act? What's the method that it needs to carry out or what's the action that it needs to deliver to actually solve your problem? A typical flow that you see is starting in the bottom left with prompt engineering. With prompt engineering you can do both. You just can't scale it that well. Prompt engineering is always the best place to start. You can test and learn very quickly, and the first thing you should do, start off with a prompt, get to an evaluation, figure how you're going to consistently evaluate your outputs. From there you can then decide, is this a context problem or is this a how we need the model to act problem? If you need more context, or more relevant context, then you go up to retrieval-augmented generation or RAG. If you need more consistent instruction following, then you go right to fine-tuning. These two things stack together, they're additive, so sometimes your problem requires both. We're going to give you examples of where folks have just used one or two and where folks have used all of them to solve their problems. A typical optimization journey often looks a lot like this. You start off in the bottom left corner, you've got a prompt, you create an evaluation, and then you figure out what your baseline is. Then typically simple next step, add few-shot example. Give the model a couple of input, output examples of pairs of how you want the models to act. Let's say at this point, actually those few-shot examples increase the performance quite a bit, so let's hook that up to some kind of knowledge base that we can industrialize that process and that's where usually folks will add some kind of retrieval-augmented generation. Let's say that it's now got context, but it's not producing the output in exactly the format or style that we want every time, so we might then fine-tune a model. Then a classic next step is then maybe the retrieval is not quite as good as you want it to be. Maybe that content could be more relevant to what the model needs. You then go back and optimize the retrieval-augmented generation again. Now that you've optimized the retrieval-augmented generation again, you want to fine-tune your model again with these new examples that you've introduced with your updated retrieval-augmented generation. Bit of an example here of the classic optimization flow that we see. If I could summarize it in the simplest possible terms, you try something, you evaluate, then you try something else. That's in the simplest possible terms. Let's dive into each of these quadrants now. We're going to start in the bottom left with prompt engineering, then we're going to move on to retrieval-augmented generation, fine-tuning. Then we're going to take all this for a spin with a practical challenge that myself and John took on and show you how this works in practice. Prompt engineering. Now, I know most of you in the audience are going to be very, very familiar with this, so we're going to skip through this at a fair rate, but always best to start and just make sure everybody knows the principles here. Prompt engineering. A few strategies here. This comes from the best practices on our documentation, but just to recap them. First of all, writing clear instructions. I'll show an example of what that means, but this is often where folks fall down in the first place. Secondly, splitting complex tasks into simpler subtasks. If you can imagine that the model is trying to make a prediction or a series of predictions for every subunit or subtask that you're giving it to solve, you should give it as specific instructions as possible to break down that problem so that it has a better chance of carrying it out. Similarly, giving GPTs time to think. I'll give you an example of a very common framework that people use to do that. The last thing-- I've alluded to this already, but testing changes systematically. So many times we see our customers end up in this sort of whack-a-mole situation where they change one thing, they change another thing, they change another thing. They're just jumping all around on their evaluation matrix and they don't feel like they're going in the right direction. This is really where you need a solid set of evals and typically some kind of LLMOps so that you can just systematically measure these things as you change them. After that, the most common next step is then to extend it to reference text or to giving it access to external tools, which takes us more into the field of retrieval-augmented generation. First of all, let's recap what these look like in practice. First of all, a couple of intuitions for prompt engineering. Prompt engineering, said it a couple of times, I'll say it again, best place to start and can actually be a pretty good place to finish, depending on your use case. What is good for? Testing and learning early, and when pairing it with evaluation, providing a baseline to set up further optimization. Should always be where you start. What is not good for? A few things. Introducing new information. You can pack quite a bit of information into the prompt, and actually with GPT-4 Turbo, you can now pack a ton of information into the prompt. That said, it's not a super scalable way to do that using prompt engineering. We'll see a couple of ways that we can approach that using other methods. Also, reliably replicating a complex style or method, again, limited by the context window in terms of the number of examples that you can actually show to the model. It's a great place to start, but depending on the complexity of your task, might not get you there. The last thing is minimizing token usage, a very common problem with prompt engineering. You keep hitting problems and keep adding more and more facets to your prompt to actually deal with those problems, and you end up using more and more and more tokens, which then cost you latency, cost, all these sorts of things. Again, prompt engineering not a great way of dealing with that particular problem. Quick recap of things not to do, things to do with prompts. We've got a pretty bad prompt here with some vague instructions and some fairly random output, and just recapping a couple of ways to improve that. Clear instructions. Telling it exactly what it will be presented with and what its task is. Giving time to think. This isn't a particularly good example here. I'm telling it to approach the task step by step, blah, blah, blah, but giving it time to think I would think of more things like the react framework where you get it to write out its reasoning steps. It's basically helping itself get to the answer. The react framework is just one way that you can approach that, but giving GPTs time to think is another great way of dealing with where you have some very complex logical reasoning that you need it to do because it will-- at the end of the day, it's the next token predictor. It's printing the tokens that it needs to help it get closer to that answer, depending on the strength of your prompt. The last thing is breaking down complex tasks into simpler tasks. In this case, I mentioned thinking of each step almost as a prediction. In this case, just laying them out as clearly as possible. On the right side, we can see a fairly nicely formatted JSON output. Again, just the basics, but just wanted to recap those before we move on. Common next step. Prompt engineering you're trying to basically tell the model how you want to act, but often it's very difficult to actually know which of those tokens is actually influencing the model the most? A great way to start is actually by approaching it as a show rather than tell problem. Just by providing a few-shot examples, so giving it input and output pairs and actually showing it the behavior that you want it to have. This leads us on nicely then to the next step, which is typically folks see some good performance improvement. We're going to see in the practical section that that's what gives us some very good lift with the practical tasks that we take on. They want to industrialize it and they want those few-shot examples to be contextual based on a user's question or based on whatever the context of this particular problem is. That's where folks typically take few-shot and move on to retrieval-augmented generation, RAG. Let's jump right in. Before actually I jump into RAG, I just want to give you a quick mental model for how to think of where to go basically. We started with prompt engineering. We've evaluated, we've identified the gaps, and now we're trying to decide whether it's a retrieval-augmented generation that we need, or whether it's fine-tuning that we need. It's sometimes useful to think of it as a short-term memory versus long-term memory problem. If we think of it as perhaps trying to prepare for an exam. Your prompt is giving them the instructions that they need to complete the exam. The fine-tuning is like all the studying that you do beforehand to learn the methodology and the actual framework that you need to answer those questions. Retrieval augmented generation is like giving them an open book when they actually go to the exam. If they know the methodology and they know what they need to look for, and then retrieval-augmented generation means that they can just open up the book, go to the right page, and actually pull the content that they need. That's why these two things solve very different things. Without the methodology and without the content, it can be impossible to solve certain problems. In this case, we're assuming that we've got a short-term memory problem. We want to give the model the right context that it needs to answer the question. Retrieve augmented generation, or RAG, is all about giving the model access to domain specific content. Quick recap of what RAG is. I know, again, most people in the room are going to be familiar, but I'm just going to recap this for the benefit of all. You'll typically start with some kind of knowledge base or some area that you want to get some content that you want to use to answer these questions. In this case, we're going to use a fairly typical flow, which is we've got some documents, we're going to embed them, we're going to stick them somewhere. Again, I know folks out there probably have their own search services, all their sources of documents that they would use, and that's fair enough. For this example, we will assume that we have some documents, we embed them, we make a knowledge base. Then when the user comes in, they're going to ask a question. Let's say, "What's the population of Canada?" We're going to go, and instead of giving that directly to the LLM, we're going to fire that at our knowledge base using some kind of search. Let's imagine we do a similarity search. We're going to pull back some content. We have some content that says what the population of Canada is. We're then going to combine that with a prompt. We're going to give it to the LML and say, "Here's the question. Here's the content. Answer this question with this content." We're going to end up with a, hopefully, correct answer. Quick recap of RAG. As we did with prompt engineering, I want to share a little bit of intuition that we've developed in terms of when you should use RAG and when you shouldn't. What RAG's good for, again, introducing new information to the model to update its knowledge. This is one of the few ways you can do that now. It's actually one of the biggest problems that customers come to. They're like, "Hey, I've got 100,000 documents. I want the model to just know these documents." Unfortunately, right now there's no super scalable way to take those 100,000 documents and give the model knowledge of all those at one time. Retrieval-augmented generation is probably about as close as you're going to get right now, which is we're going to give some contextual knowledge to it based on the particular problem that you want it to solve. Similarly, reducing hallucinations by controlling content is one of the very common use cases of a of using retrieval-augmented generation. We'll see a bit later how that pairs really nicely with fine-tuning. A typical use case is you give the model content, you give it instructions to only use that content to answer questions, don't use its knowledge. That's a typical way that folks try and constrain the knowledge to a particular knowledge base and reduce hallucination. What it's not good for. I alluded to it there, but embedding understanding of a broad domain. Currently, retrieval-augmented generation will not allow you to teach it what law or medicine is. Unfortunately, that's not one of the things that retrieval-augmented generation will let you do. Similarly, teaching the model to learn a new language format or style. This is probably where you're more in the fine-tuning realm, where you're trying to teach it a methodology or a way of approaching solving the problem. Again, reducing token usage. In fact, you're going to add many, many more tokens in RAG. You're going to keep adding input/output examples. I often see folks go prompt engineering and then RAG because the first thing they're trying to do is get the accuracy to a level that they're comfortable with, and then they'll then try and strip tokens back out of this process. John's going to tell you a lot more about that later. This is where RAG, you're really just trying to optimize, give it as much context as it needs to answer the question. I wanted to share a success story here, because with prompt engineering and RAG, it sounds like these things can be quite simple, but they're really quite hard. It takes a lot of iterations and a lot of testing and learning to actually make this happen for real. This example, a customer had a RAG pipeline with two different knowledge bases and then an LLM. Its job was to get a user's question, decide which knowledge base to use, fire a query, and use that to answer the question. When we started, we just implemented retrieval and we had had loads of talks with them. We were all really excited for how good embeddings was going to be and how easy this was going to be. Our baseline was 45% accuracy, so not so great. What we then tried was a whole bunch of stuff and I've put little ticks and crosses next to them to show how many things we tried and how many things actually made it into production. The things with ticks were things that we actually took to production. The things with crosses were things that we tried and discarded. We managed to boost it to 65% by trying hypothetical document embeddings. Where instead of doing a similarity search with the question, you generate a fake answer and do a similarity search with that. For certain use cases, that works really well, for this one, it did not work well. We also tried fine-tuning the embedding, actually changing the embedding space based on a training set that we had to actually help the model get to the correct answer. Again, this actually worked okay from an accuracy perspective, but it was so expensive and so slow that actually we had to discard for non-functional reasons. The last thing we did was chunking and embedding. Trying different size chunks of information and embedding different bits of content to try and help the model discern what were the most relevant. Again, so we got a 20% bump, but we're still fairly far from something that's possible for putting in front of customers. This was maybe 20 iterations that we'd gone through to get to 65%. At this stage we're kind of like, "Right, are we going to pull the plug on this thing?" But we stuck with it and we then tried re-ranking. Applying a cross encoder to re-rank the results or using rules-based stuff like, "Oh, well, it's research. Maybe we want the latest documents," something like this. We actually got a really big performance bump from that. Also, classification. Basically, having the model classify, which these two domains it is and then actually giving it extra metadata in the prompt, depending on which domain it was classified to, to help it then decide which content was most likely to be relevant. In this case, again, pretty good bump, 85%. We're now looking like we're just on the cusp of getting to production. Then the last thing we tried was further prompt engineering. We went back to the start and actually tried to engineer that prompt a lot better. We then looked at the category of questions that we were getting wrong and then we introduced tools. For example, we noticed there were structured data questions where it needed to go and pull figures out of these documents. What we decided to do was instead just give it access to a SQL database where it would just put in the variables and execute a query and then actually bring back structured data answers. The last thing was query expansion where somebody asks like three questions in one and you would parse those out into a list of queries. Then execute those all in parallel, bring back the results and then synthesize them into one result. These things together got us to the point where we got to 98% accuracy. At no point in this process did we use fine-tuning. I want to call that out because, again, as we said at the start, the assumption is often you need fine-tuning to go to production. Actually, in this case it was all, every problem we were dealing with was context. It was all either we're not giving it the right context or it didn't know which of those context blocks was the right one. That's why it's so critical to know, what is the problem we're solving for here? Because if at any point we had gone to fine-tuning, that would've been wasted money and wasted time. That's why this is a success story that we're happy with. I guess I wanted to give a slightly different-- Cool, thank you. [laughs] [applause] Sweet. I wanted to give a cautionary tale as well because at times RAG-- RAG is great, you have all this great content, the model we'll use that to answer the question, but it can also backfire massively. I'll give you a different customer example. We had a customer where they had one of these-- They were trying to reduce hallucination by using retrieval augmented generation. They told the model, you are to use only your content. Then they had human labelers who would check and flag things as hallucination. One of them, we had a funny guy at the customer and they said, "What's a great tune to get pumped up to?" The model came back with Don't Stop Believin' by Journey. [chuckles] The labelers were like, "Right, this is definitely a hallucination," but fortunately it was not a hallucination, it was actually their content. Somebody had written a document that said, "The optimal song to energize financial analysis." Don't Stop Believin' was the answer. This is, well, sort of funny is also an example of RAG. If you tell the model to only use the content and your search is not good, then your model has 0% chance of getting the correct answer. The reason I call this out is that when you're evaluating RAG, you've actually added a whole other axis of things that can go wrong. It's like we have our LLM which can make mistakes, and then we have a search, which is not a solve problem. That's why I wanted to call out a couple of the evaluation frameworks that the open-source community come up with, and I want to call out especially, Exploding Gradients. They developed this framework called Ragas, which is cool. It basically breaks down these different evaluation metrics, and you can basically pull it down at GitHub and use it straight out of the box, or just adapt it to your needs. Basically, there's four metrics that it measures. Two of them measure how well the LLM answers the question, and two of them measure how relevant the content actually was to the question. If we start off with the LLM side, the first one is faithfulness. It takes the answer and it breaks it into facts, and then it reconciles each of those facts to the content. If it can't reconcile a fact, that's hallucination. Then, it returns a number, and if your number is above at a certain threshold, you block that cause you found hallucinations basically. This is one very useful metric that comes out of it. The other one is answer relevancy. A lot of time the model will get a bunch of content, and then it will make an answer that makes good use of that content, but actually, has nothing to do with what the user originally asked. That's what this metric actually measures. If you find, "Okay, well, it's all factually correct, but we have a very low relevancy, that means the model is actually-- we probably need to prompt engineer, we probably need to do something here to actually make the model pay more attention to the question and actually decide not to use the content if it's not the case." That's on the LLM side, but the other side is how relevant is the content, and this is where I found it most useful for my customers, because as we alluded to earlier, the classic example with Ragas is just putting more and more and more context into the context window. It's like, "Hey. If we give it 50 chunks, it'll get the right answer," but the fact is that actually ends up getting the wrong result a lot of the time where the model get-- there's a paper that was written on and it's called Lost in the Middle where it's like, "The more content you give, actually the more the model starts to hallucinate or starts to forget the content in the middle." Actually, what you want is the most precise pieces of content and that's where this metric evaluates the signal-to-noise ratio of retrieved content. It takes every content block and compares it to the answer and it figures out whether that content was used in that answer. This is where you start to figure out, "Okay, we're getting really high accuracy, but we've maybe got a 5% context precision. Can we actually bring back much less content and still get corrected answers?" This is one of the areas I think where it's really useful for folks to start thinking in terms of like-- sometimes folks get to production or get close to production and then the instinct is just more and more and more context. Actually, this metric gives you a very solid way to calculate like is adding more context actually helping us here. The last one is context recall, so can it retrieve all the relevant information required? Basically, is the relevant information that you need to answer that question actually in the content? This is the opposite problem. It's like, "Do we have a search and the stuff that it's pushing to the top that we're actually putting in the context window? Is it actually answering the question?" If this is very low, this tells you that you need to optimize your search, you might need to add reranking, you might need to fine-tune your embeddings, or try different embeddings to actually bring surface the more relevant content for it. I guess, I wanted to leave you with that because that's like us trying to squeeze as much performance as we can out of prompt engineering and RAG. Sometimes, again, the problem that you're trying to answer is different. Sometimes, it's actually the task that you're trying to perform which is the problem, and that's where you would take a sideways step and actually try to fine-tuning, and that's where I'm going to hand you over to John to take you further. [applause] -Let's talk about fine-tuning. Up until this point of the talk, we've been focusing on the prompting family of techniques. This is where you figure out clever ways of packing the context window of the LLM at sample time in order to optimize the LLM's performance on your task. Fine-tuning is really a different technique altogether from prompting. Just to start off with the definition, so fine-tuning, and especially, in the context of large language models, is when we take an existing trained model and we continue the training process on a new data set that is smaller and often more domain-specific than the original data set that the model was trained on. Fine-tuning is really a transformative process where we essentially take a base model, we fine-tune it, and we end up with a different model altogether. Just to set back for a second, the name fine-tuning is really like a great description of this process, so we start off with some model that has been trained on an enormous and diverse data set. These models like 3.5 Turbo or GPT-4 have a lot of very general knowledge about the world. We take one of these very general models and we specialize them and we essentially hone their abilities to make them better suited for a task that we care about. Why would one fine-tune in the first place? Now, I really want to highlight the two primary and related benefits of fine-tuning. First off is that fine-tuning can allow you to often achieve a level of performance that would be impossible without fine-tuning. Just to plan a little bit of intuition here, when you're using prompting techniques, you're limited by the context size of the model when it comes to the amount of data that you can show the model, right? At the low end, that's maybe like a few thousand tokens, at the high end, maybe it's like 128,000 tokens if you're using GPT-4 Turbo, but really this is nothing compared to the amount of data that you can show a model while you're fine-tuning. It's pretty trivial to fine-tune over millions or hundreds of millions of tokens of data. You can just show many more examples to a model while fine-tuning than you ever could hope to pack into the context window of even the largest LLM. The second benefit is that fine-tuned models are often more efficient to interact with than their corresponding base models. There's two ways that this efficiency shows up. To start us off, when you're interacting with a fine-tuned model, you often don't have to use as complex of prompting techniques in order to reach a desired level of performance on that model. You often don't have to provide as complex of instructions, you don't have to provide explicit schemas, you don't have to provide in-context examples. What this means is that you're sending fewer prompt tokens per request, which means that each request is both cheaper and generally a response quicker. It's more latency and cost-efficient to interact with fine-tuned models. Next is that a common use case for fine-tuning is essentially the distillation of knowledge from a very large model like GPT-4 to a smaller model like 3.5 Turbo. It's always going to be just more efficient from a cost and latency perspective to interact with a smaller model than a larger model. Let's look at an example here. This is an example of a common task that someone might want to solve with LLMs. What we're doing here is we're essentially taking a natural language description of a real estate listing and we're trying to extract some structured information about that listing. If we were going to try to solve this without fine-tuning, we would essentially open up the toolbox of prompting techniques and we would write some complex instructions. We would provide maybe an explicit schema that we want the model to output. Here it's defined as a Pydantic model in Python. We would maybe even provide some in-context examples to the model. We would then give the model a new real estate listing and natural language and it would provide us some output. The output's pretty good, but here, there's actually a mistake and it's a pretty trivial mistake. Instead of extracting the date that we desired, it templated it to be the current date. This would be pretty trivial to fix. We could add a new rule, we could essentially add a new in-context example, and we could probably fix this problem. Let's see how we would approach this with fine-tuning. With fine-tuning, what we're going to do is we're going to start with a relatively simple data set. Here we have examples, and I want you to notice the simplicity of these examples. There's no complex instructions, there's no formal schema, there's no in-context examples. All we're giving are natural language descriptions of the real estate listing and then the desired structured output. We take this dataset and we fine-tune a model, then we take this fine-tuned model, we give it a new real estate listing, and we can see that it essentially gets the problem right in this case. This is just a simple example but, in this case, the model is both performant and efficient. At sampling time, we don't have to provide the complex instructions, no in-context learning, no explicit schema, and the model does better than we were doing with just prompting techniques. Fine-tuning can be a rather involved process, and so it's important to set appropriate expectations about when fine-tuning is likely to work for your use case and when it's not likely to work. Fine-tuning is really good for emphasizing knowledge that already exists in the base model. An example of this might be a text-to-SQL task. You have these very powerful general base models like 3.5 Turbo and GPT-4, and they really understand everything there is to understand about SQL, the SQL syntax, the different dialects of SQL, how databases work, all of these things, but you might want to fine-tune the model to essentially emphasize a certain SQL dialect or to coerce the model to not work its way into edge cases that are specifically error-prone. You're essentially taking the knowledge that exists in the base model and you're emphasizing a subset of it. Fine-tuning is also really great for modifying or customizing the structure or tone of a model's output. One of the early killer use cases for fine-tuning was to coerce a model to output valid JSON because if you're trying to interact programmatically with a model, getting out something that is valid JSON is very easy to deal with programmatically. If it's invalid JSON, that opens up many error cases. Finally, teaching a model complex instructions, well, this is for the reasons I mentioned earlier. It's just you can show a model during the fine-tuning process many more examples than you could ever hope to pack into the context window of a model. On the other side, fine-tuning is really not going to be good for adding new knowledge to the to the model. The knowledge that existed in an LLM was impressed into that LLM during these very large pre-training runs. You're essentially just not going to be able to get new knowledge into it during these limited fine-tuning runs. If you're trying to get new knowledge into the model, you really want to look at something like RAG for all the reasons that Colin just mentioned. Next, fine-tuning is not great for quickly iterating on a new use case. If you're fine-tuning, it's a relatively slow feedback loop. There's a lot of investment, for creating the data set and all these other components of fine-tuning. Don't start off with it. I want to essentially like look at a success story of fine-tuning, and this one comes from our friends at Canva. The task here was essentially to take a natural language description of a design mock that the user wanted, to give it to an LLM, and have the LLM output a structured set of design guidelines. They could then use those structured design guidelines to generate a full-sized mock and present that to the user, so it's essentially a quick way to just throw out some ideas and get a full-sized mock. Here the user says something like, "I want a red gradient. I want a profile photo maybe in the style of an Instagram post." It goes to the LLM, and it's supposed to output something that's very structured here. It has a title. It has a style with a few keywords from a known set of keywords. It has a description of the hero image, and then it has an actual search that they would give to an image search engine to generate images for these full-size mocks. What Canva did is they essentially started off with 3.5 Turbo in the base model, and then they started off with GPT-4. They wanted to essentially see what was the performance on this task. The performance wasn't great, so they were essentially evaluated by expert human evaluators. What they found were that while these models could output sensible outputs, the outputs were actually irrelevant when looked at from like a design point of view. They then went on to fine-tuning. They essentially fine-tuned 3.5 Turbo for this task and were really blown away by the result, so it not only beat the base 3.5 Turbo model, but it actually drastically outperformed GPT-4. What we're seeing here on the scale is that while 3.5 Turbo and GPT-4 often output sensible, but irrelevant design mocks, fine-tune 3.5 Turbo was often outputting rather good design mocks when evaluated by expert evaluators within Canva. If you want to think about why this use case worked, to start off, no new knowledge was needed. All the knowledge needed to solve this problem existed in the base model, but the model needed to output, it needed a very specific structure of the outputs. Then they used very high-quality training data, and they had really good baselines to compare the two. They essentially evaluated 3.5 Turbo, they evaluated GPT-4, they understood where they were succeeding and where they weren't, and so they, they knew that fine-tuning was going to be like a good technique to approach for this task or to use for this task. I want to talk about a cautionary tale for fine-tuning. There's this author of this great blog post that I really liked. They had been experimenting with AI assistants to be writing assistants. They tried Chat GPT, they tried a few base models from the API, they were impressed, but they were disappointed that these models weren't capturing their tone. They have a very specific tone that they use when they're writing a blog post or social media posts or drafting emails, and the base models just weren't capturing this tone. They had a great idea and they said, "I'm going to download two years worth of Slack messages," 140,000 messages in total. They wrote a script to format these Slack messages into a data format that's compatible with fine-tuning, then they fine-tuned 3.5 Turbo on these Slack messages. A long process. You've got to collect the data, aggregate the data, message it into a format that's compatible with fine-tuning, to fine-tune the model. They finally go through this process, they get a fine-tuned model, and they ask it to do something. They say, "Can you write me a 500-word blog post on prompt engineering?" This model, this personalized writing assistant responds, "Sure, I'll do it in the morning." [laughter] -I'm sure a little surprised and shocked, he follows up, and he's like, "I prefer you wrote it now please." [laughter -The model says, "Okay," and then does nothing else. [laughter] -We really got a kick out of this on the fine-tuning team and the author was a really great sport, but if we take a step back for a second fine-tuning really worked here. Essentially, what the author wanted was a model that could replicate their writing style. What they got was a model that could replicate their writing style, but their Slack writing style. If you think about how you communicate on Slack, it's very terse. It's in a stream of consciousness style. You're often foregoing punctuation, you're foregoing grammatical correctness. What they got was a model that replicated that. While fine-tuning a model to replicate your tone is actually a relatively good use case for fine-tuning, the error here was they didn't fully think through whether the data that they were providing the model really replicated the end behavior that they wanted from that model. What they probably should have done here was take a hundred Slack messages, 200 Slack messages, fine-tune the model experiment with it, and see is it moving in the right direction. Is it getting closer to the tone that I want the model to replicate? They would've seen pretty quick that that was not the case, then maybe they would've gone and fine-tuned it on their emails or their blog post or their social media posts, and maybe that would've been a better fit. We've seen some examples, we've developed some intuition, so how does one actually go about fine-tuning a model? Like any ML problem, the first step is you got to get your data set, and like with most ML problems, this is actually the most difficult part. Some ways of getting a data set, you can download an open-source data set, you can buy data on the private markets, you can pay human labelers to essentially collect the data and label it for you, or you can often distil it from a larger model if the terms of service that that model support that specific use case, but some way you essentially have to come up with a data set to fine-tune on. Next, you're going to go and actually like kick off the training process. This varies a lot depending on how you're trying to do the training. If you use a turnkey solution like the OpenAI fine-tuning API, this can be relatively simple. If you're trying to fine-tune an open-source model, totally doable. You're just going to have to get your own GPUs, use a framework, it's a little bit more involved. It's important while you're training to essentially understand the hyper parameters that are available for you to tune during the training process, right? Are you more likely to overfit, are you more likely to under fit? Are you going to fine-tune it to the point of catastrophic forgetting. It's important to just understand the available hyper parameters and the impact that they have on the resulting fine-tune model. Next, I want to point out that it's really important to understand the loss function. When you're fine-tuning an LLM, really when you're looking at the loss function, it's a proxy for next token prediction. This is great when you're fine-tuning an LLM, but next token prediction is not often super well correlated with performance on the downstream tasks that you care about. If you think about code generation, there's many different types of way to write code to solve a single problem, and so if you're just doing next token prediction and exact token match, the loss or change in the loss function for a model might not correlate to the change in performance on the downstream task. It's important to understand that. Next you want to evaluate the model. There's a few different ways of evaluating the model. You can essentially get expert humans who look at the outputs and actually rank them on some scale. Another technique is that you can essentially take different models, generate outputs from them and then just rank them against one another. Not having an absolute ranking but doing something like an ELO score that you get from chess. You can also do something like have a more powerful model, rank the outputs for you. This is really common using GPT-4 to rank the outputs of fine-tuned open-source models or GPT 3.5 Turbo. Finally, you want to actually deploy it and then sample from it at inference time. These last three points can form something of a feedback loop and a data feedback loop. You can essentially train the model, evaluate it, deploy it to production, collect samples from it in production, use that to build a new data set, down sample the data set, curate it a bit, and then fine-tune further on that data set and get something of a flywheel going here. We've talked about a few of these up until this point, but I want to formalize some of the best practices that we recommend when it comes to fine-tuning. First off is just start with prompt engineering and few-shot learning. Just these are very simple low investment techniques. They're going to give you some intuition for how LLMs operate and how they work on your problem. It's just a great place to start. Next is that it's really important to establish a baseline before you move on to fine-tuning. This ties back to the success story for Canva. They experimented with 3.5 Turbo. They experimented with GPT-4. They got a really good understanding for the ins and outs of their problem that they were trying to solve. They understood the failure cases of those models, they understood where the models were doing well, so they understood exactly what they wanted to target with fine-tuning. Finally, when it comes to fine-tuning, start small, don't download 140,000 Slack messages and then just do it in a single shot. Develop a small high quality data set, perform the fine-tuning, and then evaluate your model and see if you're moving in the right direction. You can do something like an active learning approach here. Where you fine-tune the model, you look at its outputs, you see in what areas it's struggling, and then you specifically target those areas with new data. It's very intentionally investing and it's really important that when it comes to LLMs and fine-tuning data quality trumps data quantity. The data quantity part of the training process was done in pre-training. Now it's like you really want to focus on fewer high-quality examples. Just to talk about fine-tuning and RAG, so if you want to combine these together, for certain use cases, it can be the best of both worlds. Oftentimes how this works is that you fine-tune a model to understand complex instructions and then you no longer have to provide these complex instructions. Few-shot examples at sample time. You essentially fine-tune a model that's very efficient to use. What this means that you've essentially minimized the prompt tokens that need to be provided at sample time, because you no longer need to do complex prompt engineering. It's baked into the fine-tuned model. This means that you have more space for retrieved context. You can then use RAG to inject relevant knowledge into the context and the context that's available has essentially been maximized in this point. Now, of course you have to be careful to not oversaturate the context, it's something that might have spurious correlations with the actual problem that you're trying to solve, but essentially opens up room in the context window to be used for more important purposes. With that said, we've been talking about theory up until this point in the talk, we're now going to talk about application of the series. I'll turn it back over to Colin to get us going. -Thanks, John. [applause] -Cool. Let's take all this theory for a spin. The problem we decided to take on was the Spider 1.0 benchmark, so effectively given natural language question and a database schema, can you produce a syntactically correct SQL query that answers that question. An example looks something like this. Given this database schema and given this question at the bottom, can we produce that SQL query on the right? Classic problem, lots of different attempts on it. What we did was follow the advice that we've given you folks. We started off with prompt engineering and RAG. If I just share some of the different methods we used, just to get into the details of what we tried, we started off with the simplest possible RAG approach. We started a simple retrieval. Just cosign similarity, use the question and find SQL queries, which answered similar questions basically. Do a similarity search with the question. We also tried formatting the embedding differently. We tried a bunch of prompt engineering just with a couple of isolated examples and our results were, as you'll see in a second, not super good. What we did was we thought about this problem and we're like, "Actually, a question could have a totally different answer if it has a different database schema. Doing a similarity search with a question doesn't make a lot of sense for this problem, but using a hypothetical answer to that question to search might give us actually much better results for this problem." What we did was use hypothetical document embedding. We generated a hypothetical SQL query and then we used that to similarity search. We actually got a really large performance bump with that for this particular problem. We also tried contextual retrieval where just simple filtering. We figured we ranked the hardness of the question that we got and then only brought equal hardness examples back in our RAG basically, if you see what I mean. That got us slightly better improvements. We then tried a couple of more advanced techniques. There's a couple of different things here. You could try chain of thought reasoning. You maybe try and get it to identify the columns and then identify the tables and then build the query at the end. What we settled on was actually fairly simple. We went with a self-consistency check. We got it to actually build a query, run the query, and then we gave it the error message if it messed up and gave it a little bit of annotation and then got it, try again. It actually got, again, sort of funny by getting GPT to fix itself. It's something that we see actually work fairly well if you have a use case where latency is not a huge problem that you're worrying with or cost. The results we got looked something like this. I'm going to come over here to talk through this. On the far right was where we got to with prompt engineering, not so great. We started off with 69%. Then we added few-shot examples and got a couple of points of improvement. That told us that RAG could actually give us further improvement here. We tried with the question and you can see that we got a 3% performance bump. Then using the answer, the hypothetical document embeddings, we got a further 5%, which is pretty cool. Actually, just by using a hypothetical question to search rather than the actual input question, we got a massive performance bump over what we started with. Then all we did was just increase the number of examples and we got up to four points shy of the state of art with this approach. Again, this is just a couple of days hacking around, starting off with prompt engineering, moving to RAG. Shows you just how much performance you can squeeze out of these very basic starting approaches. At that point we decided to turn over to fine-tuning and see whether we could take it any further and that's where I'll hand on to John. -For fine-tuning, we turned it over to our preferred partners for fine-tuning at Scale AI. They started off by establishing a baseline as we recommend. The same baseline that we saw in the previous slide of 69%. This is just with simple prompt engineering techniques. They then fine-tuned GPT-4 with simple prompt engineering techniques where you just reduce the schema as it goes into the example. Very simple fine-tune model, little bit of prompt engineering and they got all the way up to close to 82%. We're now within striking distance of state of the art. They then used RAG with that model to essentially dynamically inject a few examples into the context window based on actually just the question. Not even very advanced RAG techniques. They got 83.5%, which got us really right within striking distance of state of the art. I think the thing I want to highlight here is that if you look at the Spider leaderboard on the dev set, the techniques used are very complex. There's a lot of data pre-processing, a lot of data post-processing, often hard coding edge cases into the script being used to actually evaluate the model. We didn't actually have to use any of that here. Just like simple fine-tuning, simple prompt engineering, just following the best practices, and we really got within striking distance of state of the art on this really well-known benchmark. It shows the power of fine-tuning and RAG when combined. Just to recap, when you're working on a problem and you want to improve your LLMs performance, start off with prompt engineering techniques. These are very low investment. They allow you to iterate quickly and they allow you to validate LLMs as a viable technique to approach this problem that you're trying to solve. You iterate on the prompt until you hit something like a performance plateau, and then you need to analyze the type of errors that you're getting. If you need to introduce new knowledge or more context to the model, go down the RAG road. If the model is inconsistently following instructions or it needs to adhere to a strict or novel output structure, or you just generally need to interact with the model in a more efficient manner, it's maybe time to try fine-tuning. It's important to remember that this process is not linear. That's really what we want to stress. It might take 49 iterations to get to a point that you're really happy with, and you're going to be jumping back and forth between these techniques on your journey. With that said, we hope you enjoyed this talk. Colin and I will be here for the rest of the day if you have any questions. [applause] [music]

Info

Channel: OpenAI

Views: 98,740

Rating: undefined out of 5

Keywords:

Id: ahnGLM-RC1Y

Channel Id: undefined

Length: 45min 32sec (2732 seconds)

Published: Mon Nov 13 2023