LLMOps (LLM Bootcamp)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] okay welcome back um I am really excited for this next one because um I think this is this topic is sort of core to our whole ethoset full stack deep learning um we got started five years ago in the last um AI hype cycle around this old-fashioned technique called Deep learning um because Sergey and I observed that you know there are a lot of classes that teach you about how to build stuff with neural networks but not a lot that teach you how to actually make that work in the real world and get into production um that's kind of a philosophy that we've carried throughout as we've uh developed other courses like this one and so a lot of what we've talked about today and yesterday has been how to think about building applications with language models and the focus of this lecture is going to be um call the stuff around the edges that you need to think about to make these things work in production um and so the as like undeveloped and quickly moving as the space of like building applications with language models is I would say the space um of thinking about how to build real production systems with it is even more underdeveloped so the flavor of the sock is I'm gonna feel like um kind of a grab bag of different topics that you need that you should familiarize yourself with as you start building these applications I'll try to give you like high level pointers about what you should be thinking about what some like basic um uh initial choices that you can make are and where you can go to learn more um but throughout the beginning of this it's going to feel like um a bunch of assorted different topics and at the end I'm going to try to tie this together into like uh how like a um pull like a first passive theory for how to think about llmops um which I haven't really shared with anyone so I'm excited to get to that so um the first thing that we need to do is we're building an application on top of LMS the initial choice we have to make is which element should we build on top of um and so the tldr here is there's no single best mod like best llm to build on top of the right one for your use case depends on a lot of trade-offs in particular a particular how much do you care about out of the box quality through your task how much do you care about speed and latency of inference how much you care about cost fine-tune ability or tunability data security and license permissibility um and you know the trade-offs between those things are going to determine what the right model is for your use case but the kind of overall conclusion is most of the time you should probably start be starting with gpt4 so don't overthink this um but I think kind of one of the main questions that people ask is do we want models that are proprietary like gpt4 like anthropic or do we want models that are open source the way you should think about this is proprietary models today are better um they're higher quality they are um and most open source models that you get might actually not be open source because of Licensing friction um serving open source models also creates a lot of problems for you that if you use a proprietary model you don't have to deal with so serving these large language models is really difficult training them is really difficult um just call an API it's much easier but um for a lot of use cases you really do need open source so it's much easier to customize the apis that you get from off from providers are very limited and it probably most importantly it respects data security so if you can use a proprietary model you should um but if you want to use an open source model one thing to think about is licensing so they're um what do you think about it is there's like at a high level three different types of licenses there's permissive licenses which is what most of us are used to dealing with in open source software these are licenses like Apache 2.0 which basically let you do whatever you want with the software or in this case the model in the llm world there's also restricted licenses and restricted licenses place um like they like they say it Place restrictions on commercial use but they don't prohibit it entirely and so for these you're gonna need to draw your own conclusions about whether this works for your use case but there's also a bunch of Open Source uh like quote-unquote open source models that are released under non-commercial licenses and unfortunately these totally restrict commercial use and I would say it's arguable whether you should really even consider these open source at all um all right let's double click on the proprietary contenders for for models to use and so um what we have here is a list of like some of the most important choices for proprietary models um and we're ranking them according to a few different criteria first we're gonna look at the number of parameters in the model we're gonna look at the size of the contacts window um we'll talk about what it was trained on I'll come back to that in a second um we'll talk about the subjective quality score um we'll talk about the speed of inference and we'll talk about how fine-tunable they are so the um coming back to each of these criteria the number of parameters is important to understand especially in open source where it's more widely known because it um the number of parameters and how much data the model was trained on are decent proxy for how high quality that model is going to be the context window is important because as we spend a long time talking about yesterday and this morning the amount of data that you can put in Con in the context of the model plays a big role in how useful that model is for your Downstream applications um the what this training column refers to is what type of data the model was trained on so I listed four types of data here diverse is like massive scale internet data the reason why that's important is because that's why language models exist to begin with if you train a large model on more or less all of the internet then what you get is gpt3 quality model as of around 2019. to get to up to Modern GPT for our GPT 3.5 quality models you also need other data sources most importantly code um so code seems to be a really critical element of why these models perform well you also need instructions um so you need instructions about what a human wanted to do and what a good response is to that instruction and then you also need human feedback so very few models that are available today are trained on all four of these different types of data sources um and even the ones that are trained on things that are more than just internet scrapes tend to have more license restrictions as well next column is quality um so how do you assess the quality of a large language model well there's benchmarks that you can look at those benchmarks are helpful for giving you like sort of a coarse grain sense of how well the model works but there's really no substitute for trying them all out and playing around with it in tools like um not Dev which Charles and Sergey mentioned earlier um but really like getting a feel for it on your own task there's no substitute for that and so the quality score here is like um I would say this is a proprietary subjective qual uh quality score um that is a a an algorithm that was secretly developed by the organizers of of full stack deep learning to uh to give you the best quality sense of whether the model really works or not um and by what I mean by that is I played around with all these models um pretty extensively and this is my overall feeling for how high quality they are um attic speed and fine-tune ability are a little bit more self-explanatory so let's go through each of these options gbd4 is the highest quality large language model on the market today um there's nothing else that is in the same quality category as this and so if you don't have some of the restrictions that we talk about this is what you should be using if you want something that is faster or cheaper than cheap D4 but still really high quality GPT 3.5 is um which is uh you know what powers sort of the original chat GPT is extremely high quality model that is you know only second in quality to gpt4 and it's significantly faster and cheaper the best model on the market other than GPT 3.5 according to our proprietary algorithm is Claude um clawed from anthropic and so this is one of the reasons why this is so high quality is that this is one of the only other models on the market that is trained on sort of the full gamut of useful types of data for training these types of models and I would say it's probably pretty comparable quality to GPT 3.5 which one is better will depend on the specifics of your use case and really what you um what type of output you prioritize in these models if you want a really high quality model that's still fine-tunable I think the best option is the largest model from cohere so cohere is like part of their strategy has really been emphasizing making their models even their larger models fine-tunable openai and anthropic have sort of moved away from that so cohere is not trained on um reinforcement learning from Human feedback data and so it the quality of this model in my view is not really as high as the best models from openai or anthropic but it's still pretty good and you can fine tune it the rest of the models on this list are like the sort of the models that make a trade-off of quality in the name of speed and cost so if you want something that is cheaper or faster than the models on the top of this list these three are good options I think among these the best is um the offering from anthropic the reason why is because most of the other um fast and cheap models are trained on simpler data sets whereas anthropic has done sort of the full training regimen on their simpler models as well but you know if you're using open AI or using cohere then these models are perfectly adequate as well so that's an overview of the proprietary options for large language models um you can also pick from different open source options so starting from the top here and so we're going to use a slightly different set of criteria um so we're still going to look at parameter count we're still going to look at context we're still going to look at what the models were trained on we're still going to look at our secret quality score but we're also going to look at the license permissibility and so the way to read this is the green licenses are ones that are you know fully you can use them for whatever you want the yellow ones are the ones where you need to make up your own mind about whether the restrictions work for your use case and the red ones are ones that you can't use for any commercial use and so really these are just for tinkering they're not for any production uses at all um the other thing to note here is for a lot of these models there's two options listed so you know in the first row there's T5 and flan T5 these indicate the first one is this um like the way a lot of these open source models are released is they release the base model that's trained on you know internet data and then they separately release the fine tune of the model on instruction data and so the first option the first one here is the um is the uh the the base model and then the second one is the fine-tuned instruction tuned model and so the quality score also refers separately to the base model and the instruction tune model so gonna go through these now um T5 and flan T5 are your Best Bets if you want like a sort of full full proof permissive license and you need decent quality results um I would say flan T5 is not quite of the same quality level as coheres best offerings or the best you know semi-open Source offerings that we'll get to in the bottom of this list but it's pretty decent and it's better than most of the other things that have Apache licenses uh pythia and the fine tunes of pythia um like Dolly um have made a pretty big splash recently um so this is a pretty recent option I think this came out like not this past week but the week before or something like that um or at least in the last few weeks so uh so the the uh we're still tuning Our quality algorithm on these ones but um the early reputation is that they're really high quality um unfortunately the fine tunes of pythia at least the ones that I know about are like don't have licenses that you can actually use them for anything useful an even more recent option I think came out this past week um from stability AI is a stable LM and they have a instruction tuned version of this as well this is a recent option it's um you know too early to say it's likely a good alternative to pythia and Dolly um they're also planning to release larger versions of these models and our lhf trained versions of these models going forward so this is a good offering to pay attention to llama and it's fine tunes like alpaca vicuna and koala are um really like the ecosystem pick right now um these llama was one of the first models to be released that um or for people to get their hands on the weights that had reasonably high quality and so the community has built a lot of interesting fine tunes and applications on top of them if you are really looking for something that you can Tinker with and that's Well Suited by the community this is probably your best bet but these all have restricted licenses so if you're trying to build for production I wouldn't even touch it lastly um the other one worth mentioning is opt opt is an older model it's the model that is trained in the way that is most similar to the original 2019 gpd3 model and so if what you're trying to do is do research on a Model that looks like that you can pull uh opt but it's not really as high quality as some of these other options and then you know you'll also probably hear about other models like Bloom and glm in my opinion these are not really worth considering the quality is just not very good and they don't have very helpful licenses so that quick overview of the available open source models um lastly just a couple of notes on how to assess the performance of llms um the really the only way to know which LM is going to work best is to do it look at it on your task and we'll talk a little bit later about how to do this effectively um I'm not saying don't look at benchmarks but they can be very misleading because uh the it's a difficult task to assess these models that are designed to do so many different things so some recommendations here I think for most projects you should probably start with gpt4 this will give you the fastest way to build a proof of concept and it'll be the the most reliable way to tell you if your task is even feasible to begin with because if you can't solve it with gpt4 it you're probably not going to solve it with a simpler model at least not with a lot of effort like a a lot of fine tuning or something like that metaphor here is you know the kind of like common engineering adage of you want to prototype in the language where that's fastest for you to develop in so a lot of teams will Prototype things in Python and then they'll move to a lower level language like C or C plus plus or something like that when they know that this thing works and they just need to squeeze out more performance if costs or latency is a factor then you can consider downsizing below gpt4 GPT 3.5 and Claude are both great as well and you could go even faster and cheaper than that um and then among all the sort of commercial providers cohere is the best for fine tuning um and my recommendation at this point is I think by the end of this year open source will probably catch up with like GPT 3.5 level performance um but today I would only recommend using open source if you truly need it great um next thing I want to talk about is as you're developing this model so you've chosen your base model the next thing that you're going to do is you're going to develop your prompts for your task so how can you think about iterating on prompts in a way that is a little bit more um engineering minded than just kind of writing in a text file so as you work on your prompts and your chains how do you how do you save your work in some sense first of all why does this even matter why are we talking about this um so if you think about traditional deep learning like all the way back in 2015 um for those of you who are in the field back then um what it felt like was you know every single time I trained a model what I would do is I would write down the hyper parameters I used to train them all in a spreadsheet and then I would save the train model somewhere on my like in a file on my laptop and there was no way for me to reproduce what I did before there's no way for me to share my work with the team um and all this stuff just kind of got lost and was incredibly hard to keep track of over time so there was a lot of like lost work rework and things that were just um impossible to work together with in an organization now the way that companies build deep Learning Systems is you know every single time you run your like model.train call it's instrumented and you automatically get a log of the experiment run that you ran with all the type of parameters Those runs are comparable with each other they're shareable with other folks on your team and they're fully reproducible so every single time you call model.train they're not worried about like whether you're going to be able to find that training run again in a month or whether you know if that training run actually works you're gonna have to do a ton of other work just so that your team can you can convince your team of it prompt engineering today I think feels a little bit like deep learning did in 2015. so every single time I changed my prompts um what I'm doing you know a lot of times is I'm playing around with it in the playground the old prompts that I ran are sort of lost to time um because I don't have a great way of keeping track of them and there's no way for me to reproduce an experiment that I ran before you know share the prompt with the team or anything like that and so the question is like it feels like we're missing some tooling here to help make prompt engineering feel more like engineering and less like just kind of ad hoc experimentation so what should that tooling look like um a reasonable question to ask is like does prompt engineering even need this is this the same as deep learning do we need Advanced tools to keep track of this stuff so I want to talk a little bit about why I think experiment management was so impactful in the Deep learning world um I think the core reason is that you constantly need to go back and check your old experiments in deep learning there's two reasons for this first because experiments take a long time to run your training models so that can take hours days weeks even so it's important for you to be able to like shut your laptop walk away come back and know that you'll be able to refresh your state about what's happening and on top of that you often run many experiments in parallel so oftentimes you're running a hyper parameter sweep or you're trying different model architectures and so there's a lot more to keep track of um and you run a lot of these so you end up um if you're not careful you end up repeating yourself and like running the same experiments over and over again um I myself noticed myself doing that many times like back when I was training a lot of models prompt engineering today I think doesn't really have that same Dynamic that made experiment management so impactful in deep learning experiments are quick to run it feels more like writing code than it does like training a model right you you changed a few words in your prompts you rerun this thing it takes maybe a few seconds not hours experimentation today I think is also usually sequential so normally what I'm doing when I'm iterating on prompts is like I make a change to the prompts I rerun it I see if that fixes my problem I go back I change the prompt again and I follow this kind of sequential back and forth step I'm not today often running like a lot of experiments in parallel in ways that I change my prompt and most of the time the type the amount of experimentation that you do is pretty limited so you're you know changing some things in your prompt here or there maybe you try tuning a model parameter but you're not really like at least most of the the time that I see folks aren't you know trying thousands and thousands of different options for their prompts um one thing I'll note here is I don't think that this is going to be true forever um I think one of the core reasons why experimentation is so sequential in prompt engineering is because we don't really have a great way to evaluate new prompts that we develop if you could automatically tell whether a change that you made to your prompt is better than the old version then we could paralyze prompt experimentation and that would change this dynamic so um I think overall conclusion is you know you probably don't need Advanced tools for this today um but in the future if evaluation improves this might become really important all right so I'm going to talk about three different levels of keeping track of your experiments with your prompts and your chains the first level is just doing nothing right so make your prompts in the open AI playgrounds um you know if they seem to work then that's great copy and paste them into your your file and uh you're good to go um this is good enough for P0 this I think is like not quite enough for what you actually want for building applications but level two is just tracking your prompts and git um this is I think honestly what teams most teams should be doing right now um it's super easy it fits into the workflow that you're already doing um but in certain cases you might need something more advanced like you might need a specialized tool for tracking props and I think the places where the place where this might start making sense is if you're running a lot of parallel evaluations um right where git is not really a great tool for keeping track of like many parallel experiments or if you need to decouple changes or want to decouple changes to your prompts from deploys um or if you want to involve internal like non-technical stakeholders in the prompt iteration process right your product manager probably doesn't really appreciate needing to use git to uh you know tweak a couple words in the props and you know chat gbt kind of let the cat out of the bag and product managers now know that they can make good changes to prompts sometimes so they might be expecting to do that um if you do want to go for a specialized prompt tracking tool what should you look for here um first thing is it should be decoupled from git right if it's not decoupled from git that I don't think it adds much value and you should just use git if you um Beyond being decoupled from git I think a really desirable feature in a tool like this would be to be able to take the prompts that you're iterating on on your laptop and execute those but not only in code but also in a UI the reason why I think this is valuable is because you know again like think about your product manager um who wants to contribute to making prompts better if you create an initial version of a prompt and they have a UI where they can interact with it and try changing words around and you know convince themselves that you actually came up with the right answer to begin with that's going to make your life a lot easier um the uh I think it's also really helpful if this can be connected to visualizations of the executions of like the things that people interacted with this product on I'll come back to why that's important um and so there's a lot of movement in the space right now they're in in the past week There's been three tools announced from all of the major um like traditional ml experiment tracking tool providers uh weights and biases Comet and mlflow um and so I and I'm expecting a lot more movement in this space so this is a place where I think the tools are going to be totally different a year from now than they are today okay so concrete recommendations on prompt management um I would just manage your prompts and your chains and get but if you have a lot of interaction or collaboration with your non-technical stakeholders you're you're starting to really focus on automating your evaluation then it's worth um building your own experiment management tool trying out one of the ones from um the providers I showed on the previous page or just keeping an eye out for this space as I imagine there's going to be a bunch of new products launched here soon so um the question here is okay I've made a change to a model like I've uh um openai has updated their base model or I've made a change to a prompt um I've you know tweaked my my prompt language I've changed how I put information in the context how do I know if that actually worked how do I know if that's actually better what so why does this matter um LMS make tons of mistakes so the nature of this technology that we're all building with now uh so just because your new prompt looks better on the handful of examples that you Cherry Picked it on um and that you're uh you know your your product manager got excited about does not mean that it's better in general and uh it's super common to have use cases where you make them the prompts better in one way but it decreates performance in other ways and so if you're not um tracking if you're not measuring performance on a wide range of data that represents the data your end users are ultimately going to be putting in then you're going to miss regressions and performance so and you know in particular I think this is important um in like kind of The Human Side of building applications with llms where I think a really important component of um user retention for AI powered applications is trust right like um it's really easy to get people to be blown away by this cool new AI tool that you built um and it's easy to attract users these days because everyone's looking for tools to try out but at the end of the day if you get the wrong answer enough like if they don't trust the output that you're producing that's going to cause them to go back to the tools they're using before so um like I did in the last section I want to kind of compare this to the way that testing machine learning models worked in the sort of pre-lm pre-generative era um so the way that folks think about testing machine learning models is you start with data that comes from your trading distribution so you have a training set which is the data you actually train your model on and you compute a single metric on that training set like let's say your accuracy then you hold out some data from that same distribution and you evaluate your model on that distribution after it's trained that gives you another measurement for this metric of accuracy the difference between those two measurements is called overfitting so this is a measure of like how much your model is just really over um uh overfitting on this exact training set that you produced then you have a bunch of other data that's coming from production that data might convert a different distribution than your model was initially trained on and so you'll you'll hold out some data from that production set to use to test your model and you'll compute that single metric again that accuracy metric the difference between your accuracy on the test set and the sort of called out set from your training distribution is a measure of your domain shift so this is measuring like how much model you're how much worse your model gets when you move outside of the training distribution and then finally you can measure the accuracy on the actual production data itself and the diff the difference between that held out test evaluation and the production evaluation is a measure that you can think of as strict right so how much worse has my model gotten on that production distribution so why doesn't this work for llms well first of all um you're using an API that you're calling from openai um they don't even tell you what data they're trying they're training it on much less giving you access to that data um so there's no way for you to really know what your training data is here and then when you deploy this thing in production um as a corollary of that your production distribution is always going to be different than your training distribution no matter what like they're you're not going to be evaluating your model on the same data that openai trained their model on in traditional ml um the metrics that you compute are um are deterministic right so a traditional ml problem you might have is classification so I'm going to classify whether this image is a cat or a dog and the way that we'll measure how well this model is working is we'll compute a metric like accuracy where we just compare the predictions to some ground truth labels and we see how many times the model got it right and that gives us an accuracy score in generative ml your predictions are like text right so um let's say your prediction that says this is an image of a tabby cat but your label is photograph of cat is that a good output or a bad output really hard to say right I mean it's it's kind of subtle whether this is a better prediction for the model or not so another additional challenge with generative models is like what metric should we even look at um and then I think a final challenge here is oftentimes in more traditional ml we think about like task specific performance so our performance on this task classifying the difference between uh images of cats and dogs in language models oftentimes we're building systems that are designed to be much more general purpose so even if you're able to measure the accuracy of that system for a question answering system you might have all kinds of different questions coming from your users they might be asking you about startups dogs food physics and your performance might be totally different on those different subjects so is it really fair to summarize the performance of the model in the single metric um or is what you really need to do to capture a more diverse understanding of the behaviors that the model is supposed to be performing well on um and so to summarize like your models trained on the internet so there's always drift um your model the output is the measurement of the output is qualitative so it's hard to put a number for success and it's supposed to work around a diverse set of behaviors so aggregate metrics don't work um so let's talk about how you can think about testing language models I think there's sort of two key questions to ask yourself here the first is what data do you test them on and then the second is what metric do you compute on that data um so here's how I think you should think about building an evaluation set for your task um there's sort of four steps to this first is you want to start incrementally you want to start building this evaluation set from the beginning as you're prototyping your model the second thing that you can do is um you know since we're working with this magic technology you can ask your language model to help you out here um third thing you can do is as you roll this out to a broader broader set of users you should think about the task of building this evaluation set as incremental so you should always be adding data to your evaluation set as you discover new failure modes or new patterns in the behavior of the model and then the last thing I want to talk about here is you know this feels very ad hoc can we is there any way that we can formalize this so start incrementally um the first thing that you're going to do most of the time if as you play around with the language model is you're going to evaluate it as ad hoc right like let's say that you write a prompt that says write a short story about you know a given subject um then you might try different subjects you might try the subject dogs subject LinkedIn subject hats and you're kind of like getting an ad hoc feel for like whether this model works for these different inputs um as you find interesting examples in your in this evaluation then you should start organizing them into a small data set um so what do we mean by uh and then like the next time that you have a change to your model rather than just playing around with different examples you can run it against every example in this like small data set that you're building what do we mean by interesting examples I think there's like really two heuristics that you can look for here the first is hard examples are interesting so if you find an example as you're playing around with your model where it the model really doesn't do well you should add that to your data set because you want to make sure that future versions of the model improve on that and then the other heuristic is that different examples are interesting so if you find an example as you interact with it or as your users interact with it where um it's they're using this model for something totally different than the rest of the data in your data set that's also worth adding to the data set so those are two heuristics about what you should add to the data set second thing you can do here is you can use your language model to help um so amazing facts uh language models can help you generate test cases and uh there's a cool open source Library by a uh fsteel Alum actually called Auto evaluator that takes this that uses this approach for question answering data and the way this works is you create a prompt and that prompts um you know takes into account the tasks that you're trying to solve and then you basically ask the language model to generate diverse examples of input output pairs for this task um it's really cool it can help you like bootstrap these test cases much faster third step is as you roll this out to more and more users you want to keep adding data to your data set um so some heuristics for what you might want to add is like what do your users dislike um if you have annotators you should think about them as users too so what do they dislike um third thing you can do is you can ask another model you can do self-critique so you can create a prompt that assesses the quality of outputs of the model you can add the ones that the second model doesn't like to your data set um but you can also find uh you know so those are examples of our data you can also add different data so you can add data that is an outlier relative your current evaluation set According to some Metric or you can add like potentially underrepresented topics or tents or documents things that you find your user is creating that you don't currently have in your eval set okay so those are some intuitions about how to build your your evaluation set incrementally as you build your model um I want to talk a little bit about like kind of an idea um and this is I would put this more in the category of like speculative idea that I've been thinking about rather than something that's an accepted practice right now um but what I've been thinking about is like is there any way that we can make this feel more quantitative like is there any way we can quantify the quality of our test set um so and like one way to think about this is the notion of test coverage right so in software engineering um a metric that software engineering teams often look at is test coverage and what that means is basically the percentage of lines in your code base that have a unit test that's associated with them is there anything in applegis that we can come up with in ml um I think the way to the way to to think about this is the analogy for lines of code in your code base that you're testing is data points in your production distribution um so a good evaluation set according to this intuition is one that um has good coverage of the types of things that your users are trying to actually do with the system so what does this look like um a low a low test coverage evaluation set is one where a lot of your production data a lot of the data that your users are sending in as they interact with it um is not really anywhere close in an embedding space to the test data that you have your test set on the other hand a high test coverage data set would be one where um you know pretty much anything that your users are sending in in production lives somewhere near your one of your test cases in embedding space so there's a lot of different ways to make this um to you know formalize this mathematically but this is kind of the intuition here um one sort of note here is that like for those of you who are like coming from a more traditional ml background you might be familiar with the notion of distribution shift so test coverage and distribution shift are analogous Notions distribution shift measures how far the test distribution is from a certain reference distribution right so how far is your test data from your reference data and you can use that to see if your data is changing relative to that reference data on the other hand test coverage measures how well your evaluation data covers your production data so kind of like a dual notion to that and what it's used for is also different it's used to help you find more useful evaluation data to put in your data set um so one question you might ask is like is this enough um and I think the main way that a notion like this would fail is it you have like great coverage of your of your test data but um hard data is underrepresented uh or if like you have great coverage but your metrics don't really matter right your metrics have nothing to do with your how your users react to this stuff in the real world um so I think you would also need another notion of test reliability that measures a difference between online and offline performance so again that's this section A little bit speculative but um hopefully it helps you build a little bit of intuition about what makes for a good evaluation set for uh as you start to build your model um okay next thing I want to talk about is evaluation metrics for language models um and I think like the key idea here is there's a lot of quantitative metrics that you can Define for llms if you know what the right answer is if you don't know what the right answer is like if the correct answer is subjective then the main technique that we have in the toolkit is to define a prompt that asks another model whether this is a good answer for the question or not um and there's different ways of setting this up so if there is a correct answer for your problem then you can just compute your normal metrics like accuracy like you would in regular ml if there's no correct answer but you have a reference answer so you have an example of what one good answer looks like um then you can use reference matching metrics come back to what this means in a second if you don't have a reference answer but you do have a previous answer like here's the answer that another like the previous version of my model spit out then you can use uh like which is better metrics and if you don't have a previous answer from the model but you do have human feedback um then you can use like is the feedback Incorporated metrics and finally if you have none of that you can still define static metrics on you know whether this output is correct or not um so to go through a few of these there's um Regular evaluation metrics like accuracy reference matching metrics you know um you can look at things like semantic similarity you can also ask another language model whether these two answers are factually consistent if you have two answers from like two different versions of model you can ask a language model which of these two answers is better according to any criteria you want so which of these two answers is the better answer to the question um if you have feedback on the model you can ask a language model whether the answer to the question incorporates the feedback that was given on the last version of the question and if you don't have any of that then you can do two things you can verify that the output has the right structure so if your model is expected to produce Json output you can just verify that it's Json structured output um or you can ask them all to grade the answer on a scale of one to five something like that um so to answer the question can you evaluate language models automatically um well if you could I think this would be really powerful because it would unlock parallel experimentation and it would help us move a lot faster but I think the reality today is you still need to do some manual checks um and you as you do these manual checks you should also be gathering these feedback from the people who are checking the answer um and this type of the type of feedback that you gather is going to be analogous to the type of feedback that you're gathering in production um next topic is deployment so this is going to be kind of the lightest topic here um because yo I think if you're just using like an llm API then most of the time this is not really a hard problem you can just call the API from your front end um even if you're using an LM API this can become more complicated if you have a lot of logic behind the API call like if you have complicated methods defining your prompts or really sort of messy chain um and so you might want to think about like isolating the llm logic as a separate service deploying open source LMS is a whole other thing that's beyond the scope of what we're trying to do today um there's a couple of references I would recommend below including lecture from our previous class on deployment and then a great blog for any replit that covers how they built their own language models and we have Reza from replica coming in to talk this evening so I think he'll probably talk a little bit about their approach as well and then lastly like the one other topic that's worth touching on um in reference to you know uh deploying language models is um you like when you're running the language model in production there's different techniques you can use to improve the quality of the outputs of the language model and so just a few ideas to be aware of here first you can do self critique so soap critique is when you ask a second language model to critique the output of the previous language model um and so that can oftentimes lead to the language model coming up with a better answer to the question in a way that's kind of analogous to Chain of Thought prompting um there's a uh there's a really good Library open source Library called guardrails that has some of the the sort of functionality around this built-in another technique that you can do at a high level is rather than just sampling once from the llm you can sample many outputs from the llm and you can choose the best option um there's a whole spectrum of techniques that you can use to do that um and then thirdly like you can sample many times and even if you don't know which output is the best one you can just average them together in an ensemble um so again just kind of scratching the surface of this topic but these are all different ways of at in production if you really care about quality and reliability you can trade off some additional costs and additional latency penalty to improve reliability by adding more language model calls to your chain um next topic here is monitoring so you've deployed your model how do you actually know if it's solving the problem that your end users needed to solve in production um I think the most important signals to look at in general when you're monitoring any machine learning model is the most important thing at the end of the day is the outcomes that you're aiming for so are your end users happy is the model helping them solve the tasks that they want to solve if you don't have access to those or if you want additional signals then model performance metrics can also be really good you can Define proxy metrics like you might notice that hey our users tend to to prefer shorter responses rather than longer responses so we can monitor the length of the response and then lastly you can measure the things that actually tend to go wrong with these models in production um so let's talk about let's go let's double click on a couple of these um Gathering feedback from users I think the way to think about this is the um you know users don't really like users are kind of lazy um they don't want to give you feedback unless you make it really easy for them so good feedback is like as low friction as it can be as as high signal as it can be um so the best is if this is part of the user's workflow already um you can like uh accepting changes or giving thumbs up or thumbs down are also relatively lightweight ways to gather feedback um and then you can also like ask your users for longer form feedback like hey tell me why this answer is wrong um and it might be worth doing you might not get very many users that do that but the the ones that do might give you really high quality signals from it um so that's kind of like ways of gathering feedback from users next thing is let's talk about the things that actually tend to go wrong with llms in production um most common you know in reality is often it has nothing to do with the language model itself it's actually just you know we're building a product um the most common issue with the product is something to do with the user interface itself latency is like an especially common one here um this can be a really hard one to detect because oftentimes users don't really know what they're missing with lower latency another common problem is incorrect answers or hallucinations long-winded answers um sometimes with rohept models they have a tendency to like Dodge questions say like I'm just a language model I don't want to answer your question um most of the time users don't really like that so that's another kind of common problem uh prompt injection attacks and toxicity and profanity so these are all like kind of common issues that people see with language models in production and for each of these there's signals that you can look at to monitor whether these things are going wrong and the last topic I want to touch on is um Okay so we've gotten our model into production we're monitoring it um now we need to figure out like okay what is actually going wrong with this model and how can we fix the model how can we improve it based on the signals that we're getting from our users so we can use user feedback at a high level in two ways the first is we can use it to make the prompt better and the second is we can use it to fine-tune the model so using user feedback to make the prompt better um the way this works at a high level is you like you want to look for themes in your user feedback so problems that users are having with the model um oftentimes these days like I would say the most common way is people just look through the feedback that you're getting and you manually categorize like what's working or what's not working um then you like do some more prompt engineering or change the context to adjust the prompter of the model to respond to those themes in the feedback there's I think like a big open question here of like to what degree can this be automated is it possible to automatically surface the themes that you're getting in the feedback from your end users and is it even possible to automatically change the prompts to um account for those themes I think we'll see a lot of exploration around these questions in the coming months um I wasn't going to talk about fine-tuning LMS and I heard so many questions about it that I put together some slides on it but now I'm running really low on time so I think I'm just going to skip through it um maybe we can come back to it if there's a few minutes or folks can ask about it but high level you know there's two main ways you can do supervised fine-tuning which is actually starting to get relatively easy to do the techniques like low rank approx adaptation it's like faster ways of fine-tuning models and then there's fine tuning directly from Human feedback which today is still really difficult to do and not very many organizations are doing this on their own um okay so um so just to conclude um the sort of style of um of lecture that I want to do here is like this is a rapidly evolving field there are not really best practices yet so I wanted to give you a tour of like some of the main questions that you're going to come across as you build these applications and just point you to like kind of easy first steps and places you can go to learn more um that being said I am optimistic that there is a way to formalize this process of building production applications with llms um in a much more structured way and uh I wanted to kind of take a pass at what that might look like um and so the way I I've been like starting to think about the like a more formal process for developing language models is like you can think about this as test driven development or behavior driven development okay so what do I mean by that well we start with our prompt development workflow or chain development workflow right so this is um you know I as the model developer I'm working on you know trying to build a chain to solve tasks so I start from a base llm that I chose I you know iterate all my prompts iterate on my chain um when I have something that I'm happy with I test it and then I deploy it um and then once I deploy it I start to gather feedback from end users and in this case like in this initial phase of the project I am the end user right I'm just developing this model Now by myself I haven't even shared it with my team yet but I'm still providing feedback on how the model is doing um and so from that user feedback we get this stream of interaction data and so again right now this interaction data is just coming from me but over time it's going to come from a broader and broader set of users so then what we do with this interaction is we enter like a logging and monitoring workflow so what this workflow looks like is we take all this interaction data so all the questions users are asking our system all of the inputs that they're putting into the prompts and we use that to identify themes in the feedback what are the what are the um what are the places where our users are having trouble with this system where is it going wrong from those themes we extract test data we extract that test data by you know just pulling in the examples that are that uh that our users didn't like the outputs for or maybe by asking a language model to help generate additional examples like those ones that test data feeds into our testing process for future versions of the prompts and we move back to the stage of prompt iteration where we now go back and try to make our prompt better to address this feedback um with this now broader more enriched set of data to test the model on um then finally like as this starts to get more complex and we start to have more and more interaction data we can like optionally have a fine-tuning step for our workflow so we can also from the interaction data extract um not just some evaluation data but some training data we can fine tune the model on that and loop back into the stage of you know starting starting again from a from a new base llm which it's worth noting is going to require us to revisit how we did the prompting so this is like the overall um like test driven development workflow where you are making changes to a prompt you are rolling those changes out to your users those users are giving you feedback on whether this is working you're synthesizing that feedback you're taking those themes from the feedback using those to generate more test cases um and creating this virtuous cycle where your tests get more and more robust over time as you gather more interaction data from your end users and this process repeats right so I'm I'm going through this loop as an individual developer when I'm just working on a Model by myself and I'm the only end user but then as I get confident with this model and I'm like hey I think this this model is pretty good might be ready to go out into production um I share it with my team and we go through this exact same process where now I'm not the only user my team are users as well they're still giving me interaction data and I'm still trying to use that interaction data to make the model better and then finally once we're ready to roll this out we keep doing the same exacto duration process but with our end users in the loop um and so this is kind of like I guess an initial pass how you might think about like a more systematic way of developing llm applications um and uh yeah I'm gonna I'll wrap there all right thank you [Music]
Info
Channel: The Full Stack
Views: 56,587
Rating: undefined out of 5
Keywords: deep learning, machine learning, mlops, ai
Id: Fquj2u7ay40
Channel Id: undefined
Length: 49min 10sec (2950 seconds)
Published: Thu May 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.