Fine-Tuning LLMs: Best Practices and When to Go Small // Mark Kim-Huang // MLOps Meetup #124

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] thanks for coming on it's been a while since our last one I'm excited everybody's here we have a very exciting session today obviously in the large language model space things are getting huge people are very excited there's a lot to do a lot to see and we have somebody who's really a professional in the space his name is Mark co-founder and head of AI at Primo we're going to learn a lot about fine-tuning large language models the challenges how to kind of get started in this space how to advance in that world and I'm going to drop a chat uh in in the chat a link to an event that's coming up if you're interested in large language models if you're interested in all of these kinds of tools we're going to have a full event LMS in production we have over 50 speakers coming people will be there it'll be fantastic you'll get to see Demetrios hopefully play a little bit of guitar um but yeah so check that out register as soon as you can and without further Ado we'll bring mark on thanks for coming hey Ben thanks for having me on really excited to uh chat to the community and just knowledge here about um kind of the best practices and what we do at our company yeah awesome um so I'm gonna let you jump right into it because we unfortunately are 10 minutes under uh schedule today we're gonna have a 50-minute session so I'm gonna let you get going I'll jump in if there's any questions from the audience and then I'll jump back on at the end and we can chat about what uh you taught us sure um so obviously everybody came here to learn how to use open AI apis right um I'm just kidding uh obviously the the chat today is to talk about how you can fine tune uh large language models the implication there obviously is that you'll be using um open source base models uh within there I think that there's a lot of momentum within the community to understand what the best practices are and also um how do you contextualize and understand when the smaller more traditional models might actually be more appropriate for your use case so I'll just get into it I'll um so Primo what we actually do is we make it really easy to build into quality fine tune models and um our mission is to kind of help democratize um the accessibility for all these models and allow people to ship um AI powered applications um I'll just start off with helping to contextualize uh what the approaches are for embedding um large language models or language models into AI powered applications um at the top in the first set of models I'd want to talk about are closed Source models um everybody's pretty well aware of these uh you have open AI with gpt4 you have cohere you have anthropic Claude and Google themselves have just uh put their head into the ring with palm 2 and these closed Source models are typically accessible only through endpoints and apis um where they just provide the completions or provide embeddings for you uh and um they tend to be of you know pretty large orders and Orders of magnitude larger than most other uh models in the open community the second set of models uh that you probably would try to um embed into your applications are fine-tuned open source models uh it's actually been quite exciting over the last uh one to three months to see um how fast the community has been jumping on to fine-tune a lot of Base models um two examples that have recently come about are star coder which was a collaborative effort uh with servicenow and how in case I believe with um in their big signs group uh they've released both the base model and the star coder fine-tuned coding model um and then also replit uh just released uh their own codex um which is uh fine-tuned as well um the final set of uh models that you probably approach are um small specialized models um I kind of like to joke and think of them as the the Sesame Street of models because you have Perks and then all those descendants um and uh you know these are uh a lot smaller in in in and have very specific use cases the focus of today's talk obviously since everybody came here about fine-tuning um will be mostly the the the second set of models um and uh uh I'll just start off right now and let's go into a bit more detail on the closed Source models so um we have a lot of users that come to us and actually um you know they're fairly new in the field and they they kind of ask uh why can't I just use open Ai and have like a GPT for um you know medical GPT for Pharma uh gbt for X these use cases and and they want to go towards the Best in Class um models uh start off with the best in class and then see where you go from there um well the unfortunate part about that is actually that the only models that are available for fine tuning are the older base models um this is taken directly from their fine-tuning documentation online and uh from open Ai and uh the only available models are actually uh four different flavors and they actually um are in different sizes um and they tend to be or they are uh all just GPT three uh variants now uh other than the fact that you can only have those types of models why would you actually want to have um custom uh large language models over closed Source models I think the the main aspect um the main three aspects that you want to think about are model ownership um practically speaking you own the the the you know the the company's intellectual property within the model as well as being able to control um the slas that this model is sitting behind um a lot of times uh you have a product in um just because open ai's uh endpoints are down or cohere or any of them model providers are down you won't necessarily be able to um uh uh meet the slas or the demands because you're routing um and sitting behind uh uh the endpoints themselves so uh model ownership and be able to have those in production are actually pretty important for delivering an AI powered application um now the second aspect is actually domain expertise uh as with any person or as with any um Enterprise it's all about being a subject matter expert so um having a custom llm that's fine tune will allow you to uh have that expertise within the set of tasks that you actually care about and actually drives the business value um of your company now the final bit is sort of obvious but uh it's actually probably one of the most important aspects of having custom fine-tuned uh models um and that's security and privacy in a lot of uh sensitive environments it's just not really possible to be uh move data around or um uh in healthcare for example HIPAA compliance really precludes you from um being able to send over data to some other endpoint in in um not being able to govern the the setup data that this model is being trained on or the completions of this model is creating so um particularly in finance and Healthcare you you need the custom models in order to ensure your security and privacy so I'm going to go into finally um where small models actually work well um they work really well for simple tasks I think that um they became state of the art for a particular reason and the I just have some examples up here in this table um taken from papers with code which is a really great resource I actually recommend everybody to uh go over and look for that uh go over to that website and and play around and look around for all the benchmarks and the resources that they provide um it's about the first thing that I go to to understand um if my task is sort of oriented in a particular way where um in what the benchmarks and state of the art is um and you'll understand from here that um there's not a lot of reason to go for a large language model when your task is just really simple because these models are already these small models are already state of the art and they're uh basically in a uh they've already solved the problem that they're in tasks that they've uh been uh dedicated to um perform on and you're getting near perfect performance anyways through those um why do you need to scale to orders of magnitude larger um for a large model so um now that we've kind of gotten through contextualizing um the kind of three approaches and the three types of uh models that you want to work with um let's go a little bit more into fine tuning um I think uh for me and for a lot of folks um you can break down fine-tuning into four different categories um for one it's like multitask fine-tuning where you actually want uh the model to perform on a on a range of tasks uh and you'll you'll have data sets that are very specific to encompassing this Ensemble of tasks um you can do fuse shot fine tuning which is actually to include um uh training data that has examples so each set of each example has a few shots of or few exemplars within them um as well as their completion and then uh you kind of have domain specific uh fine-tuning where you want it to you want the model to perform particularly well um on a on a particular subject matter and then you have prompt based fine tuning which is actually going to be the primary focus of today's talk uh mostly because it um you've seen that instruction fine-tuning recently has been sort of the state of the art approach in terms of creating the right interface to expose uh state-of-the-art performance on Downstream tasks I'm going to go through a few strategies for improving fine-tuning performance um at a high level I can kind of break them down into really wanting to Define your task um prompt engineering and then how do you efficiently fine tune um uh models here's the first set of challenges that I sort of think of um in terms of starting with a a model oftentimes let's say a you want to go with a really large model like DaVinci from openai from there Gathering the data involves either going for quantity in which case you might go for a third-party data vendor to provide synthetic data in a large volume of it um or you want to go for quality and you will go with human labeling which is a lot more arduous this usually leads to four different paths that you'll take um and uh this is what I often see in terms of the the final scenarios that occur for individuals who approach it this way um you'll often get like the ROM format of data so uh it's not going to be instruction data it may not even be really well formatted to be able to be consumed by the model itself either uh secondly you might have four quality data it might be just misaligned with the goal and the the the primary task that you want it to perform uh and it just won't be uh uh performing at what you think would be state of the art uh finally um for human labeling you either get too few examples in which case you actually can't even reasonably fine tune it's been shown that um on too few examples that uh fine-tuning is a bit unstable on the downstream tasks and the performance will be um uh less predictable and then you may find that it'll be way too expensive uh paying a bunch of human labelers um is actually pretty expensive in order to get the sufficient amount of examples that you want to fine-tune on one of the solutions and what I actually encourage everybody to do is to Define your tasks really clearly within a framework our company we like to think about how we break down the different task types so fundamentally we like to separate them out into knowledge based in reasoning uh based tasks and for knowledge based tasks assuming that you actually want the capabilities of a large language model it'll either be wanting to fine-tune it to be able to perform a wide range of uh fairly simple tasks or you want it to be really good at performing a really complex a few specific complex tasks um on the reasoning side you can break it down into either coding or math obviously there are other um types of uh reasoning capabilities but um focusing on coding and math you can understand that for the coding aspect you want it to be performant at code generation or being able to explain in debug code um and on the mathematics side it's uh more uh an exercise in being able to um embody axiomotic axiomatic uh logic uh and reasoning in in start to learn higher order abstractions from there now once you kind of have an idea of your task Frameworks um I actually recommend that everybody go towards um uh Stanford's new initiative called Helm which is the holistic uh evaluation of language models um it was kind of the first of its kind uh to standardize in um uh evaluate models on a number of scenarios holding fix the problems holding fix the task um and it's kind of a living breathing Benchmark for everybody to go to um to understand uh what the state of the art is in terms of the models and also to um figure out what tasks that you're actually interested in uh uh uh benchmarking and using for um fine-tuning your model on so you want to go from the task and then work backwards from there one little trick that uh is actually pretty interesting and important within figuring out your tasks is to use task clustering um to actually diversify the training data because it's actually been shown to improve out of domain performance so um the visual that I provide is actually just um a breakdown that's taken from the the flan paper that was released about two years ago um in that paper they actually showed that if you can uh uh train on um a diverse set of uh clusters of tasks um it will perform uh inference on out of domain tasks even better versus if you don't actually have a wide diversity of task clusters in your fine-tuning training data um I will point out that one copy out to that is the fact that um the model needs to be fair uh fairly large well large ish um so it needs to be probably greater than about 10 billion parameters in order for you to kind of see that type of uh Behavior coming out of it once you've kind of identified their task and then have all the task clustering um I would suggest thinking about prompt engineering um I know that prompt engineering has kind of a certain connotation today but um I'm going to take you through how we think about it and how you can approach it in a rigorous fashion I think maybe we should talk about how uh or what a prompt actually is um at a high level what you can think about is a prompt is comprised of uh three properties you have the meta prompt you have the template and you have the exemplars um the metapront is actually used in some uh strategies or some techniques that I'll actually elaborate on in a couple slides um to come up with um more data so you can actually prompt the large language model to synthesize more data so um within the metaprompt an example would be uh to come up with a series of tasks so you're actually asking a model with a high level instruction to come up with uh more tasks um um as part of its completions once you kind of establish the meta prompt um you can establish uh the template that you'll be using um if anybody's used Lane chain they have a bunch of templates there as well um I just provide a specific template that is uh shown in the self-instruct paper that I'll go over in a second um so you want it to come up with a series of tasks and then you have a template for um having uh eight tasks that you actually provide it and then you leave the ninth task blank for the model to uh fill out so once you kind of have the template you actually fill in the tasks themselves so these are the exemplars and then I just have some examples here of exemplars that you'll be using uh in order to generate um more tasks themselves um so here's uh once you have the prompts that you actually want to use for your instruction fine tuning um you often are hit with a bottleneck where it's actually kind of difficult to uh construct like you know thousands of prompts right it takes too much human time it's the same uh problem you kind of uh run into for human labeling so um we actually uh typically use certain synthetic data techniques and I'm gonna um elaborate on a particular technique that we like a lot uh which is uh self-instruct um so in order to synthesize more data you can use the self-instruct uh approach to seed the tasks so um examples of them are back here so you basically come up you have a series of tasks and then you seed it and uh enter these tasks into this prompt and you feed it back through a task pool in this pipeline in order to generate more tasks um at the end what you actually use so uh not every single generation of tasks is going to be uh viable for you to use in your fine tuning process but then you have a filtering mechanism um in order to decide whether or not you're going to keep these set of tasks or not that are generated from the model in itself so basically see the see the large language model with a set of uh human uh curated tasks and then run it through uh self-instruct in order to produce lower tasks um in the paper it actually is able to generate um 82 000 task instances from just 175 C tasks um why this is important is actually within the the industry right now this is the exact method that um uh some uh that alpaca used uh to fine-tune llama and a lot of other descendant models that have um you know Arisen in the last couple months for fine-tuning and getting more training data out of it so um I I to provide some tips actually specifically for uh prompt engineering um within there I would say that uh for your prompts you should typically want to focus on the rationales um over the correct answers so if you think about it um it's sort of like taking a a math exam um it's not about getting the right answer it's actually about having the right reasoning path and adopting the right reasoning pattern to be able to apply on a lot of other questions so um just having you know in traditional machine learning what's important are the labels when you have X and Y but in uh large language models was actually more important is to figure out the reasoning path in providing that to the model so that I can learn better um I should have alluded to it uh before but um another tip would be to push the context limit for these examples um it's actually staggering how much longer the context limits are these days um I think Claude just released uh it's just updated their model uh from anthropic uh where it can now take in a context length of a hundred thousand um that's pretty staggering but if you can push the context limit you're basically providing more examples and uh this goes back into view shot learning or in context learning in which case you actually help contextualize what you want the model to do a little bit better and then finally a final tip I would really have for everybody is to try to flip the script um you can actually get the model to generate the rationale for you um one uh area study or one um new uh kind of strategy that has come about is of Chain of Thought reasoning um where you can use it to uh make the model provide what reasoning path it used in order to get to its answer and um that's actually really helpful for uh increasing the model's capacity to learn on your specific task so I kind of went over um a bit of the data Centric approaches um for improving your fine tuning uh uh experience in in quality um once you've kind of established how to get all the training data and also how to make your uh training data in the form of instruction prompts um you actually have to go about training the model so uh this is more about um uh training efficiency and I'll provide some guide guiding principles and and introduce a technique to help you uh train and fine-tune more efficiently um for your use case let's go through um kind of an exercise to figure out what the memory footprint of fine tuning would be for a 15 billion parameter model um I'm just going to apply a pretty naive data parallel data parallelism here uh with mixed Precision um pytorch comes shipped with uh uh distributed parallels related use um and also all Nvidia gpus actually have uh basically mixed Precision baked in as a as an option and it's shown to have um you know better quality uh results uh without or it's shown to have more efficient training uh without any degradation in in results um so let's go into the the 15 parameter model sheet billion parameter models here so the first set of memory that you have to allocate um is the uh weights of the model so uh you have 15 building parameters and then you have two bytes that you have to allocate per uh parameter in the model um this is because we're using half Precision there is actually no free lunch unfortunately even though it's called mixed Precision um it does reduce the memory footprint but you still have to keep a master copy of the model uh in memory so that's about four bikes per 15 uh for each parameter in the model that you have to keep in the uh training process the fine tuning process um what actually occupies most of the memory is the optimizer States um those have to be kept in full precision and if you're using something like atom you need to keep the state of uh the two states that are momentum in variance um in memory and um that actually uh you just take a really huge uh hit there from having to um keep those uh States in in in memory so finally you actually have to uh keep the gradients at full precision as well so that actually amounts to 270 gigabytes of vram and that's not even including activations so you're actually looking at having to have four 80 gigabyte a100s and um we're sort of in a GPU shortage today so uh good luck trying to either allocate those gpus and then paying for them because they're going to be really expensive um there are some strategies in in um ways of doing offloading in order to improve your training efficiency if if you want to do some further reading there's actually a post um that one of my scientists Lucia uh has uh on medium and she goes through how do you think about um uh uh resolving the bottlenecks that you'll you'll um encounter in mixed Precision training let's move on to actually the solution that I suggest um you have you know naive data parallelism um but when you want to fine tune a model um oftentimes uh you can think about having to take context Target Pairs and then wanting to um you want to um optimize the the objective function uh in order to um be more well tuned to a downstream task so in this visual this is taken from the actual uh low rank adaptation of large language models paper the lower paper um what is actually happening um in the summation is that you're taking some language models and you are um optimizing it through uh you know uh a back propagation in order to make the objective function you're trying to minimize the loss on the objective function in order to um uh optimize it for Downstream tasks um you don't actually need all of the initialized weights um you actually just need to incrementally change the weights by some Delta so that's sort of what I have over here um let's go into a little bit more detail so you can actually reformulate what you would find in full fine tuning which is when you want to move out when you want to change all of the weights in the model um to freezing a portion of the weights so uh if you look over here this was the original full fine-tuning formula and then in the paper the authors actually reformulate it into a combination um so how do we think about this so do you remember if you remember from back propagation you have your Ford uh backward and update steps So within the update um uh aspect of your uh modeling you actually break down um the update into um a set of Frozen weights uh and then uh a set of Weights that you actually want to change um so then you actually can break down the set of uh which you want to change into uh a lower rank Matrix uh multiplication that will be applied to um uh to the set of Frozen weights so um in a way it's actually um subtraction by addition um you have to add more parameters into the process so it's a kind of interesting that that would reduce the memory footprint it actually reduce the amount of parameters that you need to deal with in the um in the fine tuning of your model um but this is because uh what happens is that with um a choice of your rank so um intrinsically Downstream tasks are lower rank so what does that mean that means that rather than needing um every single weight the entire dimensionality of this large language model to perform well on the tasks that you care about you can actually re-parameterize it and map it down into a set of Weights that actually uh matter more for your tasks so um that is where the aspect of rank comes into play and um oftentimes what you actually do is you could map uh one billion parameter model into a rank of eight for your fine tuning process using Laura so um this leads to a extremely large reduction in the memory footprint of your model when you need to um fine tune it um so uh you know given that set of principles you can actually use Laura to um reduce the memory footprint in in in fine tune in a more efficient manner so uh right here um we kind of break down the objective function into the Frozen parameters and also the lower rank set of parameters that you want to fine-tune there's a few caveats to this and things that I want to note for fine tuning using Laura um the decomposition actually it can only really be applied to the attention layers some Cokes have tried to apply it to um the fully connected layers and uh more often than not it's really applicable towards the the key in the value matrices if you if you're kind of familiar with attention um those are the representations of the query the key and the value matrices um another thing to keep in mind is actually that higher rank is needed and therefore less reduction in parameters uh when the downstream tasks are more dissimilar so um the pre-trained language model is trained on a particular Corpus so um if the Corpus of the tasks that you actually want it to perform better on uh in your fine tuning process are pretty different from the inherent Corpus that the language model was pre-trained on you actually need to increase the rank so the rank is the parameter R and therefore you're not going to see as uh drastic of uh uh a memory reduction benefit from using Laura um maybe the final kind of interesting aspect of applying a parameter efficient fine tuning method like Laura is um is that it can outperform full fine tuning in some regards in some use cases so um that's actually because um we talked about the intrinsic rank of uh some Downstream tasks um you can think about it as as if you were focusing the model more on this particular context of uh tasks that you care about and therefore um fine-tuning a frozen uh uh only a frozen set of Weights um using Laura will help it form better on those tasks themselves so um I'm just going to bring it back to the back of the envelope calculations um there's a lot of live there's a few libraries out there that folks can leverage uh to do parameter efficient fine tuning but um when when you applied Laura and let's presume we use a rank of eight um the only thing you actually have to really keep in memory are the the uh half Precision model parameters and in some cases some people like to go to even lower uh quantization for that but um the rest of the other parameters actually just uh become tens of megabytes and you'll see about a 78 reduction in the uh memory footprint um of the of the model so uh that actually means you could effectively fit it on one you know uh 80 gigabyte a100 GPU rather than having to have four um obviously there's some overhead that may cause you to um you know have out of memory errors so keep that in mind um there are other actually other tricks to reduce the memory so I only talked about lore because I think it's the most approachable one that allows you to get the most benefit when you're fine-tuning but um you can also Shard the gradients the optimizer states in the model parameters um good examples of that are are deep speed stage three as well as uh fully sharded data parallel so uh I actually just want to summarize kind of what we've got what I went over um and the the the the things to keep in mind when you want to um some best practices you want to keep in mind when you want to fine-tune um first thing you want to kind of do is like I said you want to Define your tasks and you really want to leverage task clustering the diversity of the inherent uh trading data that you're using will allow you to perform better on out of distribution tasks and also uh allow you to train uh your model to actually perform the exact things that in in uh and perform the capabilities that you actually want it to do um the second thing you can kind of do is you know put after you've put all your data into uh an appropriate format you can use self-instruct or similar methods out there um to generate more data so you can use the model or another model to help you generate more training data so that you can actually fine tune on and finally um you can apply primary efficient techniques to reduce the memory footprint of the model and also combine it with certain techniques so that um you can actually uh do it in a reasonable amount of time and have more efficient fine tuning to make it possible so um it's really a combination of uh best practices for um getting the best quality results as well as having the the most efficient um training process when you're fine-tuning um that's really all I really wanted to to to talk about today but um I think um you know we can open it up for for some discussion and some questions now awesome that was fantastic thanks so much Mark um well yeah so the chat is open if anybody has any questions uh we're happy to answer them um but you know while we're waiting for for everybody to come in and and fill in the questions I'm sure we'll get a lot um can you just tell us a little bit about kind of where you came from your journey how you got excited about LMS and and kind of deep in the space of vellums yeah sure um so uh I kind of started out as a machine learning researcher uh at Cu uh emphasis most recently I was at Splunk looking uh inserting in stinging Athletics but uh I always felt like virtually learning is just very hard and then with language models it just became that much harder so uh I think um what we've actually seen in what really interested me um to to kind of start a company and go down this journey was that um we used to think about traditional machine traditional machine learning was uh thinking about hey let's start out with the simplest model and then start adding more capacity to it in order for it to perform better or figure out all the different um quarter cases where a model is lacking so then you know you went from regression to tree based methods and and so on but then the the script kind of got flipped with language models where now we've just compressed all this capacity and all this ability into these extremely large models and we just need to learn how to interface with them so um it kind of reinvigorated a bit of passion for me within machine learning to think about hey now I really need to learn about the models themselves understand the models and focus on that because um you know we we've really not fully taught the potential for these and um we're just figuring out ways in order to work with them better so I I almost think of fine-tuning as a way of like how do we interface and work with a model yeah that's super cool and that makes a lot of sense and in that answer we have received six questions so I'm gonna I'm gonna start start asking through them um first question is uh does the task clustering mean that when I have for example a data set for fine tuning on a specific type of task in NLP I should be mixing it with some existing data sets uh you can do both I would I would say um with respect to task clustering um you know the the punch line was more about um diversifying the uh data set that you actually want to be fine-tuning on so whether or not it's existing data sets or it's like brand new data sets that have a more diversity of applications um it's uh it's just better to have the uh more clusters of past there um a really good example actually um that's kind of come about in the community is the realization that if you use um language data combined with code code data the uh symbolic mapping between the two helps improve its performance on both language tasks and coding tasks which is actually kind of weird right in a way you have these uh these two ensembles of types of data to be combined together under fine-tuning process and all of a sudden it does better out books so yeah so it's that diversity of of fine-tuning training data that actually helps yeah that's very interesting it's it's pretty surprising to hear but very very cool stuff um here's here's another question about uh Lang chains Auto evaluator if if you're familiar with that tool I I haven't used that part of Langston yet but um somebody's asking if you know about it what about um evaluating all the llms anybody's ever explored for example using um link chain Auto auto evaluator uh so unfortunately I haven't really used Lane chains Auto evaluator 4. um but I mean it sounds like a promising approach uh the only caveat I would say is um you know first time I don't know what that'll look like yes generation or these large language models is actually pretty difficult yeah and evaluating them also is not a super clear-cut thing that you can do yet like there's not I mean there are scores that people are coming out with but there aren't very very quantifiable metrics to to measure which model is doing better especially if you're just doing like a text generation task relative to for example like Nar or or text classification yeah yeah I I certainly agree with that and um you know I I sort of alluded to helm as well it you know because I really um I was really excited about the the whole release of it mostly because um one thing to note is that you know the Llama model at least I met in the fine Meadow was really important for the community um but within the paper there's kind of been criticism around the fact that they didn't release all the problems uh uh benchmarked on it so it's not really an Apples to Apples comparison across um how they performed right because you know we all know from cue shot learning that it the prompts actually matter a lot in the examples right yeah it's it's very interesting and it makes me kind of wonder about what the state of prompts are going to be as as we advance I mean some people are of the opinion that the largest models are just going to become so good at kind of intuiting what you're trying to get at that prompting is going to become less important but obviously there are people like far on the other side and think that you know being able to craft perfect promises is is critical so it'll be interesting to see how that develops um we have another question that I I think is going to be subjective but it'll be interesting to hear your thoughts um first part of this question is what costs are most important when you're looking at um for using Cloud platforms for these techniques obviously maybe anything past kind of like per token or per character charges and then somebody says uh which cloud is the best or maybe which cloud is your favorite if you're the cloud providers um to do these tasks yeah um that's a really good question uh so I mean this sort of within um sort of data platform 101 in a way or like when you're building a product is like what's your data access pattern really gonna look like so with respect to large language models um the the per token uh uh her token cost is sort of important but at the same time we need to think about like um what are the access patterns that you expect over time from your customers um if you're a semantic search company uh particularly like you're gonna be hitting it much more often than you expect a certain agency out of it um and then uh with respect that's on the provider side but then with respect to the clouds um that's a bit of a subjective question um I think that uh you know it really just depends I think it it depends on what are the set of uh Primitives that you expect out of your Cloud providers so like not every cloud provider uh behaves um the same with respect to the regional distribution So like um if you want to if you're really scaling really large um Regional co-location of your gpus is going to be really important and then uh egress costs across data um in order to facilitate the compute is also fairly important so um and then just the bandwidth wiring so um if you really want the best to read in terms of bandwidth maybe you say hey um this might be unexpected but maybe I want to go with Oracle because they have literally a wire that goes across for connecting your multi-node uh gpus um so uh yeah I think it all depends in in at at the very end of it I think you should go what your goals are backwards so if it's to rapidly iterate maybe you choose the easiest Cloud to work with from that sample yeah that's a great answer that's actually how I've been doing a lot of things recently and I've been I've been loving using gcp but that's again super subjective and and totally dependent on the person um here's a gray one I think another subjective one but what um Vector database are you using right now and I think a second part of that question is do you even need a vector database to do llm fine tuning uh that's a good question um so I'll first talk about what we're using internally um we haven't really established a particular Vector database that we're going to be using um you know we've sort of played around with pine cone and then some open source uh Vector databases as well such as uh quadrant or even um last DB is a is a new one that's coming out in the open source um that's mostly to uh try to achieve retrieval augmented uh uh language models where that means like you know your language model is only trained up to a certain amount of uh certain update of data so you know anything that happened within the last three weeks might actually not um be encompassed in the knowledge of it but then um to do fine tuning you don't really need a vector database mostly because that's mostly to inject more context in and string together or Cache context across different and um requests there is one aspect though that I'll say there looked uh fine-tuning and Vector databases are considered separately but um as we clearly have become more sophisticated in our interfacing and usage of large language models there's kind of a composability and caching aspect within model routing that you would want to leverage both a vector database and fine-tuning to um to to to achieve like fairly complex use cases yeah that makes a lot of sense I've been um I've experienced really similar things with the kind of dichotomy and then bringing them together um cool we have we have two more questions I'm going to try to make sure I can interpret this question correctly um but I I think it's asking about when you're trying to find a reasoning path of a model when doing prompting or maybe when when kind of fine-tuning your prompts um is there a systematic way to get to this for example something like explainability I think the broad question is have you found systematic or reproducible ways to continuously make your model your your prompts better yeah um Hmm this is funny uh to an extent uh there's a lot in the uh open community and research that's come out probably within the last three months about this um I'd alluded to self-instruct before um there's one type of uh I would call it kind of post-processing that's used commonly so what self-constructed uses is self-consistency within there so what you do is actually you generate um you resample the model 10 times and then you find uh which answers occur the most frequently and then you actually go backwards so then you take those answers you don't know whether or not they're correct and then you select the reasoning paths that correspond to the most common answer or the most frequent answer there to selected reasoning paths so it's kind of going backwards there um that is one type of reasoning path selection um within our our research team we're actually exploring more methods to improve those too as well because that's kind of a um a very simplistic approach to that but you can kind of Ensemble those things together into like an actual pipeline for figuring out how do you approach explainability and reasoning for these models how have you guys been doing things to manage cost if you're for example for every kind of input you're calling the model if I understood correctly you're calling the model 10 times um with any given input do you have yeah I guess ways to manage cost when you when you're doing those kinds of experiments yeah absolutely um well there's there's a few knobs on terms of that and I think everybody's probably also um uh seen that within working with openai um the temperature setting can actually help you like increasing the temperature increases that diversity of the outputs um and therefore you can probably uh mini batch some of your Generations that way so increasing temperature and then uh causing it to sample out and generate more reasoning paths that way and then we've also uh figure um sort of played around with different types of batch reading mechanisms that we want to use within that um but I think as you said it's sort of an open question I think inference is the main problem that's on everybody's mind with respect to these language models because it's one thing to say I can get a good quality set of results and evaluations out of it but it's another thing to say like this is going to cost me 100 times more than just doing a query on a on a regular traditional database yeah I'm curious if you think or if you have any ideas on you know now with with Claude having a hundred thousand so in uh context window and even with gpt4 having a 32 000 contacts window um do you think that these larger companies are ever going to offer for example monthly fees or um do anything other than per token charging oh that is a that's a real interesting question um you know as a business we sort of thought about that as well like you know the token is their version of consumption-based pricing to an extent um I think it actually is just gonna truly to me I kind of think about it in terms of first principles uh with respect to the hardware so the Innovations and Hardware will actually unlock the union economics that are going to be important for us uh in the future so um a good example to me that I say is if tomorrow a 200 gigabyte uh GPU card existed a lot of the a lot of the the sharding in the offloading and the tricks that we sort of employed won't actually be as relevant uh for instance you know I already know that as human beings we're going to get 200 gigabyte card and then now we're going to just want to have 100 trillion parameter model right right that way and then we have to go back to sharding them and stuff like that but yeah in terms of the pricing and unit economics um I do see it as possible but fundamentally I'd have to see either changes in the way that Hardware um can be used for those use cases as well as Innovations in the research to be able to unblock those unit economics cool and I think for our last question I'll let leave it with uh I think a very fundamental question that you've probably been asked a bunch of times and probably enjoy answering which is um when do you prefer when when you're entering a new task how do you decide when you should fine-tune versus prompting with a context window Ah that's a really fundamental and interesting question um I think within that it's uh it's a real question and I fundamentally come back to you know one of the slides of those on um if you really want domain expertise if you really want a model to be much more performant and possibly in terms of you're talking about you know economics and cost like if you want more zero shot earning out of it then you really want to fine-tune it um and also uh if there's a set of uh expertise that isn't necessarily embedded into the model or the pre-trained model already um you really do need to fine-tune it but with respect to prompting there are uh you know the whole Beauty in the whole argument of one model to rule at all was the fact that you can expose the capabilities embedded in the model um itself just through providing it the linguistic and language for it to perform the task so um when I kind of look at both of those it's like a evaluation on you know how is it performing for the things that I actually care about and then uh then going backwards and saying is that viable for me to um to support on a cost basis um or or do I just need zero shot Audible yeah awesome that was a very elegant answer um very well thought um well I think those are all the questions perfect timing uh Mark thanks so much for coming on thanks for everybody for listening um this is really awesome and I'll just once again drop the link in the chat if you have the ability uh you should come to the conference we're going to have a lot more talks like this plenty of people awesome learnings and great networking opportunity um but yeah Mark thanks so much for coming on [Music]
Info
Channel: MLOps.community
Views: 8,453
Rating: undefined out of 5
Keywords:
Id: 1WSUfWojoe0
Channel Id: undefined
Length: 53min 48sec (3228 seconds)
Published: Fri Jun 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.