End-to-End LLM Workflows with Anyscale

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey I'm Robert I'm one of the co-founders and CEO of any scale and I'm Goku and I work on machine learning here at any scale we're going to talk about solving some of the core infrastructure challenges for building llm and generative AI applications a lot of the stuff that we're going to show you today is based on Ray Ray is an open source project started out of UC Berkeley it's been used by uh companies like open AI to to train gp4 it's being used by companies like uber and Pinterest to run all their deep learning so this is really starting to be the heart of um scaling AI workloads at a lot of tech companies and really that is spanning the Spectrum from getting started on your laptop writing your python script all the way through the most cuttingedge Foundation model uh use cases out there yeah I've heard about Ray actually since uh 2019 which is I guess half a decade ago now and it's amazing to see U kind of the adoption behind it Go could you want to talk a little bit about some of the challeng Alles around actually that teams face when doing Ai and trying to build AI applications and get them in production yeah I mean regardless of what kind of machine learning application you're trying to build whether it's more traditional deep learning uh or you know LMS or gen two of the most important factors here are Velocity uh how quickly can my team move um and actually doing things at scale yeah and and for and how quickly your team can move is also related to how quickly the individual developers can move right and we've seen a lot of um obstacles to moving quickly as an AI uh engineer because you are you know you're hired to do Ai and to build AI applications but a lot of these people find themselves having to manage clusters right having to scale workloads having to think about gpus and other Hardware accelerators right and these are um think about how to move from development to production it's it's there can be a lot of challenges exactly yeah you're not just on your local laptop anymore uh in fact these models just won't fit so um thinking about kind of scale is something that developers think about from day one uh even even in the development uh stage yeah I mean scale is just a fact of life and and if you're only scaling to your point if you're only scaling in production and not scaling dream development then you have a different development and a different production environment a different codebase different Frameworks and that increases the gap between uh what's running in development what's running in production it means that it's more likely that things will go wrong when you transition and that uh there will be a more expensive handoff between those two stages maybe a hand off to another team a rewrite or something absolutely that's actually where almost all the mistakes and errors happen and I'm sure you heard the quote uh well it worked in my laptop um yeah that's exactly one of the biggest challenges yeah let's dive into data processing first so what we have here is kind of an overall diagram uh this specific task you may recognize is kind of prepping your data for uh kind of filling your vector database for rag applications but I'm just going to use that as a kind of a skeleton of of data processing workload and there's a lot of um different reasons you need to prepare your your data this could be preparing it for pre-training it could be preparing it for fine-tuning it could be even comp preparing it for some kind of batch inference task right yeah and of course I want to say there's a lot of interesting challenges here in the fact this diagram in some sense may be oversimplified right there's um you have CPU compute like some of reading and writing and um and and pre-processing but you also it's often typical to have GPU compute as well doing inference embeding computations and so on exactly yeah you're going to want to kind of work with all different kinds of uh Computer Resources um and your data could be anywhere it could be in your local system cloud storage data warehous um and you want to be able to kind of read and write to all these different sources and actually when you're when you have mixed CPU and GPU compute the challenges can get even harder for example uh you may want to auto scale the or scale the uh CPU pool of workers and the GPU pool of workers independently right it may be more CPU or more GPU intensive you may have the a batch size may be appropriate like different batch sizes may be appropriate on the CPU versus on the GPU and we'll see all this in action where kind of per workload within the data processing uh cycle here we'll have full control as to exactly what kind of compute strategy I want to use here and uh and do it in a way where we don't actually we're not actually maintaining the for ourselves and connecting these different devices together we can just focus on the actual code that we want to um develop and you mentioned that um you may need to scale dream development can you say a little bit more about that and some of the challenges of uh scaling dream development and and really creating a good productive development experience for data processing yeah so obviously if your data lives on a single file uh you're not really thinking about scale uh you know you can kind of have it on your local laptop and it just works but now when we're talking about U even during kind of this experimentation development phase we have very large data sets I'm not even going to get to the models yet but just large data sets let's say living in uh cloud storage uh like an S3 bucket or something I don't it's impossible to just kind of read from there uh if it doesn't fit in your in your one machine right uh let alone actually doing the processing and then being able to write that again back to like cloud storage so these are things that actually won't fit in memory and you really you want the ability to uh be able to select different sizes or different portions of data right you often want to be able to try running something with 10 data points or a thousand data points or a million and and be able to go back and forth between them run and run at different scales y yeah the same kind of traditional principles of doing machine learning uh you know for decades still holds here you still want to kind of try it out on a small uh chunk first but um you actually want to be able to apply that at scale to your entire data set especially before fine tuning and even though you're doing this at scale the same usual principles of of software development and uh debugging are relevant right you may run something at scale on a bunch of on a 100 machines and then find out you were you forgot to import some python liver you need to install some extra dependency right so you pip install it of course pip install it you need to get it on all of those machines make sure it's there so it's not um and if you don't have a good setup for doing this you're running sshing to 100 machines and running pip install in all of them and you're going to run into a lot of challenges right these are the kinds of things that uh any scale features like workspaces uh really just solve out of the box you just pip andall at once the same environment is going to be present on all the machines you can do that interactively at different scales yeah yep and we'll see workspaces in action where uh kind of just focus on our code and environment and uh we just have kind of this seemingly INF infinite compute underneath uh that will'll leverage it's also important to have a good ide story right A lot of people like developing in vs code what does that look like in the distributed setting if you're running on on a cluster right you want to be able to integrate that vment environment with your IDE so you have the same familiar development environment you know you can set break points you can step through the code even if it happens to be running AC it's backed by a cluster and running across a bunch of machines again these are the kinds of things that we've built into workspaces and really uh move the needle for just overall developer prod productivity and really like dynamically configuring the cluster is important because the last thing you want to do is get all your dependencies installed set everything up just right you know scale the cluster run your application only to figure out actually you use the wrong instance type you know you need different instance type or the wrong type of GPU wrong Hardware accelerator and then you have to tear everything down spin up a new cluster reinstall everything it's we've all been there yeah and that's exactly what you want to avoid or the other side uh what a lot of data scies and uh machine learning Engineers do is get the beefiest machine possible and uh usually have some questions to answer to your infr platform team on why you wasted so many resources work is great to just kind of start with aan simple head um and then as I need to kind of use it for different workloads like large scale data processing or kicking off a fine-tuning job um it it's runs a separate workload and those workers uh work on that specific uh task and then uh go scale back down to zero so kind of just very efficiently using uh this compute only for the workloads that needs to actually have it available for first off Robert uh we're going to jump into kind of data processing let's do it and um obviously we're going to be talking about everything in the context of a specific task because I always think it's look it's great to actually learn about things by looking at code that works yeah um but just know that obviously all of this can easily translate to different data modalities uh different types of models we'll look at and just kind of any kind of machine learning workload so what task do you have in mind so today we're going to start off with uh a pretty popular data set um it's called the vgo data set uh so first thing we're going to do is actually just uh load this data uh so this is available through hugging face um and I'm going to have it uh locally available obviously this tutorial uh where I don't want to spend too much time kind of loading data from large sources so we picked a data set like this if you did want to load a larger data set you'd be in a good position to do it uh on here absolutely yeah right because uh it's not it's we don't think about it as kind of one machine but I can now read data uh and have it chart across multiple workers for this looks like a regular notebook but it is it's actually backed by any size cluster that you want any kind of compute resources if you need a ton of gpus a ton of other compute resources accelerators it's right there exactly yeah so this is actually uh you know the workspaces on any scale and I'm showing you the vs code here because that's my kind of preferred ID this is in fact a a notebook running in vs code but you have option to choose kind of between a jupyter lab uh experience a terminal as well um and as you mentioned this this feels just like my Lo local laptop here but uh if I go here uh I've actually started this workspace on top of a compute uh that is actually running on cloud so I've picked a head node that has 8 CPUs and 32 gigs um and we'll talk about kind of what else is available on top of this when we need it but right now that's all I'm using uh it's a kind of a very lean machine that I'm running in this and that's exactly what you want if you're not using not running intense computation you don't want a lot of uh machines started up but I like the the look of this Auto Select worker nodes button yeah we'll take we'll take a look at that uh in just about a second okay so I've loaded the data set uh and just like kind of you know traditionally dealing with data sets I I'm going to go ahead and split it uh so this comes automatically with a train Val validation and test split let's just look at a sample here uh what I want to focus on is uh kind of the inputs and outputs so here this is a data set that has something structured so it starts with an intent uh then it extracts a bunch of entities um and then the output actually for this data set is an very like unstructured sentence that's composed using these structured inputs um now what I'm going to propose is actually we flip this I actually want to have this unstructured sentence as the input and let's see if we can leverage large language models to be able to extract the structured information and you know everyone kind of talks about llms in kind of this generative space uh but this is kind of an example where it's going kind of back to structured um and we've seen a lot of our customers and kind of the world frankly um a lot of these applications were reserved for traditional deep learning um but now people are leveraging LMS to do the same because first they can get away with having far less data to actually have very uh performant models um and number two uh a lot of other perks like open domain the LMS have a lot of uh kind of underlying knowledge that they can leverage to to do tasks like this yeah actually that's a that's a good point we often think about llms and generative AI really uh favoring unstructured data over structured data but I think one point that people sometimes underestimate or Miss is that the potential for generative Ai and llms to really make it easy to create structure from unstructured data and actually lead to way more uh structured data absolutely yep awesome so uh let's before we can actually uses data to start fine-tuning a model uh there's a couple pre-processing steps we want to take um and I'm going to do this uh start with Ray right now because even though this data set is small um we want to be able to extend this to a data set of any arbitrary size so first thing we going to do is actually import Ray and uh wrap wrap this data set um with Ray now here the data set's already in memory uh but this could be done with something let's say in cloud storage as well where I can kind of uh uh read data from cloud storage or any other source um and have it be be able to be shed across workers um and a typical place to store your data might be a a cloud cloud storage like S3 exct or Google Cloud Storage or data bricks or snowflake there are many different options there exactly okay so once I've loaded the data set uh kind of works very similar I can kind of take a look at the data uh you can see uh What uh the data point actually looks like and I'm going to prepare like couple small functions so U thinking ahead a little bit here but to actually find tune our model we need our data set to look a certain way and follow certain schema um and we need this is actually very similar to how everybody does fine tuning these days open a follows a very similar standard um and I'm going to wrap this into a function called pre-process uh that'll just kind of uh apply this schema to all of our data points now uh this is a system content here I'm basically telling the llm um kind of how to behave and because we do want structured outputs I'm kind of telling it uh first like predict you know what kind of intent uh that we want and then extract these kind of different types of entities um and the output can have basically have a subset of these types of enties makes sense and of course this is specific to the particular task we're trying to do right now yes exactly so so far if you looked at the functions everything's in pure Python and to actually apply this at scale I don't really have to do anything crazy all I'm going to do is take my function it was called pre-process and we have our data set uh train DS here and I use this function from Ray data called map batches and I can simply take the data set and apply this function on that data set and I can specify things like uh you know any function arguments and uh you know our documentation has a lot more but we can even specify different compute strategies how many of what device type do I want to use and and you really want to do that to be able to have full control over what resources do I do I want to spend on this how much is it going to cost me Etc and now for the really cool part Robert when we actually run this uh this is no longer running just on this head node right here so because we've enabled this Auto Select worker nodes based on things like my data size uh the workload that that I'm actually running the compute strategy we specified and in this case I'm using the default compute strategy the appropriate number of worker nodes actually get uh spun up uh they work on this workload that we have and then they scale back down to zero that's incredible yeah and you know you can see kind of uh there logs a lot of logs that come out of this um and we get to see uh at the end I'm going to take a sample and you can see that it's actually has applied successfully applied that pre-processing function on all of our data wow um and then I'm going to actually just uh do this for the other data splits as well cuz we're going to be using it later on um and then finally I'd love to just save this data I've already gone ahead and done this uh so that uh folks can easily access this data um but once it's there you can easily uh read from that cloud storage as well um and something convenient that workspaces comes with comes with is that every workspace has kind of a default cloud storage uh and even shared storage as well if you're working with a group of folks um but it's just very convenient to be able to save your data to various not just data but any kind of artifacts um to storage and be able to retrieve that later on totally and that that is one of the challenges with distributed computing in general is like okay I I ran my application it's on some there's some artifact that was produced it's on some machine somewhere maybe I saved it to disk but like where is it now and if I need it on another machine or another the next job that I'm running how do I how do I get it how do I hand it off between different pieces of my application uh so just having a clear consistent story for storage and producing artifacts and using them Downstream uh that's huge thanks for walking us through that notebook yep and now we'll uh put it to good use awesome all right Robert so now we're going to leverage the data set that we pre-processed and actually use it to fine tunes and models all right let's do it this is the fun part all right so before we kind of get into fine tuning I did want to make it clear that traditionally we wouldn't just jump to fine tuning uh we'd want to actually experiment with some base models and actually evaluate those Bas models to see what we're working with yes I mean the whole point of fine tuning is to improve the model quality right so if you don't have a a way of comparing the quality evaluating the quality of the fine tuned model and saying hey this is better than uh you know the base model you're not going to know what to do with that exactly yep now for fine-tuning there are a couple different methods um and we're going to talk about both of them but um the two popular ones are Laura and full parameter and on any scale we actually have a couple different uh recipes that are available and of course we're going to allow you to do both exactly yep uh we have got a couple different recipes uh inside U uh this training configs directory uh which comes with all the workspace templates for fine tuning and these are basically recipes that the community has uh really worked hard on and pulled together every time a new model comes out and it comes with a set of kind of default configuration uh options that just work for this specific model and we've exposed all these recipes and of course you can modify all these different Val as well and you have full control in terms of like what exactly you want to change and what you want to experiment with now the one thing we're going to go through a few of these the one thing you definitely want to change is um the train validation path especially if our users are expanding with their own data sets so I've gone ahead and changed it to the cloud storage uh locations that we've saved our data set to um but then you know the rest of the parameters we'll look at a few that are kind of model specific uh context length um for actually what how do you want to actually train this so I'm setting the number of devices here um number of epoch you want to EXP and the fact that you can just change the number of devices from 16 to 32 or whatever you want and use more gpus and have it run faster that's awesome all right so after this you can kind of uh play around with more traditional parameters like learning rate uh what kind of padding logic you want to have how many checkpoints you want to save by default we kind of save the best and the last checkpoint but full control over that as well so a couple more uh interesting configurations here um we have deep speed which is another uh open source library that we leverage and uh the community has really worked to kind of make it really easy to do things like distributed uh training uh mix Precision checkpointing gradient accumulation all these kind of different uh things that are really important for training but as a developer I don't really want to think about that kind of managing those aspects I want to focus on my training logic and and actually just work on uh the model itself our team here as well as the community uh they're always working on the next set of optimizations uh to enable for these kind of workloads so um we we make sure that's available uh Al through these configurations things like flash attention um and you know it's guaranteed that we kind of stay on top of this as well keep enabling that for our users now uh in terms of uh you know kind of the compute strategy you again have uh control you chose a number of devices or workers that you want you can also choose kind of the accelerator type and things like that and we'll see that in action when we actually run off the workload um and now for actual uh fine tuning logic we can either change the um entire set of Weights in our model it's called Full parameter fine tuning uh or we can do something called Laura which uh says uh stands for low rank adaptation um that's really a great strategy if you want to not tweak the entire set of Weights in the model but uh really kind of uh develop these low rank patencies and alter those weights uh instead of the original St learnings from our customers and our internal applications you always want to start with Laura um it's just kind of a a simpler faster way to kind of uh fine-tune uh and then based on your task and what you're learning from these uh you you'll want to open up full full parameter fine tuning will be more powerful AB um but they're both great techniques y exactly um and of course you have full control here as well you can obviously control uh the ranking um a several other kind of parameters in terms of the actual uh weight matricies and the training logic as well awesome so with these configurations you uh again you have full control over these and to actually launch this job uh we have a script here that takes in this configuration um and kind of the computer strategy that we have and it just executes this so now because we've uh enabled this Auto Select worker nodes our head node is still this lean machine right here but when we kick off this workload um it automatically uh Provisions the appropriate workers this workload executes and then these workers scale back down to zero so when this is actually running uh the appropriate workers will be uh kind of provisioned and executed there and we have a fantastic Ray dashboard which gives kind of full uh observability and full view into what's actually happening and we get to see our GPU use Gober here so um GPU utilization CP utilization what's happening on the disc uh and then obviously what work workers are actually provisioned um just and it's just a great oversight to actually see what's happening under the hood um and and U make sure that trainings uh properly happen when you're debugging or building these types of applications they typically two types of things you're trying to debug one is just correctness like is the thing running is it crashing if it's crashing what's the error message and that's one type of monitoring that's very important but another is about performance right so it's running it's correct and everything but the question you're asking yourself is why isn't it faster yeah and in that case you really need to know what different resource might be the bottleneck whether that is GPU utilization GPU memory disk space many different things like that and having those um metrics across the cluster at your fingertips is essential so once this is actually done it's one of the most frustrating things for me is that I run let's say a large workload like this and then I have uh my code fail on the actual check that sounds I'm going to have to restart so again as I mentioned in the previous template uh workspaces comes with a default cloud storage location you can use that or anything else that you want um but now our models um and kind of key artifacts are automatically stored for us so things like um the actual uh checkpoints the best one the last one the results these are all automatically stored in in the default cloud storage and now I can access these for uh kind of inspection and other workloads that's brilliant awesome so here I'm just going to quickly show what that looks like uh this is the default cloud storage I go ahead and uh extract these from cloud storage I pull them locally um and then I just visualize for example our result. Json file and again if you have a tensor board or other Integrations you can also visualize these things there I'll show uh what it looks like to actually pull up the checkpoints and now use it for um evaluating this model um because we need to actually see that it worked excellent all right Robert so now that we find two in our model we're actually going to see if it worked now so far uh when we actually did the fine tuning uh we had a metric out of the box uh and it's typically perplexity and we have you can alter that as well but perplexities while it's a meaningful metric and it's really kind of based off of how perplex the model is at kind of seeing a new token it's not going to be most scenarios it's not going to be representative of the underlying task for example in our scenario we're wanting to go from um an unstructured sentence and extract structured outputs I'm sure we can think of a lot of different ways to evaluate this but kind of picking a random distance or entropy based metric is isn't going to be truly reflective so um to get started we're going to be uh uh setting up Ray again um we're going to have a hugging face token here that will pass uh and we'll use this to actually uh load our base model so far we jumped to fine-tuning a model but I also want to do evaluation on the base model just to see what it looks like sounds all right so I'm going to load our data set that we had uh and if you if you recall we actually had uh we created a test set as well now the model did not have access to this which is very important we had only saw the train in validation awesome so once uh we have our test set I'm actually going to separate it into inputs and outputs so just to recap uh the inputs here are going to be that system and user content that we had before um and then the outputs are going to be the actual structured outputs that the uh LM hopefully also generates um so just to look at that uh here's one example I'm going to actually also uh load a tokenizer um so if you recall the model that we ended up fine tuning was this llama 38 the instruct version um I'm going to load the tokenizer from hugging face and the reason I want to do this is because there's something called a chat template so uh during fine-tuning uh what we had done was we obviously uh created our data set uh and we pre-processed to fit fit a certain schema when that's actually being inserted into the model those that piece of text is actually uh um templatized so the LMS uh during pre-processing uh for the LM itself it'll add some special tokens like a beginning of sentence token end of sentence token and we want to make sure that we apply the same template uh when we feed the data from our test samples into our same model um so this is this is exactly why we need the tokenizer um and what I'm going to do is uh load uh one of the uh model artifacts from our fine tuning uh run so from our fine tuning which model artifact yeah so inside uh from from our fine tuning uh the uh model artifacts that were stored in cloud storage um it basically just saves all the artifacts in one location perfect and inside this bucket there's a file called to tokenizer config and inside there there's uh an attribute called chat template so very simple I just load up the chat template um and now I can actually apply that chat template to all the uh examples in our test set so this way I can now have the same inputs uh be available for U uh basically um offline inference for our models okay so now that I have um here's a quick example so this was originally just the input and now we've added uh kind of Applied the chat template to it okay awesome so from this step here uh we're actually going to kick off uh batch inference um and just to kind of recap here evaluation is one of those tasks uh that's really ripe for batch inference because you have a set of uh let's say a large set of inputs you have a model and in this case it's a large model I don't want to feed this one at a time I want to take the entire batch because it's available right now and I want to be able to feed it to my model and look at the entire set of outputs yeah and this is just one example of many different types of batch workloads that you may come across you may you do batch workloads for um evaluation you may also want to do it for just processing a large amount of data you have earlier on we talked about um taking uh unstructured data and producing structured data it's a very natural batch task to perform on a large data set that you collected absolutely and if you recall before uh when we applied map batches from Ray data we had applied a specific function to our entire data set um this is a great time to show how um this offline batch inference we can actually follow very similar logic same match B map batch as function but instead of a a function we'll actually pass in a class yeah so why why uh why a class this time yeah so in terms of like what we actually want to the operation we want to apply on the the the data set itself so in our case now we took the test data set we've applied the chat template and now I want to apply this class because there are certain attributes that I want uh available for every kind of uh operation on this data set so for example here the class itself will have a couple uh pieces of information the llm uh sampling parameters which I'll talk about in a bit um and then kind of our Laura path these are things I I don't need to change in between each function call but they are pieces of information that I want to instantiate my class with and then the class itself the function that it'll be responsible for is under this call yeah I mean so you're running the the uh fun llm on all of the data ex right which which of course is a function exactly now you mentioned there's this information you want to share between all of the invocations of that function but I think the part of the key is that that that information is kind of large and kind of um perhaps expensive to initialize exactly right and so if we were to do that repeatedly every single time could be kind of bad exactly yeah so there's a very efficient way to apply that apply our model basically to to that uh data set awesome so I've actually written this function um again going back to kind of really relying on open source um we're using VM here which is uh one of the most popular and and reliable well adopted uh inference engines yes um and you know any here any scale I think we're also we actually contributed quite a bit to VM absolutely um and it's a fantastic Library that's really matured over the last couple months now and I'm in it here to basically really easily switch between our base model and Laura so if you look at the actual call that's executing here I have a self. llm and that uses the uh offline LM class from VM and to actually generate the outputs I just pass in my inputs pass in a sampling parameters which I'll uh just cover in a second and it's the exact same thing for if I want to generate outputs from a Lowa Verge all I have to pass in is this Laura uh path itself so what's happening here is that for the base model we have a base set of weights and then for our fine tune model which is Laura use the same base model set of weights and then just merges it with these Laura weights which again are saved uh from our fine tuning run so it's as simple as pass passing in an extra path to uh use our fine tune model okay um and in terms of the outputs Again full control here I want to actually have access to the prompt so what were the inputs that went in uh what are the expected outputs which is just again from the batch uh that were passed in and then what are the generated outputs from this model so I want all three pieces of information and so here are you expecting the expected outputs and the generated outputs to be identical they should match y exactly okay so first let's uh go ahead and uh apply this for using our base model so set the model the sampling parameters here are kind of how you want the model to behave there's a lot more parameters that you control here but two that I care about right now are temperature is zero and then uh max number of tokens so this really again depends on which model that you use but I want I don't want the output to I know my outputs are going to be are not going to be too large um this is something you'll want to set if you want to kind of place any constraints on how large the output should be and where Generations would stop for example okay so I'm going to go ahead and apply the base model to our data set so it looks very similar I have the uh test data set and we've applied the chat template map batches and I just passed in the class and now just before if you remember we used the default computer strategy where just uh allocated the number of workers for that job but now I'm going to actually decide how many uh LM instances that I want uh how many GPS per worker and what what is the batch size and we can play with this based on what we care about um and you know full control here and I I love this this is the fact that you can just pass into class or the function yeah right the fact that you can just specify the concurrency and the gpus with a single number and the batch size and of course again there are defaults for all of these things but um this is just and then as a consequence you can scale this across the same computation on all of your data across a huge amount of data CL cluster that's fantastic exactly and of course you can play around the accelerator type as well awesome so going to go ahead and I just want to look at a couple examples so um here's kind of the I guess bad part of uh using one of these base models for a complicated task like this I feed in the prompt and you can see the expected outputs are you know the intent and the different entities but our model has never seen this before uh uh and so the generated text is actually a very long story which is very meaningless in fact if I actually print this it's actually writing out a class for us so it's completely uh kind of misunderstood the task um and this is actually definitely expected Behavior right we haven't done anything like uh prompt enging here something we could have done to help the base model a little bit maybe provided a few shot a few examples here and and that might work but um certainly just using the raw base model uh is not giving the right output next we're going to actually do the same thing with our fine tune model um very similar and you'll notice the only thing that I'm adding here is a Lura path by passing in that it now uses the same boss uh base model merge the Lura weights and now same in call happens so let's look at a few examples here uh you can see the expected output and you can see the generated text is actually in this case uh exactly the same um but let's go ahead and do like a quick eval here so I'm going to keep it simple I'm let's be a little bit strict so let's look at exact matches um so I just say any time that the expected output matches exact the generated textt uh count that as a as a match obviously you can do many different metrics here recall you can be a little more lenient completely up to the user but I just wanted to quickly show the effectiveness even with this strict evaluation uh there's a 90 almost a 96% uh match here W uh which is great to see from just like a you know just I think a few epochs of this as well the last thing I just wanted to mention about eval is that we had the luxury of almost kind of uh working with the structured output but a lot of LM or gen applications the outputs are very subjective um so you're not always going to have kind of cleancut metrics like this and you won't always be able to kind of rely on those out of the-box distance or entropy based metrics um so here's kind of where um a very popular technique uh people call it kind of llm as a judge or GPT judge uh where you use a much larger sometimes even proprietary model to actually do uh kind of the evaluation and this is just wild this is like beautiful in some ways it is beautiful and also uh a little bit uh anxiety Rising because you you don't really have control over what's happening but in this scenario you'd pass in kind of your let's say this fine tun lm's output um and then you'll ask uh that that larger judge llm to basically assess the quality of it and well often times people also add kind of a a suggested answer or or a a golden answer to ask it to compare and rate between one through five and things like that so either way you want to have some strategy towards like quantifying the evals otherwise you're just relying on sniff tests and um those those fall apart pretty quickly well if you don't quantify it then you're never going to automate it right so that part is essential exactly now we're finally at the last template where we get to serve our model that we've evaluated so obviously there's a lot of different ways to serve these models and before we talk about race serve and all that I just wanted to make it clear that um we do have guides for how to use uh vlm or any other inference engine and use race serve to be able to serve your models but today I'm actually going to demonstrate some of the uh any skill capabilities for serving these models using these INF engines but also um with a lot more features in terms of uh uh kind of doing multiple Laura tensor parallelism more like production grade features that our users are going to care about and just to explain that distinction um text generation like TGI VM uh tens tlm these handle the like single machine GPU performance of you know optimizing the inference engine on a single uh piece of Hardware right or on a single machine whereas Ray serve and any scale handle the multimachine scaling Auto scaling scale to zero all these kinds of features High availability and all of the operational aspects around uh running doing Serving and production exactly um now under our documentation on our website we have a lot of different tutorials for serving more like traditional models to serving like deep learning ba BT based models today of course we're we're going to focus on serving LM models everything we have we're going to show today you can create yourself configuration wise but we built a really kind of useful CI for the service template so I'm going to go ahead and run that just to show what it looks like I like how traditional models now refers to deep learning that's right things uh times have changed um so first thing it's going to ask you is kind of the model ID you want to serve um and here obviously there's a couple that we support out of the box but you can really pass in uh any model from hugging face hub for example um and I'm going to go ahead and uh I believe we fine-tuned the uh metal Lama 3 8B instruct so I'm going to go ahead and paste that ID um you're also going to uh if if you haven't already It'll ask you for the hugging face token in case you need to have access to that and GPU type here um I'm just going to choose A10 uh to serve this model tensor parallelism uh I'm going to go with the default value here now uh enable Laura serving so our find 11 instead of one oh but well let's uh 11 let's do 11 here um again that's like the flexibility of what you want and while this is a demo I think uh we'll show kind of what that looks like um enable Laura serving we find H to model uh with Laura so we have those Lura that those set of Weights available so I do want to say Yes um and while you're doing that just want to point out Laura of course we talked about some of the advantages for fine tuning in terms of training speed but Laura also has a lot of advantages for serving in particular if you have many different fine tuned models that you want to serve you do you want to have a separate pool of gpus for each of those models or do you want to uh share the same base model and have between all of the fine tun models and have a small number of adapters and this is something we can enable you to do out of the box on any scales really uh can be way more resource efficient absolutely extremely efficient here um and then obviously it'll ask for where those Lura weights are uh so I have our our location here I'm going to go ahead and paste that and then it'll ask you the maximum number of replicas you want um so this is uh this is a demo here so I'm going to not choose too many but you're welcome to have as many different replicas as you want I'm going to just keep it simple and have two um and then it'll ask you if you want to further customize your config and apparent and obviously you can do this later as well on on your own um but these include things like tensor parallelism and other configs I'll talk about a few other uh kind of changes that users may want to do at the end but I'm going to say no here for now and it'll give you the option to start it locally uh so it'll just spin that up and we'll query it I'm going to say no right now cuz I actually want to run it myself so I'll go ahead and say no so this created configuration under this file uh this is just kind of a Tim stamp here so let's look at one that I've already created um and you can see here that it basically has the lur path uh it's got if you've asked for any any other kind of components here it'll have that um and now we're ready to actually serve this so what I'm going to do is serve the model locally um and I'm going to all I have to do is kind of serve uh run this one oops I'm actually going to run this on the terminal here okay so now this is actually going to kick off locally um following the configuration that we have I believe we just did two replicas so yes it'll kick off uh two uh instances of the lower model that we have and now we can actually uh query this so um I'm going to you again you can query this however you want I'd like to kind of stick with the same kind of uh schema conenction uh convention that we have from open AI so uh I pass in a a system content and a user content and because in our model we had the chat template we don't have to do this stuff manually so we can just uh just pass in text as we normally would and all the templating all that will be done automatically under the hood so convenient exactly and again I'm going to set temperature to zero and I'm going to say Stream True so we can see the outputs as they're coming so now let's query this locally uh you saw the inputs Above This is local so I don't need to pass an actual key here but uh we'll show that in action in a bit as well um but I just passed in uh one of the inputs from uh the data set um and uh ask for it to generate the structured outputs all right so now we're going to go ahead and actually uh run this call here uh and you can see here uh well logs are on the right side but um you can see the output and I think I actually printed the output as well so that's why you see it twice but um looks like the right structured output that we're looking for nice awesome so now to go to make this an actual Production service again uh you can alter the configuration maybe locally when you're testing you did two replicas but maybe in production you want 200 um we're going you can play with the configurations and to go to production all we have to do is actually uh deploy the service so uh with any scale we have a command uh to use the same configuration and deploy it and when you run a service like this on a workspace uh you got a little modal down here that says you know the service name of the service is deployed you can go ahead and click view the service or from our any scale dashboard you can go over to Home and Services and you'll see it as well but this will take you directly to our main Central dashboard for the service um you can see it's still spinning up right now but here anytime you or anybody ever makes a request for this uh uh service here you'll see kind of um the QPS any errors that uh that are coming up the uh kind of latency uh numbers as well um and one of my favorite Parts is the full logs um and I can you know kind of do it by component the log like severity level time period And even regular Expressions here as well nice that's fantastic um so let's go ahead and actually uh send a quick request here so um the neat part about uh the service here is that up top I have a little query button um and here I have uh information about like the actual URL uh that it's running in so I'm going to go ahead and copy that obviously you can curl this command as well we've done it locally so far so now I'm going to uh Weill in these credentials using the um kind of the query uh instructions from here so replace the URL and the special token as well um and I'm going to use the same query function to now pass in these credentials instead and we can see the output is the same and it's the correct output nice all right Robert so now we've seen all four workloads together um and any kind of big takeaways well it's just amazing to me typically if you go back before Ray or before an scale each of these different boxes would have been a different framework or a different distributed system the fact that you can handle the data workloads handle the training workloads handle the serving workloads evaluation all in the same framework in the same system that's mind-blowing to me yeah and it didn't have to change context at all um you're writing python code exactly um and there are obviously actually many things we haven't had a chance to show yet that hopefully our users find through documentation and future webinars but um each of these things are have been completely isolated and can be combined or uh be uh kind of stay separate and we can launch kind of Productions of all these and for example uh we took our model that we serve locally and we went to production with one line and we went to production in very highly scalable and reliable manner we could have done the same for the other workloads as well for example our data pre-processing step or our uh model fine-tuning aspect those again also one line to take that script that does that said workload and run it as an anyscale job these are kind of isolated workloads that people could execute manually attach uh web hooks to happen on certain uh kind of actions M and it's just a great way to kind of create um great cicd workflow around all these yeah I mean I think for someone just casually watching this it may have just felt like writing python code on your laptop it feels like just developing in a notebook or developing in your in vs code but whatever resources you need when you needed a GPU for training or multiple gpus they were there when you needed a larger cluster for the data processing it's there and when you don't need them they're gone right like the fact that you can really enable people who know how to program in in Python to uh do all of these workloads at scale without needing to learn about the scheduling and the fault tolerance and the distributed systems that's uh to me one of the most impressive Parts exactly and I didn't have to make any compromises on the tools that I could use so uh you know coming from kind of a local Laptop World to running a lot workloads now still get to use my favorite tools both from the IDE standpoint but also specific libraries py toch hugging face Integrations with great tools as well um I kind of bring in my world or my world of machine learning and be able to run these uh same workloads at a very large scale absolutely and you showcase this but open source is really the foundation of of uh the AI ecosystem you mentioned pytorch of course and deep speed and VM and Ray and all of these different tools and these are hugging phase and these are really work together to to make AI possible awesome Robert so we just went through this but um how can our users actually get started with this and then start to use it for their own applications that's a great question so Goku put together these you put together these fantastic notebooks and now I personally want to try them out so what our anyone watching can do is they can go to nysc.com and on nysc.com there's going to be a button that says try this out and what you're going to have to do is click on that button and you'll have access to everything that Goku just showed you so um you'll be able to do the data prep you'll be able to fine-tune the models deploy the models do evaluation uh we have templates for everything so it's going to be fantastic and this was a pretty lean tutorial where we weren't really using too large of data sets but it still involves kind of training models and stuff are uh do users get some kind of free credit yeah there are free credits to uh get started and and try things out hopefully our users found all this useful and we look forward to sharing more capabilities in the future yeah lots more to come
Info
Channel: Anyscale
Views: 651
Rating: undefined out of 5
Keywords: anyscale, ray, llms, ml-platform, fine-tuning, genai
Id: xw-F6Nk-KdE
Channel Id: undefined
Length: 45min 53sec (2753 seconds)
Published: Mon Jun 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.