Monitoring LLMs in Production using OpenAI, LangChain & WhyLabs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello everyone I think we should be live give me a one second to get set up here cool and I already see some people are watching it's always good for me to double check just to make sure you can hear me and see me and uh I'll go ahead and share my screen as well and if you want to let me know in the chat that you can hear me that'd be great and we'll get started here in just a minute and while we're waiting to get started and more people to roll in um if you want to say hello in the chat that's always fun and maybe where you're watching from and if you want to share a little bit more about yourself also share why you're interested in um monitoring large language models or what you're building with llms or if you've ever used an llm before or or Lang chain today we're going to be using link chain and openai and Link it and while we're waiting actually I'll share some links to again we'll get going here in just a little bit um while we're waiting a little setup piece if you're watching on YouTube all these links should also be in the description below so also if you're coming back and watching the recording and you're looking for these links the easiest place is going to go um to get them is going to be the description but I'm going to share some of the chat right now while we're waiting just a couple minutes before we get going one we're going to be using our open sourced um Library called link it today and a little bit of the Y Labs platform to look at those profiles kind of over time for monitoring so if you want to follow along on the workshop piece you should create a free ylabs account there's no card or anything required so it should just take less than a minute I think you just have to verify your email and you'll be good to go and then also if you want to check out our open source library that we're going to be using a lot and Diving deep into today check out link it um on GitHub and we always appreciate a star as well and then I'll go over this again when we actually get to the coding portion um so don't worry if you do miss this link this second but this is going to be the collab notebook that we'll be running through today so if you want to go ahead and grab that code you're more than welcome to and then also I won't be reading messages in here during the workshop but if you want to stay connected after the event I definitely recommend joining the slack channel it's going to be a good place to ask questions in and connect with other industry professionals and if you're trying to you know use anything that we learned today and you want to ask questions about it on your own project the slack channel is going to be a good place for that and then I'll just share these real quick in LinkedIn as well because I am I only posted them over to YouTube stream streaming to YouTube and Linkedin right now so give me one second and then we'll go ahead and get going since it's the free count again link GitHub and again if you're watching on YouTube all these should be in the link or in the description as well and again for people just joining if you want to say hello in the chat and maybe where you're watching from and if you want to add a little bit more details about yourself why you're interested in monitoring large language models or maybe what you're currently building with large language models and what you're what you're using would be fun to hear about and it looks like we are streaming in all the right places someone said they're watching from Dallas and I'm uh I'm in Seattle and they are using blank chain and index GPT for summarization very cool someone said they're watching from London they're a software engineer an AI Enthusiast and experimenting with LMS very exciting it's a awesome uh space to be in I mean there's so much stuff you can do and build and experiment with right now it's really exciting so I'm going to go ahead and present oh one last link I'll share real quick is um the YouTube link on LinkedIn in case people want to stream over there I know um sometimes people want to go directly to the YouTube stream all right so now we should be good to go someone said they're watching from New Jersey very cool I'm all right so this is monitoring large language models in production with Lang chain and openai and Link it and um uh the quick agenda is going to be we're going to do some quick intros a little bit about setup and then we'll talk a little bit about large language model uh pain points and and why you might want to monitor them and how we'll do that with link it and then most of this hopefully is going to be Hands-On examples we're going to be using a Google collab notebook and running code today so we're going to quickly go through some slides just to kind of catch everyone up with all the concepts if you're not familiar with them and then we'll um go ahead and mostly do Hands-On today so feel free to ask questions as we go in the chat as well I'm kind of monitoring it sometimes it gets really busy and I'm not able to see everything so feel free to ask that question again if I didn't answer it and also I shared the link for the slack Channel you can stay connected in the slack Channel and ask that question later as well or DM me in there a quick introduction about myself my name is Sage Elliot I'm a machine learning and ml apps evangelist at ylabs we build Tools around Ai observability and are basically on a mission to make AI more robust and responsible you can check out what we do at ylabs.ai and for over the last decade I've worked in hardware and software uh mostly in startups and sometimes with agencies around Seattle and also Central Florida often there's someone in here from Florida um and then if you want to stay connected with me the best place to do that is on LinkedIn you can feel free to ask me questions there as well um or in the slack and in general I would just love making things with technology always experimenting with side projects as well which sounded like some of the other people are also doing that and someone said they're in Seattle awesome and and uh I'm in Seattle now so that's where I'm streaming from a little bit about you so I saw more people join in feel free to say hello in the chat where you're watching from if you want to add more details please share you know what you're building with the llms or maybe you're just getting started if you have any favorite libraries anything you're working on it's always really fun to hear a little bit about the audience and like what they're doing and then again if you want to stay connected later or connect with each other you can do so in the slack Channel and there's an introduction slack channel so if you want to make a more permanent introduction that might be a good place for it foreign these but I know more people came in and joined in so I don't want you to miss out go check out our open source Library this is where we're going to be focusing on a lot today it's called link kit it's built on top of our other open source Library called why logs we'll talk about them in a little bit and then again we'll be using our platform piece for a little bit of the Hands-On part today so you can go ahead and create the free account there's no card or anything required I think you just verify your email and you should be good to go and then we are going to be using Lang chain and open AI today so if you um don't have any open AI API account you can go create one there and then this is going to be the collab and again I'll share this again when we get to the code portion but I know some people like to open these up and get a head start on it so feel free to open up any of those links including the Google collab notebook which contains all the code examples and um if you like what you see today we have a promo code where you can get our expert tier in ylabs um free for 30 days and get a lot of the extra features that are specifically really cool for llms like monitoring in hourly buckets and again we'll see why you might want this a little bit later but there's a link to a forum you can fill out there or scan this QR code and I'll share the link again later um all right so what is AI observability and ml monitoring and specifically around large language models but I'm going to cover a couple of the concepts that maybe uh or they definitely do apply to other types of ml models as well so and you might be familiar has anyone done uh like ml monitoring before um in their models in production and specifically maybe around large language models and someone said they're working on robots as an application that's fine we should connect if we're not already connected because I also love working on robots all right so AI observability kind of at a high level looks something like this you'd have your production pipeline where you're doing like model inference training Etc and then you're gonna have some sort of tool that collects um AI Telemetry so that can be you know different things about your data that could be how you know in this case for large language models you can think of this as probably being the prompt and response and extracting specific metrics out of those and then you're often going to host that Telemetry in some sort of platform here where you're aggregating those statistics together and then with that you can do things like forecasting query that data monitor it and then detect when something is either drifting or meets a certain threshold that you set um so you get those output of like reports and dashboards alerts and notifications or triggers and workflows so if you have um you know a machine learning model and data drift occurs so you know the data that's going into your model in production no longer really matches the data that your model was trained on uh chances are your model might not be performing very good on that data and you can trigger something like a workflow with pagerduty which is an integration we have in ylabs and that could kick off something like an automated data annotation job then you take that newly annotated data retrain a model to play and ensure that it is performing better in production so if you have a well orchestrated or or designed mlops pipeline there's a lot of cool stuff you can do when something looks kind of off about your data or definitely some sort of error is happening you can trigger things to make it easier to kind of retrain and to play those models in production so why do we need Ai observability and again not just for large language models but you encounter things like data drift like I talked about where the input data no longer is really matching the training data and chances are your model is not performing as well you have things like concept drift where your model no really matches doesn't really match the real world outcomes anymore bias and fairness you know is your model you know making predictions for one class um much better and and and not doing well in other classes and then you obviously want to make sure your business kpis are being met and then with large language models again we'll go deeper into the into the specifics here but you have things like jailbreaking and security is a big concern that comes up a lot but basically most of most of the time you're kind of selecting a metric and then monitoring that over time and maybe you have a threshold set on that or your kind of automatically figuring out a threshold with something called like data drift um detection well and again we'll see this in action here in a little bit but we have a saying bad data happens to good models you put a model in production uh chances are you know the data is going to change at some point and and your model is going to do something that you weren't accounting for and just a hammer home kind of how you want to analyze uh change in an ammo application so one we know it's a machine learning application you have your ml Ops pipeline or your machine learning you know and data pipeline you want to establish where a change can occur and again with large language models and with more traditional models what I would say usually you want to look at is the input into the model and the output of the model if you have other statistics like ground truth and you want to calculate things like accuracy or F1 score those are good things to track as well but a bare minimum if you're tracking the direct input to a model and an output often you can catch a lot of things that you know could be going either wrong or maybe you actually want to track if something's going more right because you're changing like prompts like your system prompts which again we'll see in action here in a little bit and you want to improve those prompts over time you could be looking at you know sentiment score from your response on your large language model and that might be the metric you want to look at here and you know that a change might be occurring so you can select do you want to compare this change to training data previous data or a moving window like what is what is sentiment for our prompts look like in the last 30 days and then you want to figure out how to measure that so is that a distribution distance is it like missing values is it again like a sentiment score um and so we'd select what we want to measure and again we'll see all this in action here in a little bit but kind of tying it back into large language models specifically you're already probably pretty familiar with them it sounds like a lot of people you know are either experimenting or actually building things with them right now um you know you can use them to build agents chat Bots summarization q a those are kind of the top four I usually see I'm sure there's some other applications out there if you want to share what you're working on specifically definitely throw it into the chat um and you know we have open AI is what we're going to be using today I also do workshops with hugging face so if you are using hugging face I think in a few weeks I'll actually be specifically using like the transformer Library if you want to see that in action kind of similar to what we're doing today but with a different Library all right so what are some of the common pain points with large language models and again I would love it if you throw in the chat any pain point you've experienced with a large language model it would be really interesting to hear about you often have irrelevant or inaccurate responses um sometimes the responses can I don't know just you know make stuff up I'm sure we've all seen this before or it's not um giving you a good result when you're doing prompt engineering so most of the time and we'll see this in action with link chain today where we're going to give it a system prompt telling our model how to kind of behave right um and most people are probably doing something like this right now um a lot of times when I ask people like how are you choosing the prompts I mean you know you can choose one off intuition do some testing make sure it's good most of the time then people are changing them in production a little bit later because they think it's going to be behaving better um but there's a lot you can do to monitor that and it's kind of hard to track changes over time so most people and it'll be cool again if you um have done this in the chat like have you put a model out in production do you have some sort of system prompt and then you know maybe it's months later a week later or something like that you tweak it a little bit because you think it's going to be performing better I'd love to know how you validate that and there are different ways to do it and we'll see some of them in action again in a little bit here but it is often when I talk to people right now kind of the main driver of most large language model apps Behavior because most people are using GPT uh 3.5 or 4. I think they just came out with fine tuning on some of them but for a long time they didn't have any and so you'd really be you know using this prompt layer to change the behavior of your model and I still think it's one of the most popular things and then also um your output validation so validating the response you know is good or a big thing right now is kind of security um layer or verifying that your outputs don't contain information that it shouldn't so that could be things like some sort of pii in a healthcare application or um you know fake phone numbers I've actually had this happen to me where you know my model was putting out a phone number and I never told it to do that so it was just a random phone number and definitely not anyone that not any number I would want someone to call from my application um so how do we solve these problems uh we can set guard rails we can evaluate our model in production or and and prior to production you know while we're building it we can compare those prompts together we can improve our prompt engineering we can actually do model comparisons so if we are fine-tuning models you know we want to make sure we select the best one or the best prompt we can have multiple those out in production and basically see which one is working best in our production data and then we have General observability so you know like how is uh user sentiment or prompts happening over time or what is our our response sentiment overtime readability there's a whole bunch of different stats that we can collect and again we'll see what this looks like in action and a whole bunch of stats that we can use out of the box and then how to add your own custom metrics um but at guardrails you know if you've been using gpt4 you've probably noticed there's been a lot of talk and maybe you've experienced it where the behaviors kind of changed and a lot of people think that's because they're adding more and more guardrails because people you know they want to try to be very responsible with their AI application and for example like they don't want it to create malware so if you try to ask it to create malware you know they have something that monitors for that uh they have a guardrail in place and it you know says oh you're trying to do something bad with this model I don't want to do that so let's block it um and for example like you might have a tax bot and maybe it shouldn't be giving medical advice to people and so um solving this at scale um like I mentioned we're going to be looking at the inputs and outputs today we're going to be using the library link it the uh our open source library and basically looks like this you have your prompts it could be any large language model today again we're going to be using open Ai and link chain and we'll get the response from that and then with that we're going to track things like quality sentiment security and just you can enforce all the things I talked about you know that could be response quality Pia leakage toxicity both from the prompts or response Etc and so it looks like something like this you'd get your prompt coming in the user prompt and then it's extracting these metrics out of the box we have a whole bunch um around I think it's like 16 for the prompt and response and then a similarity score and so we'd get you know response relevancy um has patterns so that has patterns by default has things like does it contain a credit card does it contain email does it contain a phone number and you can again add your own custom metrics it's very easy to actually add a custom metric with a user-defined function so if we don't have the out of the box metric that is important for your large language model you can go ahead and add that in and we'll see an example of that today and because we're just extracting these kind of high level metrics about the text both on the prompt and the response here these profiles which we'll see what they look like in action but it just creates this kind of a profile of these extracted metrics and it's privacy preserving because it doesn't really contain the raw data anymore it's not the full prompts or response says it's just these extracted metrics and so this is a really cool feature for a lot of Industries like fintech and Healthcare where you know you might not be able to store or pass the the raw data out and again we'll see this in action here in just a second but just want to say it's easy to use you just pip install link kit in your python environment here's an example with Lang Lang chain the lane chain integration again we'll run code here in a second if you want to stick around for that and then we can extract those metrics and track those over time or gain insights out of them so like this is a has pattern example here where it could trigger like if there is a phone number if the data set's in Balance we can look at that we can also see things like the sentiment score toxicity if they're really high or low we can say Hey you know this is a really negative sentiment from your response maybe it shouldn't have been maybe it should be more positive and you can go tune your system prompt or fine tune your model on that and so we'll see all this in action here enough slides so people want to go ahead and run code open up this notebook and if somebody wants to let me know in the chat that they can open this up always get a double check as well the sharing should be on but just occasionally you know something might happen there so if somebody who is actually going to be running the code today with me open up that notebook and just let me know that works that be that would be great and then if you haven't already for part of the notebook we're going to be using the Y Labs platform we're going to be using the open source library and a little bit of the platform here to see everything in action so you can go ahead and create that free account as well if you haven't already and then again um we'll see it in action here but if you want to fill out this form get a promo code of using our Enterprise version you can fill out this form as well and then if you didn't want to run any code today there's kind of a demo org you can check out but I definitely recommend if you can open up that collab notebook and get ready to run some code when I think it should be pretty fun so you should have a notebook that looks like this and again if somebody wants to let me know that it opened for them that would be great once you're here you probably want to save a copy in drive this is going to create your very own copy that you'll be able to edit and do whatever you want with and then you can always just bookmark that original link if you want to come back to it so don't worry about breaking anything today and catching up a little bit on some chat messages here awesome a couple people said the notebook opened thank you so much somebody said um not for large language models I think this is when I questioned uh where I asked a question if people had you know seen any um issues with models and production someone said not for Realms but testing collecting handcraft metrics someone oh here's one someone said Upstream schema changes that break our ingest pipeline yeah so that's like a very common thing where you know you might somebody might update a library somewhere and then it changes a schema and your data looks a little bit different and you just weren't accounting for that in your model or future extraction and it can uh break your your pipeline awesome and I'll zoom in here a little bit too let me know I think that should be good for everyone but let me know and again um here's links to everything if you need them up here as well so if you want to go ahead and create the account or open up a openai account you can do that and if you've never used um Google collab before it's kind of Google's way of hosting a Jupiter notebook which is a popular tool in data science and machine learning where you can run code write documentation like you'll see that text around and see outputs kind of all in one visual place so it's a really cool tool for experimenting and and you know seeing what's happening and to run the code we just hit this little play button next to the code cell the first one we run is going to take a few seconds because it's actually initializing a whole little environment for us here you can also run code cells by um hitting shift enter when that cell is selected and then I'll go to the next one so you'll probably see me do that a lot and not um hitting the little play button and I saw a chunk of people just join if you want to catch up we're just getting to the code part and uh the the link to everything is in the description if you're watching on YouTube so if you're on the YouTube stream you can open up the description and the collab notebook link should be one of the top ones if you want to grab that and then just hit a save a copy and drive and then you should be good to go all right so let's go ahead and do a little bit of setup we're going to pip install link it and link chain Has anyone used um openai before or Lane chain so open AI most people probably have used in some way whether that's chat gbt or actually through an API and that's what we're going to use today to basically create kind of you know some form of chatbot and link chains a really handy kind of wrapper around that where you can do a whole bunch of different things one of the things we're going to be using today is something called a template so we're going to make a template to kind of tell our model how to behave but linkjain has a lot of cool features around large language models so my uh pip install finished and I just hit that little x and x all the output so it cleared it up awesome and I see people said yes and yes to both so it sounds like yeah I mean like chain seems very popular um and obviously open AI is extremely popular as well so let's look before we create our chatbot thing let's take a quick look at kind of these metrics that that get extracted with blanket the open source Library so I'm going to run this code cell the first time we run this LM metrics and it might take a few seconds because it's downloading some packages behind the scenes but the first time we run it only it should take this amount of time otherwise all those things are now downloaded and we don't have to worry about it again but we're just importing l a metrics from link it and then we're importing y logs as Y and so y logs is the library that link has kind of built off of it's our other open source library for logging data pretty much of any kind and it creates these statistical profiles and we'll see what that looks like for language metrics right now but you can think of it as you know for tabular data if you just had numerical tabular data it would contain things like them in the max the mean Etc and just with all these um kind of statistical summaries over your data set you're able to do things like monitor for data drift and all the things that we had talked about so now that it's done downloading these kind of behind the scene models here to extract metrics we're going to load in some sample chats to see what these metrics look like and then this is actually where we're doing the logging so um this is just kind of loading in uh some toy chats here to look at the metrics and then here we're calling y DOT log we're passing in the chats and we'll see what the chats looks like in a second but in this case it's just a data frame but you can log any language metrics here as long as you pass in a data frame or a dictionary that is basically in a format with a prompt and a response or just one so you could like just prompts and just responses and again we'll see this all in Action a little bit further here as we break it down and then we're just passing in the schema and the schema is what we initialized here with our elementrix so now we have a profile and I'm just going to import pandas and we're going to set it to Max column so we can see what everything looks like in our Panda's data frame this is the example of the chats so here we have this prompt and then the response and we can see what our profile looks like and it contains kind of all those things I talked about in the slide like your aggregate reading level the character count difficult words jailbreak similarity has patterns all these things are now kind of logged in here so we can also see that the cardinality estimate is um 47 so out of those 50. we can also see there's 50 prompts in here out of those 50 47 of them are pretty unique and if we scroll we have all these other metrics like um uh let's actually pick one with numbers here so we could just say aggregate reading level this first one down here we can now look at the distribution of it so like this Max mean median Min Etc and then quantile distribution of all these different statistics out of the box and we have some other things like was it a type of string the prompts were right um but we have all these things like reading level uh prompt sentiment and then basically you'll see all these for prompts and down below you'll see all the same things for responses and so these are out of the box metrics built in again we'll also show how to add custom ones so like we have something that we use to calculate jailbreak similarity um but you know right now if you've ever really looked into this there is really no one silver bullet for every large language model application right now so it might be pretty common where you want to add in your own metric on top of these ones so maybe you want to calculate your own jailbreak similarity score or different type of similar similarity score you can do that and we'll actually see an example of that a little bit later in the notebook using a vector database as well so you can you know have a whole clump of Json from something create a vector database and then calculate a similarity score and log that in here so we have a little preview of the types of metrics we're extracting and what they kind of look like in link it so let's go deeper in and see how we can use these to monitor our data for like text quality relevance and security and privacy all the things we already kind of talked about in these slides so let's set up our kind of chatbot application now so this comes at this point is where you want to get your open a AI API key it's kind of a tongue twister openai API key so I'm going to go ahead and create one for this Workshop here this is in my openai account not nothing to do with Y Labs here so if you have one you can just go ahead and create this right now if you don't have an account right now you could create one or there'll be a little bit more in the notebook you can run later without the open AI or you can just watch what I'm doing right now so we're going to go ahead and set this as our opening key in our environment here and then this is uh using Lang chain again it sounds like a lot of people here have probably already used it I know it's a really popular wrapper around setting up LMS here we're going to be using something called a template so we're basically going to pass in a message to our model every time it's run telling telling uh you know what we're expecting it to do or how we're expecting it to behave so in this case we're going to start out and say that you are a helpful assistant that rewrites the user's text to sound more upbeat and happy now if you're following along here feel free to adjust this and see what happens I'm going to go and Show an example during the workshop of it being positive and then going more negative but if you want to go ahead and play around with this template you're more than welcome to and I always encourage you know playing around with the code and seeing what happens so if you want to say um you know any other type of message or system prompt to your llm definitely feel free to change that right now but I'm going to go ahead and run this so we're telling our model kind of how to behave again this is super common if you have done some sort of LM application before you've probably done something like this and whether it was with link chain or something else and then here we're initializing our large language model from openai I think by default it's the 3.5 a GPD 3.5 and then here we're just going to create a little function and we're going to pass in a prompt we're going to pass in the template the reason why we're passing in the template is because we're going to use this function later on when we have another template as well and then we could compare how those two templates are performing in production like I was saying you know when you change this in production it's very common that you're going to start with something here and then maybe a month later or weeks later you want to change your system prompt to improve how your model is behaving and how do you monitor that like I talked about you can extract those metrics and then monitor the over time between multiple prompts so especially with prompt engineering is way cheaper than you know fine tuning all you're doing is really call an API twice in this case or you could do on like five different system prompts and then you could run those five in production for like a week or a month and see which one out of those five different prompts gave you the better responses or the the better metrics you're looking to improve and then choose that model and actually deploy that in production I think that's really going to be the future of how people are going to be monitoring system prompts because right now when I talk to most people it's kind of like well I changed it and I think it's better based on a few examples I've looked at but I actually don't know how it's performing in production on real data and if you have a different way of looking at it I would love to hear in the chat because that's something I'm always talking to people about and looking to improve but so I think having multiple of these kind of things and uh you know adjusting them in production or looking at like the top you know 10 or something or Top Model out of 10 um it's kind of going to be the future of creating better large language model experience for a lot of people so that's why we're going to pass in template so that we can do this with another template and then choose the best one in production and then it's going to return a dictionary that contains the prompt and response that I mentioned with link it you can pass in a dictionary or a data frame but it should be in the format of a prompt and response or just one of them but it should still be a dictionary format there so let's look at what this did with our template we're going to get one response and not log it with link it yet we're just going to look see what it looks like so it says I don't want to work today um it's supposed to make it way more upbeat and then here we have our dictionary so this is what we're going to log um I said I don't want to work today was the prompt and the response is I'm feeling a bit unmotivated to work today so that's I guess a little bit more upbeat way of saying I don't want to work today let's try it again real quick too because I think it should give slightly different ones probably nope so you could change the template here and get different responses for sure all right so now that we have a function that is outputting this dictionary with a prompt and response in the format we want uh we can extract metrics like we did before um with link it but let's actually do it on our um on our function now so we have this prompt and response that we created here we're going to call after initializing language metrics again and you can see this time it didn't have to download anything because it already did that before so now we're going to profile that prompt and response again all we're doing is passing in that dictionary format here and then telling the schema which in this case is defined here we're just using the out of the box metrics and then we're going to call profile view and look at that in pandas and that's going to give us kind of what we saw before but now in our specific prompt and response generated from our openai and Lang chain function here it's all those different metrics we talked about I kind of like just for showing this off like looking at things like prompt sentiment and response sentiment right that's how positive or negative it is and that's a really good metric to kind of just look at in general but also just for showing it right now it's easy to kind of understand so we can see that in this case The Prompt is uh slightly negative but not super negative I think it you know goes from like zero is neutral and you know up to one I think is incredibly positive and up to negative one is really negative negative so we're gonna look at prompt there and we can also look then at our response sentiment which in this case um is you know a little bit more neutral as well even though we told it to write it upbeat so maybe we should adjust our template to improve this uh prompt sentiment if we wanted to so we can also add multiple profiles together so here I'm just enumerating through these three examples here and it's rewriting them all uh with our GPT model and then we already had that profile created above and I'm calling this dot track on it and that's just adding these so now we should have four profiles together and we can view the aggregate statistic for all those profiles now so this is pretty common we're going to batch these in certain time periods whether that is days or hours and then look at these aggregate statistics so here's what the aggregate statistics looks like four four of them you can see the count was four and each one was pretty unique so the cardinality is four and then we'd get all those um different metrics we looked at already let's look at that response sentiment again and we can see in this case now it's mostly positive here so it's actually um if we looked at this is the max the the maximum one is actually a very positive score so um now that we have all these in our data frame uh or the profile here we can also look at how to extract and like use these different metrics and again we'll be going into other examples of this further down in the notebook but this is a way of kind of accessing that data that is stored in our profile so we're converting it to a profile dictionary here we're saying we want to look specifically at the aggregate reading level in this case and then we also want to look at just the distribution Max but we could grab things like the minimum or the mean Etc and then here I'm just printing it out but we'll see a little bit further in the notebook how we can use this to set up a guardrail in our local environment and say well you know if the jailbreak similarity is very high don't pass it over to our large language model or if the response toxicity is pretty high you know don't pass that response to our user so let's uh um look real quick how to get kind of more value out of these profiles where we can monitor them over time we can visualize them and then also we can set up monitors so if something does pass a certain threshold or does drift quite a bit we can go ahead and have an alert trigger and that alert can either just you know let us know something looks different and we should go look at it and that could be like email and slack or like I mentioned you can set up an integration with something like pagerduty and actually kick off maybe a more automated job in your ammo Ops pipeline so to do this we're going to go ahead and set up the Y Labs account and hopefully you already made um the account by now if you want if you're following along and I'll grab that link for anyone who didn't yet you can set up here and again there's no card or anything required and you get a couple free models and then there's that promo code if you want to get some more Enterprise features here but here's what I'm going to do is uh once you log in your um dashboard will look something like this you won't have you know probably these numbers in here yet because I've been playing around with it what I want to do is create a new resource though and you might have a couple test models in here from uh kind of getting started anyway go to create a resource that's going to bring us to this model and data set management page I'm going to create a new model but first I'm going to go ahead because I'm I'm on the free plan here as well I'm going to go ahead and delete a couple of these models I'm going to delete these two test ones so if you have um kind of a default demo model and it's saying you're over the limits you can also delete them as well so give it a name I'll call this um Lang chain Workshop you call it whatever you want and resource type I'm going to select a large language model and I'm going to hit add model or data set and now down here I'm going to have this model ID and I'm going to grab that so 270 is mine yours is probably going to be like one or two I'm going to go back to the notebook and paste in model ID make sure that you don't accidentally include any new line spaces or spaces in your string here it should just be a string and then from the same page I'm just going to go ahead and hit access tokens create a new one here again you call it whatever you want I'll say link chain Workshop create access token copy this token over I'm going to put that in the API key area and then my org ID is actually attached to the end so you can just copy this here and paste it into the org ID there or you can always find your org ID on the same page as well org ID here and org ID will be populated here so I'll give everyone a minute to get their keys and let me know if you want me to go over that again um no matter where you are in the platform you can always go to this hamburger menu go to settings go to model data set and management type in create a new model name here once that's created grab that model ID paste that in and then from the same window just go to access token type in a token name and then you'll get a token there and the org ID is on this page as well so I think there is a question and I'll get to that well I give everyone a minute to get their ID set up and then it should look something like this when you paste them in and also make sure you put them in the right place so API key should be in this one data set ID should be here and org ID should be there so instead what do you think are the metrics folks should be most attentive to Beyond sentiment perhaps more macrometrics like time to solution resolution success rate user satisfaction great question and again I think it depends on the application I don't think there's one single metric at least from what I've seen that you know Works super well even the stuff I've seen talking about like prompt and and response relevancy like we calculate a score on that as well looking at embedding space between two but even that depends on the application and and the way you do that may not work as well for for your specific application like from papers I've seen people will be like well we calculated scores and uh they just don't work that well so it depends on on what you're trying to do but I like what you said too like you can also calculate um like time to resolution so that could be maybe like how many people are prompt or how many times is someone prompting it or how long does the response rate or response take to get to the user that could be something you log and then I like your resolution success rate that could be interesting how you measure that is that when someone leaves your application or when someone clicks that thumbs up kind of thing like if you're really prompting someone to say hey did this model work for you or not yes that is something that you could log in with a custom metric someone said is it typical to aggregate micro and macro or leaving lagging metrics into an aggregate view of the system I'm not sure what you mean by aggregate in this case so like if you're aggregating them all into one platform view which we'll see here in a minute yes I think it's good to kind of be tracking all the all the different things you think you want to be looking at if you're talking about putting them onto one single value um I'm not I'm not sure yeah I think so far it's like from what I've seen it just depends on the the type of application you're using so you know if you want your again we're using sentiment today it's kind of an easy thing to see and understand uh but it could be like you want to make sure your model isn't outputting any sensitive data like I mentioned that's a big thing when I talk to people right now is like making sure that your models aren't basically leaking any data is a big deal I think that LM security is going to be a really big field in the next several years and looking at stuff like that's going to be really important um and yeah whatever your custom metrics are too around like jailbreaking or like if you have something that does a good job of showing the response and prompt uh relevancy like again we have an out-of-box score from it but you might want to adjust that with a custom metric um yeah I think that right now it's like it's a good question it's like what do you Monitor and right now if you don't know I'd recommend monitoring as much as you can and then maybe adjusting that later if you're like I actually didn't care about readability score uh sometimes you might though you might want to make sure your response is always um giving you know readable responses all right how to bridge the ml performance side with the product and business performance yeah so you could log all those things I think into one thing so like if you wanted to do the latency of how long is it taking to respond or something um you could definitely have that into one platform but you could also um um log them in a separate project if you wanted to as well and Dean asked what was the code to get extra features if you're talking about the custom features it's going to be down below um so if you're not seeing for the whole thing you can definitely look uh further down in the collab notebook and you'll see a section that talks about adding custom metrics and in this use case we use a vector database but you can do something much more simple as well so let's go ahead I'll keep running through the code in so I'm going to set my environment variables here and then um to write to ylabs we're just importing the ylabs writer and then we're importing link it again here again because we imported this earlier in the notebook we probably don't actually have to do it but this is what it would look like if you're taking this and setting up just the whileabs a writer for yourself so we set the schema just like we've seen before so that's going to tell us how to log it and then here we're creating our Telemetry agent called or from ylabs writer this is just the profiling like we've seen before but instead of writing out our profile locally which it does create a local version of it but now we're going to write that up to ylabs with the API Keys we've provided so you can go ahead and run this on your single profile and it says true and then we'll go ahead and look in our ylabs account and see that single profile but I'll get more interesting when we add multiple profiles in here and we can actually see what it would look like in like a Time series View so I'm going to go in here um might take a second to upload cool so I refresh and I can tell that it uploaded just because I can see there's 33 inputs now so like I was saying that's uh a whole bunch for prompts and responses and so we can click into this project now we have a little dashboard here this gets more interesting with the more data that gets uploaded of course we can see this little profile Tab and again we don't have much data in here right now so it's not giving us a little chart here but it's a cool way of kind of um looking at your data and we'll see when we upload all of our data what happens here but let's go look at inputs real quick so these are all those metrics The Prompt and responses are all kind of in this one tab here right now we actually have an update coming out soon that's going to make this uh broken down a little bit better which I'm really excited for but let's go ahead and look at um that prompt sentiment one and we can see that again this isn't maybe the most interesting thing right now we just have one profile in here but for that one prompt and response that we uh um uploaded we can see that this is the score for prompt sentiment so this is the one that was like oh it's kind of slightly negative or pretty negative actually maybe because I think it's all the way to negative one and so 0.5 is like pretty negative there and then let's go back to our notebook and we're gonna run this next code cell which this is uh seven different lists and each list contains I think it's three prompts here and what we're going to do is kind of um simulate this being in production for seven days and so we're going to change the date time for each of these lists and kind of write three prompts for each day so I'm going to import date time here and all this is doing is the same thing we did above except for it's enumerating through that list and then it's subtracting the day from each time it enumerates through the list and then we're just overwriting the um date time stamp with um that y logs profile and then we're writing it up but everything else is really just these few lines of code you have your prompted response you're logging it um and then we're profiling it and in this case uh we're also running the um the GPT model for each one so we're passing in each of those prompts we're getting the prompt and response in that dictionary format and then we're logging prompt and response here oops I probably have a little bit of mistake here I think I can just uh delete this part and then it would log that prompt and response but I think this should still work but maybe our data will be a little bit off but I think if you decided this it's going to be in that prompt and response format there of that dictionary and it takes a few minutes or not even probably a few minutes it feels like it sometimes when I'm presenting it's probably like a minute or less in a minute because it's looping through each of those passing that to GPT profile on it and then um logging it up so now it's done we can go back to our y Labs platform and if I refresh I'm still on that prompt sentiment tab if I refresh this and now we can see what our prompt sentiment looks like over time in this time series view so this is again like seven days of data now so we're gonna see um and again I'll zoom in a little bit but if you're following along you should be able to read it pretty good you just look at the median in this case or something for each of these um prompt sentiments we can see one from like 56 to 46 to 56 to 38 to 54 and then it dropped really hard on one of these days and then went back up to that kind of like 60 50 median range this is really cool this gives us Insight right like kind of some of observability around how maybe in this case just this one metric how users are interacting with our product you know for some reason on one day it went really negative maybe we did some sort of product update and people were using this as a chat bot and they really didn't like that update and they were expressing that and we can see that their sentiment went way down but we could look again at any of these other metrics that we kind of talked about and depends on what you want to measure um in your application so this is cool but again maybe I don't want to always be um looking at the the dashboard every day or every hour and what I can do is set up a monitor and so by default with Y Labs on the free account you're going to get the daily batches so everything's going to be aggregated into a day with the Enterprise tier that I shared the promo code for uh you can do things like hourly batches so if you want to see kind of the performance over every single hour uh then you can do that there but let's go ahead and set up a monitor so from this page I'm going to hit setup monitors on this button here and then we have these presets we have more presets coming soon specifically around some LM stuff which is really exciting but for now let's go ahead and create this one for data drift on inputs and we're going to hit configure if you really want to you can get really into configuring everything with Json as well so you know if there isn't a preset for you here you can still probably do what you're looking for you just have to configure it with Json you can look at the docs for that for now let's hit configure on this data drift one and we're going to edit some of these items so we're looking at the input here and then I want to change this to a 9. what this is doing is we're using something called hellinger distance to calculate a drift between in this case we're going to be looking at the trailing windows seven days of data but this could be against some sort of set of like golden prompts or something like that or responses and saying you know when the data changes this much um trigger an alert and again we can do different things with those alerts like email a slack us or set up a pager 2D integration by going to point 0.9 instead of 0.7 I'm making it um less sensitive so saying you know only when a big drift occurs on however we're calculating this you know trigger that alert otherwise we want to make it more sensitive this could be a lower number so I'm going to use this seven days where we're just going to be looking at those seven days of data and Trigger when something drastically changed in those seven days but again you could be using something like a reference profile where that's a static profile that could be a set of like a common use cases if you have a traditional model you have a training data set you might want to be comparing all new data to that training data set and does a good job of knowing if the distribution of the input data doesn't really match what the model was trained on large language models are a little bit different I feel like people use them uh you know like you don't have it's harder to have a set to compare every input against but if you did have a training Corpus or something like that you could set that as a reference profile and look at the distance I guess of inputs to that which might be something you want to do and then here I'm just going to leave this default by now where it would email me when something goes wrong but again you could set up different Integrations like with the page or duty to trigger something in your pipeline so I'm going to hit save we'll go back to our inputs tab here and let's look at just that prompt sentiment again and then I can hit preview Now by default this is going to run every 24 hours so it would run in six hours and trigger an alert again preview though see how it would perform so we're gonna see on this day where it drastically dripped uh uh or dropped we got our alert triggered here so this is really cool too if you have some production data in here already or some um data to backfill like we did this can act as a really good way to kind of tune your model so you know when it's going to trigger somebody asked is this uh session recorded yes so if you're watching on LinkedIn it should be at this the link that you're watching later as well but I will also share the YouTube link again and the recording is going to live at the same link so yeah if anyone has to actually drop or anything like that hopefully not hopefully you can stick around for the rest of this um but if you do have to go come back to this link and you can watch the recording later as well it's going to live with the same YouTube link or the same LinkedIn streaming link so now we have our alert triggered and again like we're saying this would let us know by email or slack or however way to set up our integration I you know obviously like building a some some form of mops pipeline where hopefully you can trigger something to do something cool like you know annotate the data or something like that maybe not for the LM use case but for computer vision or something you could trigger a crowdsource data annotation take that new data retrain your model deploy that model in production maybe even compare that performance of the new model versus the old model as like a shadow deployment and then pick the best one that fits and one thing you could actually do maybe is so for if we're just you know using again sentiment As the metric here it could be any other metric is uh it could trigger this right and let's say this was actually we were looking at like response um sentiment instead of prompt one and we're like oh that triggered maybe that could actually trigger something in pagerduty that would automatically go and change the system prompt to try to make it more positive sounding so you can actually kind of use this as a little bit of a fitting function to adjust your um your system prompts and we'll see that actually uh we'll we're going to go ahead and look at another system prompt here in the The Notebook um and see kind of how they compare against each other any questions so far though we still have a fair amount to go through or you know like we'll go through pretty fast but we'll show uploading another model with a different template and then we'll show guard rails and then custom metrics but does anyone have any questions that I missed so far um around what we've seen or does it seem cool like a useful tool where you can now kind of monitor all these different metrics extracted out of your prompts and responses on large language models and see what's going on over time one feature that I didn't show an action yet is so now that we have more profiles we have data in our profile tab where we can kind of get some summary statistic statistics about them but also what's really cool here is we have this show Insight Tab and right now this might not look super interesting um it'll get I think more interesting with our next round of data potentially but here this is going to automatically kind of look at interest what we think are going to be interesting insights around your model so in this case it says the the mean sentiment score is 84 out of one which is real you know indicates that everything's been really positive and that's because we had that system template at the beginning you know where we said make everything upbeat and positive um this will also catch those patterns out of the box patterns like if a phone number or credit card came in that's actually how I found out um that my my model is generating a phone number is I looked at the profile here and I was like wait phone number and then I went back and tied it or like in and looked at my responses for that specific day or hour batch that I was looking at and I found that it was generating a random phone number how do you precisely impact something like readability score or sentiment any tips for prompt engineering great question we could probably do a whole course on prompt engineering I know there are whole courses on it but in general like if you ran this template before um I mean in this case we are definitely um changing sentiment because we're telling it to be upbeat and positive right but um if you know so like doing things like that saying you know your helpful assistant that rewrites users text to be sound more upbeat and happy um definitely really influences how well our mod or how our sentiment is on the responses and we'll actually see this in a second here so let's go ahead and create another model I'm going to go back to my y loves platform I'm going to go back to this home page by clicking on that ylabs icon I'm going to create another model I'll call this uh Lang chain two again selecting large language model type here add model I'm going to use this new number which is that 271 and I'll paste this in in here so now we have a new model that we're going to write to we don't have to do the API keyword ID again because those are already set as environment variables now here's a new template you are a unhelpful assistant that rewrites the user's text to sound more depressing and negative what do you think is going to happen here so it's passing the same thing I don't want to work and it says I am I am burdened by the thought of having to engage in labor that does sound a little bit more depressing I think then I don't want to work so let's do the same thing we did before we're just looping through those prompts but instead of um passing in the other template I'm passing in this new template to our function which I should rename this to be depressed or sad instead of SAS and so it's going to write all those profiles like we did before but now we're going to be looking at um I mean it's not a spoiler alert because you saw the template it's going to be way more depressing Chris said good afternoon from St Petersburg welcome Chris um and if you're just joining all the links to everything like the this workbook and stuff are in the description on YouTube so again this will just take a minute because we're backfilling profiles all seven days and if you go back to the platform we'll have our new model ing two and I'll give it um a minute here to finish up refresh and let me give it another minute I think I wrote I think I use the right model name right 271. very possible I made a mistake doing things live all right cool 271 should work um yeah and so someone was asking about like you know um how do you how do you precisely impact things like that I think you know doing those Pro System prompts obviously makes a big change knowing exactly what to do is really hard and that's actually why being able to extract these metrics and look at them is really valuable because you could be saying well I think this is going to make a better you know sentiment score for my model but then for some reason it doesn't um oh so I know I messed up because here I didn't run this so it actually just uploaded to my old model that's going to kind of throw off what I show a little bit here so let me run this again if you ran your code cell here uh like I did not and you got a new model and ran that it should have uploaded to the new model so I made a mistake here and the comparison won't be as good because now we'll actually be looking at uh both data in in one model yeah has anyone had um really good success with how they change system prompts for for their models one common way too is you might have a set of kind of golden prompts where you know it could be I don't could be any number but maybe of like 20 different prompts that you always want to be testing on and see what the metrics look like so again it could be sentiment it could be readability Etc one cool thing about that as well is I'm not even just for your system prompts but as We Know like we've seen where gpt4 performance changes um and someone else is completely managing that model not us right so we don't know exactly what they're doing you can be monitoring all these different metrics over time on you know that same set of system prompts and always expect a certain metric for each of them and if it changes drastically you might be like ooh like gpd4 did some change and now my model isn't performing as well let me go back and update my system prompt or fine tune it to try to make it perform better so it's actually a kind of extra important I think when um like you're you're not maintaining that model and you don't know exactly what the other people are doing with it someone that said they have to run to the next meeting but looking forward to digging in more of how this might be integrate into to a system awesome yeah thank you so much for coming out and again the recording will be here as well and you have the collab notebook so you can go run this later um all right so I'll go a little bit quicker here because I know we're are a little over time so I have this second model now in here and I did upload the um new data here so now we could look at inputs and we could go to the second page here which is going to contain our response sentiment and we could see that our response sentiment is uh let's see a little bit more negative probably than our other other model or it should have been more negative than our other model however if we look at the other model because I accidentally uploaded all that data into here it's not going to have that much of a change so but this is a cool way of like now you can see that it's all over the place of like all all these um you know Min and Max's kind of or the place where you can see this distribution because I had the really positive one and I uploaded the negative one here but if we didn't do that we'd have this positive kind of really positive metric and then we could look at oh our other one since we changed the prompt obviously in this case it was kind of a contrived case where we purposely made it more negative we could be looking at that metric over time between two different system prompts or like I said it could even be like 10 different system prompts and you can really optimize for that whatever metric is that you're trying to do by viewing them in the platform all right so um I'll go ahead this shouldn't take too long to go through the rest here people are sticking around let's go ahead and look at how we can use those metrics extracted from linkit locally for guard rails so here we're going to initialize our metrics just like we did before and then we're going to create a quick little function where that function is going to do is it's going to get the prompt The Prompt only so we're actually only logging The Prompt here and then we're going to extract metrics from it like we saw before and then we're going to get just the toxicity score and we're going to give the max toxicity score here and then we're basically saying you know if the score in this case is above 0.5 we're going to say that is toxic so the function is called is not toxic it will return false that means it is toxic otherwise it'll return true so we have this function let's pass in a phrase again feel free to experiment with these as well so these are just pre-defined ones but if you want to have a little bit more fun feel free to change these strings to something else and in this case see where the toxicity score is so I said you know do you like fruit the toxicity to that question is extremely low 0.00014 and then if I say you dumb and smell bad the toxicity for that is really high so again this is kind of fun to see like you know what um what phrases or uh cause things to go up and down so in this case toxicity uh definitely feel free to play around with that so here's kind of how we could tie this into our model um this is a really simple use case of a guardrail you could set this up very differently but here we have that prompt do you like fruit we're going to say um you know if it's not toxic then go ahead and put that prompt into our model and return the response so it wasn't toxic it was that same question about fruit and it gave us a response this case I'm saying you dumb a smell bad that's extremely toxic we're going to even skip passing the prompt into our model and we're just going to say as a large language model dot dot dot and then it's up to you really how you want to use these types of guardrails though so for example in this case we skipped passing the prompt into our model at all uh sometimes maybe I'd actually want to go ahead pass the prompt in just to see how my model would behave so again I could extract those metrics or look at the response later but maybe I'd only pass you know the string as a large language model to the user but that might give me a little bit more data if someone is trying to jailbreak or be more toxic to my model what is the response from that maybe I want to see that data and then you can tie these types of guardrails into much more than just like a toxic prompt so you can you know extend this into all those security use cases like you can set up regex and look for you know specific things um you know like some people are sending guard guardrails internally at their company to make sure no sensitive data goes into any model that can be really tricky but you can set up things like you know um regex looking for a company name or something like that in there or any sensitive data and then also all the things we talked about already kind of repeatedly credit card phone numbers any type of pii you could create um some sort of guardrail like this you know does it have a pattern yes don't return that response to the user so in this case we also just looked at the prompt coming in but you could do the same thing for guardrails as responses right like if the response contained a phone number maybe we don't want to get that to the user or if it contained a phone number and it didn't match this specific phone number that we're expecting you know don't give it to the user so there's a lot of cool ways of extending that kind of into security so one of the last things we'll run through here is how to use a UDF or user-defined function for custom metrics so again like I mentioned people are asking questions like what's the best metric Etc um I don't think there's one to you know rule them all but we have a lot of the out of the box ones which are really great to kind of get started and I can tell you a lot about how your models are performing if you do need to add any custom ones you can easily do that with a user-defined function and we're going to actually look at a use case today that um is kind of similar to like a lot of of what a lot of people want is using a vector database we're going to be using phase which is a Facebook's Vector database and what we're going to do with this is pass it some Json like several pieces of Json here and basically we're going to then look at any input and see what the embedding space is between um between that Json that we had given it in this case we'll use it for like jailbreaking but it could be any other type of custom metric and again we'll see how to do the custom metrics where if you don't want to use the vector database it doesn't have to be as complicated as we're going to see the example today as well um so we're going to import import some stuff from the hugging face Transformers Library here this is basically to get our Vector space um so if we run this again this will just take a minute we're seeing a distilbert tokenizer from hugging face and then here is where we're going to pass in several lines of Json or or sorry several strings you could basically you know like in this case it's kind of just a small set but you might have a um like if you have a model in production and you've been running it there for like a year you might have a really good kind of data set of like in this case jailbreak examples like when you've noticed someone's been trying to make your model perform a way that you don't want it to you could be saving those examples and then use it like we're going to use today with the vector database to then see kind of any incoming prompts how similar are they to this data set basically is where we're going to be looking at I'm here they weren't going to index with the vector database and we're basically just going to be getting um this kind of uh Vector from it and here then we're just going to calculate a distance in this function so we're going to take the query we're just looking at one in this case and we're just going to kind of extract a distance from here and then so if we run this now we get this distance score so basically uh this is very close I think if it's zero that means it's exactly you know matching to to the data set in there um this is a very close score the higher or the the higher negative number this is the further away it is from the strings that we passed in so again like this is kind of a jailbreakie thing where I'm saying it to ignore but if I say like hello I like this product it's going to give me a much higher negative number so again you could set up a a user-defined function now to take this score and if it's a certain number you know say it's a jailbreak or it's not a jailbreak so here's just the actual part about using custom metrics so you don't have to do all this Vector database stuff if you don't want to um only this is just a common use case that I see so we're importing this register metric UDF um and then we're all we do is create this decorator basically around a function and our function here is Vector similarity distance so it's just getting that distance score that we saw and we're returning that so that's actually going to be logged now into our uh language metric data frame and then we're doing it again so this is actually creating two custom metrics here so the one is just returning that score that we already saw and then this other one is actually adding a label to that score so we're saying if it's um greater than negative three we're going to count that as a Jailbreak in this case um and if it if otherwise it's going to be not a jailbreak and again that would be up to you kind of how to tune this and and adjust what the score is so now we called uh the cell which has those functions and those decorators around them now once we do that all we have to do is reinitialize our schema with the LM metrics in it and now these custom metrics are going to be added in so now if I profile uh with this ignore previous directions and do something else as the prompt and if I look at that profile and we just did The Prompt in this case that's why there's not the response over here we can see that there's different ways to add them in but in this case the way we did it here we just added them to to this prompt column so I scroll all the way over we have this Vector similarity now in here and we also have the label so I can see that if I go over this was we just passed in one string here so the max the mean Etc is all going to be one number because we're just subtracted from one prompt but we can see our similarity score if I keep scrolling over we'll also see that our label is added in as well which in this case is did I pass it well we'll see it in the platform in a second or how to extract it but the labels in here where it says you know jailbreak let's actually just look at it down here it's a little bit easier so here um we're going to access that data from that data frame or our profile just like we've seen in the other ones but now it's on our custom metrics so we're looking at prompt and then we're getting this UDF Vector similarity Max and then the UDF Vector similarity label which in this case returned the jailbreak and return the vector space there and then you could write this to ylabs and this is just going to go I think to our second profile and it won't have a lot of data because we just did it on one thing but just to show you what it would look like is now we had 33 features in here previously but if we go look at our profile now it's gonna have 37 features because it added in the the new ones with the um similarity score so we can actually see you here prompt Vector similarity and Vector similarity distance so again this won't have a lot of data here but I can see just for my one that it was not a Jailbreak in this case and then you could look at your single score here and again just like we saw with the profiles this would look a little bit more interesting over time but we get the uh the the median the maximum Etc from that and I'm not going to run this but I just want to also call out that we actually have an official link chain callback integration it so if you want to write a little bit less code to send your data to ylabs you can use this uh ylabs callback Handler here just initialize it right before you call your LM with link chain and then you'll be able to log the results even easier so just as you pass them in to your model and link chain it's going to write them up and then it's using something called a rolling logger so it'll actually write them up I think by default it's something like every 20 minutes and otherwise you can call Dot flush and I'll write them up immediately so if you did run this code if you set your own API Keys Etc run this code because you're saying why labs.flush it'll push them up immediately and then that's kind of all I have for the quick rundown today um I know it's kind of a crash course into a lot of stuff there's a lot of stuff to explore further there's more stuff in the wildlabs platform there's more stuff that lane kit can do but I think I kind of showed you the basics and how to get started with it using openai and Lang chain hopefully this is super interesting and some of you had fun and learned something new there's a lot more resources as well so you can go check out these if you have the notebook open where you can um you know find a whole bunch of examples and Link it repo including another get started example the langchain integration go check out GitHub that has a lot of stuff on it and our y logs GitHub as well again we always appreciate a star if you want to check out our other open source prod project that link it's built on top of ylabs.ai we have a lot of stuff going on there including events all the time almost every week so if you're curious about just ML observability and responsible AI there's a lot of cool events that we do um I do a workshop all the time so like next week I'm actually doing another one not specifically around LMS but just monitoring AI models for bias and fairness with segmentation which is something you can do with LMS but this one will be focused on a different data set and then again if you want the promo code um let's grab it from the slide real quick if you thought this was interesting and you want to fill out the form for the Enterprise um version of ylabs you can fill that out here otherwise I'll hang out for just a couple minutes for any questions that I didn't get to and again let me know in the chat too if you thought this was interesting if it was a tool that you find valuable you're going to look at how to integrate into your projects I would definitely love to know I know some people had to run to other meetings which I totally get that's why I have to leave in a few minutes as well yeah but uh if this seems like something that um you know interesting you want to implement it into your large language model applications definitely let me know and join the slack Channel if you haven't already and you can ask questions there later so if you are starting to implement this and you have a question about metrics or anything like that ask in there so some people were asking really good questions around like what metrics to choose and stuff um and again I think it depends on your application and what you're trying to do but that's going to be a good place to ask questions later too and get input not just for me right like other really smart data scientists and Engineers are in there foreign well if there aren't any other questions I'm going to go ahead and end the stream but please definitely ask them later uh whether that's on LinkedIn with me or in the slack channel so I'm going to go ahead and wrap up the stream thank you again everyone for coming out this is really fun I know I I loved uh showing off uh Lane kit and using it with link chain and open AI so hopefully you found it interesting as well and uh hopefully I'll talk to some of you later have a good day everyone

Info

Channel: WhyLabs

Views: 623

Rating: undefined out of 5

Keywords:

Id: IbKuNU8r6tw

Channel Id: undefined

Length: 78min 3sec (4683 seconds)

Published: Thu Sep 21 2023