The REAL cost of LLM (And How to reduce 78%+ of Cost)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
1 of December during a normal Friday afternoon we receive email from open AI telling us that we reached API usage limit of $5,000 for this month but that Friday was 1st of December which is the beginning of the month that means we burn $5,000 USD in a single Friday afternoon and open AI bills that is not normal usage pattern that we have and when we look into details we start understanding why so I was building an autonomous sales agent where I can simply give a list of clients it can automatically research Outreach and even auto reply and follow up when the person respond back I actually made a video talking about inbox agents as well which is very similar concept that letting an agent to take over your email inbox and automatically reply so very similar concept but this one we're building is a lot more powerful agent will do a huge amount of research about this perspective from Google LinkedIn Apollo web scraping and a few other different data source so the Outreach and respond will be hyper personalized with much much better quality so there are some cost of running such sales agent but it is not a lot the problem occurred when I accidentally get one sales agent to outreach another sales agent who can also Outreach and respond back and this was a trigger for that massive bill for the single afternoon cuz this two agents just start talking with each other back and forth and this just creates such a infinite Loof that lead to this massive open AI that we receive just for a single afternoon previously we always joke about the scenario what if we have all those different agents just debating and competing with each other this was the first time I see in real world that it happened and it really hurts very quickly we start putting together practice and boundaries to Monitor and alert all those different larage model usage so those type situation should never happen again but on the other hand this also just became a huge reminder for myself that if you are building a AI product you have this large language model cost that will continuously grow as your usage and user grow it is a new type of cost that traditional software company does really need to care but as AI startup you have to understand that part of cost properly and this also introduced a lot of nuance for example people already get used to those type of subscription based pricing model where they charge $99 per month for unlimited usage but for an AI startup that has large lar model cost factor in it the pricing strategy became a lot harder because it cost fluctuated with usage and it was hard to predict what kind of usage patterns your user would have at the beginning and I personally learned it in another side project I did I don't know if you remember but mid last year 2023 one of the hottest AI starup segment is AI girlfriend or AI companion a Allstar from an influencer called Karen who made a digital version of herself in a tegram boss and here's a quick damble hey there John it it's lovely to meet you I just spent the morning grabbing brunch at the flowering Tree Cafe in West Hollywood it was absolutely amazing what are you up to today hey there yeah I've just been working all day but trying to get some relaxation in this evening that sounds really nice have you tried doing some yoga or meditation for relaxation it's a great way to unwind and clear your mind for this evening maybe we can plan a virtual dinner date or watch a movie together what do you think so this was launched May 2023 where they charge $1 per minute for this voice chat and claim to get more than 70,000 sales in just one week this got me interesed because the whole project feels very straightforward to build with all the powerful models we have we can very quickly build out a AI companion that can also talk in Voice by connecting different models together with speech to text launch Larn model and text to speech so I launched a copycat version very quickly and someone who has been working SASS and software company before it feels obvious to me that okay let's just offer free Tri let anyone to be able to speak and I just pick up random number 60 seconds free voice chat for everyone but again back then I didn't consider anything for those Moto cost and that made this product almost hard to break even so the cost of free trial look something like this on average the 60 second voice chat will be around 500 to 1,000 characters which turning into large language model and tax to speech cost is roughly 12 cents per new user and with this cost if I have 1,000 signups it release to roughly $126 for the total cost and if I get 1% conversion from free to paid for each user I have to earn at least $12.6 to break it even with model cost which is not trivial for a consumer facing product so even though Back Then There Were new user coming every day the project didn't seem to generate good amount profit and margin so I just shut down that project fairly quickly but there were project that became extremely successful in this AI companion space where they didn't offer any tax to speech model with just pure tax which means the model cost was significantly lower and that allows them to figure out a much more flexible business model and F the capital back on product development and marketing and through those different project I start realizing how important it is for AI stop to be aware and do good calculation of the large Lage model cost it is an important skill to balance between the cost as well as the performance and user experience so this has got me really interesting in this topic what kind of levers we can pull to reduce large larage model cost and ideally also maintain the performance so I started doing some research and experimentation into this and the result is pricing good with just one afternoon wor of optimization we were able to reduce the cost of large Lage model for 35% for the sales agent we were building and with a bit more time I'm pretty confident that it can be dropped even 50 or 60% that's the immediate 30 to 50% additional profit that the company can capture or reduce for the user I want to share with you what I learned so that you can try it for your own AI application as well so let's get it but before I dive into it the best way to reduce large L model application cost is not only just technical knowhow but also the deep understanding of the business workflow so that you know what steps and data are absolutely necessary recently I was trying to learn how does marketing team is actually adopting Ai and what does the workflow look like CU I think marketing has huge opportunity for AI Automation St and hopspot Academy just released a new free course called AI for marketers which help me a lot to understand how does worldclass marketers are think about and adopting AI in their real workflow they give detailed example about how the top marketer are using AI to analyze and collect wide range of data from transactional data social media activity product usage as well as CRM data to provide hyper personal experience for your customer from website and content personalization at scale to even recommendation Dynamic pricing and Target ads to really drive the endtoend customer experience and even use AI to do predictive modeling to get realtime actionable customer insights back from case study from Real World top tier marketers so it was extremely useful for me to really understand the actual workflow for marketers so I definitely recommend you to go check out if you're building AI Parise for marketers you can click on the link below to get free access now back to the actual tactics to reduce the large dange model cost at high level there are two ways you can reduce the cost for lar model once you find a smarter way to choose the right model to use for different tasks and second is reduce amount token either sent to or generated by large Man model I'll talk through some good practice for both categories with examples firstly Change Model this might sound obvious but what you might not realize is how the calls between different models dramatically different which enable us to do some Creative Solutions for example if you compare the cost between GPT 4 which is the most powerful but also most expensive model with with mistro 7B model gbd4 is almost 200 times more expensive than mistro to quantify it a bit more that basic means the cost for GPT 4 to generate just one paragraph is same cost to generate the whole book for mistro 7B model that's why mix show is so popular because they just focus on squeezing performance from small model which reduce cost by multiple magnitude and also enable us deploy AI in our laptop or even mobile phones so one strategy we can use here is that you can use powerful model first to launch your product start collecting training data to find two smaller model for specific tasks for example at the beginning you can use the most powerful model which is not GPT 4 Turbo but GPT 432k you use this model to build your initial product and from early users you can save the result GPT for 32k generated then use that to find tun a smaller model like mitro or Lama 2 which can achieve actually comparable results in this method can almost guaranteed to deliver more than 98% of cost saving that's why when you launch the product making sure you have Max them to actually save all the results it generated and provide ways for your user to Mark result as good and bad this will allow you to do the fine tuning later and Achieve huge cost saving for the large dange model cost the limitation of this method is also very clear it only works for scenario where you have very specialized tasks for example extract data from text invoice or classify Financial use but if you're a app is more general purpose like a chat B then when a user asks a question that is not contained within your training data then the result from this fine tune model is going to be much worse than those powerful model like gbd4 that's why there also smart people that explore most sophisticated solutions for example large language model cascate so the concept is simple if the cost between big model and small model is so dramatically different what if we actually have Cascade of model usage chain so when user ask a question you can use cheap small model like mro or gptj to try to answer the question first if the confidence score is actually High then accept answer but if not then it pass on to the next model repeat this process multiple time and only use gbd4 for very complicated questions so this me really leverage fact that model price are dramatically different the cost of running GPD full ones can enable smaller model to run 100 times more so even though it has more large energ mod call the total cost will still be cheaper but obviously this is not optim we should be able to know a question in advance whether it is complex question or a simple question this is where the thir method large langage model router is shiny the concept here is it use cheaper model to just do the classification of whether it is simple request like high which can be probably handled by smaller model like mro or it is a complex Mass question that should be handled by powerful model like gbd4 theoretically it can even enable better performance too because if we have a list of high performing specialized models model we can actually router different request to different expert models this concept was brought up by hugging face last year where they introduced huging GPT it's basically use large language model as a controller to break down the user question into subtask and dedicate to different model to solve problem together so if the user has a question read the image example. jpg for me which is invoice the huging GB can actually break that down into subtasks like image to text model text to speech model and produ a final results based on that and there are also company commercialize those Solutions specifically focus on cost and performance like Martian and neutral AI for example this a comparison between gp4 versus Martian large Lage model router for generating some code for the Martian large Lage model router it took about 7.8 seconds and less than 0.06 cents while the gbt 4 is still running after 30 seconds and eventually give us results after 38 seconds cost 2.3 cents which is which is 38 Times Higher cost than the router one and here's another example from neutrino where they have playground to Showcase how does this large Lage model router work behind the scenes so if I give a request like hi it will use GPT 3.5 turbo to generate response but if I ask math question what is 93 multip by 3.46 then it will switch to gbd4 to generate this response for better accuracy and you can actually create a customer router by selecting different models that you want to use un for at the moment it does seem to support custom F tune model I think what would be really cool is which someone build a open source router lar L model and anyone can just bring their fine tuna model in for the best performance so large language model router is definitely interesting thing you should try to see if it fits your use case and last but not least team at autogen also experiment a method basically you can set up multiple different agents One agent with gp4 model another agent with cheaper model like 3.5 turbo or it will choose a cheaper model to comp complete task first but if it fail it will invoke the next agent but every time it succeed it will save the result to their database so the next time when new question come in it can pull a rant example of how similar problems were solved before from their database and then both a new question as well as past example to the cheaper model because it was given examples even cheaper model can achieve comparable results and from their broad post for different tasks like ask for weather and generate stock chart most time those multi agent setup only take 20% of the cost compared with GPT 4 assistant but can achieve similar or even better success rate so this definitely another Innovative solution that you can try out comment below if you want to learn more about this methods we can to make a video deep dive into this agent architecture if you guys are interested so those are all the major method you can use to reduce cost by smilly changing to different models some of those method might feel complicated but one thing I would highly recommend you to do is just try to swap your current GPT 4 with GPT 4 Turbo to see whether you can get similar level of read outs cuz if this work you can already achieve 30% cost deduction so those are all the method you can use to reduce large larage model cost by finding the most suitable model but that's not only it the huge model optimization you can do by reduce the token that would give large L model as well one method that is pretty representative is called llm lingua from Microsoft the core idea here is that natural language are not effective quite often The Prompt that would give large langage model including huge amount of noise and words that is redundant and didn't really contribute much to the final results it generate it's kind of similar to how human consume information as well if I read a whole book they're probably just two or three caps that matter most and seem thing for many other scenario for example if large language model course's task is to summarize a core transcript or answer specific questions based on that interview they're probably huge amount of content that didn't really contribute to the final answer but still they are consuming huge amount of token so the concept here is again let's use small model to remove all those unnecessary token and words so that we only send necessary and relevant data to high performance large Lage model to generate answer and the result is pretty stunning in the core transcript example that we just show if we use the normal method where we pass on the full row material it will consume roughly 30,000 token and the answer large L model generated can even hallucinate as well that's probably because we are passing on too many information to large language model this large language model lingua you can turn this 30,000 token original content into just a small paragraph with 100 token in total which is almost 175 times smaller and this method can be even used for chain of s prompting so in this specific prompt we give large LGE model an example of how it should sync step by step and then give a new question to answer the original result consumed 2,300 tokens but if we adopt a similar method it can even clean up the sying process in the example that was given and only consume roughly a 100 to which is 20 times smaller and this method even apply for code completion where it can use small model to extract only the relevant part of code so the token consumption dropped from 21,000 to about 1.6 th000 tokens and surprisingly with even better accuracy so cleaning the promp before it stand to expensive model like GPT 4 is a great way to reduce cost and we can adopt a similar principle so to really optimize the tool input and output if you're building agents for example if you're building a research agent the agent will probably have a tool to sing website but normally the roow data returned from the web squer can include a huge amount of noise and unnecessary data like skip to main content or footage so instead of sending those row information as output of the tool which would take huge amount of unnecessary token in the agent memory we can do similar thing inside this F function to use a small model to summarize and extract core information as the final output which will Mak the memory a lot more cleaning and probably easier for the large langage model to consume as well this a great seg way to Second optimization that I found really useful if you're building agents which is try to optimize the memory of the agent so due to how large Lun model and agent works every time when a chat bottle agent generates a new response it will take all the past conversation as the input toen so longer the memory is the more expensive it will be to generate the next token the goal here is how can we get just enough memory for the agent to have good graphs about what has been talked about before instead of sending everything CU at default you probably are using something called conversation buffer memory which basically means it will try to keep every single word of the past conversation so the more interaction is happen between the user and agent the more token will be consumed eventually it will hit the token limit that's why you almost always want to find a smarter way to handle window memory and the common alternative is something we call conversation summary memory what this means is instead of using conversation buffer memory which take the whole conversation history as memory summary memory will send the chat history to a large language model and get it generate a summary about what has been talked about before so token to be passed on to agent would be much much smaller it also means it won't grow infinitely it can be kept as certain amount of token so it won't exceed large Lage model contest window but downside is because it is doing summary some details can be lost during the summarization process that's why the one I always use is something called summary buffer memory which means the agent will remember exactly word by word for the past let's say 200 words but for the earlier chat history it will do a summarization so you can still have a lot of context about the recent things that the user has been discussed before but still have a high level understanding about the passing has been discussed and there are more different type of sophisticated agent memory settings that you can use to really optimize Jam break had pretty comprehensive blog post talking about agent memory optimization I definitely recommend go check out if you're interested so those are the methods that I have been used to bring down the large Lage model cost significantly but both mthod require you have a good understanding and log of how costs will occur in your current large ler model app so that you know which part is optimize that's why observability is critical for building AI products and there are few platform that allow you to Monitor and log the cost for each large lar mocol at this point all those product are kind of similar but I will take you through a quick example of how can use one of the most popular one called L Smiths to monitor where does the cost occur for your AI agents and how can you use that to optimize the cost so L Smith is a platform that introduced by L chain which will log every single time when the agent try to complete a task and for each task completion it will show you how long does it take how many token does it consume as well as detailed breakdown of the token consumption for every single large l mocol so that you can use this to understand where you should be focusing on for cost optimization I'll quickly take you through the process of how to use this information to analyze the cost and a real example of how to save the cost of This research agent by more than seven so let's firstly install lens Smith let's open Visual Studio code which is where we're going to put together this research agent and set up monitoring I click on the button at top right corner to open Terminal the first thing you want to do is do pip install dasu engine open AI this package will include lens Smith SDK the next thing is we want to import a list of different environment variables so that we can connect with lens Smith so let's go back to visual studio code I create a new file called EMV inside this EMV file this is where we Define list of different variables you're putting laning tracing B2 and laning endpoint exactly as I have here and before launching API key and project then you need to create account on LM click on this API key button at B mod left corner create a new API key and paste in here and for lunching project you can just give a name of any project that you're running now in my case could be researcher agent you can also just command this out and all the result will be loged under default project and in the end putting the open AI API key as well the next thing is we want to set up the research agent as well as tracking so I create a new file called app.py if you're using lanching to build your agent there's no additional things or library that you need to import as long as you set up the lanching API key and the project name in the Dov file it will be automatically locked but if you're not using Lan chain you can still use l Smith to lock all this information by import a library called traceable from L Smith and all you need to do just wrap the function that you want to track under this add traceble and you can Define different round type it could be chain or it can be tool and and here I just simply call open Ai and get a response back so I will try to run my chain who is Sam altimate and save this or click on Terminal Pyon nonl chain. py if I go back to lens Smith you can see there's a new project called research agent show up and inside here it lcks this specific brong called my chain with actual inputs who sell mate and return output from open AI so it does do the basic loging but LM is a lot more easy to use if you are using L chain for example here I actually create a research agent who can access Google scripting website and do the research where I will import list of different libraries and create a few different tools one tool for sing website I basically use browser L as the scraping service on the other side I also create a different tool for Google search where it has search query as well description here and I'm using serer as a service here I'll create a tool list referencing the two tools above create a custom system message for this researcher which I already include in a few other videos that I tried before for research agent you can check them out for more details if you want and here I Define the memory which is using the conversation summer buffer window memory that we mentioned before for it has Max token limit for 3,000 and put together prompt create agent as well as the agent executor and and try to run it what is the latest release version of L chain and what is about so I can save this and then let's try python app.py now if we go back to L Smith you can see here it has a new lck Called Agent executor which is one that we just run if you click inside it is showing that the user request is what's the latest version about L chain and what's about and this is final result return and on the left it is displaying how long does this request take and What's the total amount token it consumed you can see here it consumed in total 20,000 token and majority of them are the input tokens you can click on the stats button here to see the actual breakdown so that we can tell that up to web scraping the cost is actually minimum after web scraping the cost jump up huge amount if you check this out you will understand the main reason is that the web sing tool return huge amount noise and empty space that's why it has taken a lot of token space and made a completion for the last step extremely long which take 14 seconds for a quick calculation CU I'm using gp4 turbo in the cost is around $30 per million input token and total amount token is 20,000 then the cost for This research is roughly 60 cents which is quite expensive the calculation here I did is a bit more simplified because I basically treat everything as input token since the output token is only 233 but you can get more granular if you want and my plan is to use a cheaper model like GP 3.5 turbo to do the summarization of the scripted website content so you can reduce amount of token input for more expensive model like GPT 4 Turbo to do that I will go back to the app.py I first to import a few new packages that we're going to use add a new function called summary or pass on objective which is what's the goal of this Google Search so that larg damage model know what kind of information it need to extract as well as a row content and our firstly do a quick turn Cas to get first 50,000 character breaks them down into small chunks of documents and they are created one prompt so the prompt looks like this is the website content and then above is script website content please remove noise and future out key content that will help on this research objective the summary should be detailed with lots of reference and links to back up the research as well as additional information to provide context and extract key content now create a prompt Define a large language model chain with GPD 3.5 turbo and then use staff documents which is package chain provided by lanching specific for generating summary and then get a summary back and after that I will also define a class called Script inputs uh because I want to add more description for the inputs as I add a new input for objective so I give a description the objective of the research then for the script website tool add a bit more details for after add tool which give a name script website pass on the schema script input so that when we pass on to the agent it will will pass on those input description as well and then change the input here to include objective and for the web script function itself I replace this return tax with this if condition if the content is actually quite long then we will call the summary chain and just return summary otherwise it will return the raw Tex and the rest can be kept the same so I'll try to run this again python fpy you can return to L Smith and here you can see here has two new records this is one that we are using for GPD for Turbo the token now is reduced to just 4,300 tokens and Al there's also new record called Stuff document chain which is the one that we use for summarized information and this one take 14,000 token but it is on GP 3.5 turbo if we calculate the proxy cost here it is around 15 cents which is more than 70% cheaper than the original method and if you compare the results this new result generated is actually even better than the orinal one since we use Trier model to do one runs of content fi in so this is one quick real world example of how can you use those monitoring platform to log and optimize the larg language model app cost I'm very keen to hear any other cost optimization method that you know so please comment below if I miss any I will continue share interesting a project and the AI product building knowledge I have so please subscribe if you enjoy this content thank you and I see you next time
Info
Channel: AI Jason
Views: 84,889
Rating: undefined out of 5
Keywords: gpt 4 turbo, chat 4 gpt turbo, gpt turbo 3.5, chatgpt, chatgpt turbo event, llm cost calculation, generative ai, gen ai, large language model, llm, gpt 4, gpt4, artificial intelligence, open source llm, mixtral 8x7b, uncensored mixtral, mixtral ai, ai
Id: lHxl5SchjPA
Channel Id: undefined
Length: 27min 20sec (1640 seconds)
Published: Tue Jan 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.