Locally-hosted, offline LLM w/LlamaIndex + OPT (open source, instruction-tuning LLM)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello welcome back to another video on the llm large language model series now in this video we'll be doing something more adventurous which is to download the custom large language model that's right have it on your computer and Implement a code using Lang chain and Lama index to build a summarization q a board powered by this custom llm on our own machine now if you do not need the Nature language response that comes back at the end of this pipeline you can completely drop opening as GPT and just even do all of this fully offline for completely sick I'll show you how to also use GPD for the nature language responsor to query your index but it's just worth pointing out that you can also just use them buildings to do the lookup and then use your own custom llm to generate your response so this tutorial understandably will be a little bit longer than the usual format but try to keep it as concise as possible and you will need to download an MLM from hugging phase and I'll show you how to do that but you want to pick one that can fit into your computer's RAM and when it comes to treating an open source LM there are a few options out there and if you've seen my other videos you've seen me discuss them in my tutorials or having some back and forth in the comment section the gist of it is that gbd4 and gbd3 they're both closed Source models and you can't download them and use them on your own machine as you please gbt3 has 175 billion parameters and the largest open source model that I know of is meta's opt open pre-trained Transformer which also has 175 billion parameters as well so kind of equivalent to gbt now in my other videos I've also used a few other open llms one of it being gpt2 and that has 1.5 billion parameters for this video you can just use any openlabs of your choice and initially I've used meta's 30 billion opt 30 billion parameters coming in at 60 gigabytes in size now I can foresee that it's not very visible for everyone to be downloading 60 gigabytes of pre-trained weights and also when I'm doing this recording it has got some serial lag from the RAM usage by the model as well as the screencasting application the audio recording from the mic Etc so to make things a little bit more manageable for the video I'll switch to the 1.3 billion opt model but the code is essentially the same just changing 30 billion to 1.3 billion for the code on GitHub I'm still going to publish the one with 30 billion so if you want to go to GitHub Fork the entire to follow along this entire series I think we are on like video number six on video number seven now um then you can do that all right let's get started and let's code this up so open a new file wherever you want that fault to be I'm gonna have this on my desktop so demo9.py and this will also be on GitHub some of this scaffolding become quite standard you can just go ahead and say from.m import load.f and you can just load it in this will just look for a file called.env file and it will just bring in everything in there so your dot env4 will contain your keys and stuff right and it would just load all of this in here I also want to be using springing from llama index import so let's go with prompt Helper and this prompt helper this utility helps us fill in the problem split the text and fill in contacts information according to necessary token limitations so if you've seen my other videos you know how I talk about this context right you set up a context and you pass it onto your llam it would be a really useful tool to have so I'm going to bring that in and I want to bring in my directory reader so do a quick format thing bring this up and you'll probably be wondering is the same I've never really seen you use prompt helper before what is the point of that also this is useful in this demo because we're trying to use the prompt helper to customize the prompt sizes input sizes and stuff because you're going to be using your locally hosted llm and every model has slightly different contact slang and stuff and so you want to at least have control with that and when you're using open AI gbt for example you know what the output would be you know what an input would be but here you're going to be using your own self-hosted llm so this is a nice way nice interface to just sort of control that so let's set it up prom helper initialize a couple of things so for example I can have my maximum input size I would first set it to 102.4 if you're using the DaVinci model the default is 4096. so here I'm trying to set certain constraints I'm saying that the maximum input size is only I only want it to be this size for example and I also want to change the number of outputs so I want to have that as some other figures of course the default is 256 so this is the number of output tokens and it's usually set to some pretty low number by default for example if using open AI the default is 256 you can change this up to maybe 5 and 12 you can bump this up to 300 um whatever you have okay so this is the number of output tokens um so for now let's just stick with the default but we can change it later once we run the model last thing I would like to change is the max chunk overlap I'll set it to maybe 20. all right understand that this is actually the maximum it's not saying that okay Force the chunks to overlap you're trying to create all these chunks but anything that they can have a bit of overlap but the maximum of that will be 20. it's always it's usually nice to have a bit of overlap like nice to have some overlap to maintain some continuity between the chance if you're familiar with computer vision or you do some sort of deep learning in the past this you could think of it as an equivalent of a sliding window so you want to have some sort of overlaps between the chunks to maintain some form of continuity probably you will need the comments you're just going to remove that out okay now how do you actually Implement your own custom llm for this we need to bring in this Base Class that we inherit from so we're going to say from Land chain dot llams.base import element okay so we're using llama to create the index and then we're using Lang chain to create our llm and this will be a custom llm so you have something like class and then you say this is your own I'm going to be using opt the open preprint Transformer model by Facebook so if you want to use something else you know you just have to change this I'm gonna quite local opt because to remind myself that it's a custom LM that I'm implementing and the first thing I want to pass in is the model name so we could have a model name now I said earlier that if you want to use a pretty high performance model a model that has pretty similar results to gbd3 for example then you want to use the entire the whole opt model gbd3 has 175 billion parameters we spoke about that at the beginning of the video and OPD has also 175 billion parameters pretty equivalent you can read about the papers or link them in the description but in case if you don't have hundreds of gigs laying around hundreds of gigabytes laying around your computer then you may want to consider a smaller model so a pretty good trade-off is to use something like Facebook opt IML max30b even this is a 60 gig model even this is not a small model at all right so this is the opt IML max30b it's relative right it's pretty small relative to the gbt3 the large ones but the 60 gig is not something that I'm running on a Linux box I have enough RAM I could run this but when I need to record it along with the video editing and all these other things running and competing for the same Ram then gets a bit laggy so I'm probably going to switch one level down but during the recording of the demo and then I'll push the code up with the max30b and also note that I'm not using the opt I'm not using the opt model or I'm not using the Facebook opt model I'm using the opt IML if you want the URL it's Hugging phase.co Facebook optiml Max and what does IML means it's instruction meta learning so opt plus instruction manual learning is a set of instruction tuned versions of opt and on a collection of more than 2000 NLP tasks gathered from eight NLP benchmarks um then you can read about the limitation and the biases or ptml models or perform Baseline opt on extensive static valuation ends but he has his own risk of generating maybe things relating to factorial correctness generation of toxic language and forcing stereotypes but what is instruction tuning generally um it's actually the act of getting humans to use major language instructions to guide the language model to generate a specific task or outcome so using major language to sort of guide it prompt it and with the goal of trying to induce what you hear as a zero short capability so when a model can actually generate a specific output without any kind of training data so for example you have a sentence like super type is a great company that gathers the top software engineers in Indonesia to build full stack machine learning powered applications for the world we use okay that is a useful sentence but if you want this model to Output let's say a sentiment right to do some sort of sentiment analysis then you want to slightly help the model by giving it some more details on a task and this task instruction may look like this you say that super type is a great blah blah blah but how does the reviewer feel about super type you want to promise you want to say that then you want to give this yellow lamp a pool of options tell me whether this is positive positive or negative on neutral you specify all of this in natural language in your instructions and to get your llm to only pick one of these options so instead of just putting in a big block of text you give me this block of text and then you're trying to give it further instruction and say that hey tell me if the above review sounds positive to you negative to you or neutral and only pick from this pool of three choices a good resource to actually learn more about this you can take a look at this PDF I'm also going to put a link there in my description so instruction tuning fine tunes a language model on a collection of NLP tasks described using instructions here's the review blah blah movies best rom-com since pretty woman then you tell the prom you said hey did this critic like the movie and these are the options yes or no and you keep doing it you keep doing it so for example if you want to do sentiment analysis you should keep showing it and you do you know ten thousand of them twenty thousand of them and ask it to us the llm to sort of pick up this Behavior if you want to do something like question answering then you said what is the these guys question and you give it an output and then you give it you know some sort of choices um same thing here give you some choices rule-based tasks so that is instruction tuning in general I'll leave all of these links um somewhere in the video or once you pick the model pay attention to what kind of tasks are there so here there is the tax generation task then go back to your editor immediately I could create my pipeline using a utility from Transformers so I'm going to bring that in as well I'm going to say from Transformer import type 1. now I'm just going to have my pipeline and remember the task I told you to just pay attention to so that's text generation so I'm going to call it the text generation here that's the task and then you give it a name so this would just be model name which we Define above as a variable uh if you want to pass in a few more things you can there are a few optional parameters I'm going to show you that really quickly as well I can maybe just do but these are optional so I'm gonna maybe space it out a bit could have my device running using my Cuda so that's my GPU you can omit this if you don't have Cuda set up and if you don't know how the setup of this I have videos on that it's part of my whole GPU deep learning series using pi torch so you want to learn more about Pi Parts you want to learn Transformers then you need to learn pytorch right you use pytorch I have a whole series about that teaching you about setting up your Nvidia downloading a driver of this stuff so go and take a look at that video and then finally you can have some optional model parameters as well so for example you could have touch and you can say what is the data type look like so if I want it to be a 16-bit float number I could have that I could say B float 16 and that's coming from torch so I will need to bring in torch maybe here I'm gonna have to say input torch B flow just make it a bit more efficient and do some formatting don't want to miss all your commas okay the usage pattern of this because this is a class you initialize that right you say something LM equals to local opt and you call it like that so what happened when you call it you didn't really say what you wanted to do you only Define the model name and the pipeline and the model name is actually going to be used in here so what do you want it to do when you call it so this goes back to oop Basics right it's like what do you want to do when somebody call it like that so you want to define a call method and it takes two things it takes the problem and you can pass it a stop and stop could default to none if you want to add some typing you can say prompt I want this prompt to be a spring just to make sure that the user knows from looking at this code they know what to do so I'm expecting a string am I expecting a stop but this could be optional because there is a default of none and then it should output what it should also output a string so you could also add some typing to that output as well absolutely support the string so it takes a spring and optional stop and it outputs is free create a response variable and you say take the pipeline which is up here take this and pass in the prop then pass in some Max new tokens this you should want this to be the same as the one up here you know I you know what we do is to maybe have a Max token and set as 256. then you could just instead of hard coding this set this up it's going to return an iterable so you want to only take the first item so I'm going to say zero and um among the item I'm also going to stick the generate the text but I will show you what happens if you don't do that it's going to give you a bunch of things it's going to be too much I only want to get the first item and then get specifically the generated text key so now you're only returning exactly what you want but this will also include the prom itself right so for example you know in the early example um do some examples here right copy for their website okay seeing that this is going to be text generation it is gonna have all of these texts in the return and then it's going to generate new text after that so one way you could sort of get around that is to just do an indexing on this string and just remove the first um and characters and to do that you can just say return response and you say well I wanna this is called subsetting or indexing right um so take the take from the length of the prom all the way to the end so what this is going to do is going to say okay whatever the length of this is if the length of this let's say is 63 or 62 right take 62 and then move from there all the way to the end meaning just ignore the I don't need to repeat to me the prompt right that's what it's trying to say so you can add a comment here if this is getting a bit um much you can just add a command and say this is only returning the newly generated tokens um without the problem you know or whatever now you can say that do more things to sort of complete this local opt class and that's both things are property so the first one is property and this is a private method identifying params and you want to return say the name of the model and again this is going to be softer model name so this is better than to hard code this and repeat this everywhere so you don't need to copy this and paste this everywhere you just have to Define a variable kind of like what we did up here and then just reuse it then you'll want to have a second property and you want to tell your code that this is a custom llm type so just have to return custom so this tells the code that this LM type is a custom LM okay I could start to write the code here but to keep things nice and tidy all in its own functions let's go ahead and create another helper function required create Index right so Define create index and this is where this function is going to be responsible for creating an index for creating an index so I'll need two more things here so I need to go all the way back up here it's going to be a bit long so let's place it up move this down I will need a LM predictor this is a wrapper around an LM chain from Lang chain right so we've seen a lot of Demos in Lang chain we've seen how we can basically the whole idea of chaining things and then agents and stuff right so this is a wrapper around an LM chain so let's use that LM Predator sounds good then we also want to bring in series of contacts because we want to bring we want to have another interface the X as a container to contain the predictor the problem Helper and stuff so if you read The Help text here it says the service contacts container is the utility container that holds all of this together so host your alarm predictor that's gonna be this one and it also holds this prom helper which is something that we bring in as well so might as well bring it in close it out and I know I need to create an index so I will probably also bring in my gbt list index then we can go back down to where the quid index is and let's first use the wrapper around the LM chain this is a wrapper around the LM chain from link chain to LM LM predictor and here is the same as what we've done before which is to just call it and then remember we talk about the service context so act as a container for your llm index and for your query so let's bring that in service contacts service contacts from default we have an alarm predictor and that would just be our LM that we created up here right up here and we will also have a problem Helper and this is also something we created right up there so if you look at the prompt helper that's here and the LM that's this one and if you want to read more about the service context you can look at the documentation I know nobody actually want to look at documentation but the service context container is a utility container contains it contains the following objects the LM predicted the prompt helper all of these different things in here so you can um I'll put a link to that as well now you have your orm and you have the prom so what else do you need you need the index but where does the index come from it comes from your documents right you have the documents you create the index using your documents so let's start to do the docs and let's do the index so docs this is something you've seen me done countless time before so it's just a simple directory reader where is the directory it's gonna be in my current directory it's going to be called news so I will just have to specify news and then I just have to say load data that would load all the data into the docs and then creating an index is also really easy you see me do this before as well so that's going to be index actually GPT list index from documents and you then pass in like where the dogs are from so I'm going to say docs okay that's the one up there we give you the service contacts sometimes this could take a while depending on how big your data is so you could maybe add some timer function up here you could also maybe print and say this is starting to create index and bring it down here and we're gonna see here done creating index of course you want to return the index if not you know there's no point doing all this if you don't return it and now you could say if name equals to name the index equals to create index when you call query index it's going to run this code it's going to return an index and you take that and you pass it into a rival called index and then now you can use the index to do some query so you can do some query I'm gonna you can call this response you're going to try to summarize um I don't know Australia's cool exports in 2023 then you can bring response okay and that is it that is kind of the the simple backbone structure now if you want to improve on this a little bit you can try to maybe have some timer functions to make sure that the calls are not too long especially if this motor is going to be pretty huge this model is if it's a 60 gig model it's going to take a while depending on how fast of internet speed you have but it's probably going to take up to 40 minutes one one hour maybe even so if you want to just add those things in there you can do that I'm gonna do that for you as well so I'll bring in import time and on my machine I actually have a nice utility decorator that I put into my clipboard that is there to just time functions in one of my video I'll show you how to do that and if you're interested in creating learning about decorators and stuff I can definitely go and check that out but this is my helper function so the helper function is just called timing and the way you use The Decorator is you just time it like that you just wrap the whole function in there it runs time it executes the function with the result then it prints the time that it takes from the end to the start then it prints the result um doing the query is not going to be very long but if you want to you could also have that as a separate function just have a help a function called execute query and response index query basically just the same thing as this one so might as well just copy this and then return response now this could be Rewritten by just calling execute query then you can print a response here I'm gonna comment this up and maybe use this one bring it down so if the code you can do a couple of things you can for example use the exclude keywords to say to maybe speed up the process a bit and by telling him okay if any of the nodes contain any of these keywords just ignore that completely right so you could use maybe for in this case I'm asking about Pro exports for this demonstration I'm going to say exclude petroleum every time you see the word petroleum I could also say maybe as soon if you see the word guess right or oil or whatever so I could have a list of all the actual keywords and this is gonna actually say well among all of these notes in the index right if the note contain any of this first just exclude it so you can maybe speed up the query a bit you can save yourself some money save some some time um the opposite of X2 keyword is to have the required keywords so you could say well I'm asking about clothes so clearly I want that right um but a lot of these are all optional you can have that you can choose not to have it just so make a comment that out since I'm asking about petroleum I want to put a question here that maybe um I couldn't maybe ask about this I said who does Indonesia export it's cold too right could ask that question and I said exclude the key with petroleum all right so with the combination of using S2 keywords and required keywords you can preemptively fill out notes that either do not contain the required keywords terms all contain exclude keyword stems so you're trying to just reduce the search space and save some time save some costs and just make this query overall a little bit faster since we already have the time it we will also apply it here and I also want to mention about this since we're talking about this index to query sometimes you do not want to have the tax return you just wanted to tell you okay which docs to look at all right which which is the node to look at and then you can generate your own text um so in that case you can also say response mode and you can specify you don't want any text back so no tax so it's not gonna try to generate a nice nature language response back to you it's just going to generate no text I'm going to show you how it works once we save all of this one small thing I'll do before I start to run all of this quick to demo9 script is I'm gonna go back into the local opt but since I'm just trying it out I will change this I'm going to comment this up to the 1.3 billion and this will bring it down so I'm not trying to because I'm going to try I'm still in development mode I'm trying to test this out so if I'm trying to run this with a 60 gig it's going to be very heavy it's going to take a long time so I'm going to change this down 1.3 billion is a 2.3 2.63 gig model so substantially faster should be a lot faster should be at least maybe 25 times faster and to save myself even more time I'm also going to use the OS to say that hey the first time it's running go ahead and create the embeddings put it there locally and then from there on just use the cache right so I'm going to say import time then I'm going to say import OS then go all the way down push down to the end of the file then I'm going to say something like if not OS dot path.exist if this doesn't exist and I'm this is a demo not nine so I'm going to demo9.json if this Ambience is not found then I'm gonna print and say no local cache of the model downloading from hugging face and just to tell the user to be patient it's going to download from hacking if it's gonna take a while then I'm gonna move this in so go ahead and create the index you can say index save to this save it for demo nylon Json in fact since I'm already repeating myself a better thing to do is to just sort of Define the file name and then say this is demo line.json here you can change it so quite for me same thing quite funny so try to not repeat yourself right so that you don't make any mistakes also it's better practice if this cannot find anything in your local path it doesn't find a demo9.json then it's going to go ahead and run it but if it does find it then it's going to save you some time Implement our fancy cash and it's going to say print loading local cache of the model of the MBA is actually not even so this should also be called this or you can call it index and then here you say index and this will just be the same as this GPT list index but instead of the from documents you say float from this so using phony phony phoneme no repeating no need to copy and paste now you can save all of this clear the screen and I could run demo9.py and sure enough it does create the demo9.json for us and if you click onto the Json you can see this and the reason it's a little bit faster for me is because this model is somewhere on my computer so it finds it and you just use that instead of trying to redownload from the internet that's your index all of this stuff like index for example summary um I've explained all of this in my other video the L the mbatings video that's a little bit of a 30 minutes video it goes into a lot of details about how this is done what is the logic of that so if you want to read more about this you want to study more about that go and watch that video instead I'm not going to repeat myself in this lesson all right so if you wanna if you wonder like what does this do um go and look at my other video the embeddings video so let's go back to demo9.json now we see that demo9 that Json has created and you say but what is the what is the point of having response mode no tax so no tax basically it doesn't try to generate a nature language to try to answer that question and here is when you can change another language llm you can use the same line you change an LM to handle this Downstream process of saying that okay now that you have the index you can query that and you save no tax but use your own LM to sort of core read that so let's get back in here right and I'm gonna remove the response mode no text now it's going to generate some text now it's going to use the open AI token to generate some text so I could expect some natural language response to this question so I'm going to save that run demo line.py and this time it should be faster because it should be able to find demonet.json it says hey uh it does exist so it says that load the cache so sure enough it says loading local cache of the embeddings and it says that Indonesia is the largest exporter of code globally with a 32.2 share in 2022. you can quickly do a Google search to verify the information let's just call largest exporters are post globally um and if you read some news that are quite recent let's say this is November 2022 this is like four five months ago Indonesia exported about 478 million metric tons of code making it the world's largest core exporting nation and then Australia is the second largest exporter of code so Indonesia Australia and Russia they are the main providers of code worldwide you can see it's doing a pretty reasonable job the first part of this response is a bit confusing though it says Indonesia does not export code but then it says that it's the largest exporter of code globally even 32.2 share in 2022 and its main exports destination are Japan India and China um all right so what what's up with the first one so I have my theory my theory is that because this data itself is collected over the last two years and um it's a very small snippet of that the entire photo will be so much larger but there is a code export ban in Indonesia there is a band of coexport there is a very temporary band just to prevent until the domestic needs are met until then you shouldn't be able to export coal so that may have confused the LM a little bit and you can tune this a bit you can change this stuff and also adjust the prong helper there maybe have a bit more change the channel overlap change the you know the different input size the output so I'm gonna maybe change this to 5.2 now up to this point we've been doing a lot of q a so let's change this up a bit by actually having a summarization task so let's copy the summarize task here coming up this summaries of previous core exports in 2023 but I also want to pass in a response mode I want to say that this is a pre summarized task okay nothing else changed save all of that run demo night or py so it's using its building index from two chunks from nodes right to chance and let's see what it says when it's done running this okay let's bring up the terminal a little bit more so you realize that this is the number of tokens you're using this is the number of MBA tokens and it's going to be zero now because this is just going to use the local cash I'm beating so it's not really generating new embeddings again um it's just trying to generate words for you right so now it's generating a whole word and you realize that my text is much much longer because I've changed the max output as well from 256 to 512. so let's go and take a look at this in 2023 Australia is expected to experience a decrease in core Xbox due to the adoption of alternative markets by China despite these core exports are still predicted to reach 128 billion in 20 2023 and relations between China and Australia have mended allowing for co-shitness to resume so if you remember there is a small episode between China and Australia where they start to hand out some trick dance and stuff against each other so the relationship has a little bit warm um and but you can also continue reading Japan is the top export destination for Australia in 2021 where exports to India rose by 13.6 China is projected to be the largest cooking call buyer from Australia in 2023 with total shipments for calling coal Imports expected to have a minor increase due to import growth from Australia and Indonesia and does this summarize everything it does summarize pretty well right if I'm a very busy executive working in this industry I want to know I just want a quick snapshot of what's happening in the industry I could rely on this um suppose I mean of course we want to fact check all of this stuff you want to take especially when it comes to numbers like 128 billion 30 by six percent I want to fact check all of that but it does a pretty reasonable job so there is the Q a use cases like this there is the summarization you can also pass in no tax if you pass in no tax then very likely what you want to do is you want to say something like print response and you want to print the source notes response get formative sources you may want to play these two because if you print Source notes you're going to see okay which is the this information that is generating for you right this information is generating which is the source node that is getting the information from so maybe you have 500 notes which one of them will contain the information right we talk about all of this in the embeddings if you haven't really had a exposure to LM in general I recommend you watch that video it's going to give you a right foundation and then this will make a lot more sense because we dive into all of these different ideas there in that video right so Source notes you can if you want the way get this information from can take a look at some of this so I ended with some further resources you can go and take a look at so the first one is this video now the thing about this video is that it's in Mandarin it's in Chinese so if you do not understand Chinese or Mandarin you're gonna have a hard time understanding it but some of the slides are pretty self-explanatory so like this is one of the slide 28 instruction tuning I went to Julian's concert last night and I really love her songs and dancing it was now here you're giving some instructions but you're prompting it in English in nature language you're saying that continue to fill up this but using one of these three options positive negative or neutral you're trying to tell it try to teach the llm that you're expecting response um from this tree from this pool of choices right so depending on how much time I have over the next few weeks I mean create some videos that dive into these Concepts and explore LM on a multi radical less coding but more conceptual level but for now I hope you enjoyed this series I hope you learned something new if you have any questions feel free to leave them in the comment section I'll try to answer them as best as I can have a nice day see you

Info

Channel: Samuel Chan

Views: 21,655

Rating: undefined out of 5

Keywords:

Id: qAvHs6UNb2k

Channel Id: undefined

Length: 32min 27sec (1947 seconds)

Published: Tue May 02 2023