LangChain: Run Language Models Locally - Hugging Face Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

packing phase provides a large number of Open Source models that you can use in your own applications this also include a number of large language models on the other hand length chain is a powerful platform that lets you build apps based on larger language models in this video I want to show you two ways in which you can use collage language models from hugging face with link chain the first option is to access models hosted on hugging face through an API in this case the having Face Hub is the server that has the models and we are going to be making API calls to access the models yeah second option is to download all these models locally and then use hugging face pipelines to interact with these models the usage of the models through API call is great because you don't really have to install anything locally you just make you need to make API calls and you get the response from the models and you have a wide selection of models to choose from so for example the even provide API calls for stable LM from stability AI or Dolly from data breaks for on the other hand in some cases you might want to run these models locally for example if you want to fine tune them then you need to fine-tune them locally or if you have a powerful enough GPU then you can also run them locally in this video I'm going to show you how to interact with these models using both options in lag chain so let's look at the Google collab notebook I will put a link to this notebook in the description of the video so there are two types of models that hugging face supports the first one is text to text generation models these are sequence to sequence models they are also called encoder decoder model and the second one is text generation models or decoder only models now if you want to see what type of models are available on hugging phase so go to hugging face then click on models and then under NLP for natural language processing you can look at different types of models that are available so you have the text generation models and text to text generation let's look at a few text generation models you have a whole bunch of options here right similarly for text to text generation models you have a huge selection to choose from first we will install all the required packages so in this case we're installing a link chain hugging phase hub for the API calls and since we want to do a local install so we are going to be using Transformers and then accelerate and bits and bytes for processing the data now we don't need sentence Transformer this is useful when you are doing embedding but in this case I am not going to be discussing the embedding model next you want to get your hacking face API token and set an environment variable using that token if you don't know how to get the token so go back to your hugging face account click on settings then go to access tokens you can actually click on a new token give it a name and then click generate token that will generate a token for you and after that you can simply click on this copy token to the clipboard that will copy the to your pin and just come here and paste the token first let's look at an example of how you can access a model from hugging face Hub through the API so in this case we are going to be using a prompt template if you don't know what prompt template is think about it as a simple F string in Python so basically you pass on the information then within the brackets the curly brackets you have the question or the prompt that you want to prove provide to your model right so whenever a user provide this specific variable that is going to be pasted in here and the rest of the prompt will remain the same and the prompt template is basically you pass on the template and then what question you're expecting I have a detailed video uh on describing prompt templates so I will put a link uh to that video next we are defining a chain the chain has two components first is the front so that is the prompt template that we uh designed and the second part is the large language model that we want to use so in this case I am using Google fan D5 Excel now if you don't know where this comes from so let me show you so go back to hugging face click on models right and in this case we want to look at a text to text generation models so that's the first example we're looking at so here is a list of all available uh encoded decoder models or sequence to sequence models or text text generation model all these terms are used interchangeably so then where you select one of them so let's say let's go here right and you simply need to copy this part so if you click on this icon this will copy this part this is a model card you can actually uh read about the whole model uh for example in this case it supports multiple languages it's a sequence to sequence model in there like one common application of C plus T sequence model is a translate language translation so that's why it supports multiple languages okay so that's how you can choose a model based on uh what your requirements are and you can even run an API calling here in ctek how the inference is going to look like so it makes an API call and here is the response that we've got now going back to our Google notebook okay so we simply uh create a chain for our large language model and now in order to run this you will simply pass a question right so that will become part of your prompt template all right uh and then you simply call The Run function or on your uh language model chain it will generate a response so in this case the question was what is the capital of England and the response is London is the capital of England and the final answer is London now one great thing about accessing hugging face models through the API is you can actually use pretty large models right because you're not running them locally uh in that case like you can use the extra extra large flan T5 model uh and there is going to be no constraint on it the one you can strand is on the server side so you probably want to have a powerful uh GPU running on your spaces or otherwise it's going to be generating responses based on CPU only here is another example so I asked it what is Area 51 famous for and the response is Area 51 is famous for being the location of U.S government's secret space program uh the U.S government has been constructing a space station in Area 51 the final answer is uh space station the quality of response is going to be dependent on the quality of the model that you're using uh through the game phase API okay next let's let's look at how you can run these models locally using hugging face pipelines so in this case we are importing hugging face pipeline from link chain and then from Transformer we are going to be importing Auto tokenizer auto model for causal LM Pipeline and then Auto model for sequence to sequence and then for local inference it's a little involved compared to if you're just using the API call in this case first you need a model so that is defined by the model ID in this case I am using a flan T5 small model because I don't really have enough vram to support it but for this experiment we're going to choose a very small model next you need to have a tokenizer trained specifically for this model so that's why we have this Auto tokenizer then from pre-trained and then you select the model uh that you want to use every model has its own token as it that's why you need to be careful to initialize the token as a specific to that model next uh depending on the type of model that you choose whether it's a text to text generation model or text generation model so it means encoder decoder model or decoder model you will need to say to choose your model type so for this case we're using Auto model for sequester LM that is encoder decoder model and then from pre-train you pass on the model ID along with uh if it's a big model you probably want to load it in 8-bit not the full modern in 16 or 32-bit if it's a quantized model let's say in four bits quantized so you don't already want to and do either and then the device type is Auto in this case it depends on how many gpus you have if you want to have multiple gpus and want to use them so you can provide the map over here okay so you simply initialize your model using this next we need to put everything together in a pipeline so first we need to define the type of pipeline is it a text to text generation or text generation so the model that we're using is to text to text model that's why the pipeline type is a text to text generation then you define the model the corresponding tokenizer and the max length and then next using the hugging phase pipeline from link chain we simply create a local large language model you know in order to get a prediction from this model there are two ways you can do it either you can simply pass the prompt directly to the local LM that we created or you can use a chain so again I'm using a chain uh and if you run this uh the responses England is the capital of England the capital of England is London so the answer is and then uh keep in mind we are running a very small uh model that's why the answer is not really coherent in this last example we will look at how you can run a decoder only model for this example I am using a gpt2 medium model which is a pretty old model so it's not really great but just want to show you how to use a text generation model now the tokenizer again needs to be uh pre-trained for this specific model that's why we pass on the model ID but now the model is auto model for causal Ln not sequence to sequence LM right and you pass on the model ID again right and the pipeline we have the text generation now instead of text to text and then the model tokenizer and again the corresponding maximum net that you want right and you simply Define the local uh L M using hugging face pipeline from lag chain and for instance the rest of the prediction process is very similar you simply need to Define uh chain with the prompt and your local telem and then you simply run the chain on your prompt I hope this video is useful if you have any questions please uh comment below I'll try my best to answer them as usual if you like this video consider subscribing to the channel and liking the video this helps with the algorithm if you want to learn more about Lang chain I have a crash course on it so check this next video on like chain thanks for watching see you in the next one

Info

Channel: Prompt Engineering

Views: 35,096

Rating: undefined out of 5

Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, long chain tutorial, langchain, langchain javascript, openai, vectorstorage, train gpt on your data, train openai model, train openai with own data, langchain tutorial, langchain ai, chatgpt prompt engineering, prompt engineering course, prompt engineering chatgpt

Id: Xxxuw4_iCzw

Channel Id: undefined

Length: 12min 9sec (729 seconds)

Published: Tue Apr 25 2023