Run Huge GPT Models the on Any Device with Torrent-GPT

Video Statistics and Information

Video

Captions Word Cloud

Captions

today we're going to be looking at this Amazing Project called petals that lets you learn large language models at home the BitTorrent style similar to BitTorrent the idea is very simple you are sharing your resources with other people so you load a small part of the model and then team up with other people serving the other parts of the model to run infants or fine tuning as a result you are able to run huge models like llama 7B govanico 65b or even Bloom which is 175 billion parameter model at a reasonable good speed on the peters Network people are basically sharing resources so here for example you can see different models that's the gamma 2 7 billion parameter model here is the Nama 65 the bloom bloom Z and so on and so forth it's an interesting idea and if you want to learn more about it I would recommend you to check out this paper beetles collaborative influence and fine tuning of large magnet models where the author goes into a lot more details about the architecture of the model and how they are running both inference as well as fine tuning so in this video I'll first show you an online demo that is hosted by the authors and after that I'll show you how to run this locally on your own machine and we are talking about huge models such as the Llama 65p which you cannot fit in in a consumer GPU but thanks to this project we will not only be able to run it but we'll get pretty decent speed so let's get started so first we will look at the demo and that is hosted so you can access it by going to chat.pets.div I will put the link in the description of the video so here you can notice there are three different families of models uh hosted here one is the llama2 the other one is the original llama and then the last one is the blue models which are the biggest of them all so for this uh simple demo I'm going to select lament and then we will be using the chat model not the base model so here is a simple question what is the capital of Canada the purpose of this experiment is just to look at the speed of generation I'm not really looking for accuracy or the length of the generated text right so here actually the speed is pretty good if you could see it was almost six tokens per second which was amazing considering we are running a 70 billion perimeter model now the great thing about this is that you can actually chat with the model so you can ask subsequent questions and then it will generate responses for this demo I have noticed that there seems to be a limit on the maximum tokens that it can generate right so it's not going to generate huge responses but considering that it's actually running on a distributed network of computers or gpus that that people are contributing to the speed is absolutely great this is amazing but the actual benefit of this is that you can run this train API uh and and integrate it in your own apps so let me show you how you can install this and run this on your own python codes so for that we will go back to the official repo of Peter's project where they say run large language models at home BitTorrent style that's pretty nice fine tuning and inference up to 10 times faster than offloading so if you have an old GPU this is going to be much faster for you to run rather than run the models on your own local GPU now personally I haven't experimented with the fine tuning option but if you can fine tune the models that's going to be absolutely amazing now in order to run these models you will have to install the official petals package and then you can use it in pretty much similar way the way you use a Transformer models now you're going to be using the auto tokenizer from Transformers however in order to run the model we will be using this Auto distributed model for causal LM class from the Peters package now it's a community run system so people are basically sharing their gpus in order to run this you will actually need to have a GPU and if you're not using your GPU you can connect it to the details Network and contribute to the network capacity so the more people contributing to the network the better the capacity and influence speed is going to be since I don't have a GPU on my M2 so I'm going to be trying this in Google collab but the process should be pretty similar too if you're running this locally on your own GPU okay so they have provided a really detailed Google collab notebook which walks you through a step-by-step process so first and foremost I would recommend you to Google and save a copy of this notebook and then make sure that you're running it on a GPU runtime because you need a GPU then simply hit connector and then you will be connected to your Google colab I have already created a copy here and it's already connected to a GPU once you are connected you need to run this pip install petals this will install the petos package and then you can start experimenting with the models and then at the same time I think it's better to monitor the resources that are available in Google collab both in terms of the system RAM and GPU RAM and see what type of load is going to put on your system okay so let's look at how you can use the models so again you simply need to import the torch Library Auto tokenizer from the Transformer package and then you need to load the auto distributed model for causal LM class from the Peters Network so in this case according to the authors this machine will download a small part of the model weights and rely on other computers in the network for the rest of the model right so you're not going to be downloading the whole model but just a part of it right and since it's a part of the network then we'll be using resources from other computers and other gpus to run these models then the rest of the stuff is very similar to a simple Transformer package right in this case with downloading these 65 billion the original llama so let's run this and see what happens now it if you notice it literally downloaded a one gigabyte file and the rest are actually on the network which it's going to try to access now in this case we're using as I said the original 65 billion parameter model but you can use the Llama 2 models if you want now in order to do text generation you can simply call the generate function on the model object and in this case we are providing the inputs so the query or the input text is the capital of France is right and this is in text completion mode so here we are expecting the model to regenerate up to five tokens right and the response that it generated is the capital of France is uh Paris now when you run this query it's going to actually take a little bit of time because it has to connect to the network and one once it's connected then the generation speed is going to be pretty fast as you can see here and in terms of the resources you can see that we are using a fraction of both the system as well as the GPU RAM and in the background we are running a 65 billion parameter model which is simply outstanding and amazing the great thing about this project is it's not just a demo you can actually build powerful applications by running these models in distributed Fashions so here they have provided a simple example of how you would create a chatbot using this model so it accepts an input from the user and generates the response so let's say I asked this same question before but now I'm going to do that again so what is your name and you see that the model is going to be able to generate a response so it says my name is Hall 900. funny that it came up with the reference to the space all the same and then you can keep asking questions here right so basically this is a simple example that you can create applications using these models and integrate those okay it says I was created by human that's great I will recommend everyone to check out this amazing project and see if you can contribute your resources this Google collab notebooks goes into a lot more details and if there is interest I can create subsequent tutorial videos uh even trying to figure out how to fine-tune models one thing actually I want to highlight is that you can create your own private swamp if you are working on sensitive data so in that case it's going to be kind of a sub cluster that is private to you and it's going to be run by people that you trust so that's again a great feature to have to run models in a distributed fashion among people that you trust for your sensitive data I hope you found this video useful check out our Discord server to stay updated on everything that is happening in the generative AI space thanks for watching and see you in the next one

Info

Channel: Prompt Engineering

Views: 10,778

Rating: undefined out of 5

Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, train gpt on your data, llama 65b, llama llm, petals, petals ml, torrent gpt, ai, llm, large language model, openai, chatgpt, gpt4, llama-70B, llama 2, bit torrent style, decentralized.

Id: 1iepzvhYeg4

Channel Id: undefined

Length: 9min 35sec (575 seconds)

Published: Fri Jul 28 2023