Run a ChatGPT-like AI on Your Laptop Using LLaMA and Alpaca

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey there my name is Gary Sims and this is Gary explains now obviously we're going through quite a peak period with a generative uh AIS we've got you know chat GPT you've got Bing you've got Bard which was a video I did yesterday and the thing is all these uh chat models all of these uh Machinery models run up in the cloud Azure and all these kind of their big gpus and everything but did you know it is possible to run a pretty good uh to kind of chat GPT equivalent on your laptop no need for any special graphics cards no need for any special Hardware just a normal laptop so if you want to find out more please let me explain so we're going to take a quick look at the current language models and the new models that can be run on a laptop and then going to give you a demonstration we're running so you can see how good it actually is and then right then we're going to talk about why this is important it's more trust me it's more than just an interesting exercise that somebody is able to hack together there's actually some important consequences for this okay let's dive in to have a look okay so to get these uh generative uh pre-trained Transformers GPT these language models running on a PC not on a huge super computer day we're going to be looking at llama and alpaca we'll get into those uh in a moment let's just cover some of the basics all machine learning models neural networks are used in two phases first the training stage where the model is fed huge amounts of data which train the model to understand whatever it is that's being trained for you know is this a cat is it not a cat is this a hot dog is it not a hot dog then second there's the inference stage where the model is interrogated or queried and it gives a response in first an answer based on the input that you can give it so up until now that's mainly been about classification so is this a picture of a cat yes or no but of course we also have things that generate an output so you translate an audio into text so when we speak to the various assistants that there are on Android and on you know on these Home Smart Home device and so on at some point there has to be a translation from the audio into some kind of text or some kind of internal representation so that the device can reply and now what is the real Hot Topic at the moment of course is these uh models that can generate text generate images I'm sure we're going to get other types of generation uh coming out uh all the time now I saw a recent thing about how you can generate you know kind of robotic movements uh for you know factories for picking up boxes and Things based on an input and then it generates the right sequence of movement so generate something based on on the input now training is a single user task in that while the model is being trained it isn't required to do anything else it just sits there and learns so there's one model it's just being trained it could take thousands of hours it can take crazy amounts of Hardware you know some of the numbers floating around for I think this was gpt3 I like 285 000 CPU cores over ten thousand uh GPU causes where Nvidia GPU cores have 400 gigabit per second network connectivity and these are all humming along for hours and hours and hours days and days and days as all of this stuff is fed into the system and it is trained now obviously you're not going to be doing that on a PC of any kind because this is obviously a huge thing now inference is a multi-user task in that lots of queries are happening at once over multiple copies of the trained model this requires lots of Hardware mainly because of scale also because of the the you know just the size of these models that they create 175 billion parameters I think we mentioned that a bit later in a further slide uh chat GPT is often unavailable or has been in the past due to demand because it just needs huge amounts of Hardware to run this just because millions of people are using it now uh this is a common example of what would be used for both training and also for inference and just as a side note Nvidia really are coming out on top of this whole current you know era of gpts and that kind of stuff because they're all using Nvidia gpus and Nvidia had these various uh things like this is the Nvidia h100 with 600 640 gigabytes of total GPU memory and it's got four nvs switches so it can connect all these different bits inside of it 56 core fourth generation Intel processors 30 terabytes of of SSD and this is one pod that you can buy like this and then these can be built together into a huge kind of super computer and there could be many rows of these now of course this is all happening somewhere in a data warehouse it becomes available to companies like open AI via Microsoft Azure or you know other services that are offering these uh in the cloud now large language models check GPT 3.5 gpt4 are examples of what they call llms large language models 175 billion parameters in chat GPT 3.5 a parameter means a weight and a bias the number of parameters in the neural network directly to the number of neurons and the number of connections between them so the more parameters there are the the denser the thicker the greater it is and you need Super commutes as I said for both training and for infant 175 billion but now there are some new language models have appeared first we're going to talk about llama so this is the one from meta that's the Facebook people a collection of foundational language models ranging from 7 billion to 65 billion parameters so that's at least two and a half times smaller than chat GPT 3.5 and meta says that llama with a 13 billion uh parameters outperformed chat GPT while being 10 times smaller so this is the point if we can we've got these shoes language ones and they're working surprisingly well can we optimize in quote and bring these down to smaller sizes can we fine-tune them can we cut off anything that's not really needed and so that we get a more efficient model now if it's 10 times smaller then you know even just using simple math you're going to need a computer with 10 times smaller so maybe you don't need a supercomputer to run one of these language models so there is an open source project called llama.cpp it's a it's a llama C plus plus program that allows you to query the Llama model and the goal of the author was to run the model on a Macbook so it's a plain C implementation without any funny dependencies the Apple silicon is a first-class citizen however it does also support things like avx2 for x86 arches I've tried this personally on some Jets and Hardware the Jets and Orin the Jetson Xavier NX works great on those and so it also runs only on the CPU so it will be interesting if people actually take this and make a GPU version of it this only runs on the CPU and the way it can run on the CPUs by using 4-bit quantization so what is 4-bit quantization so let's have a look in the next slide this is a process of reducing the Precision of the parameters again that's the weights and the biases and the connections we talk about so that they consume less memory in other words the process of quantization is a process of taking a neural network which generally uses 32-bit floats to represent the parameters and instead to convert them to a smaller representation in our case four bit integers obviously a lot lot less Precision there you know just think if you want to think about you know back in the days when we used to have computers uh back in the 80s 1900 4-bit color if you showed a picture of four bit color it looked pixelated the colors were not very good nowadays because we use 24 bit 32-bit color in just about everything so that's not a problem but so to imagine that kind of idea but take it down to a neural network more blocky you know less less colors it doesn't look so good so it's not a good that's a good representation however if you see a four bit color picture of a cat you still recognize it as a cat and the neural network can still work now the result is it's smaller faster and more efficient but the drawbacks are that it's an accuracy degradation and lower Precision which means you can also lose information so when we come to the test in a minute you'd have no remember that this is one that's been reduced down to four bits and also it's running on a PC and see what we can achieve so here's a question for you what case do you use on your phone I recommend Phoenix cases they are ultra slim why put on a big thick chunky case run all that design that the manufacturer has spent so much time perfecting you really should check out Phoenix cases and there's a link in the description below if you do use it it's an affiliate link which means you help out this Channel and the other one we're going to in fact this is what I'm gonna use for the demo is alpaca llama alpaca they're all kind of going along and so now alpaca is a is a is a is the Llama model but it's been fine-tuned to train to make it a kind of uh an instruction follow model so you give it an instruction and it follows it so it's trained on 52 000 instruction follow demonstrations and so there's another project called alpaca.cbp instead of modifications to llama.cpp to work with the alpaca and the other chat interface all the links to these GitHub repositives will be in the description below they're really easy to set up if you know anything about just compiling a piece of code from chat GPT downloading the files that you need for the models or the weights and the parameters we were just talking about it's actually pretty easy to get yourself up and running and the instructions are pretty good okay so we're going to do this now we're going to run our Packer and I want to just point out that this is just running on an Asus laptop not a very expensive one it's got a ryzen 5 550 5500u at six core 12 thread 16 gigs of RAM and it's running Linux and in fact when it's running it uses about four gigabytes of RAM now when you use some of the other llama models they can go up quite big I did run the 30 billion parameter one on a Jetson Oren because it has 32 gigabytes of memory and that did work on there but just just to show you how good alpaca is on an Asus laptop let's do that now okay so here we are inside of the project usually there's just some c files and some binary files and some other things that you need here and basically this little program called chat so you just run chat and it will go ahead and load up the model into memory as I said takes about four gigs and then it'll give you a prompt so now we can type so this is just like chat GPT in that I can just ask it questions so I've prepared some questions let's go with the first one what are the best places to visit in New York remember I'm asking this of a language model that's just running on a laptop in four gigabytes of memory so if this can come up with anything that's even a reasonable English you know and even has something you know that's actually quite good this is astounding so here we go the top destination for tourists include Times Square Central Park the Metropolitan Museum of Art among others well there you go not a very long answer but that is a correct answer and so I'm running this on a laptop not on a supercomputer somewhere up in the cloud okay let's see let's make these but more difficult explain what we're gonna ask is about Pinocchio is something I've been doing in the other ones explain my other videos on on chatgpt and Bard and so on explain the plot of Pinocchio in one sentence let's see now he doesn't I haven't told you what Pinocchio is I'm assuming it knows what Pinocchio is it's got the story built into it there Pinocchio is a wooden panel dreams becoming a home but let's prove himself worthy by completing various tasks before he command transformed a real boy again the story follows his adventures and Mrs Adventures as well as the lessons that I learned along the way absolutely right so there you go I've just is so amazing he's running on my laptop I've just I don't need you know all these huge things uh running up in this huge uh Cloud things um I'll talk more about why this is important at the very end so do stick around for him because this is the this is really important to know where where these generative uh models are going and what it means for us as consumers anyway let's ask another one what is Wi-Fi 6E now uh I don't know what the cutoff date for this is how much does it know about Wi-Fi six e Wi-Fi with the registered trademark 6 or 802.11 ax was released on the 6th genome and representative upgrading Wireless technology offering faster speeds now that's all true but of course the 6E version as well six a hour ago is an extension that has support for up to eight antennas on each access point of course the thing about six years if it runs in six gigahertz not 2.4 gigahertz or five gigahertz so okay the model didn't know that but the tactic generated is readable English it's about Wi-Fi it's got some interesting fact action there because remember this is just running in four gigabytes of memory so I'm impressed of course this isn't good enough for you know a final product but this just shows you the direction things are going as I said wait to the end we'll talk more about what this is It's 5 AM you're asleep and there's a loud knock at the door it's a friend who has come over to have breakfast in your fridge are eggs bread milk juice a jar of honey what do you open first and I found this on the other videos about chat GPT Bard and so on of course one is that you should open the front door first because your friend standing this is a kind of a kids other people reply because you should open your eyes first because you are asleep so anyway let's just see what this is the point is if this can even understand a little bit about what I'm saying here this is just absolutely amazing it may not get the trick the joke about the fact you should open your eyes first or open the door first however if you can understand right opening up the fridge for food can be tricky when there's so much Variety in one place absolutely if I had to choose between opening an egg or some freshly baked bread from from my friend's visit it would have to go with eggs egg whites are a great source of protein and vitamins blah blah blah blah blah okay so uh now it's spewing out a bit of extra text here that's come from this inference process you know instruction write down three things you can do today in order to make your life easier tomorrow and it's look it's running on Luke so clearly there's some work to be done here about not letting this thing you know kind of wax lyrical when because it's just generating text and it can kind of trick itself into into generating more text which is what we're seeing now okay the final test for this model is write a list of five names for characters in a Sci-Fi novel let's see whether it can understand what a Sci-Fi novel is whether it can generate names for me or not one Dr Catherine Smith PhD a brilliant scientist who has dedicated her life to finding the cure for an epidemic that is plaguing rage it ravaging Humanities populations and around the other she works okay so yep she's like a good character brilliant Dr John Smith MD Catherine's husband oh looks like they've done a relationship between the first character and the second character he's also trying to find a cure this is very interesting so uh it's interesting they've connected the two together there three Captain James Jones a brave captain of and it now remember this is all coming out like chat GP there's no word at a time but this is all coming out my laptop Dr Sarah Smith PhD Catherine's daughter we've got a whole family connection thing going on here this is quite clever it's creating me a plot for maybe this is a real book I don't know but some of you sci-fi fans who've read more than I am maybe these are real characters I don't know this is quite interesting um and Hoover got number five Commander Jon Jones a brave commander of the internet to explore new world fantastic I think that's absolutely amazing for what you know we expect so much from chat GPT but this is running on my laptop anyway let's talk about that now why is it important this is just running on my laptop okay so why is this important well in the kind of General history of computing we always get this swinging back and forth between running things locally and running things remotely you know way way back when we had kind of mainframes there was always Terminals and then there were workstations that connected to the mainframes and then ultimately we had the kind of the era of the PC where the power kind of came back to you know to the local user and then as we go on we've now got cloud services and so everything's done up in the cloud but even things like that with Google's pixel phones they want to bring a lot of the stuff back to the actual phone itself and not up to the cloud so there's always this swinging back and forth now at the moment we're in this great Boon period for uh generative uh AIS it's all running in the cloud but if we can actually generate models that are small enough or particularly models that are good for a particular domain so they're only experts in one little area and having those running on a smartphone on a laptop as an assistant even in some kind of consumer Gadget would be really really good and the way this research is going it's showing that it may not only be that these models run up in the cloud but we could actually run them usefully and actually uh use them effectively on a local device and I find that really interesting because that's it my name is Gary Sims this is Gary explains I really hope you enjoyed this video if you did please do give it a thumbs up if you like these kind of videos why not stick around by subscribing to the channel okay that's it I'll see you in the next one

Info

Channel: Gary Explains

Views: 83,695

Rating: undefined out of 5

Keywords: Gary Explains, Tech, Explanation, Tutorial, Chat-GPT, GPT-4, Bing, Bard, LLaMA, Alpaca, Instruction-Following Model, Language Model, LLM, chatbot, large language model, generative pre-trained transformer, GPT, Machine Lerning, OpenAI, Meta, Standford University

Id: OLg2BKhwaSc

Channel Id: undefined

Length: 17min 11sec (1031 seconds)

Published: Fri Mar 24 2023