I Ran ChatGPT on a Raspberry Pi Locally!

Video Statistics and Information

Captions Word Cloud
Reddit Comments
this is an llm except it isn't running on one of open AI servers in the cloud somewhere in fact it's actually running locally right over here on this Raspberry Pi which fits in the palm of my hand which means it's 100% private secure and doesn't even need to be connected to the internet in order to work and if that scares you join the club today I'm going to show you step by step how to run massive large language models on the tiniest of computers we'll also discuss the most common mistakes technologists make when looking to incorporate gen and how this approach will fix them we'll also cover the hidden issues of chat GPT that most people ignore by the end of this video you'll be able to deploy state-of-the-art llms on basic computers like laptops and even sbcs you won't need to download the weights off of sketchy bit torrent I have them for you here nor will you need to send your chat GPT queries to a private company in the cloud somewhere you won't even need internet you also won't have to set up accounts or generate API keys with open AI but this video is much deeper than just the tech it's about the future and how AI is poised to transform the world as we know it it's probably an understatement to say that geni is hot right now it's literally everywhere the technology is not just a game Cher it's a revolution in the making and it's disrupting Industries left and right this isn't just another fleeting Trend it's the future in fact those last two sentences yeah I didn't write those an llm that's baked directly into my notetaking app notion did pretty cool right but there's a problem see the compute resources and money needed to bring this to life was fairly substantial in fact it took nearly $5 million to train chat GPT and it's estimated that GPT 4 is roughly 800 GB and requires something like 50 GB of vram in order to load the model and these technical characteristics mean it needs to run on Specialized Hardware in a data center in the cloud and usually requires several a100 class gpus each which retail for over $8,000 not to mention the model is proprietary and closed source which is fine except it means that to leverage these models we need to send our Telemetry data to a private company which could be a problem but here's where things get interesting see back in February Facebook shared its collection of private llms dubbed llama many of which outperform gpt3 so this was essentially an open-source version of chat GPT but even though the weights were now open source the models were still big and unwieldly enter Georgie gregov who heroically ported the Llama model to C++ which greatly reduced the size of the model and quantization is when you take something perfectly beautiful like this photo of a cat and then delete a bunch of pixels in an effort to make it smaller and hope that it still resembles the original image and we can do this with the model file themselves and it kind of works and people are taking this advancement to run models on Tiny devices like raspberry pies and by people I mean me and this guy here who tweeted about it so I couldn't help giving this a shot this is my Raspberry Pi 4 model B featuring a quad core CPU and 8 GB of RAM and I'm going to run an llm on it just to prove the haters wrong first we're going to load stock auntu server 24 64-bit onto this micro SD card the Debian based version of Raspberry Pi OS should work but I don't really trust that all the packages will be available so I'm just going to go straight into a more tried andrue Linux drro I'm going to go ahead and open up Raspberry Pi imager and then for operating system we're going to use a buntu server 23 64bits so let's go ahead and add that for the storage we're going to select the 32 GB micro SD card and I'm going to go ahead and seat it with my Wi-Fi credentials then we're going to select right okay so auntu has been written to our micro SD card so we do need the model to be available to our Raspberry Pi it's a little over 4 gigs in theory you could put it on the micro SD card if there's enough space But even the read WR times on the micro SD card are really slow so what I'm going to do is I'm actually going to throw it on an external drive my Samsung SSD and then what I'll do is I'll connect the SSD to the Raspberry Pi and it'll have um the model available and that read right time will be really fast so what I'm going to do is I'm going to download this file so I'm downloading the model to my Samsung 1 tbte SSD drive so let's go ahead and eject the micro SD [Music] card [Music] okay so now I'm in my warp terminal and let's see if we can connect to the Raspberry Pi all right so this should be our device so I'm going to go for the IP address it looks like it's returning icmp traffic so let's try to connect let's do SSH data slayer at okay so I'm in the Raspberry Pi okay so the SSD drive is um connected but it's not available because I have to mount it so I'm going to do that real quick basically I'm in this MNT directory and I'm just going to make a directory called SSD and then I'm going to mount the uh USB drive to this directory here double check how to do that I'm going to go over to chat GPT over here I'm going to say how do I show USB drives on Linux cuz you just need to get the name of the um of the drive that it's in so it's sda1 cuz we can see right here Samsung T5 and I believe mounting it is as easy as Mount and then the drive so it's Dev and it's going to be sda1 and then the destination is going to be what I just created MNT SSD now if I go to MNT SSD okay so now I'm inside my uh Samsung terabyte and you can see the models right here so we can actually work out of um this drive and it'll actually be faster because the read WR times are faster than the micro SD card so I'm going to do apps get update and then we're going to run this command all right and then I'm going to download some additional packages the um the git repo that we want is this guy here so we are going to get clone this guy and do it in the same space here all right so now we have alpaka CPP and there's a bunch of files here we need to compile it so we just run make chat and then all we have to do now is move the model into the um get repo file so I'm going to do mg take this and then go to alpaka all right should be in there all right so now we can kick off the llm it's going to be using the 7 billion parameter 4bit quantized alpaka model run the chat file and I have htop open here on the right so you'll be able to see how the CPU gets utilized all right all right looks like it's ready to go uh the cores are pinned all right Let's do let's give it a difficult one what is quantization you can see it's a little bit slow I'll speed it up this is the speed that is actually coming out at it's not that bad quantization is the process of converting continuous values into discrete or numerical values this can be done by taking a set of data points and grouping them together based on similarity in order to reduce noise and overfitting when training machine learning models with limited data set sizes quantizing also helps improve model performance as it reduces variance between different parameters used within the same model architecture but see if I try to get cute with it if I say like explain quantization in Pirate speak this is where it starts to fail you know with GPT 3.5 or GPT 4 you can start to do these more creative responses from from the model you can say put it in this language or say it as if you're this person or you know act like a Linux terminal or or something and it can actually handle all that it looks like this model isn't quite sophisticated enough to do that and just so you can kind of see the difference like if I pose that question to gp4 it has fun with it aoy m gather around and Len ear while I spin you a tail of quantization and it will provide like real analogies like um imagine if you will a vast stretch of water with waves of all Heights but say you want to describe those wave using only a handful of measurements small medium big this is this is beautiful like this is really a beautiful way of describing a rather complex top but it is pretty good at like um stuff where the answer is like pretty concrete if I said uh you know what is the Linux command to delete a folder called TMP recursively so I think that's right I usually throw the F flag in there but maybe that's right all right and then just for comparison let's take the same prompt and bring it to gp4 still talking in Pirate speak that's okay yeah that's what I would have used honestly so I think this is a little bit better and then it gives these like verbose explanations of each flag you can see GPT 4 is just a lot more strong so similarly write a JavaScript function to calculate the average of an array of values so I think this should work I don't know why it's rounding anything but okay so now I have this average function if I pass 78 the average should be seven and it is yeah it writes valid JavaScript code all right and then let's just bring this guy to gp4 and see what it comes up with did it say ARR as like a pirate pun I can't tell that would have been amazing if it did gp4 is good I got to say it's so good um gives an example uses the reduce function very elegant ask it one more broad question what is the most significant innovation in history The Invention in widespread adoption of the internet has been one of the greatest technological advancements ever revolutionizing communication Commerce entertainment and more it has enabled Global collaboration on an unprecedented scale connecting people from all corners of the world with ease yeah so I think that's a really solid answer I think it' be probably hard pressed to come up with something much better so the model is a little slow but it's pretty good with more concrete questions like how to write a Linux command or write a JavaScript function once you get into more heady abstract Concepts it begins to lose the plot a little bit and since it's a little bit slower it's best to pose your questions as a yes or no answer so you might be wondering why don't we accelerate our model inference using a specialized GPU or TPU compute stick like the Google Coral AI USB accelerator unfortunately llms don't run on tpus and they're actually restricted by virtual memory more than anything the pace of innovation in the space is super exciting and the possibilities to leverage llms at the edge to power sophisticated chat interfaces for robot waiters or other new use cases is super compelling
Channel: Data Slayer
Views: 158,813
Rating: undefined out of 5
Keywords: Data Slayer, chat gpt on raspberry pi, chatgpt, openai, ai, artificial intelligence, raspberry pi, chat gpt, python, gpt, chatgpt tutorial, machine learning, ubuntu, large language model, llama, alpaca, alpaca.cpp, ggml, 4bit, quantization, llama.cpp, warp terminal
Id: N0718RfpuWE
Channel Id: undefined
Length: 10min 59sec (659 seconds)
Published: Tue Aug 29 2023
Related Videos
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.