Running SDXL on the Raspberry Pi 5 is now POSSIBLE!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello friends welcome to AI flux so we've covered some Edge AI device topics here before we've covered uh the new iPhone and the new Apple watch both of which have neural cores now technically speaking you don't actually always need a neural compute engine or dedicated processor to do AI things we've already seen llms run on the Raspberry Pi and with the release of the Raspberry Pi version 5 which is almost twice as fast and boasting four times as much RAM there's even more we can do and someone has already shown us what that is so a developer today released a version of stable diffusion XL that actually can run with only 300 MB of ram this unlocks new capabilities for Edge AI of course we've seen llms run on the Raspberry Pi before but it was pretty slow so let's get into it so if you want to watch a great video going over some of the incredible improvements that have been made by the Raspberry Pi foundation with the version Five release of the Raspberry Pi I recommend checking out this video made by Jeff gurling his entire channel is all about Raspberry Pi compute projects and he's done some pretty wild things he's even plugged in gpus to raspberry pies so I've linked his channel up above definitely check that out this new project that lets you run stable diffusion XL on a Raspberry Pi is called o NXX stream and this is partially because it uses some new approaches to running stable diffusion which pretty much means that the interface you're using is in some ways streaming so Onyx has actually been pivotal in the past when it came to making these models smaller and actually letting lesser Hardware like then a GPU actually run them and onx stream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights this gets into a lot of kind of software terminology but the idea is um reducing how much of the model actually has to be sitting in RAM for whatever processor working on this has to fetch from it's kind of a clever way to understand what it really needs to fetch and it also means that you can actually run this on a CPU and generally speaking on Extreme can consume up to 55 times less memory during run time while only being about 0.5 to 2x slower and this is on CPU now to be fair there were actually onx Forks of stable diffusion 1.5 that made this possible in July however the wildest thing with this I would say was that you could actually run this on a Raspberry Pi 02 granted the image generation would take about 2 hours but it was possible and what's cool now is on the latest Raspberry Pi the benefit is now you can basically Run sdxl in real time now and by real time I mean you're going to be waiting around you know a minute to three minutes for an actual image to show up but the Improvement is actually crazy and I think it's necessary to actually go over what Onyx is so Onyx has actually been around for about 6 years and I think my friends at paperspace I know the founders um I think they actually have a really great reason as to what like why this is important and what it actually does so Onyx generally speaking stands for the open neural network exchange format the idea is is back in you know 2018 there were a ton of different formats even within like the pytorch uh realm and the idea here was to just create a standard for all deep learning models with some base run times that actually made them portable and to try to prevent vendor lockin uh cuz obviously if this really hadn't been done as early as it was it would have been really easy for Google or Amazon to come in and say oh like why don't we just make our own format that you can only use on Google tpus or only on gpus on Amazon Cloud and fortunately we've avoided that and I think hugging f and a lot of other platforms that have really worked hard on even further formats like checkpoint models and Etc they've also done a lot of work to make sure that a lot of important aspects of infrastructure for portable AI have remained open source so going back to the sdxl tune generally speaking the minimum RAM required for even stable diffusion 1.5 is 8 gigs and this seems reasonable but it's still kind of daunting for most embedded compute uh platforms especially the Raspberry Pi now I know that the Raspberry Pi version 5 technically has a version that you can get 8 gigs of RAM with but it's not necessarily all usable because you still have to run you know a Linux kernel so generally major machine learning Frameworks and libraries are focused on minimizing inference latency or maximizing throughput and this comes at the cost of ram so onx stream kind of takes this and says well what if we made it less fast but able to run with less memory so this is what was possible around July a stabil fusion 1.5 could run on a Raspberry Pi 02 and the key here is there were some Precision losses so um this was running at half Precision which basically means that the resolution and the guesses it's making won't be quite as good but reducing Precision like this is quite common in a lot of uh fine tunes that make these models smaller so this is using the vae decoder with basically yeah with less Precision this is with slightly greater Precision so you can see that the image is is just less muddy when you have a little bit greater precision and that just is because it's creating better guesses and it understands a bit more of what it's making this is another example of the lowest Precision so this actually isn't fp16 Now we move on to stable diffusion XL and I should I should note this is actually coming from the base model so the on Extreme stable diffusion example pation now can support sdxl 1.0 uh granted without the refiner these Onyx files were exported from sdxl 1.0 pretty much came from hugging faces diffusers Library uh which I think they say here is version 0193 sdxl is significantly more computational expensive than stable diffusion 1.5 the most significant difference is the ability to generate um 1,000 x 1,000 pixel images instead of 512 x 512 pixel images just to give you an idea generating 10-step images with hugging faes diffusers takes about 26 minutes on a 12 core PC with 32 GB of RAM uh and the minimum recorded vrm for sdxl previously was right around 12 gigs so if you had a GPU that had less than 12 gigs of memory the odds this this would be working or is very low so onx stream can run sdxl on less than 300 Megs of RAM and this means you can run it on a Raspberry Pi 02 and also means it can run a little bit faster so meaningfully faster on a Raspberry Pi V5 without adding more swap space and without writing anything to dis during inference and basically that means you doesn't have to use more resources that the Linux kernel would normally use that you're running now generating 10 step images takes about 11 hours on a run a Raspberry Pi 02 and obviously it's a little bit faster on the latest version and there have been some specific optimizations that have been made to make this possible so the same set of optimizations that they made for stable diffusion 1.5 have been applied to sdxl 1.0 uh but there are a few differences so the unet model in order to make it run in less than 300 Megs of ram required uh basically un8 Dynamic qu quantization uh but limited to a specific subset of large intermediate tensors so it's pretty trimmed down like just to put it that way the situation for the vae decoder is a bit more comp complex than for stable diff Fusion 1.5 in sdxl the vae decoder is about four times the size and consumes 4.4 gigs of RAM when run with on Extreme on fp32 Precision so obviously Precision takes a bit of a hit here to make this work in the case of SD 1.5 the vae decoder is statically quantized with un8 precision and this is enough to reduce Ram consumption to around 260 megabytes now instead with sdxl the vae decoder overflows when you run it with fp16 arithmetic and and the numerical ranges of its activations are simply just they're too large so the real trade-off is that we're stuck with a model that consumes 4 gigs of RAM but cannot be run in fp16 Precision which would normally be the trick we'd use to get that RAM use down one solution to this problem which hasn't actually been investigated yet is just running the Vee decoder in fp16 and then you divide your total memory use by two so then you'd be at 2 gigs of RAM usage instead of 4.4 U now ironically both of these would still work on the um latest version of the Raspberry Pi but we're trying to maximize this and the developer here was looking at the rp02 um the inspiration for this solution came from uh an implementation of the VA decoder from hugging faces diffusers Library so in other words using tile decoding and you can see in this image here there are clearly tiles so they're blocks of this work being done so pretty much the idea is divide and conquer it's a common CS concept um so like why do it all at once when you can split it up into 16 distinct work units so the idea is pretty simple the result of the diffusion process is a tensor with shape for 128x 128 the idea is to split the tensor into 5x five regions so either you have 16 or 25 overlapping regions and then decode those tensors separately so basically you reduce the amount of work and therefore the amount of ram that has to be used so for example the imate here was generated by a tile decoder with blending manually turned off so that's what you can actually see these um transition points point between each tile now what I will say is when you make the same image with blending turned on it looks entirely normal it looks like something that would have come out of you know a run-of-the-mill um stable diffusion XL output now of course this took 11 hours on a Raspberry pi02 but I've been told and I'm going to buy one of these and try this but I've been told that it can take um less than an hour on the Raspberry Pi version 5 and there are also some really cool features of onx stream if you're a developer so I think you should definitely check this out if you're a developer in the space the performance here is quite clear so you can tell that based on how much RAM you have things get a little bit better so for instance you can see here in theory this here would be possible on the latest version of the Raspberry Pi so per iteration you know around 7 seconds is pretty reasonable uh and then it's cool to see attention slicing ending up in areas of application outside of just people who are working on dgml uh this was initially an approach that was developed almost explicitly for use with apple silicon and basically it's another clever way of slicing what you actually need to infer against while you're producing images or producing text with an llm and I'm not going to get too far into these other um applications what's cool is you can run this on basically Linux Mac or Windows I'd recommend doing it on Linux or Mac I'm probably going to run this on my Mac just to try it out because it's kind of cool and I'm still waiting on a Raspberry Pi version 5 that I ordered on Amazon a few days ago so I'll load up some some other images people have made with this um I think it's really cool that this is all possible with just a Raspberry Pi and not even like one of those expensive um embedded Jetson tk1s from Nvidia those are really cool but it's awesome to see that there's tooling that's let us kind of detach a little bit from only being able to run cool AI or ml stuff on Nvidia Hardware so um I hope you guys like this video I hope you learned something if you like our content um please like subscribe and share our videos it means a ton to us if you want to try vast AI check out our Link in the description um to try out renting a super fast Nvidia GPU and we'll see you in the next video

Info

Channel: Ai Flux

Views: 4,205

Rating: undefined out of 5

Keywords:

Id: XVS8oiuU6sA

Channel Id: undefined

Length: 11min 15sec (675 seconds)

Published: Tue Oct 03 2023