6 Ways For Running A Local LLM (aka how to use HuggingFace)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
large language models like chat GPT and Antrophic  Claude or Google bar are really powerful tools   with lots of use cases however all these Services  have the same big drawback privacy what happens   if we are dealing with sensitive data we can  send that over the wire even uh for reasons of   compliance or even for reasons of trust we can't  really trust uh uh sending data over to wire and   we don't know what's going to happen to that so  this is where having a a private language model we   can run in our own machines really valuable there  are literally hundreds of open- source models that   have already been trained and are ready to use so  join me in my quest to find a chatGPT replacement   I can run locally in my laptop so before we start  let's set the expectations there are hundreds of   thousands of open- source models out there some  trained by individuals on some like this train   like corporation like meta to run some of these  models we need a beefy machine with lots of memory   and even a GPU some smaller models can run on a  laptop but even with the Best Hardware open source   model are smaller in size and a little bit less  powerful than polished products like chatgpt after   all open has dedicated hundreds of Engineers to  maintaining the thing in this leak document it's   called we have no mold and neither does open Ai  and it goes into great detail into explaining how   open source has solved some problems like running  language models on phones or multimodality and uh   here we have a comparison between uh closed and  open source models open source models are quickly   Going over new iterations one or two weeks apart  and even they are smaller they can do a very good   job compared to close Source models that are  maybe a thousand times bigger so you probably   heard about the site hugging face. this is the  largest repository for open source models we   can find projects and models that are uploaded  by individuals or even by corporations you can   find model from Microsoft from meta and here um  we have a lot of things this is a really busy h   page because it's not only about language model  but also about things like image generation you   know and and and really estate of the art uh  AI but what we want for for our use case for   language models is to go here to models and then  we have to filter uh on here on the left side by   tasks so we have here a a whole lot of task like  audio or you know image processing what we want   is something along natural language processing  and I hear this is the first thing we need to pay   attention because the kind of work we are needing  depends on what task we we choose here so it's not   the same to to select translation than to select  summarization ideally we want to pick the best uh   model for our job in my case for example I would  like to have something to chat with so I will be   conversational and this starts to filter um the  other big filter here is in the library section   and this is all the kinds of libraries and ways  of running the language models uh most of these   models are Transformers Transformers is a type of  model uh this is the T in J GPT and it's also a   library from Hugging Face a python library that  simplifies setting up a model you know we also   have popular libraries like pytorch tensorflow  JSX which is a Google library then we have some   formats like GGUF which is a format used for  Lkama models and Keras is another big Library   so we're going to pick something here that we  know how to use or we want to learn how to use   uh for for starters let's pick Transformers and  then we can sort here on the on the right side   for trending or most downloads and open one  of these models will will give us information   about the model let's speak one that I know it  will work in my machine because um it's quite   Limited in memory and for example we can pick uh  Microsoft dialog GPT this is a gpt2 model which   is a a smaller and older model uh but here we  have the model car which basically explains uh   how the model works and more importantly how to  use it so you have some Snippets of code you can   copy and and paste on your on your computer and  run the model if you don't find any information   here that's useful you can always click here on  use in Transformers and you will find a couple   of Snippets normally the first snippet is the  one that H starts uh uh your program and this   picks this models uh the Transformers Library will  download the model automatically in your machine   and set it up and obviously you need to complete  the code with um the the logic you want to do in   your in your library in your project so let's go  and pick one of these so to get started we are   going to need a few tools like the a compiler  make or cake these are all things that I have   installed in my machine but you probably need  to have installed so the the library is compile   we are going to start by installing a pytorch and  tensorflow which will cover most of the things we   are going to need for for this example we can use  pip to install both libraries and then we're going   to install Transformers sentence piece this is a  version of Transformers that uh has the complete   package you know all the tools you will need to  run hiking phase Transformers so let's create a   new file chatgpt and we are going to paste the  code here this is the code in the model uh we   are using this model and this is the exact code  that we h we find in the model card uh the first   time this rant is going to download the model  which in this case is about 3 GB I already run   it so it's going to sh start but you're going to  see that there's a little bit of a a boot up a   boot up time before we get the model to run this  is a small model gbt2 really medium size so here   we have uh a prompt uh uh you doing and we should  get some response we have warning here I'm doing   well how about you so this is basically the model  is running using python uh tensor flow and the   Transformer library from um from high in face as  you can see to use this you really need to code   the behavior because Hugging Face only provides  the the the the part um that deals with the model   how to start how to encode and decode the data  but maybe of the model of the of the application   you need you need to code it you need to develop  it in your application so regarding hanging face   Transformers we can find the pros is that uh it  handles the model downloaded automatically and   we have Snippets in the hanging face side to  to use the models it's the best thing we have   for experimenting and learning about machine  learning and of course you can integrate the   code into your own product for the cons H you do  need a solid understanding of machine learning   and natural language processing you know this  is something you will need to learn uh to use   this libraries uh efficiently and you need to code  the application the behavior the logic everything   you know it's uh on your on your side to to do  H do you you need also to know how models are   configured for best performance and it's not as  fast as other alternatives we'll see a little bit   later so you really need a powerful machine to run  them locally so Lang chain uh is a framework for   building language applications on top of your  models uh it's an ecosystem that comprises uh   connections to models they can be local or remote  uh and all kinds of middleware to to augment your   application it supports uh Vector databases  and all kinds of of uh templating to create uh   an application with python that uses a language  model as an engine if we go for example here to   components we will find all the large language  models that are supported by Lang chain we can   even use a um model from H haging face we have uh  Integrations with Lama and kind of tools we can   even integrate with open AI with this not what  we want because we want something that runs on   our Hardware so if you want to run the model with  Lang chain you will still need the Transformers   library and by torch or tensor flow depending on  the model you also will need to install uh Hugging   Face Hub you know this is something a tool that's  useful to for for managing models and of course   we will have to install Lang chain itself so now  let's create as the same version we we did with   Transformers but this times we're going to use  Lang chain uh chat and we are going to paste the   code here we are using the high in face pipeline  this is a part of the L chain project uh and we   we are using the the pipeline from High face  the same model of before uh dialog GPT medium   and then we have the prompts this is all in the  documentation but basically uh we are using what   you already have for from hang face Transformers  and we're supplying here a question so let's see   if it works and should this is not a chat it just  going to reply to my question which is what is   encephalography so after about 10 or 20 seconds  we have the answer uh and it's working so we can   use this uh to build uh our chat application so  the so LangChain successfully accessed the model   from Hugging Face and that uh means it's working  and with this is the start obviously we can use a   lot of more tools um that Lang chain you provides  to build a very complex application with lots of   features but this is uh the basis for running a  model on L chain so going over LangChain we can   sum up the cons and the pros in one list uh the  pros is that is easy to use because it has an   ecosystem you can run local and remote mode side  to side and while you need to be a developer you   need you don't need to be a machine learning  specialist like with Transformers so you can   focus on the logic of the application but you  still need to develop the application uh the   cons is the same as Transformers because uh it's  as slow compared to other Alternatives we'll see   later it's still running on python so you still  need a good machine to run a model locally so   so the third uh way of running a local llm is  with Llama CPP this uh it's support of Llama   models from Facebooks and meta that run on CN  C++ and they we are going to see that there   are most performant they they run bigger models  on smaller Hardware I'm able to run uh a bigger   model in this laptop something that I couldn't do  with Hugging Face for example or lunching so here   we have a basically an a Very optimized version of  a model and they they have very a good uh support   for Apple silicon so in this is ideal at least  for me and for everyone that's using uh M1 on   M2 chips um so it's a good way of of running a  a local llm H using less resources okay to use   jam. CPP we are going to need to basically clone  the repository we're going to clone the repository   once clone we can build the project with make  so now we can execute Main and we see that it's   not running because we don't have a model uh  the Lama CPP uses a special format that's uh   designed for this project that's gguf this is one  way of storing uh the model format uh it's uh one   of the most modern ways of storing uh a model uh  compared with for example pytorch or tensorflow   that uses bin or H5 uh this uh um format is uh is  better you know so we are going to go to High INF   fate and locate one of the models in this format  download it and then test Lama CPP so back to   Hugging Face let's pick a Lama 2 file there's  a few if you want the official Lama files uh   from meta you do you need to create an account on  Hugging Face and the email must match the one you   have in Facebook and you need to request access  you know and that takes a couple of days uh but   uh if we don't want to do that there's also a few  models out there uploaded by individuals um we can   pick for example um this one Lama uh Lama 2 7  billion parameters and it's in GGUF uh you can   see here in file versions the uh model files there  are different uh types of models the Q comes from   quantization so uh it depends on how many bits the  model uses H we can download for testing the late   the smallest quantization model and it will take  a while but once we have that we can execute it   on our Llama.CPP okay so now now that I have the  model downloaded I can uh use uh DM and call that   I have it in my temp file and it's going to start  and since I didn't produce I didn't ask anything I   need put any prompt it just start spotting random  text so let me cancel that okay and let's start   again and now we're going to use um uh the -p  parameter and this is going to uh initialize   with some prompt and we should hopefully start  getting some information uh so there's a lot   of more options here we can create a template  which would be the best you know to interact with   Llama CPP but it seems that it's generating  some some output so let's let's see what we get so we got an answer here and for reference  this is a model that wouldn't run with a p or   tensor flow so Lamma CPP is really more  performant month we get at the end some   uh Benchmark about uh time you know and uh so  it's very interesting and it allows me to run   things that I would normally be able to do going  over to Llama CPP the pr is that it's a lot more   performant you see that we can run models I  couldn't have run with python uh thanks to   being pure C and it has a lot of optimizations  and can run bigger models with less hardware   and I can run them in the command line or as a a  browser application and there's a good number of   options to tweak the behavior of the models you  know and if you want to build an application you   can still do that because it has bindings uh for  various languages so you can have uh your logic   in for example JavaScript and not GS and the  model actually be running with h jam. CPP for   the cons H is that it doesn't support all models  in Hugging Face only Lama uh models or there's a   subset of model that are supported uh you have  to build the tool so you have you need some uh   knowledge and tools installed and it's not as  user friendly as some tools we'll see later on so while we're talking about Lama Cppl this  is another project that's related from Mozilla   it's called Lamafile or Llamafile and basically  this is one portable script you can run anywhere   you can run it on Linux Mac OS or even windows  and basically you can download this file and   download a model and run it without having to  compile anything and you can even embed the   model in an executable file that runs anywhere so  you can share your models easily and other people   can run them just by double clicking on the file  which is great so to use this project uh we have   a few options we can download the file directly  and it will just work but since I already have a   model downloaded I don't want to download it  again I can just go here and use the project   with external weights so I will download the Mac  OS version going to download the file this works   both on Mac OS and Linux and there's a separate  one one for Windows we need all so to make it let   me see executable and running it will throw an  arrow because we don't have a model it's looking   for the wrong model but we can always put the-  M and Supply the model we already have from uh   the last example and it's going to load it uh in  this case it's not a command line application it   will open a browser uh where we can uh interact  with the model so so this is the starting page   as you can see this is just a Llama.CPP and we  have a few options here uh to to to tweak the   model but let's let's let's um simply talk to to  the model and see what happens so it's starting   to produce text let's ask the same thing we  asked before and we should since the model   is the same it should provide a very similar  answer so here uh the model is is working uh   more or less at the same speed but in this case  we didn't have to build or clone in repository   just download one file on one model and you  can see the the answer is shorter because the   options we we supplied uh when when we started  uh the the model uh we also have the option to   upload images uh and ask the model about the  images when the model is multimodal in this   case it should work uh so this is I think an  easy alternative to running Llama.CPP you have   the same benefits but uh with an easier install so  going over L file we can sum up the pros and cons   in one list H the pros is the same as Llama.CPP  it's fast and you can run big models on a small   machine and the the other Advantage is that you  can build a single executable file you can uh   share with other developers to publish and it  will run on any machine the cons is the same as   Llama.CPP because you don't have all the models  available for you only the the models available   for Lama CPP and there's yet another uh Llama  version here it's Ollama.ai. uh this site will   uh leave give you a an executable file for Linux  or Macos and once you download this and install   it in your machine in installs a command line  tool H and downloads the Llama file the Llama   model sorry directly so it's I think this is the  easiest way to run Al file because you don't have   to you just double click and then run your your  your query on the terminal uh let's see how it   works so with Ollama we have to download the tool  and install it and then pick one of these models   uh that are here for example Lama 2 we can once  we install it we can go to the uh to a terminal   so once we install Ollama uh we have a service on  the taskbar and then uh a command on the command   line and the first thing we need is to pull a  model you know we we saw that we have a llama2   model um available so this is the first step to  pull the the model and next we run Ollama run   and the model name and it should start a command  line session we can we can talk with the model here let's ask the model the same thing  we asked so far to see if we have a good answer so we have a pretty um we have a pretty  similar answer it's a little bit slower than   Lama P or Lama file but it's really easy to get  uh the installation going and to download models   now with Llama I think this is the easiest to  install and use uh way to run a llama model   it's just double clicking and and it installs  a service on your machine and then you open   a terminal and and just run uh select your model  and and run your your prompts and get results you   know maybe the the you can run Llama and Vicuña  models very easily and it runs really fast H for   the cons is you do have a limited set of models H  there's uh less models than on Llama CPP currently   maybe that changes in the future uh the other  thing is that you can't manage the models and   the tool manages downloading the models so uh  if you already have models downloaded maybe you   need to download them again there's not a lot of  options to tweak this is really a tool dedicated   to a user uh that isn't interested in using the  language model not in tweaking the the things and   the options and there's no windows version  so if you're on Windows this won't work for you so the last product we are going to see  today is GPT4All this is the most user friendly   interface because it includes Ai and it allows  you to Simply install the application select one   model um it automatically downloads it and you can  run uh your queries you can even uh add your own   documents into into the application and you would  go and index them and once you have the documents   index you can query them you can talk to your  documents you can ask the model to summarize or   translate so this is a really good way of working  with documents locally without sending them over   the wire you know and keeping them private and  and and safe so to get started we need to download   the installer for our system so on we install the  application and start it we get this window very   similar to GPT the first thing is to go here and  go to downloads and select one of the models here   you will find the the available models um and we  can select one of these and I already installed   Mistal OpenOrca or but we I could download any  others we have the size the sizes and we can also   interact with open AI with the I but let's use  this model that I already downloaded another thing   that's uh useful here is to go to the options here  and here we can change the prompt for the model   temperature and all the parameters uh to tweak  it and one thing that this application has is the   that you can add a pad a folder for example I use  my document folder here and we it will index that   and and once it's indexed I can simply go here and  enable this folder and now the model has access to   all my uh files H in that directory for example  this directory has a lot of uh books on cicd and   we can ask for example that uh the model to give  us a definition uh depending on uh the files it   finds in in in my docs in my ebooks directory so  as you can see we have an answer here and it even   provides the context uh you know uh it identifies  the book uh and summarize a definition from uh my   my document so it's really a powerful feature  of course we can use the model normally like   any other model ask questions and everything but  with documents uh we have uh more Alternatives so   for gbt for all the cross is that it has a Polish  guy it's a user friendl option we've seen today uh   we can use local and remote online models like  open AI it supports uh uh 30 billion parameter   models easily in my machine so it's performing  great and you can add your own documents for   context so you can keep your data safe in your  machine for the cons I would say that there's   a very limited amount of models and some of the  models are not able to be used commercially so   the choice of how to run a large language model  will depend on your needs if you want chat GPT uh   you can use GPT4All which is one of the best ones  around but if you want to build a project on top   of a language model you will need something like  LLama.CPP or even python if your goal is learning   machine learning uh you need to Learn Python  and use the the libraries for machine learning   so I hope I made a case for running your own large  language model the models are only getting better   and I think the gap between uh closed and open  source models are getting shorter and smaller   all the time and especially there's a a case for  using local uh language models because you are in   total control of your own data that's all for now  I really do want to explore AI in more detail in   the future so if you have any request feel free  to leave a comment if you like this video please   leave us like And subscribe so you don't miss the  next one thank you for watching and have a nice one
Info
Channel: Semaphore CI
Views: 15,971
Rating: undefined out of 5
Keywords:
Id: 7jMIsmwocpM
Channel Id: undefined
Length: 28min 20sec (1700 seconds)
Published: Thu Dec 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.