I used LLaMA 2 70B to rebuild GPT Banker...and its AMAZING (LLM RAG)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this is llama 270b I used it to build the best open source retrieval augmented generation engine named llama Banker this is it running on a single GPU answering questions summarizing and analyzing a 300 page annual report just as well as chat GPT but with unlimited tokens for a buck 69 an hour since building version 1.0 of GPT Banker has been bugging me that I couldn't do the same with open source models that was until the announcement meta had been coding away in the background building a better llama and what's better than one llama twins this should have been easy enough to do swap out the models and we're good to go it wasn't quite that simple though let me explain to get started I first needed to install pytorch and a bunch of other dependencies easy enough grab the command from the pytorch website and throw it at a command line but I want to try running the entire app on run pod that way I'd have a GPU that's fast enough to run llama 70b that would prove to be an absolute headache while at it I'm also going to install some other dependencies including Lang chain Transformers site pi as well as some optimization leads turns out I also needed some open source embeddings so I'll install Some Loops for that as well I'm using sentence piece and sentence Transformers More on those in a sec rather than going my usual route and relying on Lane chain I decided this time I wanted to do as much as possible with other libraries so Transformers was big in play here bring in Auto tokenizer auto model for causal LM and a new class text streamer wait till you see this bad boy running time to bring in the model first up create a new variable called name this holds the name of the model we're going to be using then another variable or token to hold the hugging face access token seeing as I'm downloading the weights from their Reaper here's where I first ran into trouble in order to use the original meta weights you need to be granted access I did not know this to get access you need to go to the meta websites meet your life story and hopefully they'll accept your legit two hours after applying though I was in time to get this model up and running there's two parts that I need a tokenizer and the model itself the tokenizer converts raw input strings to Unique numerical identifiers AK numbers and the model itself this is where the Transformer class came in handy aside from the name of the tokenizer passing Keisha underscore Dia ensures that the model saved to its specified directory where I have space on the server I learned this the hard way after downloading 30 of the 150 gigabyte weights and having to start again use auth token grants access to the restricted llama2 repo getting the model up and running follows a similar process to the tokenizer Past name case deer use auth token as well as the torch data type now this is a critical part of the whole thing because it is the model without it there's no llama too so I followed the same process outlined in the documentation and that worked about as well as deploy a proton a Sunday morning after a bender using the documentation code I was getting this weird indexing error so I tried different versions of Transformers to check different installations of pytorch questioned by existence as a developer but then I found a thread on GitHub which pointed to the fact that previous versions of the config didn't include the right pad token I jump into the config set the pad token or two and hold my breath yeah no that didn't work either then add a bright idea could I find another set of Weights that did work turns out the upstage weights followed a similar pattern but they had added rope scaling configuration but more importantly they were loading in 8-bit so I swapped the upstage model named back to llamas and boom we are in business I think scaling the model across two gpus had something to do with it because loading an 8-bit reduced the vram requirement allowing me to run the model in a single a 180 gigabyte GPU to test it out pass a prompt to the tokenizer and send it to the GPU using the model.device attribute this is where the tech streamer came in handy you can use it to stream the decoded output of the model which looks slick but it still haven't actually generated anything or passed the text to the model to do that I can unpack the inputs and pass it to the model.generate function at the same time passing the streamer use cache and setting the max new tokens to infinite in order to continuously stream output to decode the output tokens all I need to do is pass the results of the model.generate function to the tokenizer.d code method this continuously allows the model to generate response to the original prompt now I could stop here the model works why mess with it I should have stopped here but I did got to pay the bills Lads if you love coding and you're not sure where to start check out my free python course at go.courses from nick.com forward slash python but if you're a python wizard use YouTube 50 to get 50 off my full stack ammo of course at courses from nick.com forward slash full stack ml I'm currently adding llm videos to it back to the video in order to take this one step further I wanted to use rag to answer questions similar to how I built the open a iPad investment banker I feel that it's only right that we do it with open source models should be easy enough any missed last words just pass the model to the existing Lane chain workflow that I built up in that video and Bob's your uncle but this went about as smoothly as unicycling up Everest while trying to install tensorflow Lane chain will just hang when using the agent executor command so I did what any self-respecting developer would do the five hours of succession and three minutes of Googling I found a new library well look at least new to me called llama index down to promising but would it get the result I wanted would it be the LA I'm a banker I dreamed of one way to find out first things first install it using pip install llama index then create a prompt using the simple input prompt from malama Index this allows prompt formatting similar to how prompt templates work in Lane chain while I'm at it may as well throw in a system prompt to help guide the model to be ethical respectful yada yada to use the prompt wrapper we can grab the prompt wrapper and use the format method to complete the prompt so I'm good with the model itself it's texted it works but I need a way that I can use it with llama index thankfully there's a class for that it's easy enough to pass through the model and tokenize it of the hugging face llm plus while I'm at it I can also specify the system prompt and the query wrapper prompt as well as some token parameters that's the llme bit of the application done there's one bit I haven't quite addressed yet that's handling the document embeddings which are needed for retrieval augmented generation add really good results using the sentence Transformer model from hugging face at work so I want to use it here as well there's two classes that I need the first being Lang chain embedding from the Lima index embedding submodule and the hugging face embeddings class from Lane June I know I know I didn't want to use it but I had to once those are loaded it's easy enough to create an embeddings instance by wrapping the hugging face embeddings inside the Lang chain embedding class now by default llama index is configured to work with openai we're building llama Banker not open AI Banker so I need to switch this out the way that's possible is by changing the service context for llama index the service context class and the set global Service context function can be grabbed from llama index to help with that the goal here is to tell Alarma index to use the model that I loaded from Transformers to this I can create a new service context and use the from defaults method to specify the llm and the embedding to use if you want to use different models this is where you'd be passing it in the chunk size here is important it tells llama index how to split up the document a larger chunk size means you're going to get chunkier smaller chunk size means you're going to get smaller blocks of text faster the model for context the surface context is now created but this needs to be set so that llama index recognizes it that's done by passing the context that was created to the set global Service context function perfect now how about those documents though the documentation makes it pretty clear how to load a document using simple Direct Reader testing this out with the PDF was a shocker though when I queried using this method it looked like I was skipping pages to be perfectly honest I didn't dig into it because I had another idea first up install llama Hub this opens up a range of other extensions that can be used to extend outline the index my goal was to try to load the PDF directly into the Llama index Vector store indexer and query from that the vector store index class can be imported from lava index while at it I'll grab the download loader function and path from pathlib as well I'll show why these are needed in a sec the download loader gives access to a bunch of other extensions for use for llama index a quick search shows that Pi mu PDF reader was available for use so I set this as the argument and created in an instance of the reader then it's easy enough to load the annual report using the load method all I have to do is pass through the full path to the annual report and pass the metadata argument is true home stretcher I need a way to query the documents though because so far I've just loaded them this is where the vector store index comes in handy by passing the documents to the from documents method I can store the PDF chunks into the vector store the this means I can now query it using the llm but would it work better than what I got from Lane chain to test it out I need a way to query I can access the index query engine using the index.s query Android method then all I have to do is use the dock query method and it's now possible to query using llama2 and the annual report this is it a big test if I query what was the FY 2022 return on Equity the model successfully Returns the same result as what is shown inside the annual report I'd call that a win by any measure now I could stop here that that just wouldn't be me so extremely adapt it is but this ain't easy to do because we're doing this on what is effectively a jupyter notebook server via run pod deployment isn't straightforward but I want to try anyway because that's what machine learning is All About Life Learning and loss let's start by whipping up a quick stream letter create the title by using st.title and naming it appropriately and create a prompt bar using the st.txt input class this is going to return whatever is stored in it so I can capture the prompt by setting that as a variable now I've already written a bunch of code for Generation I can copy that into the app all I really need to do then is query the engine when there's a prompt I'll throw the query inside of any statement to achieve that then to run the app all I need to do is run streamlit run app.pi we're up and running so if I click this link uh um oh wait in theory the app is running locally inside of the Linux server Docker container that run podcreates but it's not exposed to the net so grabbing it by the browser is a bit of an issue and this kind of sucks if I wanted to hand it out to users luckily after a bit of digging and channeling my inner Cisco engineer I realized I could port forward using the TCP mapping functionality in runpod this means I can take this IP and the external port and Boom app is running at last success now if I ask her about a summary around the performance of the company I get nothing that that should work right for whatever reason the app still has issues the error thankfully was one that I did recognize it looked like a GPU add a memory error which is absolutely nuts because this exact same code ran inside of a Jupiter notebook 20 minutes ago what I did did notice though was that every time I ran a query via the app it appeared that the model shards were being reloaded into memory as though it wasn't cached it turns out that was exactly what was happening a few forums mentioned that the best way around this was to use the st.cache resource decorator around a function to load a machine learning model this should prevent streamlit from attempting to reload the resource into memory so that's what I did prematurely anticipating success I added two streamlit expanders to write out the entire response object and also display the source text now all that was left to do was see if I sorted application issue so here goes nothing I refreshed the app the model looks like it's caching and prompt and enter and and Lama Banker lives caching using The Decorator stopped the model over consuming the GPU memory but most importantly check out how flexible and Powerful llama tour is I can ask it to summarize the financial performance of the bank perform entity extraction here I asked what the four main departments are and we got the results I could even ask it to break down the sentiment in the Chairman's letter and it looks on point well it was a hell of a variety but all signs point to huge amounts of Promise in the field of Open Source larger language model development if you want to see some tests that I did with Falcon 40b click here

Info

Channel: Nicholas Renotte

Views: 68,631

Rating: undefined out of 5

Keywords: llm, llama2, llama 2, retrieval augmented generation, rag, ai

Id: SedGB8m2XLM

Channel Id: undefined

Length: 11min 8sec (668 seconds)

Published: Fri Aug 18 2023