hello everyone Daniel here in this video tutorial we're going to start from a blank notebook and end with a full retrieval augmented generation pipeline or rag pipeline from scratch and best of all it's going to run on our local machines or to showcase the power of AI on RTX have a look at that GPU Nvidia have been so kind to send me uh an RTX 490 GPU so shout out to mark thank you so much for that now what what this means is that if you have a PC with an Nvidia GPU you've got a gym no no no I'm joking it means you've got a machine that's incredibly capable of AI workflows so I've got all the code here on GitHub we're going to write this line by line together in this video check the link in the description for that but it's also to celebrate the upcoming Nvidia GTC conference which is happening from March 18 to 21 but if you've already you're watching this in the future and it's already happened there's a heap of sessions that are going to be virtual so it's happening in San Jose California uh I live in Australia so I'm going to be attending virtually I've got three sessions lined up that I'm looking forward to this is retrieval augmented generation an overview of design system so this is similar to the pipeline that we're going to be building in this video there's some Nvidia software in here tensor RT llm uh we don't explicitly cover that but that's what I want to learn more potentially for a future video so let me know if you want to see that and then then I also have saving trees in forests all over the world using AI a really cool application uh to identify 600 different species of trees on a mobile phone with a macro lens that's so cool to me and then we have here building a deep Learning Foundation model via self-supervised learning which is a really powerful technique that um what large language models used to learn they use self-supervised learning to learn from large corpuses of text but this is bringing it to computer vision so uh once again all the codes on GitHub the links will be in the description follow along if you'd like to build your own very own rag pipeline from scratch running on your own PC I might have to move this to the floor cuz it is quite loud running here um if you have any questions leave a discussion on the GitHub page or a comment below and let's build hello hello welcome to the simple local rag tutorial now we're going to rebuild this workflow and we're going to have it run all locally on our own GPU and if you want the materials for this they're on the GitHub so I'll put that link in the description and if you want to run it in Google collab maybe you don't have a local GPU available you can click on the link here it's all in the GitHub this same workflow that we're going to go through uh will all run in Google collab so let me just jump into what we're going to do you can use this notebook here as a reference but we're going to code this whole thing up from scratch there's some information here what is rag why rag if you never heard it you're going to be we're going to be saying rag a lot during this video um what kinds of problems can rag be used for why local key terms what we're going to build all that sort of stuff all the code and explanations are in this notebook a lot of vector outputs there so that's where all those numbers come from they're from our embeddings but we're going to get to that I'm just giving you this so you can use it as a reference cuz we're going to code all of this up from scratch so let's jump into an example workflow and by the way we're going to take this quite slow we're going to write every line ourselves um so if you have any questions or issues leave a comment on the video or a discussion on GitHub or an issue if there's something wrong uh if you want to get set up all the instructions are here but we're just going to go through this step by step so what is rag well rag stands for retrieval augmented generation and I've stolen this image or this flowchart from nidia's blog so they have a great blog post which is demystifying rag 101 and that's kind of what I've based this workflow off except we're not using a framework we're going to be writing all of it ourselves um we're not using llama index or Lang chain which are two great Frameworks um and they can help to process these workflows but we're going to see what it looks like just to do with with python and pytorch and Transformers and whatnot and then in that way in the future if you'd like to use Lang chain or lmer index well then you know how to do it yourself so that's sort of what we're going to be building if you want to read more the link to that is available I'll link this uh little whiteboard that I'm using as well but let's dive into it so what are we going to build well we're going to break it down into two major parts the first part is document pre-processing and embedding creation then the next part is search and answer but before before I jump into this uh flowchart we might as well break down what rag is and so I've set up my computer to already go through the setup steps so if you go to the GitHub I've gone through these steps get clone and now I'm using python 3.11 on a Windows 11 machine with a Nvidia uh RTX 490 so that Nvidia have so kindly given me to create this video in Partnership so there we go um there's a Nvidia GeForce RTX but this should work on almost any local Nvidia GPU so the of course the RTX 490 at the time of recording is one of the best ones you can buy from a consumer level uh but it should work on most modern Nvidia gpus and we put some code in there so if you've got an older GPU with less memory it should still work and we'll discuss that later on when we get into model usage so I've gone through these setup steps and that's what I've got here excuse me I'm still getting used to Windows um I've created an environment I've activated it I've installed the requirements and if you run into any issues setting up the requirements I put some notes here but we can two things that we can do if we want to run locally we can use vs code and so I can just type in code Dot and that will bring us up here this is the same repo just running on my local computer and then if you want to start a new notebook here we can just go o simple local Rag and I'm going to call it video IPython notebook and then we can just start from here import torch shift and enter we'll select our python environment we can run it there so import torch so there we go however I personally have found and maybe I'm doing something wrong because I I'm not I'm not used to using Windows that uh vs code kind of lags a little bit for me um when I have a large notebook maybe I'm doing something wrong if you know about this please let me know so what I'm going to do is instead of pressing code dot I'm going to start with a Jupiter notebook so jupit a notebook but the same process will run whether you're using a vs code notebook which should technically be the same as using Jupiter so now I'm using a Jupiter server which has the same files that my vs code setup has but I'm just going to use Jupiter if you want to use use vs code you can do that as well so let's jump into this dismiss oh what's going on here oh oh I haven't saved it here of course that's an interesting error see I'm going to leave some of the errors in here you know so let's see if it opens up now that I've saved it okay there we go we get torch and I'm going to zoom in a little bit there that should be good okay so let's I know that was a bit of an intro but again all I've done here is I've gone through the setup steps I've just done this in the terminal get clone CD into simple local rag created an environment launched it and then installed the requirements in that environment down here if you'd like to see a dedicated setup video to this please let me know and I can make one of those um but now I'm at this stage I've just gone jupyter notebook and we are in like the kid in the tree all right so let's create a title for this so we want to create and run a local rag pipeline from scratch now of course it won't be 100% from scratch uh we won't be training any embedding or llm models here but we will be using them so let's define what is rag what is rag so rag stands for retrieval augmented generation you may have been seeing this um quite regularly in the world of llms and generative AI essentially the goal of rag is to um take information and pass it to an llm so it can generate outputs based on that information now that's a pretty broad statement but that's on purpose because rag can really be that broad oh let me turn this into markdown Escape an M by the way so that's what we want to do we have information in one place and we want to pass it to an L so it can generate information now why would based on that information that we pass it that that could be the prompt so let's break each of these steps down so we want retrieval so retrieval is another word for search so find relevant information given a query EG uh in this project we're going to be building neutri chat which is chat with a PDF now we'll break this down a bit more but we have a nutrition textbook that's 12200 pages long and we want to chat with that textbook so we might ask a question of what are the macronutrients and what do they do so EG what are the macronutrients and what do they do right so find relevant information so we'll turn that into uh retrieves passages of text related to the macronutrients from a textbook a nutrition textbook by the way and then we have augmentation or augmented so augmented is we want to take the uh relevant information to take the relevant information that we got from retrieval and augment our input prompt so uh often the input to an llm is called a prompt our input prompt to an llm with that relevant information now we'll discuss why we want to do that in a second and then we have generation so um take the first two steps and pass them to an llm for generative outputs beautiful so retrieval augmented generation so if you've ever used an llm chat GPT is a great example so um what are the macronutrients and what do they do so so this is a sense this is not actually using retrieval augmented generation this is as far as I know C GPT is just generating this based off what it's learned in its training data so why this is still helpful but why might we want to use retrieval and then augment our prompt so this is my prompt here why might we want to augment that with relevant information so let's discuss y rag y rag so the main goal of rag is to improve the generation outputs of llms so if we go back to chat GPT we got a pretty good output here now I've studied nutrition and I would say this is a pretty good high level answer macronutrients are the nutrition nutrients that our bodies need in large amounts to function correctly they can be divided into three main categories carbohydrates proteins and fats and then we got a little breakdown so that's a that's a good answer there right however this is quite a general question like you might find a lot of information on this on the internet and that's what most large language models are trained on they're trained on internet corpuses of text so ter terabytes of text Data however what if you have private information or what if you have factual information that an llm uh won't have access to that's where rag can come in so number one would be to prevent hallucinations so let's say um llms are incredibly good at generating um good looking text however this text doesn't mean that it's factual right so let's say you're uh a business and you have a customer support page let's go to a chat um my so telra is a um Telephone Company in Australia so let me just go to telra telra customer support and I know this is an actual project because one of my friends is working on this so call us message us um where's search so say I wanted to search telra so this is a phone provider in Australia so I want to search my bill um was too much last month right that's very natural language there but what if we could get see this website is giving me grief what if we could get an answer rather than just this list of links here make a complaint do I want to make a complaint H not sure what's going on here anyway we get a lot of support documents so what if your what if your Telstra or any other company that has a lot of customer support problems and I asked chbt the same question so my telra bill was too much last month if your teler bill was unexpectedly High last month there are several ways you can do review your bill so this is going to be generic information right but what if we had a chat GPT that was linked or a similar language model cuz chbt after all is a large language model that was linked with all of telstra's customer support documents and we could talk to it when when we have a problem so see how this is just like generic information that's what you expect with anything but we want a specific we we've got a specific problem we want our question to be answered with telstra's customer support so that is where rag can come in handy is what you would do in this case is for example now this is not necessarily A hallucination or actually let's go to the number two hallucination um rag can help llms generate information based on relevant passages that are factual so there we go we've got number one now number two is what I just talked about so hallucinations is all of this text looks good but it doesn't necessarily promise you that it's going to be correct if that makes sense it could look good and read well but it doesn't mean it's factually accurate so that's what we want to do with number one with rag is we want to provide our llms with factual information so information that we have stored somewhere and then we want to go hey can you process this information to answer my query and then number two is work with custom data right so um many base llms are trained with internet scale data this means they have a fairly good understanding of um language in general right so if I ask uh a query of chbt I've got a problem with my telra bill it's going to give me some pretty good answers but it's not telra specific whereas I typed this into telstra's website in an Ideal World they take all their customer support documents and put it into a system a rag Pipeline and then I would get an answer back in similar style to chbt but based on telstra's actual documents that's the whole premise of rag is to use your actual documents and process them and deliver them with in a generative way so retrieval find the relevant teler documents that are related to my problem augmentation would be to uh my bill was too much last month that's my query there so augmentation would be to take Passage from these documents put them into my query here and then have generation uh an llm like chat gbt would be to take those documents and return them or options for me in a legible readable way that's the whole premise of rag so and of course there's many more aspects to this but this is this is an actual problem that one of my friends in Australia was works with quite a large te tech company is helping telshire to work on right now so that's just a concept of where rag gets used in the workplace so this means I have a fairly good understanding of language in general however it also does mean a lot of their responses can be generic in nature so rag helps to create specific respon sponsors based on specific documents EG your own company's customer support documents and I want you to start um thinking about documents as been a a very broad term so documents could mean just like thousands of PDFs um hundreds of customer support documents heaps of text files your emails um let's get into some use cases right so we've already discussed one so what we want to do is what can um rag be used for so that's one very helpful use case we've all been to a customer support website and just had absolutely no fun trying to get some help like I just had here um the yeah w't even let me type anymore okay so telra um I love you but you're a company and you need a little bit of help on your search options here so let's go um what can rag help for so number one would be customer support Q&A chat so um treat your existing customer support documents as a resource and when a customer asks a question you could have a retrieval system we're going to build this later on Don't You Worry retrieve retrieve relevant documentation Snippets so pieces of text from the documentation rather than can you imagine having to go through what's an order estimate it's an estimate of how much your first bill your monthly charge after that no this doesn't help my bill is too much uh why have I received a final bill get help on your bills and payments okay see look I I just want to just give me the answer there's too many options here okay I'm not not ditching on tell you this is a lot for a lot of companies this is a big problem right customer support is a very hard thing to do so retrieve relevant documentation Snippets and then have an llm craft those Snippets into an answer and so you can think of this as uh chatbot for your documentation now I know there's been a lot of chat bots in the past but over the last couple of years chat Bots have actually got really good so this is no longer so much of a gimmick it actually really works for example um clana llm customer support here we go so this is a big example clana is quite a big financial company so let's go um CL market cap so CL raised yeah there you go worth multiple billions of dollars apparently so quite a big company and has used AI assistant to handle 2third of customer service chats in its first month so big details here yeah there we go so AI has handled 2.3 million conversations so this is where rag pipelines can help out so you type a message to cler it goes through its customer support documents I've got this problem and then it gives you an answer and it's based on is it open AI yeah based on open AI so this is the exact workflow that I was just talking about we have a chat gbt uh like model plugged into our own documentation right so we go back to ker and they strive $40 million Us in profit in 2024 that's incredible so let's keep going with our use case we want to learn how to build these systems right then we have another option which would be email chain analysis so there's plenty of options here so um this another thing uh project that my friend is working on um let's say you're a large insurance company and you have chains and chains of emails of customer claims right so you could search through all of those emails by hand and that would take quite a long time so um you could use a rag pipeline to find relevant information from those email chains and then use an llm to process that information into structured data so basically take the unstructured text and turn it into something like Json or something structured or or spreadsheet or something like that right and then we have another one which could be company Internal Documentation chat um that that's quite similar to the customer support Q&A but if you're a large company and you have an Internal Documentation set or even your own computer I'm still getting used to using Windows but um I'm quite used to using my Mac uh I'd really love it if um the assistant on my Mac I'm not going to say it out loud because it might trigger somewhere in the house but it starts with an S and ends in Eerie uh could search all of my documentation via just chat instead of me having to use keywords and then we have another one which is what we're going to build textbook Q&A so let's say let's say you're a nutrition student and you've got a 1,200 page textbook to read you could build a rag pipeline to go through the textbook and find relevant passages to the questions you have so the premise here is all the same so retrieval or augmented generation uh I like to think of it with all of these um use cases the all the common theme common theme Here is take your relevant documents to a query and process them with an llm right so from this angle you can consider an llm as a calculator for word quite a different concept here cuz in the last few years um llms have gotten very good at processing language in general so rag is from one angle oh gram goodness me how do I turn that off gramal is invading everywhere um one second I'm just going to turn off gramy excuse me for this uh I don't know how to do that anyway I'm going to ignore it right so an llm as a calculator for words now one more point to go on is why would we want to run this locally what if chat GPT is already good well locally is like you got your own car right you like to drive that so that's what I imagine is why would you want to run these these systems locally is well it's fun to drive your own car first of all fun so fun to build your own pipelines and software and run it locally and then of course you've got the Practical benefits which are privacy speed and cost so privacy if you're a big company uh with private information so if you have private documentation maybe you don't want to send that to an API right maybe you don't want to send your information to uh open AI or uh anthropic with Claude or Google or whatnot so you want to set up an llm and run it on your own hardware and then number two is speed so um whenever you use an API you have to send some kind of uh data across the internet this takes time so you'll notice with chat gbt is um that answer didn't quite help can you search uh telstra's customer support documentation for me let's see if we can do that does it have a web browser maybe it does but see this time we have to wait for this to happen so maybe it's 5 Seconds there we go now it's pretty quick yeah see we're not getting or maybe it did search oh there we go okay that's pretty good but we still have to wait for time for that right so locally is running locally means we don't have to wait for um transfers of data so we don't our data doesn't have to go off uh into the internet it can run on a local machine that I'm running under my desk on this particular GPU so Nvidia SMI it's going to run right there on that Hardware so it's going to go quite quick then three of course is cost now there's more arguments here but these are the three big ones that I picked out of course fun you like to drive your own car around the place you don't always like to just hire a car and drive it so cost is if you own your Hardware the cost is paid right so buying new hardware uh may have a cost it may have a large cost to begin with but over time uh you don't have to keep paying API fees and then probably another important one actually here is no vendor lock in if you run your own software slh Hardware so for example if I rely on chat gbt to run my business um if open AI slash another large internet company shut down if they shut down tomorrow you can still run your business right I don't think open AI will shut down tomorrow but this is just I mean you've got your own hardware and software you're running that it doesn't matter if they shut down tomorrow so these are these are four really big points so that's about enough for the intro of what is rag why rag what rag can be used for and why local now let's start to just dive into the project that we're going to do so I'll um just uh pause the video here I'm just going to write pause for myself when I edit this and I'll come back in about a couple minutes but if you want to take a little break and come back we're going to start going over the project we're going to build and we'll write some code okay I'm back so we just talked about what is rag why rag what can rag be used for and why local now let's jump into um I'm just going to maybe I'll leave a link here or let's discuss what we're going to build actually what we're going to build now I've got two resources we've got simple local rag that's the GitHub I'm going to put this in here one and then um this I'll share that as well copied link beautiful this is a little nice little whiteboard that I've got so I'm going to put that there as well in case you want these will all be in the description as well let me just close oh we don't need we'll keep those just in case so the reference here is for all the code by the way now I've skipped one thing haven't gone over the key terms but if you want to read those you can do them here I'm just going to explain them as we go we've got token embedding edding model similarity search Vector search large language model lmm Co llm context window sorry I'm talking too fast prompt now there's some key terms to read but we'll go through them as we go right notebook beautiful so let's just jump into here what we're going to build I want my little workflow document there we go so let me zoom in we have two major parts here document pre-processing and embedding creation and then search and answer I'm going to break these down and it's all going to happen on a local Nvidia RTX 490 so let's say uh you have a data set of documents now again documents is very broad I'm just saying a large database of p uh PDFs it could be a large database of customer support articles it could be Internal Documentation it could be email change it could be almost anything to do with text and now we're starting to get into the realm of images as well but we're going to focus on text I have a 1200 page nutrition textbook and if I go to this link in the GitHub it's an open source textbook but this would work with almost any textbook human nutrition I love food I love nutrition I'd love to learn more about it so frankly I'm building this system for myself we could do all of the nutrition articles on Wikipedia as well but we're just going to start with a PDF to begin so I've downloaded it here digital PDF and I've included it there human nutrition text I've renamed it it's open source so you can check it out it might be a bit too big to load in the browser though anyway that's our PDF can I download this yeah there we go let's have a look at it human nutrition text we'll open it up we got heaps of pages here so we could search through this so there's the people who have altered it let's just scroll right to the bottom how many pages do we have a lot of pages I might just take this this has taken a bit too long so 1,200 pages of course A lot of them aren't there aren't aren't all text a lot of them are just reference and whatnot but what if we were to search for um what are the macro okay we've already it's not there we can't do a general query like that we can do um keyword matching macronutrients there we go so what if I instead of searching just like this I could chat with this document that's what we're going to build so what we have to do is we take that PDF and then we pre-process that we're going to write code for all of this we pre-process the text into smaller chunks so rather than it I'm just going to x out of this rather than it being um the text in here is actually pretty big on these Pages I'll just zoom into that as well rather than it being a full page we might break this into something like 10 sentences and I know that because I've already tried and I've already made the material so I know how many sentences we're going to break it into but this is still an active area of research and how to um chunky or make your text smaller then we have smaller documents so these are going to be our context our retrieval um passages that we're going to pass to our llm later on and so what we do is we embed these smaller chunks of text now what I mean by that is an embedding is a numerical representation that makes data useful that's all we have to know for now we're going to see what it looks like Hands-On and then we store these somewhere um we could store them in P torch tensor if you're not familiar with P torch I have a great um video on P torch you might want to check that out and we could also store it in a database so that we could use it later and then we come to the next step so that's one step uh embedding creation and storage now we have search and answer now it starts with a cool person maybe this is you um I quite laugh let me just show you something if you search for person there's two people here with sunglasses on they like in business attire they look like like a spy movie or something like that but uh yeah so cool person maybe that's you you ask a query of your information here uh what are the macronutrients and what did they do now what we do is we embed that query with the same model that we embedded our documents with now again in embedding we just have to think of it as a numerical representation because computers um don't deal with text as well as we do it might appear that they do with chat GPT but behind the scenes chat G gbt is when I ask a question so um how much does an elephant weigh it's going to turn that query into a numerical representation and then it's going to quiz its weights of uh which are also numbers of go hey have you seen the numbers from this show me other numbers that are similar to that of what a response would look like so computers deal with numbers even though this appears as text behind the scenes and we're going to we're going to do this ourselves we're going to turn words into numbers and back again so we embed our query and this is important with the same embedding model that we used to embed our documentation and then we can store that embedding uh query if we wanted so we could case it later but usually it's quite quick just to embed it once at a time and search and then what we do is we find this is the retrieval part of um retrieval augmented generation or rag we find similar embeddings so these will be numerical representations that match our query I'll show you how to do that later on we're going to write code to do so and then this is the augmentation part is we take our question here I know that's a bit small but we take our question here we append the relevant passages from our text and we pass this to an llm so imagine if we're talking to chat GPT and we've got um my telra bill is too much but then we have context item one and then context item two oh right and then sorry I'll go context item got a bit trigger happy there and then we go please create um an answer to the problem based on the context so there's context item 1 2 3 and these could be three articles that are relevant somewhere I'm not even sure where I went check your account details there we go so an article like this again this is just an example it's not the right article but an article like this that is related to my problem and we could put another one there and another one there and then we ask our calculator for words to create an answer to the problem based on these three contexts rather than just um creating a bit of a generic answer here right so that's what we're going to do there and then this answer comes back to us along with the resources now this is an important point we can get the answer but as well as where the answer came from so then we can go and verify if that answer is actually correct so in the case of nutrition textbook we could ask uh what are the macronutrients and what do they do and our llm could return us a response but not only that but go hey this is where I got the information from I got it from uh page five macronutrients and then maybe it found another page um somewhere later on I'm making this up and it it said this is pretty helpful as well and it's like this one's also pretty helpful if you want to learn more wherever it is here okay that's just an example I'm not these aren't specifically related I'm just saying that that's the type of system that we could build not only are we getting the answer we're also getting resources to that and so you can imagine how helpful that is for a a customer support system is you get an answer but it's like hey read this documentation for more or if you're in uh a large company and you're like um what do I do when I need to Lodge an invoice and the um response says hey you can log an invoice by here and this is what you do this is our process all that sort of jazz but with that being said we've got our beautiful little diagram here we're going to be um coming back to this quite often but let's just write some steps so um we're going to build I'm calling the project Nutri chat to chat with a textbook nutrition textbook but again I want you to keep your mind open to the fact that um textbook here could be anything a collection of PDFs a collection of text documents a collection of customer support documents really keep your mind open to this cuz we're only this is still quite a new workflow rag um we're only really scratching the surface of this oh and I forgot one more thing um up the top here while I've got it in my mind um if you want to read where rag came from see the paper from Facebook AI so originated out of Facebook AI a fantastic AI team so uh if we search Rag paper is this going to work oh there we go Google's pretty good so retrieval augmented generation for knowledge intensive task so this is a machine learning paper if you've never read one of these before they can be quite intimidating but essentially what we get to is this is a a really technical way of describing what we've just described um they've got I really love this last paragraph So This is explaining what rag is so the work offers several positive um societal benefits over previous work the fact that it is more or less strongly grounded in real factual knowledge in this case Wikipedia so they their example was they used a large language model um gbt2 maybe and they generated outputs based on wikkipedia text so it makes it hallucinate less with Generations that are more factual so hallucinate less that's really important and offers more control and interpretability so you know where the generations came from so rag could be employed in a wide variety of scenarios with direct benefit Society for example by endowing it with a medical index oh there we go so you could give it like um all of uh medical journals and that sort of stuff and asking it or PubMed that's what I was looking for and asking it open domain questions on that topic or by helping people be more effective at their jobs right so if you can get information quickly um you can work quickly right so let's put this in our notebook so that's their conclusion there and I'm just going to put this it's a great paper to read retrieval augmented generation for knowledge intensive um NLP task so you see 12th of April 2021 so it's it's barely it's under 3 years ago that this paper was published and rag is taking over many different workflows so we're going to learn how to build a system like this for ourselves we go there wonderful now so specifically let's get into what we're going to code step one and I know we're going a bit all over the place but that's all right we're going we're going slow here we're going to write every line of code ourselves and I want to try and explain it as much as I can so you could use almost any PDF here or even a collection of PDFs two we're going to uh format the text of the PDF textbook ready for an embedding model so we want to turn our text into numbers with the embedding model so we're going to embed all of the chunks of text in the textbook and turn them into numerical representations so an embedding at which we can store for later and I'm just going to put this as embedding oh my goodness grumbling I'm going to get rid of that in the next break I'll just make sure I'm nice and zoomed in there and then build a retrieval system that uses Vector search we're going to talk about what this is later on to find relevant or if your trigger happer you can skip ahead and just go to all the code relevant chunks of text based on a query now Vector search is one of my favorite techniques that I've ever learned in machine learning so uh I'm really excited to share that with you so create a prompt that incorporates the retrieved this is basically just the text version of our nice image right pieces of text incorporate would help if I could spell then grammarly wouldn't be so much of a hindrance so generate an answer to a query based on the passages of the textbook um with an llm all locally all locally so so step major step one is um steps 1 to three we'll call this document pre-processing and embedding creation and major workflow 2 is steps 4 to six we'll call this search and answer all right that's beautiful so this is our what we're going to build number one document process accessing an embedding creation and number two search and answer so with that being said we can I know I said we were going to start coding in the um last section but uh how about in the next section we start off number one we're going to go number one document SL text processing and a bding creation and I'm going to write a little note here for pause so future Daniel pause and let's take a break and I'll be back in a couple of minutes and I'm back let's start step number one document text processing and embedding creation so let's treat this like a machine learning cooking show hey we need a couple of ingredients might just put this in the center of the page there we go ingredients we're going to need a PDF document of choice um and I want you to note here note this could be almost any kind of document I've just chosen to focus on PDFs for now so keep your mind open I'm going to sound like a broken record here to rag can work with if you got text you can probably set up some sort of rag pipeline there so embedding model of choice now we're going to get into that in a second so we need a PDF document and a betting model of choice won't necessarily have those ahead of time but we're going to get that later on and then the steps so number one we want to import our PDF document number two what we're going to do is process text for embedding so EG split into chunks of sentences right um number three is embed text chunks with embedding model and number four will be save embeddings to file for later use so embeddings in terms of a cooking show embeddings will store on file for many years or until you lose your hard drive in essence they don't really have a best before day as long as you have access to the model that you embedded them with because yeah if you have embeddings but not the embedding model um finding a query on them anyway that might be bit hot we'll find that on we'll see that later on so let's start with importing a PDF document now of course like any good cooking show I've prepared ahead of time so import PDF document but we want to do things for completeness here you may not have the PDF that I have so this is our PDF and of course I downloaded it from here uh the website it's in um oh sorry it's on GitHub if you really want it there we go but how about if you're you want to download it programmatically which I like to do if you don't necessarily have access to it so I'm going to start by importing oh yes we're finally writing some code uh thank you for your patience if you made it all the way to here but I promise you the rest of the video will mostly be code now um let's get our PDF document get PDF document path so PDF path equals now we're only working with one PDF um so if you're working with a folder of PDFs you could Al the code a little bit for that but this is a 1200 page PDF it's quite a large one so we already have the file just keep in mind but I'm writing some code here to so if we didn't have the file we can get it so let's download by the way OS is um python for OS module and request is going to help us download things um off the internet so you can look up the documentation by this python requests and we got requests does python documentation URL live request anyway let's see how it's used right and if you want to look up the methods you can do that so download PDF so if the path doesn't exist so if not OS path OS is going to help us deal with our operating system so we're going to look for this path and if it doesn't exist we want to download it okay so PDF path and then I'm going to print a little info statement here you'll find like to do a lot of little prints see what's happening so file doesn't exist downloading right and then we're going to go um enter the URL of the PDF so we could use the URL from the GitHub but why don't we just go Um to the actual website I'll show you how we can do this so if you have a link somewhere so digital PDF I could click on that that's going to download I'll just show you that it's downloaded it I think I already have one um and then I can just copy this link address so digital PDF um and by the way if this kind of like overruns their server I might change it to um downloading from GitHub so we can use github's bandwidth rather than this open textbook but let's just put this in here boom and I might also just come into here just in case we need to can we get the raw address from here yeah for some reason anyway I'll put that in there in a in a later section if we need to if we get an email from press book saying that we we've overrun their server because a lot of people are downloading this PDF it shouldn't be too bad so the local file name um to save the downloaded file this is where downloaded file I forgot to turn Gramm off I just realized oh my God goodness that's okay so the file name can be PDF path we don't actually have to do that again but I like just to be an explicit so um send a get request get get send a get request this is where we're going to use the request library to the URL so response will be requests.get um and then we're going to pass in the URL here so we're going hey we're just sending a message with the request liary does it something come back from this and then what we're going to do is so check if the request was successful so if response. status code so there's a whole bunch of different status codes for requests so you can look these up by going request status codes there we go Milla mdn should be good yeah beautiful so 200 is okay it worked that's what we're looking for so if the status code is 200 we want to open the file and save it essentially so we can go with open file name so the file name is our PDF path file name and we want to write binary that stands for write binary as file so this is with is a context manager and we go file. write the response so this is the response up here so we get the information from the URL we write response. content so the content that comes down from our URL now just note this has to be the download URL right because that's what's going to give us the PDF it can't just be this it has to be the specific copy link address the download URL keep that in mind if you're trying to download other things from the internet so then we can go we'll print out a little message info I don't know I don't think I need a F string actually I just got in the habit of writing a lot of f strings lately info the file has been downloaded and saved as oh actually we do need an F string file name I want the file name in there helpful information beautiful print is your friend when it comes to debugging and then we're going to go print uh if it doesn't work so if we don't get a status code info we want failed to download the file and then we can return the status code and we'll see what happened then right we'll probably get another code maybe 400 or something or if we've overloaded This Server uh maybe something else please let me know if you get an error um and I'll I'll write some code to enable the download from directly from GitHub or some other file storage system so else file if the file already exists we don't want to re we don't have to redownload it you know file exists so this is what we should get um has no if OS path if not OS path exists sorry I missed an s there did you catch that there we go human nutrition test exists um however let's say we deleted it I want to show you this code working now this is the ultimate test uh that we were a good cooking show but we've just deleted one of our ingredients so let's run this file doesn't exist downloading the file has been downloaded and saved as human nutrition text.pdf there we go we get it back wonderful okay so that's the beauty of writing programmatically um importing your files so now let's uh import the PDF I'll get a new cell here oh I don't want to turn that I'll press B Escape M for markdown uh we've got a PDF let's open it so to do this I'm going to use the pi Me PDF library now there's lots of different ways to open a PDF um but I found that here we go P mu PDF that this library in particular was the best for opening our PDF I used Pi something Pi PDF GitHub now this is where you'll probably take to um do some experimentation you know this is also a good PDF Library however I found that this one returned the best text formatting now of of course there are some settings I could play around with but what I like to do is just try out a few different options and if we're working on sort of a basic workflow take the one that works the best to begin with and then iterate from there so we can install that if you've gone through the setup steps on the GitHub you should have it available already I believe it's in the requirements folder yeah P PDF there we go so let's uh start by import fits so so I'll just write here requires pip install Pi M PDF C and then I'll just link to the GitHub here if you'd like to use it well we're going to use it here anyway wonderful so fits so that's another thing um it's not import as P mu PDF it's fits that's just a legacy name for the uh Library itself and then I'm going to use tqdm cuz I like progam bars um that is also people install tqdm we'll see how to use that later on and I'm going to make a little helper function here this is probably what you'll notice um a lot throughout this notebook is I like to turn things into helper functions all the time um or just in code in general it's very helpful practice so this is where we can do now I kind of know what we need to format ahead of time so I'm cheating a little bit here but what you'll often have to do is when you import text in some way if it's with a PDF um the text format in PDFs is always quite different right look at the layout of this so importing the text in here won't necessarily be perfect word for word so you will often have to write pre-processing steps and so I'm only doing some basic um text formatting here so performs minor formatting on text and what I'm going going to do is go cleaned text all I'm going to do is do text replace um oh excuse me I'm going to replace new lines uh with a space now again I know this ahead of time and I'm going to strip the white spaces at the end but you can uh put in anything that you want here so potentially more text formatting functions can go here and return cleaned text because uh the better text that you use the better formatted text that you pass to an llm U the better potential your responses will be I'm just going to close some tabs here got a fair bit going on wonderful now let's um open our PDF hey so I'm going to write a helper function for that and I'm going to go open and read PDF this way we could use it somewhere else if we wanted to so PDF path is going to be a string and then I'm going to return a list of dictionaries so list of dictionaries is one of my favorite kind of data structures because you can turn them into Data frames quite easy and explore them but we'll see what we do here so to create to open a PDF with f fits or pimu PDF um we do fits. open remember that's our PDF Library up there and I'm sorry I haven't turned off cramel yet come on D so doc equals fits. open PDF path and I'm I'm going to create an empty list here and I'm going to call it pages and text because what I want to H do here is this is my list right we're going to return this but it's got nothing at the moment but we're going to iterate through our Doc and create a dictionary of text and page numbers and different information about our pages in The PDF and then append that to that list so we'll see this in action rather than just talking about it so let's go for page number page number excuse me typos page in tqdm this is our progress bar enumerate so enumerate is going to uh enumerate an iterator and which just means um say this had 100 um Pages page number will start 0 1 2 3 4 5 all the way up to 99 because python starts at0 right so text is going to well we can get the the text by by taking the page in our document and using the method get text and then this is where we're going to format our text so text equals our function above text formatter and text equals text wonderful so that's this up here and then we're going to pages and text. append our information so this is where we want to get the page number and this will be our page number here now a little tidbit that I've noticed in this textbook and again this is kind of going to be where you're going to have to do some experimentation of your own and depending on what your what data you're working with in our particular textbook why do we want the page number well because if we use a specific resource we want the page number so we can jump into the text and find it out but if we go right up here for our textbook again this may be different depending on the data source you're using I found that the page numbers actually start on page 43 so I minus 41 and that's just what I found by experimentation now this we might not pay attention to this we pay attention to whether our page numbers actually match the text that's going on so again the PDF reader may not completely match up with page numbers here so just keep that in mind when you're importing your documents um if you want the page number that is you don't necessarily need it I just like to have it so Char C so this is character count so we're just going to get the length of text to go how many characters this is so we can perform uh exploratory data analysis on our texts we want to find out different statistics about them how many characters are on a page how many words are on a page Etc um word count we can get um this is going to be a rough word count we're just going to split it on uh full stops with uh space after them oh sorry no for words we just count the spaces split the text on the spaces and then for sentence count we split on the full stops so we can go page sentence count I'm going to call this sentence count raw cuz it's it's not an official way to split sentences but it's it works pretty good so we just splitting It On full stop space so if we come in here where's a sentence with a full stop space so in this case there's a full stop space and there's a full stop space so this would be one sentence beautiful so let's come down into here and then we're going to go page token count so what is a token oh we've come across one of our keywords I told you we were going to find something like this um explain the keywords as we go so divided by four so one token equals roughly four characters so how can we find this out now if we go into there simple local rag this is the repo we're working out of if we come to our key terms there we go token a subword piece of text for example hello world could be split into hello uh a comma world and exclamation point a token can be a whole word word part of a word or a group of punctuation characters one token equals roughly four English characters uh so 100 tokens equals roughly 75 words text gets broken broken into tokens being passed to before being passed to an llm so again I told you uh we we went through at the start that uh machines love numbers people love words but machines love numbers so before words go from go to chat gbt this gets tokenized so if we go what is a token in llm what is a token I believe open AI have a great explanation of this open AI there we go what are tokens and how to count them so tokens can be thought of as pieces of words before the API processes the request the input is broken down into tokens so you'll hear this a lot token count context window all that sort of jazz um so one to two sentences is about 30 tokens one paragraph is 100 tokens 1500 words is 20 248 tokens so this is important to think about when then um if you had oh what's the tokenizer tool does this work oh and do some text hello my name is Daniel and I love machine learning I'm giving a tutorial on rag pipelines and would like to explain tokenization what is tokenization there we go so we got 33 tokens beautiful token IDs ah there wow this is a great demo okay the things you find out as you're working uh sporadically there we go so now we have this is um the text before it goes to our model it's going to be tokenized so it gets turned into numbers so we don't necessarily have to uh understand what these are this is for our machine learning model to understand what they are but I just want to show you that this is what I mean by token we take the words and we break it into subword pieces or we turn it into numbers and we pass that to an llm and it's going to give us tokens back but then we decode them into words but let's keep going um so there's our token count and one more thing of course we're going to need the text right and we want to from here return pages and text and then what I want to do let's go pages and text equals open and read PDF so PDF path equals PDF path and then we'll get pages and text and then we'll get the first two samples because this will be a list right there's tqdm working nice and fast oh my goodness okay how cool is that so page -41 how about we get some random I like getting random samples you know so import random and then randomly sample pages and text you never know what you're going to get k equal three oh right so we've just read in all the pages of our PDF and now we have it in a a list of dictionaries that we can work with page number 600 okay so this is about phytochemicals phytochemicals uh in plants are chemicals in plants that may provide some health benefit carotenoids one type of phytochemical phytochemicals Also let's search this text to see if our page number is correct right this is where keyword matching can come in handy are we on page 600 there we go okay see that's with a little bit of experimentation so now we've got a way to interact with our text programmatically how cool is that so that's page number 600 beautiful we could just potentially read it from down here but I just decided to enter it manually so our page numbers line up okay what should we do next well how about we get some stats on the text right now this is where you'll see why I got the token count the character count all that sort of stuff for so import pandas as PD and we can turn this is why I also turned out um pages and text into a list of dictionaries because pandas works great with that um data type so pages and text and then DF will check the head there we go okay beautiful now we've got uh a good breakdown of the data that we're working with so we've got page number character count word count sentence count raw so this is a rough estimation and then token count and then of course the text so this is 200 tokens on page what do we got - 37 because it doesn't quite start that that's not that's not the best way to do it but uh it works for now so let's go DF do describe so this is going to give us a little breakdown and then we'll round it to two decimal places wonderful so we have 1,200 Pages that's quite the big textbook now again maybe minus about 150 in references and all that sort of stuff but so that's still over a th000 pages of text um we have an average sentence count per page of 10 we have an average token count of 287 so if we go by this math where is it our tokens so how many words should that be h four chars in English so we actually have the word count there don't we silly me so about 200 words per page beautiful okay now why would we care about the token count let me just write this down why would we care about token count so this is a little um teaser for what's to come so tokens token count is important to think about because one embedding models don't deal with infinite tokens and two uh llms don't deal with infinite tokens right so we can't just pass all the texts in the world to our embedding model and all the texts in the world to our llm in fact that would be quite computationally uh wasteful so for example an embedding model may have been trained to embed sequences of 384 tokens into numerical space and I'm not making this up um that is the embedding model that we're actually going to use so all mpet base V2 how I know that off by heart don't ask me but this is the sentence Transformer model that we're going to use I haven't liked that wow so it's actually from the sentence Transformer Library so this is hugging face a great resource for many different open source embedding models as well as llms we're going to be using this for both our models actually but this is our embedding Library sentence Transformers and I've got this we're going to get more to embeddings later but I've got this linked in our little whiteboard as well but let me just quickly show you what I mean by token limmit so we have pre-trained models on the side here then we have all MP Bas V2 now again this may not be the the the state-of-the-art latest and greatest embedding model but it is easy for us to use so just keep that in mind which embedding model to use is a very experimental Choice I've just found that this one works great pretty much out of the box for many different problems so the max sequence length there we go 384 so that means it can only take in 84 tokens if we put in a sequence that was 500 tokens it would get shortened down to um 384 anyway so we might be losing some information in that then if we go to the mte leaderboard hugging face massive text embeddings leaderboard so remember an embedding model is something that's going to turn a string of text into a numerical representation that is useful so we have a lot of different models here I'll let you explore this on your own time but this is the max token some models now take up uh can use 32,000 tokens so that's a lot of pages of text however they have quite a big model size our model size is about 400 megabytes so it's quite small this is something a lot of different things you'll have to take into consideration when you choose an embedding model so we'll talk about that later but this is just where the token count comes in that's why we did our analysis to go the rough token count so we could figure out what's on each page right so we can embed a whole page of text with our chosen embedding model so this is um sentence Transformers all MP net base V2 now we're going to see this Hands-On in action later on so don't worry too much about it if you want to read the documentation you're more than welcome to go ahead and do that the world of embeddings is quite a deep one so so uh and llms don't deal with infinite tokens so um that's the embedding model gramy I'm turning that off next one I promise you this will be the eternal battle of the video is Daniel vers Grammy um embedding model okay and then um as for llms they can't accept infinite tokens in their context window so this is another key word that I said we'd come across so um oh it's not there oh llm context window there we go so the number of tokens an llm can accept as input for example as of March 2024 gp4 has a default context window of 32k tokens about 96 pages of text but can go up to 128k if needed so let's look up um GPT for token limit 4,096 where's the documentation here we go is that the documentation September 2023 anyway let's try Claude 3 that's another um model that's come out recently Claude 3 pricing Claude API yeah well here's another thing to take in mind is that yeah 200k cont 200k token context window so that means it can take in 200k tokens in this little chat box here right so that's a lot of tokens in one hit but why if we're getting charged per token like we are here why would we want to use a $3 per million token so that means we can go there we go 200,000 context Windows token context Windows well that's a mouthful if we're getting charged per token per million tokens wouldn't it be smarter we could use all of them in one hit but wouldn't it be a lot smarter to use less right because and a lot more cost effective now this is again this is just saying with an API we're not even talking about locally if we want our um models to run fast uh we want to use less tokens or we want them to um we want to get more bang for our buk with our tokens we don't just want to pack in as much tokens as we can we want to pack in good tokens remember the're saying in machine learning good data in good data out so we've imported a a PDF document and we've got information here so what's our next step in our pipeline let's go to our we've done this step collection of PDF documents we've imported that so pre-process text into smaller chunks EG groups of 10 sentences how about we take a small break and we do that in the next section if you want to try and jump ahead I'm going to write a little heading here so maybe we call it further text processing um splitting pages into sentences if you want to jump ahead and do that and maybe we break down each of these into groups of 10 sentences or so give that a try and we'll we'll tackle that in the next section of the video and we're back now let's do some further text processing or pre-processing so we want to split our pages into sentences we've got a dictionary up here of list or sorry a list of dictionaries with uh all different pages of text but maybe we want to chunk them up right and how about we chunk them into groups of about 10 sentences now we've already done this in a sense cuz each page has on average 10 sentences but what if we had uh a PDF with much denser text so PDF page with dense text what does this look like images yeah what if we had something more like this right that's not even as dense as what I was thinking maybe something like this um or something like this we want to split it into smaller chunks one because it's I mean more understandable for me I don't know if about you but I look at like dense pages of text and sometimes I get scared um I like to split it into sentences so what we can do here is split our pages into groups of 10 sentences now we've done this by splitting on so two ways to do this so number one is we've done this uh by splitting on um the full stop and space that's a really sort of I guess hacky way of doing it but it kind of works split on that and then two we can do this with uh NLP Library NLP stands for natural language processing which is what we're doing processing natural language with code and we can use such as Spacey as one of my favorites which is what we're actually going to use and nltk so I'll just show you here Spacey install if you want to get started with Spacey we should have it in our environment already thanks to the install but beautiful NLP Library open source as well oh we got too many tabs so Spacey and nltk another very helpful one if you'd like to um delve into the world of natural language processing that isn't just all about um LL l m you do still need to process pre-process text and prepare it for llms so let's try that let's use Spacey hey so we can start by importing spy. l.n what we want to do is split our text into sentences so we know how many sentences are on each page so we want to import English and then um we can create an instance of English with Spacey and then we're going to create a pipeline so Spacey like we're building a rag pipeline Spacey also deals I'm just going to close some tabs excuse me got a lot going on here um Spacey also deals with pipelines we don't actually need those those or those beautiful so let's go add a sentence iser Oh and you'll be very happy to know that I turned off gramly yep I know big battle but it's off you can now see my typ typos in their full Glory so we're going to add a sentence iser pipeline here with spacy so Spacey sentence iser let's go into here so sent I can't even say it senten Sizer is basically turning text into sentences so pipe there we go if you want to read the documentation I'll link that there never underestimate reading the docs so NLP doad pipe senten iser I think I've said it five different ways already and now what we're going to do is we're going to create a document instance that's what Spacey calls it so a document instance as an example so just want to show you how this senten iser Works doc equals NLP and if we pass in this is a sentence we want to pass in a string with two two sentences in there so this is another sentence and if we're feeling really lucky how about I like elephants there's three sentences in there so let's now assert this will throw an error if it's not correct list doc so our document do sense so sense as in sentences and we want this to equal three there should be three sentences and now print out our sentences split so list doc sent what do we get wrong here from spy. Lang import English oh excuse me from had two Imports did you notice that there we go so now we get a list back so you might be going Daniel why do we why do we uh not just split on dot space well that's a very good question that would also work for this however Spacey is a library that is sort of U it's trained in a way I'm not sure the exact methodology for the sentence iser all I know that in practice it works quite well um for splitting sentences based on rules and other statistics and whatnot rather than just full stop space so uh you don't necessarily have to use it I just find it it's very robust with dealing with I mean I think their tagline is robust um NLP pipelines but nonetheless let's use it so for item in tqdm pages and text we want to get a progress bar when we can so let's create a sentence so remember pages and text is our dictionary I'm just going to put a cell here pages and text show you what that looks like for the first example it's a list of dictionaries so this is the first one it's just the title of the book so the text actually isn't that long um if we get maybe one at about 600 there we go we get some more text beautiful so sentences we want this to be equals a list of NLP do item sorry pass the item text so this field pass that to NLP which is is the pipe up here pipeline for sentence iser so we want to pass pass in that text and then we want to get the sense from that sentences oh and I forgot that that's a string field and that's too many T's maybe I shouldn't have got rid of grammarly okay and now we want to make sure all sentences are strings so um the default type is uh Spacey data type we don't actually want that we don't need that for now we can just turn it all into Strings otherwise we might find that we get some errors later on so always work in standard data types if you can unless you need to so for sentence in item sentences probably a more efficient way to do that two Loops if you know a more efficient way please let me know so count the sentences and we're going to go item now this is just a simple way Page sentence count Spacey we're just going to get the length of item sentences all right does that make sense let's Loop through and see what's going on sentences is not defined uh sentence there we go items is not defined excuse me typos Galore beautiful nice and quick there we go okay so let's now inspect an example with our good friend random random sample pages and text and K = 1 so we've got a random sample here this is from page 1 98 there is our raw text and you notice here that we've split it into sentences this is beautiful thank you spacy sentences wonderful okay so now we have sentence count with Spacey is 13 and the sentence count raw is 13 so they line up and in fact it's probably going to line up on many different cases but for now let's just stick with Spacey sentence counts 13 um and then we can expect it again inspect sorry as a data frame we've got some new Fields here medf data frame pages and text and then we're going to go DF describe and we're going to go round two DF describe there we go beautiful so looks like uh our raw sentence count came out quite similar to Spacey Spacey is a slightly less but we're going to stick with Spacey um just because it's probably a little bit more robust than splitting on uh dot space so we'll stick with that as our sentence count now what can we do well we've got our text split into sentences this one has 13 which is probably slightly too many um well it would all fit in our embedding model but you know what we're going to do we're going to group it into groups of 10 sentences now and that number is arbitrary you can find out um through experimentation what the best way is to group your text together in fact this concept of chunking so you'll find that said a lot in or text splitting so um getting our text into smaller groups is um still an active area of research of how to best do that my advice is to experiment experiment experiment so the concept of splitting larger pieces of text into smaller ones is um often referred to as text splitting or chunking there is no 100% correct way to do this we'll keep it simple and split into groups of 10 sentences however you could also try 5 S 5 7 8 whatever you like and there could even be overlaps right so we might get um sentences 0 to9 um in one one group and then sentences uh 8 to 18 in the next group if you know what I mean so there's one overlapping and there are libraries that can help do this so if we go Lang chain text splitting text Splitters there we go that's for Lang chain there we go but we're we're just going to do this with pure python right so if we can do it with P python well we can do it with the Frameworks later on so there are Frameworks such as blank chain which can help with this however we'll stick with python for now and there's a whole bunch of different options there so let's write some code to split our sentences into groups of 10 or less so for this example we'd get the first 10 which goes to about here maybe that's not correct so that'll be a group of 10 sentences and then there'll be another group of 1 2 3 so two groups in that so 10 and three but that'll be recursive over every example that we have so first we're going to Define oh and just to remind us why we do this so before we go why we do this one so our texts are easier to um filter so smaller groups of text can be easier to inspect than large passages of text to so our text chunks can fit into our embedding model context window so as we saw before um EG 384 tokens as a limit and three um so our context passed to an llm can be more specific and focused so if we have a limited amount of tokens we can pass to an embeding model we want to make sure that it fits within that limit so our embedding model is 384 tokens as limit other models may have embedding models may have higher token limits so that's something to keep in mind um we'll see that when we choose an embeding model and our llms also have specific context windows that we may be quite large but we don't necessarily want to use all of that because that's going to one cost us processing time and two um cost us money if we decide to use an API so something to keep in mind so let's define find our split size to turn groups of sentences into chunks so our num sentence chunk size equal 10 and let's create a function to split um text or split list of text recursively into chunk size so for example EG list of 20 um would go to two list of 10 does that make sense and or say we had or say we had 25 sentences I don't think we have one without many but it would go 10 10 5 so just continually do 10 and until you run out so we'll create a function called def split list and then we'll go input list take a list of sentences as input and then we'll go the slice size will be an INT and this is going to return a list of list of strings and this is actually input list of list of strings so sorry list of strings wonderful so now we can do this in one line of python I believe this is the power of list comprehension so we're going to get our list comprehension and we're going to start at I to I plus split size we'll get I in a second right so we're going to index on our list and then we can go for I in range we'll start it at zero and then we'll get the length so we'll go a range of um say our input list was of 25 so 0 to 25 and then we'll go slice size as our step right so that should work and we can find out if it does by recursively going over it so let's just create a list of 20 different things and see what happens so list range 20 5 test list and then we'll go split list on our test list oh slice size we can just do this as num sentence chunk size split size is not defined oh slice size excuse me there we go beautiful so we get one list of 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 n oh we might need to go plus one and then it's got 20 so we get a little bit of an overlap ah how would we fix this okay so it turns out that the um my brain was mud and it didn't work we don't need the plus one it actually works just like that and I was counting wrong because python is zero indexed so excuse me um I've been tripped up by zero indexing but let's now that we've know that our function Works let's run it over our sentences isn't that funny it's always the simplest errors off by one errors and text and split sentences into chunks so we're going to go for item in tqdm pages and text I love the progress bars so we're going to have a a new category in the item this is going to be sentence chunks and then we're going to equal split our function up here split list and we're going to our input list is going to be item sentences and oh comma not full stop and then our slice size is going to be num sentence chunk size again you can customize that to whatever you'd like uh we're going to use 10 and then we're going to count the number of chunks that we have for each one so chunks will be uh groups of 10 or less sentences always getting stats right that was pretty quick that's the power of pure python right so now let's randomly sample from our pages and text with k equal 1 there we go oh we're getting some rich data here you know uh sentence chunks it's all the same just slightly manipulated we've got sentence chunks and we've got number of chunks is one so that's just one chunk let's keep going till we get one with more than one chunk number of chunks remember on average we have about 10 sentences per sample two there we go but the second chunk isn't doesn't have that much right so it has yeah not too much there so that must be a reference group right so this is where it comes important you'll see later on we can filter out samples like this that don't contain that much information so we want to keep um so but the first chunk has a lot of helpful text right um critical that parents and caregivers direct children towards healthy choices that's that's important right so we have uh one sentence chunk that has a lot of information but this one we could basically filter this out so that means that we've improved our data pipeline we've improved a potential input to our model we're using better quality data here it's not always about more data often it's it's it's a balance between the two quality and quantity so let's get some stats while we're here so DF equals PD data frame always getting stats pages and text and we're going to go DF do describe on round two beautiful so now we have chunks beautiful so we have about 1.5 chunks now keep in mind that uh it'll it'll round up closer to two cuz this is counted as one whole chunk so that was expected we have 10 sentences it's going to be about 1.5 chunks per um per sample cuz two it's going to just round up whatever whatever is here even if it's only one sentence we'll just round up to two chunks so let's keep going what we want to do next is we want to split each chunk into its own item so what we've got here is um this is getting pretty big the amount of information here but if a sample has two chunks we want each of those to be its own item in the dictionary that way we can attach metadata to it and all that sort of different stuff but uh it's a lot easier to deal with each sample as its own individual sample so in our case we have our sentence chunks list we would want this to be one item in our list of dictionaries and we we would want this to be one item so effectively these become two samples in our data set so let's now do that so um I'll create a new heading here so splitting each chunk into its own item and then what we're going to do later on is uh embed each item or each chunk right so we'll see this later on but let's write here um we'd like to embed each chunk into or each chunk of sentences into its own numerical representation that'll give us a good level of um I'll just write this down that'll give us a good level of granularity meaning we can dive specifically into the text uh sample that was used in our model that's what Rags all about we want generation with references so I'm going to import re which is reix and what we're going to do is split each chunk into its own item so pages and chunks excuse me fourth time lucky diving that I might need to go for a walk soon and we're going to go for item in tqdm pages and text so I've changed this we're going to create an empty list of pages and chunks and I'm going to take our pages and text and break it into here so this will no longer be pages and well it will still be pages and text but we're going to just put each individual chunk as its own item into this new list um and so what I'm going to do is for sentence Chunk in item sentence chunks so we've got a loop inside a loop here so what I'm saying is for each item Loop through the sentence chunks and then what I'm going to do is create a chunk dict so so an empty dictionary and then in that chunk dict I want to enrich it with some information so I actually want to put the page number so I know which page which chunk of text came from by the way when I say chunk I mean a chunk of text so a group of um 10 or less sentences now I want to join this is a little important point because our sentences our sentence chunks are going to be lists of sentences they're actually not a paragraph We want to put these into a paragraph so we're kind of breaking them down into list and then joining them back into a paragraph So join the sentences which may sound like a bit of work but we're doing it programmatically so it's going to do it for us so join the sentences together into uh paragraph like um structure AKA join the list of sentences into one paragraph okay sound nice and easy so what we want to do is we want to go joined sentence chunk equals um and we're going to join just like that join you can join a list and we're going to pass in the sentence chunk which should be a list of strings and I'm going to replace the double spaces now again I've I I've done this ahead of time so I kind of know the pre-processing that we need to do on the text but this is going to be an experimental thing that you'll need to do depending on the data that you're working with and then we need to perform uh we don't necessarily need to but um you'll see actually we'll perform this step after how about that I'll show you where we we bring in the reix so chunk dick um I want to put in the sentence chunk and we're going to go joined sentence chunk um and then let's get some stats hey get some stats on our chunks that just sounds funnier to me to say so uh chunk chart count now it might seem like we're writing a lot of code now um Lang joined sentence chunk and that's on purpose I wanted to be uh as verose as possible in creating this tutorial because we can functionalize a lot of this later on so basically going from import the PDF to pre-processing the text that could be a few functions joined together you know but we want to do it from scratch and we're doing it from scratch so we'll get the word count as well so join sentence chunk do split um I'm going to just get the crude word count here so just splitting on spaces and chunk dick we'll get the token count which is just the um the character count or roughly token count just the character count divided by four Lang joined sentence chunk divided by four and then we'll put a note here and then we'll go one token equals four chars and then one last step is that we have to append our dictionary pages and chunks. append and we're going to append our chunk dick item beautiful now let's see how many pages and chunks that we have run that join sentence chunk is not defined oh did you catch that you got a little typo there oh wow that's beautiful the power of python right nice and quick so we have 1,800 chunks wow 1,800 chunks of different sentences so let me show you hopefully uh it comes up in a random sample why we might need to pre-process our texts a little bit more and again we don't necessarily have to do all this pre-processing but the good data in good data out so when we join our sentences together where's a full stop um not too many examples here let's get another one oh maybe it oh there we go yeah so see how the yeah there we go okay so we don't have a space there we don't necessarily need to fix this it should work itself out later on but if we want to put in the best quality text yeah there so because we've joined them uh all the sentences even though there's a capital letter at start has just been pushed right up against the full stop so we can fix this by uh using a little bit of Rex so what we're going to do is re so this is re which stands for Rex we're going to substitute and we're going to use this Rex um pattern which is uh Slash dot so uh I'm going to I'm not the best at explaining Rex so I'm going to I've just precoded this so what we're going to do is I'm going to uh write this out and then we can ask Chachi PT to explain it for us [Music] r dot one and then joined sentence chunk so what this does is it goes from that to for a capital letter so this is any capital letter at a space so if there's a pattern that follows this so um dot any capital letter change it to have a space and then that capital letter afterwards does that make sense so we want that to turn into that and will work for any capital letter so if we wanted this explained we can use our friend Chad can you explain this Rex with an example in Python code so we'll let that generate see what happens but in the meantime let's pre-process our chunks we'll get it out we get one with a bit more text there we go we got spaces here there we go space space space looking good good all right nothing gets me more excited than having well pre-processed data so we have space f beautiful now what's chat gbt got to say this is a sentence here is another one there we go so arisan P sub pattern reppel string so explained looks for a literal pattern literal period the backslash is used to escape the period since the special character in Rejects any single character when exped um A to Z captures a group of characters that are uppercase letters from A to Z the reppel R1 explained R do1 explained indicates that a period followed by space should be placed in the text one refers to the first capture group from the pattern in this case the uppercase letter followed by a period there's a little example beautiful I love that thank you chat gbt so now let's get um a little bit more stats about our pages in chunks so dfal PD data frame pages and chunks and then we're going to go DF describe and we're going to go round two okay so chunk token count on average is about 180 the max is 457 so there is one that will have some information cut off in our embedding model um just recall that our embedding model has a max token size of 384 so that means that approximately for this one with the max sample uh 457 - 384 we're going to lose 73 tokens of information so ideally all of our chunks would fit into our embedding model so all of our chunks would be below 384 or below and the vast majority of them definitely are but there will be a couple that get cut off so this is just something that you'll have to keep in mind maybe you do a bit more uh recursive shortening to make sure that all of your text chunks fit into your embedding model but I'll leave that for you to explore for now we do have a lot of chunks that um maybe below some minimum threshold you remember how we saw some that were only about a sentence yeah there we go so there's a minimum threshold well could we set a minimum threshold because is this really that valid information if we were if we were asking questions about nutrition textbook um such as an actual question like food's high in fiber will that return anything useful not really so it's going to take up some processing power that we don't actually need to use so how about we create a filter for um sentences that or chunks that are under 30 tokens let's get rid of them so let's just and again this is a limit that I've experienced um experimented with in your case you may want to experiment with your own limit of a minimum threshold and see what works with your data so show random chunks with under 30 tokens in length so I'm just going to write a note here and I'll go filter chunks of text for short chunks these chunks may not contain much useful information so let's set a limit Min token length equals 30 and now we're going to go for Row in um we can use our data frame here DF and we'll go chunk token count if that is greater than or sorry less than um or equal to Min token length we want to sample um let's get five or so and then we'll iterate through the rows iter rows so this is just basically saying hey go to this um data frame DF doad and find the rows in this data frame our chunk data frame with less or find five random rows with so this would be one not necessarily because it's a r random sample but this would be one find the rows with less than or equal to 30 tokens and let's print some out hey print we'll make a nice F string here chunk token count and then we'll go Row one we'll go chunk token count and then we'll get the text and we'll get row one sentence chunk is that going to work oh may need a different inverted commas there beautiful okay so chunk token count so these are all under 30 or equal to 30 and look they're mostly headings right so this to me okay there's one useful sentence maybe okay another useful sentence maybe so this is again where we'll have to sort of do some experimentation of yeah most of them are just links and references right so again if we wanted to what we could do in our chunking is overlap with sentences so instead of um splitting 10 10 and under 10 like we did like we did here we could go um Z to 9 and then have 10 up there and then we start this one at 9 and then so on and so on and so on so we've got an overlap of the paragraphs so we we're covering all of our information but for now let's just leave that um and go filter out data frame for rows with under 30 tokens so pages and chunks so less than a sentence a sentence or less over min token length equals and I'm going to go DF DF uh chunk token count and then I'm going to go Min token length and then I'm going to go to dictionary uh Orient equals records this is going to give us back a list of dictionaries one of my favorite data types pages and chunks over Min token L there we go okay so that's the first two let's as always get a random sample I'll copy this down there so we don't have to type it out k equals one there we go where is your money invested what college do you attend evidence-based approach to nutrition beautiful okay so all of these chunks uh over 30 tokens in length some of them may be small some of them may be long beautiful that is our data pre-processed so what we are up to next is embedding our text chunks holy smokes this is exciting so if we come over to our uh rag workflow this is where we've gotten so far it's taken now I've done this on purpose we could actually do this in a couple of lines of code with a framework but I wanted you to see the the kind of steps that it goes into to taking a random piece of data and pre-processing it into smaller document CHS because this is what you'll often have to do your data won't always be in perfect format so we've written python code to do this step and we've inspected our data on each step of the way our next step is to take these smaller document chunks and turn them into a numerical representation we're going to get uh an embedding model to do this um specifically uh from the Cent Transformers library on hugging face now I'll cover this in the next section of the video for now I'm going to take a little pause take a little break go for a walk move my body around and uh I'll be back soon and we'll embed our text chunks into numerical space and we're back it's now time to embed our text chunks and what do I mean by this well oh let me just write this embeddings are abroad but powerful concept and so while humans understand text machines understand numbers and so what we'd like to do what we'd like to do is turn our text chunks chunks into numbers specifically embeddings so my favorite definition of embeddings is a useful numerical representation and so the best part about embeddings is that they are a learned representation that's the most important part representation so that means that instead of just mapping um I'm just going to put this in italic so it looks nice and that's important part instead of just mapping our words so for example if we had a dictionary um we might go the very popular word we put that as zero and then we might go a another popular wordl as one etc etc embeddings learn them in a much higher dimensional space and it can work on a word or it can work on a sentence we're going to see this live in action in a second if you want to quite possibly the best resource or most in-depth resource on what are embeddings I highly recommend this paper by Vicky boyis one of my favorite machine learning Engineers check out her blog by the way it's really good but this is going to yeah there's a PDF audience you can read that um it's a great resource on embeddings but we're just going to see them in action for a great resource on learning embeddings see here all right now we have to go I have to give you one more thing and that's we want an embedding model so I'm going to go to the Whiteboard here we're going to use an embedding model from sentence Transformers which is an open source Library however there are many different embedding models that you can use available on the hugging spaces um massive text embedding leaderboard let's just check that out hey hugging face mte leaderboard now it can get quite addicting to continually revisit here and try and use the best model uh for example a recent one just got released the other day um yeah MX Bay embeddings mixed bread AI what a beautiful name um they just achieve state-of-the-art results so that might be another model that you want to try but this is a great resource for different types of embedding models here and we see here the different size of the models that's something that you have to keep in mind when you want to deploy your system we have embedding Dimensions so that's how many uh we're going to see in a in a second if we put in a string of 10 text it comes back in 10,24 Dimensions we'll see what that means soon Max tokens so that's how many tokens that uh the embeddings model can accept this one can accept 512 our one can do uh 34 uh 84 but we could easily just as well use something from here I'll let you inspect this in your own time but let's go to um sentence Transformers which is s.net so this is a great open source Library you can use many of the embedding models that are here in in sentence Transformers we can install it like that it should already be in our repository but let's see what happens when we create some embeddings so we'll come back and let's go now from sentence Transformers import sentence Transformer that's going to be the class that's going to transform our sentences and we're going to go embedding model equals sentence Transformer and then model name or path and then we're going to go all MP net base V2 so we're going to use this one because I found it's a great Baseline all around model that works on many different tasks but we could just as well as I've said go to here find a pre-trained model um use any of these all mini LM L6 V2 is a great sort of um balance bance so this one we're using all mpet Bas VI 2 is the best quality while if you want faster we can do all mini LM L6 V2 it's five times faster and good quality and then again there's more new embeddings coming out all the time like for example this is March 2024 and this model came out 2 days ago as of recording this so there you go new models all the time that can be an extension if you want to try out that one and use it so that's how betting model look how simple that is to create and we'll start by putting this we can actually tell it which device we want it to go on so if you're familiar with pytorch um you can use CPU or Cuda I'll let you guess as to um which device will be faster for creating embeddings so let's create a list of sentences that we want to embed this is just going to be a little demo before we embed our actual data so I'm going to write the sentence Transformer Li provides an easy way to create embeddings lovely and then we'll go sentences can be embedded one by one or in a list so multiple at a time and then I like horses all right now if you were to just have a look and at these sentences which ones would you expect to be closest in terms of if you were to score them as to compare this sentence to this sentence what's the score from 0 to 1 I would say maybe 085 and then you compare this sentence to this sentence uh they're not that similar at all right they they have different meanings so that's what we're trying to do with edings we're trying to capture the meaning in a numerical representation so sentences are encoded slash embedded by calling model do encode so that's what we want to do let's create our embeddings equals embedding model uh do encode and we're going to encode our sentences lovely and then we'll create an embedding dictionary and we'll zip together our sentences sentences and embeddings beautiful and then let's see the embedding this is so exciting embeddings are one of my favorite and most powerful uh machine learning techniques that I've ever come across in embeddings Nick do items we're just going to Loop through it and then we'll print the sentence we'll see what it gets converted into sentence will be the sentence of course and then we'll print out the embedding and then we'll just print a little new line thing here so it's going to download the model if we don't already have it and then it's just going to load it onto our CPU it's going to embed this list of sentences wonderful so the first time we run it it may take a little bit longer but subsequent runs will be quicker cuz we have to instantiate the model and there we go wow do you understand what these numbers are cuz I definitely don't um what we've done here is we've taken this sentence and the model here we'll have a look at what it's been trained on in a second has converted it into high dimensional embedding space look at this all these numbers is what's supposed to I told you we wouldn't understand them anymore um unless you want to go through these one by one and see which one computes to which but let's have a look at maybe we look at the first one of our first embeddings and have a look at the shape right embeddings do we get a shape 768 so that means means we are now representing each one of our sentences with 768 numbers so that's a lot more information now this number here is similar to this number so this model uh goes to 10,24 but and again this is just something you have to think about going forward and experimenting with 768 is quite a bit but um more M more is generally better however it does take more time to compute and more storage so keep that in mind 768 is a good place to start now I said we were going to look at what this model was trained on so we can do that by let's go to hugging face hugging face all we'll Google the name of the model there we go I've already Googled that this is like a good machine learning cooking show right here we go so this is a model card we can read about it I believe here we go this is what it's been trained on uh the project aims to train a sentence embedding model on a very large sentence level database so this is how it's got a a representation of of numbers or sorry of language it can it's been fine-tuned on 1 billion sentence pairs so that means that it's seen 1 billion different sentence pairs whether the pair is related or not and so that is why we can start up here and test it out so let's test a sentence out how about we go my favorite animal is the cow right and we want to compare that sentence to I love machine learning and we want to compare this to um I'd like to raise some farm animals one day and we can even add another sentence um my favorite color of tractor is yellow and then we can compute the similarity between these so this is this Source sentence you could imagine this is our query to in our pipeline so we come back up here we have a query and we want to compare it to our embedded chunks which one is the most similar and that's what we want to return we're going to do that later on but my favorite animal is the cow uh compared to I love machine learning we get 0.266 uh I'd like to rise some Farm manels one day and we compare that to um my favorite animals a cow that's got the highest rating and then my favorite color of tractor of yellow is actually pretty close to this as well compared to I love machine learning and then how about we make one that's um not related at all um my telra here we go telra phone bill is too high what happens here there we go it's not related at all it's very small right so this is the whole idea of embeddings we can capture the meaning and we're not just matching by word right so you see here there's actually no similarities here in terms of there's no words that are the same isn't that the beautiful thing so that means that we can just enter in a natural language query and find relevant passages in our textbook that don't necessarily contain the word but match it via meaning that is the power that's called semantic search or vector search that is what we're going to do so how about we embed one sentence at a time how now how do you think we might do that well we can do it as before so um embedding equals model encode and my favorite animal is the cow and we want to check out the embedding oh excuse me embedding model there we go look how quick that was very quick even though it's working on CPU so how about we now embed all of our sentence chunks so we've got this sentence chunk here this is why we've broken it down into chunks we want to embed them all one by one so we can go let's do time so this is the power of running locally and specifically on a GPU right that's what we want accelerate our stuff I did say before that which one do you think it'll run faster on I kind of gave it away there didn't I so to CPU and let's go embed each chunk one by one so four item in tqdm and we're going to go pages and chunks and we need to make sure that they're over the Min length well we don't actually need to we're just going to do that anyway to make sure our data dat set is pretty high quality so item embedding we're going to create a new field in our dictionary and we're just going to go embedding model Dot and code and then we'll get item sentence chunk right so let's run this okay let's going at about 10 to 12 per second it's going to take a few minutes well I might just and get a cup of tea and come back but no we're not going to do that are we we're Engineers we like to run things fast so let's stop that I'm going to comment this out oh didn't comment that out did I stop let's run that so let's try again so what were we getting about 10 to 12 per second let's do embedding model to let's go Cuda cuz I've got a I've got a GPU there and we want to use the available compute power that we have right so tqdm I'm just going to copy this I can actually copy this whole thing same code just the model is on a different device now how much faster do you think this is going to be I will tell you that it is faster but let's guess ready 3 2 1 it's going to take a little while for the model to go through the device but once it's there we can start to use it oh there we oh my goodness and remember I'm recording at the moment as well so it won't necessarily be we won't getting all the speed so this you may even see faster this on your own machine but look at that we don't even have to move we can just let that run but what if I told you there's even another way to get it faster so gpus so that's 32 seconds so if I wasn't recording that would probably be about halfed now this is a really powerful thing about gpus is not only are they fast at Computing crunching numbers calculating different embeddings and whatnot they can be used in batch mode so this was one by one but what if we wanted to batch a f and we had we only have a small data set here we only have 1,600 or 1700 examples what if you had a million so which is not outlandish if you wanted to encode say all of the English Wikipedia you might have um well over a million different samples so let's batch it up so we're going to make a a list of text chunks so that they're all in one big list right pages and chunks over Min token Lan all right and then we're going to go text chunks and we'll get um number 419 beautiful so we have 1,800 text chunks so just the same amount as we had before and remember when we first created some embeddings we passed in the list now we can do it as well with our text chunks the same thing so embed all text in batches so text Chunk embeddings so embedding model do encode and we're going to go text chunks and batch size now batch size will mean which means uh how many chunks do we look at at one time so in our case 32 is um a pretty good default I think that actually is the default in encode but you can experiment to find which batch size leads to best results so batch size won't actually influence the embedding that comes out it will just influence how fast the embeddings run so if you had a GPU with lots of memory you could probably increase this if you had a smaller GPU um that's in terms of vram so my Nvidia RTX has 24 GB of vram if yours had say 12 you might reduce this to 16 or something like that so just think about that and we want to return turn them as tensors torch tensors so we can use them later on and let's go text Chunk embeddings so what speed up did we get so that was 69 or 70 iterations per second and we were getting about 10 here so we got about a 7x from CPU to GPU and now let's see what we can get by embedding them in batches oh my goodness 3 seconds to embed 1,600 1700 samples so that's pretty darn quick even though I'm recording so that's going to take out some compute power as well not too much but just a little bit so we've gone from that took 30 seconds so a 10x Improvement so this is a 7x improvement from CPU to GPU and then a 10x improvement from single embedding at a time to batch so we've gone 70 times faster by going from CPU to GPU with batches that is incredible so let's now what once we've created those embeddings we we're working through our pipeline right um so we've got our we've just created our embeddings we embedding model and we want to save them for later because we can just create the embeddings and save them for file so then we don't have to create them again so once we've imported our documents chunk them embedded them we can save that toile import it later on and then we can just use it from there so I'm just going to get out of this and this and this and let's save to file so save embeddings to file so what we might do is H how can we do this let's create a data frame hey save embeddings to file and I'm going to create it as text chunks and embeddings DF and pd. datf frame we've got our every field in our um pages and chunks over Min I could probably change that variable right um we'll get 419 from that so yeah every variable in here or every item in our list has now has an embedding array which is beautiful and we can just save that to a data frame and then save it to CSV we could pickle this out or save it to file or something like that but I'll just save it as a data frame or turn it into a data frame so we can inspect it so I might just copy this put that down there and we can go embeddings DF save path equals text chunks and embeddings DF CSV and then we're going to go text chunks and embeddings DF to CSV and we'll save it as the embeddings DF save path and we don't need an index index equals false beautiful so once we save this it'll go to file there we go we can use that later on it's about 15 megabytes or whatnot so smaller than the PDF but remember in here we've also got um multiple instances of the text so we could actually compress this quite a bit if we wanted to oh it's a bit more so yeah we've got a fair bit of information in here that we don't actually need to store but 21 megabytes isn't too bad so let's now just make sure that we can import it and check it out see what it looks like import saved file and view so text chunks and embedding DF load equals PD read CSV embeddings DF save path and then text chunks and embedding DF load. head wonderful there's our embeddings there and so by the way these are if you hear a vector or vector representation or feature vector that's another word for embedding so this is a vector so it's a one long sequential sequence of numbers now we storing it to a CSV isn't necessarily the most efficient way if you have a very large um database of embeddings you may want to look into a vector database so if we go to our little rag workflow here um we can just store it in um torch. tensor for now cuz our sample is quite small cuz as we'll see in a second it's that's very fast even for 100,000 Plus Bings but if you have let's say a million plus you probably want to look into a vector database so Vector database here we go so a vector database is just basically a database that's been specialized to store embeddings and use approximate nearest neighbors to search over them so we're going to do semantic or vector search in a second but if you had say a million plus examples or even 100 million which is not actually that out landish um searching over all of them um can be quite computationally expensive however approximate nearest neighor what that does is instead of searching over um all 100 million it might go I'm going to search over these million because I know that they're going to be close to your query so you're searching on 100 times less than the whole database but that's a an extension you might want to look into there's a fair few Vector databases on the market now so I'll just put that in there um go if your embedding database is really large EG over 100k to 1 mil samples you might want to look into using a vector database for storage but as we'll see we can get away quite well with our smaller size data set without using a database so if you want I've got some um questions and answers in the notebook version of this not the video version if you want a little extension a little something to read you can go to the embedding section and we'll scroll down got a little Q&A there's a lot of numbers in this notebook saving embeddings to file here we go um questions and answers so which embedding model should I use as we discuss is many different there and they sort of been updated all the time we've started with um all mpet base V2 but mixed bread AI really cool company name just released their embedding model and we can use it with sentence Transformers so that's an extension you want to try so technically you generally want to use the best one available but um I like to just start with a benchmark mark one that I've used a lot of times before because I know it works well and then I can if I get an upgrade well hey that's really good now what other forms of text chunking splitting are there we got a guide for that there we've just done sentences we're keeping it simple what should I think about when creating my embeddings well how many tokens does it take in what Vector size does it go to so things to think about size of the input so if you have longer sequences you need to you probably need to either chunk your um sequences more specifically or choose a model with a larger input capacity how are you storing your embedding Vector so a larger embedding Vector has so we're using 768 so think of it the more numbers the more chances the computer has to represent your data so if we had double this to 1500 and um what's the math on that 1500 + 18 each 1536 uh we have twice as much opportunity to represent our data however that doesn't necessarily mean that it's always going to be better so just something to keep in mind size of model so our model all mpet um Bas V2 is 420 mbes in size so if you have a larger model you can generally get better embeddings but it requires more compute power and time to run and then of course open or close so we're using an open source model cuz we want to be able to run our um systems locally open models allow you to run them on your own Hardware whereas closed models they can be easier to set up but they require an API call um to get embedding so you have to send your data externally H and then get it back um so that takes some time where should I store my in beddings um if you got a relatively small data set like what we've got 100,000 examples you can use uh npay or torch. tensor so we're going to use torch. tensor and we'll see later on when we do search and answer that that's actually very quick for our smaller data set um even for 100,000 embeddings but if you have over 100,000 or up to a million there's no real limit here um just going to take some experimentation you might want to look into a vector data base so I've put a link in there for that so we're up to the next section we have officially done this part of our tutorial so document pre-processing and embedding creation we've taken our time here we've coded this from scratch we could really functionize this now to just do this only one hit import document split into smaller chunks um embed those chunks save them to file so we've done this the next thing to do is rag search and answer so retrieval this is our retrieval part and then we're going to search uh sorry retrieval part then we're going to augment our question with valid passages then we're going to generate a response so let's do that in the next section I will meet you back here and we're going to go rag search and answer I'll see you soon welcome back so we've done the really fun step of turning all of our texts into embeddings now now let's use those embeddings so what's our goal here rag goal is to retrieve relevant passages based on a query and so what we're going to do here is we're going to embed our query with the same model that we embedded our passages right and then we're going to perform like just like we um let me go to that model again all mpet base V2 just like in here when we go um I love dogs and we have I love cats my favorite car is a Tesla right just like we have this is our query we want to find which one of the passages in our embedding vectors is most similar to our query we want to retrieve those so just in this case which one's it going to be the most similar the model is loing I should have had this cooking show there we go beautiful I love cats so that's the most similar one but we want to do that for our nutrition textbook or we could do that if we were a telra we could search my bill is too much and we want relative passages from here that's our whole goal um retrieve relevant passages based on a query and use those passages to augment an input to uh an llm so it can generate an output based on out pit based on those relevant passages oh my goodness okay and so to do this we're going to let's break it down into we want similarity search and so this is something you'll get very familiar with as you um get into the world of embeddings and embeddings actually as a concept it's not only just for text by the way so embeddings can be used for almost any type of data for example you can turn images into embeddings sound into embeddings text into embeddings Etc right all we're trying to do here the takeway is that we want a numerical representation of our data because computers work best with numbers and so when we want to compare embeddings that's known as we'll go comparing embeddings is known as similarity search or vector search because all of our embeddings are vectors or semantic search which is what we're going to be probably mostly referring to semantic search in our case uh we want to query our nutrition textbook passages based on semantics or what I like to call it Vibe right so if I search for macronutrient nutrition I should get relevant passages to that text right or macro nutrients functions that's my example so whereas with keyword search um if I search Apple I get back passages with specifically Apple now in this case this one uh relevant to that text but may not contain exactly the words macronutrient functions right so let's let's just try this example macronutrient functions so nutrition we're going to go command F Type in macroon function we get nothing there right um what else could we get so we want to type in infant breastfeeding that's an important topic in nutrition oh we get nothing back and I think how um I'm not sure I'm not used to using Windows I think it was saying we're not getting result back I think or there's uh notifications going off that I haven't turned on so excuse me for that so let's see how we can do this first we're going to do to do similarity search or semantic search let's now read back in our file we already have this as uh a variable but if we wanted to just start our notebooks from here we could go import random because we want to randomly look at examples we want to get torch for p torch because we're going to use some methods in torch uh we'll see those later on import numpy and P and we want to go import pandas as PD and then let's also get get um device equals Cuda if torch. Cuda is available else CPU right we'll set the device now we want to import text and embedding DF so text chunks and embedding DF equals pd. read CSV and we'll pass in our text chunks and embedding DF CSV and then what we're going to do is we want to convert text and embedding df2 list of dicks I'm going to go pages and chunks and embedding DF to dict now again we don't need to do this step completely because we've already got the variable here but just in case we wanted to just randomly start the notebook from here so we could work on again we're working on this part of the workflow now so if we just wanted to start on that once we've already created our embeddings this is why we're writing this code so let's have a look at yeah let's just see what this looks like text chunks and embedding DF oh no such file or directory did we get that file name correct text chunks and embeddings DF is that what we called it embedding CF there we go now let's have a look at the embedding column wonderful now now let's keep going let's um what don't we look at yeah one of them or we create why don't we create embeddings on its own yeah good idea so embeddings equals text chunks and embedding DF we'll get the embedding column to list embeddings oh my goodness that you know why this has happened cuz we've stored it as a CSV it's converted into string so what we might have to do is let's write some code as we import it we want to convert this column into a numpy array okay so convert embedding column back to array and we'll do this we'll do this together because a lot of the time you'll spend pre-processing data and getting it ready to be used in uh in embedding pipeline or a rag pipeline right so we'll write here it got converted to string when it's saved to CSV so we can go text chunks and embedding DF we'll get the embedding column now I think there's a method in nump text chunks and embedding DF we're going to go embedding and we can do an apply function here maybe we go Lambda uh X NP array or no NP from string and then we'll do X will this work from string what's wrong here from string use from buffer oh what do we get wrong here a bite SLP object is required from string troubleshooting on the Fly here okay I think that looks okay let's have a look there there we go so we've got an array beautiful an array of arrays so maybe we go embeddings NP array can we just turn that into a single array oh let's let's look this up hey embeddings how do we turn this is a problem I have quite often how to turn list of arrays into single array into a single array there we go concatenate or stack beautiful okay let's try that out NP stack axis equals z all input arrays do we not have the same shape to list this is troubleshooting on the Fly this is fun okay we want NP stack are they not all the same shape hold on let me let's do a okay embeddings equals or we'll go here how to turn numpy turn string into array from string ah sep deprecated the default is deprecated that's what we need I think we just solved our problem so from string and the separator is going to be a space oh no let's just see what we get if we don't do this line ah what's the space two separations we're troubleshooting this on the Fly I told you we're going to take this slow oh no you know what I've already worked this out before so let's go into here have we got ah there we go we need to do something there let's just copy this line and see what I did before see troubleshooting on the fly right ah we need to strip that makes sense okay so excuse me I should have been more prepared in that case but we got to see how to create an numpy array from a string let's run that where did I never close that Lambda X do we need an extra bracket and now we're still getting something wrong here now brackets have missed from string comma there we go ladies and gentlemen we have now raise here and then we got that and then we got two list and then we got NP MP stack and we want to go axis equals z embeddings there we go okay now let's turn that into a torch tensor how about we do that we're going to copy this line isn't it fun see that's what real coding is actually like is you go to troubleshoot things you don't know what's going on you look up the documentation in fact if I didn't have that already prepared I probably would have spent quite a long time looking for a solution for that so let's go um convert our embeddings into a torch. tensor embeddings and then we want torch. tensor wonderful now we should have embeddings do shape we have a torch tensor there we go of we have 1600 embeddings all of size 768 so what would we like to do next well let's um create our sentence Transformer model so create model so our embedding model we don't have to recreate it but we're going to anyway because again if we wanted to just start the notebook from a few cells above so we'll get the utils we'll see what they do in a second trans sentence Transformer then we're going to go embedding model equals um sentence Transformer model name or path equals all MP net now this is important we want to embed our query with the same model that we embedded our passages of text we're going to send it to the device which is Cuda um what have I got wrong here sentence Transformer oh s sentence Transformers we create our model beautiful so now we've got an embedding model what we want to do is uh say that we have an embedding model ready and we want to Let's create a small semantic search pipeline in essence we want to search for a query EG macronutrient functions and get back Rel reant passages from our textbook so um let's write what we can do we can do so with the following steps going to go one Define a query string two so that'll be what we want to search two what can we do next so we want to turn the query string into an embedding three we want to perform a DOT product or cosine similarity if you're not sure what those are we're going to we're going to go through those in a second don't worry cosine similarity function between the text embeddings and the query embedding and then four is sort the results from three in descending order so just like we did a search before what this is essentially doing there is turning that into an embedding and then performing cosine similarity or dot product on the embeddings of these two sentences so behind the scenes and that's what the score is here so I believe our output of sentence Transformers is is normalized so we can do dot product instead of cosine similarity but if you're not sure what those two are I'm going to go through a small example um and that in a second let's see what this workflow looks like though this is our first time of doing semantic search or maybe your first time but uh I really love it so Define the query so let's go query equals just what I've said before macronutrients functions but we could do almost anything that was related to our um textbook in fact we could do anything at all but since we're focusing on textbook our nutrition textbook we'll keep it um aligned with that similar to if telra was to embed all that documentation you could put in a function that make sure makes sure it's actually related to the documentation before it starts to search so rather than I guess I don't know searching for the price of apples on T's documentation you're actually searching for um their documentation so embed the query so this is where we go query embedding and it's important to I want to just stress this again note it's important to embed your query with the same model you embedded your passages so that might not be an obvious point right so if we've got our Pipeline and we use this uh an embedding model for our documents we want to use the same embedding model for our query so just be aware of that if a new model comes out on the leaderboard and you decide to use that model you will have to re-embed all of your passages with that new model before you can start to query it because a numerical representation from model to model may be a little bit similar but it will likely be quite different so embedding model. incode we're going to encode our query and we're going to convert it to a tensor P torch tensor that is even though it's going to be a vector and then we want to go number three we want to go get similarity scores with the dot product and we can go use cosine similarity if outputs of model aren't normalized again we'll see we're going to see this in action before we discuss do product and cosine similarity and then we're going to I'm going to time this for fun I just want to show you how fast um similarity search can be or vector search so from time import perf counter as timer I like to just use timer start time equals timer and then we're going to go dot scores so we'll get the dot product between our query embedding and our embeddings tensor which is a tensor of all of our embeddings so we're going to go through basically compare all the numbers in our query embedding to all the numbers in embeddings and then we're going to go util this is from sentence Transformers by the way util I know it can get a little bit confusing when we use a lot of different functions but util is just going to prefer What DOT product DOT score and we want to go a equals query embedding and B between two tenses that is equal um embeddings and then we'll just get the zero index of that and the end time is going to be timer and then let's just print out a little uh fun info string info time taken to get scores on um how many embeddings do we have Len embeddings about 1,700 embeddings and this is going going to be end time this will be in seconds end time minus start time and we'll get it to five decimal places why not 5f seconds okay beautiful how long do you think this is going to take to search to compare this worked pretty quick on this example but this is only comparing it to two sentences right and on CPU took under a tenth of a second how much on our end when we're running on a GPU an Nvidia GPU will it take but we're still going to do step number four right so step number four cuz we're going to have scores DOT score so the dot product between 1,600 different embeddings we're going to have 1 1600 scores or 1,700 scores what I would like to get is I want to get the top K result so we'll keep this to top five so we want to get 1,700 scores but I only want to return the top five results okay so top results of the dot product equals torch top K so if we wanted to look that up torch top K uh Returns the K largest elements of a given input tensor along given Dimension so in our case we want to go hey give us the five different the five top scores let's see it in action hey go top K do scores K = 5 and then we'll go top results product and three 2 1 our first Vector search similarity semantic search go didn't work expected all tensors to be on the same device ah what did we get wrong our query our query is not on the same device as our embeddings so we'll send it to the device we'll send it to Cuda now ready round two 3 2 1 oh ah our embeddings are not on the right device of course oh no our embedding model is is it on the right device device it's on Cuda okay I think our embeddings aren't on the right device embedding now let's send that to device there we go okay now we should be fish on ready it's running no we got something wrong but got float equals double oh no okay so we're running into more errors so if we look at our query embedding this is a errors we're going to have all the time so this is in torch um can we get the D type of this please torch float 32 and now let's check excuse me I'm scrolling all over the place um embeddings dtype this is a great lesson so why is this torch float 32 well it's because that we saved our embeddings uh originally in a numpy array or we've converted them here NP array right either way it's it's we've interfaced with numpy could we just do torch. stack here expected zero but got nump array hm MP stack okay it's because numpy defaults to float 30 float 64 sorry as we see down here but we can convert our embeddings so this is where we've run into this issue float does not equal double so we can convert our embeddings to float 32 so this is a little note I want to put a note here um note to use do product for [Music] comparison ensure Vector sizes are of same shape EG 768 and tensors SL vectors are in the same data type EG both are in torch. float32 one of the major errors you'll run into um in machine learning in general is shape mismatches and data type mismatches so torch. tensor let's put in here dtype equals torch. flat 32 and we'll see if this works hey boom there we go so now we're eddings a torch Flo 32 and look how fast that was on GPU query is macronutrient functions time taking to get scores on 1680 in Bings 0.003 13 seconds now that is quick and even though I'm recording so it's going to be slightly slower I've seen it faster than this in my own experience so what if we changed it to something else right um breastfeeding infant timeline how long should you breastfeed an infant for wow okay so this is incredibly fast now we're only using torch. tensor as our uh database here so let's change this back to macronutrient functions right now torch top K this is our top result stock product returns back the scores so these are the values the scores of each passage and these are the indexes in our embeddings array so if we went embeddings um 42 what do we get we get well that's just all numbers right well that's because our query has been turned into numbers and then it's been turned into compared to the embeddings which are also numbers so we want to take these indexes and where can we index on well luckily we've got our data frame here so why don't we get our text chunks and um pages and chunks that's what we want let's look up number 42 on that right because that's the Indy with the highest score 42 okay what does the text say macronutrients nutrients that are needed in large amounts are called macronutrients there are three classes of macronutrients what did we get 42 has the highest score Isn't that cool 42 I love that number and that was our search macron nutrium functions we get a passage that is directly related to our query that doesn't necessar neily contain the exact words macronutrient functions that is so cool semantic search all right what about yeah well let's try our other one um breastfeeding timeline for infants right we search Okay top K let's see what the first one is 11 151 milk is the best source to fill nutritional requirements an exclusively best fed infant does not even need extra water including in hot cents a newborn infant birth to 28 days requires feedings 8 to 12 that is perfect right that's that's some information and we have not only that we have the page number so if we go to um let's search page 816 is that even going to come up infancy oh my goodness we get the page number oh that is the power of semantic search I love that that is so cool and now we have the relevant resources so if we wanted to study more or we could read through here so we could actually type in almost anything we want in here right let's keep it back to our original example macronutrient functions and let's make this a little bit prettier right we want to automatically see what's going on here but before we do let see how this was quick now that's only on 1,700 embeddings what if we were to increase the size of this by 100x so um then we'd have 168,000 embeddings is that still going to be fast right because you'll see a lot of information about which Vector database to use but if you have uh GPU you can actually operate on a lot of data in just torch tensor so larger embeddings equals torch. randn right let's replicate uh well let's simulate what it would be so embeddings do shape even if we had 100 times the amount of data that we have to device right I'm going go print embeddings shape we're going to go larger embeddings shape and then we'll go perform dot product across 168,000 embeddings wooo and let's time it hey we can grab this and grab that and we'll grab this and we'll go there and I just need to change this one to larger embeddings so this is the same query but this is just going to be random numbers so it won't actually um return anything meaningful this is just we want to see how fast this operation is and ready 3 2 1 oh my goodness so in about three times a second three sorry three times as much time we performed the same operation on 168,000 edings now let's let's step this up a notch let's say instead of just 100 textbooks we had a library of textbooks that we wanted to search over so this is no small amount right this is going to be um 1,000 times 1.6 million so we have 1.6 million passages of text right how many groups of 10 sentences would that be so let's try that um let's say a sentence is uh 10 sentences the average number of words in that is let's go 200 or 180 right 180 words in 10 sentences not that bad or even 150 right so we're searching over what's that 00 0 252 million words okay um let's go how many words are in the Bible right a pretty large book 780 ,000 words in the Bible so we have probably 300 plus 400 plus Bibles that we're searching over or is that correct 252 million yeah right so 400 plus Bibles worth of text let's see how fast this goes 1.7 million embeddings this might take a while to just even load look at that so the the longest part of that was creating that and sending it to the GPU but once we've done it once we've created it we just searched it over in a 100th of a second right using the DOT score isn't that incredible so even if we had 400 Bibles worth of text let's do it 400 * 750,000 words there abouts uh oh wow are we off by oh sorry right 300 million words in 1.7 million embedding so we can search it over in 100th of a second now let's let's say you had even more than that you had 100 times that so you have a incredibly huge database that's where you want to probably look into using uh a vector database right let's just decrease this back to 168,000 um and that's where you want to do something like uh let me write write this down we can see that um searching over embeddings is very fast right so this is not prohibiting us the the prohibitive time is probably the generation step right we're going to see that later on but but if you had let's go even 10 mil plus embeddings you likely want to create an index IND an index um is like letters in the dictionary so if you wanted to search for example if you wanted to search duck in the dictionary you'd start at D then find words close to Du Etc right you're not going to search ex what we're doing here let me just um even if we do exhaustive search so that's what we're doing exhaustive search is the term um that you'll probably hear we're searching on we're comparing our query embedding to every embedding and so that's exhaustive right but that could be it's quite computationally wasteful it's very um precise because we're searching across every embedding but if we had 10 plus mil embeddings and we' continually doing that we're we're kind of wasting a lot of compute power so like a dictionary we're not searching over every word in the dictionary to find the word dark we start at D then we go to Du then we Etc we find it we narrow it down so an index helps to narrow it down right um and then go if we want to do a popular IND indexing library for um Vector search is face now there's a fair few libraries that do this but I just want to show you probably one of the most popular ones face by Facebook so Facebook AI similarity search face it's open source very fast efficient similarity search and clustering of dense vectors right so that's where you're going to start to perform nearest neighbor search so approximate nearest neighbor search I'll let you read more on that so two things I'm going to leave as extra curriculum see here that's going to be um nearest neighbor search one technique that the library provides is approximate nearest neighbor search so this is an extension right a&n if you've heard of KNN K nearest neighbor similar to that and then face however but with really large data sets you got to get a bit more sophisticated so imagine you're like um Spotify and you have billions or millions of songs and you want to search over them all the time to find similar examples um you're going to use something like face and approximate nearest neighbor search to to find those different settings so we've seen it now with our example let's prettify this little search thing here okay we want to see what it looks like we always want to see what it looks like so I'm going to write here let's make our Vector search results pretty I'm going to write a little helper function called text wrap or print wrapped sorry and I'm going to use the python text w library because when we print out passages of text like this I don't want it to be just one line You'll see what I mean when I print wrapped and don't print wrapped so I want to take in some text and let's make a wrap length of 80 sounds like we're making a rap song You Know WRA text equals text wrap do fill text WRA length and then we want to go uh print wrapped text we'll see what this does in a minute now let's um get our query so we want to print our query and I hope you don't mind me uh going instead of just writing straight code I like to take a little aside here and there like I just showed you face and uh nearest neighbor search um leave a comment below if you're a fan of that style or if you just want me to stick to code because I like to include little bits of information that that will probably be helpful as you want to learn more right so Loop through zip together scores and indices from torch. topk so what we want to do is go 4core idx because that's what torch top K returns we go here outputs does it say where its outputs are output is a tupple there we go out top all of tensor long that can be optionally given to users output buffers um that's actually a little bit confusing yeah there we go we just get an output values and indices all right I like to call these scores and index so for score index in we're going to zip top results do product zero top results do product one and then we go print F score and then we want to go maybe we go to four decimal places for the score hey and then let's go print text want to see the output and then I'll just I'll just print the text to show you what I mean pages and chunks without our print Wrapped Little helpful function above and we want the sentence chunk related to the top score and so this is why we um turn it into a list of dictionaries right because then we can just use the index returned by our embedding Vector CU our embeddings up here are going to have the same index as our data frame embeddings they're all going to have the same index so that's why we make sure we keep the order okay otherwise we'll sort of get messed up if we don't have that order so we get the index of our pages and chunks list of dictionaries and then we get this is the power of rag right we want the relevant resources we want to go if I want to learn more please show me some links okay don't just give me the answer I want to research myself and find out what else is going on idx page number um actually we're going to need to use other little inversions here inverted commas oh we need to end squiggly there we go and then we'll print a new line for fun make things look pretty right okay you ready let's get some relevant passages based on our query and we query we can set a query here so we can try it on a different couple of different things macronutrient functions run what do we get unterminated string literal page number oh we didn't end our string did you notice that boom look how quick that is I love that so um macronutrient functions this is our first result so yeah see how the text is kind of all over the shop let's just um print wrapped hey boom now we got passages of text that's lovely okay so macronutrient functions text macronutrients that are needed in large amounts are called macronutrients there are three classes of macronutrients carbohydrates lipids and proteins beautiful these can be metabolically processed into cellular energy now if we go to the next one there is one other nutrient that we must have in large quantities water water does not contain carbon but is composed of two hydrogen and one oxygen more than 60% of your total body weight is water beautiful we keep keep going macronutrients carbohydrates protein and water protein amino acid food sources of protein that's very good proteins okay so we get things that are quite related to our query but even better we get the page number so let's check it here page number let's go to our nutrition textbook let's go to page five which I believe will be 48 due to our page numbering scheme on the right here on the left sorry 48 nope one off macronutrients there we go we can start to read here so we've just got an advanced search function on a semantic search now we could actually tee this up with keyword search as well so we could write a function that does not only semantic search but it also does keyword search and then there's a I just want to give you another little tidbit as a potential extension later on we have relevant passages here but there's a a concept of called reranking so we could take the top five here and the score but then we could use another model that has specifically been trained to R rank top results so we've narrowed the results down right to the top five we could even narrow it down to the top 25 and then we have our query here we could use a model to rerank and go hey these are the top 25 resources can you rank them specifically for the query um so they'll be pretty good to begin with but they might be able to be better if we rerank them so I'm going to just write a note here note we could potentially potentially improve the order of these results with a re ranking model a model that has been trained specifically to take um search results results EG the top 25 semantic results this is our these are our semantic results by the way and rank them in order from um let's say most likely top one to least likely right so rerank them just like you get 10 results when you search on Google that's what a ranker does so you get so you get one model to narrow it down from um the 1600 1700 passages of a text in our textbook and then you get another model to um rerank this smaller order because it's been specifically trained for that now if you wanted to do that um mixbr AI ranker this has just been released as well it's a just an open source version on hugging face so boost your search with the crispy mixed bread rerank models so this is a sort of setup of reranking document store that's what we've we've done so search system we've done that first stage search results that's where we're at you want to rank right so rerank pass that to the ranker and then it's going to uh rerank them in order so these are also uh open source so I'll let you explore that and try it yourself on hugging face there we go so query we can use sentence transformers this is a little extension that you might want to try out and then you can rerank them there with a model. rank so I'll just link that in here see here for an open source reranking model Okay so we've got a way to semantically search our data set how beautiful is that I'm just going to close a few different things there we go okay so what should we do next well where are we in our pipeline how about we you know what I'm going to write some code that we can check cuz evaluating our results is just as important as getting the results you know so let's say we wanted to build this into a more robust pipeline let's go um what if we wanted to to check our results what if we wanted to automatically surface the page of text related to our um query right so instead of going manually through our document and searching for page five we could just automatically return those right cuz we're programmers so let's open the PDF and we'll load the target page cuz we've got the page number that's the power of rag right retrieval we have the source of where the information came from so we're going to open our PDF document which we already down down loaded so here we go and then let's go doc equals fits. openen and I'm just going to bring this up a bit higher fits. openen PDF path and then we want to get our page so page equals doc. load page and it's going to be 5 + 41 so note page numbers of our PDF start um 41 plus you just keep that in mind for your own documents they may not the page numbers may not exactly line up with the index so let's get the image of the page and we can do that with fits or the pi moo just want to write this fits is pi moo PDF library has an option to get an image of the page so get pix map and then we're going to go DPI = 300 so that's um pixels how dense the the pixels are so the higher the value there the the better quality of the image and then we want to go save image now this is optional so I'm going to comment this out you can do this if you like go image do saave and then we'll go output file name.png just want to show you that if you wanted to save the image like get all the resources relevant to a certain query and save it to a certain file doc do close we'll close the document we don't need it anymore and then we want to convert the Pix map cuz that's the sort of the the special image type to a numpy array what we want to do is go image array cuz we want to just Plott it down notebook so I'm going to convert it to an MP array with MP from buffer we kind of saw that a little bit before image. samples MV we'll get that and then we'll go dtype equals NP we want to turn it into uint 8 cuz map plot lib likes uint 8 values and then we get it in a different shape so we want to image height reshape it to image width and image. N right so that's going to give us an image array of values beautiful most of them are going to be white pixels why because most of our PDF is white pixels now there might be a better way to do this if you have a better way to do this to print the page of a PDF in in a notebook please let me know using map plot lib and we want to go import map plot lib pyplot as PLT and then we go PLT figure we'll give it a fig size equals 1310 PLT imow image array and then PLT do tile and then we're going to go f we'll get the query so we know what we we search for so this is just is so we can check what what's actually going on query and then we're going to go most relevant page of our PDF document oh excuse me haven't quite finished there got a bit Trigger Happy on the shift enter PLT access we'll turn that off cuz we like things to look pretty and there we go let's show it there we go the most relevant page for our query macronutrient functions now how about we um do another query let's go um I want to search good foods for protein I'm trying to build some muscle right what do we get good foods for protein three classes of protein oh that's actually I think it's the first page page five as well do we have there's one of the nutrient must have enlarged qualities oh I did that oh sorry so we're actually searching through the top results from before so we have to update this so let's go good foods for protein what do we get good foods for protein most relevant page do we did that just update oh page number here so we want to get 411 good foods for protein most relevant page what do we get dietary SES a protein the protein food group consists of foods made from meat Seafood poultry eggs soy dry beans peas and seeds wow okay beautiful if you want more proteins there's some there's some foods that you could try and eat now what do we want to do well we've done some similarity measures we could go over that um but we have now now done our our step here we've got this is our cool person that's us we've asked a query we've embedded that query and then we've used the dot product to search over our embeddings now I might just take a quicker slide in the next section to show you the um different forms of similarity search um because we've we've got this if you want to go ahead and Skip ahead to the The Notebook feel free to see how we can generate stuff but I just want to show you um Vector search right we'll build that from scratch so that way you know what's going on when you're matching different vectors and then we're going to next thing we're going to do we're going to take our um query and our relevant passages and we're going to pass that to our llm but we haven't got an llm yet we got some resources here if you want to jump ahead and look at hugging face and Jammer and mistal Ai and llama 2 but let's in the next section I will go here and we'll go similarity measures product and cosine similarity similarity right so we're going to go over that in the next section and then we'll start to build Our Generation section of our rag workflow I'll see you soon welcome back let's talk similarity measures so two of the most common ones two of the most common similarity measures in spe specifically between vectors between vectors uh doc product and cosine similarity right so in essence closer vectors will have higher scores um further away vectors will have lower scores so let's just go Um and have a look at one of our embeddings so our embedding has or our embeddings have 768 Dimensions right which is in conceivable to the human brain right we only operate in three maybe four dimensions or maybe if you're really smart you can do more so what is a vector well they have magnitude and Direction let's have a look most of these will probably be in yeah 2D or 3D space so there we go like going this Vector is going that way and it has a magnitude of its length so that's the direction the arrow here and then the length is eight and so we have magnitude where's one with there we go nice and simple right that's a 2d Vector however we have 768 Dimension vectors but the principle still stands so vectors have Direction and magnitude how long is it and direction is which which way is it going and now we don't decide on these values that's for our model to decide these are learned values That's The Power of machine learning the whole idea is for these values to be learned uh without explicit explicitly programming them so if you want a further breakdown down I've got a little comparison here of similarity measures so the dot product measures the magnitude let me just zoom in here measure of magnitude and Direction between two vectors we're going to see this in code in an example so vectors that are aligned in Direction and magnitude have a higher positive value vectors that are opposite in direction and magnitude have a higher negative value right now this doesn't make much sense it's just text on a page if you're familiar with linear algebra well you might already know this stuff we have torch dot NP Dot and sentence Transformers util DOT score this is the one that we've used so many different ways to do um dot product and I got some links there if you want to check it out cosine similarity is slightly different so they get normalized by their magnitude the vectors get normalized by the magnitude so um they all have the same size if you will right and in our case our vectors already have all have the same size they're all 768 in size so if we go back to cosine similarity it focus more on Direction so vectors that are aligned in Direction have a value close to one vectors that are opposite in direction have a value close to negative 1 and I know this can sound a little bit confusing because we're like hey we just turned text into vectors but it's actually also still fascinating to me that this works so for text similarity you generally want to use cosine similarity because what are we trying to compare we want to compare semantic measurement so the direction the text is going now I know that sounds confusing but just if you're imagine uh all of Wikipedia all of the different topics on Wikipedia imagine them all as being slightly different directions on a plot right so our semantic measurements direction we want to figure out where a passage of text which direction which topic does that Trend towards rather than its magnitude and so in our case our embedding model all mpet Bas fe2 outputs normalized outputs so if we look at the hugging face model card we know this because um we've got the normalize embeddings so that's to replicate this function all this code here replicates the and code function right we could dive into that um we know that the output of this is already normalized so in our case let's let's just code up um cosine similarity and Dot product from scratch so that we can inspect the differences between the two because cosign similarity with values that have already been normalized is just the dot product so let's have a look cosine similarity and normalized by the L2 Norm by the way similarity let's go function hey what do we get images there we go now if you're not familiar with math this might look a little bit intimidating but we got we see this is a doc product we can we can replicate this with code in a second and this is the L2 Norm or the ukan norm so ukian Norm which is the square of the magnitudes of each of the values so what do we have ukian Norm there we go when you see those two lines that's what we want L2 do we have L2 yeah there we go L2 Norm so it's a um let's just write it up in code hey much easier to understand for me that is I know about you but I I I seem to understand python code a bit better than I do mathematical symbols so import torch let's just create do product we'll see this with two simple vectors uh Vector one vector 2 we want to compare vectors and then we want to go return torch Dot and Vector 1 Vector 2 once you know these similarity functions your powers of dealing with embeddings are going to be hugely increased so cosine similarity CU embedding search is not just with text it's with anything that you can represent as a vector which is modern world many pieces of data can be U replicated so for cosine oh sorry that's an M cosine similarity what we want to do is do we have the function yeah dot product divided by the ukian norm okay so let's go in here dot product so we could just do our function above but I'm just going to use torch dot torch dot Vector 1 Vector 2 beautiful and we want to go get and if you're not familiar with DOT product well I'll leave that as another extension to look up what the dot product is so we'll go ukian L2 Norm so we want to go the norm of vector 1 equals torch. Square t torch. sum the square of all the elements in Vector 1 all right so that is what this is saying here square root of the sum of the square of all of the um vector vector values in a so our a is Vector one and then we want to do the same for Vector 2 um I believe torch actually has a function that does this already but we're doing this from scratch so Vector 2 right and so if we looked up torch Lin alge um Norm yeah there we go that's going to do exactly what we just said oh it's got a fair few different options here do we have L2 Min sum Max sum oh Vector Norm here we go right there's a vector Norm beautiful so we could use that or we could just code it from scratch so let's now do return for cosine similarity we need what the do product divided by the um Norm Vector one and oh sorry times Norm Vector 2 beautiful so let's create some example tensors now or example um vectors vectors SL tenses they won't be anywhere near as complicated as these um or as sorry large because we're just trying to illustrate a point here Vector 1 equal torch. tensor and this is what I like to do with many different problems right I'll just simplify it as much as I can to try and wrap my head around it so if there's something I don't really understand I try and replicate it from scratch so float 32 right um and then I'm just going to copy this let's create four vectors maybe we do that one okay so Vector 2 Vector 4 um we'll keep these are the same one and two are the exact same vector and then this one can be um 456 right we'll just continue on so 1 2 3 4 5 6 and then this can be the negative version of them okay and so what we're going to do is now let's do the dot product calculate dot product so we'll go we want to compare let's go Vector one is our main vector and we'll compare it to each of the three so what do you think these are the exact same Vector this is double it and then this is the negative version of it what do you think the results are going to be out of all of our products here so let's go um print do product between oh excuse me between Vector one and Vector 2 and we'll use our do product function we don't actually we didn't actually need to write a function like that we could have just used to torch dot but why not right so there's dot Vector one and Vector 2 now we'll repeat that for every other Vector that we have one versus Vector 3 one versus Vector 4 and we'll change these as well three four boom there we go okay so what values do we get so between 1 and two they're the exact same Vector so we get 14 a large positive value so what did we say over here vectors that are aligned in direction of magnitude will have a higher positive value okay and then between Vector 1 and three oh we get another high positive value so these have the same um Direction but also um this one has a higher magnitude so overall the value is higher now recall with cosine similarity we normalized by the magnitude so that Vector is going to be um we'll see what happens when we do coine similarity on this and then Vector 1 and four because this is the negative version of it it's the inverse of the positive version right so this is going in the opposite direction but it's got the same value so if we wanted to relate that back to text similarity in our case it's kind of saying that this Vector here is doubly as similar as this one to our original sample so this could be our query Vector right um but is that something that we really want when we're working with text so this this shouldn't be necessarily double what 1 2 3 is we want 1 2 3 to be the highest value right because that's that's the exact same text essentially but we're dealing with numbers so this is where we bring in cosine similarity we want to normalize for the magnitude so uh let's go print cosine similarity between Vector 1 and Vector 2 so cosine similarity and again this is a very simple example but that's the point we just want to make it as simple as possible it actually scales up quite well to our thousands of vectors so Vector 1 we want to do Vector 3 here Vector 4 and then we just need to change this Vector 3 Vector 4 so when we get cosine similarity ah there we go right so we normalize by the magnitude this is the beauty of simple examples right I love it when it works out so we have 1.0 for comparing Vector 1 our query with our Vector 2 they're the exact same thing that's exactly what we want right and then we have Vector 1 and Vector 3 which is they have the same um Direction but slightly different magnitude so it's still a high value one is the highest here by the way with cosine similarity and that's the same that we've been using so far to find the scores a score of one here is the highest but it could also go negative one would mean they're the exact same and then we have negative 1 for a vector that's in the opposite direction so it's it's it's dissimilar right so that's the power of similarity messages we've just um licated dot product and cosine similarity two are the most popular ones that you'll see uh and in fact cosine similarity is generally favored when we're doing semantic search on text and just just so you know cosine similarity as we've we've replicated here is dot product but the vectors get normalized by the ukian L2 Norm right so it removes magnitude from the equation it just focuses solely on Direction and I know that's a bit of a hard concept to understand when you first start but once you start to practice with it you can start to really go okay well this is semantic search I just want to search I've got this embedding find me similar embeddings right and cosine similarity is what we're going to use because um actually sorry we could use dot product because the outputs of our model when we go uh embedding model. in code yo yo yo these are already normalized so just keep that in mind if your um outputs from your model aren't normalized by the L2 Norm you should use cosine similarity if they are already normalized you should use um dot product because you could use cosine similarity they'd come out with the same results uh because they're already normalized however you're doing extra Computing steps here so it would slightly increase the time so that's why we're going to stick with DOT product now we've covered semantic search quite briefly there okay or um similarity measures how about in the next section we go functionalizing functioning uh our semantic semantic search pipeline so we've got a few steps that we've done up here so we enter a query let's go to our workflow we enter a query and we find similar passages I want to turn this into a function because I don't want to have to write all that code all the time so I want to just basically have a function I can input a query and I get this as my response so let's do that in the next section if you want to go ahead and try that to do that yourself uh please do otherwise I'll see you in the next section all righty so we've got some code that can do semantic search but let's functionize it to make it repeatable so let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow right because ideally if we didn't have this notebook we'd like to create a python file that we could just have some helper functions in there we can input a PDF and then it just does all the um text chunking and embedding creation and semantic search all step by step right that's the ultimate extension uh to this project so and in fact I'm going to leave that as an extension to you to do at the end of this tutorial is to turn it all into an app and if you'd like to see a video on that please let me know so query would be this is going to retrieve relevant resources um if that didn't make sense already so what we want to do is input a query we want to embed that query and then get relevant uh indexes back from a list of embeddings so we're going to need embeddings and this is going to be torch. tensor could be a numpy or a but we're going to use torch because we want to stick with the GPU numpy can't use a GPU P torch can so sentence Transformer equals embedding model uh pure numpy that is there is Cuda numpy but that's another extension so n resources to uh return let's um this is going to be how many passages we want and by the way these are all um type hints right so this is saying that the query is string uh torch tensor embeddings and then we have sentence Transformer that we want our model there's the embedding model because we want to embed our query with the same model that we embedded our passages and let's say we just want five resources we could turn this into 10 that would give us 10 paragraphs back we could feed that to the LL llm that number is not set in stone that's just something um you should feel free to experiment with so we want to print the time because I love always getting the time back from doing search it's like when you do a Google Search right and you get it took x amount of seconds to um return all those results so embeds a query with model and returns topk scores and indices from embeddings lovely and then we going to go embed the query so query embedding equals model do encode and query and then convert to tensor equals true so that's our query string we create the query embedding beautiful and we've done all this before so it's not um not going to spend too much time on it we're just going to focus on functionalizing it so get product scores on embeddings so our embeddings are up here we've saved them as a TCH tensor again we can save them to file ahead of time so we don't have to create the embeddings we could just have a function that Imports them so start time do scores equals util DOT score we can get the dot scores rather than cosine similarity because our embeddings um are already normalized embeddings and we'll get the zeroth index of that wonderful we'll end the timer there and then if print time let's print info time taken to get scores on and I also love showing out how showing how many embeddings we went over Len embeddings embeddings and we'll do the end time minus start time oh excuse me no come on you got this Tel and time minus start time and 5f and we'll go seconds beautiful so we'll set that to default to print out true we could turn that off if we don't want it and then let's get scores and the indices equals torch top K on input equals dot scores from above and then the number of K that we want K can equal to um n resources to to return beautiful and then of course we want to return scores indices excellent now we've got a function to get relevant resources from our embeddings given a query so let's try it out retrieve relevant resources so our query is going to be um what can we do let's go foods high in fiber and is everything else pre-filled oh no we need our embeddings in there that's something we could probably optimize somewhere there we go boom we've now got a function that just easily gets very quickly gets the scores from 1 1600 1700 in Bings how cool is that now what about we use this to because this is going to be helpful for our pipeline later on but it's not really that helpful at the moment if we wanted to just quickly um glance at some passages ourselves and read them so let's make it helpful for us by creating a function def print top results and scores function hey query string this is just for our inspection we could actually probably build this into here but I'm just going to make it a new function why not torch. tensor and then we go pages and chunks because we're going to need our um list of dictionaries for this one and that can be pages and chunks right depending on where you've stored your uh text and all that sort of stuff and then so the end resources to return we need that and we're kind of doubling up on a few parameters here which makes me start to think um we could potentially turn this into a class um or we could yeah put this functionality into there but let's for now just keep them separate so finds relevant passages given a query and prints them out along with their scores and this is going to replicate we might actually be able to cheat a little bit here so let's get our scores indices equals um retrieve relevant resources and then our query is the query that we pass in and then the embeddings of course embeddings equals embeddings that we have above oh that's spelled wrong I want to come right along here then we go n resources to return equals and resources to return beautiful from scratch now what we want to do is maybe we instead of rewriting the print statement we just get our helpful print statement from up here I know we like to write things from scratch but because we're writing the ex basically the exact same code so four let's just make sure it all lines up for score idx in we're going to need zip this needs to change scores and this also needs to change this will be indices and do we get that that should work I hope let's find out so we've got that one there let's make a query where it's the same thing so query I'm just going to copy this out here equals that beautiful and then we're going to print it out print top results and scores so the first one just gets the scores and then the second one prints them out print top results and scores did I not press there we go beautiful oh I like that one let's just hide that so printing it out that just gets the relevant resources here we go change it up a bit and experience The Taste and satisfaction of other whole foods such as barley quinoa and bger eat snacks high in fiber such as alond pistachios raisins and air popped popcorn lovely look at that we've now got a semantic search pipeline ready to go we can just run it with one line of code how cool is that now we could save these functions to a piy file and import them later on for when we want to feed relevant resources to our llm and speaking of llms how about in the next section we get an llm and we're going to run it locally this is super exciting so um I'm just going to write here a heading getting an llm for local generation so if you want to jump ahead to the source notebook and try it out yourself feel free to but otherwise I'll see you in the next section let's get our local llm up and running and uh stop generating some text all righty now comes one of the most fun parts of a rag Pipeline and that's getting an llm for local generation now of course we want to focus on local generation cuz that's is a simple local rag tutorial however this process will also work with an llm API right so what is an llm well if you ever used chat GPT before go yo chat GPT how are you right we put in some text and we get some text back really good I'm about to build a local rag pipeline right that's what we want to do that sounds exciting it knows what rag is so which llm should you use now this is quite the changing question right now but two of the main things I mean because llms I mean they they're constantly being updated so we're just going to pick one and try it out but I encourage you to always be experimenting with different ones so two of the main questions you should ask from this is do I want to run it locally in our case yes now if yes one of the main things that's going to depend on whether you can run it locally or not or on your own Hardware is how much compute power could I dedicate I say that because llms are not exactly small files uh for example a seven Bill size one which is basically entry level these days um and this is again this may change in the future I'm recording this in March 2024 requires 28 GB of GPU memory in float 32 but that's actually quite rare to load uh a model in float 32 these days so float 16 by default to load one of the common 7 billion size 7 billion parameter models now a parameter is like a model weight just think of it as a small um number that can learn a pattern in data so the more parameters so we go up to 175 billion maybe even higher the more opportunity a model has to learn so generally the higher parameters means a better performing model however you do need as you see here even in 4 bit so um I'm kind of jumping ahead of things here but just one float 32 value EG 0.69 420 requires four bytes of memory and 1 gab is approximately one billion bytes so that's where I get these numbers from these are rough calculations and so as we decrease this is numeric Precision in Computing so if you want to look that up uh PR Precision computer science I'll fix that link so we have single Precision half Precision um what this means is how many numbers are used to calculate a number so if we go or represent a number so if we go torch tensor 1 the default data type for this tensor is int64 oh if we go 1.0 float 32 right so that's a default data type but when we have that we get half Precision so we take up Less storage less bytes so 32 is four bytes we have that we go to float 16 less bites we have 16 again to eight we get less again we go even lower four now what happens with the decreasing Precision is you often get a slight degrade in performance it's not going to be too dramatic because there are a lot of techniques these days that uh make even 4-bit numbers work quite well and because there's so many parameters when you uh remove some of the bits that represent them over such a large scale the degradations aren't that bad but they are noticeable sometimes so again the size of the model is something the model that you use llm will be uh very experimental the amount of data will it takes up on your Hardware will be depending on the size the number of parameters uh and the Precision you load it in and then of course the performance of the model will also be dependent on the number of parameters it has as well as the Precision that it computes in so models are very rarely loaded in float 32 these days I would say float 16 is the entry and then now we're starting to get into 8bit and 4bit that's a lot more common so as a big overview of models to use but essentially experiment experiment experiment to find open source models one of the great places of course hugging face three of the main again this is March 2024 so um this may change in the future but three of the leading open- Source models around the 7 Bill billion parameter size uh Gemma by Google that's actually recent release so Google Gemma Gemma here we go you can read all about that large language model there's two versions of that 7 billion and 2 billion parameters so thank you to Google for open sourcing that and then we also have oh we go back here mistel so mistel 7 bill so thank you to mistol for open sourcing that Apache 2.0 which means that we can use it for whatever we'd like almost so perform really well and these are all available on hugging face and then of course there's llama 2 by Facebook you might have heard of that one llama 2 okay so Gemma is the most recent one of these and by Google's benchmarks outperforms the other two so that's what we're going to focus on so if we go back to here if we wanted to find the best performing open source llm we could go to the hugging face open llm leaderboard and there's also an account called the bloke on hugging face which provides quantized versions of different models so quantized means um made smaller essentially so you take a model in float 16 you quantize it it might be now in 4 bit so that's how I understand it essentially take take a big model make it smaller and accept the small performance um degradation that you're going to get so a lot of different models here again as you see yeah nearly 4,000 models so a lot to go through so that's why I just advise narrow your search base check out the leaderboard try out some popular ones that you see and just see how they work on your own um use case so for us um I have we're going to check our local GPU memory because that's going to influence what kind of model that we can use so let's say we pick a 7B cuz this is actually really popular sort of size model that you're going to be running locally at the moment again this is subject to change uh a lot of gpus have around 8 to 24 gig um local gpus that is 8 gig of RAM up to about 20 24 which is what the t uh Nvidia Nvidia RTX 490 has I have 24 so I should be able to run if this table is correct uh a 7B model in float 16 but if you have less say for example you only have uh I think 12 GB is another popular one of ram you might want to go the 8 bit or the 4-bit version of the 7B model but let's see this in action that's a lot of talking so what is an llm goes from text input to generate text output well sorry this is we're focusing on generative llms for now what is a generative llm beautiful so which llm should I use how much Hardware um vram do you have available and then two well that's do you want to run it locally that's probably if you run it locally that's probably the main question you have how much Hardware vram do you have available so with that being said let's now check our um local GPU memory checking our local GPU memory availability so there are a lot of tools now these days which will sort of automate and do this for you however we're going from scratch on this so we're writing out all of the code ourselves to check so import torch GPU we can get the memory in bytes by using we have a torch method for that so get memory in bytes we're going to go torch Cuda get device properties on zero. total memory and then we can go GPU memory BYT equals um I'm going to get sorry I want this in gigabytes so this is going to return it in bytes but I want in gigabytes so we can get uh that by GPU memory bytes divided by 2 to the^ of 30 and you can look up different conversion values here I just kind of know this one off the top of my head for bytes to gigabytes and then we're going to go available and we can also check that check this number with um Nvidia SMI in a second we're going to print out our available memory GB available memory beautiful 24 GB right but that won't necessarily be um the amount of free space that we have yes so we're already using quite a bit of memory here that may be because we've instantiated a fair few embedding models so we may run into a Cuda out of me memory issue in a second but we'll face that when we come across it right so there's my Nvidia GeForce RTX I have a total of about 25 GB of memory but only um I've used just over 10 or just under 11 GB okay so I should by the table above oh that's in the source notebook be able to run uh Gemma 7B so we're going to try and use Gemma because that's the newest version of all of these 7B there might be a new one out by the time you watch this so if there is please let me know I'd love to see what your results look like um we've gone through this Gemma 7B it let's look that up Gemma 7bit there we go we've already liked it so this is on hugging face Google's page now I've got access to this model if you you haven't um got a hugging face account you will need to make one and you will need to accept the terms and conditions to use this model before you can load it so um I've already been through that step so just go to I'll just go to this website if you Google this hugging face there'll be a um a form here or just you have to basically just click accept once you click accept you should be able to start to use it so let me just put a note here note to use Gemma 7B it or other Gemma models you have to accept terms and conditions on hugging face and then of course to load it as well I'm just going to write two notes here this is all in the source notebook as well um to download models and run locally from hugging things you may need to sign in to the hugging face CLI so hugging face sign into to I've already done this sign into to CLI there we go so I've gone through this um I'll leave this to you as well it's a few steps install the CLI and then you can go hugging face CLI login so that's two steps you'll have to do before you can start to download model just keep it in mind I've already done those two steps turn that into notes now I've also got a little piece of helper code AS to so I've already gone through and done some testing on the two versions of Gemma the 2 billion and the 7 billion um and there's two versions of Gemma actually that have it and um one doesn't have it so two Bill two bill it seven Bill seven bill it it stands for instruction tuned which means that the base language model has been um fine-tuned to follow instructions so if we go please create a markdown table of um apple nutrition values right that's an instruction not just generate text so will this follow it there we go so it's an instruction to do instead of just um chat with me here's an instruction follow these instructions that's typically for a lot of workflows if you want to command or um enter things of like an llm to do something you want the instruction tuned version so I've already done some testing here of to load it in different precisions of course this will depend on which GPU you're running so there's a Min memory value that you need we're going to calculate this in a second for uh language model to run locally in the way that we're doing it um downloading the directly from hugging face but you you actually need more than the minimum amount because you have to do some calculations and that's going to take more GPU memory so for Gemma 7B float 16 we need about 19 GB of GPU memory so that should fit in a Nvidia RTX 490 but for smaller gpus such as uh 4080 um it doesn't have 19 GB of vram and then subsequently for other gpus that don't reach the requirements here you should probably just see which one matches it luckily I've got a little snippet of code that we can test if we have this value here GPU memory gigabytes it tells us what we should do so if GPU memory is above 19 GB we have use quanti quantization config I'm going to show you what that is in a second we set that to false and the model ID set to Gemma 7bb it which is the model ID is just here Google Gemma 7bit right a lot of steps there but it also works in Google collab so I know we're running locally but not everyone has a GPU available locally you can run this whole notebook in Google collab I've actually got a link here you might already be doing that Google collab so I just want to show you that it works if I take this code here let me get a GPU on Google collab zoom in run type v00 so that I think has 16 GB of vram we're just going to run that and then we'll also run that so a little less vram available here than my local machine but again this is something you'll just have to experiment with depending on what Hardware you have available so there we go so we have 16 GB on a v00 so should use Gemma 2B it because you need quite a lot of space to run uh the 7B version depending on what Precision you load it in so with that being said how about we load a Gemma model so let's go local loading and llm locally I didn't want to rush straight into um loading the Gemma 7B model uh specifically because I didn't want uh you to run into errors if you can't actually load that model on your available Hardware so that's why we went through that sort of explanatory phase of which llm should you use well first and foremost you want it to be open source or run locally well then you've got a few options available got some open source llms here but then it's going to be highly dependent on what Hardware you have available as to which model that you can run so let's write this down we can load an llm locally using hugging face Transformers so Transformers is a beautiful open source Library I'm just going to get out of these cuz I don't need them and there's some this is on the Gemma 7B it model page there's some beautiful helper code here of how to get started so Transformers and we can do this this is a similar workflow for many different llms so we're going to do a modified version of this of all this sort of stuff use 4bit precision and we should have Transformers installed because that's uh we installed that with requirements txt so import Transformers if you don't have Transformers you can also go pip install Transformers and it should be there I believe Transformers comes on Google collab now too so let's go here um I'm just going to write down here the model I'm going to use on my Nvidia RTX 490 is Gemma 7B now again adjust this this is a step you're going to have to adjust if you don't have 24 GB of vram available you should use you should run this code that I've got from the helper notebook simple local Rag and run this to find out which model that you should use cuz if you try to load 7B on a GPU that doesn't enough memory you're probably going to run into issues like I'm about to in a second but we'll go through that together so to get a model running locally we're going to need a few things so number one is a quantization config now this is actually optional but this will say um it is a config on what precision to load the model in EG 8bit 4 bit Etc so the less bits the less space the model is going to take up we we could if you have unlimited GPU memory available of course you don't need the quantization config but if you don't which is most of us it's you can create a quantization config and that will tell uh your model what position to load in so a model ID this will tell Transformers so H face Transformers which model SL tokenizer to load and three um we need a tokenizer so this turns text into numbers ready for the llm now the tokenizer is different from the L uh embedding model that is so note a tokenizer is different from an embedding Model A tokenizer is going to be specific to your llm so the good thing is that the model ID hugging face makes it easy with Transformers to pair together the tokenizer and the model that you use so we use the same um llm model we use the same model ID for our tokenizer and for our llm model so this will be what we use to generate text based on an input wonderful so let's uh load some things and I want to put maybe a little extension here as well I'm going to leave these extensions down the bottom as well but note once you get into the world of llms you'll notice that there are many tips and tricks on loading slm llms work faster one of the best ones is Flash attention or Flash attention I believe that's the the GitHub um flash attention to so because llms are often based on the Transformer architecture and the Transformer architecture has an attention module flash attention 2 is a faster implementation of attention however it can be a little bit tricky to install on windows so if you're running on Windows just keep that in mind uh I've got some notes in the setup steps so you can build it from Source you can install it from here essentially what flash attention does is it speeds up the token generation of your large language models and anywhere we can get speed UPS that's really helpful so see the GitHub for more however there is another thing to think of or to remember is that a lot of the speedups are only available for now on newer gpus so ampere Ada Hopper gpus but the good thing is we can programmatically find out whether our GPU is capable or not so let's do that so I've personally got an RTX 490 so I can use Flash attention to so wherever you can find speed ups for increasing llms and using less compute that is one of the best things you can do again I'll have some extensions down the bottom of this notebook of where you can find potential um ways to speed up your llms running locally and again if you'd like to see a video on speeding up llms Lally please let me know um yeah I'd love to make something like that so we've got some code here but we're going to write this out step by step to load a model how about we do that so we need we're going to use torch and we're going to use from Transformers this is so exciting actually this is groundbreaking stuff like this Gemma model only came out about 2 weeks ago from making this video so we're some of the first people in the world to actually use it for rag pipelines oh it depends on when you're watching this video of course there might be a better model out by now but the workflow should still remain the same you know as as long as it's available on hugging face so we can check uh by the way I just skipped a few things Auto tokenizer so this is going to create our tokenizer automatically given a model ID and auto oh typo there Auto model for causal LM so a causal LM is a causal language model which is another way of saying a generative language model So based on the input it's going to causally or casually or causally causally generate a new output okay so flash attention to available this is a little helper function from Transformers so this is going to see whether flash attention is installed or not which will help speed up our inference times so number one is create a quantization config just going through these steps up here so this I'm just going to put a note here note requires pip install bits and bite and accelerate which is another hugging face library and if we go to bits and B GitHub by Tim deur who also uh has one of the best guides on which gpus you should use so I'd highly recommend out that checking that out but bits and bites is a library uh lightweight python wrapper around Cuda custom functions in particular 8bit optimizers matrix multiplication and 8 and 4bit quantization functions if you'd like to read more on that uh the documentation is on hugging face as well so big thank you to the hugging face team as well as Tim demos so from Transformers let's import bits and by config this is what's going to help us quantize our model if we wanted to quantization config equals so here's where we're going to just use the class from above and I've got too many A's there don't know bits and bikes config and we're going to go load in 4bit equals true so um this is by default 4 bit now I've just read through some of the and bit its documentation and I found out that rather than oh excuse me I thought my table was in this notebook that's what happens when you get lots of Mo notebooks going around uh I thought that the 4bit version or the 8-Bit version might be something that we want to use but I think 4bit has had a bit more development on it for the bits and bites library now that may change in the future but for now that's just what I've found so I'm just going to use the 4-bit version for inference uh if you find something different to that please let me know leave a discussion on the GitHub here discussion or an issue or something like that but load in four bits this going to be the smallest version of the model that we can get um loading in this way otherwise the bloke may have um even smaller versions quantized versions so BNB 4bit this is where we have to say what compute D type we want to use so torch float 16 so we want so we're going to load the model in 4 bit but it's going to compute in float 16 so a little bit confusing there but that's just um the config that I I've found works best with this type of setup so now let's do a little bonus flash attention 2 equals faster attention mechanism basically just speeds up your Transformer operations right if you want to speed up your llms you should use all the tricks that you should be able to get however flash attention 2 requires um a GPU with a compute com compatibility or capability capability score of 8.0 plus so essentially the newer architectures of GPU so I think it's Amper Ada love lace Hopper and above and we can calculate our Nvidia GPU compute capability by running a torch function but we'll just have a look at this so this is uh developer I'm just going to put this link in our notebook this is all in the reference notebook by the way so Cuda gpus this is something to keep in mind Cuda enabled Quadro and Nvidia RTX so these are the newer gpus RTX 6000 8.9 so that should be able to run flash attention 2 but our local one that you'll probably often be running is yeah there's a GeForce RTX 490 that's the GPU I have we have 8.9 and then we have 480 470 TI 8.9 and then we sort of start to drop off yeah uh 2080 7 doesn't mean you can't run these models on the older gpus it just means that you won't be able to leverage some um newer newer tips and tricks right like flash attention for now they may bring them to older um gpus in the future but for now it will only work on GPU compute score of 8.0 plus and we can check that with let's go to torch Cuda get device yeah there we go device capability so that's going to get us our score of our GPU I'll just show you what this looks like on mine so might just run this in a different cell torch. Cuda get device capability and then we can go zero see what that looks like so zero is which device to get it on so I only have one device that's going to be the first device so 8.9 this comes back as a tupple so we can just measure if that first score um is above 8 or is equal to 8 or above then we can use Flash attention and if this returns true so if is Flash attention to available let's run that function and torch. Cuda doget device capability Z is above or equal to 8 let's go attention implementation equals flash attention 2 else attention implementation and we're doing this verbosely at the moment but once you've got this all set up like you can just load it automatically and you won't have to redo all of this stuff but I thought I'd be as thorough as possible when loading a model cuz we want them to work as best as possible right we don't want to leave compute power on the table um so this is scaled do product attention which will be fast anyway because P torch we have if we go torch um 2.2 P torch 2.2 flash attention V2 integration there we go it's coming in so scaled do product attention so um this is already fast in in P torch so there we go fly now supports Flash attention to so scale dot PR attention that'll be the default so if you don't have flash attention to it will does it say which one it uses implementation no oh there we go enable flash STP there we go so STP kernel so these are the different types of scout doc product attention if you'd like to uh learn more and read about that please see the documentation there so it's not that much of a fallback this is still going to be fast this is just faster okay so now that's the bonus let's go here pick a model we'd like to use now we've already done this we I've decided I'm going to use Gemma 7B it okay yours might be slightly different depending on what Hardware you have available and I'm not going to use quantization config so my model ID is already set model ID equals model ID which is we don't necessarily have to do that but or we could just type in the name Here Again by the time you watch this there may be a better version out so if there is please let me know I'd love to try it out myself so Gemma 7B I T okay otherwise the model ID is already the model ID so now what do we have to do well let's go to step three we need a tokenizer so we can do this with auto tokenizer let's load that in so three instantiate tokenizer so tokenizer turns text into tokens go tokenizer equals Auto tokenizer from pre-trained and then we go pre-trained model name this is uh if you had one already saved but we're just going to go it based off the model ID so if you don't already have I've already downloaded this Gemma 7B model so that's just something else to note that if you're downloading it for the first time and remember you just have to accept the hugging face terms and conditions or the Gemma terms conditions as well as uh the hugging face CLI set that up if you're downloading it for the first time it will take quite quite a while to download but I've already downloaded it so it's going to load it from the case depending on your internet connection of course instantiate the model so this is going to be quite similar to our tokenizer so we can go llm model equals Auto model for causal LM and we go from pre-trained and then we'll go pre-trained model name or path equals model ID and by the way there are a lot of different parameters that you can put into these two so please be sure to check out the hugging face documentation for these two classes if you'd like to uh learn more about that but we're going to define the dtype here torch dtype equals torch. flat 16 of course this will actually change when uh if we use our quantization config but I like to just put it in there by default anyway and then we're going to have quantization config equals quantization config if use so this is where our little if else casement if else cases comes in hand so use quantization config mine is set to false yours might be true depending on how much memory you have available so if we don't have much memory available we want to use the quantization uh config else it's going to be none and then I'll go low CPU uh M usage equals false so I want to use as much memory as we can so again low CPU memory usage um to me that is part of quantization config is going to offload if if you load a big model and you don't have enough GPU space it's going to offload some of the um memory to the CPU so we want to use our GPU wherever we can so attention implementation equals our attention implementation oh my goodness so that's a fair bit of code there but I wanted to be as thorough as possible on purpose so that when you start to work with your own models and load them well you know a little bit about what's going on okay well a lot actually we've gone from zero to loading an llm model this is so cool now we do need one more thing we want to send it to our Target device so if not use quantization config so if we have set a config here it will take care of the device placement automatically so that's just part of the um bits and bites config as well as hugging face accelerate so if we're not using that I will send my model to the GPU now in all likelihood I believe this is going to error either from typos or because I don't have enough Cuda memory available so let's try it out you are attempting to use Flash attention 2.0 with a model not tionalized on GPU but that's where we have this here we send it to the GPU loading checkpoint shards is this going to work oh it did work we go Nvidia SMI how much space do we have okay we got we've taken up a fair bit of memory there but we now have an llm model let's check it out llm model boom look at that our Gemma model ready to go 7 billion parameters so here we have Gemma flash attention 2 because I'm using uh flash attention so I might just print here f using uh attention implementation cuz I've installed flash attention if you're having trouble um installing flash attention please let me know on the GitHub um or check out the resources I've got on there as well so I might just print this in the next line so we don't have to reload flash attention to beautiful so I've added a little note to the GitHub here if you're having trouble in ining flash attention yeah so check that out um there's a helpful GitHub issue thread there for installing on Windows I found in collab it works pretty pretty quickly so we have a model here now what do we do with this model it's just this big llm that's sitting in our uh notebook just waiting to generate some text so how about we um calculate the number of parameters hey for fun So Def get model num prams and we're going to go model torch and then module and return let's go sum and we're going to get pam. num number of elements in the parameter for pram in model. parameters oh this is so exciting I can't believe we've got a local llm ready to run and so there's our function and then of course we can run this this so get model num Rams and pass it our llm L lot of L's lot of M's oh my goodness so Gemma 7B turns out after more inspection this is the power of uh doing your own experiments right is actually 8 and2 billion parameters so uh I'm not sure about their comparisons to other 7B models because I mean they're rounding down quite a bit cuz this is 82 billion parameters but remember the more parameters in a model generally the more opportunity it has to learn and you know what we can actually do as well is we can get the how much memory it takes up So Def get model mem size is another helber function that um feel free to use wherever you'd like to use it so torch and then this is how much how I got the memory numbers for uh the different Gemma sizes right so get model parameters and buffer sizes so what we want to do here is go me Pam so equals sum and then we'll get Pam n element times we want to go uh parameter pam. element size for Pam in model. parameters okay beautiful and then we can do the same for uh the buffers so memory buffers equals sum buff. n element times buff element size there may be an inbuilt torch function for this already if there is let me know uh but if not we'll just keep doing it from scratch beautiful so now let's calculate model sizes and we want to go model Me by equals me pams so this is going to give us our models um how much memory it's taking up on our device in bytes but then I like to know it in megabytes so we can do model M bytes and we'll divide that by 1,24 squared that's just a conversion to megabytes if we want to divide and then gigabytes is model M bytes and we'll divide by 124 to the power of three and then we can return a dictionary of model M byes model m b and we can do the same for megabytes and gigabytes we don't necessarily have to do this uh this is just a little helper function extension again a lot of the time you're going to run into memory issues when you're loading models locally so I find it helpful to be able to um figure out how much memory a model actually takes up we'll just round these as well actually just for prettiness we're going to generate text soon don't worry all locally so let's see if this works so model me size and we'll pass in our llm model what do we get round is oh did you notice that beautiful okay there we go so megabytes we have 16,38 megabytes which is 15.9 GB or basically 16 GB so to load Gemma 7bi it in float 16 we need a a bare minimum of 16 GB um of vram on our GPU right so I'm just going to put a little note there so we got the size of our model this means to load Gemma 7B it in float 16 we need a minimum of 16 GB of vram now however due to the calculations that take place in the Ford pass when we pass text to our model Ford pass we actually need closer to 19 GB so this is something to keep in mind is just because you have the minimum amount of memory uh that can load a model you still have to make calculations in the forward pass and that's going to take up more memory so if we go or we close collab um that means Gemma 7bit in float 16 wouldn't fit on a v00 which has exactly or thereabouts 16 GB of vram it would might fit just but we wouldn't be able to do a forward pass so in the next video how about we generate some text with our llm all locally so you might want to skip ahead and try to do that yourself but otherwise I'll see you in the next section we're going to generate some Tex next with an open source llm on our own computers it's going to be so much fun I'll see you there all righty put your hand up if you're ready to generate text with our llm locally or if you're running a Google collab it'll work there too I've got my hand up I know you can't see me but it's in the air you better believe it so let's generate text with our local llm so just one thing one more thing to note before we start writing code and that is note uh some models have been trained SL tuned to generate text with a specific template in mind so I know when we use or you might know when we use chat GPT or something similar we can just go hey how are you we don't have to format any text that's because there's some nice things happening behind the scene right um we can that's and chbt has been trained in a specific way but they pre-process the text in a specific way so we have to do a similar thing with with our models so because Gemma now actually we don't have to do it but to get better results we should do it if a model has been trained in a certain way we should follow that when we're using it for inference so because Gemma 7B it has been trained in an instruction tuned manner uh we should follow the instruction template for the best results now all this information is available in the model card of Gemma 7B it so do we have template yeah chat template the instruction tuned models use uh a chat template so template that must be adhered to for conversational use load in the model apply the chat template to a conversation so there we go chat is wrer Hello World program uh apply chat template add generation prompt equals true so let's have a look at what this looks like okay so that's the sort of chat that we need we need a user and then it's going to type some things and then we get this so beginning of sentence so that's a special token beginning of sentence to tell the model where the sentence starts the start of the turn for the user in our case where the user then there's the prompt the text that goes in then that's the end of my turn and then we have the start of the turn then we have the model and then after that we're going to have some text okay let's see what that looks like in practice so oh my gosh I'm so excited to use it llm you'll never forget the first time you run an llm locally so input text is what are the macronutrients and then and what roles do they play in the human body right so this is the the uh the question that we we did before but just with a bit more text around it and then we want to this is going to be just generation without retrieval because our model is capable of that it's trained on internet text and it should have uh will be able to generate some pretty good text if Google's benchmarks are correct um so these need to be curly braces excuse me input text wonderful there's our input text now we need to create the prompt template so and by the way another key term if I haven't mentioned it is prompt so prompt is a common term for describing the input to a generative llm and the idea of a prompt engineering is to structure a text based or potentially image based as well input to a generative llm in a spe in a specific way so that the generated output is ideal so we'll see what this looks like in a second but let's create our prompt template for instruction tuned model and then we're going to go dialogue template equals that and then I'm going to go Ro of user and then I'm going to go cont content is going to be our input text and there's going to be a helper method on our tokenizer that will turn this into the correct template for our model automatically and many models have this already built in in hugging face as well so apply the chat template so we're going to go prompt equals tokenizer apply chat template and then we're going to go conversation equals dialogue template and then we'll go tokenizer or token eyes sorry equals false we don't need to do that just yet I just want to show you add generation prompt equals true now let's see what this looks like so print I'm going to go new line prompt this is going to be formatted and then we go new line prompt boom there we go so there's our input text and then we format it with a chat template and we have beginning of sentence start of turn user what are the macronutrients and what roles do they play in the human body end of turn start of turn model that is exactly the format that we need as our input to our Gemma model note it may be slightly different depending on the model that you're using but that's why we use this apply chat template method because if a model has a specific input such as U mistal instruct that's another very popular model mistal instruct the template might be slightly different so just pay attention yeah there we go pay attention to the model template apply chat template of whatever model you're using and we'll go back here and now what can we do well we can generate some text let's do it so I'm going to time it so first we need to tokenize uh the input text and or actually yeah yeah we'll tokenize it first so turn it into numbers and send it to the GPU so this is our input text The Prompt so this is well this is actually our input text but we've formatted it to the prompt so we want to tokenize The Prompt that is so let's go input IDs equals tokenizer that's the tokenizer that we've already set up so I'll just show you what that looks like beautiful so there's a special to tokens uh 01 23 106 109 or sorry 107 pad end a sentence beginning of sentence unknown start of Turn end of turn okay that's our token tokenizer and we're going to pass in here the prompt and we want to return tenses equals PT for p torch and then we're going to send this to the GPU now let's just see what the input IDs look like okay so there's our input IDs and there's our attention mask so we want this to be all be visible so that's why these are all ones we could actually mask out some of this input with a different attention MK so set these all to zero um but in our case we want our model to see all of these tokens so this is going to be the input to our model and then we want our model to generate some tokens based on these now that's one thing to note a model does doesn't output uh an llm doesn't actually output text it outputs tokens we have to convert it to text but let's see that generate outputs um from local llm oh my goodness this is so exciting we may run into a Cuda issue here though so generate and I'm going to pass in all of the input IDs that little star star means pass in uh the keys so input IDs and attention mask and then we got uh a little parameter here Max new tokens equals 256 so that means given these tokens however many that is generate a maximum amount of 250 more so we've limited how many new tokens that we can get so let's print the output model output tokens outputs we're going to get zero go there new line let's see what happens running locally generating text or generating tokens using an llm running on our own GPU there we go how cool is that so there is the output the tokens so model what does it do it goes tokens in tokens out that's an llm tokens in tokens out whereas chubbt has some uh some processing on the back end that goes text in text out so we have to convert these tokens back to text but there we go you've just ran your first generative or maybe maybe not your first maybe you've ever done this before but we've just run together our first llm generation locally so be pretty proud takes a few seconds now of course the generation time is going to be depending on number of things the compute power we have available I'm recording so that's going to take up some processing power uh the GPU that you have the type of um how many tokens you have to process uh the compute data type that you're using float 16 or 4bit or whatnot and then a bunch of different optimizations so there are a few things that I've left in the extensions um that we can see later and we can cover that in a future video if you'd like so let me know if you'd like to um see a video of op there's a lot of videos I could make out of this I mean this topic is expanding rapidly if you'd like to see a video on optimizing uh llm generation locally uh for gpus please let me know as well but now let's convert these tokens into text to see what our model output okay so let's go uh decode the output tokens to text so outputs decoded equals tokenizer do decode and we're going to go outputs zero so the outputs from our model so that's our tokenize up so instead of encoding it we just decode it so map these tokens back to text and then let's go print model output decoded and and I'm going to go outputs decoded let's see what text our model generated oh my good this there's our model output so what you'll notice with the Gemma model it returns The Prompt as well as the output Tech so sure here are the macronutrients and their roles in the human body macronutrients carbohydrates roll provide energy for the body help regulate blood sugar levels provide fiber for digestion and gut health proteins roll build and repair tissues this is pretty good for a small model I mean small in terms of llm size uh 7B 7 billion parameters other functions so how about we um let me get this let's do a little trick here right so let's ask chbt to go I'm going to say can you rate the output of my llm of my local llm above based on its answer to the question and we'll go please rate it out of five cuz chat is a fairly good llm well really good LM basically state-ofthe-art right but we're rated a four out of five oh no what does it miss out on Clarity I think it's a pretty good answer for just generating could be enhanced by adding more specific examples okay could be more engaging so that's actually a technique that you could use to improve um uh smaller models get a larger model to rate its outputs and then improve the outputs of a smaller model based on the recommendations of a larger model so we're getting llm uh really meta there but let's how about we do another generation cuz this is fun tokenizer why don't we go how long should we're focusing on uh infants or sorry nutrition questions but we could ask um almost anything here be breastfed four okay that was quick that was a lot quicker this time according to the American Academy of Patric uh podiatrics the optimal duration for breastfeeding is 6 months with continued breastfeeding for up to one year and Beyond okay so these are nutrition focused questions but we could this llm Gemma 7B is a general model so it's trained on internet scale Tech so we could ask it uh how do I grow potatoes almost anything right and see what it comes out with choose a suitable location prepare the soil plant the potatoes water and fertilize but I want to grow them organically okay let's go back to our macronutrient question what are the macronutrients and what are their functions in the body so we could functionalize this much better now that we've got the model loaded and in fact we might start to move towards that but what I've got here is if we go back to our s notebook and we scroll down so we've generated some text beautiful I've already got some questions pre-loaded so I generated some nutrition style questions with GPT 4 basically just said hey can you generate some beginner uh nutrition textbook style questions and we're going to see how well our model answers said queries so that way we've got a query list and I've created some of my own there's the infants being breastfed uh the reason why this question is popping up a lot is because um a couple of my friends have young children and so I've studied nutrition in the past and that's um just a topic that we've been talking about what are symptoms of pgra how does saliva help with digestion Etc so these are sort of um nutrition questions that should be answered in a graduate style uh nutrition textbook query list there we go so we've got 10 or so questions so now we can test our model uh by or we can test our retrieval Pipeline and then our generation pipeline by sampling this query so that's instead of having to rewrite one um all the time you know so import random so query equals random. choice and again we could we could keep make this list a lot bigger um you can really ask almost anything of these General models but we want to we want to specifically make our rag pipeline for our nutrition textbook so query equals query and if we were telra we would customize our rag pipeline for our customer support documents get just the scores and indices of top related results so we've got our generation step ready but but we need our augmentation step that's what we're working on next so if we go to our workflow we've got an llm but we've got our query we need to we need to put some relevant passages into our query before we pass that to our llm our prompt so we want scores and indices equals retrieve relevant resources for query equals query and embeddings equals embeddings and let's check scores indices what is the RDI for protein per day there we go that's really quick so our next step is we can retrieve and we can generate now we need to augment our input to the model and we can write a function to do so so let's uh regroup in the next video we'll go through our final step of our rag pipeline augmenting our prompt with context items oh my goodness this is so exciting I can't believe we're running an llm locally this is so cool this is really sort of at the moment Gemma 7B only got released a couple weeks ago and we're already using it locally so this field is beautiful and exploding I'll see you in the next section Let's uh do the augmentation step of our rag pipeline okay okay who's ready to augment our prompt with context items the final step of our rag pipeline so we've done retrieval we've done generation time to augment let's do it and uh the concept the concept of augmenting a prompt oh my goodness with context items is also referred to as you'll see this a lot prompt engineering all right it cops a lot of flack but uh has been called an engineering practice but if you're just continually modifying and experimenting with something it is a form of engineering it is a form of science it is a form of art so we're still figuring out the best way to put inputs into an llm and get ideal outputs but rag is probably one of the most uh I guess promising and useful techniques um for getting the ideal outputs from an llm so prompt engineering I'm going to write it here prompt engineering is uh an active field of research and many new Styles and techniques are being found out however llms have been around for well in the current field probably 2 to 3 years now um there are a fair few techniques that that work quite well so resources I got a couple of resources if you'd like to check them out and I'd highly recommend these so if we go to our notebook here we've got prompting guide. one of my favorites bre's prompt engineering guide I really love this one it's just in a a GitHub read me so these are all um extra curriculum for you to check out so prompt design and Engineering as well this is um an archive paper that collects many of the uh useful techniques so we're not going to use all of them that's something for you to experiment and try with but if you want some references please go and check these out I'll just put a couple of these in here but they are all in the reference notebook so we go there there read through those so we're going to use a couple of um techniques we're going to use a couple of prompting techniques so one is give clear instructions I know that doesn't even sound like a technique but you would be surprised at how uh much better your outputs from a generative llm can be if you just specifically ask for what you want and then two is give a few examples of input/output so EG given this input I'd like this output so again sounds quite simple but it does really help to improve the outputs of your model is like hey I've got this example can you give me an out if I gave you this input I want the output to look like this um and then give room to think so EG create uh scratchpad scratch Pad slash um show your working space slash um another another common one is let's think step by step right so you know how uh your teachers often get you to say show you're working when you're working things out turns out that that kind of works with llms as well and I'm I'm know I'm still working this out as well but these are three things that I found really help work for inputs and outputs of llms and I believe anthropic um prompting guide for business have a great article yeah prompt engineering for business performance so anthropic is one of the creators of uh is the creative Claude which at the moment Claude 3 just came out it's the the best performing llm in the world so few shop prompting so two step by step yeah think step by step that's what we're using and fuse shop prompting we're giving it a couple of examples prompt chaining we're not going to really explore that for now but if you want to um have a look at this article for business performance this is by the leading group of um llm creators at the moment I'll put that here as well so Claude 3 now outperforms GPT 4 as of March 201 24 which is really cool to see so let's with that being said let's create a function right here let's create a function to uh format a prompt with context items Okay so we've got the ability to retrieve some context items using our retrieval function but now we want to give those text text that we get back from that and we want to put it into this put up here we've written a lot of code we want to put it into the chat template okay let's see what this looks like so we'll start nice and simple def prompt formatter and we'll put in a query string and we want to just make it as simple as possible context items list D we can make this our prompt as complicated as we like but you know me I like to start as simple as possible so let's return prompt and let's just say the prompt is just the context okay so we want to go context equals context items we want to join um this is going to be from our pages and chunks so of course how you format your prompt will be dependent oh wow there's a lot of samples there we'll get sample 420 that's going to print out 1,800 samples okay right we want to pass in pages and chunks we want to get the relevant indexes and we want to pass in these sentence chunks to our generative model so the context is going to be let's join them as dot points right so we want each chunk to I'll show you what this looks like as we print it out we want each chunk to be like an example uh based on these contexts doo. Point generate something like this okay so let me just show you um prompt example So based on the following context so one 2 3 4 5 or however many we want to put in please answer the following query what are the macronutrients and what do they do answer so we want we want something similar to that and we can turn this into code as well right but of course we want to create these contexts programmatically oh marked out so what we're going to do is we're going to create this programmatically by putting in our context we're going to join new line this will take some experimenting by the way with how you want to format your prompt I've had a bit of practice here so I kind of know where we want to how we want to format it for item in context items okay so we need a way to get some context items into there and then let's just say prompt equals context and we can adjust this in a second okay let's see a Bare Bones version of what this looks like what can we do well we want a query um is going to be equal to random Choice from our and I might just get rid of this we don't need this cell anymore do we Okay random Choice from the query list then I'll print out our query and then I'm going to go get relevant resources so scores indices equals our function retrieve relevant resources so this is where our retrieval function is really coming in handy we don't have to write that whole pipeline again we're really working towards creating a wonderful local rag pipeline so let's now create a list of context items So based on our indices we want to uh create context items equals we want to index our pages and chunks pages and chunks I for I in indices okay wonderful and then we're going to format our prompt with prompt equals prompt formatter our function from above formatter and we're going to pass in the query and then we're going to pass in the context items oh and then we'll print out the prompt okay there we go so our water solu vitamins do we have good context items so that's all we're printing out right now right so we just take the context as I'll show you what they look like they're just going to be a list of sentence chunks context items oh sorry we get um we get the sentence chunks from all those so we want oh we've already got here that's what I want okay so we get a list of sentences related to our query and then what we do with this line up here is We join them all together so that in dot Point form just like if you were asking an exam question and it's sort of the question was based on these four five paragraphs please answer this question does that make sense so recommended that users complete these activities so maybe water soluble vitamins there we go all water soluble vitamins play a different kind of role in energy metabolism so you see how this version or this um this resource maybe isn't the best one to have up the top so that's where a reranking model could come in handy as we discussed before so I'll let that leave that as an extension um so water soluble vitamins there we go now what happens maybe we just pass this straight to our our model and see what happens okay we've got no other instructions there we're just going to pass it a bunch of text and see what it outputs in the name of experimenting so we need to tokenize um prompt and we haven't actually got chat formatting on here as well so tensors equals PT and the query isn't even in there but we're going to try this workflow anyway input IDs and then we can generate an output of tokens so output equals llm model. generate and then we're going to pass in our input IDs and then we'll set the temperature of our model temperature what the hell what is the temperature so the temperature basically is one of the most important parameters in an llm but it goes um I think it's from 0 to 1 from 0 to 1 for most cases and it may be a to go negative in some instances but from my experience is 0 to one and the lower the value the more deterministic the text the higher the value the more creative okay so it's again a very experimental parameter if you want to have a look at a great guide on different settings for llms these are going to be the main ones for most llms are here but hugging face also has a lot of um different settings in the generate documentation so LM settings this is prompt engineering. a temperature there we go in short the lower the temperature the more deterministic the results in the sense that the highest probable next token is always picked increasing temperature could lead to more Randomness which encourages more diverse or creative outputs uh you are essentially increasing the weights of the other possible tokens so we'll keep going with that uh do sample equals true so if we want a sample so if we if we set that to false which I believe is the default it's just going to um choose the most like uh most likely next token all the time so true is so that whether or not to use sampling now this is a whole topic on itself for this um for sampling to choose a different token to Output next because an llm is taking in say this sequence of tokens and then generating this sequence okay so that's sampling is going to be like if you if we set it to false it's just going to always pick the next likely token token if we wanted to sample uh do sample it used is other forms of of choice so my favorite resource for this is Chip Hing huan text sampling so greedy decoding do we have greedy there's another guide for temperature yeah greedy sampling most likely token so if you want re uh resource on Tech sampling please go and check that out but again these are all going to be experimental and then Max new tokens is equal to 256 how many new tokens do we want to generate so let's turn the output tokens into text so output text equals tokenizer decode and then we're going to put output zero boom and print query and we're going to also print our r answer so what's our query up here oh water soluble vitamins we're not that's not actually included in the prompt our prompt is this no context oh sorry no question just a bunch of context information but that's what we're going to experiment we're going to slowly build up our prompt formatting our prompt augmentation so uh we don't want the prompt to be returned so we want to just replace the prompt in the output text with nothing we just want to see the generated text let's see what happens uh this text includes information about water water soluble vitamins okay that's pretty good um including their role in energy metabolism blood function and other bodily functions it also discusses the absorption of fat soluble and water soluble vitamins and their an importance of choline as an essential nutrient okay wonderful now how about we put the query in there as well so prompt equals query uh context so I want to take in a query so let's just make it simple base prompt what should we do maybe we go equals we'll go base based on the following context items please answer the query and then we're going to go we'll go context items and then we'll go context and then we'll go query and and then we'll go answer okay so let's see what this looks like oh excuse me so we need to go base or prompt base prompt no we've already got the base pomp prompts sorry I'm just reading too many versions of The Prompt here uh prompt equals Bas prompt. format and we want to set the context equal to the context and then the query equal to the query and let's see what we get back there we go okay beautiful So based on the following context items please answer the query see how we're slowly augmenting our prompt we have now context items and we have the different things here and we have our query how often should infants be best breastfed answer okay let's pass that to our model according to the text infants should be breastfed 8 to 12 times a day or more wonderful that's that's answering it from the text retrieval augmented generation now in fact I think this answer could be a bit more detailed so how might we improve that well while we're here let's try another one how does saliva help with digestion okay let's see what what comes from our model saliva helps with digestion by secreting the mzy salivary Alam which breaks down the bonds between the monomeric sugar units of disaccharides ugia saccharides and starches salivary am amalay breaks down amalo and Amant into smaller chains of glucose called dextrin and maltos okay we could um now the benefit of this is we could research this answer we actually we have the resources the context where it came from we could read through this or find the pages where it came from but now we've created a rag pipeline we're retrieving we're augmenting and we're generating and it's all running locally how about we improve this base prompt slightly so I've I've got a um a pre-baked prompt prompt that I found works pretty well so I'm going to bring this in base prompt and I'll show let's break it down hey so based on the following context items please answer the query give yourself room to think so this is um the step-by-step technique by extracting relevant passages from the text before answering the query and then I've got down here um relevant passages extract relevant passages from the context here and then don't return the thinking only return the answer make sure your answers are as explanatory as possible use the following examples as reference for the ideal answer style so this is giving examples so this is very helpful for LMS if you give them examples of exactly what you want now use the following context items to answer the user query user query query and we format the context so these are examples that I've written myself what are the fat soluble vitamins answer what are the causes of type two diabetes answer what is the importance of hydration so the three steps to take away from here is give clear instructions give a few examples of input output give room to think EG create a scratch Pad uh show your working space um let's think step by step so there are the three techniques that I've Incorporated in here is it the best prompt probably not could probably be improved if you find a better way of doing it please let me know I'd love to hear your tips and tricks but there are lots of tips and tricks in these guides here of how to best prompt models or just try to best prompt models because again this is still an experimental um phase I'll just get out of those and let's just see what happens when we do this we haven't formatted it for conversation yet so example one 2 3 um context items relevant passages user query what is the RDI for protein per day and the answer is okay let's see what happens generate what happened here oh we haven't formatted it for conversation style so how about we fix that let's go up in here so we've got our prompt but there's a special dialogue template that we've been using and that the model has been trained for so this is something to keep in mind that if your outputs from your generative model aren't as good as you'd like them make sure your inputs to them are uh in the same way that they've been trained so let's create the prompt template um we want I'm going to say this is uh base prompt so base prompt I'm going to get rid of that and base prompt we update it to format it with the context and the query and then I'm going to create the prompt template for conversation for instruction tuned model so we want dialogue template equals uh this is going to be a list so this is available on the model card I kind of know it almost off by heart now because I've done it a few times content and then we want to put in our base prompt here and the dictionary and then we go as a list beautiful and now we're going to apply the chat template so we can do this with our tokenizer so let's go prompt equals tokenizer apply chat template and conversation equals dialogue template and then we're going to go tokenize equals false cuz we're going to do that with our tokenizer later on and then we're going to go add generation prompt equals true so you see how we've slowly been augmenting our prompt we started with something really basic then we gave it examples clear instructions um letting it giving it room to think step by step we format it so that's the augmentation step now we're optimizing it for the instruction tuned model which is what we're using Gemma 7bit apply the template return the prompt what does this look like explain the concept of energy balance and its importance in weight man management so now we've got our formatted prompt here um we've got examples now we've got the context so what do we have here Balan balancing energy input with energy output to maintain weight beautiful the context should be relevant explain the concept of energy balance and its importance in weight management uh return model beautiful let's see what happens when we pass this through oh this is always fun to wait what's it going to generate explain the concept of energy balance and its important in weight management sure here is the answer to the query energy balance plays a key role in weight management it is the state in which energy intake is equal to the energy expended by the body when you are in a positive energy balance the excess nutrient energy will be stored or used to grow when you're in a negative energy balance your body will need to use its stores to provide energy weight can be thought of as a whole body estimate of energy balance oh that's pretty cool therefore maintaining energy balance is essential for maintaining a stable body weight when you are in energy balance your weight remains stable when you are in a negative energy balance you will lose weight and when you're in a positive energy balance you will gain weight so I think that's a pretty good answer so we've just built a rag pipeline running completely locally Based On A nutrition textbook that's 12200 pages long how cool is that but I think we can take it a step further and functionize our um llm response here so I want to I want to boil it down to a single function right I want to just go let's go functionize our um llm answer answering feature right can we can we make this I want to go um ask and then a question what are the fat soluble vitamins right I want it to be as simple as that so if you want to jump ahead and try um functionize this create a function called ask and then it's going to go through and just repeat this entire pipeline for us format our um prompt automatically with relevant context find the relative resources to begin with and then uh generate an output and return it from the llm so give that a shot and we'll do it together in the next section so we've got our rag pipeline working but there's still one more step I want to functionize this so it all works in a single function let's do it together hey so wouldn't it'd be cool if our rag pipeline worked from a single function oh excuse me EG you input a query and you get a generated answer plus optionally also get the source documents the context where that answer was generated from right the whole principle of rag so let's do it let's make a function to do it so I reckon you might have already had this a try at this and if you have well I I applaud you that's what we're all about right we're all about just experimenting and trying it out so ask let's go it's going to take in a query and we'll put in a couple of uh llm parameters here temperature Um this can be a float we'll set it by default to be 0.7 and Max new tokens so the max new tokens that we want it to generate this can be an INT we'll set up by default to be 256 and we'll format the answer text if we want to true I'll show you what this looks like in a second and then we'll go return answer only equals true so that'll be if we want the context back we can change these flags so let's just write a little dock string here so takes a query finds relevant resources SLC context and generates an answer to the query based on the relevant resources and this is where we've all been leading to right all the way through our entire notebook is to just create this function retrieval augmented generation in one hit so get just the scores and indices of top related results so scores indices equals retrieve relevant resources and we're going to go query equals query so we've used this function before so this is where we can leverage all the code that we've written previously embeddings equals embeddings uh and I think that's actually all we need beautiful um create a list of context items and we're going to go context items equals pages and chunks I for I and indices so these are the indices of our relevant resources and pages and chunks is our list of dictionaries with text and resources so add score to context item so this is if we want to enrich our context item so it won't actually have page and chunks the items in here let's just go a random number 575 won't have the score of how relevant that is to the query so we can update that with for I item in enumerate context items we can go item score equals scores I and we'll send that to CPU because it's going to be on the GPU right the score back to CPU now what should we do next so we want to format our prompt so this is retrieval retrieval and now we want to go augmentation or a augment so let's go create the prompt and format it with context items so prompt equals prompt formatter now one thing you could do with the prompt formatter is also we've we've set the base prompt in the the function prompt format up which we made above but what you could also do is just save the prompt like to a text file or something like that and then you could edit them that way you could run it experimentation like uh if you had multiple different prompts you wanted to try out and evaluate you could save these to uh different text files and then just um import them programmatically try over a bunch of different prompts experiment experiment experiment uh and then we want to format it with the context items that we have all of our functions are coming into play here I should just love it when a pipeline comes together and so now what do we have to do we want to generation so we want to go tokenize the prompt so remember that prompt formatter function is going to return our prompt uh with the conversation dialogue already appended to it so now we just have to turn it into tokens so input IDs this is where our tokenizer is going to come in handy and we're going to tokenize the prompt and we'll return tensors and pytorch format PT for that and we're going to send that to Cuda so cuz our llm is on our GPU so we want our um tokenized prompt to go to the GPU as well so generate an output of tokens we're going to go outputs equals llm model do generate and we can pass it out input IDs and we can put the temperature we can actually um put more settings here if we wanted to temper had to work out how to spell do sample again we could customize these settings that will take a lot of experimenting I'd highly suggest checking out the generate specifically um hugging face generate con config generation config yeah here we go there's going to be a lot of different stuff that you can do with that check that out for more but then if we we go do sample equals true max new tokens equals Max new tokens there's our generation done and now we have to we're going to get that's going to Output tokens right so we want to um decode the tokens into text because we want text to come out output text equals tokenizer dcode outputs Zer there we go beautiful and then so format the answer so our output text again is going to Output The Prompt itself now that's I think of the Gemma model may be different from another model but uh as you see here we replace the prompt with nothing we can also remove things like um beginning of sentence and end of sentence to make it look nice to make it look like that it's just outputting text so let's go if format answer text um let's go replace prompt and special tokens so output text equals output text do replace and we'll replace the prompt for nothing and then we'll also replace BOS for beginning of sentence replace that with nothing and then replace uh end of sentence with nothing as well so you might be able to format your text a little bit better but that's just a simple um one that we can do for now and then only return the answer without context items so by default we'll get it to come back with um context items if return answer only we'll go return output text or sorry what have we set this to as return answer only equals true yep we only want the answer most of the time but if that's not set to True we'll by default set the output text and the context items to come with that so context items there we go okay now we have one function to perform retrieval augmentation and generation how cool is that let's try it out hey our rag pipeline is coming to life oh invalid syntax perhaps you forgot a comma um comma relevant for IR in indices oh I don't need a comma there you know what this is probably I've probably forgotten something here pause I'm going to pause this and I'll find the error and I'll come back you know what I found the error you might have seen it there was too many eyes and N here I N I so I forgot the N it wasn't a comma it was an N there we go okay beautiful we're running you might have caught that uh far better than I did but that took a little bit of troubleshooting so oh now we have invalid syntax let's let's see what happens you ready to ask a query so query equals random Choice from our query list and then we'll go print the query we're going to try out our function here retrieval augmented generation in one hit query is query and then ask by default we are going to just return the answer only so ask query equals query what do we get what are the micro macronutrients and what roles do they play in the human body here we go sure here is the answer to your user query macronutrients are the nutrients that are needed in large amounts by the human body there are three classes of macronutrients carbohydrates lipids and proteins these macronutrients are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use hm is that correct maybe maybe not proteins are macromolecules comprised of chains of subunits called amino acids they provide structure to Bones muscles and skin and play a role in conducting most of the chemical reactions that take place in the body carbohydrates are molecules composed of carbon hydrogen oxygen they provide a ready source of energy for the body and provide structural con constituents for the formation of cells fats are stored are stored energy for the body hm I think that's a little bit incorrect let's ask let's troubleshoot the this okay this is so this is really hard to actually troubleshoot and look I used uh cat gbt to um find out where the problem was is there an error in this function and it showed me that I was missing in so that's a really helpful use case for when your eyes are just glazing over your code glazing over your code so um is this answer correct regarding the macronutrients it's mostly correct minor inaccuracy roles of macronutrients carbohydrates playing a role in cellular structure so I think this is from our uh examples so this is something that you'll have to look into it's just because our rag pipeline is generating text with um whatchamacallit reference doesn't mean that it's always going to be correct so let's try uh lower the temperature so what lowering the temperature is going to do is hopefully generate less randomized and creative text and so in theory it's going to go hey llm just take the context and generate text that's pretty straightforward from that context describe the process of digestion and absorption of nutrients in the human body sure here is the answer to the query we could probably remove that too from the output if we wanted to the process of digestion and absorption of nutrients in the human body is a complex and multifaceted process that involve several organs and systems working together it begins with the mouse where mouth not Mouse where salivary amalay breaks down starch into smaller molecules called monosaccharides cells then use these nutrients to build to generate energy or build new cells so this is it's not too bad right like it could probably be improved but it's working how retrieval augmented generation pipeline is working locally this is so cool name five fiber containing foods do we get this five fiber containing foods are peas beans oats hog grain foods and flax oh my goodness retrieval augmented pipeline running locally yes yes yes so what happens if we wanted to return the context you know so I'm going to set return answer only equals false return answer only equals false so I want the context now uh it's given us the same query as before but now we have some context okay beautiful so there's a generation the search happens really quickly the generation takes a little bit more time there are some steps we could take uh to to optimize that but I'm probably going to save that for um another video if you'd like to see that please let me know so this is the first relevant sentence do we have a page for that page number 1086 I believe we might have already seen that one so let's try again but we'll go to here 1086 anyway 1086 there we go oh maybe we haven't seen that do we have fiber on this page I just lost the page number there 1086 1086 okay a multitude of diet fresh fruit Tower modernize influence other ethnic groups have migrated to this diet simple carbohydrates to learn more about the nutrition of that good nutrition equates to receiving enough but not too much of the macronutrients proteins carbohydrates and and water and micronutrients the phrase you are what you eat sugary high fat where's a fiber on this page so maybe it didn't work too well maybe I'm missing something maybe you can see something I didn't but this is where we'll have to troubleshoot later on what do we get page 60 is this relevant see this is another step which which is going to be probably another extension is how to evaluate rag pipelines the digestive system there we go that's much better when you feel hungry what happens so we could uh make this look a lot prettier so that's our local rag pipeline we've now officially um built this system evaluating it would be a whole another thing though we've built the system uh it works from a principal standpoint what you would likely do next is refine it a little bit more that we've got the outline structure uh to sort of um make sure that it works a little bit better for certain use cases so you could create some um good examples and then uh use those as sort of evaluation metrics and whatnot that space is also still being worked out but congratulations we've built a local rag pipeline uh to just go from scratch Tak in a document a 1200 page PDF nutrition textbook we processed it into smaller chunks we embedded those chunks we've stored those and now we have a system that can we can ask a query with just one line of code basically it will find relative resources uh ask an llm and then return those resources as well as uh the original um query and the generated answer based on those resources so with that being said um a little summary let's put a summary here rag equals powerful technique for generating text based on reference documents maybe we now have to call up telshire and go hey I can help out with your system we could take all of the customer support documents we could now embed them we could store them in a database um such as torch tensor or a vector database and then we could use a model like Gemma 7B to offer uh output generated outputs based on those documents so we can go here Hardware use so we want to go use GPU where possible to accelerate embedding creation and um llm generation and then I'll turn this into markdown and then we'll go here keep in mind um the limitations on your local hardware and then we want to go many opsource embeddings and llms starting or embedding models to become available um keepy keep experimenting to find which is best so there's probably a few more things to go over but we've covered a lot of material so what I would say is for some practice I've got some extensions in the Bas notebook so if we come down right to the very end um these aren't formatted very well at the moment but if by the time you watch this they'll probably be a bit more formatted so there are several different things that we could do for the PDF text right now we're just focusing on text but there are things like figures and images in there we could probably work out some sort of pipeline to get those out of the PDF better and encode them um embed them in a certain way so a couple of extra resources here for PDF extraction there's a bunch of different prompting techniques that we could try as well I've linked some resources for those so um what happens when a query comes through that isn't on any context in the textbook that's something that we'd have to uh think about in a workflow if we wanted to deploy this say we we embedded telstra's support documents and someone starts asking about um where they can buy a chainsaw maybe telra support chatbot shouldn't really answer those questions then we have we try another embedding model um we could also try a reranking model based on our um resources so if we had top 10 resources could we rerank them in a in a better way we could try another llm try different prompts of course text from a PDF we've mentioned this evaluate we need a way to evaluate our answers so we could just right now we've we've only evaluated them by visually going through them but you probably want to start to look into um how you could create uh a good resource of example answers and then compare those to your um your models outputs then we could start to use a vector database index for a larger setup so if we had um we actually saw that it would work for probably a million or more so if you have a lot more than that you probably want a vector database or index and then there are libraries and Frameworks such as Lang chain and llama index that can help do many of the steps that we've gone through but I wanted to on purpose build a rag pipeline from scratch so now that we know that the steps that go into it we can start to use higher level Frameworks such as Lang chain or llama index that will do these steps for us we've we've built a whole rag pipeline from scratch here that runs locally and then finally optimizations for Speed there are a bunch more of these coming out actually not finally I've got a couple more things after this um but if you wanted to speed up your generation which is likely always going to be the case the faster generation the better user experience uh see these and if you'd like a video on um increasing the speed of generation please let me know and I can look into that and we could also stream the text output that would look a lot prettier right so right now our model if we just run it again water soluble vitamins we have to wait for the output to come out right so it comes out eventually um oh yeah there we go that's that's actually incorrect so what I think our llm is doing that's the fat soluble vitamins so our llm Gemma 7B is taking too much input from our EX examples here so that would have to be maybe we have to modify this in our prompt so I would say give that a try that's another thing that we could improve on different prompting techniques um but yeah we want our output here let's just go again to stream right so we want to see that token by token this is what chbt does so if we go water soluble vitamins there we go so see how it's stream it's coming out that looks really cool so that would be another extension that we could try to get this streaming out and then finally a really cool one would be to turn the workflow into an app so gradio is excellent for this so if we go into gradio and then we could have an app rather than it running in a notebook we could have an interface here so hello submit and then we could have our system do retrieval augmented generation and then the generated output would come here so let me know if you do any of these extensions I'd love to or if you have any other questions I'd love to see them please post them in the um discussions or in the YouTube comment and that way we can all learn together and if you'd like to see any more topics um on Rags such as improved generation speed or other videos leave a comment below um or in the GitHub discussions and we can talk it out there but otherwise we've built a rag pipeline from scratch all running on our local GPU which is super super exciting I will see you in the next video