The 5 Levels Of Text Splitting For Retrieval

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the most effective strategies to improve the performance of your language model applications is to split your large data into smaller chunks the goal is to give the language model only the information that it needs for your task and nothing more this practice is the Art and Science of text splitting it is one of the first and most foundational decisions a language model practitioner will need to make text splitting takes a minute to learn but in this video you're going to learn the five levels of text splitting that squeeze out more performance from your language model applications using the same data that you already have now there's something for everyone in this video for the beginners we're going to start from the very Basics and for the advanced STS I'm going to give you plenty that you're going to want to argue with me on but either way I guarantee you're going to learn something along the way this is going to be a longer video and we're going to cover a lot but that's on purpose I want to take our time and I guarantee that if you make it to the end you're going to have a solid grasp on chunking Theory strategies and resources to go learn more for those that are just joining us my name is Greg and I'm exploring the AI space through the lens of business value you see models are cool stats are cool but I want to find out how businesses will actually be taking advantage of AI and language models this video will be split up into six different sections first we're going to talk about Theory we'll talk about what splitting and chunking are why we need them and why they're important I even made a cool tool called chunk vi.com to help us visualize along the way then we're going to jump into the five levels of text splitting for each level we're going to progressively get more complex and introduce topics along the way for you to consider when you're building your own language model applications for level one we're going to talk about character splitting this is when you split your documents by a static character limit for level two we're going to talk about recursive character text splitting this is when you start with your long document and then recursively go through it and split it by a different list of separators for level three we're going to talk about document specific text splitting so if you have python Docs or JavaScript docs or maybe PDFs with images we're going to include multimodal in this level as well for level four this is where it gets interesting we're going to talk about semantic splitting so the first three levels were all naive ways of splitting these levels focused on the physical positioning and structure of the text chunks these first three levels it's a bit like sorting a library based off the book sizes and shelf space rather than the actual content of the books but in level four here we're not just going to look at where the text sits or its structure instead we're going to start to delve into the what and the why of the text the actual meaning and context of these chunks it's like understanding and categorizing the books by their genre and themes instead and then with level five we're going to talk about a gentic splitting so we're going to look at an experimental method where you actually build an agent-like system that's going to review our text and split it for us and then to finish it off we're going to end with some dessert a bonus level that shows the advanced tactics that start to creep a little Beyond Tech splitting but are going to be important for your overall knowledge about how to do retrieval in general my goal isn't to prescribe the best or most powerful method you'll see why that's actually not possible my goal is to expose you to the different strategies and considerations of splitting your own data so you're able to make a more informed decision when you're building this is part of a larger series on retrieval I ofo and if you want to check out more or get the code for this content head over to fullstack retrieval.com and I can go send to lastly I do a lot of workshops with individuals and teams if you your team or your company want to chat live or do a custom Workshop just feel free to reach out so without further Ado let's jump into it first we're going to start off with a theory behind text splitting what is it and why do we even need to do it in the first place you see well applications are better when you give it your own data or maybe your user's data but you can't pass unlimited data to your language model and there's two main reasons for this number one applications have a context limit this is an upper bound on the amount of data that you can actually give to a language model you can see the context windows on open AI websites for their own models and number two language models do better when you increase the signal to noise ratio let's see what Anton co-founder of chroma has to say about this distracting information in the model's context window does tend to measurably destroy the performance of the overall application so instead of giving your language model the kitchen sink and hoping the language model can figure it out you want to prune the fluff from your data whenever possible Now text splitting or chunking is the process of splitting your data into smaller pieces so you can make it optimal for your task and your language model now I really want to emphasize this point the whole goal of splitting your text is to best prepare it for the task that you actually have at hand so rather than starting with hey how should I chunk my data your question should really be what's the optimal way for me to pass the data that my language model needs for my task our goal is not to chunk just for chunking sake our goal is to get the data in a format where it can be retrieved for Value later so let's talk about retrieval in general so in the bigger picture the act of gathering the right information for your language models is called retrieval this is the orchestration of tools and techniques to surface up what your language model actually needs to complete its task let's take a look at where chunking fits into the retrieval process so here we're taking a look at the full stack retrieval process we have everything from your raw data sources to your response here if you want an overview about this entire process head over to fullstack retrieval.com where I do a separate tutorial on this now the important part is we're all going to have our raw data sources down at the bottom here and they eventually need to make it into our knowledge base right however we can't just put our raw data sources we're going to need to chunk them which is what this video is about now right when you do your data loading this is where your chunking strategy is going to come into play how you choose to split up your documents is a very important decision as you go through this you'll see that there isn't one right way to do your chunking strategy or really your retrieval strategy for that matter for example take a look at this tweet from Robert hir for those that need a translation Robert is basically saying that he employs many alternative strategies across his retrieval stack what works for him may not work for you the last thing I'll comment on is the topic of evaluations evaluations are super important when you're developing your language model applications you won't know if your performance is improving without rigorous testing one of the most popular retrieval evaluation Frameworks out there is ragas I encourage you to go check it out I won't be covering those today because it's more of a retrieval topic rather than this narrow Niche that we're going to be covering today plus they're very domain specific and application specific if you want my taken evals please head over to fullstack retrieval.com and you'll get a notice when I start to cover it all right now that is enough talking for now I finally want to get into some code let's move on to level one character split all right so level one charact character splitting before we jump into that I want to talk about the chunking commandment your goal is not to chunk for chunking sake your goal is to get our data in a format where it can be retrieved for Value later I'm placing so much emphasis on this point because it doesn't matter what your chunking strategy is if it doesn't serve your Downstream task keep that in mind as we keep going here so level one character splitting this is the most basic form of splitting and this is when you're going to chunk up your text by a fixed static character length let's talk about what that means first the pros it's extremely simple and easy the cons it's very rigid and doesn't take into account the structure of your text and to be honest I don't know anybody that does this in production I don't let me know if you do cuz I'm curious the two concepts I want to talk about for this one we're going to see what chunk size is and we're going to talk about what chunk overlap is but let's use examples to explain those for our text this is the text I would like to Chunk Up it is an example text for this exercise cool we got that one so before we talk about packages that help us do this automatically I want to show you how we do this manually first just so you can appreciate the nuances for how cool some of the stuff is so in order to create our chunks I'm going to first create an empty list of chunks now my chunk size is going to be 35 this stands for 35 characters so I'm going to count 35 characters in count that as chunk one next 35 characters is chunk two I'm going to run through this I'm going to create a range and the range length is going to be the length of my text that I have up above here and then um the iteration step or the the step we're going to take is the chunk size so we're going to skip ahead every 35 characters I'm going to get the chunk I'm going to unpend it and let's see what our chunks are this is the text I would like to CH Unk up it is an example text for this exercise first off congratulations you just did your first chunking exercise do you feel like a language model practitioner yet I sure do let's keep on going here there's a couple problems with this this is the text I would like to Chu well it's stuck in the middle middle of the word that's no good um how do how are we supposed to know how long this 35 character length we could change this to 40 and then all of a sudden this goes a little bit further but dang again we have it one more time that's giving us a hard time that's not too good um so yes it's quick yes it's simple but uh we need to fix this problem now before we move on to level two I want to talk about Lang Chain's character splitter so their character splitter that's going to do the exact same thing for us but it's going to be through a lang chain oneliner and the way that you do that is you're going to initialize a character text splitter you're going to tell it how much of a chunk size you want you're going to tell it how much of a chunk overlap you want we'll talk about that in a second and when you do a blank or an empty string as a separator we'll talk about that in a second too that means it's just going to split by character and Lang Chain by default they will strip the white space and so they will remove the uh the spaces on the end of your chunks I don't want them to do that quite yet so I'm going to say false so this is we just made our character splitter now we're actually going to go and split the documents and so I'm going to say dot create documents and this create documents function it expects a list and because our string up above is just a plain old string I need to wrap it in a list right here let's go do that text splitter so now what we get returned is we get three chunks however they look a little bit different than plain strings the reason why is because they're actually a document object now in Lang chain a document object is uh well an object that holds a string but it can also hold metadata which is important for us to understand when we start doing more Advanced Techniques so don't get scared documents uh still have our string and they're held within page content this is a text I would like to chump it's same thing that we had up above cool that makes sense so let's talk about overlaps and separators here so I'm going to make the character splitter again it's going to be 35 characters but this time we're going to have a chunk overlap of four now what this means is that the tail end of Chunk number one is going to overlap a little bit with the head or the beginning of Chunk number two and the overlap is going to be 4 characters so the last four characters of Chunk one will be the same four characters of Chunk two let's see what this looks like here and again I'm just going to make these and what we have is this is the chunk or this is the text that I would like to we still have the first same split the chunk size is the same but check this out now the first four characters of Chunk number two are the same four characters as chunk number one because we have that chunk overlap all right now when I was getting ready for this exercise I thought I remember a tool that actually visually showed you different chunking Tech techniques with uh highlights of the different chunks but I couldn't find it so I ended up making a tool and that tool is called chunk viz.com and this is just a quick snippet of it but I want to show you this while we're on the topic so chunk number one was this first beginning part and then we have the overlap and then we have chunk number two and then the overlap and then chunk number three but if you want to try this out for yourself it's kind of cool you can go to chunk fis.com and then what you'll get here is you'll get a tool where you can input um different text and uh play with your different chunk sizes that we get get out so let's go back up and let's grab the text that we had I'm going to bring this over here I'm just going to replace this and so you can see here that right now our chunk size is one with no overlap that doesn't make any sense although it's visually cool it won't do any good for us because we have 83 chunks here what are you going to do with 83 one character chunks I don't know I'm not going to do much with there but as you start to increase this number you can see that these different chunk sizes are going to start to get bigger so we're going to take this all the way up and as we go through I'm going to put it at what we had it before which is 35 so this is a text that I would like to and you see it ends right in the middle of the CH just like we had beforehand let me zoom in just a little bit more here it ends just in the CH that we had beforehand now if I start to introduce overlap you can see that now we have a little bit of an overlapping section so we had chunk oversiz or chunk overlap before this is the text that I would like to so it chunk one still ends here but now there's the overlap that comes with it so what's cool is you can switch this around to yourself you see above a certain mark it ends up being I want to get rid of this over over a certain Mark well this chunk size is bigger than the document that we have so it's just going to encapsulate everything but you can go to chunk vi.com and go play around with this we'll take a look at one more of these sections uh in a minute here cool so let's go down there those are characters those are separators fabulous and so the next thing I want to talk about is separator so beforehand we just had a blank string as a separator which means you're going to split by character right however if we specify another separator in this case I'm going to do CH well let's see what that says here this is the text that I would like to and you see here that the CH is missing and the space is missing because I removed the uh strip whites space and so it's Default true so that's gone and you see that this word is supposed to be chunk up but it's not anymore because we removed the CH it's not too helpful when we do ch um you could do the letter e if you wanted and it's going to be a little bit different either way unless you know exactly what you're doing I wouldn't suggest messing around with the separator to try to get better results here all right so that's the Lang chain side of the house the next next one I want to show you is the Llama index side of the house so they have what they call a sentence splitter but also I'm going to use their simple directory reader because this time instead of just using a a static string that I put in the code we're actually going to load some essays from a directory so I'm going to make the sentence splitter and this time I'm going to have a chunk size of 200 and a chunk overlap of 15 and then I'm going to load up some essays so my input files I'm just going to load up one essay this data is also in the repo so if you go and clone this repo you can also get this data pretty easily this is going to be a Paul Graham essay and this is going to be his MIT essay so we can go check out Paul Graham's MIT essay and we can see what it is right here you can go and read it I have it loaded up for you so let's go ahead and load this now we have our documents but this document is just going to be let's let me show you here let's check out let's check how long this document is and it's just one big long document because the entire essay was loaded into this variable here however we want to chunk it up and the way we're going to chunk it up is with our split splitter we're going to say get nodes from documents now you may be asking hey Greg wait what's a node well llama index is nomenclature for a chunk or a subsection of a doc is going to be nodes and so that's what we're going to get here same thing so now that we have our nodes I want to take a look at one so as you can see this node is quite long based off of the amount of text that's in here but there's some really cool information that comes out of the box so first of all this node has an ID and we can see here that it's a text node because they delineate from other nodes so we have a node ID so we can go and uh use it later and deal with its uniqueness and then we also have some metadata so we can tell where it came from we can have last modified date Etc but then one of the other parts I like a lot is going to be around node relationships so here we have a relationships key and we can take a look at other relationships this node has so node relationship we can see the source node that it came from but then also we can take a look at the next node so what node actually comes next and this is really helpful for when you start doing some traversing across your documents but either way I won't go too far into that one well congratulations we just finished level number one let's head off to level number two recursive character text splitting so you'll notice that in our previous level one we split by a static number of Chunk sizes each time so 35 characters by 35 characters however there's other chunking mechanisms that will actually look at the physical structure of your text and it will infer what type of Chunk sizes you should have so instead of specifying by 35 characters you can then specify well give me every new line or give me every double new line and that's what recursive character text splitter does so what it's going to do is it's actually going to have a series of separators and it's going to recursively go through these documents and it's going to start at its first separator and it's going to first Chunk Up by every new double new line that you have here for any chunks that are still too large after that first iteration it's going to go to its next uh separator and that's going to be just new lines and then it's going to go to spaces and then it's going to go to characters so now I don't need to specify 35 characters or 200 all I can do is I can just pass it my text and it will infer what the structure should be now the cool part about this one is if you think about how you write text you're probably going to separate your ideas by paragraphs and those paragraphs are separated by double new lines this method takes advantage of that fact and so we can start to be smart about which separators that we use to take advantage of how humans naturally write a text all right so let's go ahead and let's check this out we're going to do Lang chain text splitter we're going to do the recursive character text splitter now again I'm going to take some text but this one's going to be a little bit longer that we have here right and so let's do that text and then let's put it through our recursive character text splitter this is going to be a 65 uh character limit we're going to pass it through and then all of a sudden you can see that we get a whole bunch of different chunks here now I'm not even sure how many chunks let's see how many we actually have we have six 16 different chunks all right so with that one of the most important things I didn't understand about the for end of sentence end of sentence day you can see here that what's cool is that we're ending on Words quite often and that's because words have spaces in between them and that is one of the separators that we try out so this is cool now we're not splitting in between words anymore however we are starting to split in between sentences that's not so good that's not so fun um one of the ways that we can combat that is we can increase the chunk size because our hypothesis is that if we inre increase the chunk size we can start to take advantage of the paragraph splits a little bit more all right so now what I'm going to do is I'm going to increase the chunk size to 450 still chunk overlap of zero and let's see what we have here one of the most important things I didn't understand about the world with without child was the degree to which performances are super linear cool so there's a period here's a period and here's the end of it now what's interesting is all three of those let me scroll down to this viz again this is the same exact uh string that we have those are all three different paragraph breaks hm that's pretty interesting so one of the important things to note here is look how these these paragraphs are different lengths if I use level one in the 35 character split or any character split I'd start to cut in the middle of them but now I can get these paragraphs grouped together and the hypothesis behind this method is that these paragraphs will hold semantically similar information that should be held together so the recursive character text this is pretty awesome let's go take a look at what this looks like at chunk vi.com I'm just going to go ahead and copy this go back to chunk fis.com let me put this text in there and you can see if we did the character splitter of 35 uh text we have 26 different chunks we're chunking all over the place this doesn't make any sense but what I'm going to do is I'm going to actually scroll down I'm going to select the recursive character text splitter and now I still have a chunk size of 35 but we're going to increase this one so the first thing I want to show you is as I start to increase this notice how I go between 35 and 36 the split the first chunk here doesn't switch sizes that's because it's looking for the space to actually split on and because there's a space here it snaps to the nearest word this is why it's so cool so I'm going to increase this let's see when it finally does split and there it goes it just jumped up to a degree but either way let me save you for this here I'm going to select the chunk size I think we wanted maybe like 450 I forget what it was let me zoom out just a little bit let's go four maybe four I mean either way look it now we're splitting all these three different paragraphs and we can even increase the size and it doesn't really do much for us but all of a sudden if I go too big well it's going to chunk the first two paragraphs together because that's um this 493 is around the size of this these two combined anyway you can go and split this again you can get so big that it finally takes over the third paragraph all right so that's it for level two congratulations recursive character text splitter if I'm starting a project this is my go-to splitter that I use each time the ROI for your energy to split up your docks is pretty awesome it's a oneliner it goes really quick there's no extra processing that's needed so if you're looking for a go-to place to start I recommend with level two the recursive character text splitter let's move on to level three document specific splitting so up until now we've been splitting just regular Old Pros we've had some static strings and we have had some Paul gram essays but what if you have markdown what if you have python docs what if you have uh JavaScript docs there's probably a better way to split on those because we can infer more about the document structure from special characters within those documents because when you have code you start to have some code formatters and we can take advantage of those all right so the first I want to look at here is for markdown so we're still going to have something that's like the recursive character text splitter however we have a lot more separators now the reason why this is so cool is because let's take a look at this first separator it's a new line and then it's going to be a pound symbol which indicates a heading within markdown and this Rex here means one pound symbol between one and six times so it's a new line followed by a header H1 through H6 why would we do this well headers usually denot what what you're going to be talking about so this is a cool way to try to group similar items together so these are the Lang chain Splitters if you have your own different package you might see other Splitters but if you want to see the Lang chain side of the house you can head over to their GitHub and you can go find on Lang chain Libs Lang chain Lang chain texts spitter dopy you can see that they have a markdown language and here are the Splitters that they actually end up using all right so let's go ahead and load up our Lang chain markdown splitter I'm going to do a chunk size of 40 which is again this is really really small my first go-to for chunk sizes is going to be anywhere between 2,000 4,000 5,000 6,000 different chunks and as contact lengths for language models get better and their performance with large context gets even better well you're going to start to increase this a whole lot because it can infer what you want there all right so let's go ahead and do that and then so here's some uh markdown text fun in California uh H2 driving blah blah blah blah blah blah cool and let's split these up and so we can see here uh the first document is fun and California driving and what's cool is that it split it on these headers and so split on header here etc etc split on header here which is nice that's markdown you can do this also for python but instead of using the markdown Splitters you're going to want to have your own python Splitters so again Lang chain is going to split on classes uh functions indenting functions so these might be methods within your class or you might have double new lines new lines spaces and characters they have a python code text splitter let's go ahead and run this let's see what we got chunk size of 100 and we scroll and we can see that we have uh this whole class is enveloped in within one document which is cool because that's what we'd want but then we have uh P1 equals John equals person we have this stuff and we have the ranges right here um I put this over on the uh chunk uh chunk fizz.com and you can go ahead and you can throw this in there you can go to python Splitters and let's see how how long this that was 100 let's go ahead and bump this up to 100 we have it at 100 right here and Lang chain I couldn't figure out how to undo the strip whites space with them within the JavaScript version to make this tool which is why I'm a little hesitant to show it but either way you can see that we're splitting right there that's what we'd want nice all right same thing we have uh JavaScript uh we also have just a bunch of different um separators that we have here this is very similar but in this case we're going to uh do our recursive character text splitter again but now we're going to specify which language we want it to split by so here's our text recursive character text splitter. from language this time and we're going to do a language which is going to be language. JS chunk size 65 let's go through there let's do it and then all of a sudden we split up our JavaScript code as well cool those are all strings those are all pretty easy because they might just be in txt files which is simple enough for us to work with however what if you have PDFs everyone loves talking about PDFs they especially love talking about pulling tables from PDFs this is because there's a lot of old school industries that still put information inside a PDF so when it comes to chunking you're not only just going to split text but you just want to pull out all the different elements within your documents and some of those might actually be tables it might be pictures it might be graphs let's take a look at how we do this here so the way I'm going to do this is I'm going to load up a PDF right here and this is just going to be a Salesforce Financial PDF I went over and I just pulled one of the random PDFs that we have here so I can have a table all right um so we have this and the way I'm going to do it is I'm actually going to do it via unstructured now unstructured is another Library we can check go check them out they're at unstructured doio we get your data llm ready so they have some really cool parsers that you can use which is going to be advantageous for when you start to get more complicated data types or your data gets a whole lot more Messier so for example if you had a million PDFs that you somehow needed to get into a structured form unstructured would be who you win with that not a sponsored video by them at all all right so we're going to load up two uh Elements by them it's going to be partition PDF and then elements to Json we're going to get our PDF and if we take a look at this Salesforce PDF here you can see that we have a few kind of just like notes or whatever then we have some paragraphs but then we have this table and this is where I'm going to start to um Place more emphasis on and then we're going to do uh partition PDF we're going to give it the file name and then we're going to give it some unstructured helpers this is just a little bit of config for them so if we take a look here let's load it up let's see what elements I actually found and so I found a whole bunch of them here it looks like we just have some narrative text this is going to be your regular text but then we have this table and that's actually what I want to double click on and so what I'm going to do is I'm just going to go grab the fourth from the last element which is going to be this table right here 1 2 3 4 and let's look at what the HTML looks like so here we have the HTML and now you might be saying well Greg how come the HTML is important why would you just want to read it like a table well tables are easy for us to read They're not so easy for the language model to read however the language model has been trained on HTML tables not only HTML but also markdown but in this case I want to pull out the HTML table because the language model is going to be able to make more sense of this than I can so when I pass my data to the language model I'm going to pass the HTML or you can pass markdown whatever works for you and if you want to see what this actually looks like you can head over to an HTML viewer and you can see what the um unstructured actually pulled out which is pretty cool nice so that's how you do tables within PDFs but now let's say you have images within PDFs or maybe you have images elsewhere how you going to take advantage of those how are you going to extract those well let's take a look at how you do that and I'm going to use unstructured once more this time I'm going to do uh their partition PDF which is the same as last time and here I have a fine-tuning uh visual fine tuning paper you can go and check this out let's go look at the archive one I'm going to download the PDF just so you can see it because there's this wacky photo up at the front all right so here's the same one I'm going to load up that page that I had and then I'm going to get the uh partition PDF ready for me and this time I'm going to do extract images in PDF equals true so now it's going to extract all the parts for me which is the chunking process but it's going to treat the images separately which is nice also iner table structure blah blah blah let's go from there image output directory path this is in this repo uh so you don't have to reload the images if you don't want to but either way I'm going to load those up and I don't want to make you wait this does take just a little bit of time so let me come back to you 1 minute later awesome so that just finished up loading for us let's take a look here so I know it's kind of small on the screen but I they found looks like 15 or so different images or 16 different images on here and that was all extracted from the PDF that I supplied now the interesting thing about this is how are we going to make those images useful right we would need to take an embedding of them because we're probably going to do semantic search later however embedding models usually don't cross paths between text and images meaning there's embedding models for images and there's embedding models for text but generally their Vector length isn't going to line up and if you don't have the same model for each one doing similarity search between the two may give you a hard time yes I know for all the perfectionists out there there is something called the clip model which is going to do embeddings for both images and text so you can take advantage of them however the tech is still not quite there and I haven't found the same performance with clip that I have with other ones so I'm going to show you a different method here I I will note this to say that in the future and when you're watching this there may be really good models that do both of these if it's true then I would take advantage of those uh instead of the method I show you but either way let's take a look here so what I want to do actually is I want to generate a text summary of each image and then I'm going to do an embedding of that text summary so now what I can do is I'm going to go to semantic search maybe the text summary will get returned for me if so then I can pass that image to a multimodal llm or I can just use the text summary on its own to answer my question or do my task cool the way we're going to do this is we're actually going to use langen and here we are we're going to load up uh chat open Ai and we're going to use the gp4 vision preview model all right so I'm going to load that up and I made a quick function here this is just going to convert the physical file on my local machine to Bas 64 which we uh then can go and pass to the language model so string to base 64 now we have this image string let's go take a look at what this looks like just looks like a bunch of gobbly go which that doesn't mean much to me but it will mean something to open Ai and I'm glad that it does all right let's close this let's get that out of there so what I'm going to do is I'm going to use the GPT for vision one again and we're GNA construct a human message this just means it acts as if it's coming from the human content type please give me a summary of the image provided be descriptive and then we're going to pass it an image URL and here we are the URL is we're going to pass in our image U base 64 we had there let's go ahead and pass that over and let's see what openi thinks the image actually is I haven't shown you what the image is but let's look at the summary the image shows a baking tray with pieces of food like a cookies or some baked goods arranged Loosely to resemble the continents on earth as seen from space hm what but do you know there you go yeah that makes sense so now when I do my retrieval process I can either just use this text in lie of the picture if I don't want to work with a multimodel llm or I can do semantic search have this summary get returned and then pass this image over to the language model the llm all right so that seems about right so what I've done in level three here is emphasizing that your chunking strategy really depends on your data types so in this case I was pretty explicit about that and I showed you what it would mean for Python and JavaScript and if you have images but in your industry in your vertical you may have different data formats and you'll want to pick a chunking strategy that is going to adapt to those data formats because remember the ultimate goal is that you want to group similar items together so that you can get them ready and prepared for your language model task in the end now you'll see there that I even made an assumption that you want to group similar items together I'm just saying that because generally you're doing question and answer and generally you want to combine similar items together for context to answer a question however if you're not doing that maybe for some reason you want to combine opposite items together in in which case your trunking strategy be a lot different I don't know of anybody who actually do that but let's get back to the tutorial here all right so now we're moving on to level four semantic chunking now the interesting part about levels 1 through three here is we all took physical positioning into account doesn't it seem kind of weird that we would split up a document with the intention of grouping similar items together we just assume that paragraphs have similar information in there what if they don't what if we have really messy information and doing recursive character text splitting doesn't really do anything for us you know I saw this tweet from lonus we can go and take a look at this one he says weird idea chunk size when doing this when doing retrieval augments generation is an knowing hyper pram and feels naive to turn it into a global constant value I totally agree now he recommends could we train an end to-end chunking model I didn't want to go quite that far because I think there's a little easier step that we could try beforehand and I wanted to do an exploration but now what I'm going to do is I'm going to do an embedding based chunking method it's a little bit more expensive and it's definitely more work and it's definitely slower than what we talked about for the first three but it starts to take the meaning and the content of the text into account to make our chunks the analogy I looked up beforehand is imagine the first three levels that's like having a bunch of books and putting them on a bookshelf depending on their size right and the bookshelf size but what if you want to group the books together by genre or by theme or by author well then you actually need to know what the books are about and that's what we're going to try doing level four here all right so when I thought about level four what I wanted to do was obviously semantic chunking and I chose an embedding based way to do this so what I wanted to do was I wanted to take embeddings at certain positions of our document and then I wanted to compare those embeddings together right so if two embeddings are close to each other distance-wise well maybe they're talking about the same thing that's the assumption that we're going to make if they're further from each other that means that they're maybe not talking about the same thing right so what I imagine is I we'd have a big long essay and then with those I'd take an embedding of every single sentence that we have right and then I want to compare those embeddings together now the comparing the embeddings that's going to be the important part and where all the magic is going to be for this and I did two different methods that I wanted to share with you the first one is I did hierarchical that's a mouthful hierarchical clustering with positional reward so my first thought is well you know let's just do a clustering algorithm and let's see which embeddings are clustered together and then let's assume that those are the chunks that we're going to have but one thing I wanted to do was take into account short sentences that appear after a long sentence you know just like that I wanted the you know to be included with that long sentence because it's likely needs to be relevant with it and so I added a little bit of a positional reward so hierarchal FAL clustering generally is just going to be based off of distance but I added a a little uh extra sauce to it and did some positional reward all right this one was okay but it was kind of messy to uh to work with and it wasn't as logical as I wanted it to be and I couldn't really tune this um intuitively like I wanted so I wanted to find something just a little bit easier as an exploration for me so the what I did was is the next method was to find break points between sequential sentences so I got embedding number one of sentence number one and I compared that to sentence number two's embedding and I measured the distance between them and then I got two compared it to three and then three compared it to four and so on and so forth um I do a visual with this and so I guarantee it's going to make more sense in a second we're going to use Paul Graham's essay and what I'm going to do is I'm first going to split all of my different sentences and I'm going to do that just via some rejects with a period a question mark or an explanation point there's likely a lot of better ways to do this don't come at me with that but either way we have 3177 different sentences in this Paul gram essay all right so what I want to do is I want to start adding more information to each one of these sentences so it's like I have a lang chain document but I'm just going to do my own to show you how we're going to do this so instead of having a list of sentences I want to have a list of dictionaries of which the sentence is placed in it I'm going to add in the index just CU it's fun why not and let's take a look at these first three after I do that now I have a list of dictionaries and one of the keys is sentence and now we have these different sentences up here CU if I were to go here let me just show you what this looks like the single sentence list I want to do this with this the first three again it's just a list of strings now these list of strings are list of dictionaries all right cool well now what I want to do is I actually want to do some combining of the sentences like I said if I just did sentence one compared to sentence two compared to sentence three it was a little noisy it was kind of all over the place and it didn't tell me much I thought you know what if I combined the sentences so there's a little less movement from each one cuz now what I want to do instead of comparing one to two comparing to three comparing to four Etc I'm going to compare the embedding of sentence 1 2 and 3 combined with sentence 2 3 and four combined then compare that with sentence 3 four and five combined so it's a little bit more of a group I did a just a small little function you could take advantage of here I have a buffer size of one means one sentence before and one sentence afterwards you can do whatever you want go and switch around with this and go play with it I won't go through this code but I've commented it so you can follow along if you want now let's take a look at what that does so here we have our original sentence but now we have our combined sentence all right and this combined sentence is going to be what comes uh before and after it because this is the first one there's nothing before it's only after so um get funded by Y combinator is the sentence of number two nice so we have a combined sentence here which is want to start or startup that's what sentence number one is and then uh something in the grad school and that's what sentence number uh three is cool now that we have those what I want to do is I want to get an embedding of the grouped sentences of this combined sentence key so I'm going to use open Ai embeddings and let's go through this and I'm going to get all the embeddings um which is basically get the combined sentence for X in each one of those sentences and this is going to be we're going to go grab all those which is really nice we have our embeddings now I need to put those embeddings with its proper list all right so sentence uh with the now I'm going to make a new key the combined sentence embeddings and I'm just going to go through and add those and let's go take a look at what that looks like now CU of course this is fun and I like doing this in an iterative nature so we can take one step together at a time all right so sentence want to start a startup here's our combined sentence embedding so now we have this embedding for what's up here all right cool um well now what I want to do is I want to add one more metric to it and I know we keep on going here but hopefully you're still following along I want to add the distance between the first uh sentence and the second group of sentences I want to add that to the first sentence so I can see how big is the jump with the next one all right so what we're going to do is we're going to get the embedding of the current thing we're going to get the embedding of the second thing the second group that it comes with we're going to get the distance we're going to append the distances because we're going to do something with this later but then we're going to start we're going to get a distance to next so how far is the distance between the current embedding with the next one let's go ahead and let's run this and so now we just got that and this is added to our sentences but we have our distances here too so let's just take a look at the first three distances awesome so this is 08 so this means that sentence number group number one is 08 distance away from group number two and group number two is0 2 distance away from group number three hm that's kind of interesting why is group one further away from group two than group two is further away from group three I don't know but we're going to do something with this in a second let me show you these sentences look like one more time just because we're doing this iteratively oh boy I added a whole bunch here there's too many I should have did just did the first three okay um we'll go through this let's scroll all the way down to the bottom we got a long ways to go okay now we finally have it distan to next because this is the first one you can see that it's 08 that's what we just saw above cool let's close that up uh we have our distances here but now we're all data people we're all having fun we all want to see some visuals I want to see some visuals let's do that any data scientists out there will laugh at this because I've typed this more in my life than I think I've ever should import matplot live. pyplot as PLT that is just absolutely muscle me memory for me right now all right so now we plot our distances hm cool so this is our distance now it looks kind of random doesn't it well I mean a little bit you can see here that it looks like there's a little bit of some e and flows this one there's a little bit more distance in between so what this would mean in English is that for some reason the chunks here are more dissimilar from each other than uh further down than chunks that are grouped together but either way what's interesting to me is that we have some outliers up at the top here you can see here there's these points up the very top and that tells me hm maybe there's good break points there because two groups are so dissimilar that they should actually be chunked up and they shouldn't actually be together because if there's a long distance in their embedding space maybe they're not talking about the same thing all right so I want to show you this one more time but let's iteratively build another visualization to further emphasize the point that I'm trying to make here all right first thing I'm going to do is I'm just going to plot the distances let's go down that's the exact same that we think that we had beforehand all right the next thing I want to do I just want to do a little bit of formatting don't hate me for this first thing I'm going to do is I'm going to do a y upper bound this means the bound of the upper y limit because you as you can see right here there's not enough cushion up here for me it's visually too off I want to fix this all right and then we're going to do a y limit of from zero to the Y upper bound which means how long is your y AIS and then we're going to have an X limb of how long you want this to be on the X limit because you see here there's these buffers in between I don't want that all right let's get rid of that and anyway we go through this we have more space up the top we got rid of the sides all right what's next let's see what we have here I don't even know what's going on here I wrote this code I I still don't I'm just kidding I do um breakpoint percentile threshold so what we need to do is we need to somehow identify which outliers do we want to split on now there's a million ways you could go about doing this and I'm really excited to hear alternative methods for you um that you may have in the back of your mind for me what I ended up doing was I just did a percentile base I wanted to identify the outliers so using the different points as a distribution I wanted to find out hey which points are in the top 5% because if it's in the top 5% well those are probably going to be outliers for us so I only want to take these um upper these upper distances right here and the way I'm going to do that is I'm going to specify the percentile I want to take and then I'm going to use numpy and I'm going to go mp. percentile I'm going to pass it my distribution of distances and I'm going to get the break point percentile threshold it's going to be 95 and then what I want to do is I want to draw a line on the graph showing where that threshold is so I'm going to draw a a horizontal line right across and the Y is going to be the breakpoint distance threshold this will be a number that says hey what is the 95th percentile of all these all right let's take a look at what that looks like and here we go so now anything above this line is going to be in the 95th percentile these will be my outliers where I'm going to eventually make my chunks so everything that's in right here up until this one point that'll be chunk one then we have chunk two then we have chunk three chunk four chunk five blah blah blah and going from there because again one one last time I know I keep on repeating this but I really want to hammer this home the hypothesis is that if there's a big break point then a chunk should be uh split up at that point and so this is where we're end going to end up doing that all right then what we're going to do is we're going to see how many distances are actually above this one and so I want to get the number of distances that we have meaning the number of break points the number of things that are going to be above that threshold and then I'm going to do PLT . text this is just a fancy way to put some text on your visualization let me go ahead and do that and so we can see that we have 17 chunks I put that down in the corner right there H that's kind of interesting all right um then what we're going to do is we're going to get the indices of which points actually are above the breakpoint meaning which are the actual outliers because this break uh this break breakpoint distance threshold this is just a single static number but I need to get a list of numbers to find out where these break points and these chunks actually need to be met so for I IND distances um if x is above the breakpoint distance threshold so I'll get a BN a bunch of indices there that doesn't do anything different for us but then what we're going to do actually let me just do this because I think this actually would be helpful I'm going to look at these indices now what we have here these are 17 different chunks we can look at this it says 16 but that's cuz there's an extra one added to the front there should be a zero right here um or this could be the 317 at the end so what this means is between uh between sentence zero and sentence 23 we want to make our split and make our break all right because at this at number 23 it says Hey here the distance to the next one was quite big so we want to include number 23 on this one okay either way let's go from there let's go do some more uh let's do some fancy colors on here so what I want to do is I want to add some colors I just set my own custom ones right here then what I'm going to do is I'm going to go do uh a vertical span meaning you're gonna have a vertical shading in the background of your um the background of your graph here and you're going to do a start index and an end index the reason why I do a for Loop here is because you add them one at a time but the start index and the end n index will be what is in our indices above thresholds anyway's go through there and you can see here I cheated a little bit let's take away this text we can see that we have our different chunks right here so now we have chunk zero chunk one chunk two blah blah blah and go go all the way through there but of course we want some text on here to really make it more explicit and you can see here chunk blah blah blah blah blah uh this last one was giving me a hard time so I actually had to do just just a little bit of custom code for that one uh that splits it up uh that's just a little bit of a Band-Aid don't don't add me for that one either I was too lazy to figure that one out Okay cool so we go through there so here's all of our different chunks now this wouldn't be a good chart unless we put on some graphics or some titles as well so now we'll do a title A Y Lael and an X label and then there we go hm that's kind of cool uh Paul Graham essay chunks based off embedding breakpoints H that's pretty interesting but uh a good visualization doesn't do anything for us we can't really pass this to the language model and it's not going to know what to do with it so what we're going to do is we're going to actually get the sentences and actually combine them so here's a bunch of code here um I won't go through it too much but the the tldr is that you're going to append all these different pieces in your chunks so like I said beforehand you're going to go from chunk zero to chunk 23 and you're going to combine those and that'll be your first chunk let's go through there then let's take a look at what this actually looks like let's go through this and so we have the uh chunk number zero about a month of need phing cycle we had something called a prototype day you might think they wouldn't need any more motivation cool they're working on their cool new idea they have funding for an immediate future and they're playing the long game with the only two outcomes wealth or failure hm you think motivation might be enough so the hypothesis here is that these two are actually getting split up at a spot where there's a big breakpoint with it so it's kind of interesting to see where the uh semantic splitting actually happened there you go and now I want to reemphasize that this isn't perfect of course but I think this is an interesting step towards doing chunking because if I were to think if I'm going to hypothesize out in the future what is chunking going to be like uh well as compute gets better as language models get better there's no way we're going to do physical based chunking anymore unless the um structure of our documents is uh we can make those big assumptions on it we're probably going to do a smart chunking and I think this is a really cool um way to go towards that all right so that's level four again experimental please let me know what you think please give me other ideas for how we could make this a little bit better and this is going to be a little plug if you want to see more of these experimental methods that I do I shared this out on Twitter so um other people got to see this early now let's move on to level five a gentic chunking so if we went off the deep end with level four um we're going into the ocean here we're going into the Mariana Trench and we're going to the very bottom and we're going to do some cool things so my motivation for this side is I asked myself hey Greg what if a human were to do chunking how would I do chunking in the first place and I thought well I would go get myself a piece of scratch paper cuz I can't do all that in my head I'd start at the top of the essay and assume the first part will be in a chunk well because the first little bit it needs to go somewhere we don't have any chunks yet so of course it's going to go in a first chunk then I would keep going down the essay and evaluate if a new sentence or piece of the essay should be a part of the first chunk if not then create a new one then keep doing that all the way down until the S on the essay until we got to the end and I thought to myself wait wait a minute this is pseudo code we can make something like an agent to do this for us now I don't like the word agent quite yet because we're not quite there yet a lot of people like using I think there's a lot of marketing around it so yes I call this agentic chunking but I'm going to call this an agent like system the way that I like to Define what an agent is is there's some decision making that needs to go on in there and you use the language model to make decisions which you're going to have an unbounded path but the language model will help you guide that via your uh decision-making that you do and so so I thought man this is pretty cool I'm going to try this out all right so now let's go into level five and talk about what I found here so one first design decision that I need to make is how do I uh want to give different pieces of my essay to the language model and I thought man well there's still that problem with the short sentences you know so around this time there was a cool paper that came out all about propositions what is a proposition well a proposition is a sentence that can stand on its own so there's another agent-like system it's kind of just a prompt we'll go over that in a second here but it's going to take a sentence and it's going to pull out propositions which are little itty bitty Legos that can stand on their own let's talk about what that means Greg went to the park he likes walking if you were to pass he likes walking as a chunk to the language model language model is going to be like WTF who is he however if we change this into propositions it will make less sense from a reader's perspective meaning it doesn't look great for us to read but it makes a lot more sense for a language model Greg went to the park Greg likes walking that's pretty interesting if you want to take a look at more proposition work and I might do a whole another video on this let me know if you want me to because I think this is really cool uh Lang chain came out with proposition based retrieval so a new paper by Tom Chen um I haven't met Tom Chen yet but Tom I'm a big fan of your work if you want to chat I'm very down at talk in the image that they showed prior to the restoration work performed between blah blah blah blah blah blah you have this big long paragraph and then it's going to split up into different propositions and then use that for your retrieval because these sentences can stand a bit more on their own let's go ahead ahead and do it I'm importing a whole bunch of Lang chain stuff here I'm not going to go over each one except for the for the cool parts now one of the cool Parts here is going to be this Lang chain import Hub hey Greg what's the Lang chain Hub I should probably do a whole another video on this either but they have this thing or Lang chain came out with this thing called The Lang Hub this is within their Lang Smith Suite but a lang Hub is just going to be where they share prompts around so this person I don't know who this is um looks like it's been viewed a lot YouTube transcript to article act as a uh expert copywriter specializing blah blah blah so Lang chain will help host prompt templates that you can use for your own so it's an easy way to just go share prompt templates so here we have this whole entire prompt and here we have a token where you can go and grab it and then if you wanted to grab this prompt yourself then you can go and just grab it like this the uh advantage of this is you can share prompts a whole lot easier but then two if you want prompts to be updated with the latest and greatest thinking well you can just keep on pulling from there which is nice so I'm going to pull uh hub. whf proposal indexing all right so let's go you view the proposition prompt here uh wfh proposal indexing here's a citation that comes with it and here's the chat prompt template I won't go through all this but the interesting part is split the compound sentence into simple sentences maintain the original phrasing from the input whenever possible and then they actually give one example here so they have an input one here and then they have an output about what they want the language model to Output hm interesting I just went and copied this code I put that in right here I'm also going to uh get my language model going we're going to use GPT for a preview because I want the long context and good processing power all right so within this object is going to be the prompt then I'm going to create a runnable which combines The Prompt and the language model and this is via the Lang chain expression language and with that language you can do just this pipe operator and you can combine those right there and then the next thing I'm going to do I tried doing some uh extraction of these propositions via the uh recommended way but I found it was giving me a hard time so I just made my own extractor from it and the way I'm going to do that is I'm going to just do the pantic extraction method so I'm going to create a pantic class here and then I'm going to do extraction chain create extraction chain with p uh pantic and I'm going to pass it in the sentences pass in the language model cool we have that and then I'm going to create a small little function which is this the get propositions which is going to be hey go and get this Pro go and get the list of propositions from the thing that I give you because it takes a runnable and then it's going to do the extraction and it's going to return the propositions for us all right so now I'm going to do Paul Grahams superlinear essay cool we have that I'm going to split the essay into paragraphs now this is a a design Choice hey Greg aren't we doing just chunking again not really because this is like a very loose chunk that doesn't really matter I could pass it a sentence at a time I could pass it to I could pass it paragraphs see how many paragraphs we have we have 53 different paragraphs nice let's take a look at one of the paragraphs just because I like um being super explicit here uh let's do number two nice one of the most important things I didn't know etc etc we can look at number five cool those are just paragraphs for us all right then what we're going to do is we're going to go get our propositions and so we're going to do essay propositions and so I'm just going to go through this go through each paragraph I'm just going to do the first five because there's kind of kind of a lot of data here I'm going to get the propositions let me come back to you when this is done one minute later awesome so we have our propositions let's take a look at a few here again because we're doing this iteratively with each other all right I'm going to take a look at a few you have 26 propositions cool so our five paragraphs resulted in 26 propositions the month is October the year is 2023 at the time past I did not understand something about the world cool so now we're starting to pull out individual facts about what the um what the paragraph is about lovely so now what I want to do is I want to use an agent-like system that is going to go through each one of these iteratively and decide hey should this be a part of a chunk that we already have or should it not there there's no package that I saw that did this for us yet because this is a quite experimental method but what I did is I ended up making a a gentic chunker this isn't a package yet so you can't go uh import this anywhere but I'll show you the code that's powering this awesome so here's the code on how this works now I'm not going to go through this in detail because that's not the uh point of this tutorial but I'm going to go through the highle pieces here the way that it works is you're going to have uh AC equals a gench chunker okay cool and then you're going to have your list of propositions this could be sentences but propositions will be best cuz that's how I designed it to work and then what you're going to do is you're going to add your propositions to the class and then what it's going to do is it's going to start to form chunks for you then we're going to pretty print the chunks all right so let's go through and see how this works the real magic happens in the add propositions so if we go up to the top here so we're going to add proposition now if it's your first Chunk meaning your first proposition your first chunk then you're not going to have any chunks and chunks is going to be a property of this class if chunk size equals zero meaning you don't have any yet then create a new chunk hm totally makes sense well what do you want the chunk to be about well let's go find out where create new chunk is and on create new chunk what we're going to do is we're going to create a chunk ID which is just going to be a random uh uu ID or a subset of one then we're going to get a chunk summary and a chunk title the reason why we do this is because when we add a future thing we need to know well what are our chunks already about right because that'll tell us whether or not we need to add it so we're going to have a summary which is going to have a lot of good detailed information and we're going to have a title right so get new chunk summary all that this is doing is looking at the propositions that are currently in the chunk and then it's generating a summary about what that chunk is about and then the chunk title this is just a few words that kind of give you a quick glance on what it actually is now there is a parameter you can set when you do this do you want to update your summary and title because as I was going through this um as I added say proposition number one it was about one thing but then once you added proposition number two and three and four to the chunk you may need to update the summary or update the chunk because now the chunk is kind of just slightly different right it's kind of as if you had a uh centroid and it's moving just slightly but you want to capture that Essence all right and then what we do is with each one of the chunks we have a chunk ID we have a proposition or we have a list of propositions a title a summary and then the chunk index is just um when was this chunk made its number cool uh let's go back to where we were um okay so that's if if you add uh your first proposition to your empty list of chunks and then you're going to return here meaning you're going to stop you're not going to go any further because well you've already added it to a chunk it's all good cool but let's say that wasn't the case well let's go find a relevant chunk this part I thought was actually kind of cool too in this relevant chunk you're going to have a proposition now you want a proposition to go in and you want a chunk to come out right and so we have a a simple little prompt here determine whether or not the proposition should belong to any of the existing chunks and then I do an example here I have some other good stuff but what I'm going to do here is I'm G to pass it a uh string of what our current chunks actually look like and those current chunks are going to be groups of three things the ID of the chunk the name of the chunk and the summary of the chunk because what I want it to do is if it deems that yes this proposition should be part of a chunk well then what I want you to do is I want you to Output the chunk uh ID for me and this is how I can then extract which chunk it should actually be a part of all right so we're going to go through there we got all of that um and let's just say that a chunk ID uh does come out meaning it should be added to another chunk well what I'm going to do is I'm just going to add that proposition to the chunk nice so yes I found a chunk it should be a part of cool I'm going to add this proposition to it then when you add this proposition to it this is where I was talking about beforehand um if you want to generate new metadata then it will go and generate a new summary and title if you don't want to do it then it doesn't have to so let's say you didn't find a new chunk ID meaning you uh wanted to add a proposition it didn't find a new chunk so you need to actually make a new one uh no chunks are found create a new chunk and that's the same thing that we had up above and you go from there so that's really the meat of the entire thing and if we're going to go back to our repo here let's go do this for our essay that we had all right so I'm going to uh do the agentic chunker I'm going to say from AC or uh make AC and then what I want to do is I want to add each one of the first 20 propositions but you know we only had 23 so let me just this whole thing here and then what I did was is I have a lot of print logging if you don't want that cuz you think it's annoying just go print logging equals false but let's step through this so now it's adding our first chunk which is the month is October no chunks duh because you don't have any so it's created a new chunk which is 51322 it's called date and times yeah cool makes sense that's probably where I'd want this to go the year is 2023 chunk found date and times oh yeah duh cuz what it did is it's saw there's a chunk there's a date and times chunk I'm going to add this one to it too too okay cool I was a child at some past time no chunks are found because there's only one at this time and it doesn't think it's part of that one and so created a new chunk personal history H nice okay um at the past time I did not understand something important about the world it's adding it to personal history cool makes sense the important thing I did not understand is that the degree in which returns for performance are super linear it didn't find any chunks and so it doesn't think it's part of date and times or personal history it made a new one called return performance and returns relationship cool teachers and coaches implicitly told us returns were linear uh chunk found it's adding it to the returns chunk nice teachers and coaches meant well no chunks blah blah blah and we go all the way through here and so it's kind of interesting but what I want to show you is an an instance where a chunk name was updated so adding um you get what you you get out what you put in Was Heard A Thousand Times by the speaker it's adding it to the Mis conceptions in per performance of returns and relationships wait where did that title come from oh wait it's the same chunk ID but the name has been updated because you're actually updating the chunk as uh the chunk TM and the chunk title as you go along here so as we go through here we're going all the way through it did this a whole bunch of times I want to see what comes out the other end so you can pretty print the chunks cool so it looks like we have five chunks chunk number zero has this chunk ID um this chunk contains information about the specific dates and cool and here are the propositions that were added to that chunk nice so this is the content of our chunk that we want to actually pull in this chunk contains Reflections on someone's childhood blah blah blah okay cool oh dang oh this is a big one so now we have a bunch of statements that were pulled from the essay that are all about superlinear returns across different fields H all right cool and let's look at this last one or second to last one teachers and coaches K cool consequences of inferior product quality on business viability and customer base nice all right so that's kind of interesting now we're starting to group similar items together so if we have a question that pops up about product quality and business viability well here's the chunk that you got to look at now is this perfect well not quite yet because you could get some complicated questions where you may want multiple things for different chunks but I think this is a really interesting Direction on how we'd start to move towards there and if we actually want to get these chunks because we want to go do something interesting with them maybe proper index them then you can go and get those and get the list of strings so there you go that's a gentic chunking now again it's slow and it's expensive but if you bet that language models will speed up and they'll get cheaper which I'm guessing they will then this start this type of method starts to come into play finally what I want to do is I want to congratulate you on finishing the five levels of text chunking and text splitting we are almost done but I wanted to throw in a bonus level in there right so this bonus level is going to be through alternative representations now much like before like our chunking commandment we need to think about how we're going to get our data ready for our language model right chunking is only one part of that story it's how you're actually going to split your texts but the next step that you're going to take in front of that is you're going to get embeddings of your text and you're going to go throw those in your knowledge base right well there's different ways you can get embeddings of your text and there's different questions that'll come up like should you get embeddings of your raw text or should you get embeddings of an alternative representation of your raw text that's what we're going to talk about this bonus level let's go through this very briefly the first thing that we're going to talk about is multiv Vector indexing this is when you do a semantic search but instead of doing semantic search over the embeddings of your raw text you get it off of something else like a summary of your raw text or hypothetical questions of your raw text all right let's do this first one which is summaries and how we'd actually do that so we're going to get our super linear essay one more time I'm going to blow through this part since we already talked about it but what I'm going to do is I'm going to get my six chunks from our document and I'm going to say summarize the following documents and and that's going to be a chain that we have there and then we're going to do this but we're going to do this in batch which is one of the other nice things about Lang chain uh expression language and now we have all of our summaries and here's the first one so this is a summary of our first chunk that we have right now what I want to do is I want to get an embedding of this summary instead of the embedding of the chunk like I I would have with the old method and the way that we're going to do this is we're going to get our Vector store ready we're going to use chroma today and we're going to get just a doc store ready which is just going to be the inmemory bite store that link has and then we're going to pass in a multiv vector retriever this is a cool Lang chain abstraction where you can um basically go semantic search off one thing but then return another thing for your final language model all right we're going to go through this there's a whole tutorial that they have on this in fact I have a whole tutorial on this as well at fullstack retrieval.com so I'm going quick through it but that's CU I want to show you a whole bunch here we're going to get our summary docs we're going to add those and we're going to uh change those into proper Lang chain documents instead of the actual strings that they were and then here's where the cool part happens we're going to add them to the vector store which is also going to get the embeddings for us we're going to add the summary docs and then we're going to add our normal docks to our normal Dock Store and our retriever nice um cool and then what we could do here is we could go to our Retriever and we could go get the relevant documents and so now what I'm going to do is with my query this is going to go match on the summary embedding rather than the raw document embedding but when these docs are passed back to us these are the raw documents so there's a little happening behind the hood here um but again I encourage you to go to fullstack retrieval.com and go check out theor tutorial on this all right there's one method for us the other method you could do is well instead of summaries I want hypothetical questions this one's really nice if you anticipate a Q&A bot that you're making because then you can start to anticipate well which questions are people going to ask the language model will generate questions for you and there's a your hypothesis is that there's a high higher likelihood that um these questions will end up matching for you and you'll get better uh document matching cool another one you can do is you can do a parent document retriever so now this case the hypothesis with this method is that if you subset your document even more you'll get a better semantic search however in order to answer the question of whatever hypothetical question you could have you actually want what's around that small document so yes that small document will will do good semantic search but you actually want the buffer around it so what comes before and after it and another way to say that is you want the parent document that that small chunk actually comes from and or in order to show you this one I'm going to do a um I'm going to do a tutorial from llama index now I also have another tutorial on this on full stacker tre.com if you want to go check that out so with L uh with llama index we're going to do their hierarchical node parser and what you're going to do is you're going to give a list of Chunk sizes so they're going to get chunks that are two uh 20 48 512 and 128 here and let's go do our Paul Graham essay let's see how many nodes we actually have this is 119 nodes that's cuz 128 is pretty small for us here and if we take a look at one of these relationships here we and this is one of the smaller ones because this is something at the end and that's the 128 you can see that these chunks are quite small but and we are just looking at the relationships here you can get a source a previous a next and then the important part here is the parent so if you were to do this via the Llama index way yes you matched on the 128 chunk because that has good semantic search but you actually got the 512 or the 2048 right I won't show you how to go all the way through that because again that's a whole separate tutorial then the last thing that I'll show here is around the graph structure so sometimes you're going to go over your raw text and instead of chunking your raw text you actually want to extract a graph structure from that text because there's a lot of entities within your uh within your text I'm going to do that via the diffbot and this is the diffbot Transformer we're going to go I'm going to say say Greg lives in New York Greg is friends with Bobby San Francisco is a great City but New York is Amazing Greg lives in New York and let's see what actually pops out of over here so now we have this so now we have actually have a graph document and one of our nodes is node ID Greg type person properties K Greg we have another Bobby node which is cool we have another node here which is an entity location it's entity New York we have another uh relationship node so this kind of be like an edge and this is a social relationship between Greg and Bobby so I won't go through all these but either way you can start to see how now you can use a graph structure to answer questions about uh a person or your specific text but that's a little outside the chunking side either way that is actually the rest of the tutorial and I want to congratulate you on what is quite a long video I'm not sure how long this is probably one of my longer ones that we have yet either way I'm excited that you're here my name is Greg camrad and I am on a mission to figure out how Ai and business business are going to overlap the vend diagrams with each other this has been a lot of fun thank you for joining me we will see you later
Info
Channel: Greg Kamradt (Data Indy)
Views: 33,213
Rating: undefined out of 5
Keywords:
Id: 8OJC21T2SL4
Channel Id: undefined
Length: 68min 59sec (4139 seconds)
Published: Mon Jan 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.