Whats the best Chunk Size for LLM Embeddings

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in a recent video I looked at how to do embeddings using AMA and of course I got a lot of comments suggesting different sizes for the embedding chunks I felt I knew what was right but I was wrong and as it turns out so were all the comments but I'm already getting ahead of myself embedding is a concept of converting a phrase or chunk of text into an array of numbers that represent the semantic meaning of the text every embed performed by the same model is going to be an array of the exact same size this is pretty magical because determining the similarity of two of these arrays is computationally quite simple you can do thousands or more comparisons super super quickly so quickly that it's kind of mindboggling and with the right model designed to do embedding even that state can be done incredibly fast in fact in the video I did that introduced the new feature in ama that supported these embedding models I showed the example of embedding war in peace each 500w chunk of text took about 40 milliseconds to embed and there were 1100 of them in the entire book so the whole book took much less than a minute using a non-embedding model would take closer to an hour and the results would be very Med mediocre at best if there is a downside to embedding it's that if you switch embedding models at some point you'll probably have to re-embed everything that you've done already but it's pretty fast so that shouldn't be too much of a big deal in my last video where I started applying embedding to a problem I wanted to solve for myself I looked at how to identify videos that folks should watch if they have any particular question now this isn't a traditional rag problem with rag or retrieval augmented generation we chunk up some Source Texs and store the embeds and Source Tex in a vector DB along with a bunch of optional metadata then a user asks a question and that is used to find the relevant chunks and pass the source text and any metadata to the llm along with the question to come up with a final answer now I didn't want a short answer I wanted a URL for a video that they can watch so I want to Chunk Up the transcript embed it and when I do a search it spits out the video URL to watch one of the metadata items I was also collecting was time in the video where I said something to make it easier for the user to find the right spot in the video but what's the right way to do chunking and what are the limits and why is chunking important well way way way back at the dawn of the beginning of the current trend of llms you know back 6 months ago models had a maximum contact size of roughly 2,000 tokens a token is roughly a word or common part of a word if you wanted to pass your entire library of documents to a model it would take the first 2,000 words roughly and ignore the rest and there was a good chance that the parts of your library relevant to the question was a tiny tiny portion of the overall text so we had to limit the tokens supply to the model to just the content that is relevant to the question embedding and Vector databases provides the solution to this problem but now models have much larger contact sizes even so there're still problems some models with large context sizes remember the beginning and the end of that context really well but forget everything in the middle and even if it does remember that larger context takes an enormous amount a memory on the system running the model and even if you have that memory and the model remembers everything getting your content into that memory takes time and so often even with larger contacts embedding is still required and finding the right content to supply the model using embeddings is going to be far faster than getting all of your content to the model for it to figure out you only have to watch a few videos about Google's Gemini with a seemingly unlimited context to see that embedding is going to be in our future for a long time to come so I think I've established that embedding is important okay so what is chunking I don't know if that's the official term I use the word chunk a lot for other things as well ever since seeing chunkies in the store and never quite being able to justify the 25 cents for one square of chocolate in my 12-year-old brain at the time so a chunk of text is some portion of a source text of some length in this video I want to start looking at what is the right length for those chunks is a shorter length better and how about overlap is it better to have each chunk a totally separate portion of the source or to have some of the words in the previous chunk repeated in the next chunk to ensure any concept isn't split across chunks so I created some code looking at my transcripts for my YouTube videos all of this code is in the embedding chunk length folder of the video projects repo at github.com techno evangelist video projects I'm using bun for this bun is an alternative to JavaScript just kidding I create a script for every video and I read it word for word a few videos ago I messed up a line and I missed it in the edit suggesting something like that and got so many any comments saying no that's not what bun is I know what it is I've been using it for a year and I hope that it can soon overtake Dino with some super important core features anyway if you want to use the code have bun installed which you can find at bun. sh I plan on converting the simple code to python after publishing this video but since getting a working bun environment is so much easier than the corresponding python environment I tend to start there there are two main files here first 1- ed.ted.com overlaps I have 0 3 5 10 25 50 100 and 500 I make sure that overlaps is less than the length because a length of five with an overlap of 500 would send it into a aspiring Loop and then I went through every valid combination and split up the text then created embeddings and stored them all in Json files with names that show the length an overlap so now 2- search. TS takes all those files and tries to determine the right chunking settings first I created an array of questions along with the right answer where the answer is the script that talks about that question then for each question I get the embedding for the question and open each file with the embeddings from my transcripts and do the similarity search for embed files that got the answer right I find five chunk sizes that answered it best and worse but still correct then I spit out the results and the results are interesting when looking for exact phrases that were in the script the shortest chunks perform best for more complicated Concepts longer chunks perform better than the shortest but 100w chunks were the longest with good results on its own those results aren't that surprising though it was interesting that success goes down over 100 words chunks that were 500 words or longer never were among the best matches and often fell into the worst matches I was surprised how often 25 and even five word chunks performed best the result I was most surprised with was how little overlap helps in this case there were very few instances where more overlap helped at all and an overlap of three words at most seemed to be the sweet spot now it could be that I didn't have enough questions here to be accurate maybe I should come up with more questions it might be interesting to see if any one chunk size was more successful than another maybe a warning points when a chunking strategy was best second best and third best adding up the scores I don't think I came up with one length to rule them all I'll probably stick with 25 to 100 but I'm pretty confident there is little reason to go higher than 100 words long often the most challenging part of writing any chunking code is dealing with overlap but maybe I don't need to worry about that as much as I have and this isn't a traditional rag solution maybe the results would be a little different if I was using the content to generate an answer with an llm that might be something worth investigating in the future I think this was a pretty interesting experiment if there was any one outcome from this experiment it's not a specific chunk size or overlap amount but rather that you should experiment with whatever your solution is uh without whatever questions you typically get for your rag solution and then try the different chunk sizes to see which performs best in your environment what do you think were you surprised by any of the results do you see any holes in what I tried to achieve let me know in the comments below or if you have any ideas for future videos let me know about that too I had a lot of fun with this one now that I have a good chunk size I want to move on to looking at some Vector databases thanks so much for being here goodbye [Music] I never have my water [Music] here
Info
Channel: Matt Williams
Views: 7,949
Rating: undefined out of 5
Keywords: embedding, chunk size, natural language processing, semantic search, semantic search demo, semantic search rag, large language models explained, natural language processing techniques, rag, retrieval augmented generation, typescript, bun.sh, generative ai
Id: 9HbU9Of-Ptw
Channel Id: undefined
Length: 10min 45sec (645 seconds)
Published: Fri Mar 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.