Ollama 0.1.26 Makes Embedding 100x Better

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I think we may have what could be the most significant release for olama in a long time which you can find out more about at ama.com I'm thinking it may be one of the top five or so most significant new features but when I watch the Discord for what folks were excited about I think they were looking at the wrong thing they sold the top item the headliner which is about a new model Gemma from Google but the exciting thing here is just one of the line items down below the line is support for Bert and namic Bert embedding models this is huge and it's a foundational feature that allows olama to be used in far more places than ever before it's the feature I beged the team to deliver on in August and it's finally here embedding is all about creating a vector that represents the semantic meaning of whatever data you provide the model the most common use case for embedding is rag search Rag lets you find relevant content to your question so that you can provide the model with with the right inputs to come up with a good answer even with larger context of today's latest models rag is still incredibly important to keep the model on point and to move with speed I just heard a podcast today talking about how rag isn't very good because it only delivers a few hopefully relevant fragments of the source text well I actually do think it right now it will do a better job than rag does on that because right now right now the problem is that you're essentially scoring the text using embeddings but it misses the Nuance it misses the themes it misses the sort of General uh hierarchy of the knowledge in sh that's the hello world of rag but there are so many examples of rag that use those fragments and larger sections plus summaries plus summaries of summaries to try to make what the database delivers more relevant I even made a code sample in the AMA repo that shows this using a portion of the collection of the Chicago Institute of Art and that was back in September when you build out a vector database for rag you provide a lot of content that first needs to be embedded that embedding along with the source text is stored in the database when you ask a question that question is also converted to an embedding your question embedding is the same size as all the other embeddings for all your content so it's actually very easy to mathematically compare the embeddings to find the ones closest or most similar to your question and often a vector database can do that comparison with millions or more other embeddings in way less than a second as a side note there are a bunch of different Vector databases out there that you can use in your applications I can't really say what's different about them the main differentiators seem to be around how easy they are to use in python versus typescript or maybe how easy they are to host outside of an application or how they serialize the information to disk uh I'm not really sure there's a big difference in how fast they can filter information but there may be a difference in their maximum capacity I think this would be a good thing for me to look at in a future video let me know if that will be interesting to you AMA has supported embedding for a long time but it only used the regular models like llama 2 and mistol and everything else on ama.com in the past this will work but it turns out to be not super accurate and even worse it's really slow so whenever someone asked I would usually recommend not using olama embeddings and instead use the hugging face apis that you can run locally to do that embedding but every time I did that I had to read the same article to figure out how to set it up it Ted to be hard to do and so most just went to the open AI embedding model instead so let's take a look at how to do this with AMA now that we have version 0.126 first let's try running this on the command line I'm using curl with a simple phrase and there's our embedding super fast but hopefully your documents are a bit more complicated in this normally you don't want to embed an entire document all at once instead you need to split it up into chunks and then embed each of those that way you can supply just the relevant parts of the document to the model this isn't just about context size you need to provide info that makes sense for the question there are a lot of strategies about how large the chunks need to be and also how much overlap there should be between them what do you summarize and should you also summarize topics across multiple sections along the way but that's a topic for another video so here I have some code that splits a text file into 500w chunks this is using bun and then I'm going to feed it the text of War and Peace which is about 550,000 words or about 1,00 of our 500w chunks then I'll Loop through each one and embed them using nomic embed text so let's run it I skipped to the end and the process took about 50 seconds to run which is super impressive looking at the logs for AMA it looks like each 500w chunk took about 40 milliseconds to process let's compare that to what llama 2 would take I'll just swap out the models and run it each chunk takes about 1.4 seconds so in the time Namek Ed did all of war in peace llama 2 did 35 chunks all of war in peace will take llama 2 25 minutes that's a pretty huge difference let's take a look at the code in Python to do the same thing in case you're interested and to run that takes well just about the same amount of time as the JavaScript version about 50 seconds most of the code in both examples is Reading in the file and chunking up the text the actual embed is a single line of code that takes 40ish milliseconds to run I am super excited about what we'll be able to do with embedding now that it runs reliably well in ama and not just reliable but super fast the other features in 0.126 include support for Gemma from Google which I'll take a look at soon it looks like the team still working with Google to make the model more reliable on its responses and there's some cleanup for Windows support so that's great oh regarding windows I saw a comment just now pointing to issues with the way I set up environment variables the recommendation from the team is to use system variables but it was suggested that that might not be the right solution I'll take a look at that and see if I need to make a correction that video was made when ol Windows existed for a day so I wouldn't be surprised if something was off anyway anytime anyone finds a problem with one of my videos I am super happy to replace it with the corrected content it has to be credible and repeatable though so don't just claim that bunnies can fly without some way for me to verify it but if there really is a way they can fly I would update the appropriate video right away though this may be the first time I mentioned bunnies in any video of mine maybe you know what I'm actually talking about there well thanks so much for watching I hope you are as excited about AMA 0.126 as I am let me know what you think in the comments below I love the comments and they've been particularly active more recently and as for Cadence on videos I'm still figuring that out I'm going to experiment with the idea of making a video like this every Monday and Thursday and then lots of shorts based on that content every day we'll see how that goes my wife and daughter are definitely feeling the impact of three a week again thanks so much for being here goodbye
Info
Channel: Matt Williams
Views: 41,618
Rating: undefined out of 5
Keywords: artificial intelligence, llama 2, hugging face, open source, large language model, run ollama locally, custom model
Id: Ml179HQoy9o
Channel Id: undefined
Length: 8min 16sec (496 seconds)
Published: Fri Feb 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.