Supercharge your Python App with RAG and Ollama in Minutes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

recently I created a few videos about embedding embedding is a key part of the process of setting up a rag or retrieval augmented generation system rag is good for creating a database where you can ask questions to any documents that you may have those documents could be markdown or or text or web pages or PDFs PDFs are probably the more common uh file type even though they're the absolute worst format you could use a PDF is not designed to make it easy to get text out of itself in fact it's often used to make it really hard to get intelligible text out of the file but anyway let's build a decent rag system that you can work with I'll be using python in this one but there's going to be a companion video using typescript coming very soon I'll skip using PDFs because it's such a terrible format that said it's important and I want to see if there's a good PDF to text workflow beyond what the useless tools out there like Pi PDF and Pi MW PDF can do that'll have to be a different video at some point in the future let's step back a little bit and look at what a basic basic rag application includes the main components are a model that you can ask questions to and a database that stores all the source documents but it's more than just asking questions to a model you're providing relevant documentation to the model that will hopefully help answer the question a bit better you don't want to provide full documents because they just tend to confuse the model but rather just the fragments of those documents that are actually relevant doing a search in a SQL or document database won't have as good results you really need a database that supports vector embeddings and some sort of similarity search so for this video we're going to use chroma DB as far as Vector databases go it probably has the fewest features but that means it's super simple to understand really fast and really easy to get up and running with so now we need to split up or chunk the document some folks like to throw around terms like agentic chunking or semantic chunking as an approach there seems to have been a YouTuber that pushed the idea of that working well but what I have seen work best is chunking based on number of sentences it's simple fast and it works great in Python the best option seems to be to use sentore tokenize in the nltk tokenize pack package you pass it some text and you get back a list of sentences now this is a list of English sentences there is some config that you can do to work with other languages but it doesn't really know what to do if there's yaml or Json in your text with your chunks in hand you can now embed embedding is a process that generates a mathematical representation of the text in the form of an array of numbers while it's technically possible to create embeddings with any model you really want to use an embedding model to do it fast efficiently and to get embeddings that perform well in fact I tried using a regular model in the app that we're going to build and it failed to come up with anything useful there are three choices when it comes to embedding models in olama as of April 2024 namc embed text mxb AI embed large which is a model from mixed breed. and all- mini LM in some quick testing namic and mix bread did the best job and mix bread took about 50% longer to do the embeddings versus nomic so let's start building the app you can find the code for this project in the GitHub repo techno evangelist video projects before going through the code I want to have a working chroma DP instance now there are a bunch of ways of doing this but to keep it simple I will just run this command chroma run-- host Local Host d-port 8000 -- paath myor chroma data then in my import python file I'll initiate the chroma client to connect to that database delete the collection if it exists and then create a new colle ction now normally you wouldn't go through the deletion process but I want to start over each time I run it in this example I want to pull in articles from a website and be able to ask questions about those articles so I have a file called source dos. txt that lists each URL or file on the file system that I want to embed you can see that it's a list of Articles from the Mac Rumors website how I actually download the files isn't all that relevant here but but you can see how to do it in the code in the repo the output of my retext function is just the text of the article next I chunk up the text using my chunk text by sentence function that's in the Matts olama tools module that uses sentore tokenize to create chunks of X number of sentences now for each chunk I can embed embedding in ama is super easy and even easier when using the python Library . embeddings and then specify the model name and the text for the model name I wanted an easy way to change it to test each model so I have a config file that sets the embed model name and the main model name I use the config Purser to read that file and then can just yank out those settings then I save the embedding value to the embed variable and the last step of the import is to add the embed The Source text and some metadata to the vector DP most Vector databases also need a unique ID for each item stored so I create that from the source file name and the index of the Chunk in my list of chunks now my database is populated so I can perform my search there's some initialization stuff up front like reading in the model names from the config and making the connection to the chroma DB then I take the query from the CLI args and create the embedding then I just run the query this is part of the chroma DB functionality I can return the top five results or 10 or W any other number and then join all those together into one string and then put the original query along with those relevant docs into the prompt that goes to the model now I can run olama generate passing in the name of my model The Prompt and that I want to stream the response finally for each part of the streamed response print out the token so let's try it out I'll set my embed model to be nomic embed text and the main model to be dolphin mistl I'll run Python 3 import pi and import all my text that takes about 20 seconds to run then Python 3 search. piy what happened in Taiwan and I get a good description of the recent earthquake in Taiwan and how it affects tsmc now try searching for what is the Vision Pro and we get a good answer about that even ask about personal voice and we get an answer there try switching the main model to Gemma colon 2B and then ask the same question about Taiwan or a different model want to try out the different embedding models go for it hopefully you will now see what's involved in creating a basic rag application there's a lot more we can do here maybe since it's a news site all the most recent info is more relevant so add the date of the article to medit data and then sort the results by date or if the question specifies a date it filters down the search to only documents added on those dates or maybe your query does a search using the facility on the web page to come up with a list of relevant docs import and embed the top five results and then do the similarity search and get the answer from the model there's a lot you could do here going forward I hope that all makes sense if you have any questions about about this let me know in the comments below or join the Discord at discord.gg olama and if you have any ideas for future videos let me know about those as well thanks so much for being here goodbye water okay

Info

Channel: Matt Williams

Views: 24,997

Rating: undefined out of 5

Keywords: artificial intelligence, machine learning, llama 2, ollama rag, large language models python, large language models project, machine learning python

Id: GxLoMquHynY

Channel Id: undefined

Length: 9min 41sec (581 seconds)

Published: Fri Apr 05 2024