PostgreSQL as VectorDB - Beginner Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

here you can see that if we use this setup it's literally twice as fast and we can also use a database that we're already familiar with to manage our vectors as well as our documents I think for most of you out there building applications with large language models using PG factor is actually better than using a dedicated Vector database like pine cone or wv8 and now I've been working with large language models for about a year now and I think this image is a good overview of what a typical architecture looks like from a high level If You're Building an app like this and if you've dabbled with these models with retrieval augmented generation then this should be familiar and now from the start about one year ago I started to use pine cone as my primary factor database because this was also A New Concept for me so I've never heard of factor databases never worked with them and pinec con was put forward as the factor database for AI applications so I started to work with that buil all my applications for my clients around pine cone and now really what I've noticed over the past couple of weeks as some of the applications that we're building are transitioning from proof of concept to production where managing the data that is active in the factor database becomes more and more important I've really ran into some issues where I had to reconsider my data pipelines and also my Factor databases and that pointed me to this article as I was doing some research on finding alternative ways of working with these factors and here uh I saw other people running into some issues as well and switching to PG factor which got me interested because PG factor is essentially a an extension for a post SQL database to use factors in it and also perform similarity search and now I'm already familiar with postur SQL databases actually I use data grip and have most of my projects for my clients where we need a database I have them available in here in either a postgress datab base or or something similar so that immediately caught my attention and then I thought okay well then but what's the tradeoff if pine cone is really a dedicated Factor database then it will probably be a lot faster right and that's when I found this really interesting comparison done by superbase here it outlined that PG factor is faster than pine cone so then I thought hey really what's the point of using a dedicated Factor database in a separate environment separate from all of my other projects um versus just using a database that I'm already familiar with and since Lang chain also has a PG Factor class which works very well I decided to start experimenting with this and see what the pros and cons are so I have a document over here or a script I should say which I will share with you as well where I run various experiments and I will share with you my insights show you first how to work with the pine cone index and then what that looks like if we do the same thing with PG factor and then I'll show you why I completely switch to PG Factor right now for all of my generative AI projects and now to be clear I'm definitely not an expert on this so it could very well be that I just don't have enough information or knowledge about how to effectively work with pine cone but I'm just going to show you some basic examples that I think will be very relatable to you and just from a speed and simplicity perspective PG factor is the clear winner for me all right so let's walk you through some examples I'm going to load up an interactive python session over here and then we're going to load some data so I have an ebook over here the Christmas Carol and let's load that up so we're going to split all the documents and here you can see we have now taken this C or txt file which is in here it's a huge ebook and we have split it up into the documents now we're going to connect with my pine cone account through the API key in the environment and then I'm going to run this which essentially says hey if there's a demo index uh you can load it if not we'll create one one I already have the demo index over here and we can now have a look at that see what's going on so if we now come to the pine cone data Explorer we can see all of the factors here and the source and then also the text of the various chunks that have been put into it okay so this is all basic pine cone stuff now let's create a function that we can call to run a query to do to perform a similarity search and I'll also create a function that calculates the time it takes to run the function and we can also specify a number of times that we want to run this function and take the average of the time so let's see how long it takes if we search this pine cone index with uh we have the query over here and it's just the um it's the first line of the ebook and now we can see okay we can run a similarity search and we correctly get the first chunk of the book back based on the uh first line line over here from from the book that we put in and then we can see that it took 0.53 seconds all right so that was the pine cone setup I'm going a little fast but this is just to show you how we set up this experiment now let's come over to the pg Vector part and if you want to follow this yourself you can check out the repository that I will link in the description now let's take a collection name over here and also a connection string to connect to our database and now we're going to to create uh the store similar to how we just creating the pine cone Factor store now we're going to use the Lang chain PG Factor one and just by loading this and putting in the uh connection to our database I can now come over here and let's have a look at what we actually created and this is one thing that I really like first and foremost about working with PG Factor we now um have two tables and Lang chain creates these automatically for you if you just put in the connection to your post SQL database so we have a collection name which is the name that we just put in here and then we also have a embedding table where you first of all have the collection ID which maps to The UU ID that you see over here then we have the embedding and we also have the original document in here so we have a traditional structured SQL data datase but it also has the factors in here and I will show you in a bit how this is just from a an an overview of what kind of data you're dealing with and what kind of data you have in your application is very effective but let's continue now with the search to perform the similarity search and actually see what this difference in speed is so now let's run this query again and so this is another custom function to run the similarity search on the new PG Vector object we created and now let's run this and see how long it takes this time all right so we can see same result it retriev the first chunk of the uh the book and you can see that we're now at 0.29 seconds so it's almost twice as fast and really what I found is so there is some variation every time you you run these so let's see I can run run this one one one more once more and see oh we're 0.23 now let's come back to the Pine con one run this one one more time let's see how long that takes and now we're at 0.5 seconds so really from these examples that we have over here PG factor is literally twice as fast and that's really what I've seen uh consistently been the case with the applications that that I'm building right now and um that really for me defeated the purpose of using something dedicated like pine cone and why this is the case is explained to some degree in this document but it basically highlights that since you interact with all of these services using apis not necessarily the implementation and speed of the algorithm and the search becomes the bottleneck but the network connection over which the data is sent so by using an open- Source database like postrest SQL you can put your database essentially close to where your application is running and this can reduce latencies and now next where I think PG Factor really shines is if we start to add more data to it so let's say we have another txt file let's load that up split it up into chunks give it a collection name two and let's run that one more time let's see what happens over here we can now come back to our database we can refresh this and here we have another collection that's in here and now we can come over here refresh refesh this notice we're now at 102 rows and now we're at a lot more rows adding more data to it this just to me gives me a much better overview of what's actually in my application and does also my client's applications and if I compare that to the few that I have over here in the pine cone console I just really prefer having an dedicated application that I can use to go through my data and also not only query on the embeddings but also on the documents query on the collection ideas and I just feel more in control this way versus if I come over here it starts with a random factor that you get by default then it shows you some results you can query by ID but I don't store IDs otherwise you would need a sep separate database which kind of like defeats the purpose right so that I think summarizes my issues that I had with pine cone and now I am aware it probably has to do something with name spaces and configuring everything using namespaces accordingly to for example how I do this with uh collections in here but again as I said I just look at it from a speed perspective and a Simplicity perspective if I look at the examples over here PG Factor wins in terms of speed it's open source and I also just like being able to open the database and see what's going on and now one final thing that I want to show you and this is really powerful you can steal this from me I built a custom PG Factor surface uh with a custom similarity search with scores that allows you to use Lang chain and perform a similarity search over multiple collections so out of the books you can only uh perform a similarity search on one collection but this uh surfice will allow you to do so over all of your data so we can look at this uh service over here initialize it and then we can run it and then run the query and then it will use all of the data that's in here now why is this useful that is because right now what you can do is you can very easily add more data to Defector database and be in full control so for example let's see I can use that same surface to delete collection one so let's do that let's update this and here it's instantly so here you can see I can remove uh those I can remove those and now that's gone and now I can come in here delete collection 2 and again it will be instant and it's gone in a similar fashion I can just reupload that data to the database let's see so we have all of our documents in here let's do it one more time and we create the first collection again all right so now we're back I can reload this and here you can see now we have all of that data again so it's a very effective way to manage your data and really what I found once you you get through the POC with projects you're working on it becomes much more about upgrading and improving the data sets than it is about proving the application so what you will naturally see is in the beginning the client or whatever you're working on you will have a fixed set of of documents of data that you want to put in there you do some tests you validate that this is this is a cool ID and it works but that that is really where the actual testing starts and where continuous Improvement of the data set comes into play this setup that you have over here works very effective for that and you can even build your own uis guis around these delete collection and update collection functions which is what I've uh done using an aure storage account with some web hooks to very effectively manage everything that's going on over here and just have one simple query that searches through everything so if you're working on large language model applications right now and you're using a dedic at Factor database like pine cone or we8 I would highly recommend looking into PG factor and at least give it a shot see if it works for you see if it makes sense for you so how do you set up the postour SQL database well what you can do is you can have a look at either super base which I really like um you can start for free and what you will get is you will get a managed postgress database and you can enable the factor extension and that is just for free so that's one option you can also do it like I do it locally right now or just put it on a server wherever you want to deploy it or what I do for most of my client Pro projects I use a manage progress SQL database through Microsoft Azure because that's why I deploy my apps anyway so then it's all hosted in there so those are just some ideas for you to start experimenting with it it's also worth reading through this article which I will link to you as well and that's really what I wanted to show you in this quick video it's something I've been working on and I really use my YouTube channel as a way to communicate what I encounter in my work what I learn and then give that back to you guys so if that sounds interesting to you you want to learn more about that make sure to hit that subscribe button and if you're interested in more AI lessons or lessons that I learned working with these generative AI projects then make sure to check out this video next

Info

Channel: Dave Ebbelaar

Views: 11,132

Rating: undefined out of 5

Keywords: data science, python, machine learning, vscode, data analytics, data science tips, data science 2023, artificial intelligence, ai, tutorial, how to, vector database, vector search, data version control, integrated vector database, generative ai, knn search, cost and latency concerns in vector databases, iterate on vectors in your pipeline, ml engineering pipelines, pgvector, pinecone, weaviate, similarity search, integrated vector management

Id: Ff3tJ4pJEa4

Channel Id: undefined

Length: 14min 25sec (865 seconds)

Published: Thu Dec 21 2023