FAISS Vector Library with LangChain and OpenAI (Semantic Search)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Facebook AI similarity search also known as Vice this is a vector library that was developed by the Facebook team now what we're going to be using this in today's video is to store embeddings now I've covered embeddings in another video but for this specific use case what we're going to be doing is taking a full document turning the document into chunks then turning those chunks into embeddings which we're going to essentially store these Vector embeddings what's great with being able to store these is we can do something called a similarity search and I'll be showing in this video how we can search a specific query within our Vector library and find specific results within a document so what we'll be covering specifically within python code is showing a query example using the library also as a retriever and then also saving and loading a feice index a lot to cover but I promise I'm going to break it down and make it as simple as possible that being said let's let's start coding all right so with Google collab up we're going to have to pip install a few things now if you're using vs code you might already have this installed on your computer awesome save yourself some time uh with collab I don't believe all these already built in so you have to import them in so link chain then Lang chain open AI L chain community and then also fice CPU so f a i SS CPU and then also tick token like that all right I'm going to run this it'll probably take about a minute or so I will see you in the future all right so now that this is all been pip installed let's bring in our Imports so from Lang chain open AI import open AI uh the community version is going to get deprecated in the future so make sure you have this over here uh from linkchain community do document loaders we're going to import in our text loader text loader then from link chain. text splitter import recursive character text Splitter from link chain open AI import open AI embed settings like that so all this should be pretty familiar then here's the new stuff for this video and uh so from linkchain [Music] community. Vector stores import viice then from linkchain core I can type uh Vector stores import Vector store Retriever and then lastly from link chain. chains import [Music] retrieval QA and then one other thing we're going to import in OS hopefully that should be everything in this video I try to put it at the very top uh what we're going to have to do is set up our open AI API key I do this pretty easily so Environ this then just put your open AI API key open AI API key and don't publish your key if you're going to publish your notebook um but I'm just going to really quick put my key in over here your key should start with an SK and like this and you can grab it on the open AI website I covered in the first video in the series um but make sure you grab your key you put put it here and you run the cell I'll be pausing the video here I'll put my key in run the cell delete my key and then we can keep moving forward all right so my key has been added now we're going to start doing some fun stuff uh now that the annoying Imports are all done uh first thing we're going to do is loader we're just going to do a text loader and keep it pretty basic for this video so text loader and we can essentially bring in a piece of text now what I'm personally going to be bringing in over here is a text document about Metallica so I can do that is I can grab this folder over here and throw in this now the text that I grabbed on this Metallica I'll just load it up to show you guys I literally just copied and pasted Metallica's Wikipedia page in here so it doesn't have a ton of text but it's still good to chunk this and um yeah we're going to use a recursive text splitter to do that specifically and you'll see why we want to chunk this too especially when we get some of our results from this Vector library that we create but feel free to literally just go to Wikipedia copy um I'll show you the article too Wikipedia right so load up Wikipedia and I had that on my second screen but we'll put Metallica over here and literally just grab all this copy and paste it into a text document that's literally all I did and yeah I've seen Metallica a few years ago I saw him with vfold and also vby uh one of the bands that got me into metal music but regardless so we're going to have to put our text loader over here let's put our document so metallica. text over here like this then we can run that and we have this in our loader Now to create our document you just say documents equals our loader and then you load this in so load covered this again another video but uh put that in over here and now we can set up our text splitter so we can say text splitter and honestly if you didn't want to see this document just to show you really quick just put documents over here you could print this out um but this is what I showed you in that text document right all this has uh been put in over here so a lot of text Metallica I've done a lot of things so I'm personally just going to delete this cell it's take a little bit too much space um but that's how we can do that uh let's do our text splitter now so text splitter equals our recursive character text splitter thank you for completing that and um we're going to do a chunk size of just 500 nothing like too crazy on here so 500 we'll do a chunk overlap of zero and um you can mess with these results mess with these numbers to see if you get better results but personally like since this is just a demo video and there's really like no functionality behind it besides it being a demo video does really matter too much but we're going to do length function equals length like this and uh our rec curs of text Splitters great now we'll build a few more cells then we're going to do docs over here so docs and we're going to say that's equal to our text splitter so grab your text splitter here copy or paste it or type it out and then we'll P split documents I don't even know what I just typed out there but we'll do split documents like that then we'll throw our document in here or documents that in there like this and if you want to see how all these chunks look right docs we can do over here zero right and I I covered this again more detail in the other video so watch that if you're not too familiar but this would like what their first DOC would look like our first 500 over here docs one right docs two and this is just off of the recursive text spitter that we set up so and you can also just do length throw docs in here right and you can see that there is 133 different chunks I guess I could just put let's just for clarity sake I'll put docs zero over here just to keep that over but doc zero and you can see there's 133 different chunks that were created from that Wikipedia article all right so now what we're going to do is set up our embeddings so embeddings and I'm going to just do the built-in open AI embeddings that we already have over here so we can just do open AI embeddings like this and um remember this is going to be like a vector representation so each chunk will end up being like let's say 0.5 0.1 0.3 and there's a lot more of these right you're not going to only have what four here um it's greatly expanded but just think about how it's going to be built out like on that side of things and this is what we're going to be sending over to our Vector Library um so let's set up our library now so what we're going to do is Library equals viice so a i SS it should actually be all capitalized like that do from documents and this makes it super easy right so what we're going to do first is throw in our documents and we labeled each of these as docs so that's going to be our first input and then our second input is our embeddings well we use these embeddings over here the open AI Bings throw that in here and this will run awesome okay so what I want to show you first is how we can query this so essentially we can query this whole document and ask questions on it so let's set up our first query so we're going to do query one and our query one is going to be who replaced Cliff buron so who replaced Cliff Burton in Metallica and technically there's only one answer to it you could say two because the different bases that are in Metallica but that will be our first query so on let's find our answer so to find the query answer we can say query answer like this we can say equals Library do similarity search and this is what uh feice is really good for similarity search and we can just put our query one in here so essentially what we're doing is we're taking a look at this Vector library and we're doing a similarity search based off of who replac clipboard so it's going to analyze this whole Wikipedia article and try to find essentially a chunk that answers this question so let's put that in over here okay and to get the results on this what we're going to do is we're going to say print like this and we're going to put query answer and let's put zero over here so kind of like where I showed you guys this docs over here uh the best result is going to be zero so that's what we're going to put over here cor answer is zero and then just put page content so pageor content and I did demo this before it did work so I'm crossing my fingers it works again and let's put this over here so burden's death left Metallica's future in doubt the three remaining members decided buron would want them to carry on and with the bur's family blessing the band sought a replacement roughly 40 people including childhood F premise prong and Jason new said of flum and jsum who by the way still apparently tore his audition for the band toil burn spot he said learn Metallica's entire set list after the audition Metallica invited him to Tommy which I assume if this chunk continued it cuts off here but they would talk about him joining the band because he did join the band right Robert's now the basis of Metallica so technically you could say he replaced Cliff as well but he ended up placing Jason um we're going to put query inser one over here and see what the second best result would be and you'll see why or kind of how this works a little bit in the future but um they talk about about this over here he just followed although buron initially declined offer in the ear except on the condition this just talks about burden joining Metallica I assume right over here um 1983 okay so like this was the second best answer but this obviously was the best answer it gave us what we specifically needed almost right like probably just a little bit more later in the sentence it would say it but that's where we our recursive text splitter uh decided to split that so I want to show you the similarity search this this time and the scoring behind it and the way that you could do this and kind of how it chooses like what number zero or number one specifically behind this let's do this so uh we can say docs and scores and I just grabbed this name from the documentation we can say library. similarity search with score okay and then we'll just throw our query one in here again so query one and each of these are going to be assigned to score this time assuming that it gives us the same results so what we're going to do this time is we're going to have docs and scores we're going to put zero over here and let's see what we specifically get so same exact thing right the tommies over here Source metallica. text which we talked about we imported in over here and this has a score of 0.23 so we want to have a score closer to zero that means we have better results 0.23 so let's see what docs and scores one looks like a little bit behind the scenes of how this works um but you can see it's a 0.29 right and uh let's see what two would be as well so kind of a cool way that you could kind of debug and also see specifically what's going on and then I guess oh yeah so here we go this is two this ends up going past over here right and San Francisco headfield origin Kurt Hammet decided newad as Bon's rep or Burton's replacement his first life report to Metalica that the country club in California members initiated n by tricking him to eating the ball Wasabi um man that's kind of funny it reminds me when I was in high school did math competitions there's a guy in my math team would literally eat was Sai gave him five bucks ate a bunch of with Sai and uh did not impact him at all which is crazy but all right cool we have that over here and just a little bit background information retrievers don't necessarily need to store documents but it could retrieve them um and Vector stores can be used as the backbone of a retriever uh but there are other type of retrievers as well and this is just some text that I grabbed from the documentation but let's build out our retriever on the side of things so retriever like this equals library. as retriever like this over here all right so now we have that specifically then we can say over here or is our QA equals retrieval QA like that from chain type okay we're going to set our LM equal to open AI you could declare this outside if you want to use a specific model or not then we're going to say chain type equals stuff over here oops and then lastly we're going to say retriever equals Retriever and we're going to do this another time a little bit later in this video but uh we should have that now okay bu a few more lines of code then let's set up a new retriever query so just to ask another question so retriever query and we're going to say equals what is the most hated Metallica album which every knows St Anger but uh let's see if we can grab that from the Wikipedia article then we can go over here and say results equals qa. invoke and throw this query in here right and now let's print these results print and we'll throw our results in over here and it says what is the most hated Metallica album I don't know well that's not what I got originally when I rant this so maybe we'll have to tweak this again what is the most hated Metallica album let's say what Metallica album do fans hate the most and like I said this worked originally when I did this and I test it a few times so let's see oh there we go what Metallica album do fans hate the mo result St Anger 2003 so it is working on the side of things now that we just slightly modified this query over here um all right we're going to do this one more time but I just want to show you how we could save our feice index so that way we don't have to essentially recreate it every time so let's do that um pretty easy so Library savecore local like this then save as a specific name that you want I'm just going to name it as Vice index Metallica like that all right all right and then what I'm going to say over here for this next line is we'll say Metallica saved equals and then we're going to say this is a load local then just grab our name over here and then grab embedding that we already declared a little bit earlier which our embedding go back over here right is this open AI embeddings we set up so now that's over here I'm going to reuse some code so we had this code over here for QA uh so put that over here retrieval this time we're going to change this up so instead of retriever we'll put Metallica saved and then we're going to put asore retriever like this right and run that and then we want to get our results again right results equals qa. invoke retriever query run that and if we want to see our final results let print results again what did metallic album do fans hate the most and the answer is it is difficult to say which metallic album fans hate the most his opinions of album vary however some man however some fans May dislike their eighth album St Anger due to its departure from their traditional sound and mixed reactions from critics well I think most people hated that snare and also the lack of solos hey you made it to the end of this video I hope you found some great information in here and you learned something new if you did make sure to subscribe to the channel it's 100% for free but it does show signals to YouTube's algorithm that a people do enjoy these videos and B there's an audience of people that are whether data analyst data scientists or even interested in AI that continue to watch these videos and support the channel now if you want to watch other videos on the channel I upload about two to three times a week as I learn new skills right now I'm heavily focusing on the open AI Lang chain side of things and I'm making a full playlist right over here so make sure to check it for new videos
Info
Channel: Ryan Nolan Data
Views: 4,004
Rating: undefined out of 5
Keywords: Data Analyst, Data Scientist, Facebook AI Similarity Search, FAISS, Vector Library, LangChain, ai, data scientist
Id: ZCSsIkyCZk4
Channel Id: undefined
Length: 19min 59sec (1199 seconds)
Published: Tue Feb 27 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.