Ingesting and Querying Netflix Data using Chroma DB | Step-by-Step Python Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to another episode of generative geek We are continuing with the chroma DB series and so far what we have done is we learned in the previous video on how to create chroma how to install it um using python then how to create persistent clients how to use opene embeddings and then using the open embeddings and the persistent client how do you go out and create a collection then we added a few items to the collections we got them back using our get command we also retrieved The Collection using get collection we saw how many items were there using the count method and then we did a peak in the end where we were like okay uh if we had to look at the first 10 items how do they look like and we saw that you know when you do a peak you could also see the embeddings that were part of it as compared to when you do a get right so so this is what we have done till now now in today's episode what we are going to do is we are going to take a Netflix data set we going to take the first thousand rows from it and the Netflix data set that I'm going to use is the kaggle data set the link for that is available in the description of the video um but if you go to this uh and if you go to the link all you have to do is just download this file it will give you a zip file unzip it and get the CSV that part I have already done and I have kept the Netflix titles. CSV ready for me for processing here now when we have this what we'll do is we'll look through the data set first we'll load it through using pandas and then once the data set is loaded we'll start creating some sort of an enriched information string and I'll talk about you know why we do that and how we create the en information string then using that string as a document we will ingest documents we'll use uh the various things we have learned till now like the count method and the peak method we'll do all that we'll figure out you know how many documents are there how many collections are there we'll look through the first 10 and all of that then we'll view some documents using the get method uh we'll update the collection by adding metadata information this is something that we have not touched till now but we'll be doing it in this episode now you know um it will allow us to do more powerful queries and that you will see as part of the video how we do the metadata updates and then how do we retrieve uh collection and make query searches using the metadata embeddings that we have done once everything is done I'll show you how to delete the collection and that will be the last of it so this is uh the agenda for today so let's get started so first steps first now we'll go out and start importing all the necessary libraries uh for us because the panda the Netflix title CSV is already available with me uh if you still haven't downloaded this is the page where you need to go and once you are on this page click on this download button this will give you a zip file and then you know unzip it and get the CSV give the CSV where your notebook is and let's get started right so we'll import pandas as PD and then we'll say DF is equal to pd. read CSV uh this is the method that you use when you want to read the CSV file using pandas and I'll I'll say Netflix titles. CSV we'll run this I'm getting no no um module name pandas found we'll just install pandas uh this is a new environment and therefore you know maybe the Imports were missing uh the installs were missing now that I have installed uh I expect it to take a little while but yeah so now this is done we'll quickly look through the data frame to see what kind of data has got loaded so we have show ID type title director cast country date added now this is the entire CSV right you know so um going back to our scope we will take the first thousand rows from this data data set maybe not the first but like know we'll just process this data set a little and uh we'll look through the data set we'll use you'll use uh pandas for this and then once the data set is ready for processing we'll create enriched information string so let me just show you what I mean by that so let's first quickly look at the data set we'll do a DF doino and see what the data set looks like so if you look at it there are a few empty rows like you know cast and Country those column values are missing for certain um Fields even the date added is missing so just to ensure that you know uh we get back uh this Pro this exam this experimentation goes all right and you learn uh I'll just drop anything any row which doesn't have any of these column values so DF is equal to DF do drop na in place is equal to true and uh what it does is if you now look at the DF info you will get that you know we we have all of the column values are available and there are 5,332 rows now what I'll do is I'll say that hey I can't ingest I don't want to ingest this entire data set because that probably has no meaning as such so what I'm going to do is I'm going to take the first thousand rows and I'll say DF is equal to DF up to First th000 rows that's now my subset and I'm going to create a new uh list called Netflix info this is a blank list for now and this is the list where I'm going to store my enriched string what I mean by enriched string is if you look at this data set right there is show ID type title but when I want to actually store document as a document into the data set the title and the description a combination of these two makes a lot of sense for me to store as a document everything else if you look at it what is the type what who's the director uh what was the release Here what is the rating that all becomes some sort of a metadata for this uh string right so I'm going to create a string and I I'll Loop through the entire data set and I'll say for index comma Row in DF doer rows this is how you look through a data frame I'm just going to append to the Netflix list I'm going to append a dictionary which has an ID which is nothing but a show ID no I'm not going to append an ID though so this is not what I want I would rather just have I don't want to append a dictionary I'm sorry I rather want uh so sometimes what happens is chat GB will give you something and you start believing that that's probably right I I don't want that I rather want that the title and that's what I was just discussing with you that you know I want the title and I want a separator and I'll add a description right you know so the title and description these two things is what I want to add to the Netflix info right and ID is for me is nothing but DF do index. astype St str. to list so what I'm doing is I'm just taking the IDS from the data frame and using those IDs itself so I'll run this and now you can see let me just quickly run Netflix info and print the first five you can see this is how our list now looks like so sanova is the name of the movie and then this is the description in the new line again the Great British baking show is another T movie or a TV show and then this is the description so this is how each of our uh list entry is looking like now right um and this becomes my enrich string so the each of this is basically Bally my um this becomes my document and the ID if you look at it I'll just print the first five IDs 78 9 12 24 those are my first five IDs right so so they become my IDs now uh what we'll do is we'll go out and we'll just quickly delete the collection so we already have the collection that we had created yesterday I'm just going to use the same collection um what was the name of the collection it was called new collection so I'm just going to retrieve that same collection so the way to way you retrieve a collection is you say client. get collection and you pass it the name our name was new collection and you then passed the embedding function and our embedding function so this was the name and our embedding function is open aore AF that's the embedding function we had created so okay now now the challenge is that you know this notebook is from uh the previous episode but I have not yet ran the same thing again so let me just ran let me run everything again um so that there is a context available right so and that is okay I'll just leave this error for now because it says this is already available uh and I'll not do much after this because all I wanted to instantiate was the client the error we got this time was that it did not know what a client is right so I have fix that error now and it has been able to get the collection for me okay I should have referenced it with to some variable saying hey collection is client. getet collection and now using the client collection that we have I can say collection. add and this is how you add your documents right so documents is equal to Netflix info now if I press enter now what is happening is okay it's saying that you know some existing embeddings were available called id1 and all that um okay why is it saying id1 uh I should not have done that but okay part of learning is that you know when you type something you realize that you should not have done it and that's all right right so um what we'll do is we'll do collection. upsert IDs is equal to IDs and documents is equal to Netflix info upsert like you know like I told you in the last video what it does is it's like an update or create so if if an ID is already existing it will force an update if an ID does not exist that'll get created right so so collection. upsert now if I do collection. getet and I'll say IDs is equal to IDs first five IDs I want to pass um I can see that you know it got 24789 and these are the first five movies right you know that part of the document that it has uh you know pushed uh okay so I hope this has been clear till now we have not done anything new all we have done is just created uh collection let's quickly see how many items we have so I'll do a collection. count and we have 1,2 items in this now let's just quickly start seeing how do you go out so now our our database already has some information as the next step what we want to do is we want to now go out and start quering the database so what I'll do is I'll form something called as a query text and that's like a list I'll say maybe my my query is going to be that where the main character uh character is a detective right so let's just say if if this is what what I'm trying to query I'll say result is equal to do collection. query and I'll pass in the query text right so query text is equal to query text and top uncore K is how many results you want so I'll just say ncore results is equal to five um and that's what we pass so we want to get back Five results um and just see what results come back right so so here if you look at it it's giving me some IDs it's giving me some distances and it's now giving me some document back right so it does not default by default give any embeddings it's giving me documents right so department so first list first item is um the movie is called or the show is called Department two cops form a task force to take down two mobsters and all that right so Brick Mansions and undercover police detective small town crime when a disgraced ex cop discovers a dying woman I am all girls a Relentless detective right so so you can see what what is happening behind the scene is if you remember our Vector database and embeddings lectures what is happening is that you know we are using an open AI embedding function query text um when you use the open embedding function the query text will be converted into an embedding next what will happen is the distance is going to get computed between the query text embedding and the embeddings present in the collection right so so you have these two different vectors you are going to do some sort of a the the the database is going to do some sort of a cosine distance or a UK ukian distance or some other method of calculating distance between these two and whichever is less it'll say hey you know what these are the lowest distance objects and is going to return those five right you know so and this is what the list the distances list has right so it indicates their closeness in terms of semantic similarity what we can also do is we can even pass a list of queries so so if you saw here we had one query let me just quickly change this and say hey um where the main character is a detective or I can even say where the story is about a young girl right so let's just say right you know I don't even know what M I'm going to get back right so let's just see right and it needs to return Five results right but but because uh there are two lists so you can see that the IDS get returned for the first query and then for the second query so you will in total get uh 10 documents back so the D department and all that this is about probably about the the first one uh in the C after an embarking on an affair with a cop probing the murder and insul this coming of a charma follows a summer in the life of an 11year old girl this is most probably the second one right so the second query that we had my girl to a te make surprising discoveries so you saw like you know if you if you give like a list uh you might have anything any query coming to your mind you might say hey you know what this is the kind of movie I want to watch now uh you can give it um that uh um you know you can give that as a query and expect some interesting results back now this is when we have only ingested thousand rows there were like you know some 7 8,000 rows in that data in the original database we have only injested thousand rows um now now let's just do this now let's um what I'll do is I'll pick maybe 383 and 343 so I'll say hey uh ped uh or selected IDs is equal to 343 and uh what was the other one it was 383 right so these are the two IDs that I'm going to pick and next what I'll do is I'll say my reference text is collection. getet I'm going to basically get um values for these selected IDs I'm going to get the documents back and the way to get the documents back is first let me just show you what we get back uh and then it's nothing it's very simple python when we get this thing back I want the document so I'm just going to write documents in front of it right so so the these are the list these are the items we get back I can just say for text in um in reference uncore text print text right right so so these are the two movies that we got back right you know my girl this coming of a charmer and Department like these were the two IDs right so uh now what what I'm doing is once we have the reference text I want to now take these two movies and pass some sort of a wear Clause right within the query object so we have been forming this query object here I just want to enhance the query now right so I'll say hey you know what my query is nothing but result is collection. query query text is reference text um we formed the reference text here so I'll probably just use those two and I'll say end results is maybe three and I want to add a we Clause where so the way you add a we Clause if you have ever worked on it's very similar uh you just add a dictionary and so we'll say where type is movie right and next we'll print the result okay so see I did not get anything back this time and the reason I did not get anything back is because we there is no metadata available all we did was we had initially just done some sort of um title and description string formation and we inserted that there is no there is we we never never inserted any metadata into it so the first thing when you want to do more complex queries is that you know we need to update the metadata and the way to update the metadata is that you know the same way that we update the documents we now need to update the metadata so let's let's first uh do the metadata updation and then come back to these type of queries where we'll where we'll form a query saying hey you know what I want to find uh movies which are about detectives or I want to find TV shows about detectives right know so those kind of things you will be able to do once you have updated the metadata I'll show you quickly uh how it works so let's do the metadata update for that what we'll do is same way that we created the Netflix uncore info enrich string I'm going to create a new list called metad datas and create an empty list here now we'll again glop through the database or look through the data frame and we'll say for index comma Row in DF iter rows what I want to do is I want to update I don't want to update the title I don't want to update the description I want to update and let's just quickly look at the data frame once again right you know so so that we know what all we want to update uh what all metadata we want to update inside the uh in in the database right you know for every document there is a metadata that can be added and our we Clauses are going to act on these metad datas so I'm just going to use them now um so type is important for us so we'll uh we'll uh we'll use type and then we'll say type is nothing but row type we want the type uh what else do we want uh do we want anything else uh yeah we will take uh maybe the country uh country will be interesting um so we can say row country yeah uh we can even add um something like a release year release underscore here yeah so so let's say this is what we want to add into our uh metadata string right we have this complete uh seems like some Arrow okay yeah so with this now my metadata string is ready we can quickly see how it looks like I'll just quickly print the first two and you can see the first two first is a movie where the country is United States Ghana bisano Faso borina Faso United Kingdom Germany Ethiopia I don't know maybe it it released in all these countries in one go right so um what we are trying to do is we are just trying to pass some metadata information so that our queries can be enriched right you know that's my whole goal with showing you I'm I haven't done a thorough analysis of the database of this data set myself s now once the metadata string is ready uh I'll also quickly just look at the Len of IDs that we had created earlier we have th000 and uh Len of metad datas is th000 they have to be all uh like you know of the same length otherwise the update will fail and now once we have that we'll just say collection. update IDs is equal to IDs and metad datas is equal to metad datas right so once you once you do this when we are insert when we were updating documents we we had done documents is equal to Netflix info if you remember um now we are updating the metadata so I'll just press this now what should have happened is I'll just quickly show you collection. getet IDs is equal to reference underscore uh now there was some IDs that we had got uh selected IDs right so I want to just IDs is equal to selected IDs now if you look at them uh earlier when we were querying um if you see all we got was um we we were only looking at uh the documents but we we never got any metadata back now if you look at it each document has some metadata associated with it and um this will now help us get even better results right so so the query that we had put in earlier uh where uh our result was based on some sort of a uh wear Clause let's run that same same query again right now if you look at it I did not get back an empty string I got back some result right because it there was where the type is movie and my query text was based on this reference text that we had got uh this was the reference text um it is able to go out and find some interesting movies for us right so um it is able to say hey you know what this is these are the movies we can even say um if you look at it IDs IDs um country this is all metadata and then we can come back to the documents part there has to be yeah documents right so if we just do print results. documents okay D CH has nothing called documents uh okay we'll we'll get to in a while like you know I probably uh I don't want to waste your time while I'm debugging so we saw that you know if there is a simple type of a query that we want to do in the sense where hey this is my reference text go out find the closest matches but only those which are movies I was able to do it but let's just say if I want to do even more complex matches like you know something like hey uh can you find me movies where the release year is greater than 20 2015 and maybe where the country is United States and it closely matches the reference text uh so let's just see what happens when we do that kind of thing right so how do you do that we'll say result is equal to collection we have the same collection we'll make a query and our query is still the same our query text is same as reference text uh now look at it that you know we look look at it this way that while our uh our reference text is still very very limited we only have two strings there right but if you have like a production system your reference Vector can itself be a very decent size right so uh and then you will get better results but this for demonstration purposes I think even this is all right um so end results is three let's just pick the first three results and we'll now add the we Clause we Clause the logic is very simple you put a where then you create a dictionary and what I'm going to do this time is I'm going to create a uh logical operator combination so we'll say where and and and from here you create a list what kind of an and do you want so our and is that you know we want um the first is where type has to be a movie so our type has to be a movie Next what we want is um sorry sorry this is not the right way I should have said type is dollar EQ movie this is how you define um your queries here like you know your operators here next we'll say um where a country dollar EQ is United States yeah so and then we'll also say where release underscore year now so far we have been doing more equals now this time we want the release to be greater than so with greater than you put dollar GTE and dollar GTE is greater than 2015 so this is what we want from the this is the query that we want to run now uh our query is formed uh we'll just quickly go out and print the results collection got an query uncore text should be query _ text oops it should be result right so so this is what we got back now if you look at the metadata for everything everything is a movie right so and everything every result that we got back will have a release year greater than 2015 so and you can see how close they are uh so the first one is um you know let's look at the in this uplifting musical trouble teen takes a leap of faith by ending a summer camp like maybe this is about that young girl right uh desperate to save his 11-year-old Gunner runs away from home um he's compelled to track down her killer so this is that detective one a secret service agent so we had like you know um this is what we are getting back for each of the query for each of the string we are getting back three three results uh I hope this helped you understand how uh you can now go out and uh create uh complex queries similarly what you can also do is you can even have um a delete or a delete operation or an you we can even do an or operation here like you know instead of an and I could have as well said that he take all of this and put an or here right so instead of an and maybe you want an or like you know hey any of these matching I want the result so let's just run this and you will see this time around some things will match but not all so this in this combination a movie match happened but the year is 1991 so similarly you know where some country is Philippines it's not United States so these ways you can create very complex queries and you can do more complex retrievals back from the database now let's just say you don't want any of these IDs right you know you want to delete these IDs let's just say hypothetically so what what you can do is you can say collection. delete I'm now going to delete the these IDs and you will say IDs is equal to you will give it the list of IDs that you want to delete and you can just go out and press you know you can run it and those IDs will get deleted right so this is how you uh Delete the documents from The Collection what you can even do is you can pass more complex like you know when we were doing retrieval we passed complex logical queries same way you can pass more complex uh queries for deletion as as well right uh let's just say you want to delete the entire collection the way to do it is you will say client. delete collection and you will give it the name you will say hey new collection I want this entire thing gone press enter this entire thing is gone now so if you do a collection it has it is there and if you do a count it's not nothing is there right you know so it says hey you know what I don't even know what you're talking about so this is it guys I hope um you got some sense of chroma DB uh in the next video on chroma DB we'll do some sort of a multimodal search where we'll ingest images we'll run embeddings on them and then we'll do queries like hey give me photos of dogs give me photos of where um where there are street signs or where there's some sort of a street festival happening so it'll be an interesting one um embeddings is a very powerful Concept in general so I really recommend that you watch the embeddings playlist we have and it will really help you get better understanding of how embeddings work thank you so much
Info
Channel: Generative Geek
Views: 216
Rating: undefined out of 5
Keywords:
Id: Hny-KElMIFo
Channel Id: undefined
Length: 30min 34sec (1834 seconds)
Published: Mon Apr 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.