Building Multi-Modal Search with Vector Databases

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone my name is Diana Chan Morgan and I run things Community here at deeplearning.ai today we are so excited to host an amazing Workshop about leveraging the power of vector databases like web8 in conjunction with multimodel embedding models to power at scale production ready applications capable of understanding and searching text images audio and video data what all of you can expect to take away from the workshop today is how machine Lear machine learning models can embed multimodal data how Vector databases like we8 enable real time semantic search how Vector databases can be used to scale the use of these models to billions at scale code code implementations of performing any to any modality search applications enabled by ATS scale multimodal search and retrieval so we will be dropping a link in the chat where you can ask all the questions and vote upon which questions you want answered by the speakers this session will also be recorded and the slides will be sent out afterwards uh and to introduce our event partner we8 we8 is an open source Vector dat datase it allows you to store data objects and Vector embeddings from your favorite ml models and scale seamlessly into billions of data objects uh they just released our short course with us uh last week with Sebastian one of our speakers Vector databases from embeddings to Applications I'd love to introduce our first Speaker Sebastian Sebastian is the head of deell at weba and an expert and Vector database technology he is passionate about helping developers build and productize AI based applications that take advantage of the latest Cutting Edge developments in machine learning hey Sebastian happy to have you here today hey Diana yeah thanks for having me yeah I'm super excited about this session so great absolutely and our next speaker is Zayn Zayn is a senior developer Advocate at weba he's an engineer and data scientist by training who pursued his undergraduate and graduate work at the University of Toronto building artificially intelligent assisted Technologies he then founded his company developing a digital Health platform that leveraged machine learning his passion about open source software education community and machine learning hey zaye thanks for joining hey everybody super excited to be here um and really looking forward to this Workshop getting our hands dirty with multimodal data performing search over it couldn't be more excited to be here with you guys absolutely well uh I'll have Sebastian start to take it away Perfect all right okay so um we thought that like before we even jump into multimodel might be good to split it maybe into two parts to give you a little bit of intro into Vector search and kind of like the ideas behind it um and and kind of like um basically warm you up to the the main portion of of the session which Zen later on will deliver um but before I get into that like Diane mentioned earlier we launched a a pretty cool course with Andrew uh which is basically title Vector databases from embeddings to Applications so please enroll uh it's I'm super proud of um working with the Deep learning team on this and with Andrew uh so yeah the more the more marrier so super excited about this um so let me do a quick intro now to like vectors Vector search Vector databases and all this kind of stuff and um trust me I'm gonna take just few minutes of slides because I actually I prefer to leave in the code and that's going to be like my favorite part of the session but what I want to talk about is it's like and I've been talking about this actually for years where in the world or like at least on the internet everything we do starts with search right whether you want to look for music or you want to watch movies or you do shopping or you're looking for information all of it starts with search and we've been doing that for decades um so what's the problem right everything fine like search works the thing is that with some of the classical methods there's some challenges with traditional search if you run a query like why do airplanes fly you may get a response like this uh which basically talks about like hey you know fly expensive eror because it does this this and that but doesn't really explain why do airplanes fly but it matches on keywords right like technically that should be the right result but practically it doesn't answer the question so yeah we asked for one thing but we really got like a complete different thing and it's more of an ad in this case than than the answer to our question uh where semantic search comes to the rescue it it kind of looks at the whole thing from a different perspective it's not about like matching the words that are uh in our content but about matching the meaning what is it that we mean by this question of what is it that we need and with semantic search you are more likely to get a response like this that maybe come from uh NASA that basically explains that you know to to make air move faster over the top of the Wings so and so happens and this is more of an answer and this is not about much in the keyword but what's that we're looking after and in summary I think this meme explains pretty well like with a lot of cool stuff that is happening in ml space um we we getting like amazing new tools of doing our jobs better so how does this work so let me explain to you like um maybe not a 5-year-old but a little bit like a 10year old like a little bit of M might be uh required but not too much and the idea is like this um so we have those amazing machine learning models that can take content of any type pretty much depending on the model and then if you fit it in what you get back on the other side you get a vector embedding and basically a vector embedding is like a bunch of numbers an array of numbers that is the way of machine learning understanding the meaning behind the original data right and if we took like our original examples where we had like the two respon and let's say run it through the machiner model and then take the second one we get like a bunch of different vector embeddings and each of them will be already somewhat different and if we then took those and moved it into like a a multi dimensional space uh what you would probably notice very often is that the vector embeddings that represents the similar kind of data will be stored together so like what we're trying to visualize here uh like a cat and a dog they'll be a lot closer to each other like chicken will be somewhere some somewhat further away uh while for example things like apples and bananas will be in completely different space and this is a simplified version of how those Vector embeddings look because uh we can only like imagine spaces in like three dimensions but those Vector embeddings go into like a TH or 1500 or even more Dimensions uh which basically means that it can capture more information and more meaning so to summarize the whole thing of how the whole thing works and how the query works is as follows so if we took our original query why do airplanes fly and we already have our Vector space embedded of our data uh if we then fit in that query through the machinary model we'll get a vector embedding again so this is this one actually represents our query and now from here we can map it to our Vector space and basically the area around around here where that we're pointing to is probably where the the the most likely article or answer is to our original question and very likely we'll find our NASA article finding it and this is how basically it works in a natural so let me show you how all of this works in practice um so I have a a notebook that is prepared and um I think we can also share a link to it uh so I like to walk you through this so a little thing so what I'll be using I'll be using we8 our open source Vector database uh which is pretty awesome and uh and recently we were working on a brand new uh python client uh which uh changes how we interact with we8 uh it's still under beta but I think it's so close to be released that I thought it was ready for us to actually show you it in action and I would love if you were actually following and coding along with me uh as I go through this uh so of course as one of the first steps you have to do is uh go and install the client and then this will basically install the latest version of the the W client which is still under beta so that's why we need this like for DOT star um so once we have that basically there's different modes or different ways to uh deploy weate we could use weate embedded you could use uh selfhosted so self-hosted is like you could actually run it with a Docker compost uh and then we also have a cloud deployment um the cloud deployment is not compatible just yet with the latest V4 client it's it's uh it's going to come next week so for today I'm just going to focus on running this on my machine so if you want to use embedded then you can follow me along if you prefer using Docker compost there's already a Docker file in here that if you find of Docker you could just go and and and kick it off uh but I am going to stick to to this one so the way it works the kind of thing that we need to do first is this where you need to import web8 I'm also importing OS because I don't want to actually share my my keys with you uh in in plain text uh but usually the idea is what we need to do is just call something like you know we8 connect uh to and then B basically uh you can say either connect to embedded or connect to local and for the purpose of uh this presentation I'll be using open Ai and cohere uh so if you want to use one of those you could just replace it here as a as a plain text uh or it's better if you use environment variables and uh that that should be it right so if I run this code oh yeah I need to connect to my jupyter notebook fine uh what this will do the first time you run it um we it will go or the client will go and and download a binary like an embedded binary from weit for you and then you will run it for you locally on your machine so it's pretty much no installs required for as long as you have the we8 client um so I already have a an instance running and from now on I can interact with we8 through this client right so as part of this uh demo I want to use couple of um data sets they're super simple so one is like just 10 objects uh and then it has things like category and question and answer right like with some content uh for us to play with and then the other one is got like a thousand objects so we can run like a second test and it's got like some additional properties as well uh so we'll get to that so if I run this code basically what we do is just like load the data from a given a path and then convert into adjacent so this this is like a previous one locally of like one of the objects perfect okay so we have our data pretty much ready so what we need to do is create a new collection uh with um let's say um classical databases a collection will be a table uh in our case um we call them collections because they're not actually tabled underneath um so what we do is call collection. client. cre create um okay sometimes I need to reload uh my vs code and reconnect just to get my uh tool tips showing and hopefully you'll be able to see them too let's see so yes there we are so basically as part of this uh in order to create the most basic collection um we just have to have a a name right let's let's call it questions and then if I create a collections like this it will be one of those collections that you can bring in your own vectors uh but in this case I want to take advantage of weat a module system which allows you to which actually can help you vectorize your data um so what I will do I'll add like a quick import uh with it class called we classes where we have like a lot of helper classes that help you actually write through a code a lot faster so if I just follow this I want to configure my vectorizer and and I want to use text to V coh here right so coh here they have like an amazing multimodal model um I'm just going to use the default one and um let me call it just Al questions fine what happened here oh I was playing with it earlier I I already have this collection so let me um I should have cleaned up my my uh my environment but that's fine so I we can do very quick one like if the collection exists uh called questions then we can just delete it all right and collections delete o fine let's rerun this fine and that that basically creates like a brand new collection for us and then from this moment uh in order to add data into uh with it I can just go uh into my client Collections and what I will do is get our questions kind of object I can call it uh let's say I call it Q uh let's call it questions uh and then I can say questions. data do insert manyu and you remember earlier we loaded this um data right so I'm going to just grab this 10 objects and then straight up insert them here uh and then something interesting is going to happen so as part of this uh what we do is um insert the data um so let me show you a quick preview so response um and this get this gets actually pretty interesting uh so response questions the query and then maybe I can um fetch some objects and I'm going to grab let's say four objects and if I just print that response can see like a we already got like a bunch of objects um that that we go back and then I can actually drill in specifically to objects and maybe just drill just print the properties of of the first one right so this is one of the objects that we got in um the other interesting thing is and um you're going to like this um so often we we we we need to kind of like take care of vectorization in this case um let's go item uh the vectors are already there so what I can do not client queries [Music] queries questions not queries got uh questions query pouch object by ID and then I'm just going to grab one of the IDS that we had from before um and uh and what I can do is say include Vector so this object already has a vector um but I can print it uh like this metadata vector and you can see that's like that this is the vector of of the first object so the important thing to understand is when did we get this Vector how did that happen it actually happened here so when we went and decid to insert this object uh we went like okay I'm going to add all these objects into the database but also I'll go and use use C here to vectorize since that's how we Define The Collection um and and that's kind of like the the idea um the other thing is and let me show you like a different example so I'm going to recreate this collection oops um and this time I'm going to use open AI so let me just go text to V open Ai and technically I mean if you want we could even just specify like a specific model like other uh version um model version yeah uh and then kind of do 0 two etc etc but that's kind of like um the model we we use by default anyway so I'm just going to skip that for now but that's how more or less we would do it and we could also add like a generative con config um and this generative config is interesting because that's how you can for example and we have like support for um open AI Palm coh here you could ALS around on Azure but in this case I'm just going to use open Ai and we could even just specify like uh which model we want to use like we could say hey I want to use gp4 um etc etc right um and so let's recreate this collection and um that basically can can load the data sorry that's that's like a brand new collection um that that will work for us um um and let's add a new one so this time if I import I can for example try importing against it's pretty much same code um it's going to work with 1,000 technically um if you have like a bigger batches of objects if you have like a million objects I wouldn't necessarily recommend running just inter many and just passing a million objects because you can run all sort of problems but in this case for like 1,000 of text objects I should be able to just uh run this uh and then this actually takes like a thousand objects and then vectorize them using my API key um and then those objects will be pretty much uh available to query very soon and I can even verify a client uh oops so okay so we have our questions collection here again so I oops questions I can even very quickly ver verify how many objects we have in there so let's see and let's just print the response that should give us yeah so we have a thousand objects there it should be all vectorized U Etc um okay so this is the first part this is how you kind of get in um get the data into we8 um how and you can perform all sort of operations on it right so if you have questions data uh you can you know uh Delete objects you can in like inserting of course but like uh you can update object uh you can let's say I could grab one by ID and then delete it etc etc so that's kind of the idea um so the second part is quaring and this this is where um things get even more interesting so let I just need to reconnect uh to our client because this is a separate Jupiter notebook um and hopefully I will get like all the support so um I am going to use the same uh code as before where we can connect to our uh questions and uh using the questions I should be able to say query uh near text um and um what did we have in our data um so we have like this object vectorized right so so maybe we could search for something like death or uh yeah a bit of a morbid one yeah let's not look for Abraham Lin Lincoln um but yeah like we have something about um rhyming pigment so let's for now just uh search for pigment right and um we this should do the trick um so wait that's not the right thing okay so query equals exactly uh pigments and we want to return maybe the first five objects and and then maybe you can print response object zero technically I should do like a loop but I'm lazy I I'm copying like this faster all right uh maybe not that was a way I should have done a loop all right let's do a loop okay four item in this F objects uh yeah print item properties so then we can see what we got back okay so this basically code here where I spend more time on printing than actually the query uh it it it shows like how you can uh run a vector search that uses this as as the query and then we get this response p and then we could look for uh color um and and then we will search through that as well um there's as well like something yeah and you can see like reference to color green uh we could have like we have wax and this is interesting um I can actually search a query using in different language so in po walk is Vos and we should probably uh get something like a pet which is air so let's see what sort of oh we're using open a not coherent no more um but it's it's still kind of like a find information that um it's it's sort of uh relevant like musical instruments um and then he finds like um adorable wi instrument Etc right so he finds info on all sort of stuff um so that's how we can run like um a vector search using just just text we could also add a filter uh and um so we could for example say find something where value is let's say uh because value is like maybe number of points at least 500 right um so in this case to do that let's import um with create oh this should be python not markdown okay fine I'll add a new code here okay so I'm going to import weit classes again as wvc and to add a filter it's actually pretty straight forward forward so we can say filters and we are searching on value okay so search on value and and we want something that is greater than 500 and then this time the the results we getting back are at least 500 or more well they're more than 500 um we also can use hybrid search which is a pretty interesting concept um where basically instead of like near text uh we could do hybrid and hybrid search what it does it it combines a keyword search using bm25 five and Vector search into one uh query and we have this Alpha parameter so if Alpha is one we basically say we we are going to use 100% of the score from the vector search if it's zero we want to use 100% of the keyword search but maybe we want to say like hey I want to use zero like 70% of the vector search versus the keyword search so in this case we try to both match the keywords and um and the vectors and and that also runs and performs performs the search where hybrid is useful is when maybe the mall is not necessarily trained for the type of data that you're searching for but you still want to be able to match exactly on on something or maybe like you have serial numbers or or all sort of other things and the final one that I want to maybe show you which is uh pretty exciting is uh R augmented generation so as in here you saw when I was creating the Open Eye collection I added this um like gp4 uh to to that collection so what we could actually say is uh use a different uh query so is generate uh dot yeah near text that's the syntax so in this case what we can do is say like okay I want to get let's say four objects and then I can add a single prompt and then the prompt can we say something like explain or or write a short tweet about uh and then uh we want yeah about the question and I think that should do the trick so when I run this well it basically what we we do it runs in two steps first it runs a query and retrieve those four objects and then after that it passes uh basically the question value together with the rest of the prompt and uh and then generates a response uh where did that go okay so we should let me just grab I'm just going to go to documentation that's usually the safest thing so with the generation on single prompt oh yes instead of going for properties I should have printed uh generated or on the item oops not on the on hybrid fine so I just need to add this so let's rerun it you only took six seconds last time so it shouldn't be too bad and there you go so basically um this took the original information from our question property and then turned it into a tweet so for each of those um there's another one that is called um group task so maybe I'll just do very quickly so group task basically instead of doing it per object um it can basically generate a single response for all of it so we could just say like hey uh explain what this content is about uh and then in this case the response should contain just generated already as as as a one Global thing so one last code example and that should kind of take again like the this find like these four objects and then create like a single summary across all of it so this Con is about a collection of questions and anwers from the game show Jeopardy each entry includes the a DAT episode and then it goes on and on and kind of like explains what this content is all about uh which is actually uh pretty awesome so yeah um I hope you were able to follow through um I know I was like flying through it but I guess like one of the benefits of watching it especially maybe later on YouTube you may be able to pause at the right moments Etc um but yeah I hope this gives you a pretty good basis to the kind of things we can do and then hopefully I can pass the button to uh Zen uh but yeah if you want to learn as well about we in our documentation you can definitely go to we docs and start with a quick start because that also takes you through a similar workflow um but yeah that's me for this part of the session all right thank you Sebastian so let me share my screen now um go ahead here all right so that should be coming up seban can you verify you see that yes I can see part to Vector databases yeah all right we're in business okay so Sebastian spent um the last half an hour talking about how you can get started with Vector databases and how you can use them to do this semantic search over text uh and he even showed you this idea of sending the output of the vector database to a large language model for it to reason off of and I was having conversations with people in the chat talking about um the unique applications of vector databases and comparing them to other types of databases one of the most unique aspects of vector databases is exactly what you have in screen here right the fact that they can handle a lot more than text and so the last 30 minutes we have together I want to spend talking about going from text Data understanding text Data searching over text data to doing a lot more than that understanding audio data video data images and then searching over it uh and if we have time at the end I'll even add a little bit of um multimodal calling the new uh uh the gbt 4 Vision model uh for it to understand the images that we're retrieving from our Vector database right so let's start off from uh fundamentals and let's see how we can pass in more than just text to our Vector database so what we really need is a model that understands multiple types of data right so what I've got on screen here is not just a sentence Lon is the king of the savanas but also the image of a lion maybe the sound of a lion roaring I've got a video of uh lions running around we need models that can understand each of these modalities and can then convert these into corresponding vectors and the trick here is that because the input data point the types of objects that I'm passing in are quite similar in meaning I either have the image of the lion or the lion roaring or a sentence describing a line um the corresponding vectors should also be similar and this goes back to what Sebastian was talking about where no matter what type of data you pass into the vector database if it's semantically related if it if it has a similar meaning the vector data points should be closed together in Vector space and so as you can see here the vector for the sentence is quite similar to the vector for the actual image of of the L right and that is also similar to the the sound of the lion roaring and in order to do this you need to have models that can understand each one of these modalities there's a couple of ways to do it one way to do it is to have one model that understands each uh each modality separately and then you unify them um and so once you have this uh capability to understand and extract meaning from different modal ities then what you essentially have is one unified Vector space regardless of what type of modality you bring in if the model has been trained to understand and embed it into vectors you can now have this unified Vector space where your videos images audio files text files all live together and this translation from Human understandable versions of data videos that you can watch audio files that we can listen to transl ation from this human understandable version of data to machine understandable vectors still preserves the meaning behind the vectors it still preserves the meaning behind the data so the images of the chicken and the video of the chicken and the sound of a chicken should all be in close proximity in Vector space and so this is where things get really exciting because now you can literally take all the files on your computer whether they be video images audio text files and you can embed them into the same Vector space and the reason why this is pretty exciting is because now once you have the ability to embed all of these files into one unified Vector space you can now do any to any search and so what do I mean by any to any search you can take any of these modalities and you can turn them into questions let's say you want to return media that is similar to the concept that's encoded in this sentence right or it's similar to the image that I've shown here or the video that I've shown here all of these can become queries as Sebastian was showing earlier and they can go into your vector space and you can perform Vector search using these multimodal queries and not only that but because you have images audio video in your vector database already what comes out can be any of these data modalities as well right so you can retrieve audio files you can retrieve images video files and text documents um as a relevant uh object to any of these modalities that you pass in and this is quite unique because Vector database Vector databases because they represent data through vectors it doesn't matter what the original modality of the data was if the machine learning model understands it then it generates a vector and the vector database is optimized to search over it and so with that short intro to uh multimodal uh multimodality let me show you how you can take what Sebastian taught and you can use that to build up a multimodal uh search engine using wva so this is part two of the workshop uh you can also get access to this notebook on the on the same repository that's Sebastian Shar there's a second folder there um to set this up we're going to be using uh Docker so what I would recommend is go ahead and if you don't already have it uh download and install Docker and then once you've got that you're going to go into your terminal you're going to get the uh Docker uh image this is going to give you a Docker file and then you're going to go ahead and run that Docker file you're going to compose it and then it's up and running so I've already got this up and running so as you can see here over here I've got this up and running uh I've got the Deep learning AI environment running and it's got uh both my um both my uh containers up we'll see the uh we'll see the utility of both of those in just a second so let me go back here that will that will be as much Docker as you need there's a couple of other dependencies here as well so Sebastian has already set you up with this one the new uh wv8 um python client for this uh notebook you'll also need the open AI uh python uh API so we're going to install those and then we're going to go ahead and connect to our locally running version of we8 so in this case this is code is very similar to what subashan showed you I'm going to go ahead because Docker is running in the background I've got we8 running locally on my computer I'm going to go ahead and connect to it verify that it's connected and then I'm going to go ahead and get some metadata about the instance that's running so here one thing that you'll notice is that I've got this running locally and then the module that I'm using is a multi-c bind and this is particularly important this goes back to what I was talking about earlier where in order to embed and store media files you need to have a model that can translate those media files into vectors and that's exactly what the module system with wv8 does right so the multi-d bind is one of our modules and this module so I'll go to the documentation here this module allows you to pass in any one of these modalities that are listed here so most importantly text documents images video and audio and that's what I'll be showing off today uh and all you need to specify is in your Docker file you need to specify which module uh you want to uh you want to get uh up and running and so we can see that we have the correct module we've got the correct version of we8 so we're all good to go the next thing we're going to do as Sebastian showed is we're going to create a client so here we have a quick check that sees if the if the animals collection is already there delete it and we're going to create we're going to recreate it from scratch uh this is very similar to Sebastian's code where I'm going to go ahead and create a new collection called Animal and then I've got code that's slightly different here so what I'm going to specify is a vector config because I need to tell wv8 how to vectorize my multimedia data once it is passed it so here I specify that I wanted to use the multi-d bind module this is how I can pass in audio images and video and specifically I need to tell it where these uh where these files will uh will live in the database so which property these files will uh will uh be called from so the audio Fields I tell it that the uh property name will be audio the image Fields will live in a property name called image and the video Fields uh it can access through the property name video and this is quite important because if I if you think back to how uh multimodal embeddings work actually underneath the hood in image B we have multiple models and it needs to understand how to take the correct modality of data and Route it to the correct embedding model and then generate the embeddings so here we we set it up for success by telling it where to access the right types of files and vectorize them accordingly using the right model we're going to go ahead and get that running and that creates our collection for us we've got a helper function that you'll understand the um the importance of in just a second anytime you pass in uh Multimedia Audio images video files into we8 it needs to be uh base 64 encoded so this is going to help us do that for all of our uh file types then we're going to go ahead so I've got a repository of images audio files and video files and I'm going to insert them into the collection that I just created in real time so this is quite similar to the code that uh Sebastian showed as well what I'm going to do is Loop through all of my image files I'm going to add them to a uh to a list to an empty list the python list and I'm going to pass it the name of the file the path on my computer where that file resides and I'm going to pass it the actual image so I'm going to base 64 encode the the image and then I'm going to pass it in um I'm also going to tell weate what type of data I'm passing it by giving it a media type uh property over here so we're going to go ahead and take that list and we're going to insert uh many objects here so as this is going on what's Happening underneath the hood is we8 gets the image it knows that I wanted to use the multi-c bind module it's going to pass that image to the inference container that's running on my laptop locally it goes ahead and vectorizes all of these images and then it returns it tells me successfully I've added all of these objects to weate and uh these are the uh unique identifiers for every single object in order to make sure I like to be vigilant I I'm just going to go ahead and run a aggregate query that shows me whether all of the objects have been added right so here I've got nine image files I can run this aggregate overall query that will tell me how many objects I've got stored in my database right now right and that checks out so far so we can now keep on going I've now got images in my in my database and I've got vectors for those images as well now I'm going to go in and insert audio files so in this case my audio files uh are also so here if I show you I've got about I think six or seven audio files yeah uh I've got six audio files everything from birds to Apes to roosters dogs and car uh and um cats I'm going to take each one of those files I'm going to Loop over them and I'm going to add them to the database as well right the only thing that's different now is that I have to specify the right property and because the multi-d buy module understands that these are audio files because of the name of the property I have to use the correct uh property name here outside of that all I'm doing is looping through all of my audio files I'm adding them to a python list and then once that's ready I'm just going to go ahead and insert all of them uh into into weeva and behind the behind the hood this is actually going to use batching uh for a small amount of files that's not that relevant but if you're inserting millions of files then that becomes a lot more relevant so I'm going to go ahead and run this it goes through the exact same Loop except now it's going to use a different model it's going to use the Audio model uh to perform inference create the vectors and then store the vectors into uh into weate and so now all these vectors are stored into we as well uh so just to do another uh quick sanity check here we had nine images and then we've got six audio files we've got a total of 15 objects in the in the database stored alog together now so now comes the interesting part I'm going to take video files uh and I think I have uh four or five video files here in in my repository yeah I've got six video files here um similarly concept of uh animals I'm going to Loop through them and I'm going to specify that I'm inserting videos now right everything else is the same I store the name of the video the path to the video where it's stored um and then I tell weate this is a video type uh and this is mainly for filtering later on when I want to do query so I'm going to go ahead and run this and this one as you'll notice I'm doing something different here I'm inserting one video at a time as opposed to inserting inserting all of them together right so I'm not using insert many I'm using one video at a time and the reason for that is because video files are large it takes time for the model spe especially running on my on my Mac to generate the vectors for those video files so I don't want the database to time out I want it to know that everything is okay it just takes longer to create vectors for these files so I add them in one by one and what's happening underneath the hood here is it takes the video file and it passes it through the um the the video model actually an image bind the video model is very similar to the image model the only difference is that it looks at per second of video and then it breaks it down into frames it generates uh Vector embeddings for those frames and then it uh and then it combines them to Output one uh one single Vector for every for every video so as this is happening I can talk a little bit about the uh the next steps here so now that I have my database I've got vectors for each one of these types of modalities so I have the uh video of the mirror cat digging cat playing I've got the audio of these animals and I've got the images of these animals all living in the same Vector space and what I said that this enables is the ability to do this any to any search and that's the really exciting part because now I can pass in audio files as questions or queries and I can get back the most semantically relevant data points so I can get back images uh or video files um and that's really easy to do in VV actually it's a single query where you can pass in the uh the file that you want to use as a question and then you get back the relevant objects you can control how many objects you want you can control which properties of the objects you want as well right so if I just want the name of the file the path of the file and the type of media I can specify that and uh it will only give me the uh those uh data points we're almost done here we're just vectorizing the last video by the way I should say a couple of words so if you're using this in production what I would recommend is running the model uh the uh the inference model the uh the large embedding model on a computer that has a lot of GPU so that you can inference a lot quicker and then I would recommend running wv8 on a on a computer that has access to uh to Ram because that's what wv8 needs so as we've been speaking I've inserted my six videos in here as well so now we should have a total of 21 objects right nine images six audio files and six video files so we do a quick sanity check here we've got 21 objects in here belonging to multiple modalities all living in the same Vector space um so now I can go in and actually uh print out everything that lives in my uh in my Vector database if I wanted the uh the unique um ID for these as well I could Loop through it but I don't really want those so I'll just take out the name of the images and I'll and I'll print those out as you can see here I've got all of these different multimedia files uh in the database so now I'm going to I've got a couple of helper functions here that allow me to visualize uh these multimedia files right so I'll be able to play videos and uh images directly in the notebook uh and and I've also got a helper function here that I can pass it a dictionary and it will visualize whatever modality is in there this is also one of the reasons why I had the media type encoded in the database as metadata because I can then parse it out and then uh show you uh images video and audio uh data points as well I'm going to run these helper functions here um oh this one I don't uh I'm just going to run this but we've already got the base 64 functions there as well so so that should be that should be fine okay so now we get to the searching portion right this any to any search which is probably the most exciting part I'm going to go ahead and point my point a variable to the collections uh animals collection and then we go in and we can query this and the interesting part here is because we're using a multimodal module with we8 I can go in and do any type of search I can search with audio files images I can search with text files with video files files or let's say somebody just brings me a vector and says I want to know what object is close to this Vector I can also do a with near Vector search and I'll demo all the modal uh multimodal searches um that I spoke about earlier so here I'll do a a near text search and in this one all I need to do is pass in a natural language query as Sebastian was showing earlier right so if I search for um dog with stick it's going to go in it's going to vectorize this query and then it's going to go in and return to me the three closest objects and I can go ahead and plot those all out the most semantically similar concept here is this uh video of a dog right running with a stick and then I've also got audio files here this is the audio of a dog barking I'm not sure if you'll be able to hear it but yeah you can you can play around with this uh once you have the notebook and then this is the image of the dog and in order to go from text search to image search it's as easy as passing in the image that you want to use as a question so here if I've got the uh if if this is my query image this image of a cat I can pass this and I can convert it to a base 64 image I can do a with near image search and then I can ask it to return to me the name of the closest images where they're stored and what type of uh files they are and I can run this query and this vectorizes the image generates a vector for me and then I can see what objects are close to this image so here I got the image of this cat it's pretty close these are all images of cats essentially and it matches pretty well just based on uh image similarity and then you can even do uh audio search so here this is the audio of a dog barking you can you can play around with this later on um I'm going to pass this in as a query when I do with near audio I'm going to convert it to Bas 64 and then everything else remains similar I'm going to to limit this to one so I only get one output and then we can have a look at what that one output is and so this is another audio file of a dog barking so this is one second that was two seconds um we can also do video search right so if I've got this video of a mircat looking around I can use this as the input query like so again same same step right very simple convert it to base 64 and then return the um return the closest objects that I've got here this one takes slightly longer because again videos take longer to um vectorize and embed the the machine learning model running on my computer uh needs to Crunch a lot more numbers to get the vectors so one of the last things that I wanted to show you and then I've got a section on multimodal rack okay so that's completed so let me show you what came back right so here I've got I've got this uh video of a mircat so this is all truncated but you'll be able to play around with this later um again very semantically similar I've got this image of the mircat and then I've also got this uh image of a mircat um standing on a log and so this is pretty cute so what I what I decided to do was introduce this concept of multimodal rag and multimodal rag is a a very New Concept um there was a paper about it uh earlier this year but Sebastian talked about rag I want to talk a little bit about multimodal rag so let me take two minutes and talk about what that is so how does all of this multimedia searching actually help with large language models everybody is interested in large language models um how can it improve these large language models well let me let me start off with this right Sebastian talked about retrieving text and then passing it off as context but isn't it true that a picture can encode a lot more meaning then text so then the question is why stop at just retrieving text right if now I've shown you that you can retrieve audio files video files and images we can go ahead and retrieve all of these different types of modalities so now your rag workflow looks a little different where you can go ahead and store all of these images text uh audio files in we8 and you can also retrieve them so let's say I retrieve an image like that uh that image of the mircat on the stump now I can pass that off to a large multimodal model Model A large language model that can understand images and I can get it to answer questions on top of those images and this is the very basic concept of multimod rag if I'm retrieving different media from my Vector database and then I'm sandwiching it into the prompt and getting the large language model to answer questions on those images I've performed multimodal Rag and that's the very last demo that I'm going to show you this is the most probably the most exciting uh part so what we'll do here is we'll retrieve this image of this uh mircat standing on a log from we8 and I'm going to pass it off to the large multimodal model chat uh the GPT 4 vision from open Ai and then I'm going to get it to Output text and then I'm also going to use di 3 to get it to repaint recreate this image um but just more cuter right so maybe I want to get this framed and put it up uh on the wall so I want to show you how to do that so I'm going to perform a near text query and I I want to get out this particular image so I'll describe right mircat on a log and I'll only get back one and I'm also going to pass in a a filter here that only returns images because for now chat GPT um 4 Vision only understands images it doesn't understand audio uh audio files so I only want to filter for images right so I'm going to run this query and then I get back this uh mircat image that I wanted to pass to GPT 4 Vision right and this code I got from open AI um I have a new API key that allows me to access gp4 Vision uh the preview and I give it a prompt that says this is an image of my pet please give me a cute and Vivid description right and I pass in the base 64 uh image as well to this uh to this call to open Ai and then I'll show you the description that it returns for me right the answer on this prompt so we're going to run this it's going to make an API call to open aai and we'll see in a second what it comes back with okay so it gives me this really nice description of the mircat Charming site your pet mircat and then it describes the mircat so this this is one application of multimodal rag right I've given it an image and I've got it to describe the image for me I'm going to go one step further I'm now going to take this description and I'm going to pass it off to a model that can understand text and generate images so that's Del 3 which was uh released uh last week so now I'm going to call Del and I'm going to tell it to create a image uh a higher resolution image based off of this description that I have here so this is potentially something that I can frame and put up on my wall then I'll show you what that image looks like here again this takes a little bit longer because it's generating the image um after I run the code all right so this is pure off of the text description I've got the the mirat on the log and I can print this out I can get this uh turned into a poster and this can go up on my wall um and this is the power of vector databases used synergistically with generative models that can understand all the modalities that you can retrieve from Vector databases um and you can do this at scale so here I've shown you a quick kind of toy demo where I've got 21 files but these could be billions of files and you can retrieve them and pass them off as context and um so this is what I was super excited to show you all uh if you have any questions we'll we'll take those now if not um check us out we think a lot about this we work a lot uh on this um it's completely open source um join our slack Community um thank you very much folks absolutely well thank you so much Sebastian Zayn I'm sure our community learns so much from you guys and really appreciate the thorough uh Workshop you put together I think we have time to answer maybe two or three questions before we end today I know Judah and the audience and Sebastian I saw you hop in and answer some questions in the live chat but uh I think our number one question um right now is if you were to compare the cost of keyword search versus Vector search how much is Vector search more expensive on average interesting um is it like in a monetary way or yeah that that's that's an interesting question um I don't have a clear like a perfect answer Zen do you have a good take yeah I guess the so that's a pretty um it is a difficult question but I think what you have to account for is that when you're doing Vector search you have this added component of having a machine learning model that you need to perform inference with right and so as you saw in the multimodal space that inference can be very costly it might require a lot more gpus it requires its own infrastructure so from that component it might be more costly for um keyword search for classical search you don't need gpus you don't need machine learning models um although you wouldn't necessarily use keyword search on an image right because that's the that's the key unless you do labeling or something yeah exactly so what are you if you're setting up this uh computer that has GPU that can understand images video the added benefit that you pay for is that now you have this this ability to semantically search over any modality really that you can encode as a vector it would be very hard for you to do keyword search over images or video or audio unless you you have some sort of detailed description or metadata of it which then you can use to filter right so um yeah but that's that's a very interesting question great one absolutely and I think the last question for today how do you keep data current are there optimal methods for updating the vector DB database once a day nightly different than updating more frequently yeah I I I would say so it depends on on your needs within like your business so if you like basically so that's one of the benefits of of weit is um it's got full Crow support that is live uh so basically as you go even if your data changes every second you could continuously keep updating the data um so and then V the the vector space or the vector index gets updated uh immediately um so the question here is it depends on what does your business need so if you always need the data like immediately to be like all all the time uh like I don't know you're running a social media platform in this case like you need to be constantly if you're running uh a local library U maybe you could do it once a week so it's all all depends on what do you need in terms of uh that frequency um and yeah there's no one perfect answer um but but the the the beauty of we is that you could very easily keep it in in track like life if you want to create a system that continuously sends updates that's not a problem at all absolutely well I want to once again thank Sebastian and Zay and Duda for speaking today answering questions helping us learn more about we8 and Vector databases uh once again please don't forget to register for their short course I'm sure you'll not be disappointed and learn so much more than what we also learned in this Workshop today we also dropped a community surveying the chat uh so happy to take topic suggestions for our next event maybe even another course with we8 we're always open to feedback and suggestions and we hope to see you next time uh here with us at De deeplearning.ai see you guys soon thank you bye thanks a lot thanks for having us
Info
Channel: DeepLearningAI
Views: 11,856
Rating: undefined out of 5
Keywords:
Id: 3WUobZryyok
Channel Id: undefined
Length: 61min 11sec (3671 seconds)
Published: Tue Nov 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.