Vectorstores

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we'll do a quick dive on Vector stores or vector databases or vector embeddings if you've ever heard of the term embeddings or word embeddings or wondered what that section on openai's website labeled embeddings is all about then this video is for you if you've never heard of vector stores or vector embeddings then that's okay because you've probably interacted with them without even knowing it some common consumer facing applications powered by Vector stores would be for example one anomaly detection related to your bank account two recommendations for content on streaming platforms three automated content moderation on social media there are only two things you need to understand in order to follow along with this video the first thing is you need to understand what a vector is the second thing is you need to understand what a DOT product is a vector is a multi-dimensional value this is in contrast to a one-dimensional value which is sometimes referred to as a scalar if you were to ask me what's My Age Again I might say 33 and that would be an acceptable response on the other hand let's say you asked me how would you shoot this basketball so that it goes inside the hoop I might answer and say I would throw it at this speed in this direction and at this angle you see how there are multiple Dimensions associated with the answer to the second question this is what vectors represent vectors are multi-dimensional values you can represent many things with vectors for example let's represent someone's personality with a vector we'll use a four-dimensional vector and make the first number how sociable they are second how talkative they are the Third how trusting they are and the fourth how confident they are hopefully you have a good feel of what a vector is a DOT product is a mathematical formula just like MX plus b equals y or a squared plus b squared equals c squared except that it takes two vectors and spits out a value that tells you how similar or different the two vectors are the more positive the output of the dot product the more similar the two vectors are considered to be the more negative the output of the dot product the more opposite they are considered to be if the dot product outputs a value of zero that means that the two vectors are orthogonal or unrelated or neither similar nor different okay the rest of this video will consist of three parts part one is a rip-off of Dave Shapiro's video entitled what the heck are embeddings and we'll lay the ground for the subsequent two parts Part Two is a demo of how to use a vector database AKA a vector store shout out to Chris Johnson for the tips and tricks I'll be showing here and part three is a very quick rundown of two more advanced concepts related to Vector databases and shout out to Ryan Lambert for prompting me to cover these foreign we'll be using Python and an embedding model from open AI chat GPT and gpt4 are the open AI models most people are very familiar with but other models offered by openai include Dali which is a model that receives text and spits out images related to said text whisper a model that receives audio and spits out transcriptions of what's being said in the audio and text embedding ada002 that's what we'll be using that's the embedding model that will receive text and spit out a 1536 dimensional Vector I looked into what the dimensions of these output vectors mean but didn't find anything if anyone knows please let me know I'm guessing the First Dimension means how Western or Eastern The Source text is I'm guessing the second dimension means how scientific or unscientific the source text is I'm guessing the third dimension means how masculine or feminine The Source text is I don't know I'm speculating hopefully that gives a gist of what's happening when a source text is passed through these embedding models and generated or turned into a 1536 dimensional Vector let's send the text tiger to text embedding a2002 and see what we get back here is the 1536 dimensional Vector we were talking about now let's send the text Cat to text embedding ada002 and see what we get back right another 1536 dimensional vector it's hard to tell these vectors apart but they are different now let's send the text fish to text embedding ada02 right another one of these huge vectors okay here is where the dot product comes in let's calculate the dot product of vectors one and two AKA tiger and Cat and compare it against the dot product of the vectors one and three AKA tiger and fish I predict tiger and Cat will score higher because they're both of the feline I don't know if that's the right term family right a tiger and a cat seem to me to be more similar than a tiger and a fish but let's see what text embedding Ada 002 has to say about that so according to text embedding ada002 tiger and fish are more similar than tiger and Cat maybe that has something to do with cat being a domesticated animal I don't know let's test out Monday Tuesday and Wednesday Monday and Tuesday are going to score higher than Monday and Wednesday because Monday and Tuesday are closer together temporally right let's see if I'm correct so in this case I was correct Monday and Tuesday scores slightly higher than Monday and Wednesday in similarity now you see why these vectors are sometimes referred to as embeddings it's because the models that generate these vectors embed Notions of meaning into them I couldn't come up with a way of visualizing 1536 Dimensions but here is a visualization of a collection of three-dimensional Vector embeddings if you look at this region of the vector cloud or vector space let's see what these points represent graph configuration filter graphics vector [Applause] compression I mean I guess these are like it Technologies I don't know let's come over here let's see what the Concepts in this part of the Vector space say tanker shortwave precipitation runways GRT rainfall ah maybe like stuff related to like water and military equipment I don't know or let's come over here see what it says minorities autonomy republics provinces territories reforms so this is like political Concepts maybe so that is Vector embeddings word embeddings the groundwork of vector stores and Vector databases [Music] for this demo we'll take a look at the vector database side of things just like SQL tables or SQL databases store uniformly labeled data and no SQL databases or collections store objects or dictionaries Vector databases store vectors at the time of recording in 2023 there are many Vector databases available on the market for example Pinecone or chroma or Melvis and all of them are valid but in this demo we'll use chroma simply because it is free the steps for installing chroma DB are one clone the git Repository two build the image in the Clone project if everything works you should see a chroma DB server running in your console two common use cases for Vector databases or vector stores would be similarity search and anomaly detection from this list of common consumer-facing applications of vector stores content moderation on social media would be an example of similarity search with similarity search you are comparing vectors that represent user data against vectors that represent Concepts defined as unacceptable by a given platform and when you find the similarities across a certain threshold you will Mark the relevant user data for hiding or deletion unusual activity associated with your bank account on the other hand is an example of anomaly detection and involves generating vectors for all the actions you perform against your bank account and raising a flag whenever a particular Vector crosses a certain threshold of this similarity the general technique for how to do similarity search or anomaly detection yourself is outlined here one break up your data into chunks the chunks if your data is text would commonly be broken up by paragraph sentence or word for example you'll need to tune the size of the chunks to your use case two generate an embedding Vector for each Chunk in your data set using an embedding model like text embedding ada002 three convert your query to a vector using the same embedding model used in step two and if you're performing similarity search you return the chunks associated with the vectors that score the highest in similarity to the query chunk if you're performing anomaly detection you compare your query against the vectors for each chunk and raise a flag if the query Vector scores below a certain threshold let's take a look at this skills Matrix if we search for mobile development let's see what gets returned we can see we get matches even though the text mobile development is nowhere to be found on this document now let's search for AWS we can see we get matches of various AWS Services even though they're not exactly matching the string we're looking for if you look closely you see the top results score lower and this is inconsistent with what we were seeing before with the dot product this is because the dot product is one of several formulas for measuring the similarity of two vectors chroma DB uses a distance formula instead of the dot product which is why the higher results are showing with a lower number the exact matches are showing as zero a lower score means closer in distance I.E higher in similarity and a score of zero means your query is identical to the match chunk here's a quick overview of the code behind what we're seeing on line 56 you see we're breaking things up I.E our text by either spaces or commas lines 85 to 88 show us generating the embedding vectors for each chunk and on lines 96 to 99 we are searching for the top 10 results that match our query let's talk about indexing indexing involves storing data in a format that speeds up future access or analysis when dealing with larger data sets indexing becomes necessary regardless of if your data is stored in an SQL no SQL or vector database here's a visualization of a collection of vectors numbering in the tens of thousands and you can imagine in the case of Performing similarity search against this collection you would be waiting an increasing amount of time to search for your match as the amount of vectors in the space continues to grow into the hundreds of thousands or Millions the workaround for this is to group your vectors into clusters called vodanoia regions and then calculate the vector in each region that is closest to all the others IE you calculate the Region's centroid then when you perform search against the cluster you will scan the vector Space by comparing against only the centroids of each region and not the individual vectors within each region once you find the centroid closest to your query you then compare against all the other vectors in that region to find your closest match this technique is called inverted file index while inverted file index may not give you the best match in practice it achieves a good trade-off between accuracy and speed in cases where speed is more important another approach for speeding up search against a large amount of vectors is to approximate the vector Space by reducing or compressing the amount of dimensions of the vectors this technique is called Product quantization if my suspicions are correct Vector search plus an indexing technique like inverted file index or product quantization is what's behind the recommendation algorithm used in production at Spotify please correct me if I'm wrong

Info

Channel: COMMAND

Views: 172

Rating: undefined out of 5

Keywords:

Id: EaNNRVY_pgU

Channel Id: undefined

Length: 14min 44sec (884 seconds)

Published: Fri Jun 30 2023