Are OpenAI Embeddings Better than TF-IDF Vectors?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello in this video we are going to walk through a python script that showcases how to retrieve open AI embeddings also we will do a test on which one is better open air embeddings RTF IDF based vectors for a document clustering application let's get started we will be using open AI Library make sure to set your open AI API key you can get the API key from platform.openai.com I wrote here two functions one is get first and tokens and the other one is get chunks from text these functions can be used to process the text content and make it suitable for open AI the open AI model has certain token limits in this function get first tokens and is the token limit that we'll be providing from the color function for each token there are generally four characters in English so this function get first and tokens will return four n characters from this given text and I am doing a little bit of pre-processing of the text by removing continuous underscores or removing continuous spaces from the text I ended up not using this function but I am still keeping it here to make sure that others who will need it can use it in this particular program I mostly used this function get chunks from text this function creates chunks of text of size 4X and characters in each chunk so practically this chunk variable is a list where there will be chunks each chunk will have four n characters I made sure that no words are split between the chunks the function will return the chunks as a list this function retrieve embedding no chunking this receives a text and it calls the open AI API and retrips embedding from open AI for this given text I am using this particular model text embedding add a second generation model for retrieving text and embedding from open AI I also ended up not using this particular method but I am still keeping it here so that others can use it I mostly used this method retrib embedding what I did was I called the get chunks from text method with this text and whatever token limits are provided and I stored the chunks in this variable for each chunk I retrieved the embeddings from open Ai and then kept it in this list of embeddings for all those chunks I averaged the embeddings to get an overall average embedding for the entire text that came via this parameter text that means even if you have a large document content that has more tokens than the token limit this function is able to create embeddings by averaging the embeddings of all the chunks created for that text I run the retrieve embedding function with some sample text and then I check what is the length of the embedding for this given text the length of the embedding is 1536 that means whenever we send the text using the given model this one it will return us an embedding array of length 1536 for the actual experiment we used 20 news groups data set but I used only two categories these two categories are target.politics DOT miscellaneous and rac.sports.baseball in case you are not familiar with 20 news groups data set it is a standard classification or clustering data set that has 20 categories of documents for my test I'm only using documents from the politics and the sports categories now the question is why am I only using two categories instead of all 20 that is because I did not want to spend much money retrieving the embeddings using open ai's API calls each API call will cost some money even if it is very small amount it's some amount for 1731 documents of these two categories of 20 years group status set I spent only a few cents I think think it's 10 to 15 cents note that I used the full documents this is why it was 10 to 15 cents if someone only uses the first chunk of each document it will be much lesser one suggestion is that when you use such a commercial API please make sure that you set a usage limit like this one if for any error in my code even if my code keeps sending too many long documents to open AI the loss will be no more than ten dollars I did set it here in the usage limit section that I don't want to pay more than 10 dollars per month once the API calls reach the usage limit the calls from my python code will start to receive an exception also whatever data set you have you just get the embeddings for once and save those embeddings in your computer that is why what we are going to do in our program today there are functions in scikit-learn to fetch 20 news groups data set again we are only fetching two categories we made sure that we have the text in a list and also we have the levels in another list we removed any empty document or empty text from that list and we also remove the corresponding labels here I am printing some of the documents just to check if everything is working in this part of the code what I do is I create a list called data and in this list I keep I the document content I label that means either sports or politics here in this context politics is in indexes 0 and the sports will be in index one so the labels will be 0 and zero for politics and one for sports then whatever embeddings are retrieved from open AI for the IIT document we keep it in this embedding variable and eventually we take it to the data variable along with the it document and the height level and after this Loop executes practically this list called data has all the document contents their labels and their embeddings and we processed 1731 documents for each document we have the embeddings and in this part we are saving that variable called Data in a Json file so after this part we will be able to read our data whether it is the text whether it is the level or whether it is the embedding list for each document we'll be reading them from the Json file so we don't have to make those API calls anymore now I retrieve all the embeddings for all the documents from the Json file I apply k-means clustering on those embeddings with k equal to 2 then I compute the adjusted Rand index to find out what is the similarity between the levels we have and the predicted clusters the Rand index is quite high it's around 0.78 now the question is what if we cluster the documents without using embeddings rather using something like TF IDF based Vector space let us see I get all the text content for all the documents in this variable documents then I use scikit-learn's feature extraction facilities the tfidf vectorizer for all those documents so we create TF IDF vectors we have 18 807 features so very unique words we apply the same cayman's algorithm but this time with the tfids vectors of all those documents definitely for two clusters we compute the adjusted Rand index and we find that the adjusted Rand index is quite low 0.03 with multiple runs of k-means I have seen it improve a bit around 0.05 but never close to 0.78 that we saw for the clustering using open AIS embeddings in this experiment open AI is embeddings win it is still questionable that I did not remove the stop words now did I stamped the words before applying tfidf vectorization in an ideal experimental setting for TF IDF after a stop word removal and stemming and then applying tfidf vectorization probably one can reduce dimensions a b it using principal component analysis anyway that will not make tfidf based vectors contextual like embedding based vectors in my experience embedding based Vector representations are much more Superior compared to the conventional Vector space modeling open ai's embeddings cost money but the embeddings are really cool depending on your projects you can choose which sorts of vector space modeling you will be using you can also check hugging face for any free model to generate embeddings that is a topic of another time see you soon in another video

Info

Channel: Computing For All

Views: 389

Rating: undefined out of 5

Keywords: OpenAI, OpenAI embeddings, TF-IDF, OpenAI Embeddings vs TF-IDF, AI Technology, AI tools and packages, embeddings, vector database, open ai, ai, openai embedding, embedding openai, embedding, vector openai, openai vector, text embedding, openai api, what are embeddings, what is embedding, embedding function, programming, create embeddings, search embeddings, pdf embeddings, text embeddings, creates an embedding vector, embedding vector, openai embeddings python, word embeddings

Id: CTYWhIUhx-4

Channel Id: undefined

Length: 10min 42sec (642 seconds)

Published: Mon Aug 28 2023