Sentiment Analysis Project with LLM | Chatgpt & Gemini API

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to perform sentiment analysis using Google Gemini and openis chat GPD apis in Python and just so you know both of these platforms have apis that are sort of free to use in fact we have a separate video on how to fetch your free Google Gemini API key from Google's maker suit platform do check that out and for charb API open gives $5 worth of free credits on new accounts so you may use that up talking about sentiment analysis I'm sure you would agree that it's one of the most interesting data science and AI problems that we all like solving in fact on our Channel analytics with we ran a three-part series on sentiment analysis where we discussed and implemented traditional machine learning approaches in part one embeddings for feature extraction like word to we glove and fast text in part two and finally we also implemented deep learning for sentiment classification in part three all across these uh three-part uh sentiment analysis series we got classification accuracy of around 85 to 90% now in this video we are going to implement a whole different way of classifying sentiments and that is using the power of large language models so here's what we are going to do we'll be using the Amazon Alexa customer reviews data set for sentiment classification this data set has a binary classification labels positive which is label equal to 1 and negative which is label equal to zero in terms of handson we'll first be doing uh data pre-processing using Python and finally we are going to generate sentiments on the customer reviews using both Gemini Pro and Chad GPT API in terms of prompting we are going to try out zero short as well as few short prompting and observe model performance on these now let's get started with the handson all right now we are in our uh Google collab environment uh I'm already connected to a regular runtime we don't need a g CPU for uh this project uh and by the way I'll share a link to this particular notebook in the description part of this video for your quick reference so over here first let's import our data set that I have uh kept in this uh project folder within my Google Drive this one and uh just so you know guys uh this is Project number five in our Genna handson series prior to that we have uh done uh multiple of these projects like building a chatboard using B API uh doing handsone with open source models like Falcon 7 billion Dolly 3 billion we also built a code explainer using the Palm API and uh recently we have released uh how to do handson uh with Mixel 8x7 billion model uh within the free collab environment so this is Project number five that we are discussing and uh link to the previous uh video tutorial is something that you can get from this project tutorial documentation uh we have provided link to all of these here all right now just to reiterate this uh I have kept the data set for this particular project over here which is this Amazon alexa. tsv now we are going to import that into our collab environment uh so for that first we need to mount uh the Google Drive uh to collab and then we navigate to the directory where our sentiment analysis project is located and this LS over here is just uh to double check if everything is in place as you can see we do see uh our data set uh marked over here so we are good to go now let's import the data set for this tutorial we are using a data set called Amazon alexa. tsv we'll read it into a panda's data frame uh and take a look at the first few rows let's do that as you can see this data set contains information about Alexa reviews and we are particularly interested in these uh two columns verified reviews and the feedback which is a binary sentiment label uh to these uh customer reviews so just to be specific and focused on uh the data that is of interest to us we create a new data frame called my data with these relevant columns and we rename them for clarity and here are the first few rows from this data frame uh just to check that everything is on track and we are good next let's look at the distribution of feedback labels basically count of positive and uh negative sentiments across the two labels positive and negative as we can see our data is fairly imbalanced uh to tackle this we are using down sampling or under sampling as we call it uh as we are not training a machine learning model here rather using a language model for prediction so we don't need tons of data for training in fact in zero short prompting we literally don't need any data whatsoever for training we solely rely on models inherent knowledge for predictions hence using down sampling is okay in this particular case so for balancing our data set over here we uh calculate the occurrences of uh the two labels posit positive and negative and then from the majority label we are taking a sample uh to match it to the minority class and as you can see now our data set is as balanced as it can get now as Next Step it's crucial to prepare our data before uh we move on to predictions so in this uh next section we are focusing on data pre-processing which is a critical step to ensure our Text data is in optimal shape for analysis so first off let's import the necessary Library which is is this so to do this we Define a function called uh clean text uh to clean our data this function removes special characters uh single characters HTML tags convert text to lowercase which is this particular uh part of the code and eliminates extra wi spaces too now let me just initialize this function next step is to call this uh clean text function on our uh verified reviews so we start by extracting the reviews from our uh uh balanced uh data frame into this list called uh reviews next we iterate through uh each element from our reviews list and call the clean text function onto it and finally we ingest the clean reviews back to our balance data now let's have a look at it how it looks like and as you can see we have uh the clean reviews added to the uh balance data frame that we have now in this next step we are splitting our data into 5% training set which is a handful of reviews that we would use for few short prompting later on and remaining 95% would be the test set on which we are going to ask language model to predict sentiments and then use the results for gauging prediction accuracy let's run this as well all right now we have come to the exciting part where we are going to set up our Gemini model API in this part you would need the Gemini API key if you don't have it refer to this particular video to do so now first thing first let's install the necessary package and uh there thereby import the required libraries over here next we'll uh Define a utility function uh to markdown which converts text to markdown format let's initialize this code at this point let's secure our Gemini API using our API key uh Gemini API key I have configured within my collab environment uh with this name uh called Google API key I also have open API key and I've given access of these keys to this particular notebook so you may do that too and thereafter I'm calling these uh keys in my code uh using this user data doget uh moving on you may also pull a list of available models on the Gemini API so these are the models that we have in our case we'll be using the Gemini Pro uh model for our sentiment classification task so we are declaring that over here model is Gemini Pro and now is the time to make an API call to to generate text in this uh case answering the age-old question what is the meaning of life so as you can see uh we are able to generate uh text from the Gemini model with the help of uh our API now let's integrate the Gemini API with our cment analysis task for this we will first convert a sample of our test set to Json format if you can recall uh we dumped 95% of our uh balanced data to test set now we are just picking some 20 records from there uh and dumping that into this test set sample let's do that apart from that we are also adding uh a new column called uh PR label and within this uh column uh we'll be adding the uh predictions generated by the language model later on all right next up we convert our samples to Json format uh with this line of code and this is the prompt that we have written where we are uh telling the language model to act as a expert linguist and help in classifying customer sentiments into positive negative labels we are also specifying which is one and which is zero like positive is label one negative is label zero so that the model doesn't uh confuse itself and finally we are asking uh the model to generate uh predictions and uh give back the final output uh in the Json code format right so let's initialize the prompt this prompt has the uh review samples that we are asking language model to generate predictions on in the Json format all right finally we feed our prompt to the Gemini model generate content method for uh sentiment prediction it might take a couple of minutes so be patient with that all right we have received the response from the model and now is the time to clean up this Json data from the API response and dump it into a pandas data frame and uh this is the code to do that so these are the clean reviews that we sent sent to the model and these are the predictions generated by the Gemini pro model now as last step let's ingest these predicted labels to our original test set sample and then we'll finally compute the accuracy here we go so in this case as uh we can read from this confusion Matrix all 20 reviews are uh correctly uh classified into positive and negative labels so it's 100% accuracy that we are getting with Gemini Pro on a single API call and there you have it we have seamlessly integrated the Gemini API into our sentiment analysis workflow if you're enjoying this tutorial guys do not forget to like this video and uh subscribe to our channel so that you see more of our data Tech content in your YouTube Stream So guys we have generated sentiments on customer reviews using uh Google Gemini API and it works like a charm now in this next step we are going to learn another important aspect of generating sentiments from large language models given llms have prompt length constraints there's a limit to how many reviews we can send in a single API call for predictions and in case we have sizable data to work with we go for badge processing here we divide our data into small batches that are permissible to be sent in a single API call and hit the model API multiple times to process all the batches iteratively and thereafter collate the results let's see how we can do that in this part of the code and just for the sake of variety in this uh next part I'm using the open AI chat GPD API so let's dive in so let's start by configuring the open a API I'm pulling the open AI API key from my collab environment using user data doget the way I showed you for uh Gemini and to test the API uh I have written this get completion function it has two arguments The Prompt and the model which I'm sort of uh mentioning as default GPT 3.5 now let's uh feed a prompt to the API let's check the response and we are indeed getting uh the response from the API so it's working fine now let's move on to the batching API calls for this chat GPT API for the sake of time I am taking a sample of our test set again uh so this is our test set it has around 488 records and I'm picking a handful which is 100 over here right again I'm adding a blank column called pred label which I'll be using later on for dumping the model predictions back here and uh then we set up batches equal to 50 and divide our data across these batches so because we have 100 samples so we'll be having two batches of 50 each and thereafter we Define uh this get completion function GPT completion function uh with these arguments it's very much similar to the function that we wrote above for Gemini within this function we are passing the batch and uh dat data from the batch uh which is like 50 samples are uh converted to Json data within the function and then we have also sort of drafted a prompt and this prompt will have the Json data ingested into it and finally we call the open AI model uh feeding this particular prompt uh to it and generating predictions all right let me initialize this function now we call this uh completion function on our batches and dump model responses in this uh list called responses let's run it again it might take one or 2 minutes depending upon the data that you have so be patient uh in this particular step all right our predictions are generated now as last step we clean up the Json data from the API response and dump it into a panda's frame which we are calling DF total let's run it as you can see this is how the uh data frame looks like and finally we are ingesting uh the predicted labels back to our uh or original test set total which had 100 records and now uh testing for accuracy as you can see we are getting a healthy 91% still so guys how cool is that on a scale of 1 to 10 we didn't have to develop a machine learning model or a deep learning model worry about uh the text to numeric representations embeddings Etc all we did was to configure the language model API for chat GPT and Google Gemini and generated prediction within a span of minutes do let me know what you think about this in the comment section below and remember these are the results with zero short prompting can the model do any better if we show it a few examples from our data set with the help of a few short prompting maybe let us find out in this last section of our code So within this section for bashed API calls to chat GPD API with few short again uh this is the test set we have 95% % of the Balan data uh we are picking 100 samples from it just like uh we did in the previous part creating uh two batches of 50 and then we are writing our GPT completion function now in this particular function there is a small change that apart from feeding the batches as argument now we are also feeding a train sample right so this train sample is basically some handful of uh 5 10 uh examplers that we are picking from the training set that that we uh sort of created above which was 5% of the total balance data uh and within this function we are uh uh uh converting uh the uh batch data as well as the uh sample data into Json and then uh we are also sort of having our prompt over here having uh the reviews on which uh predictions are to be generated as well as the sample Json data which is contributing to the few shorts and finally we call the open AI model for predictions within this function now let's initialize this function all right and finally we are calling this gbd completion function on our batches iteratively and within this as you can see from the train set which was 5% of balance data we have picked up some random 10 uh records as sample and then we are feeding that to uh the function as an argument let's run this one now remember the only change we have done from the prev part to hear is that we have uh uh employed few short learning for our uh uh model and uh we should be getting better results with this particular approach let's see what we get all right uh we are done with the generation part on uh the label prediction now let's uh clean up uh the Json response and dump it back to our uh original data frame which was this test set total and again let's compute the accuracy so guys here are the results with with the few short prompting as well we are getting somewhat similar accuracy to what we got with chiro short do try it out on your end and uh let me know about your experience what kind of accuracy level did you get Let's uh discuss more on this in the comment section of this video so guys that's all we had for you in today's video if you have any question or facing any difficulty with this code do let me know in the comment section below and we'll get back to you if you want to see more such videos do subscribe to our Channel if you want us to pick more vide on certain topics that you would want to see do share them in the comment section as suggestions and we'll pick them up for you I'll see you in the next video till then bye
Info
Channel: Analytics Vidhya
Views: 9,923
Rating: undefined out of 5
Keywords: analytics vidhya, data science analytics vidhya, analytics vidhya data science, #datascience, Sentiment Analysis project, Machine Learning Project end to end, Machine Learning project for beginners, data science project with code for beginners, Sentiment Analysis on Amazon Reviews, End To End NLP Project Implementation, sentiment analysis, sentiment analysis tutorial, sentiment analysis using machine learning, sentiment analysis amazon reviews python
Id: 3z0a3Ymwj1k
Channel Id: undefined
Length: 17min 49sec (1069 seconds)
Published: Sat Feb 17 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.