Sentiment Analysis in Amazon Comprehend Using Python API and BOTO3 Package

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good morning youtube in this video i'm going to show you how to do sentiment analysis on amazon comprehend through python api using buddha 3 package the plan is to go over what amazon comprehend is then show you how to get api keys and access and then go over the data requirements and guidelines and then jump into the code use the chapters in youtube to jump to the place where it's most interesting to you and the notebook for this project is in github and the link is in the description let's get started amazon comprehend is basically a collection of natural language processing models so a amazon aws created those models they print train them and these models perform specific tasks today i'm going to talk about sentiment deduction model but they also have other models in there like entities and topics and others some of the models are pre-trained and some of them pre-trained by but customizable so for example document classification model sentiment is a type of document classification however you can create your own custom model for that in which case you may customize your documents into i don't know invoices versus letters versus something else and you would have to create a pre-labeled data set for that and retrain our model let's go over the process of getting access to amazon comprehend so this is kind of the quick outline and i have the very helpful amazon guides in the description so what the process involves you creating a an aws account which is free for one year after that you would need to create a user that has appropriate permissions to access comprehend and s3 so the user is created in the identity and access management part of aws where you just you know search for iam or find it in the list of aws services you go in there there is a user menu where you go in and create a new user and then you can give that user permissions or maybe add them to a group that has certain permissions after you have created that you can create and download access keys these are the keys that you need to configure your access to comprehend through so you download those keys they're just a a dot csv file and then you can use command line line interface to create environment variables folder which is a basically a dot aws folder on your computer in your user folder that has your keys and that's what's used by aws to authenticate you the next thing you have to do is to create a role for your amazon comprehend job to access as three again rows are in the identity and access management menu however another way and possibly simpler way to create one is to schedule an amazon comprehend analysis job yourself now you don't need role if you just want to do real-time analysis like analyzing one piece of text document or analyzing a batch of up to 25 but if you want to do a an asynchronous scheduled job on aws then you would need to have as three access for your job and then you create a virtual environment in python and install but a three package i didn't have any issues with this let's talk about comprehend costs with the free aws tier you get to analyze 50 000 units of text a month and one unit is about 100 characters so the example that i'm using every tweet is about two units i'm analyzing a thousand tweets which is not a huge sample and that is you know 2 000 units that eats up into your limit pretty quickly once you've reached your limit it's one dollar per 10 000 units go to this link and check the most up-to-date pricing on amazon comprehend because amazon has changed that in the past or has clarified the explanation quite a bit that it looked completely different so make sure you do that let's talk about data for this particular project i'm using twitter data twitter data is popular for sentiment analysis quick little chunks of data and you um you can analyze them very nicely what you need to remember is that tweets can have breaks within them and usually what you do you save all of your tweets into a big document and then you send it to aws to analyze as one document per line that means that sentiment per line of text however if you have breaks within the tweets amazon will take that as different tweets so you have to glue it back together and remove the breaks you export it in the udf 8 format and if you're not analyzing twists but other documents up to 5000 bytes can be analyzed for sentiment and at once and in this project i'm going over the three ways to analyze sentiment two of them are real time and one of them is asynchronous and i would mention that one document at a time 25 documents at a time or copying the data into s3 bucket creating an analysis job that will run kind of on its own takes about 15 minutes maybe an hour and then you get your results back in the s3 bucket that you've specified let's jump into the code i have saved this particular notebook on github and you can download it and get the code yourself so you don't have to rely on uh looking you know screenshots and looking at the screen as far as the packages go buddha 3 is the package that has both an es3 client for uploading and downloading data as well as the comprehend client i'm also using pandas to process the data frames i use json and tar file to process the output results from the the scheduled analysis job the data comes from twitter and it is 1000 tweets that have the word walmart in them i did save them on my github so you can get them here i'm re-saving them on my local drive so i can upload them to s3 so i've download uh i i've downloaded the data and here it is and we can let's take a look at a slightly larger set in here um this is the twitter that i analyzed let's we'll get to it number 100 i'll meet you at walmart on 32nd street that's nice let's send that to comprehend to find out what the sentiment of this particular text is so we've seen it as a as a piece of text and now we're initializing our buddha 3 client with the service name of comprehend and the region that's my region and then we are using the text sentiment uh function where text is our text in language is english so this is pretty simple those are the only two parameters that you have in here and this is the output that we get back so the output has three main parts the first one is the verdict what is the sentiment and the answer is neutral the the second one is what is the sentiment square this is the breakdown that the model produces then the model that classifies our text into different sentiment buckets and this is the likelihood of a text being in that bucket and there's four buckets positive negative neutral and mixed now we get all of the metadata response metadata where we have the request ids and all of those things are associated with it so to get the sentiment scores uh that's a small dictionary that we can get out of this variable and then we can get the um the verdict the sentiment itself from here let's open up up it up a notch and go into real-time batch processing so in our real time batch analyze up to 25 documents same limit of 5000 bytes each and we need to submit that as a list so i send some some of them over here let's maybe let's you know change it up a little so what i'm doing here is i'm defining what my text list is and then i'm sending the sentiment badge into amazon comprehend again it's very simple there is two parameters one of them is the text list and the other one is language so let's take a look at a single record in this particularity the set that i sent why is everyone and their mom always at walmart don't you have zoom meetings to attend or something a very uh that's that's on the nose okay let's take a look at um how the results look like so the results have two variables the results list where we have uh the same parameters that we had in the sentiment kind of determination the model scores as well as the verdict plus another variable which is the index and the index is the number of the line in that we sent so uh this is zero and it's 31 and it's neutral so and that is 23 that's that's our index up to um zero through 24. zero through 24. and then we have the response metadata also attached to it so now let's batch process that and turn it into a nice data frame here it is let's make it a bit longer there it is there's a mixed one number three let's take a look i am very curious mixed is a is the least popular type so it's kind of interesting what we're getting in here my books are not great yet although each each is a little better than the last huh but i can go to walmart and buy worse off the shelf so they're definitely good enough to sell okay yeah that's a mixed sentiment interesting so we got our sentiment badge and i could have just you know loaded it up in here but let's do this so we're going to parse sentiment into the data frame and then take the the the text of the tweet and add them in here and there it is so the next step is to do an asynchronous analysis job for that we would need to use the s3 buckets where we upload the data first then schedule the job and then download the data unzip the data and process the results so it's a quite a bit more involved process than processing either a single piece of text or a whole batch of them uh of 25. so i am using the whole file the 1000 tweets about uh walmart and i'm specifying where the local file is located i am specifying the bucket name and this is important you want to use your bucket give your input data and output data different folders within the bucket or maybe even different buckets um actually because you don't want to copy your results into the same folder as the input data because if you were to run an analysis on on another uh schedule another job within the same folder and you include the whole folder then it will think that the results are the input data we initialize our s3 client in builder 3 and upload the file into our bucket the folder name is included in the file name not in the bucket name so once we've done it and i've already done it so i'm not going to run it we can specify the parameters for a job so our job has three types of parameters the input data parameters and configuration the output data configuration and the last one is the role and the role you can either create in the identity and access management iam at a in the aws management console or you can create it by running or scheduling an analytical job and comprehend manually through the web interface and i actually recommend that you do it delay the later type because it just creates a seamless and you don't have to think twice about how to do it so once we've done that we can initialize our comprehend client with the region name and specify the parameters of the sentiment job and here i need to note that the sentiment and entities detection are very similar and you if you just swap those words it's going to work just as seamlessly it's probably true about other types of analysis it is in fact true so um most of the analysis have the same parameters so all you have to do is to swap that statement and it would work for you so um we have our input data configuration which is where the data is and whether we want to take how we define our unit of analysis whether we want to take one document per line or one document per file so the difference is obviously that what is the level with the are you looking at the lines within documents and you can do lines within multiple documents not only multiple documents multiple folders of documents and then or you you want to do maybe different files in which case it will just take the whole file and as long as it's under 5000 bytes it will process them and give you the sentiment um the output data configuration which is basically the folder where you want to put it and the access role it also takes obviously the language parameter and the job name i recommend that you do name the job even though it's an optional parameter because you don't want to have a list of unnamed analysis job in your amazon um comprehend kind of web interface where you go in to look at the job lists um if they're all unnamed you're not going to know what was going on in there so and what it will return to you is the kind of the status thing where you have your job number and all of the parameter you can look up all of the parameters associated with that so describe sentiment detection job and you uh put job id and then there's job status and where the what is the name of the output file so i've already ran these jobs that's why when i click on it it says job status is completed but it takes a few minutes so you may want to package it into a function that periodically checks the status and then finishes when it's completed now we can use that job id to get the um s3 uri or basically the url for the output file that would include the name of the file and if you are confused anyone just want to know what files you have in your bucket there is a function for that so to download the data from your bucket you would need to get the name of the file of the results file the the folder where you want to put it and also the s3 name like the bucket name and the uh the name of the file itself and using that procedure you can download it and then use tar zip to unzip it so what happens is that in your zipped up file there is a file code output and it has no extension and i would put it into a folder of extracted output and it's basically a temporary file that i'm using so i extract it there and then i use json to load the results from this particular temporary file and its name output for any type of job so it's it really makes no sense to kind of name it differently for anything so once we get it back i load it up and it's a thousand uh it's a dictionary with a thousand results which is great because we sent out a thousand tweets um now let's take a look at our result over here and see what the structure of the report is so the result would have um you know obviously a thousand line and each one of them would have our sentiment the verdict the sentiment score for the model it would have a variable called line which is basically the which line and the text so one is from 0 to 999 and then the file name and the file name is very helpful if you're processing multiple files so for example if you want to have data for walmart and kroger and you save them into different files now you have that information attached to it right there so to process that document we're going to use line as our index and what we're going to see in here um and i'm going to take this index file out do this see how the data isn't sorted correctly that's because it's um it's an asynchronous job and it kind of sticks the results once it's processed and it doesn't have to process from the same in the same order that the batch does so now we're going to sort by index and we're going to have uh 900 0 through 999 1 000 lines in here and just to make sure that we got our data correctly we can choose a record any record 250 um and take take a look at it a more closer look we're gonna take a look at the text of the tweet then we're gonna do a real time processing of it to get the results and then look up the results for that particular tweet in the output file for the job analysis so this is a huge shout out to bellevue walmart neighborhood market for wording somebody with a grant special thanks to store manager much appreciated wow it's a positive tweet 99 so yep that's what we were getting in in both of those results um it seems like they've lined up correctly now we can save the uh the results file into our hard drive obviously that's important um for further analysis i prefer analysis in excel but if you want to do charting in python this is one way of doing this particular there isn't a lot of things in this particular analysis we can do but we can do the counts of the different types of sentiment nothing to compare it to and i might have a subsequent video on how to do more of kind of fancy analysis this is just the procedure of working with getting the sentiment so um let's run that and this is our output for this particular file the negative neutral and positive sentiment or we can do percentages in here so um for that particular set of walmart tweets we had uh about 11 positive 57 neutral and 31 negative so that is the result thank you for watching and if you like this video um give it a thumbs up you know if you have any questions or requests for additional things leave a comment you

Info

Channel: Probabilistically

Views: 693

Rating: 5 out of 5

Keywords: amazon comprehend, aws comprehend, sentiment analysis, aws sentiment analysis, amazon comprehend sentiment, aws comprehend sentiment, aws api, amazon comprehend api, aws comprehend api, aws boto3, amazon comprehend boto3, aws comprehend boto3, aws python, aws comprehend python, amazon comprehend python, sentiment analysis boto3, sentiment analysis aws api, comprehend boto3 api, comprehend boto3 python, twitter sentiment, twitter sentiment analysis

Id: flCGy3p83O8

Channel Id: undefined

Length: 24min 21sec (1461 seconds)

Published: Thu May 20 2021