Sentiment Analysis in Python with TextBlob and VADER Sentiment (also Dash p.6)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's going on everybody welcome to a new tutorial which is also kind of part 1 in a little miniseries but really all we're going to be doing in this tutorial is doing and kind of working through a couple of sentiment analysis libraries that are just kind of out of the box libraries you can download and perform sentiment analysis with we're gonna kind of check them both out see what they have to offer us and see which one we want to go with if there's some other sentiment analysis library or curious about testing you can go through the same exact methodology to analyze them and see and see what you think so the two that we're going to use our text blob and Vader sentiment you can just pip install a text blog Vader sentiment text blob is a natural image processing library built on top of NLT K whereas Vader sentiment so basically text block can do more than just on an analysis an invader sentiment it's just sentiment analysis if you're curious about both of these you can learn more about them if you go to the description I link to the text base tutorial there and you can you can check them out here and here so what I want to do is just just get to working with them and test test them against known data so the known data I'm going to use is this these like movie reviews and I've got a positive and a negative file if I recall to do it I'll put just straight links to them in the description otherwise just go to the text-based version and just and find them and download them or you can use your own data if you have your own but anyways these are just like short movie reviews so you can just right click download save as these reviews and we're going to use them for classification so the first thing I'm going to do is I'm just going to go ahead and drag those over cool and now we can actually we can test both of these using this and I like this data set because they're kind of challenging they're not necessarily always clear some of them are very sarcastic some of them are like just plain confusing like I couldn't have classified them so it's a nice confusing set a real challenge to feed to a classifier so it's still interesting to see how well can it do and not just how well can it do on like the whole set but is it good at filtering out the things that it doesn't know so there with a lot of classifiers a benefit that you can have sometimes is that the classifier itself can determine what's kind of confidence level is when it makes a classification so if you can do that it would be nice to be able to go through these these two files here and on the really challenging ones that it you know is it good at at least recognizing I'd just I don't know the answer but I'll make a guess kind of stuff like that so anyways we're gonna see what we can find with both of these so the first one I'm going to start with is going to be text blobs so we've already got text blob imported and I'm just gonna say the analysis equals text blob and all you do is you're just gonna pass a string to this so I'm just gonna say text blob sure looks like it has some interesting features and and that's it so what that's going to do is just convert this into a text blob object and we can do all kinds of things with the text blob object so what one thing we can do to figure out all the things we can do is just derp it and let me just make some more space here that's the zoom out so obviously we have all the little done ders here but we can also see all kinds of stuff like polarity and sentiment which we're interested in we can see we would canto kanai's it we can do word counts we can even translate it part of speech all that kind of stuff so there's definitely a lot of things that we have at our disposal so one thing I thought that was pretty cool is is the translation so one option we have is just a print analysis translate and then you put the language code in here so two equals and let's do Spanish so es and we'll run that and here you go this has been translated to Spanish I'm not gonna read that out but that's pretty cool I don't know how great of a translation is Spanish isn't exactly a language of mine I can sort of speak it but yeah so it looks about right it seems like a pretty dirty translation but anyways that's kind of cool they you could do stuff like that so it's more than just sentiment that we can do but the other thing we can do for example would be part of speech tags so we can just do analysis tags and in fact let me just get rid of the dirt there we can run that and then then we get all these tags and just going from the text-based version of this charlie just bring it over but I have all the tags in the text-based version of the tutorial so check that out if you want to know all of the tags because there's quite a few one reason you'd want to know this is for things like if you want to build noun phrases but also if you want to know what's what's being talked about here look for the the proper nouns basically so in this case text blob is being determined to be that proper noun and then you can find the other nouns too and then say okay there's let's say positives let's say if this was positive sentiment we we could relatively well determine that there's positive sentiment for text blob but why text blob well because of the features there are interesting features right so you can start to do a lot of really cool analysis but anyway this is for sentiment so let's go ahead and just do analysis dot sentiment and let's run that and we see we get two scores we have polarity and subjectivity so I want to make sure I'm right on this but I'm the polarity I believe is negative one to positive one so negative one being you know negative sentiment positive one being positive sentiment and then subjectivity is a degree from zero to one where zero is very objective and then one is very subjective I'm not finding here I'm pretty sure that's about correct though so now what I want us to go ahead and do is we can we can test these things so the the code that I've written to do that is very simple code so I'm actually just gonna copy it from the the text-based tutorial I don't really see any value of us writing this this out I'll just kind of explain it so so from that sentiment when we say analysis sentiment we see it produces this value and then we can access these two attributes just using dot notation basically so so we're saying dot polarity to to reference this number that's that sentiment number okay and then all we're doing is we're just opening these files reading through and then if the polarity is above is above zero we're saying okay that should be positive if we got that right add one to correct because this is the positive file so everything should be positive here otherwise if even if we got it wrong we just pause count plus equals one and then we're going to do this one and then here we just calculate the percentage accuracy so let's go ahead and run that real quick we've brewed should be relatively quick there we go okay so in this case I got a positive accuracy as seventy one percent which is not horrible but then the negative accuracy is fifty five percent which it's above average but on a sample size this small we can't actually make that determination yet probably but also even if it was a solid fifty-five that's a little too close to two random for my tastes so what are some things that we could do well when you could start to try to play with you know where is where's the line for neutral like a neutral zone so so rather than saying is above zero what if we said it needs to be above zero point two right or below zero point two like it looks like maybe we're too weighted towards the positive so you could start to play it with stuff like that to try to get them both to be relatively equal but as we can see even with that tiny tweak now it's heavily weighted towards negative so then what you know you can continue tweaking it but we're clearly not really seemingly making any gains here by raising that bar so yeah so in this case sixty and sixty eight really neither of those and nowhere in between those would be a number I would be happy about doing sentiment analysis with any serious degree so the next thing that we could do is we could take subjectivity into account so let's return this back to zero and we could ask the question about subjectivity so I'm just gonna copy and paste this line and I'm gonna say right if the subjectivity it needs to be less than 0.3 so we want to be close to objectivity right so that's what we're gonna do and then we're just gonna throw out the other ones just to see is there is there a way that we could gather sentiment at least on things that we know we're correct on so let's go ahead and run that one and see what we get so it looks like negative accuracy goes up significantly although our sample size has been seriously you know diminished like almost like a fifth and then the other option we could do is raise it like what if we maybe let's see if we can get more subjective which to me would seem to suggest that that would be more dangerous right the more subjective it is the less accurate we could possibly be so we do get at least more samples or I'm sorry whoops what I meant to do is say let's say it needs to be greater than 0.8 that's why we got more samples okay running it again wait for it okay so this time we did kind of better I guess which is odd like clearly it looks like we don't want to necessarily be too close to subjective or too close to objective I don't really know I'm having a hard time understanding how we're supposed to value subjectivity either way though we're simply not getting enough samples so I just I don't see how this is gonna be of any use so at least so far in my opinion it's getting kind of challenging to give any sort of real value to here so so the next thing after we've we've run through text blob why not run Vader sentiment instead so with Vader sentiment it's going to be we're gonna build this relatively similarly let me see if I can just take it so that with Vader sentiment we do have like a little analysis bar otherwise most of this stays the same but I'm just copy and paste it paste and we don't need this twice so here we just define our analyzer right and then to do sentiment you know because we're kind of skipping the little phrase of our own but basically to do sentiment to do basic sentiment with Vader sentiment it would be something like this I'll just copy it over right so V s equals analyzer dot polarity scores Vader sentiment looks interesting I have high hopes and then we could print out that sentiment so let me go ahead and do that just so you understand what to expect from this package so when we run this what we get is we get negative sentiment neutral sentiment and positive sentiment and then we have this sort of compound score now if we read as which one's the Vader yeah here we go if we read Vader sentiment it should somewhere on here tell us what these each of these here we go so positive sentiment is just where the comp I'm sorry uh let's see so about the scoring basically what they're saying here is the compound score is is your most useful metric if you just wanted a single metric to measure sentiment but obviously we have quite a few other things so like for example the compound score is a combination of these things but it might be useful because in this case this looks pretty clear right we only have positive and neutral sentiment there's really no negative sentiment muddying up the waters so we could probably be more confident in this compound sentiment right so so even though we could use compound and we'll test compound it looks like we have a few more things at our disposal that can kind of help us detect when something isn't quite right so so anyway let me go ahead and clear this out analyze her sentiment cool cool cool cool okay I think we're all set let me just get rid of this down here and so this is the code I wrote for doing Vader sentiment so basically it's going to be exactly the same as what we've been working with text blobs so we're just asking with vs compound which is what they said is the good single metric to work with if it's greater than zero congratulations you got positive right if it's less than or equal to zero then that's a negative and so on so now let's go ahead and run this script and we get something a little less strangely weighted towards one way or the other but still not really acceptable not something that I would want to personally go off of so 6957 is just not that great but if we check the the documentation here it basically says you know you really should say anything between negative zero point 5 and positive zero point 5 we should call that neutral and then if the positive sentiment you know positive sentiment would be above zero point 5 and the negative sentiment would be less than negative zero point five so let's go ahead and try that out again I've already written it out but it would be some simple changes anyways but it'll be quicker for me just to copy-paste here it's double import and now and in fact I'm gonna go ahead and just comment out the library that we're not using that seems to make sense so what I've done here is I've built a threshold variable and then basically what we're gonna ask is if the compound is greater than or equal to our threshold requirement or is less than or equal to the negative of the threshold requirement then we're gonna use it otherwise if it doesn't meet that threshold we don't care about it we're gonna toss that data out so let's go oh no what have I done I just wanted to run that I think it ran I don't know it like copied it over let me just write it one more time cool so in this case we get 87 percent accuracy for positive and then 50% for negative now that's pretty good but what if we have another requirement or just simply a requirement that we don't want any positive or negative sentiments so so this was basically the this was the suggestion of vape sentiment it was you know hey use you know negative 0.5 to positive point 5 as kind of your threshold and honestly this is about I mean 50 percent on negative is horrible and then and also we talked aways tons of our samples so so that's not good so so now what I want to try is what if we just ignore compound and instead we work with the you know requiring the you know basically the positive let's say out for negative sentiment we want to have more more negative than positive but then we also don't want much positive if any so what I'm going to do is I'm going to copy over a script that does basically that again all this is in the text-based version I'm just kind of running through the methodology I use to go through these two so paste that one in so basically what we're going to do here is if neg if the negative findings are not more than 0.1 okay and then if the positive minus the negative is still greater than zero or greater than or equal to zero we're going to say positive is correct and then we're going to come down here with negative and we're gonna say okay if if if the positive is not greater than point 1 and positive minus negative is not less than or equal to zero then we're gonna say the negative is correct as well so let's go ahead and run that one and we can see we've done much much better now one problem I see with this immediately is both of these like in theory you might have things that are both being classified as positive and negative because we're using the equals so let me let me just test this real quick so I don't really think about that as I was going through it so yeah so that we can see there makes a massive difference if you're still allowing for that that are equal to so we can we can flip the other one around but I'm going to be relatively confident we're gonna see the exact same issue here luckily this isn't where it's actually not so so at least in this case we can go with with greater than zero for the positive requirement or less than or equal to zero for the negative requirement okay good enough off to update the tutorial to to reflect that one but you want to want to continue making making those mistakes now once we're willing to do this one thing that we didn't really we didn't give text blob the ability to to be tested against you know zero like so so we didn't give text blob all we did was test text text blob with a moving center point but never a neutral zone so the next thing I want to do is go ahead and let's test text blob in a neutral zone so I'm going to copy this paste this in so now we're going to test text blob the polarity needs to be greater than or equal to 0.5 or down here we could be less than or equal to negative 0.5 but in the polarity requirement 1 is greater than 0 1 is less than or equal to 0 so we've got no overlap going on there like we had temporarily and the other one let's go ahead and run that one okay so in this one we have a hundred percent accuracy interestingly enough though not many samples so what if we lower the the threshold requirement so let's just do like 0.2 for example let's run that one okay so now we got many many more samples but still we're losing quite a few samples let me try one more we'll just go to 0.1 see how many more we gained here so we gained quite a few especially in the positive negative is still chopping away quite a few but let's try this those aren't even even there we go right at one more time okay still 100% accuracy so interestingly enough in this one the real kicker is the stupid zero like it looks like we classify with text blob at least a lot of things as zero zero sentiment polarity which is causing a lot of problems and that's why that less than or equal to was making such a big difference when we moved it around especially for text blob less so for vader sentiment it appeared but why that made such a massive difference here but as you can see with these rules we're doing pretty well and clearly it's just it was because of zeroes so so that's basically the the two sentiment analysis libraries compared it looks to me like text blob is slightly better but the other thing I didn't really talk about is the speed so text blob right now is taking me about six point seven seconds to run through these two files which is a little over 10,000 samples in total yeah so solid six point seven seconds and let me find a decent version of let's do this one of the vader sentiment copy/paste and we'll run the vader sentiment example this one runs in 3.3 seconds now this one don't forget though is the one where we have to be careful with the less than or equal to zero we don't have any overlap so that's a little more fair of an assessment now we could also you know we can probably continue tweaking this one as well and we might find the same it the same story with either one I think that probably text blob is slightly more accurate but Vader sentiment seems to go you know twice twice the speed so it's really just gonna kind of depend on on what your needs are but it looks to me like as long as we're avoiding that zero mark and properly classifying that or having a neutral classification it looks like text blob is is superior also I'm kind of interested in text blob just for doing other things besides just doing you know sentiment maybe you want to do part of speech classification or whatever but I'm planning to use this within putting into a database so that's why I'm I haven't really decided which one I want to use for this and honestly they're very interchangeable as you've seen we just need to figure out what rules we want to go off of to make our classifications but anyways that is all for now if you have questions comments concerns whatever you can feel free to leave them below otherwise I'll see you in the next tutorial
Info
Channel: sentdex
Views: 61,545
Rating: 4.9332638 out of 5
Keywords: sentiment, analysis, textblob, vader sentiment, vadersentiment, Dash, Python, programming, tutorials, data visualization, react, GUI, application, data analysis
Id: qTyj2R-wcks
Channel Id: undefined
Length: 23min 25sec (1405 seconds)
Published: Tue Feb 27 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.