All right. Hey, everyone. I propose we get started. The first bake-off has begun. You should have gotten an email from Piazza. Uh, I think I'll just go through my posting really quickly, um, in case any questions arise, I think it's straightforward, but let's just make sure everyone is clear on this, because the time window is tight. You just have until Wednesday at 4:30. So here's the procedure. The test data are at this link, uh, so download that, and unzip it, and then move the whole directory into your WORDSIM data directory, whatever you have set for WORDSIM_HOME in the, um, homework notebook. It's probably just data WORDSIM. So move the data in there so your system can find it. That's step 1. And note this: move the directory, not just its contents, unless you wanna have to fiddle with this reader code. Then open your completed homework 1 notebook. You should have already submitted that, but now you're gonna add to it. Um, find the cell near the bottom that begins with, "enter your bake-off assessment code into this cell." Let's call that the bake-off cell. And then you just paste in this blob of code here. It's just two new readers. You can tell that these datasets are the mturk287 dataset, which I think in this literature is thought of as a relatedness data set. And SimLex-999, which I think is a similarity dataset. So if your own model favored one or the other kind of problem, well, it's evenly matched here because we're gonna macro-average the score from these two test datasets. And you can see here, I've gathered the two readers into this tuple called bake-off. So that's step one. Now, let's suppose that the custom model you developed for part, you know, for part four, let's suppose you call that data frame custom_df. Then all you have to do is, in the bake-off cell at the, at the very bottom, so it prints out "enter this code", where custom_df is the model that you're evaluating. Crucially, you have to set the reader's argument to bake-off so that you don't use the, um, development data for this. And then optionally, if you feel your model needs to be evaluated with a specific distance function, you should specify that as well, otherwise, it's going to default to cosine, which might be fine for you but maybe this is an important design choice for your model. So then you just run that bake-off cell, um, and this is important, right? So we're all on the honor code here. Running the cell more than once with different choices for custom_df is against the rules. That is dishonest. The idea is that you have completed development and now on completely fresh data, you are seeing how that system works. So if you go back and make an adjustment to custom_df and try again, as soon as you do that, you've broken the rules, okay? Uh, we don't have a way to really check this, but that's the way the field works and it, you know, progress kind of depends on people being responsible about this. So run that code once. And then, to help us see whether we can reproduce the results that you got, in the cell right below that, this is step five, just enter the macro-average score that you got so that we can, if we rerun your system, compare against that and maybe contact you if there's a big discrepancy or something. Having done that, you just upload the notebook to Canvas [NOISE], um, and if your system, as in the homework 1, if it depends on external code or data, you know, things that won't be in our course environments, then put it in a zip archive and upload that as well, and then you're done. Um, you have to do that by Wednesday at 4:30. We are going to look at a bunch of the systems and we hope to announce the results next Monday. Certainly, next Wednesday. I think it should be straightforward. I think it should be really rewarding to see what happened with your systems. Any questions or concerns? All right. We have officially begun Bake-off 1. Now we want to switch gears. The topic for this week is supervised sentiment analysis. Um, I'm gonna try to do a bunch of things with this unit. So as you can tell from this overview here, and by the way, the slides are posted, so feel free to download them. And as before, they include a bunch of code snippets that I think will be useful for you in terms of taking the material and turning it into code that you can use for your homework and for your bake-off submission. And I'm- I'm gonna leave some time at the end of the day today to introduce the next bake-off and the homework, um, so that you guys feel like you can get started right away on these problems. Well, I'm gonna be trying to do for these two lectures is first of all, introduce sentiment to you in a kind of general way, because I think it can be a wonderful source of projects. There's lots of great data. And I'm gonna argue for you that I think the sentiment challenge is a deep NLU challenge. Some people regard it as kind of superficial, but I think with some serious thought about this, we can see that this is a kind of microcosm for everything in NLU. Then I want to give you some general practical tips that are kind of outside of the code base but illuminating, I think, about how to design effective systems for sentiment but actually for lots of problems. Then we get down to the nitty-gritty. I'm gonna introduce the Stanford Sentiment Treebank which is- which is kind of our core data set for this unit. That should be pretty fast. The main thing that I wanna do with you is make sure you feel comfortable working with this code in 4 here, sst.py, which is a module that is included in the course repo. And it plays the role of vsm.py in the previous unit. It's got like a lot of functions that I'm gonna ask you to use for the homework and for the bake-off. And frankly, also, my claim is that it represents a lot of good practices around designing experiments, especially in situations where you might wanna try lots of systems and lots of variants of systems which is, you know, par for the course in lots of types of machine learning. Having done that, I think we won't get through that, all of that today, but I hope to at least get to 4. For Wednesday, we'll talk about methods, and then we're gonna kind of dive into different model choices you might make. So feature functions you could write if you're dealing with, kind of linear classifiers, and then we'll look at RNN classifiers and tree-structured networks, which are more modern and maybe more successful approaches to sentiment. And this is a nice topic because sentiment is great, but it's also a chance for us to kind of get some common ground, build up a foundation around best practices for supervised NLP problems. That is something I'm trying to do in parallel with you. So questions you have about methods, metrics, models, all that stuff should be on the table. We're kind of using sentiment as a chance to introduce all of that, okay? The associated material. This is a small typo, that should be 2, um, but just to review. So this is the core code, and then there are three notebooks. The first one is an overview to the SST. I'm gonna kind of give a, a look at that today. The second one is more traditional approaches to supervised sentiment or supervised learning in general. I've called that hand-built features because I think that's their characteristic. And then the third one, oh, that's also a typo, I'm embarrassed, that should be SST. The third one is, um, neural approaches. So the RNNs, and the trees, but also kind of intermediate models that use, um, GloVe type representations, distributed representations as the input to traditional linear classifiers. It's a kind of nice progression of ideas. Uh, then the homework 2 and bake-off 2 are paired in that notebook. The core reading is this paper by Socher et al. that introduced the SST, and also introduced a lot of very powerful tree-structured networks for the problem. And then I would suggest this auxiliary read- reading. So if you want to just learn about sentiment, this Pang & Lee thing is great. Uh, it's a kind of compendium of ideas for sentiment analysis. And then Goldberg 2015, that's different. That is a really nice primer for doing deep learning in NLP. Uh, it, it kind of unifies in notation and, and in concepts a lot of the models that you'll encounter. Uh, if you need a further review of kind of more basic supervised learning stuff, then check the website. I provided a few links to different online tutorials and stuff that would be like one level back on linear classifiers and things, which I'm kind of taking for granted in here. Let's start with the conceptual challenge. As I said, that sentiment was a deep problem. Sometimes, people think of it as kind of superficial. I want to claim different. So just as an exercise, let's ask ourselves the question, "Which of the following sentences express sentiment? And what is the sentiment polarity, positive or negative, if any? There was an earthquake in California." It sounds like bad news, maybe unless you're a seismologist who was safely out of the state. Uh, does it have negative sentiment? You know, this is a kind of edge case in the sense that you might- depending on how you phrase the problem to the people doing annotation, they might say yes, they might say no. It's worth reflecting on, right? But to give an answer of, "Yes, this is sentiment and it's negative," you need to bring in a lot of other assumptions. "The team failed to complete the physical challenge." Negative or positive sentiment? What do you think? First of all, is it a sentiment relevant statement or is it neutral? I guess you could ask that primary question. And the secondary question, is it positive or negative? Well, that would probably depend on [NOISE] what the team referred to and what your perspective was. And I feel like if we know that it's another team, this is clearly bad news. Maybe we give that negative sentiment, maybe not, maybe we exempt the whole thing. Again, this could be a point of unclarity. "They said it would be great." This seems clearly to involve some kind of evaluative sentiment, right? But it is arguably not the sentiment necessarily of the person or the- the author or the speaker. So is it positive or negative? It feels like it's positive with respect to some other agent. But as you can see from these continuations, they said it would be great and they were right. That seems clearly positive, right? By- by normal definitions. But they said it would be great and they were wrong, that seems more like negative sentiment, right? And the nuance there is that in this case, for the first sentence, we need to know not only that the adjective great was used, but that it was used in the context of this verb of saying, and verbs of saying do not necessarily commit the author to their content, right? They're just reporting other facts. They might create a certain bias, but it's only until we get to the second sentence that we kind of know how to resolve the sentiment here. What about this one, the party fat caps- the party fat-cats are sipping their expensive imported wines. Negative or positive affect? It sounds kinda negative to me. It sounds like it's, you know, playing a role, doing a little bit of mimicry. The challenge here would be that there's a lot of positive words in that sentence. So your system is apt to mis-apprehend the sentiment here as positive, when in fact, the vibe that I get is pretty negative. So again, like perspective is creeping in. What about, oh, you're terrible. I think that would really depend on the context. On the face of it, it sounds quite negative. It sounds like an accusation. But if it's a couple of friends, kind of teasing each other about a joke or something, then it could be endearing, right? And it would be hard to classify, maybe the whole point of this little bit of negativity is to create a positive social bond. In which case, calling it negative is really to miss the point. There's a lot of this stuff, here's to ya, ya bastard. That sounds really negative, but that could be very affectionate in the right kind of context, right? This is a real snippet from a review of the movie, 2001, I'm sure we have some 2001 fans in here. This is a classic movie that is also very long, and let's face it, at times, very boring. [LAUGHTER]. Many consider the masterpiece bewildering, boring, slow-moving or annoying. That has a lot of negative stuff in it. But, you can probably intuit that the author of this sentence likes this movie, like this is probably embedded in a positive review. Not for sure. But that kind of seems to be- it seems like they're building up some kind of expectation that they're going to then kind of thwart, right? So again, very complicated perspective fact. And let's also just round this out by thinking about all the complicated ways that we as human beings can be related to products, and people, and events and so forth. Think about long-suffering fans who like the first album, but feel like all the other ones were sellouts. Their negative affect is gonna have a whole lot of complexity to it that isn't just like I dislike this new album. Or what about bittersweet memories, or hilariously embarrassing moments, right? It seems off to just say that these things are positive or negative, there's just a whole lot of other dimensions here. And so my- my- my summary of that would be positive-negative is one dimension that's important, but there's lots of other aspects to sentiment analysis if you think more generally about kind of affective computing. Makes sense? I guess part of the lesson here is be careful how you define the problem, and the other lesson here is that you could make this problem kind of endlessly deep. And here's another perspective on this. This is actually work that I did with Moritz Sudhof, one of the teaching team awhile ago. This is actually from a network, sorry, a dataset of people transitioning from one mood to another. They would give kinda mood updates on a social network platform. And there are two interesting things about this. First of all, the number of kind of moods that people are in, is astoundingly high. And also, the kind of ways in which they relate to each other and are differentiated, it's also fascinating. And you start to see really quickly that standard things, like positive-negative, or maybe like, emphatic or attenuating, those are just two of many dimensions. And then the other part about this since it's a transition diagram, is that you can see that people are likely to transition in and around certain subparts of this emotional space, and less likely to transfer across to certain other parts. And it kinda gives a picture of the contours of our emotional lives. But it also shows you just how high dimensional this problem could be. Unfortunately, we're going to look at just positive-negative sentiment, but, there are data sets out there and ways of thinking about the problem that would embrace much more of this nuance. Final thing about this, because one thing you might do and actually build set- a set for this on the first day, is think about applying your sentiment analysis system in the real world. And my- my pitch to you there would be, you'll encounter many people in the business world who think they want pie charts like this, this is my imagination about what it's like to be a business leader. In Q1, [LAUGHTER] the negative effect was 30, and in Q2, it's 35. Maybe that's an interesting lead, right? To see that this breakdown in your reviews or whatever your social media updates are trending negative. But the thing of this is, is it's very unlikely that that answers the question that they have. Probably, even if this was estimated correctly, what they really want to know is why. And, one exciting thing to think about is how you could take an NLU system that could do this basic thing and offer much more of that why question. Because after all, the why answer is latent in these texts that you're analyzing. And your capacity, or your system's capacity to just label them as positive or negative is just like the thin- thinnest slice of what you could actually be doing with that text. And so, immediately, you could think about branching out and going deeper. And I think we're now at a state in NLU where- where we could think seriously about answering those why questions. One more thing about this by way of general setup. So we're going to have to be pretty narrow, talking about the Sent- Stanford Sentiment Treebank. But there are lots of other kind of tasks or areas within NNLU that are adjacent to sentiment that I think are really interesting to address. We won't have time for all of them, but what I did here is, list out a whole bunch that I could think of. And then for each one, I listed a paper that I think would be a great starting point. Obviously, each one of these things has a huge literature associated with it. So I had to be kind of subjective and selective, but what I tried to do is pick papers that either are just great about giving you a picture of the landscape of ideas, or even better, that have associated public data that you could get started on right away. And I've really actually tried to favor papers that have a dataset associated with them. And it's not that I'm saying that these problems are the same as sentiment, I think they're actually very different. Um, even the ones that might superficially look like sentiment, like hate speech. But rather, people tend to think about them together. And I do think a lot of the methods that we discussed transfer pretty nicely into these new domains. Especially if you're willing to do the work of really thinking deeply about what these problems are like, in the way that I'm encouraging us to do for sentiment here. And this is not an exhaustive list, I just thought these are really exciting problems to work on. Questions or comments about that? kind of general setup, or about sentiment analysis in general. Yeah. What would be a possible architecture for doing some things similar to what you described with like, say, given a review trying to figure out why the review is positive or negative? Let's say expensive is a category, or like, good food is another category like restaurants? I think you're already on the right track. Right. Take this review that has four stars associated with it, and see if we can get down to the level of which aspects are being highlighted. Maybe some of them are negative. There are some data sets that give you that kind of multi aspect sentiment, but it might be a kind of latent variable that you want to try to induce with some uncertainty. Yeah. It's a great question. How about some general practical tips? I think these are interesting. They don't necessarily apply to the SST because of its unique characteristics, but this is good stuff to know about in general. I claim. First, some selected sentiment data sets that we won't get to look at. Here's a whole mess of them. Um, what I've tried to do actually is favor either data sets that are massive, like the, uh, two Amazon ones. One released by Amazon and one by, uh, McAuley, who was a post-doc here. Those are both truly enormous, and offer breakdowns by product and by user, and some other interesting metadata, so that you could use them to do some sentiment analysis that also broaden other kind of contextual predictors, and you could certainly work at scale. Like if you wanted to build your own, word representations, the way we did in the first unit, as the foundation for your project, these data sets would support that. Which I think is very exciting. And RateBeer is also like that. Also really large. Bing Liu has released tons of data sets about sentiment, and they range from just like, you know, 30 million reviews up from Amazon that he got somehow, on down to kind of what you are working into , which is like annotations at the level of aspects of cameras. So you could do much more fine grained work. With Andrew Maas, I released a data set up here that gets a lot of use. This might be a good kind of development dataset for you. It's pretty big. It's got a mess of reviews that are unlabeled for developing unsupervised representations. And I think it's a pretty easy problem as they go. Um, oh, also, this one is different, so I was involved with this work as well, sentiment and social networks together. I think this is a really exciting extension. And if you've done work on kind of like social network graphs or knowledge graphs, it's really cool to think about combining them with textual sentiment analysis. We did that in that paper there, West et al. And we released a data set of people who are running for office on Wikipedia. And all those discussions are public, and a lot of them are evaluative. And so we have this network of people, and we also have all these evaluative texts, and we brought them together. And then finally, the SST, is one we're going to look at. You have other greatest hits in mind? Data sets you've worked with for sentiment. This isn't exhaustive, I just thought these were cool ones basically. There are also a lot of sentiment lexica, and these can be useful resources for you. And actually, this can be used for problems related to the SST. So I thought, I would just highlight them here. Uh, Bing Liu's Opinion Lexicon is just two wordless: positive, negative. You're gonna work with it on Homework 1 because it's built into NLTK. And it's really nice in the sense that it does a good job of classifying words in a pretty context independent way, and it's pretty well attuned to the kind of data that you might see on the web. Uh, SentiWordNet is a project of adding sentiment information to WordNet. And so if you're already doing work in WordNet, it's a kind of nice counterpart. Uh, and NLTK has a SentiWordNet reader, which actually I wrote. Um, the MPQA subjectivity lexicon is a kind of classic. And if you follow that link, you'll see that that group released a bunch of datasets related to sentiment. Uh, the Harvard General Inquirer is just a huge spreadsheet of words along the rows, and then lots of different affective and other social dimensions measured along the columns. It's almost like a vector space that's been hand curated. Um, and it's used a lot in the field. LIWC is similar, Linguistic Inquiry and Word Counts. Although, you kind of have to pay for LIWC. But, uh, so if you don't want to fork over whatever they're charging, you could just use the Harvard Inquirer. And then, I listed two others: Hamilton et al. Will Hamilton was a student in the NLP group here, and he released these data sets called SocialSent, and that's really cool because that's not only some lexica, but also some methods for developing context-sensitive sentiment lexicon, so that you could learn like the sentiment words that are associated with music, and with cooking, and so forth, and there's lots of nuance there. So I find that very exciting. And then, finally, this is just a big human developed, so human annotated. It's kind of like an experiment, a huge lexicon of scores for different words along a few different affective dimensions, and I've worked very successfully with this final one in the past. I think, I won't spend too much time on it, but I do have this slide here that shows that for those classic lexica, um, what kind of what their relationships are like? So that you could figure out whether or not it's worth bringing two of them in or you could concentrate on just one. And I've done that by quantifying not only their overlap, but also the number of places where they disagree on sentiment. I think, I won't dive into this but it's there for you as a resource if you decide that you're gonna bring these things into your systems. [NOISE] Let's do a few nuts and bolts things too. The first one I wanted to start with is just tokenization because I think people tend to default, to default tokenizers and sentiment is an area where you can see that, that might be a mistake. So here's my argument for that. I've started with this imaginary tweet. "NLUers: can't wait for the June 9 projects! Yaaaaaay!" Uh, and then, it has an emoticon that's gotten garbled and then a link to our class site. So the first thing you might wanna do as a pre-processing step is fix up the HTML entities that are in there, so that it looks better. And so that you can recover this emoticon, but this step here, it's very common to encounter text like this. And you might wanna check to see whether it's worth fixing those up. So if I've done that, then I have this text here. If I apply just a whitespace tokenizer, it does okay, um, in the sense that it has preserved the emoticon because it was written in a kind of cooperative way over here with the whitespace around it. Um, it didn't quite identify the username because it has a colon stuck to it. It did well on these tokens. It didn't do well on the date, um, and it didn't do well on the URL because it left that period attached. So, this is maybe not the best. The Treebank tokenizer, this is the one that you will likely encounter other systems using. It's by far the most pervasive tokenizer, just about any large NLP project has this step under the hood. And this is a really meaningful choice because it comes from a different era, and you can see that it has really made a hash of this text. So, it took the username and split off the @. It divided up the contraction, so that the negation is separate. That might be good for you. It might not be. It broke apart the hashtag, which again, it could- the hashtags are different on Twitter and certainly different for sentiment. So you might wanna preserve them as distinct tokens. Um, it broke apart all this stuff, maybe that's okay and it also destroyed the emoticon. It turned that into just a list of punctuation and then it also destroyed the URL. Um, so if you think about using your text, you know, text that you're gonna find on Twitter is probably a pretty dangerous movement. It's certainly not gonna allow you to find the emoticons, or travel around in the links that people have provided, or do any aggregation by hashtag or username. So this is maybe not the best and that kind of queues up a bunch of stuff that we might want from a sentiment-aware tokenizer. Right, it should preserve emoticons. It should preserve kind of social media markup. You might also want it to preserve some of the underlying markup. It's meaningful for example, that people have wrapped some words in the strong tag, for example, to indicate that they were in bold. Um, might wanna do something with people hiding curse words because those are certainly socially meaningful. You could think about preserving capitalization where it's meaningful, right. It's one thing to write great and another thing to do it in all caps. More advanced thing would be like regularizing the lengthening, so that when people do Y, and then just hold down the A key and then Y to indicate real emotional involvement, where the longer they held down the key, the more involved they are. You know, all those things are just gonna be very sparse tokens. And if you could normalize them to like three repeats, then you might get a meaningful signal from them. And then, you could think even further down the line about capturing some multi-word expressions that house sentiment like out of this world, where none of the component parts are gonna look sentiment laden, but that n-gram there is certainly gonna carry a lot of information. And NLTK also has this tokenizer here, which again, I wrote or at least I wrote the core of it, uh, that does pretty well on these criteria here. So, if you're working with social media data, I'd certainly argue for this one over the Treebank one, for example. And here's what you might hope that the sentiment-aware one would do. It would preserve the username and the hashtag. This one also does the date which can be nice. It gets the URL and the emoticon, and it regularizes this. And I can quantify that a little bit. So I'm gonna report on some experiments and what I've done for these experiments, is just take a whole mess of OpenTable reviews and fit classifier, simple like Softmax classifier to them with different amounts of training data. I'm always testing on 6,000 reviews. But here, it goes from 250 training texts to 6,000. The gray line is whitespace tokenization. The green line is Treebank and the orange is sentiment-aware. And this picture just shows that you get a consistent boost on this sentiment problem from choosing the sentiment-aware tokenizer. And it's aspecially large where you have relatively little training data and that makes sense because those are the situations where you want to kind of impose as much of your own bias on the data as you can, assuming it's a good bias because your system doesn't have much to work with. Whereas, by the time you get to 6,000 reviews, these differences have been minimized. And just to round it out, what I did here to test for robustness is train on OpenTable reviews, again from 250-6,000 [NOISE] but I tested out of domain. So I tested on IMDB reviews and it's the same kind of picture. The, the performance is more chaotic because of the out of domain testing. But I think pretty consistently, it's worthwhile to do the sentiment-aware tokenizing. The orange line basically strictly dominates the others. Make sense? Questions or comments about the tokenization stuff? Maybe I've convinced you. How about stemming? People ask me all the time about whether they should, they should be stemming their data. So stemming is heuristically collapsing words together by trimming off their ends typically and the idea is that this is helping you kind of collapse morphological variance. I think that's what people are always imagining that it will take like thinks, thinking and smush them together into think and you'll get like less sparsity in your data and therefore more clarity. There are three common algorithms for doing this all in NLTK, the Porter stemmer, the Lancaster stemmer, and WordNet. And my argument for you here is that Porter and Lancaster destroy too many sentiment distinctions for you to wanna use them. Actually I think that applies outside of sentiment. The WordNet stemmer on the other hand, it doesn't have these problems because it's much more precise but then again you might think it's not worthwhile when you see exactly what it's doing. Porter stemmer, what I've done here is, you know, the Harvard Inquirer lexicon that I mentioned before, it has categories for positive and negative, it writes them without E's for some reason. And what I've done here is just give you a sample of words that are different according to their Harvard Inquirer sentiment. But that become the same token if you run the Porter stemmer. So defense and defensive become defens, or extravagance and extravagant become extravag, or affection and affectation both become affect. You can see what's happening here. Real sentiment distinctions are being destroyed by this stemmer. Yeah, tolerant and tolerable, toler. Temperance and temper become, both become temper. But here's the argument for the Lancaster stemmer. I actually think this is even worse. So again positive and negative and when you Lancaster stem them this is to the point where, like, fill and filth both become the same token, or call and callous, truth and truant for some reason both become tru. I think you're doing real violence to your data quite generally frankly by running the stemmers but for sentiment you're obviously gonna be losing a lot of important information. The WordNet stemmer is different, so what the WordNet stemmer is doing is you need to give it a string and a part of speech and then it will use its very high precision lexicon to collapse them down into what you might think of as a base form. So it does do the dream thing for sentiment which is that, like exclaims, exclaimed and exclaiming all become the same word. It doesn't do it for noun forms. And then on the flip side, it does collapse all comparative and superlative variants of adjectives down into their base form. So it's doing it with very high precision. If it's not in the lexicon it's not gonna do anything to your word. Um, I think this is fine to do. It's just quite costly to do it at scale. Um, it's probably not worth it in general for collapsing these if you have sufficient data, and then for sentiment, you might regret actually having combined together happy and happiest because they're different in terms of sentiment. And just to round this out, I did the same kind of experiments. So this is OpenTable n-domain testing. 250 to 6000 reviews. The sentiment-aware tokenizer beats both Porter and Lancaster. Basically, that the idea there is that if you take the sentiment tokenizer as your kind of default baseline, stemming is only hurting you. Maybe that's convincing and out of domain testing or did I not include that? It's the same kind of picture when you do out of domain testing. Makes sense? Any defenders of stemming wanna chime in? Part-of-speech tagging. Oh, yes. It's a question actually. Whenever do you prefer to use these kinds of approaches for preprocessing versus character level models? You know that's a great question. Um, one thing I'll say and that you'll see emerge later in the course, is that the move to first of all sequence models because they process all of these things in the context of what came before and maybe after them. And then also analyzing done to the character level has made these decisions in some cases less important, right? So a lot of these newer models can recover from a bad tokenization scheme because of all the contextual modeling that they're doing. Overall I say this is a great development because whatever you think about my sentiment aware tokenizer, even it is probably not getting a perfect read on exactly what the unit should be. Um, but I would still say that it's probably worth your while to start all these systems in a reasonably good place. Um, even if these differences are becoming minimized. Yeah. Er, how would still bottom, stemming in bottoms work on misspelled words?. [NOISE] How does the stemming relate to misspelling? How do they work on misspelled words because, like, when it's sentimental very word understand all the misspelled word would leads to [NOISE]. Like when it just make it go they, um. [NOISE]. The stemmers are gonna do what they do. When you look at them they're just basically huge regular expression mappings. So they don't care about misspellings because they don't actually even care about word identity that much. Um, for newer methods, misspellings, this kind of a related point. If you have distributed representations and you have a common misspelling, it's likely to have a very similar representation to the one that's spelled correctly in which case those systems one of their selling points is that they gracefully recover from that stuff. And therefore, reduce the need to, like, run a spell checker as a preprocessing step. A lot said, and the other side of this is and I think you can see that in this, in these plots, like, this makes it especially clear, I think. The more data you have the less these choices are gonna matter because the more your system is gonna be able to uncover or recover from this bad starting point. It's when your data are sparse that these choices really matter. Part-of-speech tagging could help with sentiment. That's my first pitch here because there are a number of cases in the English lexicon where two words with different sentiment are distinguished only by their part of speech. So arrest as an adjective, like, as in, it's arresting I guess that's positive but arrest as a verb according to the Harvard Inquirer is,a negative. Uh, fine that so-that's a clear case. So it's a fine idea. That's a positive thing but to incur a fine. That's the noun version. That's typically a negative. So it might help you to create distinctions by part-of-speech tagging your data and essentially considering every unigram to be a-its word form and its part of speech. That's the first pass here and that's some evidence for it. But even that sentiment distinctions transcend part-of-speech. Here I've got a bunch of cases from SentiWordNet where one in the same word, with the same part of speech has different sentiment. So mean could be that your mean as in, you know, you're not nice to people but a mean apple pie is a good apple pie. Um, or something could smart. That might be bad and that means it hurts. But somebody being smart that's a positive thing. Serious can obviously mean different things depending on the context. That's something I experienced a lot in my life. A serious problem might be a good one or a bad one depending on your perspective and so forth. So I don't have an answer here except to say that even adding as much preprocessing as you can think of is not gonna fully disambiguate the words in a way that aligns with sentiment. That makes sense? I don't know, that's a kind of ambiguous thing here though, like, is part of, it is part of speech tagging worth it as a preprocessing step for sentiment? It's not so clear to me. That's a kind of empirical question. This is really powerful though. And again this is a heuristic thing and an intuition that we'll return to in the context of the SST and this is just simple negation marking. So the linguistic thing is that you have all this these ways of expressing negation, and they obviously kind of flip intuitively what the sentiment is. So I didn't enjoy it is probably negative, whereas I enjoyed, it is positive. I never enjoy it, negative. No one enjoys it, probably negative. This is an implicit negation here I have yet to enjoy it. That's like, it might as well be I don't enjoy it. And I don't think I will enjoy it. That's a case where it's probably negative sentiment and the negation word is really far from the associated thing that you want to treat as negated which is the word enjoy. Early on in the sentiment analysis literature, a few people have proposed a simple heuristic method which is just basically as you travel along, if you encounter an, word that is negative according to some lexicon that you've developed which would include, like, n't, and not, and never, and no one, and so forth and you have a whole lexicon of them, then you just start marking all the tokens after that negative word with _NEG. Up until maybe you hit, like, a comma or a period or some kind of a punctuation mark that tells you heuristically again that you've reached the end of the scope of the negation. So what you're doing when you do that upending of _NEG is essentially creating a separate token. You're saying that this word when it's under negation is a different word than when it's in a positive context and that will help your system distinguish enjoy NEG from enjoy positive and give your system a big boost presumably. So that's a really easy preprocessing step. That is basically just giving your statistical model a chance to learn that negation is important. So here's kind of what happens. No one enjoys it would mark everything after no with NEG. I don't think I will enjoy it but I might if you've got your algorithm set to stop at a comma, then it will stop that NEG marking at the end of what's intuitively a clause boundary there. And that's probably good. And here's a little bit of evidence that this can really help. So got a full comparison. Gray is just whitespace, green is Tree bank, orange is the plain sentiment-aware tokenizer, and red is the sentiment-aware tokenizer with NEG marking. And that gives you a consistent boost all across these levels of training data from 250 texts to 6,000. And yeah, even out of domain, this is a useful bias to have imposed. It's really giving your system a chance to see that negation is powerful for sentiment. That make sense? . Er, is after the post-processing, what models consume the data? I just used a simple softmax classifier and noted it down here, which is just a pretty good standard linear model. Uh, and that's really it. It's just a bag of words classifier. It's just that when I've done the NEG marking, this bag of words has been kind of annotated in this clever way. Excellent. That was it by way of general stuff. Now, we're gonna dive deep on the Stanford Sentiment Treebank. And this is perfect timing, because this will give us a chance to talk about the code itself under SST.py. And then you guys can leave here, actually ready to do the homework if you want to, and think about the bake-off. So the SST project, the associated paper is Socher et al 2013. I was involved with this project. This is a- it was tremendously exciting at the time for me. It was the largest crowd-sourcing effort that I had ever been involved with. Um, I remember feeling somewhat nervous that it would even work, because we had people annotating hundreds of thousands of phrases. Um, in retrospect, it looks kind of small. Now, Stanford has produced annotated data-sets that are vastly larger than this. But still, uh, it's an impressive effort. Um, full code and data release credit. There goes to Richard Socher. I think, um, you can still- if you, if you visit that link, you can play around with the system. It does wonderful visualizations. You can even, um, offer to give it new examples that it will then learn from. Um, and then of course, you can use the code in lots of ways. So a kind of model for being open about your data and your methods and your results. Uh, it's a sentence-level corpus. It has about 11,000 sentences, and those sentences are derived from a classic sentiment data-set that was released by Pang and Lee, who are, were really pioneers in doing sentiment analysis. And I actually, this paper, at last year's NAACL won the Test-of-Time award. Uh, I think the argument there is, that they really did set us off in the direction of thinking seriously about sentiment in its own right, but also as a great test bed for NLU models. So it starts from those Rotten Tomatoes, uh, sentences which were labeled naturalistically by their authors, because it's a review data-set. But what the SST project did, is crowd-source labeling, not only those sentences, but every single phrase in all of the trees that are contained in those sentences. Uh, it's a five way label thing, actually, reviewers were given a slider bar, and then the labels were extracted. And that was justified on the grounds that by and large people picked points on this slider bar that were kind of consistent with the labels that had been provided. So the result is a Treebank of sentences that look like this. Here, NLU is enlightening. These are real predictions from the model. All new, new test cases. I was very impressed at how well this did. Uh, it labeled NLU UNK, but it still got this right. So you have labels for all of the sub-constituents. So, is is neutral, NLU is neutral, enlightening is positive, and as a result of enlightening being, being positive, that projects up the tree essentially. This is an example that I use to motivate, right? So they said it would be great. This is really cool. It knows that be great is positive, down here, and it knows that these things kind of aren't contributing sentiment. And the sentiment signal goes pretty strongly up until the top here, but then it's diminished somehow, and the overall sentence is neutral. And that's the prediction that I wanted from this sentence, because I feel like this sentence alone does not tell us about the author's bias. It just reports somebody else's perspective. So, you know, who knows what was happening up here to cause this to emerge, but it did. And then really cool, they said it would be great, they were wrong. It got that that was negative. And it got that that was negative because it end, it must know to project things from the right more strongly. And so, that they were wrong over here with a one projected up to the top. Even though this remains kind of ambiguously neutral or positive. It didn't quite nail- they said it would be great, they were right. I think right must not have a strong enough sentiment signal on its own. Um, but at least it did figure out that this is kind of just neutral here. Even though there's positive- positivity kind of weakly expressed on both sides. Anyway, so I was impressed by the system, but I introduced these examples more to show you the latent power of this particular data resource. This is kind of unprecedented that you would have this many labels, and this kind of degree of supervision for the individual examples. Yeah. The example that they said it would be bad, they were wrong. That would be negative, because they said it would be bad, it probably be neutral. And then, like they were wrong, and be negative if that's where a negative sentiment is coming from. That's my intuition. Well, my intuition as a human is that you're right. They said it would be bad and they were wrong, should be positive. It seems like a very interesting stress test for the system to see whether it got that right. Because then it needs to know not only how to balance these sentiment signals, but kind of how they've come together. [NOISE] There are a few different problems that you can define on the SST. There's a five way problem that's just using basically the raw labels. They go from very negative, negative neutral, positive and very positive. And this is a breakdown for train and dev. There's a test set which I'll return to and I've not shown the statistics because I'm kind of trying to keep it out of our view for now. Uh, but it's comparable to dev. And I think this is fine, but there are two things you might keep in mind about this version of the problem. So first, intuitively, if you say, think about sentiment strength separating out polarity, then 4 of course is greater than 3, but 0 is greater than 1. I think it's much more natural to think about this as a kind of, two scales with neutral kind of unranked, with respect to the other two. Maybe that doesn't bother you too much. What should bother you is that, if you fit a standard classifier to this kind of labeled dataset, which has a ranking on its labels, then your classifier will be conservative with respect to how well you're actually doing. Because, it will regard a mistake between 0 and 1 as just as severe as a mistake between 0 and 4. Whereas, really you might think about how you could get partial credit for being close to the true answer. And there are models that will let you do that, even in the classification context, like involving ordinal regression or ordinal classification. But your model is probably not doing that, and so, it's just worth keeping in mind for this problem that it's a little bit strange. There are two versions of the problem that make more sense to me. Uh, the first, is this ternary one, which we're gonna make a lot of use of. Here, you just group together 0 and 1 as negative, and 3 and 4 as positive, and 2 in the middle. So you've lost some sentiment distinctions, but at least with regard to the way the labels were given, it seems quite justified to me. And this one is nice, because you keep all of the data that you have available. There is another version of the problem that is discussed a lot in the paper alongside the five way one. And this is a binary problem, where we simply drop out the neutral category, and then cluster together, 0 and 1, and 3 and 4. Perfectly respectable, uh, the only shame here is that, first of all, in the world, there's a lot of neutral sentiment, not everything is sentiment-laden. And the other is that we had to drop out a lot of our data. These i- these statistics here are just for the root level labels. So they kind of ignore all the labels that are down inside the trees. This is the all nodes task, where you actually try to predict all of the labels on all the sub constituents. You get many more training instances of course, because some of these trees are really big. Uh, but you can define the same three problems in this way. To kind of give an equal footing to all different kinds of models, we're mostly not going to look at this problem. I'd have, my only regard about that is just that, this is one of the more interesting aspects of the SST, that it has all the supervision. And so, I do encourage you for projects and things to think about how you might model the fullness of this data-set. The labels would it ever be like a good idea to have two labels. One for like the valence, or one for this intensity and one for the valence? Oh, absolutely, yes. I mean, that's kind of, I mean the SST doesn't have that kind of multi-way label. But, my linguistic intuition, my kind of understanding of the psychology of affection, of affective states is that they have many dimensions. And you can see that in one of those lexicons that was released by Victor Kuperman, the last one that I listed which has a few different dimensions that they take as kind of the true substrate for emotional expression. Yeah, I wish we could do it. Um, here, they've all been collapsed into a single scale though. all of the ones like creating, yeah, I guess like the very's, having that as a label, whether it's very or another thing or two. So 0 and 4 would be plus, um, emphatic. And the rest would be minus emphatic, as a binary problem. And then you also have polarity, which is positive, negative or neutral. I think, oh, yeah, no, I take it back. That's a great idea. That's worth thinking about. Yeah. And certainly as a way of kind of seeing what your model is learning, it's interesting to impose that distinction. Other questions or comments? Great. Okay. Let's dive into the actual code. This is perfect because I can cue you up to start working productively. So sst.py, you'll have to get used to my interfaces a little bit. But my claim is that having done that, you'll be able to work really productively because I brought together- I've created a little framework for you that should make you quite nimble in terms of running experiments. First thing, you just have all these readers. Um, so you have train reader and dev reader. And then each one of them has this argument class func. And there are basically three choices. If you leave that argument off, it will give you the five-way problem. If you set class func to ternary class problem, it will be positive neutral negative. And also binary class func will give you just the binary problem. So these are nice prepackaged ways for different- getting different views on the data. The notebooks explore all these different problems and for the homework and the bake-off, we're going to do the ternary one. Because I- I think it makes a lot of sense. But this is easy and this is just kind of set up. You could paste this in and follow along if you wanted, SST is the library. The trees are already there for you in your data distribution, and then these are the readers. And maybe the only other thing to mention is that each reader, you can think of as yielding tree score pairs. So I'll show you what the trees are like in this- in a second and the scores are just strings as you would expect given the way the labels work. I have a separate slide here on these Tree objects. So they are NLTK Tree objects. Up here, you can see I created one from a string. This is just an illustration of how to do that, and if your notebook is set up properly, it will do this nice thing of displaying them as pretty intuitive trees. So this is NLU, is amazing. Down at the bottom here for subtree and tree.subtrees, that's a method. That will cycle through all the subtrees for you. Just a really useful method to know about. And then up here this is basic tree, um, components. So tree.label will give you the root level label for that tree. And then tree zero and tree one will give you the left and right children of that tree respectively, assuming they exist. You can see that for this tree over here, the left tree, the left child node is that subtree with- rooted at two and NLU, and the right subtree is the verb phrase, "Is amazing." And then obviously, that's recursive. So tree 1_0 would take you down to two is and so forth. There's a bunch of other stuff you can do with these trees, but in my experience writing feature functions, these are the important components that I needed. If you think about writing a function that's going to crawl around in them, this is kind of the nuts and bolts. Okay. We'll build up this framework in three steps here. The first is this notion of a feature function. And this is a very simple example here. This is a kind of bag-of-words feature function. All the functions that you write- all the feature function should opt- should take a tree as input. And so if you think back to the readers, right? Readers are yielding tree score pairs. It's going to operate on the left value of those pairs, the first one. And it needs to return a dictionary, where the values in that dictionary are counts or booleans. Ah, and they can be real-valued as well. So like Integers, counts- integers, floats, and booleans. And that's kind of the contract, right? You can write any feature function. It will be any function of individual trees as long as it returns a dictionary. And what I've done here is do tree.leaves to get all the, um, just the lexical items. So I'm not really making use of the tree structure. And then counter here just turns that list into a dictionary where the elements have been counted. So that's a nice really quick way to do a bag-of-words feature function. And I've shown you, I created that same tree, NLU is amazing down here. Unigrams phi of that tree yields this dictionary. Does that make sense? This is something that you'll want to make a lot of use of. I'm- I'm assuming that part of your bake-off and part of your homework will be writing interesting feature functions. And they all- they can do- can do whatever you want inside here as long as it's tree to dictionary. Any questions about that? All right. Good. Next step; model wrappers. So this might look a little redundant at first, but bear with me. You'll wanna write functions that take as their input xy pairs, where x is your feature matrix, and y is your vector of labels. And what those functions should do is fit a model on that data, so a little supervised training data, and then return the fitted model. And that's all. So here what I've done is logistic regression from scikit, it's a good baseline model. I said I wanted the intercept feature, like the bias feature. I specified solver and multi-class equals auto just so it wouldn't issue its warnings. So it seems to be in the habit now of issuing warnings about those changing or something. Ah, and then I fit the model crucially. You have to remember to fit the model in here and then it gets returned. And again, this might look like just a tedious way of calling the fit method on a model, but as you'll see, by putting wrappers around these things, you can do a lot of other stuff as part of this process without changing your interface. And for more sophisticated models, you might want to do a bunch of sophisticated things before you fit or part- as part of fitting. And then, finally, this brings it all together. SST.experiment is like a Swiss army knife for running experiments. Um, I've given it here with all of its default values so that you can see at a quick glance, all the different things that you can do. But the point is you just point it to your SST, you give it a feature function and a model wrapper, and that's really all that's required. If you give SST.experiment those three things, then it will train a model and run an assessment on a data that's separate from its training data. And it will give you a report like this. Not only will it give you a bunch of information about your experiment encoded here, but it will also print out a classification report. And that's it, right? So you have to write very little code in order to test your feature function and/or your model wrapper. And then if you want to try out different conditions like, you can go in here, you could specify that your assess_reader was SST.dev reader and assess on the development data. You can change the training size if you haven't specified the assess_reader if it's doing a random split. You can change the class function. You can even- you can change the metric. Um, you could turn off printing and I'll return to this vectorized thing a bit later. But that's all kind of if you wanna do more nuanced experiments. The thing to keep in mind is that you can just very quickly test your feature function and your model wrapper. And the other thing I want to say is that you'll see this throughout for this entire unit and in fact, for many units in this course. When we see these classification reports, we are gonna care mainly about the macro average F1 score. Um, the reason for that is we have slight class imbalances. Of course, for other problems, the class imbalances can be really large. And the idea behind macro averaging is that we care equally about all those classes. Now, I think that's justified in NLU in- sometimes the smallest classes are the ones we care about the most. Um, micro averaging would favor the really large classes as would accuracy and weighted average. But macro is a kind of good clean picture at how you're performing across all the different classes despite their size. So keep that in mind as you look at these reports. You'll kind of wanna hill climb on that value. SST experiment returns this thing here, unigrams softmax experiment and that contains a lot of information about your experiment. Again, I think this is part of best practices. It's like when you run an evaluation, you keep as much information as you possibly can about what you did. Every single time in my life that I've decided to leave something out, I've regretted it because I wanted it later and then I had to like retrain the whole system or something. So what I've tried to do is package together all the information that you would need to study your model's performance and to test it on new data. And you can see that listed here. You've got the model, the feature function, the data that it was trained on, and the assessment data. And that's relevant if you did a random split. Because this'll be the only way that you could recover exactly what data you were using; the predictions, the metric, and the score. And then each one of these train and assess data-sets has the feature matrix, the labels, the vectorizer, which I'll return to, and the raw examples before a featurization. I think that's really important for kind of error analysis because, if you try to do error analysis on x and y, you're just staring at these high dimensional feature representations. It's hard to know what's going on. You as a human, unlike a machine learning model, would prefer to read raw examples. And so they're there for you as well. So this kind of brings it all together. This is on one slide here, a complete experiment testing bag-of-words on a logistic regression model, right? So set up SST_HOME, phi returns the dictionary counting the leaves. This is that simple fit_model thing, and then experiment SST.experiment just runs that. I didn't show it, but it would print that report and give you all your experiment information. Does that make sense? It's kind of like, let's do experiments without a lot of copy and paste. Without a lot of like just repeated cells that do basically the same thing. I'm hoping that in a notebook you could use this to kind of reconstruct or, you know, to conduct a series of experiments and study them without having a huge mess. That's the core of it. Let me say one more thing, and then, well, actually we have time for a few more. [OVERLAPPING] there's one more glimpse- I want to give you a glimpse about what's happening under the hood, um, because again, if you move outside of this framework, this is a nice thing to know in terms of best practices. So under the hood, when combining your data with your feature function, I have used the scikit-learn feature extraction DictVectorizer. And I want to walk you through why I'm doing that because I think this is a really convenient interface for doing all kinds of problems. And I've done it by way of illustration. So imagine my train features up here are two dictionaries a, b, b, c and they each have their accounts associated with them. That's like, you know, if you write a feature function, mapping a tree to a dictionary, what the code internally is doing is applying that feature function to all your examples and creating exactly a list like this. So that happens under the hood. But what a machine-learning model needs or at least all the scikit wants, what they need is a matrix. They don't operate on dictionaries. They operate on matrices that look just like the vector space models you were building before, right? Strictly numerical data, where the columns are the meaningful units. DictVectorizer maps, these lists of dictionaries into such matrices. So when I call, yeah, I setup the vectorizer here, vec.fit_transform is the idiom in scikit on that list of dictionaries. X_train here, that's a matrix. I put it inside a DataFrame just so you could see what had happened, but under the hood what it's really doing is operating on an NP. I'm giving you an NP array. But the Pan- Pandas DataFrame is nice because you can see that the keys in these dictionaries have been mapped to columns. And then each example: 0 and 1, we have the counts of those features for that example. That's really good because, for example, a failure mode for me, before scikit came along, was that I would go by hand from dictionaries to NP arrays, and I would get confused or have a bug about how the columns aligned with my intuitive features, and then everything would get screwed up. Uh, now I just trust the DictVectorizer to do that. So I as a human, I'm very happy to have these dictionaries, I think it's an intuitive interface. But my machine-learning models want this and DictVectorizer is the bridge. The third thing, that's really important about these DictVectorizers, the third problem they solve, suppose I have some test examples, so here I have, the first one is a and it gets a count of 2, a, b, d with their counts. But notice that d is a feature that I never saw in training. If I call my vectorizer up here, and I just say transform on those test features, then I get X test, and notice that it's perfectly aligned with the original space a, b, c, and d has been dropped out. It's not part of my feature representations from my training data. My model will be unable to consume d, and so by calling transform, I have just gracefully solved the problem that d would have imposed. And also, it's doing the thing of aligning the columns and so forth. Yeah. If you had fit transform, it would have just made a new column for the entry? Exactly. Yeah, and it would have four, and I would have lost this correspondence with my original problem. Yeah. Um, yeah, that's a great point of contrast. This is so important because this is the way that you can swiftly go from a train model and a vectorizer, you know, process new data in the way that you expect for your model, and get the desired results. And again, this was an area where my code would contain mistakes before, and now it just doesn't because of this DictVectorizer. But also it's really good to know about, and also it's good for you to know that under the hood this is what happened. This is what's happening as we take your feature functions and turn them into something that a machine-learning model can consume. [NOISE] Excellent. This has been a lot of material. The next phase of this is I'm gonna show you some methods, hyper parameter exploration and classifier comparison. But I propose to save that for next time because this has been a bunch of material and I think in reviewing this section, you're now pretty well set up to start doing the homework if you want. So let me just review that quickly and then we'll wrap up. It's a similar prop plot to the first homework and bake-off. Except now focused on the sentiment, Stanford Sentiment Treebank. So what I've done in this notebook is set you up and give you a couple of baselines, mainly there. I'm trying to document the interfaces for you. And I'll do more of that next class but I've shown you how to fit a softmax baseline and an RNNClassifier and do some error analysis that might help you bootstrap to a better system. And then the homework questions and there are just three. The first two involve a bit more programming than the first one. So they're, they're, you know, they're worth a bit more. But then you have this familiar pattern of you develop your original system. And your original system will be one that you enter into the second bake-off. So, that's the overall plot. For the bake-off itself, we're gonna focus on the ternary task, as I said. And basically, you're just going to try to do as well as you possibly can on that task. And in this case, I don't have to impose very many rules, you know, before I didn't want you to download external vectors and so forth. But in this case, I feel like we can just say, do whatever you think is best in terms of developing a good solution for this problem. If you want to download vectors from the web, that's fine. If you want to download other people's code, that's also good. The one note that I have added is that it needs to be an original system that you enter. So, you can't just download someone's code and retrain it and enter that. You have to make some kind of meaningful addition or modification to that code. But beyond that, anything goes, um, except for the fact, of course, that again here now we are really on the honor system. For the first one, I could withhold the test data. But in this situation, the test data are just the test distribution which you all have in your data folder already. So we're completely on the honor system just like we would be if we were publishing, to do all our development on the dev set and only use the test set for that very final phase, when you actually submit your system. Okay? Yeah, I think that's it. You know, the- the softmax baseline is the one that I just showed you. And I run the experiment. And here I'm using the dev reader for development. So this is the kind of setup that you'll be defaulting to. Next time I'll show you, there's a few little tweaks that you need to make to do deep learning models in this framework but it's really easy. So I'll show that to you next time. Giving you some tools for error analysis and then finally the homework problems. The first one is like, write an original, um, feature function. The second one is a kind of transition into the world of deep learning. And then the third one is your original system. That makes sense? All right, it's a bit early but I propose that we stop here. Next time, I'm gonna finish this lecture and then I hope to leave time for you to do some of your own hacking, so that I'm sure that you'll leave here on Wednesday, feeling like you can complete the homework. The time window here is tighter, so that's really important.