I want to do a couple of announcements and then uh have a quick guest lecture from Moritz Sudhof. Just because it was irresistible in terms of Moritz just having done a project that kind of bridges us from the VSM unit into the big themes that I tried to introduce last time around sentiment analysis and why you would want to do it and tasks that are adjacent to it. But first, just a couple of quick announcements. First, uh, it's really exciting to see the bake-off submissions coming in, I've been checking them periodically throughout the day. Uh, I'm not going to reveal anything about the scores because we want to reproduce everything, but I will say that I myself I was pretty proud of my system, I thought I was doing pretty well, and it's pretty clear actually that I would have gotten about 30th place. Uh, some people have really done better than I did. Uh, but anyway we will have a fuller report after we've done some reproductions next week. Uh, the other announcements I wanted to make is like my bias for this course is definitely that you all come up with your own projects, I think that's more exciting. But when people write in with project ideas that you might, you know, for ongoing things that you might want to join, I can't say no because some of those things are really exciting as well. And I wanted to just mention two today. So this first one, this is from the people at Gridspace which is an area startup that does a lot of um, speech-to-text work, they have lots of Stanford connections as well, some of you might have encountered them. They're developing some new datasets that are gonna be kind of quasi-public not protected data, so it's pretty free use and they're around um, dialogues. Um, and at least part of those are people interacting with artificial agents. So if you're interested in that project they have this Google form that you could fill out. And then I think they'll get in touch with you about sharing the data and so forth. Um, and actually this is exciting because they're still in the process of getting feedback from people, potentially you, about what kind of labels they should get and how how they should design the data-set, so this is a chance to kinda get in early. The other thing I would say is that I attached a bunch of sample files which they sent me. And the automated_dialog MP4 is hilarious as advertised here, so check that out. And then the other one is from Vinay Chaudhri who has done this in the past. He's got a project ongoing that some students worked with successfully in the winter for the CS224N. Uh, and this is nicely timed because the centerpiece of this is relation extraction using distant supervision, which is our topic for next week. So you potentially get a nice confluence of the work we'll be doing and a project that you might develop with Vinay. So, uh, again I've provided his contact info and I think you should feel free to write to him. Okay. And then before we dive back into the SST, let me invite Moritz to come up. Um, as I said this was just a great opportunity because, you know, Moritz is doing work kind of in industry right now, and the topics are nicely aligned with both things that we've done so far. Thank you. Right. So as we're diving into the next unit, many- many of you are already starting on your projects or at least thinking about it. I wanna take five minutes to just do a digression into politics and also make make my case that you should not forget about anything that we've learned in the last two weeks, because it's extraordinarily useful for whatever you're going to be doing moving forward. And to prove that point, um, let's talk about politics. So five-minute digression, politics specifically, um, political polling. Um, so I do some work with a Democratic political polling company. And uh what I learned about the space, which is kind of interesting, um, is they, political polling is actually really bad right now. Most of it still happens on the phone. And you get these like really kind of structured questions like, do you care more about balancing the budget or income tax? Choose A or B. And so the, the, the actual like information that politicians politicians get back is very sterile and it's like highly issues-based. Um, but what they really wanna know is they really wanna know like as a voter, who are you voting for and why? What's driving your decision? How are you thinking about candidates? How are you thinking about the election? Like what messages are resonating with you. Um, what's top of mind for you. They wanna get into this a lot more about attitudes, perceptions, opinions. And so, the parallels of sentiment are clear where if we know that Beto O'Rourke is you know has 32% of the vote and Elizabeth Warren has 18, is the same as knowing that one product has a five-point, you know, a five-star rating, the other has a 4.5. The question is why? What's the difference between those two? Why- why do some experiences resonate more than others? Why are some, why do some things receive more support than others? Um, and so imagine that you are building an NLU system and it needs to answer these questions. What are voters looking for in a candidate? Why are they- why are they supporting their stated choice for the election. Um, and for this dataset, we're going to be looking at three quick questions from a survey. They were asked what qualities do you look for in a president, in a candidate for US Congress or Senate or for somebody for a state or local government. And we also have for each respondent the party identification. So baseline you have a few thousand responses of people basically talking about what is, how are they framing this election? Who are they supporting and why? And we want to start to understand, okay, how can we give politicians real kind of richer feedback on what's driving these decisions? Um, and so I wanted to show you a notebook but I couldn't get my own computer connected. But the only reason I wanted to show you a notebook is because this is literally the only code in that notebook because using our wonderful VSM code, I took a, I built my own like word-by-word matrix but that's everybody knows how to do that. You call these two lines and suddenly all the rest of the exploration basically is just based off of these two lines of code, so I think it does a nice job of showing that how quickly you can get started with a totally new and unknown dataset. Um, so what I wanted to just do very quickly is look at some of these TCGA plots. So let's say that we're interested in understanding, okay do Democrats and Republicans make decisions differently based off of, um, uh, for, for the election? Um, so this is obviously you can kinda just see there are some colors here. So I've we- I've colored the terms based off of whether they are more associated with Democrats or Republicans. So, um, green is independent, red is Republican that's kind of up here. How do I- I'm not Chris, I'm Moritz. Red is up here, blue is more down here. And if you actually zoom into it, there's a- there's a couple of nice things that happen, I've already annotated it just for clarity here. You can see that on the Republican side, people are talking a lot about issues. So um the words that you can't really read up there are words like defense, fiscal, um, you know, words like Constitution. Um, these are all about conservative values, immigration, policies and values. But if you look at to where the clusters are more dominated by blue, here you get things like collaborative, co-operative, composed, articulate, charismatic, well-spoken. These are all descriptions of personal characteristics. These are descriptions of somebody that probably doesn't match how these people are thinking about Trump. Um, and so you can see already kind of a clear split between how, uh, the Democrats are thinking about candidates in the election and how Republicans are. We can also look at this again taking advantage of that structure we had in the dataset, which was, we also know what office they're talking about. We can look at the same thing, this time we're coloring the colors, the words, based on whether it's more likely to be in a review or in a comment about presidential, local, or congressional elections. And as you would expect, there's a ton here, of the personal qualities this blue, this dark blue, that's all uh, presidential. So personal qualities like temperament, being quick, being unifying, being a negotiator, being classy, those are all super important for president. But if you look more at the area over here where it's the other colors for local elections, whoops, sorry Chris, um, you can see that it's a lot more about issues. We have advocates accessible, there's health care, affordable housing, these are much more the bread and butter of like I have local policies that I care about, uh, that need to get done. I'm gonna comment here, great work Moritz. Oh, wow, thanks Chris. Um, and then the final thing I wanna show, so okay, so this has been- this was like how long did it take, uh, me to do this? All I needed was to take the original data and then call two lines of this code and another visualization line. And I feel like we're already starting to get a sense for how we would tackle this problem. So we're building an NLU system to answer the question; what are candidates? What are voters looking for and why are they supporting candidate X over Y? So I think one thing we know is we would have to be sensitive to whether they're talking about issues or personal characteristics, that seems to be kind of a cleavage in our data. And then the other thing that we might want to understand is like, okay, uh we wanna understand personal qualities more and how those, um, which qualities are resonating more with other people. So maybe we dig deeper and try to find sub-clusters of, you know, this is about decency and humanity and fairness but this is about being well spoken. Um, and so this is- this would be a first step. You're diving into a new unknown dataset. You wanna know what is in it, how do I actually solve those higher-level NLU problems I wanna solve? The code that we've all explored so far in the first unit is your best first stop. Um, and even for- I got paid to do this by the way. Um, even for like an industry, this sort of code is it's not just starter code. It can give you a great um handle on- uh, on a dataset on a problem. Um, and so as you embark on this next unit and your projects uh I encourage you never to forget the humble vector space models, um, and know also that domains like politics, like customer, um, understanding and experience, like employee and their experience, all of these domains need NLU people. Um, they do not have it figured out. Um, so it's up to all of us to help them and do exciting new things for our projects. Thanks. [APPLAUSE] Oh yeah any questions? You guys are convinced, oh there we go. Yeah. Why are you assuming that [inaudible] [BACKGROUND]. Yes, so the question here is somebody answered a survey, do we even care about the data that comes back? Do we- do we trust it? Does it- do we think it means anything to us? So the deeper question there is we actually have, you know, if- in the actual application of this analysis we have data on turnout and previous supportive candidates and so there are things that we can verify about like, yeah, these people vote. But I think the broader question is, when people are telling you about their experience, should you care? Does it matter? And I'm gonna say- I'm gonna make the claim that whenever people are being emotional, uh, and expressing opinions, you should care because they have taken, like they did not need to fill out these surveys or offer these reviews. Uh, and so there's- my first bias is always there's something of value here. They had some reason for telling us this and an aggregate if a million people tell you something you can learn something from it. Um, so I'm always optimistic I guess to answer that question. Another question. [NOISE] [inaudible]. Absolutely. I think uh- I mean I think one thing that is true overall is in politics, if for presidential elections in America, style matters. So there's always going to be some of that entering, but I think definitely like it also speaks to kind of in the age of Trump things are things are different and like the word, orange showed a lot, showed up a lot in this dataset. So you would not expect that to ever be a relevant thing to say in a presidential survey otherwise. Um, so I think like, but that also speaks to how, like when I applied GloVe to the same data it failed, like it didn't because the- this- the length- the words people were using in this context just means something a lot more specific than in all of the English language, um, so there's also kind of PMI coming back to the rescue when GloVe failed. Okay, um, let's dive back into the sentiment stuff. And just to recap, what we did last time was, I set the stage. I talked about sentiment as a general problem in NLU and I tried to make the case that it's a very interesting problem even if at first blush it might look kind of simple. And then I gave you some general tips, showed you some evidence that it's worth thinking about preprocessing your data, how you do that thoughtfully and so forth. I introduced the Stanford Sentiment Treebank that's up here and highlighted some of its unique properties as a dataset in general and certainly for sentiment. Then we spent a bunch of time walking through the basics of sst.py. And I'm gonna show you one screen from that- that kind of summarizes that whole unit. The reason that's important is that's what you want to work with productively as you're doing your Homework 2 and Bake-off 2. And in fact, I'm going to try to get through this material at a leisurely pace, but we'll keep uh, keep moving forward so that we can have some class time today to make sure that you guys are all set up and working productively with it and a bunch of the team is available in the classroom and then did a post on Piazza in case you're remote and wanted to just log in and chat with someone about how things are working or maybe take a step back from where we have and think about regular supervised learning, the kind of stuff that I'm sort of taking for granted in this course, but I think the team is happy to fill in any gaps that you might have. But the point is, since the time window is tight, we want you to get good with the code by the end of the day today and we're available for that. This slide here kind of summarizes the entire framework and in fact, what I'm showing here is kind of for simple linear methods which we're going to explore first, but exactly this same framework works for all the deep learning models that we'll discuss and it's meant to be flexible and modular in a way that will let you run a lot of experiments without making a lot of mistakes and the long and short of it is, you know, set yourself up to point to the data distribution. Your feature functions here I've called it phi should always operate on trees and return dictionaries. Let's assume they're all count dictionaries. You should also have these model wrappers which take in a supervised dataset, an X, y pair fitted model and return that fitted model. And then all you have to do to run an experiment is call sst_experiment pointing at your data distribution with that feature function and that model wrapper. And that's it and already from here, you can see that there's lots of space to explore. If I wanted to try Naive Bayes or a support vector machine I would just write a different model wrapper. If I wanted to explore different feature functions, I would just write new functions of trees and that would be very quick in terms of evaluating and by default, this is gonna evaluate on random train test splits that are drawn from the training data, um, and you could periodically test against the devset to see how well you're doing in reality. That's the rhythm that I'm imagining, and I give you a quick peek under the hood at how this code is designed. I would encourage you to look yourselves and find out even more about how this is working, but it's kind of all based around these dict-vectorizors, which I think help us avoid a lot of common coding mistakes when building feature representations of data. That's the quick recap. Are there any questions or comments about it before I dive into this new stuff? Any concerns that might have emerged over the last couple of days? All right, let's dive in. We have budgeted some time later in the term when you guys are in the thick of your projects to talk a lot about methods and metrics. And we're going to return to the two themes that I'm going to introduce now in that context and kind of talk about them even with a little bit of philosophy behind them. But I want to introduce them now because I think that both of these can make you a better experimenter right from the start. Those two things are kind of hyper-parameter exploration and classifier comparison. So let's start first hyperparameter, here I've called it hyperparameter search and I want to give you the rationale for it. So I'll just walk through this argument. First, some terminology. The parameters of a model are those whose values are learned as part of optimizing the model itself, those are sometimes called weights, right? And that's mainly what we think of when we think of machine learning is that like from data our model has a capacity to set all these parameters in a way that's effective for new data. I'd say equally important are what are called the hyperparameters of the model. These are any settings of the model that are outside of the core optimization process. And some examples are like in GloVe or LSA, you have the dimensionality of the representations that you're going to learn. Um, for- for GloVe you have the learning rate, for GloVe you have Xmax as part of the weighting function and alpha, which is also a preprocessing step essentially that does something to your count matrix. Neither of those values is learned as part of GloVe, but they're super important in terms of what you learned. Also, regularization terms, hidden dimensionalities, learning rates, activation functions. You can even go so far as to say like the optimization method itself, the algorithm that you use is the hyperparameter to your model. There's no end of these things essentially, if you go on to scikit-learn and just look like in the linear model package, that thing's like logistic regression or if you look at the support vector machines in there or the Naive Bayes models, they have dozens of things that are hyperparameters and you can see them codified in code because it says like, here are the keyword arguments. They have some defaults, but that's just one of the many values that you could explore for each one of those things. This is crucial because to build a persuasive argument, you need to be thoughtful about your hyperparameters. And the rationale there is what I've said here, every model that you evaluate really needs to be put in the best possible light. The- to round that out otherwise, right? In a kind of antagonistic situation, you could appear to have evidence that your model, your favorite model is better than some other one just by strategically picking hyperparameters that favored yours over another one, right? Pick really good settings for your model and ones that you know are kind of degenerate for the other one and then you say look, "My model is better." That's a problem. Hyperparameter search is all about kind of making sure that you don't have the opportunity to do that opportunistic selection. And the other way to think about this is that, in science, there is a kind of antagonistic dynamic that happens, and I think it happens primarily in the service of making sure we make progress, which is that you submit your work somewhere and some referee is now evaluating it. And the default mode for that referee is to think, "Does this person really have results, right? Can I trust what this person is saying?" And from their perspective, they might think, "Well, they've only shown me two settings, how do I know that they didn't pick those settings in a way that would rig the game in their favor?" And what that referee is really looking for is evidence from you that you have done what I have described in three here, which is put every model that you evaluate in the best possible light given the data that you have. And what that really implies then is that when you evaluate these two models, you do quite extensive hyperparameter search and you describe not only the- the space that you explored and that- but then also report some statistics about how those models performed. Certainly, the best-performing model, but maybe other information about average performance and so forth. Once you've done that, if your model wins, we have a lot more confidence and your referee is gonna feel less antagonistic because they can feel like really both of these models were given a chance. And it doesn't need to be antagonistic, you yourself could be that critic, right? You're evaluating a bunch of models, you want to enter the bake-off. You have every incentive to really and truly pick the best model. In that case, you should explore wide ranges of hyper-parameters so that you can give the best entry given the things you're exploring. So, I'm encouraging you to do this. Another nice thing I'll say about this is if you use the code to do this search, then you can set up a grid of hyperparameters that you wanna explore, Click Run and then go to the movies or take a hike and know that that entire time you are working hard exploring, finding the best possible model. That's the reason for automating this stuff. Um, what I want to talk with you about later especially in the context of your projects is bringing some perspective to this. Really, we're talking about infinite spaces of hyper-parameters and you could spend infinite dollars doing these explorations. Someone has to impose a budget on you and I think we can have a discussion about how best to think about those budgetary constraints for, you know, a priori constraining your search space and convincing reviewers that you haven't done anything harmful. But for now, I think it's okay. I think what we just wanna enable you to do is some exploration of pretty reasonable spaces of parameters for your models and that's what this is all about. So again, this is a complete experimental setup. What I've done is leave phi alone, it's just a bag of words classifier representation but fit soft-max with cross-validation. Here you start to see the value of having put wrappers around all these models as opposed to just calling fit, because what I do in this case is first setup a logistic regression model. I decide that I wanna cross-validate over five-folds, that's cv = 5, and then the really interesting thing is this what in scikit-learn is called this param grid. So it's saying for the intercept bias, I'm gonna try the model with and without for this value c which is the inverse regularization strength for the- oh yeah, the inverse regularization strength. I'm gonna explore this range of values from 0.4-3 and I'm also gonna explore which kind of regularization I do. If I pick L1, I'm probably gonna get a model that favors really sparse feature representations. A lot of the values for the weights will go to zero whereas L2 will give me more non-zero weights kinda evened out. So I'll explore that as well and then I've included in the software for the course in utils, fit classifier with cross-validation. You give it your data-set XY, your base mod and your param grid and the CV value and it will find for you via cross-validation. So it should be pretty responsible, the best model and then you return that and that's so by default, here it's exploring the full grid of values. So these spaces grow very quickly, that's why you might wanna pick a long movie or a long hike if you've picked a large param grid. But the point is that your model will churn away at this and it will return what in a data-driven way it has decided is the best model, and then you could enter that into the bake-off or into a subsequent evaluation. Yeah? [inaudible]. It checks every setting, every combination so it does true 0.4 L1, true 0.6 L1 like that for all of the values. That's what it's doing, it's called grid search. I've hardwired grid search in, it's perfectly reasonable to think about things like random sampling from the space. I think that typically performs basically as well and then there are more sophisticated packages like Scikit optimize that will try to do more data-driven, model-based exploration of the parameter space. But all these things come down to the same intuition which is explore wide under some budgetary constraints. And this function here best_mod, I think it actually prints out, sorry yeah, when you call that it prints out the best parameters. You'll get some feedback but you also get the model itself which you can use subsequently. In terms of SST experiment, nothing changes and the reason nothing changes is because all of that cross-validation stuff was packed into this model wrapper. Makes sense? All right. So go forth and explore widely. Let your computer run overnight or something like that. Second methodological thing that I thought I would introduce now is classifier comparison and let me set this up again thinking about interacting in the community. So suppose you've assessed a baseline model B and your favorite model M and your chosen assessment metric favors M. So good news, right? Looks like you won, but is your model M really better? That's a deep question. So first of all, if the difference between B and M is clearly of practical significance like your model is gonna save lives and you can show it's gonna save 1000 lives. Then maybe you don't need to do a subsequent round of like statistical comparison or anything. Right? Because maybe it's just clear that we should pick your model. Um, but still especially in this age of deep learning, you have to ask the subsequent question of whether or not there's variation in how either of those two models performs. Because maybe on one run, you saved 1000 lives and then another actually your baseline model was was saving more. Um, so even in the case of practical significance, you wanna do- might wanna do something further. Now I've offered two methods for doing something further. The first, is to run a Wilcoxon signed-rank test. This is just wisdom that I got from the literature. That's a kind of non-parametric version of the t-test that's not assuming anything about the distribution of your scores and is allowing you to assess based on repeated runs of a model which one is likely to be better, and we can talk later about the rationale precisely for choosing that test but I guess the bottom line for here is that it's gonna be a pretty conservative, statistical test that will probably especially if you can afford to run the model a lot give you a pretty robust picture of whether or not your model is truly better than another one. That's really good. The only downside there is you have to be in a situation in which you can repeatedly assess your two models B and M. So your dataset has to support that like a lot of random training test splits and also your budget has to support that, right? If it takes you, uh, a month to optimize B and M, then you're probably not gonna be able to do this kind of testing. Because you maybe can only afford to do one or two runs and really what you wanna have done is like 10 to 20 runs. In those situations, my, uh, offer to you is McNemar's test. So this is applicable in a situation in which you can just run the models once, you get a single confusion matrix from the models and it's essentially operating just on that fixed set of values, and the null hypothesis that you're testing there is basically do the two have the same error rate? So this is noisier and you might have less confidence in it but I think it's better than nothing when it comes to comparing classifiers. It's certainly better than just looking at the raw numerical values and deciding that the larger one is clearly superior. That makes sense. And again if you think about reviewers, reviewers when they see just two numbers, they want context, they want to know, well, how big is that difference really, right? And, you know, practical significance is the best answer to give, but then these things will further substantiate. And in situations in which the values look small, you might still have a good argument in favor of your model if the two models are very consistent in their behavior, and then you could use these tests to substantiate that even in this face of what might look like a small difference. And I hope I've made that easy. So just to set that up, here what I've done is, you know, fixed bag-of-words feature representation phi like before. But now I have two model wrappers. One I've called fit_softmax, you might call it fit logistic regression, and the other is fit_naivebayes. This is a classic kind of comparison especially for sentiment, like let's see which is better, logistic regression or naivebayes. So you set up those two model wrappers, and then the two ways you can do this are first again, this is a kind of Swiss army knife, sst.compare_models. You give it at least one feature function, and if you give only one, it will assume that it should use both for both experiments and at least one train function. Here you can see fit- filled that out with fit.softmax and fit.naivebayes. The rest of the things are just default values that you can probably leave alone, and it will run all the comparisons that it needs to to run the signed rank test, which is given here as stats_test. And what you get out is the two means for the models, this is just printed, and the p-value. And up here you can see I'm also returning the full vector of scores in case you want to like plot them to get further information beyond just the means about how these two systems compare. One might, you know, have chaotic performance that you see when you look at the scores and another might be more rational. [NOISE] And then the other test kind of simpler to run, here I've just run two fixed experiments, softmax and naivebayes, and then you can run the McNemar's test. And here you just have to make sure that you've run them on the same data. And this is the interface here. I just get the actual values from one, one or the other of these because they're meant to be the same and then look at the predictions. And so again, this will take a while to run, because you have to run a lot of experiments, and this requires running just one experiment. Questions about that? All right. Those are the two methods things that I wanted to do. Let's dive into this feature representations stuff then. I'm just doing this by way of example, and the idea is that you see having seen a few examples the ways that you can start to get creative about how you represent features. So so far what we've done is bag of words. It's kind of small but what I've done here is just do bag of bigrams, and you could do trigrams and so forth. That's pretty standard. It's worth trying out. An interesting thing that you can do because we have tree structures is actually use that structure. And so what I've called that down here is five phrases, and what it does is represent the data as all the words that occur in isolation and all the pairs of words that occur as siblings of the same parent on the assumption that it's interesting that is amazing, form a phrase linguistically, but NLU and is do not. And you can see the differences here. So NLU and is for example is part of the bigrams, but it is not part of this phrasal representation because there's no single parent that dominates NLU and is. And if you play that out over the very rich tree structures that are in the SST, you find that you're getting very different representations than you get from a simple linear pass like bigrams is doing. And you can see down here I've just said the height is less than 3 to get just these local trees, but you could set it higher, right? And that would be a version of like getting higher level n grams because you'd be finding larger and larger chunks of these trees. The other class that I want to highlight here is negation. So last class, I highlighted a heuristic method that I think is very powerful of just finding negation words that are in a lexicon, and then marking them although following tokens with _neg to kind of indicate that they are in the scope in some sense of that negation word. The problem with that is that that can be pretty imprecise. You end up depending on things like punctuation, and if the punctuation isn't there, then this _neg marking just runs on and on way past what you would think of as any kind of semantic scope for the negation. Because we have tree structures, we can be much more precise about this. So like in this example, the dialogue wasn't very good but the acting was amazing. It seems clear that the negation is meant to target only very good. That's hard for us to keep track of in the linear method, but it's easy with the tree structure, because basically what you want to do is write a feature function that when it finds a negation, goes to the parent, and then marks everything that's, that's below that parent. And that's very close to what linguists think of as the scope of the operator. And that's nice because then all this clearly unnegated stuff is left alone. And that's a very general idea. There are lots of things in language that take scope in this way that have semantic influences over the things that are next to them in the tree. So we actually, we touched on this last time. They said it was great. It's not a speaker commitment to it being great, whatever it is, because of this verb of saying. And so you can imagine writing feature functions that are also marking things that are in the scope of verbs like say and claim, and maybe doubt, and deny marking them in some special way so that they can get treated differently by your model. Here I did it another one. It might be successful, right? That's a kind of hedge, very different from it is successful, and you could capture that by doing some of this scope marking from might down into the thing that's next to it. And this is something that's really possible only because we have these trees. [NOISE] And here are a few other ideas. So obviously, lexicon derived features, you're gonna do some of those for Homework 1. You could also think about modal adverbs marking their scope like it- it is quite possibly a masterpiece or it is totally amazing. Those adverbs are doing something special to commitment. Another idea is what in the literature is called thwarted expectations. This is the case where I build up what seems to be one kind of evaluation only to offer really another one, right? Many consider the movie bewildering, boring, slow-moving, or annoying. That speaker might be building up to a positive endorsement because of this shift in perspective they're performing. And one signal of that is that they're kind of laying it on thick with all of this negative language. It could happen in both directions. It was hailed as a brilliant unprecedented artistic achievement worthy of multiple Oscars, but it was terrible, right? That kind of review is one that your model is going to struggle with because of all the indicators of positivity balanced by that one word that was negative. But the imbalance actually might be a signal to your model that somebody is keying into this thwarted expectations thing. So you could just count all these words essentially and look at their ratios. And then even harder of course kind of like the Holy Grail for work in sentiment analysis is getting a really good grip on non-literal language use, like if someone says it's not exactly a masterpiece, it's probably terrible. If they say it's 50 hours long for a movie, that's hyperbole. They probably don't literally mean it, and they're doing something special socially or emotionally when they pick that hyper- hyperbolic expression, right, or the best movie in the history of the universe. That one is hyperbole and you feel like it could go either way in terms of whether it's positive or sarcastic. And there are lots of other ideas. Uh, this is just a sample of them. And that lexicon idea I showed you some lexicons before, that's a pretty rich set of resources to mine for doing sentiment analysis. Oh, I had one other methodological note that I thought I would. I had one other methodological note that I thought I would insert here. Because, it might be that for the bake-off for example, you end up writing a lot of feature functions. A bet that you could make early on. For example in this bake-off, is that you should be using linear models not deep learning models because you only have till Monday and it can take a long time for deep learning models to work. And so that might set you on the path of writing a lot of interesting feature functions of the sort I've just been describing to you. And I thought I would insert here just a methodological note about assessing those feature functions. So suppose you've written a lot of them, and you might want to be combining them into a single model, which I'd also encourage. Scikit-learn offers in its feature selection package, a bunch of methods for doing, for assessing feature functions individually. And you can sort of what you're doing with those tools is assessing essentially in isolation, how much information they tell you about the class label. And that can give you a picture of how much predictive value they have. What I've done here with this little dataset which I constructed artificially, is just try to send you a warning that assessing these feature functions in isolation, might be a kind of good first pass heuristic, but it can be very dangerous because what you end up doing in the presence of correlated features is often overstating the value of those features. Because what you're really doing when you run your model, is assessing them all in the context of that unified model. And that's doing something very different than you're doing with this individual testing. And I'll let you think about this example here. But the point is, if I use this chi-squared test in isolation, it looks like at least these two features, X_1 and X_2 are really good to include. And maybe you decide to drop X_3. But the truth of the matter is, if you fit an integrated model, then X_1 alone is the best model. And including X_2 actually degrades the performance of the model. And so that argues I think for something more like a holistic assessment, where you are maybe dropping or adding individual features, but always in the context of the full system that you're evaluating and not so much in isolation the way I've done here. Uh, yeah? For like, related features. Is there something like simple you can do like maybe some PCA thing to understand how your features um, maybe are like [inaudible] before you like actually like choose which ones to use there? I think it's a reasonable response. It's certainly a way, uh, of removing those correlations. Um, it can cause some problems at test time. If you pick one of those matrix factorization methods and do that to your trained features, if I give you new examples, you have to be able to do that reliably on those new examples as well. Um, and that can be problematic because the characteristics of those new test examples might be very different from the training examples. So it's just something to think about in terms of a real deployment. You mean the data distributions are different, or? Yeah, because if you do that reduction on your training set let's say, you've done that. And I give you a single test example, it might be not so clear what you're supposed to do with that test example in terms of reweighting it, reweighting its feature values. There are some like Scikit actually tries to manage this. So for example, I'm not sure what it does for PCA. But if you run TFIDF, then if you call fit.transform, it will do this to your feature matrix or reweight it. And then if you run that with just transform, I think it uses its IDF values that it has stored from training and it applies those to the new test cases. Oh wait, I meant, like sorry, um, I don't know if I'm communicating what I meant, but like, for example construct like a correlation matrix of agreements between like your features and how you label the class, to like, use that as a way to select which ones stay? Oh, so not as a way to change the representations [OVERLAPPING] Yeah. But rather just to do the selection. Um, I think it could be a good heuristic. Yeah. It's certainly telling you something about the entire matrix which was what I'm pushing here. Yeah. Yeah. Pretty much the same question. I was going to ask about PCU. That's pretty much answered. I think you'll get finer grain inflow- information if you can afford it by just using, actually, Scikit has these functions like ablation things that will repeatedly run your model having added or removed some of the features. And then you can kind of see their practical significance in the context of the full model, with exactly the scaling that you're going to be using for your task. Final thing about feature representation, that's a kind of transition into the world of deep learning. What I'm pitching here is, using distributed representations as features. So imagine my simple example from the SST is The Rock rules. Uh, you know, The Rock is an actor. What I can do in this case, is just look up all of these words in an embedding space. Maybe GloVe, maybe it's one you built yourself, right from the first unit. So look them all up. What I need to do in this space is, find a way to combine them into a single fixed dimensional representation, because all these methods presuppose that all the examples are coming from the same embedding space essentially. So in this simple mode, what you would do is combine them. You could use for example, the sum of the values component-wise, or the mean, or the product, or whatever some function that will take them together into a single representation. And then that function X here, is just the input to your classifier. Whereas, before if we built a bunch of feature functions like the bag of words one, your feature representations will have you know, like 20,000 dimensions and will be very sparse. These distributed representations of your data will have 50 to 300 dimensions depending on what vectors you downloaded or built. And they will be very dense, of course, right? All the vectors will be active. So it's a very different perspective on your data. But it's amazing how much- how well these classifiers can perform, given how little information you seem to be feeding them. And also how hard this combination step typically is, because of course, this is a hyperparameter tier model. It's not something that your model is learning. Yeah? If you had um, multiple sentences, so let's say paragraphs, uh, what would be your limit for how many uh, pre-trained vectors you would add together before uh, you couldn't really get a signal out of that data? I'd say not pre-rated limit. I think it's actually striking how you could do this with fairly long documents. Just sum them up and get a pretty good representation of what's in there. In that, in that situation, my intuition is that you would not want the mean. Um, but rather, something the sum that was encoding a lot of information about the length of the text which you'd essentially get from the magnitude of the dimensions as you added them together. But it's kind of amazing how well that can work, as a kind of all-purpose rough look at the full text. In terms of running experiments for this again, I claim it's really easy. Oh, a question in the back here. Go ahead. [inaudible]. Um, I think that this is a nice baseline for a sequential model. If I wanted to argue for an RNN that used GloVe inputs, I might use this as my baseline because what I would be doing there is saying, keeping constant the amount of lexical information that I'm introducing, what precisely is the value of modeling the full sequence? And then you'd be assessing your RNN about how well it did over and above this simple combination function, because it's kind of like the simplest version is sum, uh, to combine these vectors and the complicated version is your fancy LSTM. Um, the framework can accommodate these representations. I've given you the full recipe here, and most of this is just building up the embedding space, GloVe lookup, have a general purpose feature function that has too many arguments for the framework. So then I define the special purpose one that builds my GloVe representations. Softmax is as before, no change there. And then you call the experiment, and the only thing you have to remember, if you have this kind of data coming in, that is vectors and not dictionaries, is to say vectorize equals false. Otherwise, it's going to assume that those vectors coming in are actually dictionaries and it will do, if it does anything at all, it will be something quite crazy. What you're essentially doing here is saying, "Hey sst.experiment, don't featurize my data form using a dict vectorizer, just take it as it comes." And then it all works out. Other questions about that approach? Great, let's do two more things. And this is a foray into deep learning. So the first one will be, uh, RNN classifiers. This is implemented in a few ways in the package. If you want all the details, go to np_rnn_classifier, it gives you the forward algorithm and also the back prop for learning. But I'd actually recommend that you use what's called Torch, a classifier RNN, or the TensorFlow version, because they're faster and more robust. But all of them are drawing on the same basic structure which I've depicted here. So The Rock rules is my example. I look them up in some embedding space which could be pre-trained when you built yourself for GloVe, or it could be a random embedding. And then the crucial thing is the action of this RNN. So it has this hidden layer here, and you do combinations to get hidden states at each time step. And then in the simplest version, the final time step is the basis for your classification decision. So imagine that by the time I've gotten to h_3, I'm looking at some fixed dimensional dense representation, and that's the input to a classifier. This little part here is essentially this, where x is the final hidden representation from the RNN, and that's why I think these are kind of a natural pair if you think about assessment. And so this is just the label for the SST sequence, and these are the leaves. People might have seen these models before. It's worth noting, uh, one thing about how you do preprocessing, that makes these a little bit fiddly. So your examples here are presumably lists of strings. Here I've got this tiny vocabulary, so imagine these are all just different words, two examples. In the standard preprocessing flow, you map those into indices, and those indices are into an embedding space that you've built. By default, the code will randomly initialize one of these embedding spaces. But if you feed in your own space, it will use that as the embedding. And so from these indices, you do all the look-ups, and what the model actually processes of course, is essentially a list of vectors, and so the final version of these examples is this list of vectors down here. That's the way, this is a kind of standard setup for these problems, it's not obligatory of course. And when we look later in the course at contextual word representations, we will skip all these things and just directly look up entire sequences, because the point there is that words could have different initial representations, depending on the context they're in, which is something that this embedding look-up thing of course can capture. But the RNN itself is an exciting development to my mind, because even if I have fixed representations down here, because I'm modeling the full sequence like this, I'm modeling all of these words in the context that they occur in. So for example, one thing I really like about this as a linguist is, I have an intuition that negation is important for sentiment. I showed you a messy heuristic method based on the tokens, and then I showed you a very precise version that depended on trees. But, you know, the trees might be wrong or my, I might just be confused about how scope works in my dataset. So both of those have drawbacks because neither of them is really responding to information from my labeled dataset. This RNN here of course, this sequence of hidden representations, you know, one in the same token at point h_2 could be different depending on whether or not h_1 is the origin for a negation, and that's the sense in which these models just out of the box are allowing you to have the power to see those semantic influences play out at the level of the sequence. And the same thing for might, and for say, and for claim, and for no one, and for everyone, all these things are affecting the semantic context, and therefore affecting the hidden representations associated with the words in the sequence. So it's a lovely holistic analysis of your example, with clearly the potential to capture lots of interesting semantic influences. And I think that's born out, these are, are very successful and powerful models for lots of different tasks. I have a note here on LSTMs, I'm not going to dive into this too much. Suffice it to say that, regular RNNs like that NP one down here, they struggle with long sequences, and it's not a surprise that they do because the signal they get from the training label up here becomes very diffuse and problematic as these sequences get longer. So LSTMs and other mechanisms like them, are ways of managing that flow of information in a way that leads to better results. And so the Torch and TensorFlow RNNs both have LSTM cells by default. And I'm not going to give you an explanation of them here, we don't have that much time, and also frankly, I feel like I just can't improve on these two blog posts, especially this first one. If you want an intro to LSTMs, I highly recommend it. It uses a lovely kind of visual language for helping you think about what's happening inside these cells, and why they work so well, um, yeah, very helpful. And then finally, here's a code snippet to round this out again. I'm, here I'm building a GloVe space for my embedding. For the feature functions of course, they get kind of trivialized, because what I wanna do is just return the sequence of words that are the leaves, so I just do that. No featurization at that point because the real action is inside fit_rnn. I decide on the train vocabulary, and I've decided that I'm gonna keep just the top 10,000 words by frequency, and all other words will be mapped to UNK, and you can set that parameter however you want up here. And then I create a little embedding space and feed that in here, and there's lots of other parameters that you can fiddle around with. Fit that model to the x, y pair that come in. And from there, it's just SST experiment as usual, and again, remember to say vectorize equals false for these models, so that it doesn't assume its a bunch of count dictionaries. I'm not sure that would even work in this context. And in terms of hyperparameter exploration, you're seeing a glimpse of it here, there's a lot of values that you could explore, and maybe the dark side of deep learning is that these things will matter a lot to the final performance of your model. Let me do one more thing, and then why don't we just do some coding here in the room? And that's these tree structured networks. So it's, it's a very similar idea to the RNN, except instead of having simple sequences, we're going to process the SST trees as they come. So I have that same example, The Rock rules, and you have some hidden parameters in here that are forming hidden representations here in orange. It's just that instead of processing them input to hidden and then hidden to hidden, you take the child nodes and process them together to form the hidden representation, and you do that part recursively until you reach the root of the tree, which has a hidden representation, and that's the basis for your classifier. And from there, it's kind of, in my code Softmax classifier as usual. And then the back-prop is a process of taking the errors and feeding them down, and splitting the gradients apart, and then feeding them down like that, and you can see tree- np_tree_nn exactly how that process works. Combinations is like how to mark products summations, how do you get two and then put them together? There's lots of options here. [LAUGHTER] So my code, is a great question, my code is just concatenating the two, uh, child nodes and feeding them through this kind of simple classifier here, where you have learned weights and a bias. I think that's a good baseline for the space. It's the first one that was explored, and the first one that everyone kind of sets up, and I've mapped out three alternatives. All of these were explored by Richard Socher, Matrix-vector, where you represent each word as both a vector and a matrix. And when you do that combination to form the parent, you kind of cross the matrix and the vector so that you get lots of multiplicative interactions, so very powerful. I think it was too powerful in the sense that it had too many parameters relative to the data, and that's how Socher et al. motivated this tensor combination here, which takes this combination function and extends it with essentially a method for combining in every conceivable way, the concatenation of the two child nodes with a higher order tensor kind of sandwiched in the middle allowing for all those interactions, and then you essentially add that to a function that looks like this. So that's very powerful and successful, and like when I was showing you examples from the SST project before, that was a model that was using this combination function, and then just to round it out as a more recent development in Tai et al, they have the feature that so the nodes that are doing the combination functions at the nodes are LSTMs, and the innovation there is that they're kind of gathering state and other kind of gaining information from the child- the two child nodes separately so that they can be gated separately, and then combining them into a single representation. And I highly recommend all these papers if you're interested in this space. This is a nice progression of ideas. And then the final thing that you could say that unfortunately we don't get to explore too much in here is that recall of course that in the SST, you actually get supervision not just at the tree node the way this model is implying, but also at all of the sub nodes, and I have included model implementation in our class repo that will do that, and I think the guiding intuition here is that I have all these points of supervision on shared parameters for this classifier. So I can gather them all together and get lots information for all of them, and then pass that down at each one of the sub trees. So in terms of implementation, once you get your head around how all this information needs to be passed around in the tree, it's pretty straightforward, and then of course you can get really powerful models coming from the fact that you're getting so much more information about what all the subconstituents mean in terms of sentiment. And then just to round this out, this is a full picture of how you use SST experiment with these tree models. Uh, and I think the only special thing about this is that the tree models are unusual in assuming that their label is part of the training instance itself. So in the default, the regular tree_nn, it looks to the root node to find its label as opposed to having it in a separate vector. And so that's a little bit funny down here that mod.fit takes only X and not Y. Y is just there for assessment. It's operating on X, and the reason for that is that then in the context of sub-tree supervision, I can just gather all of those points of supervision from my training instance, and not worry about how they align with some other vector. But I think that's the only gotcha. Other than that, you can use this framework for the full set of models that I've introduced here. Are there questions? This is good, it gives us some time. What we're imagining for the homework of course, let me just run through it quickly to get you oriented, and then you can start coding on your own. So sentiment words alone is just asking you to use a lexicon to create a filtered bag-of-words feature function, just to see how much information about the class label is encoded in this off-the-shelf lexicon. And so what I'm asking you to do is design that feature function, and then use the McNemar test to compare them, just to expose you to that interface and get you thinking about classifier comparison. So I think that's pretty straightforward. A more powerful vector summing baseline is asking you to explore in a little bit more detail that distributed representation approach that I showed you as the final phase of thinking about like linear classifiers, and you're just sort of seeing whether with a more powerful model than just a linear classifier one that might uncover lots of interactions among the feature dimensions seem whether you could do better at the problem. And then the final one is your custom system, and custom here just means basically anything goes. It's okay to download other people's code. Of course, I have a huge bias for you taking code that we've provided and doing cool modifications to it to see whether they pay off. Uh, but we don't have to place too many rules here for designing your original system, and that's what you'll enter into the bake-off. We're evaluating on the ternary class problem for all of this stuff. So you should kind of hill climb on that. What you should do is probably do a lot of testing just within your training data and periodically test against your dev set to get a more honest look at how you're actually generalizing to new data, uh, and never look at the test set of course. The test set although it's distributed is completely out of bounds until the bake-off starts, and then we'll actually run the evaluation and I'll give you some code to guide you on exactly how to do that, but have in mind that you want a system that does well on the ternary class problem on the test set. And the only other thing that I wanted to note is that I added, um, maybe it hasn't reloaded here. I just added a note at the bottom here asking you to give an informal description of your system, just to help the teaching team kind of get oriented on what your code actually does and what your solution is like, uh, so that we can classify the solutions and come to some kind of understanding of what worked and what didn't. Yeah? For these bake-offs will we go over in class like the best one, and why we think that's the best one, or are we going to dissect them? That's the plan. I mean, we hope to give you really rich reports on this. Um, we're aiming to have the first one on Monday, and I can tell from the submissions that have come in that there will be interesting lessons that we can reflect back to you about what worked and what didn't, and I'm hoping that continues. Yeah. Okay. I'm gonna stop talking and let you all just get coding, and we're here to help.