Lecture 1 | Natural Language Processing with Deep Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

I just finished this class and it was fantastic. Manning and Socher do a great (and charming) job of explaining the intuition behind the various approaches to solving NLP problems with deep learning. However, I felt like a lot of the knowledge I gained beyond the intuition came from working on the homework assignments. I don't know if they are releasing those as well.

👍︎︎ 5 👤︎︎ u/FragLegs 📅︎︎ Apr 04 2017 🗫︎ replies

what's new vs. '16?

👍︎︎ 3 👤︎︎ u/Franck_Dernoncourt 📅︎︎ Apr 04 2017 🗫︎ replies

Are there subtitles available for download?

👍︎︎ 1 👤︎︎ u/liubanghoudai24 📅︎︎ Apr 04 2017 🗫︎ replies

Can't wait to watch all these. Thanks for sharing.

👍︎︎ 1 👤︎︎ u/andrewm4894 📅︎︎ Apr 29 2017 🗫︎ replies

Captions

[MUSIC] Stanford University. >> Okay everyone. We're ready. Okay well welcome to CS224N in Linguistics 284. This is kind of amazing. Thank you for everyone who's here that's involved and also the people who don't fit in here and the people who are seeing it online on SCPD. Yeah it's totally amazing the number of people who've signed up to do this class and so in some sense it seems like you don't need any advertisements for why the combination of natural language process and deep learning is a good thing to learn about. But nonetheless today, this class is really going to give some of that advertisement, so I'm Christopher Manning. So what we're gonna do is I'm gonna start off by saying a bit of stuff about what natural language processing is and what deep learning is, and then after that we'll spend a few minutes on the course logistics. And a word from my co-instructor, Richard. And then, get through some more material on why is language understanding difficult, and then starting to do an intro to deep learning for NLP. So we've gotten off to a rocky start today, cause I guess we started about ten minutes late because of that fire alarm going off. Fortunately, there's actually not a lot of hard content in this first lecture. This first lecture is really to explain what an NLP class is and say some motivational content about how and why deep learning is changing the world. That's going to change immediately on the Thursday lecture because for the Thursday lecture is then we're gonna start with sort of vectors and derivatives and chain rules and all of that stuff. So you should get mentally prepared for that change of level between the two lectures. Okay, so first of all what is natural language processing? So natural language processing, that's the sort of computer scientist's name for the field. Essentially synonymous with computational linguistics which is sort of the linguist's name of the field. And so it's in this intersection of computer science and linguistics and artificial intelligence. Where what we're trying to do is get computers to do clever things with human languages to be able to understand and express themselves in human languages the way that human beings do. So natural language processing counts as a part of artificial intelligence. And there are obviously other important parts of artificial intelligence, of doing computer vision, and robotics, and knowledge representation, reasoning and so on. But language has had a very special part of artificial intelligence, and that's because that language has been this very distinctive properties of human beings, and we think and go about the world largely in terms of language. So lots of creatures around the planet have pretty good vision systems, but human beings are alone for language. And when we think about how we express our ideas and go about doing things that language is largely our tool for thinking and our tool for communication. So it's been one of the key technologies that people have thought about in artificial intelligence and it's the one that we're going to look at today. So our goal is how can we get computers to process or understand human languages in order to perform tasks that are useful. So that could be things like making appointments, or buying things, or it could be more highfalutin goals of sort of, understanding the state of the world. And so this is a space in which there's starting to be a huge amount of commercial activity in various directions, some of things like making appointments. A lot of it in the direction of question answering. So, luckily for people who do language, the arrival of mobile has just been super, super friendly in terms of the importance of language has gone way way higher. And so now really all of the huge tech firms whether it's Siri, Google Assistant, Facebook and Cortana. But what they're furiously doing is putting out products that use natural language to communicate with users. And that's an extremely compelling thing to do. It's extremely compelling on phones because phones have these dinky little keyboards that are really hard to type things on. And a lot of you guys are very fast at texting, I know that, but really a lot of those problems are much worse for a lot of other people. So it's a lot harder to put in Chinese characters than it is to put in English letters. It's a lot harder if you're elderly. It's a lot harder if you've got low levels of literacy. But then there are also being new vistas opening up. So Amazon has had this amazing success with Alexa, which is really shown the utility of having devices that are just ambient in the environment, and that again you can communicate with by talking to them. As a quick shout-out for Apple, I mean, really, we do have Apple to thank for launching Siri. It was, essentially, Apple taking the bet on saying we can turn human language into consumer technology that really did set off this arms race every other company is now engaging on. Okay, I just sort of loosely said meaning. One of the things that we'll talk about more is meaning is a kind of a complex, hard thing and it's hard to know what it means to understand fully meaning. At any rate that's certainly a very tough goal which people refer to as AI-complete and it involves all forms of our understanding of the world. So a lot of the time when we say understand the meaning, we might be happy if we sort of half understood the meaning. And we'll talk about different ways that we can hope to do that. Okay, so one of the other things that we hope that you'll get in this class is sort of a bit of appreciation for human language and what it's levels are and how it's processed. Now obviously we're not gonna do a huge amount of that if you really wanna learn a lot about that. There are lots of classes that you can take in the linguistics department and learn much more about it. But I really hope you can at least sort of get a bit of a high level of understanding. So this is kind of the picture that people traditionally have given for levels of language. So at the beginning there's input. So input would commonly be speech. And then you're doing phonetic and phonological analysis to understand that speech. Though commonly it is also text. And then there's some processing that's done there which has sort of been a bit marginal from a linguistics point of view, OCR, working out the tokenization of the words. But then what we do is go through a series of processing steps where we work out complex words like incomprehensible, it has the in in front and the ible at the end. And that sort of morphological analysis, the parts of words. And then we try and understand the structure of sentences, that syntactic analysis. So if I have a sentence like 'I sat on the bench', that 'I' is the subject of the verb 'sat', and the 'on the bench' is the location. Then after that we attempt to do semantic understanding. And that's semantic interpretation's working out the meaning of sentences. But simply knowing the meaning of the words of a sentence isn't sufficient to actually really understand human language. A lot is conveyed by the context in which language is used. And so that then leads into areas like pragmatics and discourse processing. So in this class, where we're gonna spend most of our time is in that middle piece of syntactic analysis and semantic interpretation. And that's sort of bulk of our natural language processing class. We will say a little bit right at the top left where this discussion, speech signal analysis. And interestingly, that was actually the first place where deep learning really proved itself as super, super useful for tasks involving human language. Okay, so applications of Natural Language Processing are now really spreading out thick and fast. And every day you're variously using applications of Natural Language Processing. And they vary on a spectrum. So they vary from very simple ones to much more complex ones. So at the low level, there are things like spell checkings, or doing the kind of autocomplete on your phone. So that's a sort of a primitive language understanding task. Variously, when you're doing web searches, your search engine is considering synonyms, and things like that for you. And, well, that's also a language understanding task. But what we are gonna be more interested in is trying to push our language understanding computers up to more complex tasks. So some of the next level up kind of tasks that we're actually gonna want to have computers look at text information, be it websites, newspapers or whatever. And get the information out of it, to actually understand the text well enough that they know what it's talking about to at least some extent. And so that could be things like expecting particular kinds of information, like products and their prices or people and what jobs they have and things like that. Or it could be doing other related tasks to understanding the document, such as working out the reading level or intended audience of the document. Or whether this tweet is saying something positive or negative about this person, company, band or whatever. And then going even a higher level than that, what we'd like our computers to be able to do is complete whole level language understanding tasks. And some of the prominent tasks of that kind that we're going to talk about. Machine translation, going from one human language to another human language. Building spoken dialogue systems, so you can chat to a computer and have a natural conversation, just as you do with human beings. Or having computers that can actually exploit the knowledge of the world that available on things like Wikipedia and other sources. And so it could actually just intelligently answer questions for you, like a know everything human being could. Okay, and we're starting to see a lot of those things actually being used regularly in industry. So every time you're doing a search, in little places, there are bits of natural language processing and natural language understanding happening. So if you're putting in forms of words with endings, your search engine's considering taking them off. If there are spelling errors, they're being corrected. Synonyms are being considered, and things like that. Similarly, when you're being matched for advertisements. But what's really exciting is that we're now starting to see much bigger applications of natural language processing being commercially successful. So in the last few years, there's just been amazing, amazing advances in machine translation that I'll come back to later. There have been amazing advances in speech recognition so that we just now get hugely good performance in speech recognition even on our cell phones. Products like sentiment analysis they have become hugely commercially important, right? It depends on your favorite industries but there are lots of Wall Street Journal firms that every hour of the day are scanning news articles looking for sentiment about companies to make buy and sell decisions. And just recently, really over the last 12 months, there's been this huge growth of interest in how to build chatbots and dialog agents for all sorts of interface tasks. And that sort of seems like it's growing to become a huge new industry. Okay, see I'm getting behind already. So in just a couple of minutes, I want to say that corresponding things about deep learning. But before getting into that, let me just say a minute about what's special about human language. Maybe we'll come back to this, but I think it's interesting to have a sense of right at the beginning. So there's an important difference between language and most other kinds of things that people think of when they do signal processing and data mining and all of those kinds of things. So for most things, there's just sort of data that's either the world out there. It has some kind of, pick up some visual system for it. Or someone's sort of buying products at the local Safeway. And then someone else is picking up the sales log and saying, let me analyze this and see what I can find, right? So it's just sort of all this random data and then then someone's trying to make sense of it. So fundamentally, human language isn't like that. Human language isn't just sort of a massive data exhaust that you're trying to process into something useful. Human language, almost all of it is that there's some human being who actually had some information they wanted to communicate. And they constructed a message to communicate that information to other human beings. So it's actually a deliberate form of sending a particular message to other people. Okay, and an amazing fact about human language is it's this very complex system that somehow two, three, four year old kids amazingly can start to pick it up and use it. So there's something good going on there. Another interesting property of language is that language is actually what you could variously call a discrete, symbolic, or categorical signaling system. So we have words for concepts like rocket or violin. And basically, we're communicating with other people via symbols. There are some tiny exceptions for expressive signaling, so you can distinguish saying, I love it versus I LOVE it. And that sounds stronger. But 99% of the time it's using these symbols to communicate meaning. And presumably, that came about in a sort of EE information theory sense. Because by having symbols, they're very reliable units that can be signaled reliably over a distance. And so that's an important thing to be aware of, right? Language is symbols. So if symbols aren't just some invention of logic or classical AI. But then, when we move beyond that, there's actually something interesting going on. So when human beings communicate with language that although what they're wanting to communicate involves symbols. That the way they communicate those symbols is using a continuous substrate. And a really interesting thing about language is you can convey exactly the same message by using different continuous substrates. So commonly, we use voice and so there are audio waves. You can put stuff on a piece of paper and then you have a vision problem. You can also use sign language to communicate. And that's a different kind of continuous substrate. So all of those can be used. But there's sort of a symbol underlying all of those different encodings. Okay, so what the picture we have is that the communication medium is continuous. Human languages are a symbol system. And then the interesting part is what happens after that. So the dominant idea in most of the history of philosophy and science and artificial intelligence was to sort of project the symbol system of language into our brains. And think of brains as symbolic processors. But that doesn't actually seem to have any basis in what brains are like. Everything that we know about brains is that they're completely continuous systems as well. And so the interesting idea that's been emerging out of this work in deep learning is to say, no, what we should be doing is also thinking of our brains as having continuous patterns of activation. And so then the picture we have is that we're going from continuous to symbolic, back to continuous every time that we use language. So that's interesting. It also points out one of the problems of doing language understanding that we'll come back to a lot of times. So in languages we have huge vocabularies. So languages have tens of thousands of words minimum. And really, languages like English with a huge scientific vocabulary, have hundreds of thousands of words in them. It depends how you count. If you start counting up all of the morphological forms, you can argue some languages have an infinite number of words cuz they have productive morphology. But however you count, it means we've got this huge problem of sparsity and that's one of the big problems that we're gonna have to deal with. Okay, now I'll change gears and say a little bit of an intro to deep learning. So deep learning has been this area that has erupted over the sort of this decade. And I mean, it's just been enormously, enormously exciting how deep learning has succeeded and how it has expanded. So really, at the moment it seems like every month you see in the tech news that there's just amazing new improvements that are coming out from deep learning. So one month it's super human computer vision systems, the next month it's machine translation that's vastly improved. The month after that people are working out how to get computers to produce their own artistry that's incredibly realistic. Then the month after that, people are producing new text-to-speech systems that sound amazingly lifelike. I mean, there's just been this sort of huge dynamic of progress. So what is underlying all of that? So, well, as a starting point, deep learning, it's part of machine learning. So in general, it's this idea of how can we get computers to learn stuff automatically, rather than just us having to tell them things and coding by hand in the kind of traditional write computer program to tell it what you want it to do. But deep learning is also profoundly different to the vast majority of what happened in machine learning in the 80s, 90s, and 00s. And this central difference is that for most of traditional machine learning, if I call it that. So this is all of the stuff like decision trees, logistic regressions, naive bayes, support vector machines, and any of those sort of things. Essentially the way that we did things was, what we did was have a human being who looked carefully at a particular problem and worked out what was important in that problem. And then designed features that would be useful features for handling the problem that they would then encode by hand. Normally by writing little bits of Python code or something like that to recognize those features. They're probably a little bit small to read, but over on the right-hand side, these are showing some features for an entity recognition system. Finding person names, company names, and so on in text. And this is just the kind of system I've written myself. So, well, if you want to know whether a word is a company, you'd wanna look whether it was capitalized, so you have a feature like that. It turns out that looking at the words to the left and right would be useful to have features for that. It turns out that looking at substrings of words is useful cause they're kind of common patterns of letter sequences that indicate names of people versus of names of companies. So you put in features for substrings. If you see hyphens and things, that's an indicator of some things. You put in a feature for that. So you keep on putting in features and commonly these kind of systems would end up with millions of hand-designed features. And that was essentially how Google search was done until about 2015 as well, right? They liked the word signal rather than feature. But the way you improved Google search was every month some bunch of engineers came up with some new signal. That they could show with an experiment that if you added in these extra features, Google search got a bit better. And [INAUDIBLE] a degree and that would get thrown in, and things would get a bit better. But the thing to think about is, well, this was advertised as machine learning, but what was the machine actually learning? It turns out that the machine was learning almost nothing. So the human being was learning a lot about the problem, right? They were looking at the problem hard, doing lots of data analysis, developing theories, and learning a lot about what was important for this property. What was the machine doing? It turns out that the only thing the machine was doing was numeric optimization. So once you had all these signals, what you're then going to be doing was building a linear classifier. Which meant that you were putting a parameter weight in front of each feature. And the machine learning system's job was to adjust those numbers so as to optimize performance. And that's actually something that computers are really good at. Computers are really good at doing numeric optimization and it's something that human beings are actually less good at. Cuz humans, if you say, here are 100 features, put a real number in front of each one to maximize performance. Well, they've got sort of a vague idea but they certainly can't do that as well as a computer can. So that was useful but is doing numeric optimization, is that what machine learning means? It doesn't seem like it should be. Okay, so what we found that in practice machine learning was sort of 90% human beings working out how to describe data and work out important features. And only sort of 10% the computer running this learning numerical optimization algorithm. Okay, so how does that differ with deep learning? So deep learning works, is part of this field that's called representation learning. And the idea of representation learning is to say, we can just feed to our computers raw signals from the world, whether that's visual signals or language signals. And then the computer can automatically, by itself, come up with good intermediate representations that will allow it to do tasks well. So in some sense, it's gonna be inventing its own features in the same way that in the past the human being was inventing the features. So precisely deep learning, the real meaning of the word deep learning is the argument that you could actually have multiple layers of learned representations. And that you'd be able to outperform other methods of learning by having multiple layers of learned representations. That was where the term deep learning came from. Nowadays, half the time, deep learning just means you're using neural networks. And the other half of the time it means there's some tech reporter writing a story and it's vaguely got to do with intelligent computers and all other bets are off. Okay, [LAUGH] yeah. So with the kind of coincidence where sort of deep learning really means neural networks a lot of the time, we're gonna be part of that. So what we're gonna focus on in this class is different kinds of neural networks. So at the moment, they're clearly the dominant family of ways in which people have reached success in doing deep learning. But it's not the only possible way that you could do it that people have certainly looked at trying to use various other kinds of probabilistic models and other things in deep architectures. And I think that may well be more of that work in the future. What are these neural networks that we are talking about? That's something we'll come back to and talk a lot about both on Thursday and next week. I mean you noticed a lot of these neural terminology. I mean in some sense if you're kind of coming from a background of statistics or something like that, you could sort of say neural networks, they're kind of nothing really more than stack logistic regressions or perhaps more generally kinda stacked generalized linear models. And in some sense that's true. There are some connections to neuroscience in some cases, so that's not a big focus on this class at all. But on the other hand, there's something very qualitatively different, that by the kind of architectures that people are building now for these complex stacking of neural unit architectures, you end up with a behavior and a way of thinking and a way of doing things that's just hugely different, than anything that was coming before in earlier statistics. We're not really gonna take a historical approach, we're gonna concentrate on methods that work well right now. If you'd like to read a long history of deep learning, though I'll warn you it's a pretty dry and boring history, there's this very long arxiv paper by Jürgen Schmidhuber that you could look at. Okay, so why is deep learning exciting? So in general our manually designed features tend to be overspecified, incomplete, take a long time to design and validate, and only get you to a certain level of performance at the end of the day. Where the learned features are easy to adapt, fast to train, and they can keep on learning so that they get to a better level of performance than we've been able to achieve previously. So, deep learning ends up providing this sort of very flexible, almost universal learning framework which is just great for representing all kinds of information. Linguistic information but also world information or visual information. It can be used in both supervised fashions and unsupervised fashions. The real reason why deep learning is exciting to most people is it has been working. So starting from approximately 2010, there were initial successes where deep learning were shown to work far better than any of the traditional machine learning methods that have been used for the last 30 years. But going even beyond that, what has just been totally stunning is over the last six or seven years, there's just been this amazing ramp in which deep learning methods have been keeping on being improved and getting better at just an amazing speed. Which is actually sort of being, maybe I'm biased, but in the length of my lifetime, I'd actually just say it's unprecedented, in terms of seeing a field that has been progressing quite so quickly in its ability to be sort of rolling out better methods of doing things, month on month. And that's why you're sort of seeing all of this huge industry excitement, new products, and you're all here today. So why has deep learning succeeded so brilliantly? And I mean this is actually a slightly more subtle and in some sense not quite so uplifting a tale. Because when you look at a lot of the key techniques that we use for deep learning were actually invented in the 80s or 90s. They're not new. We're using a lot of stuff that was done in the 80s and 90s. And somehow, they didn't really take off then. So what is the difference? Well it turns out that actually some of the difference, actually maybe quite a lot of the difference, is just that technological advances have happened that make this all possible. So we now have vastly greater amounts of data available because of our online society where just about everything is available as data. And having vast amounts of data really favors deep learning models. In the 80s and 90s, there sort of wasn't really enough compute power to do deep learning well. So having sort of several more decades of compute power has just made it that we can now build systems that work. I mean in particular there's been this amazing confluence that deep learning has proven to be just super well suited to the kind of parallel vector processing that's available now for very little money in GPUs. So there's been this sort of marriage between deep learning and GPUs, which has enabled a lot of stuff to have happened. So that's actually quite a lot of what's going on. But it's not the only thing that's going on and it's not the thing that's leading to this sort of things keeping on getting better and better month by month. I mean, people have also come up with better ways of learning intermediate representations. They've come up with much better ways of doing end-to-end joint system learning. They've come up with much better ways of transferring information between domains and between contexts and things. So there are also a lot of new algorithms and algorithmic advances and they're sort of in some sense the more exciting stuff that we're gonna focus on for more of the time. Okay, so really the first big breakthrough in deep learning was in speech recognition. It wasn't as widely heralded as the second big breakthrough in deep learning. But this was really the big one that started. At the University of Toronto, George Dahl working with Geoff Hinton started showing on tiny datasets, that they could do exciting things with deep neural networks for speech recognition. So George Dahl then went off to Microsoft and then fairly shortly after that, another student from Toronto went to Google and they started building big speech recognition systems that use deep learning networks. And speech recognition's a problem that's been worked on for decades by hundreds of people. And there are big companies. And there was this sort of fairly standardized technology of using Gaussian mixture models for the acoustic analysis and hidden Markov models and blah blah blah. Which people have been honing for decades trying to improve a few percent a year. And what they were able to show was by changing from that to using deep learning models for doing speech recognition, that they were immediately able to get just these enormous decreases in word error rate. About a 30% decrease in word error rate. Then the second huge example of the success of deep learning, which ended up being a much bigger thing in terms of everybody noticing it, was in the ImageNet computer vision competition. So in 2012 again students of Geoff Hinton at Toronto set about building a computer vision system of doing ImageNet task of classifying objects into categories. And that was again a task that had been run for several years. And performance seemed fairly stalled with traditional computer vision methods and running deep neural networks on GPUs that they were able to get an over one-third error reduction in one fell swoop. And that progress is continued through the years, but we won't say a lot on that here. Okay, that's taken me a fair way. So let's stop for a moment and do the logistics, and I'll say more about deep learning and NLP. Okay, so this class is gonna have two instructors. I'm Chris Manning and I'm a Stanford faculty, then the other one is Richard, who's the chief scientist of faith of Salesforce, and so I'll let him say a minute or two hello. >> Hi there, great to be here. I guess, just a brief little bit about myself. In 2014, I graduated, I got my PhD here with Chris and Enring in deep learning for NLP. And then almost became a professor, but then started a little company, built an ad platform, did some research. And then earlier last year, we got acquired by Salesforce, which is how I ended up there. I've been teaching CS224D the last two years and super excited to merge to two classes. >> Okay. >> I think next week, I'll do the two lectures, so you'll see a lot of me. >> [LAUGH] >> I'll do all the boring equations. >> [LAUGH] Okay, and then TAs, we've got many really wonderful, competent, great TAs for this class. Yeah, so normally I go through all the TAs, but there are sort of so many, both of them and you, that maybe I won't go through them all, but maybe they could all just sort of stand up for a minute if you're a TA in the class. They're all in that corner, okay, [LAUGH] and they're clustered. [LAUGH] Okay, right, yeah, so at this point, I mean, apologies about the room capacity. So the fact of the matter is if this class is being kind of videoed and broadcast, this is sort of the largest SCPD classroom that they record in. So, there's no real choice for this, this is the same reason that this is where 221 is, and this is where 229 is. But it's a shame that there aren't enough seats for everybody, sorry about that. It will be available shortly after each class, also as a video. In general for the other information, look at the website, but there's a couple things that I do just wanna say a little bit about, prerequisites and work to do. So, when it comes down to it, these are the things that you sort of really need to know. And we'll expect you to know, and if you don't know, you should start working out what you don't know and what to do about it very quickly. So the first one is we're gonna do the assignments in Python, so proficiency in Python, there's a tutorial on the website, not hard to learn if you do something else. Essentially, Python has just become the lingua franca of nearly all the deep learning toolkits, so that seems the thing to use. We're gonna do a lot of stuff with calculus and vectors and matrices, so multivariate calculus, linear algebra. It'll start turning up on Thursday and even more next week. Sort of basic probability and statistics, you don't need to know anything fancy about martingales or something, I don't either. But you should know the elements of that stuff. And then we're gonna assume you know some fundamentals of machine learning. So if you've done 221 or 229, that's fine. Again, you don't need to know all of that content, but we sort of assume that you've seen loss functions, and you have some idea about how you do optimization with gradient descent and things like that. Okay, so in terms of what we hope to teach, the first thing is an understanding of and ability to use effective modern methods for deep learning. So we'll be covering all the basics, but especially an emphasis on the main methods that are being used in NLP, which is things like recurrent networks, attention, and things like that. Some big picture understanding of human languages and the difficulties in understanding and producing them. And then the third one is essentially the intersection of those two things. So the ability to build systems for important NLP problems. And you guys will be building some of those for the various assignments. So in terms of the work to be done, this is it. So there's gonna be three assignments. There's gonna be a midterm exam. And then at the end, there's this bigger thing where you sort of have a choice between either you can come up with your own exciting world shattering final project and propose it to us. And we gotta make sure every final project has a mentor, which can either be Richard or me, one of the TAs, or someone else who knows stuff about deep learning. Or else, we can give you an exciting project, and so there'll be sort of a default final project, otherwise known as Assignment 4. There's gonna be a final poster session. So every team for the final project, you're gonna have teams up to three for the final project, has to be at the final poster session. Now we thought about having it in our official exam slot, but that was on Friday afternoon, and so we decided people might not like that. So we're gonna have it in the Tuesday early afternoon session, which is when the language class exams are done. So no offense to languages, but we're assuming that none of you are doing first year intensive language classes. Or at least, you better find a teammate who isn't. >> [LAUGH] >> Okay, yeah, so we've got some late days. Note that each assignment has to be handed in within three days so we can grade it. Yeah, okay, yeah, so Assignment 1, we're gonna hand out on Thursday, so for that assignment, it's gonna be pure Python, except for using the NumPy library, which is kinda the basic vector and matrices library. And people are gonna do things from scratch, because I think it's a really important educational skill that you've actually done things and gotten it to work from scratch. And you really know for yourself what the derivatives are because you've calculated them. And because you've implemented them, and you've found that you can calculate derivatives and implement them, and the thing does actually learn and work. If you've never done this, the whole thing's gonna seem like black magic ever after. So it's really important to actually work through it by yourself. But nevertheless, one of what things that's being transforming deep learning is that there are now these very good software packages, which actually make it crazily easy to build deep learning models. That you can literally take one of these libraries and sort of write 60 lines of Python, and you can be training a state-of-the-art deep learning system that will work super well, providing you've got the data to train it on. And that's sort of actually been an amazing development over the last year or two. And so for Assignments 2 and 3, we're gonna be doing that. In particular, we're gonna be using TensorFlow, which is the Google deep learning library, which is sort of, well, Google's very close to us. But it's also very well engineered and has sort of taken off as the most used library now. But there really are a whole bunch of other good libraries for deep learning. And I mentioned some of them below. Okay, do people have any questions on class organization? Or anything else up until now, or do I just power on? >> [INAUDIBLE] >> Yeah Okay, so, and something I'm gonna do is repeat all questions, so they'll actually work on the video. So, the question is, how are our assignments gonna be submitted? They're gonna be submitted electronically online, instructions will be on the first assignment. But yeah, everything has to be electronic, what we use in Gradescope for the grading. For written stuff, if you wanna hand write it, you have to scan it for yourself, and submit it online. Any other questions? >> [INAUDIBLE] >> Yeah. So, the question was, are the slides on the website? Yes, they are. The slides were on the website before the class began, and we're gonna try and keep that up all quarter. So, you should just be able to find them, cs224n.stanford.edu. Any other questions, yeah? Yeah, so that was on the logistics, if you're doing assignment four. It's partly different, and partly the same, so if you're doing the default assignment four, and we'll talk all about final projects in a couple of weeks. You don't have to write a final project proposal, or talk to a mentor, because we've designed the project for you as a starting off point of the project. But on the other hand, otherwise, it's the same. So, it's gonna be an open ended project, in which there are lots of things that you can try to make the system better, and we want you to try, and we want you to be able to report on what are the different exciting things you've tried, whether they did, or didn't make your system better. And so, we will be expecting people doing assignment four to also write up and present a poster on what they've done. Any other questions? Yes, so their question was on whether we're using Piazza. Yes, we're using Piazza for communication. So, we've already setup the Piazza, and we attempted to enroll all the enrolled students, so hopefully if you're an involved student, there's somewhere in your junk mailbox, or in one of those places, a copy of a Piazza announcement. Any other questions? Okay, 20 some minutes to go. I'll power ahead. Very quickly, why is NLP hard? I think most people, maybe especially computer scientist, going into this just don't understand why NLP is hard. It's just a sequence of words, and they've been dealing with programming languages. And you're just gonna read the sequence the words. Why is this hard? It turns out it's hard for a bunch of reasons, because human languages aren't like programming languages. So, human languages are just all ambiguous. Programming languages are constructed to be unambiguous, that's why they have rules like you can. And else goes with the nearest 'if' and you have to get the indentation right in Python. Human languages aren't like that, so human languages are when there's an 'else' just interpret it with whatever 'if' makes most sense to the hearer. And when we do reference in programming language, we use variable names like x and y, and this variable. Whereas, in human languages, we say things like this and that and she, and you're just meant to be able to figure out from context who's being talked about. But that's a big problem, but it's perhaps, not even the biggest problem. The biggest problem is that humans use language as an efficient communication system. And the way they do that is by not saying most things, right? When you write a program, we say everything that's needed to get it to run. Where in a human language, you leave out most of the program, because you think that your listener will be able to work out which code should be there, right? So, it's sorta more a code snippet on StackOverflow, and the listener is meant to be able to fill in the rest of the program. So, human language gets its efficiency. We kinda actually communicate very fast by human language, right? The rate at which we can speak. It's not 5G communications speeds, right? It's a slow communication channel. But the reason why it works efficiently is we can say minimal messages. And our listener fills in all the rest with their world knowledge, common sense knowledge, and contextual knowledge of the situation. And that's the biggest reason why natural language is hard. So, as sort of a profound version of why natural language is hard: I really like this XKCD cartoon, but you definitely can't read, and I can barely read on the computer in front of me. >> [LAUGH] >> But I think if you think about it, it says actually a lot about why natural language understanding is hard. So, the two women speaking to each other. One says, 'anyway, I could care less,' and the other one says, 'I think you mean you couldn't care less, saying you could care less implies you care to some extent,' and the other one says, 'I don't know,' and then continues. We're these unbelievably complicated beings drifting through a void, trying in vain to connect with one another by blindly flinging words out in to the darkness. Every trace of phrasing, and spelling and tone and timing carries countless signals and contexts and subtexts and more. And every listener interprets these signals in their own way. Language isn't a formal system of language, it's glorious chaos. You can never know for sure what any words will mean to anyone. All you can do is try to get better at guessing how your words affect people. So, you have a chance of finding the ones that will make them feel something like you want them to feel. Everything else is pointless. I assume you're giving me tips on how you interpret words, because you want me to feel less alone. If so, then thank you, that means a lot. But if you're just running my sentences passed some mental check list, so you can show off how well you know it, then I could care less. >> [LAUGH] >> And I think if you reflect on this XKCD comic, there's actually a lot of profound content there as to what human language understanding is like, and what the difficulties of it are. But that's probably a bit hard to do in detail, so I'm just gonna show you some simple examples for a minute. You get lots of ambiguities, including funny ambiguities, in natural language. So, here are a couple of, here's one of my favorites that came out recently from TIME magazine. The Pope's baby steps on gays, no, that's not how you meant to interpret this. You're meant to interpret this as the Pope's baby steps on gays. >> [LAUGH] >> Okay. So a question, I mean, why do you get those two interpretations? What is it about human language, and English here, about English that allows you to have these two interpretations? What are the different things going on? Is anyone game to give an explanation of how we Okay, yeah, right. I'll repeat the explanation as I go. You started off with saying it was idiomatic, and some sense, baby steps is sort of an, sort of a metaphor, an idiom where baby steps is meaning little steps like a baby would take, but I mean, before you even get to that, you can kind of just think a large part of this is just a structural ambiguity, which then governs the rest of it. So, one choice Is that you have this noun phrase of the Pope's baby, and then you start interpreting it as a real baby. And then steps is being interpreted as a verb. So, something we find in a lot of languages, including English, is the same word can have fundamentally different roles. He, and the verbal interpretation verb, steps would be being used as a verb. But the other reading is as you said it's a noun compound, so you can put nouns together, and make noun compounds very freely in English. Computer people do it all the time, right? As soon as you've got something like disk drive enclosure, or network interface hub, or something like that, you're just nailing nouns together to make big nouns. So, you can put together baby and steps as two nouns, and make baby steps as a noun phrase. And then you can make the Pope's baby steps is a larger noun phrase. And then you're getting this very different interpretation. But simultaneously, at the same time, you're also changing the meaning of baby. So in one case, the baby was this metaphorical baby, and then in the other one it's a perhaps counter-factually it's a literal baby. Let's do at least one more of that. Here's another good fun one. Boy paralyzed after tumor fights back to gain black belt. >> [LAUGH] >> Which is, again, not how you're meant to read it. You're meant to read it as boy, paralyzed after tumor, fights back to gain black belt. So, how could we characterize the ambiguity in that one? [LAUGH] So, someone suggested missing punctuation, and if, to some extent, that's true. And to some extent, you can use commas to try and make readings clearer in some cases. But there are lots of places where there are ambiguities in language, where it's just not usual standard to put in punctuation, to disambiguate. And indeed, if you're the kind of computer scientist who feels like you want to start putting matching parentheses around pieces of human language to make the unclear interpretation much clearer, you're not then a typical language user anymore. [LAUGH] >> Okay, anyone else gonna have a go, yeah? Yeah, so, this is sort of the ambiguities are in the syntax of the sentence. So, when you have this 'paralyzed' that could either be the main verb of the sentence, so. The boy is paralyzed, then all of after tumor fights back to gain black belt is then this sort of subordinate clause of saying when it happened. And so then the 'tumor' is the subject of 'fights back', or you can have this alternative where 'paralyzed' can also be what's called a passive participle. So, it's introducing a participial phrase of 'paralyzed after tumor'. And so that can then be a modifier of the boy in the same way an adjective can, young boy fights back to gain black belt. It could be boy paralyzed after tumor fights back to gain black belt. And then it's the boy that's the subject of fights. Okay, I have on this slide a couple more examples, but I think I won't go through them in detail, since I'm sort of behind as things are going. Okay, so what I wanted to get into a little bit of for the last bit of class until my time runs out is to introduce this idea of deep learning and NLP. And so, I mean essentially, this is combining the two things that we've been talking about so far, deep learning and NLP. So, we're going to use the ideas of deep learning, neural networks, representation learning, and we're going to apply them to problems in language understanding, natural language processing. And so, in the last couple of years, especially this is just an area that's sorta really starting to take off, and just for the rest of today's class we'll say, a little bit about what are some of the stuff happening where they're at a very high level and that'll sort of prepare for Thursday, starting to dive right into the specifics. And so, that, so there is so different, different classifications you can look at. So on the one hand, deep learning is being applied to lots of different levels of language that things like speech words, syntax, semantics. It's been applied to lots of different sort of tools, algorithms that we use for natural language processing. So, that's things like labeling words for part-of-speech, finding person and organization names, or coming up with syntactic structures of sentences. And then it's been applied to lots of language applications that put a lot of this together. So things that I've mentioned before, like machine translation, sentiment analysis, dialogue agents. And one of the really, really interesting things is that deep learning models have been giving a very unifying method of using the same tools and technologies to understand a lot of these problems. So yes, there are some specifics of different problems. But something that's been quite stunning in the development of deep learning is that there's actually been a very small toolbox of key techniques, which have turned out to be just vastly applicable with enormous accuracy to just many, many problems. Which actually includes not only many, many language problems, but also, most of the rest of what happens in deep learning, whether it's looking at vision problems, or applying deep learning through any other kind of signal analysis, knowledge representation, or anything that you see these few key tools being used to solve all the problems. And what is somewhat embarrassing for human beings part is that typically, they're sort of working super well, much better than the techniques that human beings had previously slaved on for decades developing, without very much customization for different tasks. Okay, so deep learning and language it all starts off with word meaning, and so this is a very central idea gonna develop starting off with the second class. So, what we're gonna do with words is say were going to represent a word, in particular we're going to represent the meaning of the word. As a vector of your numbers. So here's my vector for the word expect. And so I made that, whatever it is, an 8-dimensional vector, I think, since that was good for my slide. But really, we don't use much that small vectors. So minimally, we might use something like 25-dimensional vectors. Commonly, we might be using something like 300-dimensional vectors. And if we're really going to town because we wanna have the best ever system doing something, we might be using a 1000-dimensional vector or something like that. So when we have vectors for words, that means we're placing words in a high-dimensional vector space. And what we find out is, when we have these methods for learning word vectors from deep learning and place words into these high-dimensional vector spaces, these act as wonderful semantic spaces. So, words with similar meanings will cluster together in the vector space, but actually more than that. We'll find out that there are directions in the vector space that actually tell you about components and meaning. So we, one of the problems of human beings is that they're not very good at looking at high-dimensional spaces. So, for the human beings, we always have to project down onto two or three dimensions. And so, in the background, you can see a little bit of a word cloud of a 2D projection of a word vector space, which you can't read at all. But we could sort of start to zoom in on it. And then you get something that's just about readable. So in one part of the space, this is where country words are clustering. And in another part of the space, this is where you're seeing verbs clustering. And you're seeing kind of it's grouping together verbs that mean most similarly. So 'come' and 'go' are very similar, 'say' and 'think' are similar, 'think' and 'expect' are similar. 'Expecting' and 'thinking' are actually similar to 'seeing things' a lot of the time, because people often use see as an analogy for think. Yes? Okay, so the question is, what do the axes in these vector spaces mean? And, in some sense, the glib answer is nothing. So when we learn these vector spaces, well actually we have these 300 D vectors. And they have these axes corresponding to those vectors. And often in practice, we do sort of look at some of those elements in along the axes and see if we can interpret them because it's easy to do. But really, there's no particular reason to think that elements and meaning should follow those vector lines. They could be any other angle in the vector space, and so they don't necessarily mean anything. When we wanna do a 2D projection like this, what we're then using is some method to try and most faithfully get out some of the main meaning from the high dimensional vector space so we can show it to you. So the simplest method that many of you might have seen before in other places, is doing PCA, doing a principal components analysis. There's another method that we'll get to called t-SNE, which is kind of a non-linear dimensionality reduction which is commonly used. But these are just to try and give human beings some sense of what's going on. And it's important to realize that any of these low dimensional projections can be extremely, extremely misleading, right? Because they are just leaving out a huge amount of the information that's actually in the vector space. Here's, I'm just looking at closest words, to the word frog. I'm using the GLOVE embeddings that we did at Stanford and we'll talk about more, in the next couple of lectures. So frogs and toad are the nearest words, which looks good. But if we then look at these other words that we don't understand, it turns out that they're also names for other pretty kinds of frogs. So these word meaning vectors are a great basis of starting to do things. But I just wanna give you a sense, for the last few minutes, that we can do a lot beyond that. And the surprising thing is we're gonna keep using some of these vectors. So traditionally, if we're looking at complex words like uninterested, we might just think of them as being made up as morphemes of sort of smaller symbols. But what we're gonna do is say, well no. We can also think of parts of words as vectors that represent the meaning of those parts of words. And then what we'll wanna do is build a neural network which can compose the meaning of larger units out of these smaller pieces. That was work that Minh-Thang Luong and Richard did a few years ago at Stanford. Going beyond that, we want to understand the structure of sentences. And so another tool we'll use deep learning for is to make syntactic pauses that find out the structure of sentences. So Danqi Chen who's over there, is one of the TAs for the class. So something that she worked on a couple of years ago was doing neural network methods for dependency parsing. And that was hugely successful. And essentially, if you've seen any of the recent Google announcements with their Parsey McParseface and syntax net. That essentially what that's using is a more honed and larger version of the technique that Danqi introduced. So once we've got some of the structure of sentences, we then might want to understand the meaning of sentences. And people have worked on the meaning of sentences for decades. And I certainly don't wanna belittle other ways of working out the meaning of sentences. But in the terms of doing deep learning for NLP, in this class I also wanna give a sense of how we'll do things differently. So the traditional way of doing things, which is commonly lambda calculus, calculus-based semantic theories. That you're giving meaning functions for individual words by hand. And then there's a careful, logical algebra for how you combine together the meanings of words to get kind of semantic expressions. Which have also sometimes been used for programming languages where people worked on denotational semantics for programming languages. But that's not what we're gonna do here. What we're gonna do is say, well, if we start off with the meaning of words being vectors, we'll make meanings for phrases which are also vectors. And then we have bigger phrases and sentences also have their meaning being a vector. And if we wanna know what the relationships between meanings of sentences or between sentences and the world, such as a visual scene, the way we'll do that is we'll try to learn a neural network that can make those decisions for us. Yeah, let's see. So we can use it for all kinds of semantics. This was actually one of the pieces of work that Richard did while he was a PhD student, was doing sentiment analysis. And so this was trying to do a much better, careful, real meaning representation and understanding of the positive and negative sentiments of sentences by actually working out which parts of sentences have different meanings. So the sentences, This movie doesn't care about cleverness, wit, or any other kind of intelligent humor, and the system is actually very accurately able to work out, well there's all of this positive stuff down here, right? There's cleverness, wit, intelligent humor. It's all very positive, and that's the kind of thing a traditional sentiment analysis system would fall apart on, and just say this is a positive sentence. But our neural network system is noticing that there's this movie doesn't care at the beginning and is accurately deciding the overall sentiment for the sentence is negative. Okay, I'm gonna run out of time, so I'll skip a couple of things, but let me just mention two other things that've been super exciting. So there's this enormous excitement now about trying to build chat bots, dialogue agents. Of having speech and language understanding interfaces that humans can interact with mobile computers. There's Alexa and other things like that with and I think it's fair to say that the state of the technology at the moment is that speech recognition has made humongous advances, right? So I mean, speech recognition has been going on for decades, and as someone involved with language technology, I'd been claiming to people, from the 1990s, no, speech recognition is really good. We've worked out really good speech recognition systems. But the fact of the matter is they were sorta not very good and real human beings would not use them if they had any choice because the accuracy was just so low. Whereas, in the last few years neural network-based deep learning speech recognition systems have become amazingly good. I think, I mean maybe this isn't true of the young people in this room apart from me. But I think a lot of people don't actually realize how good that they've gotten. Because I think that there are a lot of people that try things out in 2012 and decide, they're pretty reasonable, but not fantastic, and haven't really used it since. So I encourage all of you, if you don't regularly use speech recognition to go home and try saying some things to your phone. And, I think it's now just amazing how well the speech recognition works. But there's a problem. The speech recognition works flawlessly. And then your phone has no idea what you're saying, and so it says, would you like me to Google that for you? So the big problem, and the centerpiece of the kind of stuff that we're working on in this class, is well how can we actually make the natural language understanding equally good? And so that's a big concentration that what we're going to work on. One place that's actually, have any of you played with Google's Inbox program on cell phones? Any of you tried that out? A few of you have. So one cool but very simple example of a deployed deep learning dialogue agent is Google Inbox's Suggested Replies. So you having recurrent neural network that's going through the message and is then suggesting three replies to your message to send back to the other person. And you know although there are lots of concerns in that program of sort of privacy and other things, and they're careful how they're doing it. Actually often the replies it comes up with are really rather good. If you're looking to cut down on your email load, give Google Inbox a try and you might find that actually you can reply to quite a bit of your email using it. Okay, the one other example I wanted to mention before finishing was Machine Translation. So Machine Translation, this is actually when natural language processing started. It didn't actually start with language understanding in general. Where natural language processing started was, it was the beginning of the Cold War. Americans and Russians alarmed that each other knew too much about something they couldn't understand what people were saying. And coming off of the successes of code breaking in World War II, people thought, we can just get our computers to do language translation. And in the early days it worked really terribly, and things started to get a bit better in the 2000s, and I presume you've all seen kind of classic Google Translate, and that's a lot of half worked. You could sorta get the gist of what it's saying, but it still worked very terribly. Whereas just in the last couple of years really only starting in 2014, there's then started to be use of end-to-end trained deep learning systems to do machine translation which is then called neural machine translation. And it's certainly not the case that all the problems in MT are solved, there's still lots of work to do to improve machine translation. But again, this is a case in which just overnight replacing the 200 person years of work on Google Translate with a new deep learning based machine translation system has overnight produced a huge improvement in translation quality. And there was a big long article about that in the New York Times magazine a few weeks ago that you might've seen. And so rather than traditional approaches to translation where again just running a big, deep, recurrent neural network where it starts off reading through a source sentence generating vector internal representations that represent the sentence so far. And then once it's gone to the end of the sentence, it then starts to generate out words in the translation. So generating words in sequence in the translation is then what's referred to as kind of neural language models, and that is also a key technology that we use in a lot of things that we do. So that's both what's used in the kind of Google Inbox, recurrent neural network, and in the generation side of a neural machine translation system. Okay, so we've gotten to, I just have one more minute and try and get us out of here not too late even though we started late. I mean, the final thing I want to say it's just sort of to emphasize the fact the amazing thing that's happening here is it's all vectors, right? We're using this for all representations of language, whether it's sounds, parts of words, words, sentences, conversations, they're all getting turned into these real value vectors. And that's something that we'll talk about a lot more. I'll talk about it for word vectors on Thursday and Richard will talk a lot more about the vectors next time. I mean, that's something that appalls many people, but I think it's important to realize it's actually something a lot more subtle than many people realize. You could think that there's no structure in this big long vector of numbers. But equally you could say, well I could reshape that vector and I could turn into a matrix or a higher order array which we call a tensor. Or I could say different parts of it or directions of it represent different kinds of information. It's actually a very flexible data structure with huge representational capacity and that's what deep learning systems really take advantage of in all that they do. Okay, thanks a lot. >> [APPLAUSE]

Info

Channel: Stanford University School of Engineering

Views: 702,790

Rating: 4.9414349 out of 5

Keywords: Word2Vec, Natural Language Processing, Word Vectors, Singular Value Decomposition, Skip-gram, Continuous Bag of Words, CBOW, Negative Sampling, Hierarchical Softmax

Id: OQQ-W_63UgQ

Channel Id: undefined

Length: 71min 41sec (4301 seconds)

Published: Mon Apr 03 2017