Today, we're gonna look at the topic of semantic parsing. And this is my favorite topic of the class, and it's one of my favorite topics in NLP altogether. Um, when I was at Google, my team was responsible for doing semantic query parsing for Google Search and for the Google Assistant. So we applied the semantic parsing techniques that we're gonna look at today to billions of queries every day and in 40 major languages. Um, now, I'm at Apple, and we use very similar techniques for query understanding for Siri. So semantic parsing is highly strategic for Google, and Apple, and Amazon, and Microsoft. Uh, and it's also been a very hot topic in academic research over the last five or ten years. Now, semantic parsing is a big topic, it's a complex topic, uh, I think more than any of the other topics that we've covered, it draws on concepts from linguistics and logic. Um, so I can't possibly do the topic justice in, uh, just the next hour that we have. But fortunately, if this topic catches your interest, we have an abundance of material on the website that can help you go deeper. Um, we have a series of four codebooks that introduce a simple semantic parsing system called SippyCup. Um, Chris has made a few screencasts that explained the main ideas in very simple and approachable terms. And there's also this terrific paper that Chris co-authored with Percy Liang in 2015 that I think conveys the main ideas of semantic parsing very clearly and very concisely. For today's lecture, I think the best that I can do is, uh, first, to give you some high level motivation for why this is a problem that you might care about, why this is an interesting and impactful problem. And second, to- to describe the standard approach to semantic parsing in- in very high level terms. So let me begin by talking a little bit about the motivation for semantic parsing. Um, at this point in the course, you might be asking yourself, "Wait, I thought we were supposed to be doing natural language understanding, but I still don't know how to build C3PO. I still don't know how to build a robot that can really understand me." Uh, the stuff that we've looked at so far, um, vector space models of meaning, sentiment analysis, relation extraction. They seem to capture aspects of meaning or fragments of meaning. But there's still a lot that's missing. We still don't know how to generate complete precise representations of the meanings of full sentences. And there are still many aspects of meaning that we don't know how to capture. So for example, things like higher erroty relations, or events with multiple participants, or temporal aspects, like this example, uh, Barack and Michelle Obama got married in 1992 in Chicago. Or logical relationships as in, no one may enter the building except policemen or firefighters. As an example, try to imagine building a natural language understanding system that could answer logic puzzles like this one. This was actually the very first project that I worked on when I was a young, eager PhD student became- before I became disillusioned. Uh, this is a type of logic puzzle that appears on the LSAT exam, and it used to appear on the GRE exam as well. Uh, this is a typical example, and this is basically a constraint satisfaction problem, a CSP. So you have six sculptures which you need to assign to three rooms while respecting some constraints that are given there. The interesting thing about this type of puzzle is, it's difficult for humans and it's difficult for computers. But what makes it difficult is completely different for humans and for computers. For humans, understanding the language is easy. And in fact, the authors, I guess ETS, uh, has taken great pains to ensure that the language is clear and unambiguous. They really wanna avoid any misunderstandings, where like, the human readers just didn't get it. So they work hard to make the language very clear and unambiguous. For humans, the language is easy, but solving CSPs, doing the logic part of it is hard, and that's what this is supposed to be testing. For computers, it's the other way around. For computers, solving CSPs is easy. In fact, it's trivial. For computers, understanding the language is hard. If we could build a model that would automatically translate this text into formulas of formal logic, then a theorem prover can do the rest. It'll be completely, uh, trivial for a theorem prover. But that translation, translating this into formal logic is far more difficult than it might at first appear. Uh, and to illustrate this, I wanna take a close look at just a few specific challenges that might arise. So our challenge is, build a system that can translate these words into formulas of first-order logic. Um, one challenge is that many words have semantic idiosyncrasies. So an example is the word, same. So consider this sentence, "Sculptures D and G must be exhibited in the same room." If you translate this into formal logic, you're probably gonna get something like this. So I've glossed over some details here like I didn't worry about existentially quantifying variables X and Y. But I think it's enough to convey the- the basic idea. What this says is, if D is in X, so D is a sculpture and X is a room. If D is in room X and G is in room Y, then X and Y must be equal to each other. Okay. So that seems straightforward. But the point I wanna make is that this adjective, same is a funny kind of adjective. And to make that clear, let me compare it to some other adjectives. Let's instead look at this sentence, "Sculptures D and G must be exhibited in the red room." Think about what that logical form for that would look like. It might look something like this, "If it starts off the same way, if D is in X and G is in Y, then X is red and Y is red." By the way, I'm using sort of a list be- list- like, list- like syntax here rather than standard first-order logic syntax, but hopefully, it's still under it doesn't really matter what the syntax is and hopefully it's still understandable. Uh, you might quibble with my logical form here because, well, I didn't properly account for the semantics of the- the definite determiner which is a very rich topic in itself, but I think that's, uh, an issue that we can safely put to the side, it's not really the main point here. The real point is, when I say the red room or a red room, maybe it will be better, um, I'm just, uh, asserting something about, uh, the rooms that D and G are in. I'm saying whatever room D is in, it's gotta be a red room and whatever room, uh, G is in, it's gotta be a red room as well. And if I look at other adjectives like if I change it to large room or smelly room, it's gonna have the same kind of semantics. It's gonna be the same mapping into a logical form. Does everybody buy that? Do these logical forms look more or less right putting aside issues like the- the definite determiner? Red, and large, and smelly are ordinary adjectives, they're what's called intersective adjectives. They have semantics which are intersective. Same is a funny kind of adjective, it's called an anaphoric adjective. And it- it- that refers to the- to the idea that the semantics of same, um, uh, refer back to something earlier in the sentence. The same room, it's not a specific room or even a room with a specific quality, it's just whatever room D and G are in. If you build- let's say you've successfully built machinery that can do semantic interpretation of this sentence, and this sentence, and this sentence, it can successfully handle ordinary adjectives. That same machinery is gonna break down when it gets to an adjective like same, which syntactically looks identical, but semantically behaves very differently. And there are other examples too. If you're an eager young PhD student, you might think, "Oh, okay. I ran into a problem. No worries. I'll just fix it. I'll just make my model know about same." But then you realize, there are more problems like that. There's other anaphoric adjectives like different, which have similar properties and you might say, "Okay, well, I'll just add another epicycle to my model and I can handle that one as well." But then you start to encounter more and more and more quirky semantic phenomena that are not anaphoric adjectives, but have other kinds of idiosyncrasies that you haven't accounted for in, ah, in advance. Let me give you a different, ah, kind of challenge. The challenge of scope ambiguity. And to introduce this, I'm gonna use, ah, a joke from Groucho Marx. It's not a very good joke, but it is a good example of scope ambiguity. It starts off, "In this country, a woman gives birth every 15 minutes." Okay. I didn't know that, but all right. And then he continues, "Our job is to find that woman and stop her." So this joke hinges on a semantic ambiguity. When you read the first part, you're like, "Okay. Every 15 minutes, there is some woman who gives birth." But when you get to the second part, he's giving that first sentence a different reading instead, there is a specific woman who gives birth every 15 minutes. That's the semantic ambiguity. It's a question of whether the universal quantifier, every, or the existential quantifier, a, takes wider scope. Um, because I'm a language nerd, I collect examples of scope ambiguities. That's a perfectly normal thing to do, right? Uh, and I have another nice one which has exactly the same structure. "Children, there is a time and place for everything." You can see again it has the- the- the existential quantifier and universal quantifier. So the, you know, the- the standard reading is, "For everything, there exists some time and place for that thing." But the alternate reading is, ah, "There is one specific time and place which is the time for everything." Right? and the joke continues, "And it is called college." This is from South Park. I thought that might resonate with you guys because you're at the university. Um, so this is the- this is- this is the idea of scope ambiguity. How does it relate to the LSAT puzzle that I just showed you? Well, um, one part of that LSAT puzzle was this, "No more than three sculptures may be exhibited in any room." So this is interesting because there's not just one possible reading here or even two possible readings, there is at least three possible readings here, which if you render them as first-order logic, are quite different in terms of their logical form and their logical consequences. But all are plausible readings of this sentence. And from a computer's perspective, there is no way to know in advance which one is the correct reading. I think the most obvious one, the one that the authors intended you to get was this one. There cannot be a room that contains more than three sculptures. But there's some other possible readings here. Can anybody think of another way of reading this sentence? No more than three sculptures may be exhibited in any room, that has different- that would have a different logical form, that has different logical consequences? Yeah. Do you consider, uh, no room should have more than three [NOISE] sculptures across? You know, like, across a specified- like you can't have one and then take it out and then bring another one, take it out and then bring another one. So in the end it has three but [OVERLAPPING]. Oh, that's interesting. I think that that says- That's how- that's not one of the ones I was thinking of but, ah, you're- you're touching on, ah, something about whether we're talking about a single point in time or over time, um, whether, ah, whether the assignment of sculptures to rooms could change over time. Yeah. Yeah. Ah, is it that there cannot be any- no more than three sculptures, as long as they present it again? As long as they what? They pres- they are in a room. So a sculpture is in a room? Yeah. You could- you could read this as, "No more th- no more than three sculptures may be exhibited in any room." Like, it doesn't matter what room it is. No more than three sculptures maybe exhibited at all, in any room. That reading of this sentence is maybe a little bit less obvious and a little bit less accessible than the first one. But I think there are ways of contextualizing this that make that reading plausible. Are there any differences between the first and the second in this case? Yeah. Yes, there are. Because in the first one, a possible world- something- a possible configuration that's consistent with the first sentence is you have two sculptures in this room and two sculptures in that room. But that is not consistent with the second reading, which says, "No more than three may be- may be exhibited at all." Totally at all. Yeah. Exactly. So they have different logical consequences depending which one you're reading. Um, there's actually a third reading which I think is even harder to get to but I- I claim is still a plausible reading of the first sentence. And that one is, "At most three sculptures have the property that they may be be- may be exhibited in any room. For other sculptures, there are restrictions on allowable rooms." So go back to that first sentence now, "No more than three sculptures may- no more than three sculptures may be exhibited in any room." The other ones have restrictions. Right? That's a plausible reading. It's not the most obvious reading, but it's a plausible reading. And from the perspective of a computer algorithm, like computers don't have intuitions about which one of these things is more plausible or less plausible. Um, so when you're building a system to- to map the sentences into logical forms, your senten- your model probably needs to account for all of those possibilities. Because in different contexts, any one of them might be the correct interpretation. Of course, you also hope that your model will be able to make good predictions about which one is more likely and less likely, uh, and a little bit later we'll look at ways of, um, building a scoring model which will help you, uh, choose among multiple possible parses, multiple interpretations of a given input. Um, I'm not gonna say anything more about the LSAT puzzle but I could go on. There are lots more challenges. I'll just mention one- one other off the top my head. Um, if you, um, there's nothing in the problem itself. Let me go back to that problem. There's nothing in the problem that says, "A sculpture can only be in one place at one time." There's nothing that says a sculpture can't be in two rooms at the same time. But if you just map this into first order logic and then give it to a theorem prover, the theorem prover will happily put one sculpture in two different rooms. Because that's common sense knowledge that a- that a sculpture can't be in two different places at the same time. A human just fills that in automatically, without even thinking about it. But a computer doesn't know to do that. So you ha- have to, ah, if you wanna make this work, you have to also supply all kinds of background knowledge which is kind of unstated in this problem. So um, we started this problem, you know, ah, as I mentioned at the beginning of my PhD and we worked on it for about like three months and the further we got into it, the more we realized we're not gonna solve this problem. And that was like 15 years ago. And as far as I know, this problem remains unsolved today. This is a really hard challenge. But it looks, at first glance, like it should be achievable. Okay. Here's a very different kind of example. Um, this, this is taken from a consulting project. Yeah. It's like graduate students kind of perspective on it. Has anyone tried to do this with a more limited scope and using something like a knowledge base for grounding, kind of similar to what we talked about on this lecture? I'm not sure. Yeah. It's a good intuition though, uh, that you could, it can give you a lot of constraints that I was talking about might come from the domain that you're in. Uh, and that can be an interesting thing scientifically, uh, over trying to prove it. Thank you. Um, so here's a very different kind of example. This is taken from a consulting project that I worked on when I was in grad school, and I was working with a startup that wanted to build a natural language interface to a travel reservation system. So they wanted to be able to understand and respond to natural language descriptions of travel needs like this. The idea is you'd be able to, uh, send an email to this service and, um, they would automatically figure out some travel plans for you. So what are some of the challenges to semantic interpretation that you see here? There's lots of them. Yeah, what's, what's an example of ambiguity here? Ambiguity in Oakland or SFO, you could use Sunday evening or Monday morning. It's like we have to extract all of those and realize that they're on the same sort of fill in the blank blank that they made. Yeah. One kind of ambiguity here, I mean, it refers to Oakland, but it doesn't actually make it clear whether Oakland is an alternative to SFO or an alternative to Boston. Now as a human, you know that it's an alternative to SFO, it wouldn't make sense to book a flight from SFO to Oakland. Uh, might be fun. Um, but, uh, but an automated system, of course, you know, it, it, that relies on world knowledge which might not be available to an automated system. What else could make this hard? Yeah. I also just realized that the he later on, the first back to the husband about the different flight, has to be, the return flight has to be both of them. That's right, you have to, uh, resolve some anaphora here and figure out what that pronoun refers back to. Um, and there's other, there's other, um, [NOISE] uh, kinds of reference resolution here. Like, for example, Sunday evening or Monday morning, we need to figure out that that refers to the Sunday that, presumably that refers to the Sunday that follows that Friday, not the Sunday that precedes that Friday, right? So I guess we're talking about Sunday the 14th or Monday the 15th. There's something else weird here which is that, if Friday is the 12th, then Wednesday can't be the 18th. Like, mathematically that doesn't work. The human that, that Friday and Wednesday are five days apart, but the 12th and the 18th are six days apart. The human made a mistake and the system has to deal with it somehow. Like, a reasonable thing will be for the system to say, "Did you mean Wednesday the 17th?" Um, but unless you engineer that into the system, it's not gonna be able to do it. What other, what other challenges do you see here? It seems like there's also some extraneous information like it, uh, it's important that they don't want to fly on United, but it, this model probably doesn't need to care about why. That's right. Or why has the husband is staying later. Yeah. So how to focus on what matters and what doesn't matter. Hating their guts is an idiom. Um, so this, this gives you a, a flavor of some of the kinds of things that you run into when you try to do complete and precise semantic interpretation on real-world problems. This goal of seman- of complete and precise natural language understanding goes way back to the beginning. Uh, last time Chris mentioned the SHRDLU system, which was developed almost 50 years ago by Terry Winograd when he was, uh, a grad student, a PhD. Uh, he's now a professor here and he doesn't work on natural language understanding anymore. But this was arguably the birthplace of natural language understanding. So SHRDLU is a blocks. It's, it's an NLU system that's grounded in a blocks world. It parses the user's input. Uh, maps it to a logical form and then tries to interpret that logical form in its world, and answer a question or take an appropriate action. Um, I give a you- link to a YouTube video here and it's worth checking out later. The- these slides by the way are linked from the website, and from there you can find your way to this YouTube video. And it's kind of a fun video to watch. Um, and so you can say, even quite complicated things, like find a block which is taller than the one you are holding and put it into the box. And it understands and it just does the right thing. And at the time, people were totally wowed by this. I mean, I think even half a century later, this kind of elicits a wow. But certainly at that time, it really had people thinking that human level NLU was just around the corner. But then, that excessive optimism crumbled when later systems, later systems tried to deal with more realistic situations and with real-world ambiguity and complexity. Part of the reason this was achievable was that it was this very constrained blocks world that, um, that we were operating within. Another milestone for semantic parsing was the CHAT-80 system. Uh, this was developed around 1980 by Fernando Pereira, who's now a research director at Google. So this was basically a natural language interface to a database of geographical facts. So you could, uh, it could answer geographic queries like what's the capital of France, or how many countries are in Europe or things like that. Uh, it was implemented in Prolog and it was, it used a hand built grammar, uh, and a hand built lexicon. So there was no machine learning in sight, it was just lots and lots of rules, uh, that drove this, this semantic interpretation. Now, here's something astonishing. It is still possible to run CHAT-80. Uh, even though this code is older than most of you, you can still run it on the Stanford rice machines. Uh, and if you want to try it yourself, here's a recipe for getting started. And it's actually really fun to, um, kick the tires and, and try it out and see what works and what doesn't work. Um, in previous years, I've actually taken time in class to do this live in class, which is kind of fun, but it's a little bit cumbersome and I feel like maybe the educational value of it isn't worth the, the cost. So I'm not gonna do that this time. But, um, I want to show you some examples of, of some queries that, that it can handle. So you can ask it, like, is there more than one country in each continent? By the way, what's the right, let's see how good you guys are at geography. No, why not. In Antarctica and Australia. Yeah. Funny thing is, uh, this system doesn't say, doesn't think that Australia is a continent. It thinks that Australasia is a continent. So the reason it says, it does say no but the reason it says no is because of Antarctica and not because of Australia. Uh, what countries border Denmark? It's a pretty easy question. Yeah. Except, remember, it's 1980 [LAUGHTER]. Oh, the Soviet Union was there. No. East Germany? West Germany. West Germany, yeah. It's- again, yeah, kinda, kinda, kinda funny to step into the time machine, back to the- back to the, um, [NOISE] uh, Cold War and all that. Uh, I'll skip the next two, but this- this one is really impressive. Which country bordering a Mediterranean- bordering the Mediterranean borders a country that is bordered by a country whose population exceeds the population of India? CHAT-80 knows how to get this question right even though it's incredibly syntactically complex. Can anybody get this one? [NOISE] All right. I'll let you guys think about that one. Its the country whose population makes [inaudible] the population [inaudible] is that Russia or? China. China. So China is the only one. Is that the one? In 1980, I believe. [NOISE]. [inaudible]. I'll let you think about it. [LAUGHTER]. No, I'll let you go try CHAT-80 on the rice machines, you can ask CHAT-80 [NOISE] and CHAT-80 will tell you the answer. Um, [NOISE] by the way, uh, one of the things that I- that I- when I showed this slide, one of the things that I get out of it is trying out small variations. So if, if instead of saying, what countries border Denmark, if I said what countries [NOISE] touch Denmark or what countries adjoin Denmark, it frequently just says, "I don't understand." You change one word and it just says I don't understand because that other word or that other formulation, that other way of asking the question just wasn't in the hand-built grammar. Uh, the last example kinda falls in the same bucket. How far is London from Paris? It feels like a question that has very similar flavor to all the other questions, just another geographical question, but it turns out that no matter how you ask this question, CHAT-80 just says, "I don't understand." It doesn't know about distances between- it knows about like areas, it'll tell you how many square miles, uh, you know, the total area of countries south of the Equator and not in Australasia, but it can't tell you about distances between places. So it has two pretty profound limitations. One is, um, that it is it only knows about certain things, it's certainly restricted to geography, even within geography, it only knows about certain kinds of geographic facts. That's one restriction. The other restriction is on the ph- the phrasing the allowable phrasing of questions. Even questions that it knows the answer to, you have to express your question in a certain specific way. Maybe if- maybe there are a few different ways that it can handle, but there are lots of other ways that to a human are perfectly reasonable ways to ask that the system just can't handle. [NOISE] So on the one hand, systems like SHRDLU and CHAT-80 are pretty astonishing because, um, [NOISE] long before most of you guys were even born, these systems demonstrated precise and complete understanding of even quite complex sentences. And, by the way, they did it while running on hardware that's less powerful than you know my rice cooker. Um, admittedly, I have a really fancy rice cooker, [LAUGHTER] but still, uh, it's, um, [NOISE] it's kind of impressive what they were able to do. But their coverage was extremely narrow, it was limited to one specific domain, and even within that domain, they were very brittle. If you ask a question this way, you get an answer; if you use just slightly different words, nothing. [NOISE] Today, by contrast, we have systems that exhibit vastly greater robustness and with very broad coverage, but with very fuzzy and partial understanding. So you can ask Google about just about anything and it never says "I don't understand." But it might not exactly ask your, uh, it might not exactly answer your question if you ask Google this question about what country- which country bordering Mediterranean borders a country that is bordered by a country whose population exceeds the population of India. Google is not gonna tell you the answer or actually what it probably will do is it will link you to my slides from last year, um, which maybe had the answer. Um, but it's- it doesn't have that complete precise understanding that's needed to actually answer the question. Uh, where we wanna be is up here with C-3PO and the pot of gold. We want to build systems that can understand precisely, and completely, and robustly over broad domains. [NOISE] Now, you might not feel super inspired by examples like that- that one, what's the population of the largest continent, that- that one. Um, because let's be honest, who cares? But the ability to answer these highly structured kind of database-y type of queries has lots of applications that are more compelling. [NOISE] So, for example, if you're a policy analyst working on global warming, you wanna be able to answer questions like these. And it's easy to imagine that you have a database that contains all kinds of statistics about carbon emissions. But unless you are a programmer, you probably don't know how to write SQL queries to pull out the information that you need. So it will be great to have a system where you could just type in questions like this and get the answer. [NOISE] Here's another example. If you grew up in America, there's a good chance that you have an Uncle Barney who is a total freak about baseball trivia. Uncle Barney would love to be able to ask questions like these, but again he doesn't know how to write database queries. Um, but if you can invent a semantic parsing system, um, maybe you can make a website and sell subscriptions and Uncle Barney will pay you 10 bucks a month. [NOISE] Ah, and here's an application that has become very important at Google, and Apple, and Amazon, uh, voice commands for getting things done. [NOISE] Um, one thing to notice here is that these queries are not well-served by conventional keyword-based search. Just imagine if you ask Google, "How do I get to the Ferry building by bike?" And Google responded by showing web pages containing those terms. That would be a disappointment because you don't want web pages containing those terms, you want a map with a blue line on it. And the only way we can draw that map is if we know where you are and we compute a route, a route from where you are to the Ferry building, and doing that requires understanding your intent in a complete and precise way. So satisfying queries like these requires mapping them into structured machine-readable representations, with meaning, that we can pass to a downstream component, uh, to take- to take action on. It requires semantic parsing. [NOISE] So the goal of semantic parsing is to do precise complete, uh, interpretation of linguistic inputs and to map those inputs into structured machine-readable representations of meaning. [NOISE] So really important question is, what kind of semantic representation are we aiming at? What's our target output? Um, and there are a lot of different possibilities kind of depending on what the problem is and what the domain is. Um, [NOISE] if the goal is to facilitate data exploration and analysis, then the right semantic representation might be a database query language like SQL. [NOISE] Uh, for a robot control application, you might want a custom-designed procedural language. [NOISE] For interpreting voice commands, you can often get away with a relatively simple meaning representation, which is based around a fixed set of high-level intents that are parameterized by specific arguments. So, for example, if the query is directions to SF by train, the semantic representation might be something that first indicates that this is a travel query, so that's kind of indicating the intent here, the type of intent. And then, with two parameters. One is a destination parameter and the value there is a Freebase ID, which is the Freebase ID for San Francisco. And then the second parameter describes the transportation mode. And here the value is, uh, I guess any num value, there's maybe like five different transportation modes and transit is one of them. And so this little machine-readable expression conveys everything that you need to pass to a back-end system which can actually like figure out the directions and draw a map and give it to the user, uh, and similarly for other kinds of, uh, other kinds of queries here. [NOISE] Okay. So to illustrate the central ideas of semantic parsing, we've created a simple semantic parser called SippyCup. Although it's designed for simplicity and readability, it uses methods that are commonly used at Google and Apple for semantic interpretation of user queries. Ah, and we produced a series of four codebooks that introduce SippyCup. So Unit 0 is kind of a high level overview of semantic parsing, and then the remaining units demonstrate the application of SippyCup to three different problem domains. So Unit 1 focuses on the domain of natural language arithmetic. This is queries like 2 times 3 plus 4 as English words rather than mathematical symbols. Um, Unit 2 focuses on the domain of travel queries. So this is queries like driving directions to Williamsburg Virginia, ah, which has pretty obvious applications for assistant products like, ah, Google Assistant and Alexa and Siri. Ah, and then finally, Unit 3 focuses on geographical queries. So things like how many states border the largest state, and this has a very similar flavor to the kinds of, ah, queries that CHAT-80 takes on. So Unit 3 kind of illustrates a modern machine learning based approach to the same problem that the CHAT-80 system took on. By the way, this approach, that, that SippyCup, uh, illustrates was pioneered first by Luke Zettlemoyer and his group at Udub, and later by Percy Liang, and his group here at Stanford. Um, so the key elements of this approach are, first, a context-free grammar which defines the possible syntactic structures for the queries. Ah, second, semantic attachments to the CFG rules, which enable bottom-up semantic interpretation. Third, a log-linear scoring model for, which is learned from training data. Fourth, phrase annotators for recognizing names of people, locations, dates, times, and so on. And finally, grammar induction, inducing grammars from training data in order to quickly scale into new domains and new languages. Um, and over the following slides I'll, I'll talk about each of these five elements in turn. So let's start with the grammar. Ah, the grammar is the core of the semantic parsing system, and it has two parts. It has a syntactic, a syntactic part and a semantic part. Um, the syntactic part is a fairly conventional context-free grammar. So we have terminals like Google, and NY, and me, and bike, and car. Um, and then we have non-terminals which we indicate with the dollar sign. So like $look, for example, is a non-terminal. Um, and we also use, oh, we have a designated start symbol which, ah, here we're by convention, we're calling $ROOT. So every derivation has to, has to bottom out with $ROOT. We're also using a notational convenience here. This question mark indicates that this element is optional, and so that's just a notational convenience. If I wanted to avoid using this, I could instead have two rules here. One that has $ROOT, and all this stuff without the parentheses and the question mark, and another rule which has $ROOT, and all this stuff except that. So this is just saying, "This could be here, or it could not be here, it's optional." So this particular grammar fragment could be used to parse just a handful of queries. It could be used to parse route me to New York by car. That's one possible thing that can match this grammar, or it could match, um, route, I'll skip this, to, let me do a compound location, Google in New York by bike or a handful of other queries. Ah, there's not very many queries which match this very limited fragment, but it gives you a flavor of the, the kinds of things that are, that are, that are achievable. Um, as is typical for grammars of natural language, these grammars are usually non-deterministic. So if you've looked at CFGs in the context of programming languages like if you took CS143, you're probably, there you probably have seen deterministic grammars where there's only one possible, ah, parse. One possible interpretation for any given input. In linguistics, we typically deal with nondeterministic grammars, where there are multiple possible parses for a given input. And that's important because natural language utterances are very often, I dare say usually ambiguous. So here's an example of a parse for the input route me to Google in New York by car. Ah, we recognize me as an optional word. We recognize Google, as a $loc, a location. Recognized New York as a $loc, we then recognize $loc and $loc as a compound $loc. Ah, we recognize bike as a $mode, and then because we've got a destination and a mode, we can use that last rule to put the whole thing together into a $ROOT which is the the final, ah, syntactic production parsing. So this is basically the syntactic parse, one possible syntactic parse for that input. So far so good. So this is the syntactic part of the grammar. Um, okay. So given a grammar and an input query, we can generate all possible parses of the query using dynamic programming, and specifically we use an adaptation of the well-known CYK chart parsing algorithm. So if you've ever looked at syntactic parsing before, ah, you may have seen, ah, you may have seen this. Um, here's how it works. The first thing we do is, we rewrite the grammar so that all of the rules in the grammar are binary or maybe unary. Binary means they only have two things on the right-hand side. What we don't want to have, is three or four or five things on the right-hand side. We want to have at most two things on the right-hand side. Then what we do, is we consider every span of tokens, every subspan of the whole query, and we do that bottom-up. So we work our way up from very small spans, like spans of one up to larger and larger spans. As we go, for every span, we consider all way, all ways of splitting that span into two parts. And then we consider every grammar rule whose right-hand side can match those two parts, and that's why it's important to make the grammar binary, so that we only have to consider splitting things in two, and not splitting things into three or four or five. Every time we have a grammar rule that match, that can potentially match that right-hand side, that tells us that we have a way of making the category which is on the left-hand side of the rule. And so we can record that as a possible, ah, interpretation, a possible parse for that span, and that can be helpful as we work our way up to larger and larger spans, because it can help us build bigger things above that. We can use those categories in trying to interpret larger spans. Okay so that's the syntax. What about the semantic part of the grammar? Every rule in the CFG can come with what's known as a semantic attachment, and I've shown the semantic attachments here in the square brackets in green. You can think of these semantic attachments as little programs that are run when the parser applies the corresponding syntax rule, and their outputs are fragments of our meaning representation. Um, they basically, the semantic attachments basically specify how to construct the semantics for the thing on the left-hand side from the semantics for the things on the right-hand side. So, um, for Google, this one is very straightforward. The semantics for Google is just gonna be a freebase ID which means Google. It's the entity Google. And similarly for New York. For $loc, in $loc. Um, this thing says, "The way to construct the semantics for this $loc, is to build it up out of the semantics for this $loc, and this $loc, if you've already got semantics for this $loc, and this $loc, I'm going to call the first one $1, and the second one $2, and this thing says, "Just take the semantics for this $loc, and semantics for that $loc, and stuff it into one of these S expressions with an in, in the front". It's a fragment of our- our meaning representation, and this thing tells us how to build up these pieces into larger and larger pieces. These ones are straightforward. The semantics of, of a $mode is just one of these enum values, and then the last thing that's interesting, is that this one tells us how to build the semantics for the entire request. It says, "It's a Get Directions Request." I guess that comes from the fact that it's a route. Um, the destination is gonna be whatever the semantics is for this non-terminal, and the mode is gonna be whatever the semantics is for this non-terminal. So these semantic attachments tell me exactly how to construct my semantic representation from smaller pieces building up to larger, and larger pieces. So here's our example parts again, but this time I've added in green the semantic yield associated with each node of the parse tree. By the way, if you're a linguist, this is basically Montague semantics. It's bottom-up syntax driven semantic construction. So first, we construct the semantics for Google and for New York as Freebase IDs. Um, and for bike as an enum value, ah, then we combine these two to get the semantics for the compound location up there, and finally we combine everything to get the semantics for the Get Directions Request at the top. Okay, let me pause there because that was a lot to absorb. Any questions so far? Okay. Let me keep going. Um, so the next question is, how do we recognize na-names and dates and numbers and things like that? Um, we could do it all in the grammar and that's just what I did, uh, with Google and New York here. I had directly in my grammar, uh, something that associates the string Google with the semantics which is a Freebase ID. And I could do that for long long lists of entities and I could also do it for dates and numbers and things like that. But, um, if I do that, it's gonna mean, um, adding lots and lots like millions of rules to my grammar, um, which is gonna be really messy and cumbersome and brittle and difficult to maintain. And it's gonna have limited coverage anyways. So I mean, I can't possibly put every possible date or every possible number into my grammar. So instead, what we're gonna do is leverage special-purpose annotators for, uh, phrases that describe entities, locations, names, numbers, dates, times and things like that. So here's an example. The, the example query is reserved Gary Danko with Tom next Friday. And so here we're imagining that we have three different annotators that are gonna help us interpret this query and they basically function like black boxes which run first before the syntactic machinery gets to work and can hand their results to the syntactic machinery. So first I imagine that I have a Freebase annotator, and this thing, its job is to look for entity mentions that are things that Freebase knows about. And it recognizes this phrase, this string Gary Danko and it says, oh I know about that. That's Freebase entity this thing. And it tells me this is a $restaurant, um, which I think it figured out from, uh, it, it, it, is also able to generate lots of metadata because it has all of Freebase at its disposal right? It knows about Freebase. And so it's able to look this thing up and figure out it's a restaurant and it knows to generate that syntactic category. And then it also, uh, generates this helpful metadata that can be passed downstream. So for example, it has a confidence score that says how, how sure it is that it's made the right interpretation here and that confidence score, uh, can be passed through, um, and, and uh be an input to scoring the interpretation later on. Um, I also mentioned that we have a contact annotator which recognizes this, this string Tom. And it says, oh Tom, I know who that is. It's this user ID and he has this email. So obviously the only way that can work is if the contact annotator knows who issued the request. It knows that this request is from a specific user and that user has a friend named Tom and this is his user ID. So it needs access to some personal information in order to ground these semantics with a specific user ID. And finally, a date annotator. It recognizes this phrase next Friday. And it says oh, I know how to generate semantics for that. Next Friday means May 10th, 2019. And it was able to do that because it knows when the query was issued and therefore is able to interpret next Friday appropriately. So next Friday is an indexical expression but to ground it correctly, I need to know something about where the, whe- when the query came from. So we have uh these annotators that can run essentially as black-box modules, run over the input query, generate hypotheses about how to interpret small spans and record those hypotheses as inputs to the syntactic machinery that will will, uh, will get rolling afterwards. Okay. Um, a pervasive problem. Yes. Just going off back to that approach. Yeah. What exactly is the issue it's tackling? Like when you first pitched sort of the problem, I thought a solution could be to affect the annotation itself. So rather than saying location goes to New York, you could save location goes to span, and then the location would be invoked lookup on span, rather than running the entity recognition first, you could go to the semantic parser and it would be able to identify the span and then pass it to the annotator. Is there a reason why we put on the annotator first? Um, I think maybe what you're suggesting is that the parsing could happen top-down instead of bottom-up. Does that sound right? That could be yeah. Um, and that's, uh, conceptually that's definitely possible. Um, and like when you for example if you've taken CS 143, you study lots of different approaches to parsing and some of them are bottom-up and some of them are top-down and some of them are a mix of both and all of them are conceptually possible. But there can be big differences as far as the efficiency of parsing. And particularly for natural language which is so highly ambiguous. Um, top-down parsing turns out to be prohibitively expensive. And bottom-up parsing is, um, way more efficient because you can eliminate, uh, unlikely possibilities early in the process. Um, when I was working on this problem at Google, the grammars that we were working with, uh, it was commonplace for even very pedestrian inputs to have literally thousands of possible parses. In fact that's the next point I'm gonna turn to is ambiguity. Um, and when there are thousands of possible parses, you need to do pruning fairly early in the, in the process. We, we would use, uh, what's called a beam search to basically maintain as, as the, the process of parsing is happening to maintain a finite length list of the most promising possibilities and aggressively prune beyond that list. Um, and that was the only way to to keep the search for a plausible parse manageable. Just following up on that. When we do the annotations, it does help us to limit the number of possible parses and so that's one of the big advantages to running the annotator first? Yeah. Yeah. Thank you. Yeah. So this problem of ambiguity is, uh, this is a problem that's that's pervasive, uh, throughout language and certainly, um, a big challenge for semantic parsing and ambiguity can include both syntactic ambiguity and semantic ambiguity. So here's an example, uh, the input is mission bicycle directions. So imagine the the user issues this query to Siri, or to the Google Assistant. Uh, one possible interpretation for that is that we wanna get to the mission by bike. I want mission bicycle direct, I wanna ride my bike to the mission. That's certainly a possible interpretation. It turns out that there's a bike shop called Mission Bicycle. So another possible interpretation is I want directions to Mission Bicycle. And I'm not specifying transportation mode. I just want direct maybe, it's maybe I want driving directions but where I wanna go with mission bicycle. Um, so here I show these two different parses for this query. In this one mission is the location, bicycle is the mode. In this one, Mission Bicycle is the location and I'm not specifying a mode. In this case, in this example, there's only two possible parses but as I mentioned a moment ago, in complex domains, it's with rich grammars, it's commonplace that there are tens or hundreds or even thousands of possible parses. So if our grammar supports multiple interpretations, how do we know which one to choose? The answer is with a scoring function. We need a way to score the different candidates so that we can figure out which one is the most plausible. And one approach to doing that is with a log-linear model to score alternative derivations. So a little bit of terminology here. I'm gonna call the input that is the the query, the national language query X. Um, I'm gonna call the derivation or parse. So that's the syntactic parse tree with all of its semantic attachments. I'm gonna call that Z. And I'm gonna call, uh, I'm gonna use Y to to designate the semantic yield. So that means the final semantics. It's basically the semantics that you get at the root node in the parse tree. Uh, by the way, Z completely determines Y. Uh, but for any given X, you may have lots of candidates Zs and correspondingly lots of candidate Ys. So to to build this scoring function, what we're gonna do is first, define a feature representation which captures the key characteristics of the input X, the candidate parse Z and the candidate semantics Y. Uh, there's a lot of room for variation in the feature representation but just to give you a flavor of like some common commonly used features. Uh, your feature representation might include, um, an indicator function which is one, just in case the input contains the word to. And the candidates semantics contains a destination parameter. Presumably like intuitively makes sense those two things are likely to go together. And so seeing those together, might count as weak evidence. It's hardly proof but it might count as weak evidence that this is actually a good parse. Um, or you might have features which capture again, indicator features so Boolean features which capture the occurrence of specific CFG rules or specific categories the dollar the dollar things, um, in the candidate parse. The intuition here is some rules in your grammar might be much more likely or unlikely than others to participate in good parses, in valid parses. You might not be able to anticipate that in advance but if you have features like these, then you have the opportunity to learn that from data. You can learn from your data that some rules work much better than others. Uh, another kind of feature you might include is features which pass through the confidence score that you got from an annotator. Presumably, if the annotator's really confident that makes it more plausible that a parse that includes that annotation is a good parse. There's lots of room for variation in the feature representation but once you have a feature representation, the score for a, a derivation, the score for parse Z will just be the dot product of the feature vector and a weight vector theta which is learned from training data. And once you have a score, you can turn the score into a probability in the usual way by using the soft-max function. [NOISE] So the next question is, where do we get this thing from? Where do we get the weight vector theta? This is basically the parameters of our scoring model. Um, well, we're going to get it from data, we're gonna estimate those model parameters from training data, and we can do it using a variant of the EM algorithm. The reason for using the EM algorithm is that our training data consists of pairs of inputs x and target semantics y, but the training data doesn't include the correct parse trees, z. So there could be multiple parse trees that yield the same semantics. So we have to treat those Z's as latent variables, and that's what the EM algorithm is good at. Yeah. [inaudible]? Yeah. I'll, um, I'll try to, um, explain it in application to the, to- to- to this specific application. The EM algorithm, uh, alternates between E-step and M-step. Um, EM stands for Expectation Maximization. And in the E step, what- what we do here, is use the current model parameters theta to parse the inputs in our training data, and for each input x, generate an end best list of possible parses. So we're just parsing all the inputs using our current model. Then, in the M-step, we're gonna change the weights of the model. And specifically we change the weights to put more probability mass on the elements of the n-best list that actually generated the correct semantics. Um, in the n-best list, some of them are gonna have good semantics which match the target, some of them will not, and we want to shift the probability mass towards the ones that match the target semantics. Um, and then we go back to the E-step and do it again. We reparse everything using our updated weights that could cause the n-best list to shift around, um, and we keep doing that back and forth between the E-step, and the M-step. So these updates to the models are basically SGD updates. This is kind of like a stochastic gradient descent algorithm, uh, and the weights will adjust gradually over time, uh, toward weights that are more and more successful in generating the target parses for our training data. Okay. So we've got a grammar, it's got syntax rules, it's got semantic attachments, there's ambiguity, we have a scoring function, we've learned the scoring function from data. The next question is, where did the grammar rules come from? If- and the answer really depends on what domain you're working in. If it's a small simple domain, then grammars with a few dozen or a few hundred rules, are commonly enough, and at that's sca- at that scale, it's practical to write grammars manually. You just like, think about the domain, and look at lots of example queries, and just manually generate some rules that will capture, uh, that will cover the- the- the- the- the- the set of input, uh, input the- the set of inputs, uh, properly. But that's not the common situation. The common situation is, you're in a large complex domain, um, and you, uh, need thousands of rules to model the domain well, and at that scale, it's just not feasible to write down all the rules manually. Uh, so instead we want to learn the rules automatically from- from training data. And this is the challenge of grammar induction, and it's where some of the most interesting research work lies. Um, one strategy for grammar reduction which- it's a very simplistic strategy but it's illustrated in Unit 1 of the SippyCup codebooks. The idea is simply to generate all possible rules and add them to your grammar, uh, and then to use standard parameter learning to learn weights for each rule, to learn which rules are most likely to contribute to a successful parse. So all- what is- first, what is all possible rules mean? It basically means, um, all ways of combining the symbols of your semantic representation. Your sort of semantic, like your language of formal semantics always combining those symbols into syntax rules, every possible right-hand side, every possible left-hand side, the cross-product of all those things. And if we add all those rules to our grammar, then we can generate a huge variety of different parses. Most of them are bad, but we can use the- the weight learning that I talked about on the previous slide to learn from data which ones are good, and which ones are bad. This works but it comes at a heavy cost because generating all possible rules leads to an exponential blowup in the number of possible parses, and consequently in the time that's required for parsing, uh, and if you work your way through SippyCup Unit 1 you'll see this illustrated very vividly. So more sophisticated approaches to grammar induction look for ways to be far more selective about, uh, introducing new grammar rules, and also ways to prune them aggressively when they're not contributing to successful parses, to keep the size of the grammar manageable so that training can still run in a feasible amount of time. So we've talked about two different ways of using training data in this process. One way of using training data is to induce rules of the grammar, to figure out which syntactic productions should be part of the grammar. The other way of using data is to estimate the parameters of our linear scoring model, um, and these two- two way, uh, two ways of using data are both really important and kinda work hand in hand. Um, so that's, this kind of underscores the importance of data for making this whole thing work. You can't do grammar reduction and you can't learn the parameters of your scoring model without training data, and not just a little bit of data, to really make this work, you want massive amounts of data. So organizations that are doing this at scale, like Google and Apple invest a lot of money in data annotation, uh, using proprietary Crowdsourcing platforms similar to Mechanical Turk. So paying graders, paying human annotators to look at example input queries and write down the- the target semantics for those queries so that we can then, um, train- train machine-learning systems to- to- to, um, to capture that. However, that's really expensive. Labeling examples with targeted, er, target semantics, um, is slow and laborious, and costs a lot of money. So another really productive direction has been to enable learning from indirect supervision. Um, an idea pi- pioneered by Percy Liang is the idea of learning from denotations. So denotation basically means, if your semantic representation is something that can be executed or evaluated in some way, that execution or evaluation could result in something that's a lot, uh, simpler and more intuitive for a human annotator to produce. So an example here, I should have put an example on the slide, an example might be like, you know, what's the capital of the largest state in the United States? The semantics for that might be some kind of complex, um, logical formula that literally says, except in logical language instead of English, what is the capital of the largest state? The denotation of that semantics would be the answer to the question like, Austin. Juneau, yes. I guess, if we're talking about large in terms of area, right? What is the capital of the largest state? Juneau, Alaska. Um, so that's the denotation. If your training data consists of inputs, like, what is the capital of the largest state, paired with logical forms, we can learn very effectively from that, but it's hard to produce that training data because it's hard to get human annotators to produce those logical forms. Um, and in Google and Apple we figured out all kinds of tricks to make that a little bit easier for ordinary people to produce logical forums, but it's an intrinsically hard problem. It's much easier to get ordinary human annotators to produce the answers, like, Juneau. And so if we can have training data that says, What's the capital of the largest state? Juneau. If we can figure out a way to learn from that kind of data, uh, we'll be able to get lots more data, um, and, um, and the benefit of- of learning from that data. So that's the idea of learning from denotations. Um, this is a really powerful idea, and it's illustrated again in Unit 1 of SippyCup, where the domain is natural language arithmetic. So there the donations are just the answer to a simple arithmetical computation. Okay. So to recap, the key ingredients of this approach to semantic parsing are CFGs with semantic attachments, log-linear scoring models, annotators, grammar induction, and above all, lots and lots of training data. Um, I think- I hope that that's enough of a high level overview, that if this topic interests you, you'll be able to dive in to the SippyCup codebooks, um, start reading some papers, and get a much more concrete sense of how all this stuff fits together, and how all this works. I think I'll stop there for today.