We're six paragraphs in and it knows about this point, I've covered the first sentence of this initial paragraph now it's time to talk about this second sentence of the lead even more surprising to the researchers than the fact that they spoke English and it completely ignored the speaking English part until it got to the part of the news article where that comes in and Now it's talking about it Which is the kind of thing that these kinds of systems would have real trouble doing I've known journalists who can't write that well If that ties into Something that I think is kind of fundamental to the way that people think about AI like we used to think that you had to be really clever to be good at chess and if you could play chess then you Were real? Intelligence you had to be and then we realized that like playing chess to a superhuman level doesn't require something that we would call Intelligence and it's slightly unsettling to find that like writing coherent and plausible News prose apparently also doesn't require general intelligence like if you just learn the statistical relationships between words But do that really really well that seems to be enough I mean obviously any journalist would would say that the truth is kind of important in this as well But yeah to make it sound plausible. You're absolutely right, right Yeah there so there's there's definitely questions about like directing this towards producing a specific article that is correct, but just the generating of the of the prose itself apparently requires less like Philosophical sophistication than we thought Then many thought anyway, I think 10 years ago People would have a really hard time believing that Something that's just learning from data and learning these relative probabilities could produce something this coherent you would expect it to have all sorts of conditions and a formula and An uncoded you would look at this and say oh this must have like a database of names and countries locations and Cities and everything else because it's using that information, but it turns out all of that Is already represented in the data set because we're talking about all these things Here's a recipe for some kind of peppermint chocolate cake, and it's got a bunch of different completions so you can just spit these out arbitrarily so You know those recipe blogs where like you google the recipe for something and you go to the blog and there's like seven paragraphs Of like my mother used to make this for me back in our home in, Indiana I always remember sitting out on the porch with my dog He used to you know The important thing is I had an onion on my belt which was the style at the time and and it's doing that It's like talking about Different back stories. Yeah, just back story and um I wonder if anyone was trying to make any of these recipes Yeah, it's dangerous this one So this is this is a recipe for meringue cookies 1 and 3/4 cups butter softened the cup of sugar and egg yolk 3 t of heavy Critical T. What unit? Is that 3 tons? Heavy cream? That's usually a lowercase T. I don't know what it is 3 tons of heavy cream. Let's say three and a half to four cups of flour pinch of salt Peppermint Jojo topping which like I have no idea what that is, but peppermint. Jo jo's is mentioned in the prompt So one and a quarter cups powdered sugar a cup of chopped pecans a half a cup Finely chopped mint leaves 1/2 cup chopped fresh mint about a half sheet So it's like it doesn't quite make sense, but it's right on the edge of making sense like we have half a cup of chopped mint leaves and Then also half a cup of chopped fresh mint all these potentially cherry picked out of a huge number of horrendous Right. Yes. So this is these ones specifically or not So the unicorn one and this is something that I like and it's standard practice, but I like this the unicorn one It says they specifically said right now we're gonna make a sample for the paper They made 10 of them and they picked the one they liked which was this one but for the recipes here These are not cherry picked at all. That's why they're showing 6 they just gone This is the first six that we generated And here they are and there so that gives you a better idea of the general quality of what it's putting out They're all fairly sensible. I like look at this world, right? this one comes in and says I do not substitute it with something else blah blah and then Like this, I don't know if that is right Here's an image or yeah Please like this on Facebook and then it goes on I found this really cute card with cute little Kittens on and then as your samples cut off, so it's just like next post in the blog or you know This is why GPT-2 is so cool to me. Let's see The first thing they tested on is a children's book test where this is. I think it's like a closed thing Where you just have it have a data set with some children's books and then you like remove one word And then the system has to predict which of the words is the correct Like you give it you give it ten words that it might be or something like that and it has to pick what word fits In this space. So that's standard kind of language model type task. The one thing that they did have to do is they had to run analysis to check about overlap and it turns out that one of the Children's books is the jungle book by Rudyard Kipling Which was actually in in its entirety was in the data set that they trained the thing on so he just knew that one So then they threw that out because I think that's not fair if you've already seen the entire book before And it's performance on that was was good by the time they guts up to the to the very large scale models Its scoring one. Is that eighty? eighty nine percent where humans tend to get like ninety 90 to 93 percent It's like nearly a human level for guessing one missing word from a sentence in a children's book pretty good Lambada is designed to test long-range dependencies, which is what I've been talking about a lot The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict It's like 50 words is a pretty long sentence and So this kind of long-term dependency thing is a standard way of testing language models and It ends up with an accuracy of 63 percent Which is an improvement over state of the art by four percent. So it's state of the art on Lambada without Being specifically trained on that just running in general the winograd schema I don't know if it's support if you know event venal guard. Who knows Maybe it's somebody's name Whatever. This is about resolving ambiguities, which is really especially important in translation tasks and things like that it's very easy for sentences to be ambiguous in a way that Makes translating them Very difficult or even impossible and I have like I have an example of this check this out. So consider a sentence like The chicken didn't cross the road because it was too black okay, and then we can consider different versions of this sentence suppose that this is Wide chicken didn't cross the road because it was too wide, right? That's like one possible completion for this or you might say The chicken didn't cross the road because it was too scared another perfectly sensible sentence, then the question is it in one of these It is referring to the chicken and in one of them. It is referring to the road color for a third and sure stormy Stormy. Oh, yeah. Alright. That's a good one. So stormy means it is actually neither of the things in the sentence the other one that's fun is something like busy Right. Is it a busy road or did the chicken just have better things to do than crossing the road? We don't know. I mean like I Would say probably the road but this could be a children's book, right? We are running this thing on children's books. The rabbit. The rabbit was too busy. It was late for an important date why can't that she can be busy so The point is suppose. We're trying to translate this into a language the genders everything as many languages do and Maybe chicken is like a is a masculine noun and Road is a feminine noun and it has to know What it's about right? Is this like is this L or L or whatever so the idea of this benchmark is? measuring how well it can resolve ambiguities because if this says wide if you're trying to do Translation the old old-fashioned way Where you're like parsing trees and looking things up in dictionaries and stuff like that This kind of sentence is the hell nightmare because you can't you don't know. I mean, you just don't know the information is not really present in the Text it's not present grammatically. It's present in your understanding of the world and chickens and roads, right? So translating this if this is it was too wide and you translate it Into something which is the equivalent of the English sentence the chicken cross the road because the chicken was too wide you've screwed up Right, that's a bad translation. But at the same time Like there's nothing in the sentence that tells you that it shouldn't be right so what you need to do is the same this the thing that we've already seen that GPT 2 is super good at Which is pulling in knowledge of the world Like knowing that the University of La Paz is going to be near the Andes or in the Andes right that kind of thing It's going to know that Roads being wide is a thing much more than chickens being wide is a thing and so on and that like roads don't get scared If it's scared crazy fearless, so again on this kind of thing, it does very well It beats the state of the art by a lot. You can see on this graph here So the way that this graph is laid out, by the way, this is the same in all of them This is the size of the model. They made four different sizes of model. And these are like the same sizes that previous Language models were so they were like make sense to compare them their previous to the small ones They do worse than the state of the art But then these seven hundred sixty two million parameter and the 1.5 billion parameter significantly past state of the art They're getting like 70 percent. So the state of the art is the straight line across, right? Yes And the thing that is also kind of fun about some of these graphs is so some of them they're seven six two million and the 1.5 billion end up doing about as well as each other, which means you've like hit the limit of I Get maybe your data set or whatever. Whereas in this one. They're still growth which means an even bigger model We might expect to do even better maybe reading comprehension This is another thing you have some text you then you have to answer questions about that text, right? the thing that's fun is How do you do this Without Modifying your model. That's just a generative model This is where we start getting into So that's... you... by the time it's read it, it's modified itself based upon what you've given it to read? Is that what you mean? No, what I mean is The way that GPT 2 works is you give it the sequence of tokens and it gives you a probability distribution for the next token and so they're like type signature of that is totally fine with if you're trying to fill in a missing word or I guess I don't know how they did it for these for this test But you have to take you have to take the challenge that you're given and try and express it in terms of This like predict the next word type setup because otherwise You're sort of cheating, right? The whole thing is they're trying to go at this and not not modify the system at all. So for reading comprehension the way they do it is they give the thing bits to be comprehended and then they give Q : a question a : the correct answer to that question newline Q : a new question they give like three or four example questions and then Q : the question, they actually want answered a : let it generate So they sort of prime it I think we have some examples of this So this is how they did the question/answer thing. They gave these two paragraphs about the Olympic Games to torch relay moving the Olympic torch and I have some news story and then a bunch of questions right question What was the theme of the one world one dream and so on and so on and then at the end question and did they? climb any mountains and then a : Generate me the next word so they've used This the the input text to kind of prime the model Now we're doing question-answer pairs. This is how it works, right and The interesting thing about this is it actually ends up giving kind of a better answer than that human generated answers So the question did they climb any mountains the responses they got from humans were unknown. Yes. Yes, and yes because they do plain mountains but gbg to Dare answer is Everest. So gbt twos answer is actually kind of better than the humans. The humans just said, yes they did and the machine learning system has named the mountain that they climbed so I don't know if that's If that counts is not quite understanding the question or if that counts is actually providing a high-quality answer It's up for debate because it has this ability with attention to do the long range It has to look back past all of the previous questions To the actual paragraph and find the relevant information and it can do it and it performs reasonably. Well at that But the thing I love about that is that they have to like come up with tricks To get it to actually do the task that they're trying to get it to do. So the summarization one is brilliant I love the way they did this with the summarization. See if you can guess how they did this, right? You want to get a summary of? A piece of text. How do you how do you get that given a huge data set or edit? content and what you do is you write the whole long piece of text and then you put like a new line and then you put TL DR too long didn't read in this data set. There will be thousands and thousands of examples of long pieces of text followed by a short summary of that text and In the middle is this string TL DR? I would love to have been in the room when they thought of that So yeah, it's really Really cool really powerful technology. I like it a lot. So an executable binary the net effect of slotting that T diagram against here slightly downwards is to show you That the C you've written gets converted into binary and the net output from this process it produces out a program that you probably store in a