Exploring foundation models - Session 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we have a packed agenda today so we don't really have uh we don't have Leisure Time to uh to relax at the beginning uh let's start just with a brief introduction my name is Mike Wooldridge I'm a professor at the University of Oxford and director of foundational AI research at the Alan Turing Institute in London the UK's National Center for AI and data science and thank you all for attending it looks like we've got a very exciting day ahead of us let's begin with some routine housekeeping uh first the first rule of the IET is no smoking the second rule of the IET is no smoking so no smoking no vaping no naughtiness anywhere please if you would toilets are located out of the center door down the marble staircase turn right for ladies and turn left for gentlemen we are not expecting any uh fire alarms so if we hear a fire alarm which would be a constant alarm it would be a real alarm uh the fire exits it to the front left and rear of this room so there are many ways out out of the room and the master point is underneath a Waterloo Bridge okay so uh let's start with some uh scene setting so this event was actually originally planned for October last year uh we had to cancel the first iteration of it uh for a whole bunch of logistical reasons it was then rescheduled for December and we had to cancel the December event because of train strikes and then chat GPT happened and everything went crazy and an event which we probably would have expected 60 or 70 people all of a sudden it became clear that we needed a lot more capacity by the way there is at least one joke in my talk but I promise you the joke that I'm not going to make is that chat GPT wrote my presentation um I've had so many calls in the last two months that have started with somebody making that joke I think if we agree on one thing today we all promise never to make that joke again it just isn't funny anymore so where how do we get here what's it all about what is an on what is hype that's what we're trying to get to the bottom of today so the first thing to say is we've obviously now been seeing for at least a decade very exciting progress in machine learning driven primarily by progress in neural AI machine learning is a broad field with a number of different techniques but the one that's really taken off and shown promise over the past decade is techniques based on neural networks the idea of neural networks goes back to the 1940s McCulloch and Pitts who noticed that uh the if you look under a human or animal brain or nervous system under a microscope you see massive numbers of interconnected nerve cells neurons that arranged into enormous networks and what prompted them was just the similarity of such networks to uh to electrical circuits but it's really over the last decade that this approach has really become viable on large-scale problems and problems of interest and I think there is agreement that basically there are three drivers behind this progress firstly there have been scientific advances and we'll talk about what those scientific advances are but just as important is to make this work it turns out that you need lots of data and we are in the age of big data and every time we upload a picture of ourselves to social media and and carefully label it with our names and the names of our friends and children and dogs we are feeding uh their machine learning algorithms and just as important is compute power to make this work requires enormous amounts of compute power to process that training data progress not always but bigger can be better and the race to scale began the race to grow these systems began and uh we're now in the age of big AI when I was a PhD student AI was done on desktop computers that were typically shared with a whole bunch of other students that age of AI is now over in exactly the same way that Rutherford when Rutherford was experimenting doing experiments on the structure of of the atomic nucleus back in the 1920s 30s they did that on a lab in in in in the in their uh on a bench in their lab physics Nuclear Physics experiments can no longer be done on a on a bench in your lab you need big facilities and we are now in the age of big Ai and so the term Foundation models was coined in a report that was released in one of the lockdowns in summer 2021 from a group in Stanford University when they released this paper on the right the opportunities and risks of foundation models the lead for that activity Percy Liang is going to be one of our speakers uh later on today and what is the new idea it's about scale and the term Foundation models doesn't mean that this is the foundation of AI but these are tools upon which you can build and so what we what we've seen is the release of a progressive number of systems of which chat GPT is just the latest where they are extremely large neural networks using vast amounts of training data and and requiring huge amounts of compute power in order to train them so the key idea here is the following when I was a student and I was taught about symbolic AI the big idea of symbolic AI is that intelligence is primarily a problem of knowledge if you can give a machine the appropriate knowledge for a task then it can carry that task out for you so symbolic AI says intelligence is a problem of knowledge the dream of big AI is that intelligence is a problem of data and if we can get enough appropriate data with the compute power and the right uh the right system architectures then that will lead us to intelligent Behavior so that I think is uh the slogan of big AI intelligence is primarily a problem of data and the most prominent of the uh the most prominent of the various tools that we've seen are large language models uh and uh what large language models do the term is is uh is is not tremendously helpful uh but what large language models do is completion from prompts and actually it's ridiculous that this idea works so phenomenally well because it is such a simple idea if I open up my smartphone and start typing a text message to my wife and I type I'm going to be my smartphone will suggest the completion in the pub or late or late and in the pub right how is it doing that because it's been trained on all the text messages that I've sent it's got a repository of all of those text messages and it uses some fairly simple they're not very sufficient it's not clear that there's any neural networks there for example some fairly simple statistics to learn that the likeliest next thing that I'm going to type if I've typed I'm going to be is either late or in the pub so that feature is in exactly what large language models do they do the same thing but on a vastly vastly larger scale not just the messages the text messages that you've sent but on vastly larger data and as we've seen the most prominent of these is uh chat GPT but chat GPT is really just gpt3 with a few nicer front ends and a bit more training so it's GPT 3.5 gpt3 was released in uh in 2020 and immediately gathered interest because it was clearly a step change in capability over predecessor systems so gpt3 for the moment is the canonical large language model and what is the scale there 175 billion parameters what that means roughly speaking in the parameters in a neural network are the individual neurons and the connections between those so this is an extremely large systems it means that for example you just simply can't hold the data for such a system on a regular desktop computer what about the training data well this was the training data is not publicly available so we don't know the details but what we seem to know is that it's something like 500 billion Words which is 45 terabytes or 45 million long novels Wikipedia apparently was used is the training data and made up just three percent of the training data that went into that system so these numbers are really kind of meaningless they're so large it's they're beyond human understanding we can't relate those numbers to our experiences in the everyday world but it does immediately tell you one thing we do not process anything like that amount of text and yet we come out as very capable readers and speakers and writers so machine learning has got a long way to go before it becomes as efficient as human beings and in order to process that that work to be able to carry out these training it which just means tuning the various parameters the weights between the edges in the neural network and to do that requires AI supercomputers running for months and this is extremely expensive and there are also concerns about the amount of CO2 that's being generated if we're running many such systems uh so how do they work well here's a classical view of neural networks uh here is our icon on the left Alan Turing and imagine that each of the inputs to this neural network is one of the pixels that makes up the uh makes up the image uh that we're seeing on the left there usually there is some processing but this is a reasonable approximation and then each of these neurons is trying to recognize a a simple pattern on the inputs by doing a simple piece of mathematics and if it recognizes that pattern it feeds into the network until in the end we get the desired output uh being produced so that's the kind of the classic stereotypical view of a simple feed forward neural network um so does this work for large language models so in principle it might do you might be able to do this training it is a truth the text on the left hand side and on the right hand side we get universally acknowledged produce but no not really because the problem is too unconstrained so these large language models they certainly do have very large neural structures but they're not just a big neural network so what do they look like well the big breakthrough came in 2017 with what are called Transformer architectures and Transformer architectures I believe came out of a Google lab in 2017 and this is the key picture from the paper that introduced them attention is all you need and you can immediately see there's a lot more structure there the system is organized in a lot more sophisticated way and the two big innovations that fed into these Transformer architectures was firstly positional encodings that is when you're feeding a sentence as the input to one of these systems you're recording not just the word but where in the sentence that word occurs and then secondly the other big innovation which is where the title of the paper comes from is attention mechanisms and in I believe gpt3 has 96 layers of attention mechanisms which enable the system to focus on the text of Interest so we're not going go into more detail about the structure of these things but my point is simply these are not just big neural networks there's an awful lot of system engineering and AI machine learning architecture behind the scenes so well what can they do um well firstly here is a slide which just has some dimensions of intelligence these are um these are all the kind of attributes that are fully intelligent uh individual or if we succeed with AI the kind of capabilities that an AI system would have so on the right hand side are the light blue ones we've got um uh capabilities in the physical world so navigation Mobility manual dexterity hand-eye coordination understanding vision and audio understanding proprioception the idea of yourself understanding yourself in a space and where you are in that space and so on and then on the left hand side the darker blue ones are the mental capabilities things like Common Sense reasoning we're going to hear more about that later a logical reasoning problem solving the ability to do arithmetic and Mathematics recall the ability to Simply remember things theory of mind and rational mental state the ability to understand other agents their motivations their beliefs and so on and how they relate to oneself so what do large language models succeed with well they don't exist in the physical world at all so none of the physical stuff they can't make you an omelette or ride a bicycle or do anything physical I saw there was an announcement literally within the last few hours about uh about trying to hook up large language models to robotics but at the moment they don't exist in that space at all so they don't do anything in the physical world and those as intelligent agents existing in in the the world that we exist in an awful lot of our intelligence is tied up with understanding and working within the physical world um there are some things that they're quite bad at we know that they're really quite bad at planning and problem solving there are some things that we're really not sure about and there's a huge amount of work going on to study for example what kind of logical reasoning these systems are capable of um they're not very good at arithmetic I'll come back to that in a moment but what they are good at is the thing in the middle is natural language processing so what these systems are really good at is tasks to do with language so the takeaway from this slide is that large language models are a long way from the end of the road in AI it's easy to have a conversation with chat GPT and to think wow they solved the problem it's a long long way from being solved all those crosses on that previous slide were basically capabilities that these systems don't have so large language models are best at language based tasks so what kind of thing can they do they can generate high quality text they can answer questions about text you can give them a piece of text and then ask them questions about it they can summarize text in quite a smart way they can extract the key points from a text you can give them a long boring email of five pages which is describing to you the policy for the coffee room in the department and they will just tell you this email says tidy up after yourself in the coffee room right I mean they can extract the key points from text in that way there you can give them multiple text and identify ask them to identify discrepancies between those texts and the commonalities the main points of agreement and they're also and this is one of the unexpected applications of this technology they're also very good at brainstorming so traditionally brainstorming is done by locking a bunch of people in a room with a jug of coffee and a packet of cigarettes and they're not allowed out until they come out with at least half a dozen ideas about a pitch for a new product or something like that large language models can just do that you can keep pressing the button and they will brainstorm for you so this is a very unexpected benefit I think we're going to see a wide applications of uh in the next few years so they're best at language based tasks and these are the core capabilities the safest territory for large language models right this is if you're going to use them this is the territory where really you're advised to stay going outside this territory and asking them to do other things you're on much thinner ice um so what about natural language generation so when gpt3 was released the guardian got hold of it and got it to write an editorial so here the prompt is something like this right an editorial for the garden trying to convince people that large language models are not a risk to humanity so this is what they published this is what GPT 3 produced with a bit of after the fact editing but the main point I'm not going to read it out the main point is that this is perfectly coherent understandable English it's the kind of text that would be produced by a reasonable a reasonable uh graduate level writer um so far as I'm concerned the Turing test the Venus Turing test for AI you know which was kind of AI produce interact in a way which is indistinguishable that a human being interacts as far as I'm concerned it's quietly been passed in the last few years um what this on the one hand this is kind of a landmark but on the other hand it really tells us that the Turing test is not a tremendously significant test of AI but nevertheless as far as I'm concerned that test is now history it's just been passed quietly in the last couple of years um they apparently have encyclopedic knowledge so here is a a prompt I gave a one paragraph summary of The Life and main achievements of Winston Churchill and it comes out with an extremely coherent well-written paragraph which summarizes these things I check this as far as I could see um the uh the fact that it is reporting here are indeed all correct um however as we have seen over the last week as the Press have got hold of chat GPT and started playing with it and Bing has been had a limited release with um with some of these capabilities built in they get things wrong a lot so if you are using one of these tools one of the Prime lessons is you simply can't take what they say at face value and this is problem is particularly exacerbated because they're extremely confident and convincing very often they come back with very clear um statements that appear to us to be entirely plausible but so caveat Mentor they get things wrong a lot and you need a great deal of caution but nevertheless that capability is really really interesting what else can they do well they seem to be capable of some common sense reasoning this is this is a classic AI territory um this these prompts here are from a test uh written by a computer scientist Von Pratt in the 1990s uh for common sense reasoning in Ai and what Vaughn did was came up with about 200 questions to try to explore uh Common Sense reasoning in AI systems no AI system at the time could you even begin to apply these questions to right I mean it simply wasn't possible but he was thinking if we could what would we apply and it gets an awful lot of them wrong Common Sense reasoning is kind of everyday reasoning it's not abstract mathematical or logical reasoning it's just using everyday Concepts so Tom can Tom be taller than himself no cannot Tom cannot be taller than himself can a sister be taller than her brother yes a sister can be taller than her brother can two siblings each be taller than the other it gets this one wrong uh the other one that gets wrong that's kind of surprising which was invented first cars ships or planes and it says cars were invented planes followed by uh cars were invented first followed by planes and ships Slightly bizarre answer but nevertheless the fact do a number of these things is is quite exciting it's good at few shot learning where we give few shot learning is where we give a small number of training examples and the system has to pick up on what we're doing so here we're trying to train it to do sentiment analysis we're showing a few examples of statements and the sentiment that the statement implies I hate it when my phone battery dies is a negative sentiment and then at the end we just give it a we give it a prompt this new music video is incredible and this is part of the prompt the sentiment there and it picks up on that so there we're training it to do a task with a very small number of examples which is really quite interesting um uh things they're not good at they're not very good at mathematics so this is a random example I typed some random numbers what is one six seven five four four plus one six three seven and it came out with one six eight one eight one whereas the answer is actually one six nine one eight one there's something weird going on there there's a few phds I think need to do some work to try to figure out what's going on but it can't do arithmetic why should it be able to do arithmetic it's a language model that's not what its capability is but for me um this is uh this is a really exciting thing it can write a program to do arithmetic so we ask it a c program that does the same sum and it this is what it produces and just to prove that although I'm approaching 60 I can actually still compile C programs and run them I actually check this and it's perfectly coherent and correct C program which compiled the first time Alan Turing would have loved this he would have squealed with pleasure at a system that couldn't solve a problem but could write a program to solve the problem I think that's absolutely fascinating and we will see a huge body of work exploring those kind of capabilities where when a language model can't do something it either writes the program or calls another system which then pulls back another language model and so on that's really really exciting and I say I just wish Turing could have been here to see this this would have appealed to Turing on so many different levels um although we're not really talking about it much today that if I include are able to answer questions about images potentially videos on where this is going is having text and an image where the two are interrelated and you'll be able to ask questions related to those things these are these are the these are the textbook examples from Google's Imogen system where the prompt is a piece of text and the image is produced so the classic one is the one on the right acute Corgi lives in a house made out of sushi and the image did not exist before it was produced um Technologies are then very exciting but they raise a whole bunch of issues um as we've already mentioned they get things wrong a lot and the term that's come to uh describe that is hallucinations they require crazy amounts of compute power no British University and not the Alan Turing Institute has the resources to train one of these models um they require vast amounts of training data and there are serious questions about whether we will simply run out of data in order to build ever larger models bias is a huge issue if you're ingesting the whole of the World Wide Web to train your model there's an awful lot of bias and toxicity out there and the model is ingesting that and although the providers of these models try to stop you using them to create toxic content there's a thousand people out there in the world on Twitter busy trying to find ways to circumvent those safeguards and actually at the moment it's not that hard to do that so they're easily they're gullible they're easily persuaded to come out with statements which are misleading and so on they're also prone to injection attacks we haven't got time to go into that but we're a really fascinating Topic in its own right I'm just going to flag this one up this is I tried this on stable diffusion which is an image generation model and my prompt was Oxford professor and these four Images are the images that It produced now uh there's something not quite right about these images and it isn't the fact that they all look slightly weird and a bit odd that's normal for Oxford um there's clearly bias in the training data which is being reflected in the uh in this and again so by now I would imagine stable diffusion and the other systems will try to pick up on uh on that that kind of thing that was the only joke by the way that was the one planned joke in the whole talk um so there are clearly issues of bias there are huge issues of bias so looking to the Future Foundation models are important I've been an AI researcher since 1989 I started my PhD in 1989 and I've my my knee-jerk reaction to people getting excited about AI is to say calm down but actually I think myself and an awful lot of other AI professors had to recalibrate when we saw what gpt3 was capable gpt4 is expected in the next couple of months and I'm fascinated to see what that will do we're seeing all the other big tech companies now racing to catch up and so we're seeing uh we're seeing a race to get this technology out and to get some Advantage um why has it gone viral here's an interesting question I think there's why did chat GPT suddenly go viral gpt3 was available for months beforehand and that didn't go viral outside the AI Community I think there's two reasons number one anybody in the world with a web browser can interact with the most sophisticated AI system in the world yeah you can be anywhere you don't need anything other than a web browser and we've heard about all sorts of advances in the last decade in Ai and all sorts of different directions but how have they affected anybody's life for the vast majority of people they haven't affected them at all and now for the first time you can speak to state-of-the-art Ai and secondly it feels like the AI we were promised interacting in very ordinary language feels like exactly the kind of AI that we imagine AI should be so these are the reasons why it's gone viral there's a huge surge of creative applications out there although these are tough times in the tech sector my colleagues inside the tech sector said say that the success of chat GPT has prompted a huge flurry of activity within the sector we're going to see a vast number of creative applications and for the most part all the things that I've described they're not going to be replacing people in jobs they're just going to be tools there will be an option probably in Microsoft Word that says tidy up this text rewrite this text and make it nice and clean extract the top three bullet points from this text and so on right um so if that's don't think of this as sort of replacing swathes of jobs it's going to be another tool this is a spreadsheet for language and for ideas that's exactly the same way and just as spreadsheets didn't make math petitions redundant I think these tools are not going to make people who work with text redundant what is the maturing doing on Foundation models well we launched we launched a program a piece of work around foundation models last year we launched it last summer and it's about benchmarking and we're going to hear a lot more about benchmarking today and that what that means is trying to understand what these models can reliably do and cannot reliably do but our aspiration our dream is that the UK should have a sovereign capability in this technology at some point during or shortly after the second world war the UK decided we were going to be a nuclear Nation we decided that we needed a strategic capability in the Aerospace sector and so on and I think it's Unthinkable that the UK would turn away from the most important AI technology and we want to democratize it we want to bring it back into we want to bring it back we want to bring AI back into universities we want to make that data available and visible for people to scrutinize we want to have access to the code we want to be able to do experiments to understand how this technology is working we want to try and fix issues like bias and those kind of things and do do that experimentation in the open without having to sign non-disclosure agreements with big tech companies okay so uh that's what we're doing on the work on Foundation models um that I think pretty much concludes my talk so we're now ready for the second speaker uh and I'm very pleased to be able to introduce um the second speaker Phil bluntson uh Phil is as it happens he's also a professor at Oxford uh but the bulk of his time he works for one of the startups in this new technology cohere.ai so Phil the floor is yours okay [Applause] let's see if this works for me okay we have slides well it's it's great to be here this morning and uh thank you Mike for that that great introduction so in in this talk it's um uh going to mostly be a sort of overview of really just how I see things happening in the larger language modeling space at the moment both both in terms of bit into the research but also commercially what's going on because I work as a chief scientist for a company that builds big language models and sells them uh so it's a very exciting time so I've been working on language models I think since about 2000 about 20 years they're about since I started as a student most of that time it's been a pretty obscure field it's not the sort of thing you could sort of mention to someone and say I work on language modeling and have them have any idea what you're talking about suddenly that's not the case in the last uh the last few months suddenly everyone seems to have played with chat gbt and have some awareness that there's something going on here and uh almost daily now there are news stories relating to large language models on major newspapers and for me this is incredible so this has just taken from yesterday an article um calling out the sort of coming you could call it search war between Google and and being leveraging these Technologies so it's a very exciting time so I'm going to start out with this talk I said this is going to be high level informal but I'm going to start out formally so partly this is just uh from Reading many articles that sort of slightly get language modeling wrong but also to give you a grounding uh so what is language modeling like formally what is it that we actually do what is a language model is it something that predicts the next word no it's not that's not what a language model is formally a language model is something that assigns a distribution to utterances so an utterance is a sentence uh it might be a document a web page some finite piece of language it might even be a continuous signal from audio mostly we we're concerned with text so that little equation in the top left is uh defines what a language model is and there's there's a little bit of sort of uh technical stuff going on there but what that says is that a language model assigns a probability so w there is your utterance so let's call it a sentence a probability is a number between zero and one if we took all possible utterances that's all possible sentences it could be uttered uh that's an there's an infinite number of such sentences so this is unbounded if we summed up all their probabilities that's what the little Sigma star there is that's all possible sentences uh it would equal one and that's that's what a language model it is it's the P there it's a probability distribution that satisfies that equation so that doesn't say anything about predicting next words or anything like that or generating interesting conversations so formally that's what language modeling is and what it what a language model does for us is it answers the question sort of given observed training text that is and Mike referred to vast amounts of training text but assume we've got some training text we can extract some statistics from it what this model does is tell me how probable this utterance is this next utterance another sentence or document uh traditionally it's always sentences these days we always think in terms of documents or web pages or things like that so basically what where these things started out and what they were used for and include classic applications were machine translation and speech recognition so the intuition in Translation is a lot of the translation problem is working out how what is for probable utterance so in a very simplistic sense if you want to translate from French to English if you have a dictionary that you can look up the French words once you've done that you need some way to order them into a coherent a coherent probable utterance in English and that's what a language model would do for you it could take the two sentences he likes apples and apples like tea and say that the first one is more probable that's more likely that's what was what the output should be similarly for speech recognition and this is a bit like a bit like the effect you might have at a party when you mishear something a speech recognition system you've got this noisy mapping when you're trying to work out did they say he likes apples or he licks apples now probabilistically the first one is much more likely so I'll bias towards that so that's what what language models basically give us so uh just like Mike referred to their sort of origins of neural networks back in the 40s it's very similar for language modeling and uh Alan Turing uh the the namesake of ATI plus um IJ good on the right there uh realized when they were working on cracking German codes that knowing the probability of a German message would actually help them a lot crack those codes because they had to map from the encoded text to the decoded text and knowing what was a probable German message was very useful and so they developed a bunch of statistical tools which in some sense form the origin of language modeling they really dealt with the question of how do I estimate the probability of a word I've never seen before okay so just slightly uh bear with me on the formality for a moment or two and we'll get to how do we get to this whole next word prediction thing well from probability we know that if we have a joint distribution so our utterance is a sequence of words so we can think that of W1 W2 W3 there are words WN is the last one we want a distribution over those we know from The Joint decompositioning probability that any joint probability like that can be Rewritten as a sequence of what we call conditional probabilities that's if we know the probability of the first word and then multiply it by the probability of the second word given the first word the third word give them the first and the second and so on and so on until we get to the last word those two things are equal so if you can estimate a conditional probability you can estimate that joint language modeling probability the vast majority of language models do this not all of them there are language models that don't do this decomposition and they have very interesting properties but the vast majority do and Transformers that have become the dominant model for doing this definitely do this and this is where next word prediction comes in because what we're doing in this conditional is saying what is the probability of the next word given all of the ones I've seen before and then doing it again and again so next word prediction is is one way of getting to language modeling but it's not the inherent problem what we're actually doing underneath is estimating these joint probabilities but if we do it in this decomposition it gives us this great property which we call Auto auto regressive property of these models which means that we can't just estimate the probability of an utterance we can actually produce an utterance left to right if we've defined things that way one word at a time by sampling from these distributions and that's exactly what you see GPT doing chat gbt it's sampling from these conditional distributions so that gives us the ability to not just assign probabilities to text but to generate them Okay so in this very simple objective of signing a probability to utterance we can start to see that there's actually a lot of the power of understanding and working with natural language so if we have a prefix something like there she built her and what this little dot there means is distribution over what things could come after there she built her now there's a lot of stuff that could come after after that utterance a house a career all sorts of things but if I add some extra context Alice went to the beach there she built a suddenly that distribution gets a lot narrower you're probably thinking sand castle maybe a boat but something to do with beaches and by presenting data like this to models and forcing them to create a distribution of what comes next we can pressure them to learn some of the things that that determine this distribution and if you this seems very simple from a human point of view but if you dig into it what you've got to do to work out what should come next is realize that there's co-reference going on here we have Alice which referred to in the is referred to in the second sentence the she so anaphora we've got bleach in there you need to you need to be able to deal with these co-references to work out that the second sentence is referring to building something at a beach and Alice is doing the building so by modeling texts like this you start to see how a lot of the deeper structure of of language can actually arise doesn't mean it will uh because we can always trade memorization for generalization so humans are extremely good at seeing utterances like this and discovering the underlying structure discovering what we call the syntax and the semantics the structure of this data so we believe humans do actually use a similar sort of conditional distribution there's lots of good neuro-linguistic evidence for that but they do it in a much more sophisticated way than our language models so we can get at this by discovering that sort of human-like structured language we can also get at this distribution by just seeing a lot of data and memorizing a lot of stuff and what we're doing in big language models these days is interpolating between those two extremes these models aren't just memorizing but they are also not not generalizing the way humans do so just to continue this theme uh a little bit more we can see that things like translation we can just structure as as language modeling if we think of the French sentence followed by its English translation and assigning a probability to that try different English uh translations on the end there which one is the most probable similarly for questions what's the most probable word to come after a question it's probably the answer and that's why you see this sort of a so-called emergent question answering ability from these models just because often the the answer comes after a question and similarly for conversation if you ask what is an utterance that could come next so uh as Mike has mentioned underpinning the models that we deal with mostly today as something called the Transformer I'm off around 2017 but this is a culmination of a long history of development in language models I started out with with Turing and IJ good it really took off in the in the 80s with speech recognition research and and uh research is built a particularly at IBM a research group there did great work building language models for for speech recognition that morphed into machine translation in the 2000s when Google started to offer machine translation as a service to Google Translate service again that drove a lot of interest in language modeling for translation around the middle of the 2010 so 2013 2014 people started applying deep learning uh big neural networks to the machine translation problem they came up with this idea of attention a a group at Montreal and that culminated in in 2017 at Google in the the attention is all you need paper in the Transformer which was built to do machine translation not to do uh language modeling in the sense I've described for later on repurposed so it's a long history there but the Transformers turned out to just really hit this sweet spot for scalability and modeling power that has enabled this sort of explosion in application since then as I said this work was done at Google an interesting aside of all those authors I think possibly only one is still at Google most have left this is true of a lot of such research that has come out recently the middle author Aiden Gomez was actually one of the founders of cohere okay so let's sort of jump forward and think about where are we at today we've we've heard about GPT chat gbt how does this all fit together so if I was talking to you last year I'd probably just be talking about raw language models base language models what Mike talked about with gpt3 Etc and that's what I'm going to call a base language model down the bottom there but things have changed a lot in the last year and even the last few months now we're really seeing this sort of uh hierarchy technology stack of language modeling emerge all with different properties different demands different costs uh along the way so you have your base language models they're our language models trained on lots of data from the web then what we have is well a cohere what we call a command model or an instruction following model this is a model that's actually trained on top of the base model we're super supervision that's explicit examples of inputs and outputs to follow instructions and I'll go a bit more into this but this gives us a much more usable interface to these models once you have that it's then quite straightforward to turn it into a chat model or a dialogue model and that's what chat GPT is and that dialogue level there but then what we're also seeing is emerging on top of that what I call a search and retrieval level and this is what you're seeing with things like Bing startups like you.com and what what Google's trying to do with with Bard integrate search into these models so let's start out with base models so this is what um uh Mike referred to as sort of gpt3 style models models with tens of billions or hundreds of billions of parameters uh so just to sort of expand on Mike's numbers I'd say if you had 170 billion parameter model you're probably wanting to train it on more like 2 trillion tokens if you want a decent model uh tens of billions you're looking about a trillion tokens of data from the web that's uh web pages scraped so the classic source for that is a common cruel but some of you might be familiar with but also specialist corporate so Wikipedia things from stack exchange all these different sub corpora that help specialize these models so training these models coming back to that language modeling equation what we're doing technically is we're taking all of this data and we're trying to find parameters of these Transformers that minimize what we call a negative log likelihood so same thing as saying we maximize the probability it's expensive to train these models so as I said tens of billions of parameters I mean slightly vague here um we'd probably say I'd probably want about half an extra flop of computation uh so if you think so a Google V4 TPU is about one extra flop so you need about half of one of those that's about um foreign what's that 4096 cores of tpus and it'll take you about one to two months of training uh so the characteristic of this stage of training is it's very expensive it's very data inefficient the other thing is the data is cheap you can scrape it from the web on the right My Graph here is to give you an illustration of this is this process is not entirely simple and just having the access to the computation and the money doesn't necessarily get you there so this the Orange Line there is a training run so on the y-axis is the negative log likelihood so that's what we're trying to minimize and on the x-axis is time so over time so the time axis there is that's probably about a month on that x-axis of training that orange line is what we want that's a nice training run all those other lines there are not nice training runs so you see they start out well then they might diverge every time you see a sort of discontinuity that's where an engineer is uh uh is is fiddling with the model trying to rescue it because what you're looking at here is let's just say it's more than a million dollars worth of computation in these models and when it goes wrong you lose all of that so knowing how to get this to do behave like the orange curve and not the other curves is hugely valuable but also as you can see from all these discontinuities when it doesn't go right there's quite a lot of sort of desperate attempts to to fix it but it's it's a non-trivial process to get these models to run I mean as an aside why why this might happen is changes in things that we can't control in this case it was a change to an underlying Library so previously a run that worked someone changed an underlying Library the Run stopped working the engineers hadn't realized that that would would happen see it after after sort of a month of of training go back fix it and get it to work so base language models this is this is our interface for the cohere language model on top is the input on the bottom is the output you get so let's input a question like who is the prime minister of the UK you don't actually get the answer what you get is a list of other questions that are sort of similar and this is classic behavior for a raw pre-trained language model it doesn't necessarily inherently know that you were looking for a question about looking for an answer it might just think you're looking for a list of questions that looks similar you can address this by things like prompt tuning and other techniques if you if you're sort of sophisticated user of this but the key thing about these base models is they're just not very easy to work with so from a from an average consumer point of view they're not a great interface so jump forward and this idea of well we can add an extra layer of training what we call supervised training or it could be something like reinforcement learning to take a much smaller amount of data and so now we're going from trillions of tokens to maybe tens of thousands of documents but much more expensive data so there's a data that we have humans annotate and curate so we have to pay for it and the curation looks like on the right this interface so the on the top is the input The Prompt and on the bottom a possible completions on the left and the right two Alternatives and then an annotator might rate the two relatively so in this case you're writing the left one better than the right one this is a tricky process it's not necessarily easy to get good annotations even in this simple example those of you who are old enough to be familiar with banana man uh with with bilotti might realize that um both uh so the left one is actually wrong he wasn't the the voice of he wasn't starring as banana man uh he was actually the voice of crow one of the other characters but uh so there's some subtle issues about how exactly do you rate this but the left one is closer to the prompt in terms of being roughly 15 words so supervised learning we take examples like this ratings from humans much smaller amount but as I said paying for them and use those to train models and that gives us something that's much more usable much more easy for humans to interact with because it gives something you can just type an instruction to do this for me here's an article summarize it for me or write me a poem in this style or answer this question also the training is much faster so for instance again tens of billions of parameters now to train such a model it'll be about a day so rather than rather months we go down to a day for instance the model we have we retrain it every week whereas our big underlying model we're retraining every couple of months so now again going back to our interface if I type in who is the foreign minister of the UK I get a nice answer the prime minister of UK is Rishi sunak exactly what you're expecting so this is really in some sense I'd say the sort of product Market fit moment for for large language models uh when we switched from these somewhat difficult to use base models to these supervised trained models suddenly they're much more usable by an average person the other thing it does is it makes it much easier to build on top of so as we have sort of heard about a number of times chat gbt so something like chat gbt which looks very sophisticated is actually a very thin wrapper around an underlying instruction command model open AI had already built that and it was straightforward for them to to turn it into a conversational model to very good effect and this is as Mark said this had an amazing um uh viral growth so again an article here from the guardian referring to 100 million users uh just two months after launch so these conversational models are a very small step up from the supervised models I just mentioned so all you're doing is going from the supervised data of input instruction prompt output to doing it in the conversational multi-turn style you're going from one turn to multiple turns again the amount of data involved is small uh thousands tens of thousands of conversations that you just collect from people so here's an example of a model um in a little conversation I had with the model and this is to illustrate one these sort of really interesting sort of semantics that you can have go on in a in a conversation like this but also the problems of having models that are trained on a static data set and not updated so here I'm asking who is a prime minister intentionally uh ambiguous the model decides to interpret that in terms of the Canadian Prime Minister and gives me a nice answer but I say no actually I'm interested in the UK again it gives a nice answer but when I ask who succeeded Rishi sunak the model says Boris Johnson so forgetting about Liz truss which I guess is easy to do and understandable but the interesting thing is when I start asking about Liz the model makes this interesting jump and assumes that I'm talking about the queen which I have to inform the model is unfortunately has passed the queen has passed away um the model then misinterprets me to think I'm talking about the first Queen Elizabeth and says no actually sorry just to be correct it was the second one and I say well actually that one passed away as well so uh this is both amusing but also illustrates that the models that I've talked to up into this point all I have access to is the static data they were trained on and that quickly gets out of date because uh well if you're trying to track UK prime ministers they change every few months uh uh monarchs die Etc so the next level in this last uh one in my hierarchy is what the search retrieval level and this is what you're seeing a lot of action at the moment with Google being Etc it's also a great startup called you.com that's really been ahead of others in this and they've built this lovely conversational search interface out of um a big language model and now you can type in just like I did on the previous just like you do with chat gbt you can have a conversation with such a model but it doesn't just answer from it wait what it does is uh this idea of it actually launches a search query so just an example of of using an action or a program a bit like Mike mentioned it's essentially a simple version of of uh running a program so it runs a search and it gets back results and it uses those results to inform what it generates so here at trainers search and the nice thing about this is now you can present the result to those search and people are very familiar with this so on the right are there is what's been researched returned from the search the model Ram you don't see what search the model ran it it's had to work this out from from the conversation it's Run a search about UK prime ministers and then that's generated a nice paragraph about uh Rishi sunak the other interesting thing is if you can it's a bit fuzzy but if you can make out after it um it does mention Liz truss there's a citation so now the model can actually citate its cited sources uh and attribute where the information that it's producing is coming from and you can see that citation at the bottom which um yeah I think it's a news article about the rapid change of UK prime ministers so this is where these models are going so you see a lot of uh discussion about models not being grounded and such this is where models are going in terms of grounding so to be able to cite sources but also to be able to present to you those sources so you can say well does that make sense is it citing uh the right article and often you'll see sometimes that goes on rye and you'll you'll get sort of the answer Rishi sooner because the current prime minister and then a citation to the Wikipedia page of Winston Churchill um so it's not perfect but you can see that this is where we're going in terms of um uh grounding these models in current facts the other thing of course is that search is dynamic you don't need to update the model for it to get the most up-to-date search results so you don't have to keep retraining the model uh so in the first part of this talk and and some of the things Mike mentioned there's a lot of there's been a lot of focus understandably on the costs of training big language models and that's because most of the people doing this have been researchers even within big tech companies but things are changing because these models are being deployed and when you deploy big machine learning models suddenly you realize that training is not actually the big cost it's deploying them so if this thing actually these things actually go out into the wild and people use them someone's got to be paying for the computation of all those queries so the tweets on the right are from Sam Altman is the CEO of openai that built chat gbt and he's referring to the eye watering cost of serving that model which of course is free so that's a cost that's been entirely worn by open AI yeah the thing about chat TPT in the coming back to that very first slide about search Wars we've been being in Google there's no advertising and it's not clear at all how to work traditional search advertising into these models that's an interesting challenge for companies that rely on the advertising to pay for their deployment so there's an estimate from an academic I think from Maryland um of maybe chat GPD costing about a hundred thousand dollars a day I've seen estimates I think as high as six or seven hundred thousand I'd suggest it's probably uh somewhere in that that range either way it's very expensive uh thing to be giving away for free so these models are expensive to train but they're also very expensive to deploy and if you're going to offer a search engine something like that that has very high throughput traffic the cost of providing each one of those search results is going to go up and that means your your margins are going to come down so this is going to lead to very interesting happenings in that world okay so final uh comment in this this space is well how do we actually evaluate these models and in many ways this is one of the biggest open questions and something that needs a lot more research and thinking and I think Percy will talk about this at length this afternoon but it's hard we don't have good evaluations within a company we can evaluate against customer data that's not something that can be shared across companies or externally so if you're a customer how do you compare different models how do you decide which one's best Percy will talk about I suspect the helm evaluation that the the Stanford researchers came up with that's on the right there this evaluates just one one slice of what we use language models for it evaluates few shop prompting which I haven't talked about at all that's not what chat GPT does so it doesn't necessarily show you the way models work but it's pretty much the best or only open comparison between commercial models that exist at the moment one obvious takeaway from these ignoring the numbers is that the supervised models are always much better than the base models for these tasks okay I think I have about a minute left so I'm going to wrap up there uh so to summarize um well language modeling has a long history uh but it's evolving fast uh we're moving on from a time of focusing on uh being all about who has the biggest models so one thing I haven't said is there's really a trend away from increasing model size actually to decreasing a lot of that driven by what I mentioned about cost of deployment and things like this so we're moving to an Era of one deployed models that's changing the incentives uh uh we're having a much richer hierarchy and branching out of different models models that that work with code models work work with images search specialization all of these things so we're going really from the I guess the development phase into really the commercial uh product Market fit phase it was very exciting time so thank you um I don't think there is actually any time for questions uh and I'll pass on to yeah [Applause] you're all very welcome to come and drink some cheap wine with us at the turing's expense I'm very pleased to introduce our next speaker Professor Maria liakata who's professor in natural language processing from Queen Mary University of London Maria over to you setting up great I'm very excited to be here today and like many of you I've been thinking about where our research priorities in natural language processing and I I should be placed given the current developments in portraying language models so what should we be focusing as researchers in NLP so I'll talk about some of the challenges faced by portrayed language models so some of the linguistic challenges and what are possible mitigation strategies from knowledge enhancement and evaluation so um I'll try I'll give a very quick overview of pre-trained language models because uh already Phil and Mike have said quite a few things about them um they're now sort of focus more on sort of challenges and that are faced by uh plms even really large ones like gpd3 and chat GPT and and then I'll go on to about some work that we've done on knowledge enhancement us and many other people as well in working in the space and evaluation of language models and conclude with remarks um about sort of implications for the future so um I'll give you this kind of visual so already people have been safe here saying a few things about the the history of language models I think the really the thing to take note of is how things have uh really sped up from the Advent of Advent learning in 2013 and really also the homogenization of going on from 2018 onwards so these plms are dominating um kind of this this field um and this is a good and a bad thing together we can discuss this at the panel perhaps later so um plms as mentioned previously are based on Transformer Network architecture and this is the um the known diagram of that Transformer architecture from the uh all attention is all you need paper already mentioned by uh by Phil so typically like previous sequence sequence models before they have an encoder and and decoder and the encoder creates a hidden representation for a sequence of data and the decoder uses the hidden representation to create a new sequence a new output sequence based very heavily on the notion of positional encodings and self-attention and especially multi-head attention so the positional encodings were mentioned earlier by Mike they're concatenated with the input embeddings to keep of track of items in the sequence so again this is sort of the kind of the building block of all peer Lambton Transformers multi-head attention so rather than now so in in your self-attention mechanism you have a different positions within a sequence the mechanism tends to different positions in the sequence to create a representation of the sequence and now rather than having this uh happen in sort of one big go you you have a parallel attention layers which is the the multi-head attention which makes it much more efficient so this is what has been like the really big breakthrough that allowed this really large Transformer models um so we've seen that the plms are based on Transformers um and they sort of some very popular training strategies involve masking and prompting so with with your masking uh you sort of um like in the example here where you say you mask basically words within your input and you essentially Force the model to learn these particular these words and so you can have different kinds of masking strategies there where it can be random so some kind of percent of the data is masked or it could be more knowledge driven which is something I'll talk about a bit later uh and then sort of the models learns kind of the parameters the weights for for these kinds of representations and then given kind of a new task let's say a sentiment classification tasks uh where um you know if the model had seen it's a you know movie in every regard so probably terrible and mask painful to watch and then if it sees no reason to watch then it has some kind of representation about this um you know um being it bad or terrible and then you can use kind of the the representation of that sentence from the language model to label it as a positive or negative um so and then with prompting uh you you basically insert um a a piece of a kind of piece of text in certainly in the input examples to you can basically formulate um kind of um a classification task more as a masking task so you basically Force the model to tell you so if you have no reason to want to watch it was terrible you kind of bridge uh the uh the gap between what the knowledge has learned so things like movies can be terrible and therefore you don't want to watch them and the fact that you would want to to classify this as a kind of a negative instance so there are many pre-trained language models and variants keep being created and examples of such plms are but which the was introduced in 2018 and still academics very much use birth and variants a bit it's quite small relating if you if you sort of compare it to current models so 345 million parameters allow still not straightforward to train for for everyone and it's embeddable in new applications there's a focus on the encoder and sort of works well for classification tasks and then you have giants like gpd3 with 175 billion parameters offered as an API with a focus more on the decoder and the generation and then chat GPT smaller model more focused on the question answering and the interaction but then it's also both GPU well both gbt and chat GPT mostly have knowledge they have knowledge up to 2021 so um what are challenges faced by by these plms uh as as Mike uh quite extensively covered they can be very impressive these days they can be very good at many things such as translating texts in several language they're very good at paraphrasing this is I think the strongest Point generation of short coherent summaries fluent summaries from multiple documents creative writing and they do all this by capturing these higher order core currencies in the text so here as a kind of my joke in the talk I asked chadt Chad GPT to create a short Ode to the musaka I'm I'm hoping you're fans of musaka like me and it did an amazing job so what did it actually do it learned that is what it's made of and it also has produced words that rhyme with the ingredients it's learned as a traditional Greek dish and makes references to Greek culture and has also learned very importantly the structure of a node so the way it combines language and language linguistic structure is indeed very impressive so I'm going to continue with some examples from chat GPT because it's linguistically so impressive and has generated such a stir and I'm sure all of you are using it and playing with it and having lots of fun with it in your spare time um so um I I think that even though it's so good at language there are actually quite a lot of linguistic problems it's not so good at so in this example here it's basically my student asked it to to provide it I asked about some papers that um that help where this particular paper this deep hierarchy operation autoencoder has been applied to text so this paper actually came after 2021 so actually GPT would not know much about it but it replied by producing three papers it made up the titles so it paraphrased nicely the original question and it also decided that the authors should be Asian so um so so basically it's uh it is yeah provenance factuality biases hallucination uh all of that together uh and I guess what Phil mentioned about sort of the grounding of the models in sort of having to produce evidence um is is really important in this respect um but even with knowledge that it has seen so this is now an example from Romanian history which exists in Wikipedia so they must have had access to it when training chatgpt um uh but it's sort of a very small very lightly covered in um I guess historical um event so it was it's about the question about when did basarab 3 rule um and then it comes up with two sentences uh and all of the things pretty much it says about vasara III are incorrect so this means that obviously the um the information was the kind of the resource the information was not enough for it to this to to be kind of captured in the model's parameters um with Greek history it seemed to do a little bit better so obviously that was better covered now um I'm I'm sort of I'm going to move now to something that is I think a lot a lot more serious and can't be necessarily fixed with a kind of fine tuning so this is about inference and complex semantic similarity so in this case I asked GPT about beaches near a small town in messinia in the peloponnese and I said are there any nice beaches near petalidi and it said yes there are several nice beaches near petalidi and some of the most popular ones are blah blah blah blah and all of these beaches are located within 30 minutes drive from Italy and so on so then I and I know this is factually correct because I actually have been to at least two of these beaches so then I pose the task differently I give it two sentences one posing my original question so are there any good beaches near petalidi and a second sentence I take one of the answers it gave me before this granny Beach and I say cranny is very good and I say does sentence two provide an answer to sentence one and interestingly it says that sentence two does not provide a direct answer to the question asked in sentence one and the reason it gives me is that it's not located near peterlevy it's just told me it's within 30 minutes from petality it's located in messinia well both of them are located in messinia and um and then it also tells me it is not an answer so which is actually kind of an unnatural response in this so I I wouldn't say that completely the the kind of the terrorist test is solved here because it is it's sort of uh I think I think it depends on on how much we we actually um you know probe and and see what these um models can do and clearly it can't infer from the the knowledge that it already has so you can't get this similarity between the two sentences between sort of Granny being an answer that it gave me before and it being an answer now and also it doesn't understand Nia and a 30-minute drive from so I think this is quite serious um now even with summarization where it can be very impressive indeed very impressive it's not so good at summarizing log long documents and I think also Mirella will talk later about sort of long document summarization so so the example here I'm actually not able to show you the actual text because it was sort of sensitive data and and really we shouldn't be working with sensitive data and chat GP team um so it's sort of a timeline on someone's so that someone has posted on a mental health Forum on on how they're feeling and it was sort of a long sequence of posts about 30 posts or longer and rather than giving a summary what it did is it paraphrased each sentence which is um interesting however it is not a summary it is also missing important events it's also not respecting the order of events and I think you you can see that actually if you're interested in sort of the summarizing such a like a long timeline of someone's mental health this is a problem so [Music] um again another issue is when working with sensitive data is that we we can't actually work with plms that we can't control ourselves so there's a lot of privacy issues around that um there's as I sort of mentioned there's issues with long sequences so generated summaries don't capture most important events or preserve a temporal order [Music] um temporal robustness is is a problem so as um as Phil mentioned they are expensive and Mike mentioned the plms are expensive to train so it's difficult to update them with the latest information uh the hallucination of information is an issue particularly with sensitive data or medical information um and another interesting one which we tried so I think if you've tried chat GPT and you've told it to work with a particular style like I told it to generate a node um it's not so good at preserving things like disfluences so here if you're wanting to generating synthetic data from let's say therapy session or people who have sort of cognitive decline again it's not going to preserve this kind of stuff so there's some quite interesting issues that are important for Downstream applications so I'll now talk a little bit about how we can resolve some of these issues obviously there's a lot of work in in this direction particularly on knowledge enhancement but also Phil mentioned evaluation so there's a kind of different kinds of evaluation that can be done so in terms of knowledge enhancement so this is about incorporating knowledge into plms sort of that you know they they may not have or as we saw for example it couldn't necessarily see a sort of the semantic similarity between two sentences that it should have seen so the knowledge is typically uh it can be distinguished in sort of implicit in cooperation where you essentially do knowledge guided masking so rather than as I said before sort of you you mask sort of a random proportion of the test while training you actually guide it so you it masks sort of particular entities for example from a medical database and you also have knowledge related pre-training where you combine an original tree training pre-training task so for example next sentence prediction with another knowledge related task um and then there's explicit incorporation which involves either modifying the model input so for example adding rare word definitions at the end of the input or storing knowledge for retrieving it later or knowledge Fusion so I will talk more about the knowledge Fusion and in the knowledge Fusion sort of the way it can happen it can happen on top of the PLM so as a kind of the last layer of the of the PLM it can happen in the Transformer or between Transformer layers so the kind of work that we've done in my group has been sort of on top of the PLM and within the Transformer foreign this we kind of evaluate these enhanced models on a task so we look at in particular on semantic similarity detection and semantic similarity detection is a framework consisting of a collection of binary text pair classification tasks which aim to recognize the presence of a predefined semantic relationship between two given texts so this includes things like semantic equivalence entailment and question answer relations and we've so this kind of the work I'm talking about actually is is from 2020 so I expect chat gbt has been trained on all these corpora that I talk about here but the issues you know still remain so we have a kind of paraphrase detection so we we consider similarity between short text ranging from a sentence to a short paragraph and they can come from sources like Cora paraphrase detection Corpus and Samuel where there's a kind of different kinds of tasks such as um uh in answer ranking so an example where um your um you have a sentence pair which is the best way to learn coding and how do you learn to program and then you have you basically want your model to be able to tell you that this is a paraphrase and and this is now an example that is similar to the one that I showed you earlier on the sort of beaches near petalidi so are there good beaches in the northern part of Qatar and fubarit is very clean and you want this to be kind of uh this indeed to be in a question and answer relation so the second sentence answers the first um so we in this particular work we actually decided to enhance the bird in particular so the PLM we were working with with topic models and why did we do that topic modeling has been useful before for semantic similarity detection and it's it's also been useful for domain adaptation and for those of you who don't know topic models learn topics as probability distributions over a fixed vocabulary so how do we do this so this was the case of doing the enhancement on top of the PLM so this was quite a simple but effective architecture so we encode the two sentences using a bird um and then what will kind of retrieve as a as an answer comes from kind of the sea on top and we also combine sentence representations with topic models for each sentence and here we used I think LDA and jsdmm but any kind of topic model even kind of more recent neural topic models could be combined here and we get both document level topics and sent and Word level topics and um basically we sort of concatenate them with sort of the the pair representation from bird and then this kind of go through the soft Max layer and produces an answer as to whether the two um as to whether the two sentences are in this kind of semantic similarity relation and so when uh so we evaluated this obviously on different corpora and against kind of previously previous state of the art for these for these corpora and um we are kind of one thing first of all to to be observed is that the topic Baseline on its own is performs quite good compared to other more sophisticated models and then we compared it with bird without the enhancement and then the Bert with kind of the the enhancement on top um so yeah so you can see where so the yellow bit is kind of where the um the topics the just the topics performances and then you can see that obviously bird performs better than just the topics but but with the topics performs uh better still and then we wanted to understand so what what were the main gains from this kind of addition enhancement to the to the model so um so kind of the we we basically took uh examples from uh from the output that had been labeled by the model and we sort of manually identified you know where there were named entities in the exam this examples where they were domain specific words and when there was where there was non-standard spelling and we can see that most of the gains actually come come from the fact that it now captures better the domain-specific words so kind of the hypothesis that we had that the topic modeling Edition enhancement would help with kind of these sort of more Niche domains was confirmed so um I guess there's another question about whether it's is it just a question of fine-tuning but longer and we did try this as well and in fact we we found that Improvement the performance of bet did improve if you fine-tuned longer but in fact it never got better than um the bet that had the the enhancement so in submit in subsequent work um in to improve model performance on semantic similarity we decided to try knowledge enhancement with linguistic information known to help known to have helped with semantic similarity detection so more specifically we want to combine linguistically enriched embeddings so in particular dependencies and counter fact embeddings with Bert and this time we decided to modify but internally So within the Transformer rather than on top sort of hypothesizing that this would give us more flexibility and indeed there has been work that has sort of used the multi-head attention aspect of the Transformer to do kind of the the injection at this point in within the multi-head attention so so we we actually experiment with two different versions one with the attention injection where basically you take you take the um the basically the representation from the previous layer uh as the query and then you add the keys and the values is kind of the the new embeddings that can come from your dependencies or counterfeited embeddings so this is the attention injection mechanism and then we also tried a gated injection mechanism which basically involved aligning sort of the embeddings with the Bird Space which this is kind of this P feed forward I that you see and then adding through gating to kind of the um the current um the current head the current layer um sorry the kind of multi-headed tension layer and we saw that this indeed helped with improving performance on these sort of same Cobra that we'd looked at before um the If you're sort of what we what was interesting was that it didn't always outperform the kind of attention so the gate the new gated mechanism that we performed performed as well as the attention as well or better than the attention-based mechanism with a lot fewer parameters so essentially it was a much Greener option um interestingly the the kind of the the that on top of the model enhancement uh performed so I don't have it here but I have it in the paper uh performed better in some of the cases uh but still we use a lot more parameters so um so yeah the Gated injection was at least as good as the attention injection so um another task that we know is challenging for plms is cross document domain reference resolution so what does that mean so this means identifying mentions of the same Concepts across distinct documents from different domains so for example here you have a news article that is linked to this scientific article so they both talk about a new vaccine for Coronavirus and then the challenge here is to link between the entities within the two documents so the you have to find that Acme is our are the same thing and mRNA based vaccination a new vaccine would be the same thing and SARS curve 2 and coronavirus would be the same thing um so um so why is this so difficult it's kind of the vocabulary between different um is very different across domains even when you're talking about the same the same entities because they're addressing very different audiences and here's another another example where you have to kind of be able to map a labeled flow patterns to data about flow rates and support Vector machines to the algorithm and so on so how did we how can we how did we do this kind of evaluation and sort of sort of propose this task we created a corpus for for this kind of cross document across domain reference and indeed sort of um kind of Phil mentioned that sort of the need of creating different copra and how things help with annotation and so on so I guess the the main message to get out of this is um the kind of the interest in terms of where we saw sort of the weaknesses lying within the models so yeah we had sort of various bass lines and you can read about this in the paper if you're interested um but um I guess the uh the models seem to struggle with cases that are very highly context dependent and where subset relations are not computable so a bear is a carnival but not all carnivores are bears so it's kind of these these these kinds of evaluation sort of give us a sort of an idea of where we want to to go in terms of kind of enhancing models so I'll sort of kind of finish now so I've got to I think I'm running out of time so um the researchers are currently working to mitigate limitations of PLM so both within industry and Academia so apart from the work on explicitly injecting knowledge into plms there is work on combining plms with logical structure for inference and importantly work on explaining the model so that they can provide evidence for the decisions or their responses to questions and there's also ongoing work and evaluating these models on finding out what they know and their performance over on a variety of tasks that require inference or temporal awareness so we are at the moment working on such tasks sort of injecting temporality and also evaluating on longitudinal tasks um but for this research to be possible academic researchers will need to be able to work with such large plms closely scrutinize them have control over uh what data they are trained on and be able to intervene in their architecture they're becoming less and less accessible to academics with hundreds of billions of fiber parameters and we still a lot of us are still working with Burton and Roberta and kind of models that were proposed before 2020 so they have to be important solutions for academics to continue doing really good work with models and they work to be influential in sort of enhancing these the model capacity so thank you foreign
Info
Channel: The Alan Turing Institute
Views: 7,031
Rating: undefined out of 5
Keywords: data science, artificial intelligence, big data, machine learning, data ethics, computer science, turing, the alan turing institute
Id: n9OkJBluOa4
Channel Id: undefined
Length: 89min 28sec (5368 seconds)
Published: Fri Mar 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.