A Neural Network Model That Can Reason - Prof. Christopher Manning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Damn I look good on camera.

👍︎︎ 10 👤︎︎ u/egrefen 📅︎︎ May 31 2018 🗫︎ replies

Im surprised this hasn't received more attention, an interesting and novel architecture achieving a markedly superior state of the art on an interesting dataset. The leap frog in data efficiency just drives the point home

The paper also links to a TF implementation on github which I would be super curious to see how it performs on different datasets, especially the Stanford Question Answer set

👍︎︎ 2 👤︎︎ u/Allong12 📅︎︎ Jun 07 2018 🗫︎ replies

Paper here: https://arxiv.org/abs/1803.03067

Interesting work and impressive results! I wonder how optimal this architecture is, or even how informative the ablation analysis really is. It would be nice to systematically explore the space in the way that was done for LSTMs here: http://proceedings.mlr.press/v37/jozefowicz15.pdf

👍︎︎ 3 👤︎︎ u/mikeross0 📅︎︎ May 31 2018 🗫︎ replies

Captions

I'm hello hi okay we're gonna start the final session so I'm pleased to give you a short introduction to someone who doesn't really need one Chris Manning chris is a professor of linguistics and computer science at Stanford University where he completed his PhD and he's known for his broad contributions across the statistical NLP he wrote the seminal textbook on the statistical NLP with Henry shoots as well as one of the most used textbooks and information retrieval along with his group he's contributed many papers to top-tier conferences across the world as well as useful tools that we know and love if you have ever had to horse data like core NLP Chris with his students has also been one of the driving influences for the revival and interest in deep learning or neural network methods in the NLP community through papers tutorials and and it's actually through him and Richard and Joshua is a tutorial or email P a few years ago that I got into this field so it's had an impact on my life so today we're delighted to have him as the final invited talk of the conference where he's going to tell us about how neural networks can move from just predicting but also reason so please join me in welcoming Chris to the stage [Applause] okay thank you so much Edie I was a little bit afraid that by Thursday afternoon there wouldn't be too many people here but it seems like this conference is just proving to be a great success all around so today I'm actually not going to talk so much about natural language processing and instead what I want to do is just discuss how we might design neural networks to do higher-level cognition that is reasoning and this talk mainly covers work done together with my student through Hudson now can't your network machine learning systems excel on a wide variety of tasks such as speech recognition compute various computer vision tasks and they've even been pushed with surprising success to tastin areas like robotics and natural language processing such as machine translation but nevertheless by and large these successes are in the same overall area that the successes are in places where we have these kind of stimulus response tasks that a stimulus comes in and we process it to recognize something or to translate it or whatever and these are tasks that are intuitive or instinctive for humans they aren't tasks that require careful deliberate thinking so the question then is what about if we'd like to have artificial intelligence systems that can be used for reasoning what about tasks that require more deliberate conscious and explicit thought arguably in the early decades of AI that these are actually the kind of tasks that people thought about most so those might be a reading comprehension tasks but not one like some of our current question-answering tasks but instead ones like a middle school reading answering a reading comprehension task where you actually have to understand the interactions and relations between the characters and events doing common-sense reasoning and problem-solving making explicit plans for a series of actions or playing games that aren't like Atari games but are the kind of strategic board games that many of us like in our spare time for tasks like these we really need reasoning or inference and so then the question is what exactly is reasoning well if you're going to ask the question of what is reasoning and the answer that you're going to give his neural networks a really good place to start is to look at the paper by Leon burr - from 2011 right back in the early days of the current deep learning era so but who discussed how we could enhance deep learning systems with reasoning capabilities from the ground up so he suggested that a good definition of reasoning was algebraically manipulating previously acquired knowledge in order to answer a new question he says that well he points out that you know often when people say reasoning their mind immediately jumps to logical inferences but it doesn't necessarily have to be achieved by logical inferences in particular he suggested that there's a kind of continuity between he algebraically rich systems that you have in neural networks over to the kind of algebra as they're used in places like logic and so that if we could start down the road of connecting together trainable learning systems in the neural network world that might lead us on the path to inference but in particular what he wanted to emphasize is that central to reasoning is having a system with composition rules so that we can guide the combinations of modules in a way that we can recombine information to do new inferences recombine modules so that we can address new tasks and so what I'd like to talk about today is some ideas as to how we might be able to make further progress along this road to having reasoning done by neural networks now at the moment the dominant viewpoint in deep in the deep learning community is the seeker learning device that is as empty as possible the emptiness of the learning device is by and large compensated for by providing enormous amounts of training data to the system and then often what people try to do is improve performance further by guiding how the device generalizes by the technique of data augmentation which is effectively externalizing the desired inductive bias of your system in additional artificially generated training data however I feel that we should not be afraid of good inductive biases indeed I think they are the key to producing good machine learners once which are capable of learning quickly and effectively indeed if we look at what's been happening and deep learning that I think most of the biggest breakthroughs and deep learning have come from building the right kind of inductive biases into models we can certainly fail by using models with much too rigid structure but we can succeed through the use of appropriate but flexible structural priors built into our models as inductive bias these successes include the early success of convolutional neural networks the development of attention and also the kind of sequential and vertical gating that's used in lsdm forget gates and highway networks all very effective structural biases in most of my early work in using deep learning I was a strong proponent of tree-structured compositional models I believe that they really give the right inductive bias for capturing the structure of human languages but not only human languages there are lots of other places as well and different problems where having a tree structure is a very natural and good in duck tobias and so these kind of models in the generic terms work so that there take two vectors initially word vectors and build up a vector representation for the phrase and crucially that vector representation for the phrase was represent in exactly the same space and therefore that means this computational unit could be applied recursively to build up structures for larger sentences the result of this composition captures how languages have nested freezers where the sentence meaning of the whole can be derived by composing together the meanings of the parts as in this sentiment analysis example the sentence starts off negative there are slow and repetitive parts but then it turns positive but it has just enough spice to keep it interesting and the tree-structured model can understand this structure and understand the whole sentiment for the sentence turns positive actually what lay on the two proposed was using exactly the same kind of tree are in in structure for reasoning he proposed that the path to universal composition was an association model a that maps to representations into a new one of exactly the same type and then that was augmented by a scoring module r which is used to guide inference the way in which you choose to combine together your initial axioms and your derive facts and the result of using this kind of same composition module is that you had build up a proof tree now on problems with unambiguous compositional structure an inductive bias of tree structure really does win and indeed there are at least a couple of papers that have been presented that i clear which shows that so one of them is on computer language program translation and the second one by edon colleagues who introduced me looking at whether neural networks can understand logical entailment these are clear port all problems with tree structure and in both cases you've got big advantages from using a tree structured model over something with the simpler inductive bias however on the other hand in my own field of natural language processing it's proven really challenging to get sustained gains in NLP task performance from using tree structured models for various reasons including the fact that tree structured models are very GPU unfriendly compared to simpler models but also because of their use of symbolic tree structure where you're making hard decisions and the sensitivity to parse errors which is very easy to make with natural languages it's has been hard to get gains so in this work I actually want to pursue a different path to how we might approach compositional reasoning it's possible to have compositional reasoning without trees one alternative is to use an attention model inside the sequential model now in a way you could say this is all most trees by another name since effectively attention gives us for free any operations where we can feed in all the preceding nodes in to our composition unit but we still get the advantage that we have now soft weights on the different preceding nodes to decide how much to use these axillary arguments and so this is a more flexible and more easily optimizable set up but it's possible that we don't even actually have to do that and in fact the results that we have in the systems that I'll show later suggest that so a second alternative is to think that maybe our sequential neural networks can essentially do what logicians call currying so it's as elementary observation of logics that if you have a multi argument function f of X Y Z you can always transform it into a another function the carriage function where it takes one argument and then returns a function which will take an argument and return yet another function which returns takes the third argument and then we'll give you your result and this perhaps reasonable to assume that we can use sequence models to do precisely this kind of computation taking arguments one at a time ok so what is our goal in this work so in this work drew and my goal is then to say well at the moment the vast majority of machine learning its building these big correlation engines where you're learning any kind of association between what's in the input data and the results that you want to assign we'd like to have a different kind of neural network design where it hasn't explicitly has a structural prior that encourages compositional and transparent multi-step reasoning but on the other hand we'd like to avoid the issues with something like tree recursive neural networks and to have a model that maintains end-to-end differentiability and there's easily scale of auto real world problems ok so this is the outline of the rest of my talk so the first part of it explained the desire to move from machine learning to machine reasoning so I hope you are all convinced by that so then in this next bit I'm going to explain a little bit about the clever task which is what I'll provide experimental results on today then there'll be a big main part talking about the model that we're proposing memory attention composition networks or Mac nets and then I'll go on and show some experiments and discussions about that ok so the task that I'm going to be looking at today as a concrete task is this clever data set so it's an artificial visual question-answering data set that was made by Justin Johnson while he was interning at Facebook AI research so here's an example of an image and a question there is a purple cube that's behind a metal object left to a large ball what material is the cube so these questions have the kind of complex compositional questions which cusses son designed to be good places to see whether you can understand and reason so we just walk through for a moment what you have to do you start with there being a purple cube so there are a couple of purple cubes then it says it's behind a metal object so there's a metal object and the things behind it then left to a large-bore also there's a left large ball and the things left to it so that pretty much narrows it down to a certain purple cube and then it asks what material is the cube and you probably think gosh I don't know what material is that but it turns out in the clever data set there are only two materials things that either metal or rubber so if they're not shiny they're rubber and so the answer is that this is rather so that's the main setup of the problem but something that's important to note this down the right-hand column is each instance in clever is accompanied by a functional program which is sort of you can think of it kind of like a database query that specifies various filter and select and relation operations which will compute the answer to the question and that's important because some of them other models that we'll look at make use of that data okay so the clever data set it's another of these sort of artificial generated data sets it's not a real problem in the world so starting off on the visual side these are photorealistic images of 3d objects and a few shapes colors materials and sizes they're actually all made in blender if you have kids that use blender so but it's accompanied by these highly compositional multi-step questions which seemed good for assessing reasoning and then as I mentioned there are these tree structured functional program representations so it's actually the case that the the language of the questions so isn't really natural language the language of the questions is actually generated from these functional program representations by rule so actually in the basic version of clever we don't actually have a natural language understanding tasks we have artificial controlled language understanding tasks nevertheless it requires a variety of reasoning skills such as transitive inference counting and comparison between attributes and amounts and to make my natural language processing self a bit happier alongside the clever data set there is then the clever humans data set which then does have genuine human ask questions collected for Mechanical Turk owes that the strengths of clever will by the construction of the data set it allows thorough analysis of performance based on the question structure and type but more particularly part of why it was originally designed was there was this first generation of visual question answering work and it looked impressive initially and then people started to discover that there are a lot of question conditional biases and opportunities to do spurious and superficial reasoning so it seemed that most of the initial vqa systems didn't actually understand much about either language or the images so you got things like if the question was asking what covered the ground you didn't have to look at the image the answer was always snow and so clever was put together precisely so that would not be the case and that you actually had to understand the language in the picture enable to be able to answer them correctly there's been a number of previous pieces of work that have already looked at the clever data set they basically fall into two groups so the first group is the neural module Network approaches so these were pioneered neared by Jakob andreas and colleagues and there's also been an approach pursued by Justin and Johnson colleagues and essentially these models treat it like a semantic pausing problem in NLP terms so all of these models make use of the strong supervision that comes from the functional program and so what what you're doing is a two-stage operation the first stage is that you use a sequence to sequence model to translate the English form text back into the functional program using the strong supervision and then the second phase is actually the neuro module Network where you use the functional program to piece together custom neural network pieces so this is a suite collection of specialized Newell modules for doing things like selecting filtering counting comparing and they're stitched together based on the functional program and then they run more recently there have then been a new class of models which don't use the strong supervision and essentially they're large ComNet stacks interleaved with some kind of specialized layer so the first such system was the relation that's done a deep mind and the relation net layer that's stuck in between the convolutional layers takes every pair of pixels in the image and considers a relation between them and makes predictions based on those binary relations and then more recently again from the University of Montreal is the film system and it also sticks layers inside the convolutional Network stack but this time they're kind of like a controllable normalization layer where the processing of the question tilts the normalization that happens between the layers it's a little bit harder to get an intuitive sense as to why that's a useful thing to do but it is ok but now what I want to do is move on and tell you about the memory attention composition networks that we've designed so the idea of this network is how could we custom design from the ground up a neural model which has a better structural prior for problem solving and reasoning tasks and building on the success of sequential models that this model is explicitly a sequence model but it's designed to be more constrained in such a way as it encourages it to do a sequence of explicit reasoning steps each of which is performed by a Mac cell a memory attention composition cell so in particular as a recurrent Network there's one universal Mac cell that is used throughout all the steps but it's parameterised design as such that it's versatile and they can adapt its behavior and context to perform different operations so we don't need kind of custom networks for every different kind of reasoning operation that we want to do this network will then make use of self attention so it can actually simulate arbitrarily complex reasoning graphs using self attention while maintaining into n differentiability okay so well then for the next few slides gradually work down into the details so for a single max cell that's meant to do one reasoning step and to do that it makes use of having two recurrent states so we have a control state and so this step this represents the steps reasoning operation and the reasoning operation is going to be represented as an attention based average of a given query where the query for clever is going to be the question that we're asked and then we also have a memory state so the memory is going to store retrieved information that's relevant to a query and can be accumulated over steps and so it's actually also going to be represented as an attention based average but it's going to be an attention based average over what in general I'd like to suggest is the knowledge base but for the work I'm presenting here Enclave are our knowledge base for answering questions is simply looking at an image before we get into the details of the Maxell we need to have a knowledge base and a query to feed into the max sequence and there is a pre-processing layer that uses pretty standard methods of neural network pre-processing so for the query it's processed by by LST m and that yields a sequence of contextual words the hidden layer of the BIOS TM and an overall query representation that's formed in the standard way by concatenating the final forward and backward hidden states for our so called knowledge base that is the image we're using a completely standard ResNet stack indeed we're using the one that justin johnson made and then using the output of that as our knowledge based representation so in slightly more detail here's the picture for the mac net and so the first thing to notice is that it has two recurrent states it has a control state and a memory state and so there is a fundamental idea here which is to think maybe there's actually something good in the kind of architecture that's being used throughout the history of computing wear control and memory are separated from each other at any rate that's a hypothesis that's being developed in this work so the control unit computes a next control state extracting an instruction from the query then the read unit retrieves information from the knowledge base given both the previous memory it's again working recurrently and the current control and then finally the right unit is going to update and write the next memory merging the newly found information with the previous memory to then write the next memory so quickly in one more level of detail though this is probably getting to be slightly more detail than you could absorb off PowerPoint slides but I'll just sort of say a minute or two about how we get down into the details of what things do okay so this is the control unit so the control unit has its inputs the previous control state the query that's the summary of the query to perform and the contextual words and so first of all it calculates a time specific query representation through a linear layer then it combines that with the preceding control construct instruction and then at that point it uses a particular architecture which we both thought would be very useful for interpretability and proved to be very successful performance wise which is our calculated representation CQI here is projected down onto the actual words of the query and the way that that start is done is in the standard kind of attention way so we put an attention distribution over the words of the query and then we calculate a weighted average of those words the read unit is the most complex part of the system so it takes the previous memory and the knowledge base so first of all it's going to be considering each item in the knowledge base which is sort of the outputs of the con filters and first of all it considers relating them together with the previous memory so this is a facility that makes it easy to do transitive reasoning because the proceeding memory can guide you to what to get next from the knowledge base to Reason transitively but we don't also work or we don't only want to do transitive reasoning sometimes we'd like to be able to combine multiple memories together such as when we do logical operations like or and and or counting so we again combine the output of that with accessing over the memory then we also want to consider if what to do based on the control that we've calculated so the control feeds in again doing a harem art product and then at the end of that we're again doing a projection via attention back onto the knowledge based items and this way attention weighted average then becomes our candidate new memory that we pass forward and then there's the right unit we actually explored a couple of right units or more or less complexity the simplest right unit is you take your newly calculated memory and your previous memory and you put it through a linear layer and you write it as your new memory and it turns out that actually works pretty well but you might think that sometimes you've got to do a hard amount of thinking and sometimes only a little bit of thinking so it might be useful to have a highway gate so that once you've finished your reasoning you can just easily pass through your preceding memory or we had this idea of using self attention where you could use all of your memories to help guide the next memory and so we could put in a self attention layer where we so we can make an explicit dag of inference steps so concretely the way we do this is we then use a kind of a key value memory Network idea so that we're taking the attention distribution between proceeding control States and using that to access proceeding memories okay so that's our map Network our Mac cell and the way we make our whole map mac network is then just using these recurrently and in particular allowing it to be a directed acyclic graph using attention so this gives us a fully differentiable end-to-end model for reasoning the final bit we have to have is network outputs this is actually very easy for clever there are actually only 28 possible answers there's yes no there's colors materials comparisons there are numbers but all of the numbers are single digits they're only actually 28 answers and so we simply use a 28 Way softmax and so this takes in both the memory and the final form of a query using the query at this point is vital because because of the way that the magnet segregates the query from the memory that the memory does not really represent the query and it has to be combined in again okay so here are some of the experimental results in a bit of discussion so in the clever data set overall there are 700,000 training examples 150,000 test examples the 28 answers because they're a limited space of answers the baseline is very small there are lots of yes/no questions or metal versus rubber questions and so if you guess the most frequent answer for each question type that gets you to 41% using confidence and LS TMS is sort of the standard baseline vqa architecture that does a bit better but not a lot better that gives you 52% Jacob Andreas's initial neural module network work did a lot better than that that got you to 83 point seven percent which seemed to be heading in the same direction of some of these tasks where you were kind of closing in on what was measured in the original paper as the performance of humans which was 92.6% but remember that this is a completely artificial and and tasked with artificially constructed language and deterministic answers so there's actually no reason for this task we shouldn't be able to get a hundred percent right and actually there's a bogus reason why the humans tend to perform a bit imperfectly which if someone wants to ask they can explain later so because of that actually in recent work all of the action has moved to the ninety five to a hundred percent level and so that started with deep minds relation networks we've got just over 95 percent Justin Johnson's second generation your module networks get ninety six point nine percent film got ninety seven point seven percent and you're probably expecting that it turns out the Mac net does better again getting ninety eight point nine percent more than having the previous error but nevertheless these numbers are getting so high you might be sort of suspicious as to how much signal there is there I think there is signal but maybe it's more interesting to look at some of the details so in basic clever there are five kinds of questions there are existence questions there's something exists there's querying an attribute and comparing attributes between two things and all of these four systems get over 97% on all of those the harder kinds of questions of the ones involve numbers comparing numbers and counting and so on particular previous systems had the most difficulties with counting questions so both the relation networks and the second generation your module networks got about 90% on the counting questions film did rather better and then Mac dead net is doing a lot better again on counting questions but even for Mac net the class of examples it does least well in on is counting questions and this might make you wonder whether having more of a counting primitive would actually be also a useful kind of structural prior to consider putting into a reasoning system here are a couple of more interesting results so how fast these models train that's a good reflection of whether they have good inductive biases trying to use LS TM plus CN n plus spatial attention model it learns very slowly that's the bottom line film and the second generation new module networks are better again but magnet trains much more much faster again so into a pox through the data it's performing as well as the neural module networks performing after 10 a pox through the data and I think you can take that as stronger evidence that this model does just have good inductive biases that help it to learn here's another demonstration of these this looking at a learning curve by restricting the amount of data that you train on so if you train on 700 thousand examples several of the models do very well indeed but if instead you say let's just train on 70 thousand examples one tenth of the data set what you then find is all of the other good approaches like film and Justin Johnson's and your module networks are getting at most 51.6% again not really that much more than the 42 percent baseline whereas Mac is still doing really well even stem DK examples is enough for it to get 86 percent correct and then here's our real human language the clever humans data set so it consists of 18,000 a small number of real human language examples coming from crowds sourcing so the humans were instructed to write questions that would be hard for a smart robot to answer in particular they were able to ask any kind of question which could be semantically quite different to the only five types that the systems were trained on that data set was regarded as too small to actually train on so the first thing people have done is just a blind transfer task and here are the results from that and then the second thing is that there is a corresponding dev set or adaptation data set for clever humans so people will then adapt on that data set and then see how well you do and so if you do that what you notice is well not only does math net win again but Mac net is actually gaining differentially more in doing the adaptation which again I think reflects the fact that it has good inductive biases so it can generalize much more quickly on that data okay here a couple of pictures because of the fact that the model uses the tension both to the query and the knowledgebase you can have a fairly concrete idea of what the model is doing so this question is what color is the map map thing to the right of the sphere in front of the teeny blue block now for the length of the sequence model you get the best results by running it with a reasonable length sequel model of running it with a length of somewhere between 8 and 16 but as for showing pictures it's better to have a shorter sequence model and it still performs fairly well with the short sequence model so what we find is for the the first time period it focuses in the language on teeny blue block and it finds the teeny blue block in the picture then it focuses on the sphere in front of and it finds the sphere in front of and then it says what color is the matte and it focuses on the what color is the matte and it answers it correctly and gives the answer purple here's a more complex and harder to see example how many objects are either small objects behind the teeny metal cylinder or metallic cubes in front of the large green metal object I can barely read this myself I'm interestingly what you notice here and this is quite common in the way it behaves is it'll then with a longer sequence model it'll use the first couple of steps to absorb semantics so at first it's focusing on how many and small then it's focusing on the fact that this has a disjunction in the condition and then it starts looking at large green metal object and looking for metal cubes in front of that and you can see there it's found a large green metal object and is then looking for the metallic cube in front of that and then it moves back to the first half the tiny the teeny metal cylinder and then looking for small objects behind that and at time step five is finding the teeny metal cylinder and then the small objects behind that and so this is actually enough information that can successfully count and it gives the right answer of four I'm here a couple of clever human examples just one off which so show the different kind of semantics of questions that's being asked in clever humans so what is the shape of the large item mostly occluded by the metallic cube what color is the object there's a different size what color ball is close to the small purple cylinder so we've got notions like occlusion and close to that when nowhere in the original semantics okay so I hope to have shown to you that this works pretty well let me just quickly emphasize you know how does this differ between Mac nets versus module networks so module your module networks are also an architecture that speaks to lay on Batu's vision that they're also composing together pieces to do more complex inferences but I think this model is advantageous because we're making use of one universal and versatile cell that's used across all states sharing architecture and parameters and indeed we did some ablation studies you actually get worse performance if you don't share the parameters in a in a sequence model manner and but this cell can adapt this behavior based on the context in which it's applied in contrast and your module networks have a discreet fixed inventory of really rather tough specifically designed specialized modules with distinct parameters and even architectures so they lack the generality and they lack any of this concept of sharing and universality against the Philmont model for Montreal we actually did take a keys insight from the film model so the film model proposed a model in which rather than transforming both the query in the knowledge base into the same representation it proposed an indirect modulation and we've sort of borrowed that kind of indirect modulation but I think we really benefit from having our close attention to the query in the knowledge base where films conditional normalization layer is kind of a global tilt which is applied identically across all image reasons and doesn't allow the kind of selective behavior based on features or positions okay so I hope that I've shown you this idea that maybe there are interesting things we could do to push neural networks to do higher level in and reasoning to get back to what artificial intelligence was meant to be about where we would have thinking going on in our neural network architectures I've proposed a particular design for doing that which is a form of constrain sequence models separating control and memory and particularly exploiting attention as a good model for reasoning and I've argued that it has strong compositional reasoning skills and the right kind of inductive bias to be able to do well in these kind of tasks and it's practical as a generic fully differentiable end-to-end model now is this the one and only true answer as to how you could design a neural network more for reasoning I'm sure it is not I'm sure there are lots of other ways and variants that people could think of exploring but I think it is an interesting challenge for the community to actually be putting more effort to be thinking about what are a few more different and good building blocks that we could be using in neural net architectures rather than being so dependent on such a very small vocabulary of ideas as so much of the work in the field has been in recent years thank you [Applause] [Music] okay we have we have time for some questions Dimitri thank you great work it's an interesting idea to separate control and memory but then what happens is that you decide which steps of reasoning will be performed regardless of what result from the knowledge base the previous steps of reasoning and fetched so you decide in advance what reasoning you will do and you didn't block with what you retrieve do you think this can this has some limitations so that's a great question so I actually think that you're pinpointing one of the deficiencies of this current architecture so it seems like a key thing that you'd like to be able to do for reasoning in general is you're searching your knowledge base you've discovered that a person was born in Poland and therefore you say okay well I'm going to take that fact and incorporate that and start looking for other people in Poland that they might know or whatever it is so you want to have the memory feeding back into the control to determine future actions and I think absolutely that's true and that's something that should be incorporated into a magnet version too it sort of happens that for the particular design of clever you don't have to do that because you are really sort of following the instructions that's in the question but you know I think that's a key Avenue err that's a deficiency that should be addressed in the future thank you I think we have time for at least another question there's a lot of work on making neural networks explainable or interpretable and it seems like in reasoning it's important not to just get an answer but also why that answer do you have any thoughts on making more explainable okay I'm partly partly not so it was actually one of the hopes in the design of this architecture that it would be a model that was transparent interpretable and explainable and that was initially a motivation for grounding both halves in the model as doing attention distributions over the query and the knowledge base and I mean it was really kind of then a lucky side effect that we found that that actually improved the speed of learning and improved performance as well so in one sense there are these lovely attention pictures that I can show you which makes things kind of interpretable but on the other hand I mean as a lot of us know who've looked at attention pictures is kind of easy to tell just so stories about them and you can never be quite so confident as exactly what it means as to what the thing is doing you can't show that there's good localization over time but it's hard to tell whether it's really computing in back of or doing greater than and so I think there's sort of more work to be done for that I'm not sure I have a specific answer to that but clearly various ideas people are starting to explore such as generating text alongside the models to have them explain what they're doing I'm afraid that's all we have time for but it's think Chris again for his keynote [Applause] [Music]

Info

Channel: The Artificial Intelligence Channel

Views: 34,022

Rating: undefined out of 5

Keywords: singularity, ai, artificial intelligence, deep learning, machine learning, deepmind, robots, robotics, self-driving cars, driverless cars

Id: 24AX4qJ7Tts

Channel Id: undefined

Length: 46min 28sec (2788 seconds)

Published: Mon May 14 2018