Differentiable Neural Computer (LIVE)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I had heard about it when the original paper came out. It was all a bit over my head. Hopefully this will explain it better.

👍︎︎ 2 👤︎︎ u/dranzerfu 📅︎︎ May 11 2017 🗫︎ replies
Captions
hello world it's Suraj and welcome to this live session today we're going to be talking about the differentiable neural computer just I mean just take a second to just soak in how awesome that name is I mean this is this is a really really cool model it came out of deepmind a few months ago and it is the successor to the neural Turing machine so recall from my last weekly video we talked about meta learning or learning to learn the edge of deep learning and and future directions that we should move so this is like this is a relatively complex model and it's definitely the coolest model that I've ever seen and I've really had a lot of it was it was really fun studying this this thing because it gave me so many ideas of directions we should move and things we can do with this but let's talk about the problem here and the problem is how do we create more general-purpose learning machines and that was the problem with which they designed this this this model okay so I'm going to show you guys what the demo is for this and I'm going to then talk about what other things we could do with this but before before we do that let me answer some questions all right so to start off with go ahead and ask some questions while I explain a little bit more about this okay and then I look back at questions through slack okay so neural networks are great right we do a bunch of amazing things with them with them but the problem with neural networks are are that they are made to focus on a single task or whatever you train it on right you can take a neural network and you can train it to you know learn to recognize cat you know cats and images but then you can't take that same neural network and then ask it questions about the London subway or the best path to take to from point A to point B you just can't do that but what if you could and how would you design a system like that so that is what the DNC is is it is it is a neural network with an external memory store so it's it's a so it's got two parts you've got the controller which is a normal neural network you can use a feed-forward net you could use a recurrent net you could use any kind of network you want and in this example we're going to be using a feed-forward neural network and then you've got the memory bank and the memory bank is an external matrix and the controller interacts with the memory bank by by performing a series of read and write operations so we call these heads read and write heads so it takes an input it up it it propagates it through the network and simultaneously it's reading and writing to the matrix to learn about to both write what it's learned and then read from the past time steps what what it can use to then make the output prediction and then it outputs the prediction right so that's the basic idea and what we're going to do in this example is we're going to let me see if I have the output somewhere the output is to basically map so here's the output I know you guys like seeing the output which is a definitely a good thing yeah so here's what it looks like when I ran this we have two sets of numbers right zero it's just one hot encoded vectors so 0 1 0 0 that's the input and then 1 0 0 1 0 0 and then we want to learn the mapping between the two so then given some input we'll know the output right so it's just binary mapping which is a very simple use case which is what we need for this model because it the model itself is where the learning should occur but let me show you what they used it for which what did mind use it for ok so this is what it looks like so you have your inputs and your outputs and it learns the mapping between the two right XOR right we dealt with in my building neural net in four minutes video same exact idea except a way more amazing model so what they did and let's talk about what they did first what deep mind did is they said ok let's apply this to the London Underground okay and right so the questions are coming in so let me answer two questions and then let me go back to what what I was just talking about question number one does it learn and predict the hyper parameters No that that is hyper parameter optimization and it can be added on to this and the second question is can we do image recognition with this yes oh and then one more question what's the difference between this and the neural Turing machine I'll get right to that in a second okay so back to this what they did was they applied it to the London Underground okay so what do I mean by this basically the first thing they did was they generated all these like random graphs right it's a graph problem it's a graph problem right subway systems are all graphs they have nodes and they're all connected and what they did was they gave it a set of graphs and they had these handcrafted inputs right so they would say so this graph resembled some generated subway and then it had a set of labels so it was a supervised learning problem right so they had this generated graph or subway and then it's associated labels like the labels would be the the different paths that you could take from point A to point B so you could go through Oxford Circus and to Tottenham hands whatever and in central whatever so that's what they did and then once they kept training it on these randomly generated graphs then they gave it the actual London Underground graph with its associated pairs and it learns to then if you if you asked if you if you then I said you know two points like point A and then point B it would tell you the UH the optimal path to get there because it had been training on that but here's where it gets even cooler so normal so you could you could do that normally you you wouldn't need an external memory store to do that you could do that with a recurrent Network an LS TM network but what was really cool then was they added something else on to this they added a question answering system so not only did they train it on the London Underground's paths but they also added a natural language to it so they trained it first on randomly generated graphs then they trained it on a text database where that where it was a question answer database and they learned to associate a questions with with their associated answers and then it associated both so then you could then ask it hey what you know in natural language like a query what's the best place - what's the best way to get from point A to point B in natural language and then because it had this external memory store that had the previous learnings from the generative graph it could then apply those to the natural language questions so you see these two entirely different data types these two entirely different data sets that this thing was able to train on so it would it learn to optimize for one data set and then it learned to optimize for the next data set and it could associate between the two which is the cool part and you could then extrapolate this kind of thinking to anything really you could train it on some set of images and their associated labels and then something entirely unrelated like you know also a question/answer data set so you could learn natural language and then also maybe an audio data set so it could learn the labels for audio so then you can ask it things like hey what kind of sound does this cat make so we would see the cat picture and then it would sociate a sound with it it's got that language so it's like this general-purpose idea now it's not perfect it's not a GI but it's a step in that direction which is very cool and they called it a computer right why do they call it a computer will recall that computers computers are they have two parts they have a processor and then they have memory right you have your CPU and then you've got RAM or random access memory and so what happens a little you know kernel level talk for a second what what's happening at the kernel level every time you're doing anything on the computer is you the the RAM preloads a bunch of instructions and then each instruction is fed to the CPU one step at a time and what the CPU does is it takes in an instruction decodes it executes it and then repeats the process and this process is called the instruction cycle and it is the hallmark of how computing works right and then there's a GPU but that's a different story we're talking about the CPU now von Neumann architecture in computer science was is very famous right and a lot of computing is based off of that idea but what this is and now this is not deep mind talking this is Suraj talking or this is this is my this is this is my what should happen we can use this as a framework for building hardware as well right so it's it's a computer but it's it's it's all software there's no hardware associated with it but if we if we switch our thinking from serially decoding these instructions and instead learning from instructions at the kernel level at the hardware level then we can get some really interesting results now there are people working on this but I think it's really cool to think about what the next the successor to both von Neumann architecture is and also the successor to silicon and what new mediums we could use for computing could be so it's a lot of very exciting possibility with just this architecture this software architecture ok and let me show you guys one more thing that they did so to keep going with this idea of associating two different data types so they first fed this thing they first fed it some associations like Joe is the mother of Fria you know Bob is a mother of a there is a husband of that so a bunch of different Association so natural language text associations okay and then once it had those associations 49 inputs later then you could say things like who is Freya's maternal great-uncle and because it's it's a graph problem it took this natural language and it constructed a graph out of it then it's just a graph problem right you can just traverse the graph to find who Freya's maternal great-uncle is even though we didn't explicitly tell it who that uncle was ok so there it can do multiple things right it's not just language it's also a graph construction and the fact that it's using an external data structure for memory it's such a simple concept isn't it I mean if you think about it it's one of those like intuitive things like duh like of course there should be an external memory store but just no one tried it before I mean you know we did have the dynamic memory Network out of Facebook we have the neural Turing machine by deepmind but this is this is a really cool idea that's what I'm trying to say this is a really cool idea and yeah so that's what they did and we've got and also neural networks no networks have memory right they have memory but the memory is so interpolated with the memory are the weights its interpolated with the processing you but if we detach the memory into a separate component that's when the results start to get magical and that's what that's what this is okay oh and here's the coolest part so the whole system is differentiable okay so the whole thing is differentiable what do I mean by that that means you know when we differentiate or back propagate our network our neural networks we forward propagate then we take the difference between the output and the prediction that's our error or loss and then we use that loss to compute the partial derivative with respect to each of the weights backwards and then we continually do that and that's how we update our network but and that's how we differentiate but this whole thing is differentiable so it's not just the controller the network but it's also the it's also the memory memory store so this thing is differentiable - so will you compute the partial derivatives with respect to all of these rows in memory ok so there's actually a lot of parts here and what we're going to do is we're going to go through each part step by step and I'm going to talk about how each part works ok so get ready for this this is going to be up this gonna be amazing ok you're going to have your mind blown so let's let's go ahead and get started with this we have a lot to go over so it's going to be a lot of fun the first thing that they did here so let me answer one of the questions one of the questions was how is this different from its predecessor the neural Turing machine so there are several ways that in here in text how it's different here's how it is different in text but there are several several ways that is different but basically it can all be summed up as there are more memory access methods ok and so it's different ways of interacting with memory so a more complex a more optimized way of interacting with memory than the neural Turing machine and and it added this temporal I love though I loved the terminology here I loved how deep mind uses like neuroscience terminology because I mean they have actual neural scientists working on their team but they added this temporal link matrix so you see these arrows pointing to different rows in this memory bank or memory matrix so the reason that's there is so that they so that they can so that the network can know when it's reading or writing the order with which things were written read read or written to memory right it's the order and the order helps because you know whether it's who is Freya's great uncles paternal whatever you want to know the order sometimes and so this adds an ordering to the read/write heads okay so right okay so one more thing before we start looking at the code okay and I want to talk about attention so right so we have our controller we have our read and write heads and then we have our memory bank okay so the question then is how do we know we're in this matrix - right - how do we know we're in this matrix to read - and the degree to which we should do those things and that's where attention comes into play we call it attention because it's a way for us to frame how a how precedence or how how importance or how weighting is played in to how we read and write what do I mean by that how do we know where to store store store stuff basically so they added three attention mechanisms the first one is called content lookup so think about content address systems right when you have a Content address system the the key up the key tells you what that content that contents value is so it's the same thing with this we have a reader right head okay and that's good that's going to contain some you know content address and then we have a similar content address in memory and so what we do is then we find the similarity the cosine similarity between all the content addresses to see what's the most similar and then we could use that that value that's the most similar to then update our network right - so content lookup via a similarity measure is the first and we'll go through each of these attention mechanism the second one the second attention mechanism is the temporal linking right so how do we know the order that things will writ read and written to memory and how do we then update our network based on that and the third one is allocating memory for writings so this is a dynamic allocation for writing par instead of having some static amount of memory dedicated to writing we are dynamically allocating it so we're going to erase it and then and then rewrite over it dynamically you'll understand more when we get to that part so there are three attention mechanisms here okay and I love how they they compared the attention mechanisms to the hippocampal ca3 and see a one synapse synapse regions of the brain which is super cool and they also did this actually recall from the from the deep cue from the deep cue the the network that could beat all of all those Atari games they used something else from the hippocampal region it was called was it called experience replay which came directly from neuroscience so deep Minds papers read a lot like new Mendez except they actually pub they actually have great results so oh yeah anyway so here we go with this let's get to the code and also let me answer two more questions before we get to the code question one can we use this DNC for real problems like data Association absolutely absolutely you can and they did write with the with the family members in the family trees you can abstract that or you can extrapolate that problem to something entirely different like finding associations between people and obviously people but ideas and images and different types of data different data types numbers stuff like that yes you can and then one more question can you explain the memory and head operations in great detail still confused about it yes let me let me do that as I go down because that's that's what this that's what the code is right the details of that but high level we have an external memory store right we have a neural network controller and then we have an external memory store which is a matrix right it's a matrix that we define ain't no X by Y matrix and we have our neural network and for our neural network we feed it in an input okay and then forward propagating right and then it's reading and writing to memory so it's saying so the the degree to which we're reading and writing is dependent on how we structure it which we're going to go into but it's reading and writing to memory too it's reading to see just like it would read its own weights just like it would read its own right weight it's reading from the memory store to make predictions to compute that those that series of matrix operations that it would to then output that that prediction right and it's writing to memory to then just like it would write to its weights right just like how you multiply each weight by each value as you propagate it forward you're also multiplying it by this memory this external memory matrix and then we differentiate which is going to be awesome when we get to it okay so let's get started with the code from scratch we don't have time to just write it all out because there's a lot of code and there's a lot of theory here so we're going to focus on the theory we're going to focus on the theory and then we'll we'll compile it and run it and it's going to be awesome okay so oh there's another question is it better to have loops and neural networks thing to go without it is it better to have loops it's better if that's your use case what we're using in this in this example is a feed-forward Network it's a feed-forward network but you could use a recurrent net in fact in the paper they use a recurrent net and when you would when would you want to use guys help me answer this question when would you want to use a recurrent network when would you want to have loops right when would you want to have the state fed back into the input at the next time step when you have a sequence okay when you have any kind of sequence any kind of sequential data then you would want a recurrent Network okay but the most simple type of network is a feed-forward net and that's what we're doing here because we want to really you know break it down to its bare essentials so we can understand the general architecture and once we understand the general architecture then we could use it for crazy new use cases that no one has ever done before we are at the bleeding edge right now so let's let's get started okay so we're going to define the dnc the differentiable neural computer as its own class okay and like all classes we want to initialize it in the init function so let's go step by step through what we're going to define here so the first thing we're going to define is the input data and the output data or the sizes of both now this is what it's going to look like recall that it is a set of pair of binary pairs 1 0 0 1 0 0 and these pairs are going to be randomly initialized it doesn't matter but there's going to be a mapping both for the input and output data just as sets because they're both just a series of ones and zeros right and we want to learn the mapping between the two so that given some novel input with ones and zeros we can then predict what the probable output set of ones and zeros would be right because we already have a set of them it's a supervised learning problem and then we want to then predict what that label would be so we have our input size and our output size that we're going to define here okay and notice that we these are coming from the parameters up here right we're going to define is when we initialize our DNC later on but we're defining the class right now so that's the first part the next part is for us to define our read and write vector size so let's notice how it's called num words and word size but there are no words here right this is kind of like like they had words in the deep mind code so this is kind of like leftover from that but there are no words here but what we can think of these two variables as are the sizes of our read and write vectors because when we initialize them we'll be using these variables as as parameters to define the size of our read and write vectors and a couple other variables that will that we'll talk about but they're basically these constant values that we are going to use to initialize a bunch of variables later on but they're constant these values these two values are constant right the number of words on the word size in fact we're going to use it to initialize the size of our memory matrix our memory matrix is going to be the number of words by the word size that's the size of our memory matrix okay so then we want to define our heads right so how many heads do we want and so that is so basically a head is an operation how many times how many times do we want to be reading and writing to memory while we're training our net work and we're just going to say one we're going to have a single head for every time step it's going to be reading and writing to memory just one head okay so just to keep it simple but we could have multiple heads okay so then we're going to define our interface size so what is this so okay so let me go back up here so I left out one part because I wanted to get to it now so we have our inputs right we have our input data and then we feed it to our controller it reads and writes and then it and then it outputs a prediction but a DNC doesn't just output a prediction it also outputs a an interface vector and what the interface vector does is it defines how we're going to interact with the memory bank at the next time step so it's outputting two things it's outputting a prediction and what's called an interface vector and we use this interface vector to then feed it back into the network so then at the next time step we know how to interact with the memory bank so that's what we're doing here we're defining that the size of that interface vector and yes there are like three places in the code where there are magic numbers and let me make sure that you guys can see this so there are three places in the code where there are magic numbers but but that's that's just how it is because I mean we could we could change these in our results would be better ah you know I tried out I tried out several several numbers here but these these these produce the best convergence and that's just all of deep learning right for all hyper parameters but we're defining them by these remember these set vectors these numbers and word size that we're going to consistently use throughout this this code okay so we define our input data we define these two variables that will help us initialize our memory matrix saw length and width and then our the number of heads we want to read and write with the interface size which is the size of that output that associated vector with the output and then our then we're going to define our input size which is the size of the input right which is going to be after we flatten it what is that size of that input going to be so we're going to define that here using those those same two parameters that we talked about before and then the output size what do we want the output size to be well it's going to be the size of the output that we define earlier plus the interface size because it's one big vector that we could split and then use later on okay and then we're going to define a distribution on both outputs both the both the prediction and the interface so we have distributions around both and finally we create our memory matrix all right our memory matrix this thing up here right over here it's just a matrix we can define it in one line of code it's just a matrix it's not some you know pseudo magical thing we'll define it as a set of zeros using our number of words by our word size but there are no words it's just the size versus the length versus width okay n times W okay we have some more variables here so we've defined our matrix we define the size of our heads our input size or output size or you know input output heads and memory matrix size and now remember we don't just have an external memory matrix we have a third matrix right we have our neural network which we can consider one huge matrix we have our memory matrix but we also have this third matrix over here which is this temporal linkage matrix right this is how we sequentially this is how whenever we're reading and writing to memory we decide what order we should read and write to the ordering matters right ordering definitely matters whether it's a graph traversal problem or natural language problem the order matters because we're going to continually feed it data okay so right so okay so we have our usage vector and the usage vector is going to record which locations have been used so far so it's kind of like it's deciding where in the memory bank have we [Music] read and written to before and then we'll sort out there and we'll use that usage vector to then define our temporal linked matrix okay later on but right now we just initialize it with zeros and then we have our precedence weight which is going to represent the degree to which the last location was written to in the previous time step once I get through up to this output weight right here then I'll answer questions okay so then right so that's our temporal link matrix essentially so we have what we've defined our major components and now we've got to define our read and write head variables right head weights variables let me update that because we only have one head right but we have weights for that head right we have a set of read weights and a set of write weights and these weights are just matrices they're small matrices but they define the degree to which we're reading and the degree to which we're writing it's what do I mean by the degree to which well recall that reading and writing is just they're just matrix operations they're just multiplication and we can define how much we're multiplying we can tune that similar to how we use a learning rate whenever we're updating our weights it's like that these weights these read and write weights define how much we're multiplying the memory matrix by and remember the entire thing is differentiable so everything is updated right you might be because you might be wondering how do we know what the read weight should be or what the right weight should be or even what the usage weights or the linked matrix should be we're differentiating everything based on the output that the loss between the output and the the predicted output in the actual output and we're using that to differentiate the entire thing all the components which is amazing if you think about it it's an end-to-end system and 2n differentiable so we have our read weights or write weights and then our read vectors which are going to use the read weights to then apply that's that's what we actually do the matrix multiplication with we take our weights times our vectors and then that's how we get our that's how we get our output for the for the matrix okay so then we've got our placeholders right these are our tension flow placeholders we're going to feed in our input and output pair right those ones and zeros both of them are gateways we just feed them both in learn the mapping and then predict it output okay so then we define our network so because this is a super simple use case we have one read head okay we have just binary input/output pairs let's just define a feed-forward two-layer feed-forward network right it's got a set of weights and it set it by C so weight bias weight bias that's it okay make sure that we could see everything here okay this is a little longer line so let me let me go over here float32 we've named our weights you know standard deviation of 0.1 we define the size by using the input size okay so this parts going to be cut off here as well but just recognize that it's it's similar okay so let me make it bigger again okay so where was I so okay so we define our network and let me do let me talk about these and then I'll answer questions so then we have our output weights so we have weights for our output right so all of these components have weights are our output our output values have weights both the interface vector and the output and why do they have weights so that we can then differentiate we don't we don't we take the partial derivative with respect to not just our controller's weights but by the weights of our outputs by the weights of our heads by the weights of our matrix and by the weights of our temporal link matrix so we take the partial derivative with respect to everything so even the weights even the outputs have weights okay both the end we and we initialize them randomly using this TF truncated normal function okay and then we also have a read vectors output weight okay so now let me answer some questions because now we can get to the fun part so the questions are let me just see who's who's who okay we got 408 people here alright people are doing good okay is it better to have loops and neural networks 10 - no no I already answer can you tell me what is the difference between fine-tuning and transfer learning transfer learning ok so fine tuning is a is a kind of vague term you could think of transfer learning as fine tuning in fact you can think of all machine learning as fine tuning we are we are interim proving our model but transfer learning is when you train a network on some task ok and then you use that pre train model to then learn from a different a different task so you're transferring the learnings from one task to another two more questions what if the matrix is larger than the amount of RAM you have so that would probably not happen unless you have a really both a really bad computer and the model is huge like gigantic but if that happened then you would have a an overflow a ramp up your an overflow for of RAM and your in your computer your system would notify you of that with a pop-up unless it was like I don't know Debian or something would in which case you're just you're screwed ok so then do you guys use radial basis functions as kernels and neural networks so I haven't seen those used a lot those are rarely used I've seen I can recall radial basis functions being used in a Markov chains so like Markov models I see those used with those is mid is mid I data a sequence of data that can be fed into a looping neural network I'm trying to create a mid I generator yes I've got like three videos on that generate music intense your flow how to build an AI composer and how to generate music and tensorflow live check out all the videos ok so back to this back back in black ok so we define all of our all we did was we just defined all of our vectors we're going to be using right all of our all of our components we define our components of this thing both the control that would be the controller the the weights for the output the weights for the interface vector the read the one head that we're using in the weights for it the memory bank which is a memory matrix that the temporal link matrix and its associated weights that's it okay that's what we just defined and now we're going to actually go right into the step function so so notice how I've got two functions up here each of these is for a different attention mechanism so we have three attention mechanisms for the networks for the controller to decide how we're going to deal with this memory bank how we're going to update this memory bank and read from it but let's go straight into the step function and then we'll talk about the details of these helper functions so we'll start at a high level and then go increasingly more low level so here's what happens the step function happens at every time step right so whenever we build our session at the end we're going to run this step function continuously so at every time step the controller receives an input vector from the data set and emits an output vector but it also receives a set of read vectors from the memory matrix at a previous time step via the read heads okay then it emits an interface vector that defines its interaction with the memory at the current time step so now I'm going to add something else on to notice I'm adding things on iterative ly so remember how I said oh you have one input and you have one output and then I was like actually you have one input and then you have two outputs one output is the predicted output and the other output is an interface vector that defines how you interact with the memory bank and the next time set well we've actually not just got two outputs we've got two inputs so one of the inputs is the data itself but the other input is going to be the read vector from the previous time step so think about this for a second this is not a recurrent Network we're using a feed-forward Network but because we're not feeding in the state of the controller from a previous time step but what we are feeding in from a previous time step is the read vector so it's a feed-forward network but so the controller is feed-forward but as a whole there is recurrence happening so you could think of the differentiable neural computer as a whole as a recurrent Network but not in the traditional sense or we're feeding in the previous state of the network to the previous time step but in the sense that we are feeding in the read vector from the previous time step from the memory matrix from the external memory store back into the input so in that way it's recurrent okay so so that's what's happening at every time step it receives an input vector from the data set and miss an output vector in an interface vector and then at the next time step it inputs the next the next input is going to be from the data set and the read vector and then we just repeat that over and over again so let's let's do this let's let's programmatically go through what this looks like so the first step is first two is for us to reshape our input so that it's it's it's fit for the size of our network right so we've got our input data and remember we also have the read vectors right both of those things we've got our input and the read vectors from the previous time step and then we take that input and then we forward propagate through the network we forward propagate through the network and remember it's a two layer network right so we do a matrix multiplication by the weights and biases 10h as our activation function majoring multiplication 10h and this l2 activation is going to be our output okay well sorry no no no sorry this is going to be our output and we're going to use that last activation to compute it by then doing matrix multiplication but this is our output vector and then this is going to be our interface vector so remember there are two outputs B both the normal output and the interface vector okay so now we have both of our outputs we forward propagated and it's time to then use these two vectors to then learn and do more more things with okay so then we've got this one line so remember I said how there's magic numbers and three parts of this code we talked about that one this is the other part so the partition what the f is this so the partition is a is what we're going to use to define our interaction with the memory matrix okay so we take this partition and it's a it's a ten it's a it's a think of it as a list or an array with ten parts to it and it's just one big one big matrix that we're then going to we're then going to convert into a set of keys and a set of strings and a set of vectors okay so let me go ahead and talk about how we're going to split these up into a set of keys and vectors all right but first let me answer some questions before we get into this so in terms of questions we've got can we use neural nets to make something like a PCB layout designer because currently available ones are crap yes how would you do that you would want to use a generative model a probably a generative adversarial Network train it on now I'm not I'm not familiar with the components of how a PCB works but I assume that you can think of it as a graph problem as well because you have different components and data is flowing through the Hardware in a certain way or you know it's some kind of mapping but think of it as a graph traversal problem and you're feeding in all these graphs and it's supervised right and you have the correct paths for all of these examples and then given these set of examples you can then generate a new path which would then be a new PCB layout design that's just one way but yes you could for sure okay so back to this so we want to convert our interface vector into a set of read and write vectors and we use the partition to do that okay so what do I mean by the partition let me let me back up here for a second let me back up let's back up let's back up we have our controller and it outputs two it outputs our output vector which is our prediction and it outputs our interface vector okay and our interface vector is meant for our network to then decide how to interact with the memory bank at the next time step but how do we know so how do we know exactly how to interact with the memory bank well we're going to use this partition what the partition does is it defines sizes for 10 different think of these as placeholders what these 10 placeholders are going to be or they're going to become these 10 variables right here they're going to become these 10 variables so before we initialize these variables we define the sizes of them and what these variables are going to do is they're going to be they're going to come out of the interface vector so think of the interface vector has one big matrix or one big vector same thing and then we use the partition to then split that vector into sizes that we've predefined beforehand and each of those is going to be a very important component that we're going to then use to update our memory bank so we have our interface vector we define a partition to split it up into a set of sizes that we want and then we dynamically partition it using tf's dynamic partition function into the read keys into three Ribery balls three right variables and then three gate variables which I haven't talked about which we will and then I read a set of read modes so there are quite a few parts here okay so so once we have these variables then we can define what the what the shape and size of them are so we've initialized them as parts of the interface vector because we are going to then update our memory bank using these so we have our set of read vectors and our set of write vectors so the key is define where we want to we use the keys to find the similarity between whatever we have in our our memory bank so that we can then read from it what whatever is the most similar or maybe it's whatever is the least similar that's that's up for it to learn with its own weight right it depends on the use case and that's up for it to learn a lot of these things like you might be asking well why why does it choose one place or the other that's all dependent on the data right it learns how to interact with it we're just defining the the blueprint here we're defining a blueprint and then it's going to learn how to interact with the memory matrix but we can definitely improve the blueprint so so the key is going to be like you know it's like for a dictionary it's it's the content address and then the string is going to be is going to help us initialize our read weights okay this STR is going to help this value is going to help us initialize our read waits for write vectors it's the same thing we have our key string is going to help us initialize our weights and then we have two vectors here there are two components to writing first we erase and then we write so it's like an overwrite so if before we write to something we want to erase what's already there and then right over that empty space okay and yeah and so then we initialize these using the sigmoid function and then the soft plus which is another which is another probability function okay so then we have a read vectors in our write vectors that we we've created from our interface vector which was one of our two outputs and now we're going to define our gates so what are these so recall from L SCM networks from gru networks we had gates now these gates are just scalar values there's a single numbers but we call them gates because they define they define the degree to which we perform a certain operation so remember L SCM networks had forget it had if each L SCM cell has a forget gate okay and it had two other dates and then a GRU unit has a update gate it's got a reset gate and these gates defined that the degree to which we are performing these you know operations whether it be reset or update and it's differential those were differentiable too so recall we whenever we differentiated our L SCM network we updated those gate values and so then a red and it erased and then a rip I wrote as it as necessary as it as it would best converge so it was up to it to learn what these gate values should be everything is learned the right vectors the the gates the keys all of it is learned through differentiation through back propagation okay which is a beautiful architecture that everything is is is learned we don't define anything statically except for those those two lines of magic numbers that I that I showed you guys okay so then we have three gates the first one is the free gate which defines the degree to which locations at read heads will be freed okay then the alloc gate which is the fraction of writing that is being allocated to in a new location and then the right then the right gate the amount of information to be written to memory so how much do we want to write to memory so we so we have one gate for reading one gate for writing and then one date for dynamic memory allocation so at every time step we we are deciding how much memory do want to allocate to some to some head value whether it be a read or write it's not some static done it you know a static allocation is dynamic it changes at every time step okay so then we have one more variable from our partition that we talked about at the very end what was that last one it was the read modes so what are these read modes the read heads can use gates called read modes to switch between content lookup using a read key and reading out locations either forwards or backwards in the order they were written so the mode is it tells us how we should read okay so it's like it's an addition to the read weights that helps define should we read in a forward direction or a backward direction and it again this thing learns how to do that we just define that hey there should be a direction with which we're reading create some differentiable value that we can then learn to update via training okay and then so we've defined that and now we're going to when I said we were going to dynamically allocate memory this is what I mean the usage vector is is what what is what's going to help us dynamically allocate memory so we have a retention vector which is this which is the helper vector for the usage vector but the retention vector is used to calculate the usage vector and it's asking us what's available - right - okay what is available - right - so then we can then write to it so let me go back for a second let me let me do a little high-level refresher we had our input we forward propagated it through the network and then we computed our output vectors both the predicted output and the interface vector and then we compete and then we partition that interface vector using this partition variable and the dynamic partition function into a set of parameters with which we can then interact with our memory bank at the next time step we then initialize and flattened and resized all the read/write vectors and the gates which the region-wide vectors and then the gates which control the degree to which we're reading and writing as well as the read modes which is the direction that we're reading and now we're going to actually perform the read in writing both the reading and the writing okay so for the so we're going to start off with performing the writing and then we'll do the reading okay so now let me let me write that out so we're going to write first so now we're going to actually do the writing and reading we define all of our variables now attempt to actually do the writing so for the writing we'll compute our usage vector which will use the dynamically allocate memory and then we're going to define our set of write waits for the right head so retrieve the writing allocation waiting okay and then we just we decided where to write to and then we can define our write weights using both of those variables both the allocate allocate weights and the alloc allocate gate and that's going to give us our right weights how much space to allocate for them and where to write - like how much space to dedicate to writing and where to write - and then then we can update our right to our memory matrix using that so first we recall we first erase using this erase vector then we write and both of these operations occur through matrix multiplication first we erase then we write and that's it for writing okay now we're going to read so we're it rewrote first and now we're going to read so for reading and guys if you sit through this I'm I'm I'm going to wrap at the end so so so get ready for this and we're almost done so I'm gonna answer questions ok I know there's a lot of parts to this but it's it's it's an amazing architecture and it's definitely worth looking into it's it's an amazing architecture ok so we've written to our memory matrix and now we're going to read from it so as well as writing the controller can read from multiple locations in memory memory can be searched based on the content of each location using content lookup or the associative temporal links can be followed backwards and forwards to recall information written in sequence or reverse which is our third attention mechanism now I haven't actually talked about our attention mechanisms yet but let me this is going to be our first which is a which is recalling memory using our temporal links which is that remember that matrix of arrows right up here that's going to be our that's essentially an attention mechanism like how do we read and write to memory based on what's happened before ok so we'll define our link matrix which is that those those sets of arrows that third matrix that the third big matrix I talked about using the weight vector that we defined here using our right our right with a weights ok and so then we'll use the linked matrix and our right weights to define our precedence weight and our precedence weight is going to help us with our Shh so then our precedents wait so we don't actually okay so we so the precedence weighs was used to create the link matrix and then we updated it right here so we just updated it like we've we've used it to define our link matrix and now we can just update it okay and so let me make sure you guys can see this okay so now we're going to define our three modes remember our read modes forward backward and content lookup by using matrix multiplication on on our read weights okay four four three modes and these are all differentiable all the modes are differentiable and for look up we're going to differ going to initialize it using content lookup and then we're going to now we can initialize our read weights using those three modes and then we'll create our read vector using the read weights and the memory matrix and then we multiply it together to get our read our read vector and then we can return it as the sum of our output and what we just calculated that product value that which is what we've now read from memory and we feed it back into the next time step right okay so what this what this function right here is this run function is is it's essentially just creating the outputs for for each of the inputs using unstack a sequence of outputs so this this just generates our Associated output sequence so a small little helper function but okay so also let me go back and now let me define the other two attention mechanism so we talked about one of the attention mechanisms which was the temporal linkage right how do we define the order with which we are updating our memory bank but there are two more attention mechanisms here the first one is content lookup right so remember is it's kind of like recommendation systems it's kind of like you know word vectors where we are finding the distance two vectors we're doing that with our controller and our memory bank so whenever we're reading from our memory bank where we have a vector here right the the read weight vector and then we have our vectors which is order all the rows in the memory bank and we want to find the similarity between all of them and then find the one that it's either the most similar or the least similar and it's up to it to learn what like level of similarity it needs to then read from it learns where to read from using this content lookup attention mechanism so we find the l2 norm of using the key so let me read this out a key vector emitted by the controller is compared to the content of each location in memory according to a similarity measure the similarity score is determined a weighting that can be used by the read heads for associative recall or by the write head to modify an existing vector in memory so we have a key re key and we then compute the l2 norm which is a way of normalizing a vector it's a standard formula for normalizing a vector it's a square root of the sum of the absolute values squared so if we take all the components of a vector square them and then square root it and that's going to be our l2 norm and we'll do that for both the memory matrix and the key and then we'll perform a matrix multiplication to compute the similarity and then we'll squash it with a soft Max and return that and it's going to give us a probability value that is a percentage of how similar the key is to to the memory matrix so that's that's one that's our second attention mechanism and then our third attention mechanism is the actual dynamic allocation of memory as we as we train this model so how do we so basically what this does is it retrieves the writing allocation weighting based on the usage free list so which is the usage vector and the usage of each location is represented as a number between 0 and 1 and a weighting that picks out unused locations is delivered to the right head it's independent of the size and the contents of the memory memory so that means that you can use on you can use the DNC this CNC on a task we're using one size of memory and later you can upgrade to a larger memory so it's independent of sizes C the only parameter for it is self we're not telling it well you should be you shouldn't allocate memory of this size of X size it decides that for itself and it can be our surely big or small theoretically could be infinity okay so we are then taking the sorted usage vector we're going to sort the usage vector ascending Lee and then we're going to compute the cumulative product of all those components and four and then you and then initialize our allocation weights using those values and then for each of our usage vectors we're going to flatten it and then add it to the weight matrix and return the allocation waiting for each row in memory this is essentially a matrix of allocation values that we can then use to decide how best we want to allocate memory so it's a matrix that we were returning from this okay so that's those are our three attention mechanisms I know it's a lot to take in but those are our three attention mechanisms and now let's get to our main function and we'll talk about how we can actually use this we defined everything we define our attention mechanisms our parameter hyper parameters or components of our differentiable neural computer and now it's time for us to actually train this thing so we'll start off by defining these parameters and these parameters are going to be used are going to be used to initialize our data randomly right this random dot ran int is going to initialize our input data and our output data which are a set of binary numbers ones and zeros okay so now we'll go ahead and initialize our tensorflow session initialize a differentiable neural computer and then run it and the run function is then going to compute the output it's going to it's going to run this step function C that we defined this huge step function it's going to at every time step for a number of iterations we want and it's going to output the you know whatever that outputs going to be like one's yours Israel like that that's going to be the output and then once we have our predicted output and we have our expected output we then want to compute the loss between those and we'll do that with the sigmoid cross-entropy function and that's going to give us our loss then and this is use case dependent we don't always have to do this but we're doing it right now we're going to regularize each layer of our controller and what regularization does is it can improve convergence sometimes okay and we'll use the loss to do that which is similar to the l2 norm and then once we have that we'll add the regularizer Xin and then we'll optimize it using atom which is gradient descent and so when we do gradient descent it's being applied to the controller the memory bank there had the three attention mechanisms and the temporal linkage matrix everything is is is differentiable and once we've done that we can initialize all of our variables and then initialize our input and output data using the the variables we define right up here that I talked about right up here and then for each iteration we're going to feed in the pair each input output pair into our feed dict and then run our session using our loss optimizer and output and then feeding in our input and outputs and that's going to give us our results and let me run this again which is the code but in it's enough I literally just took this from a Python file and then pasted it into a Jupiter notebook so let me do it from from command line which is a Python DNC DePauw no Python DNC PI ok bunch of warnings because we've got a lot of you know things a lot of deprecations happening ok ok so I just want to see what everyone's saying here okay great I know I should write a book I did write a book it's called decentralized applications I'll probably write a book later but I got to do a lot of other things first so other questions can we run DNC on common home PC yes you can if you have a any kind of computer you can run this thing because look at this I mean this yeah this is just binary scalar values I mean you don't need a GPU for this you just need a CPU and any kind of numerical data you can run easily on a CPU on a home PC but if you want to start applying this to graph traversal problems like deep minded like solving the the London Underground like the shortest path then you need a GPU but you could still do this on a home PC you don't need a cluster for this one would you need a cluster if you are applying this to natural language then you need a cluster if you're applying this if you want to create a question answering system then you want to do this in the cloud okay using using I have a video on that coming out in two days so I'll I'll tell you guys use floyd hub use floyd hub okay but i have a video coming out on that okay so on cloud options okay so we've got two minutes and i'm going to spend the rest of this session answering questions for the last two minutes but i also want to play this in the background because it's it's cool okay this is what you can do right associations associations between unrelated datasets whether the beads language natural language or graph traversal or image recognition or speech recognition or speech generation or audio generation or graphics generation or you know all sorts of different domains can be learned if we have an external memories memory store that acts as the blueprint to then create these learning systems these neural networks so we separate the processing from the memory and we have results that make our heart sing so for the last set of questions we have here a do you know if they invited people for the AI nanodegree I don't know uh and livestream next week so this is the last livestream for a while I've been doing live streams for the past 17 weeks and it's been super fun this is a TED talk and instead I'm going to be doing something similar it's going to be a pre-recorded version of like me you know typing out stuff it's not going to be live at least for a while and yeah you're still going to get amazing content like this it's just not always going to be live I will do lives again in the future don't worry I know it's not like I'm stopping my lives and yeah so but this is this is this differentiable neural computer is the future this is this idea of separating processing and memory is the future okay let me let me wrap as well to end this all out so just throw on a beat and I'll rap about something okay thank you guys for staying through the session you guys are heroes I'm very proud of you for staying through this this is non-trivial and the fact that you stayed through this it makes me very proud of you and very proud to be a part of this community that we are all in together so whenever you guys are ready go ahead and play that beat how much data does this need depends uh it depends how much data it needs on what your use case is oh I definitely need you guys okay I'm gonna keep having live streams don't worry about it just just not for you know current time Bing okay I love the DNC I try to do it like I want to be me I don't care about all these models you see I only want to use something that takes three different data sets and put them together if you see look I take a 1 value then I converge the bee stops but it doesn't matter man I'm like I splurge a burst out into the world like I'm the king don't stop and look at my machine learning man it's like Bing the worst service ever created by Microsoft oh I just miss Microsoft man don't look before I get shot okay so that's the rap all right so thank you guys for showing up for now I've got to go create some more amazing content for you guys thank you for showing up I love you guys and cool thanks for watching
Info
Channel: Siraj Raval
Views: 44,392
Rating: undefined out of 5
Keywords: differentiable neural computer, neural turing machine, deepmind, google, AI, ML, DL, deep learning, machine learning, artificial intelligence, programming, coding, python, ruby, java, C++, live coding, live programming, research, computer science, data science
Id: r5XKzjTFCZQ
Channel Id: undefined
Length: 63min 56sec (3836 seconds)
Published: Wed May 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.