Attention in Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] say I give you the deep learning book along with the question how is convolution equivalent with respect to translation what would you do to answer this question well one way you can do this is to read the entire book and assuming you remember everything you've read try to answer the question but there's a better way since it's a question on convolution I flip to the chapter on convolution neural networks then I find equivalence as one of the properties and read out that page or at least that part of the page which do you think is a faster method if we read the entire text like in the first method answering the question may take us a few weeks but in the second method the same can be done within a few minutes that's a very big difference furthermore our answer while reading the entire book may be more vague as it's based on too much information what did we do differently here in the former case we didn't focus on any part of the book specifically whereas in the latter case we focused our attention to the chapter on convolution neural networks and then further focused our attention to the part where the concept of equal variance is explained this second approach would be the exact thought process many of us humans would take it's quite intuitive given this example scenario we can now better define atencion atencion mechanisms found in neural networks is somewhat similar to that found in humans they focus in high resolution on certain parts of the input while the rest of the input is in low resolution or blurred in this video I'm going to talk about the attention mechanism applied on image inputs let's take a look at visual attention at a higher level consider the problem of determining appropriate captions for an input image based on the papers show tell and attend this normally consists of two steps first is to encode the image in an internal vector representation H using a convolution neural network and then we decode H into word vectors signifying the captions using a recurrent neural network the problem with this method is when generating a single word of the caption the LST M looks at the entire image representation H every time this is not very efficient as usually we generate different words of a caption looking at different and specific parts of an image to solve this problem we create n different non-overlapping sub regions hence H I would be the internal feature representation used to generate the eighth word it is not necessarily the representation of the I 3 gene of the original image I'll explain this in a bit for now the figure on screen is a high-level diagram of attention when the decoder decides on a caption for every word it only looks at specific regions of the image leading to a more accurate description now that's good but how does it exactly decide the region or regions to consider this is the crux of the attention mechanism an attention unit considers all sub regions and contexts as its input and it outputs the weighted arithmetic mean of these regions arithmetic mean is the inner product of actual values and their probabilities how are these probabilities and weights determined they are determined using the context C context represents everything that recurrent neural network has output until now let's take a closer look at what happens we have input regions Y from the convolution neural net and the context see from the RN these inputs are applied to weights which constitute the learn about parameters of the attention unit this means the weight vectors update as we get more training data we apply a tange activation so that of very high values tend to have very small differences and be close to one and very low values also a very small difference is closer to minus one this leads to a much smoother choice of regions of interest within each sub region it is more fine-grained so to speak note we don't necessarily have to apply a tange function we only need to ensure the regions that we output are relevant to the context in the simplest form this similarity can be determined with a simple dot product between the regions Y and the context C the more similar they are the higher is the product hence the output is guaranteed to weight the more relevant region why I hire the difference of using the simple inner product and tange function would be grin you ality of the output regions of interest tange is more fine-grained with less choppy and smoother parts of sub regions chosen regardless of how they are calculated these M's are then passed through a softmax function which outputs them as probabilities s finally we take the inner product of this probability vector s and the sub regions Y to get the final output Z of relevant regions of the entire image understand the probabilities as correspond to the relevance of the sub regions Y given the context C now there are two types of attention mechanisms the first is soft attention and then we have hard attention the main difference here is that in soft attention the main relevant region C consists of different parts of different sub regions wide in heart attention the main relevant region Z consists of only one of the regions why I'll explain them both in detail the entire mechanism of attention that I described until now is all soft attention Z has relevant parts of different regions soft attention is deterministic so deterministic what's that a system is said to be deterministic if the application of an action a on a state s always leads to the same state s prime a dumb example would be you're at a corner of your room at coordinates 0 0 and you're facing forward consider an action a which is moving 5 feet forward the system is now at a new state with the coordinates 5 0 and still facing forward no matter how many times you stand at the corner of your room forward facing and walk five feet forward you will always end up 5 feet from the door and facing forward try it trust me it works hence the system is deterministic let us apply the same concept to soft attention initially we have an image just split into a number of regions why with an input context see this is our initial state on the application of soft attention we end up with a localized image representing the new state s Prime these regions of interest are determined from Z the RO eyes will always be the same regardless of how many times we execute soft attention with these same inputs this is because we consider all the regions Y anyways to determine Z now consider heart attention looking at the architecture heart attention is very similar to soft attention however instead of taking the weighted arithmetic mean of all regions heart attention only considers one region randomly so heart attention is a stochastic process now stochastic when you hear the word stochastic think about randomness in such a stochastic process performing an action a on a state s may lead to different states every time typical example is like in a board game with the dice like snakes and ladders the initial state is the position of the players the action is rolling a dice and depending on the roll there are multiple possibilities for the next board state what makes hard attention stochastic is that a region Y I is chosen randomly with the probability si this means that the more relevant a region Y I as a whole is relevant to the context then greater the chance it is chosen for determining the next word of the caption using the word captions output until now by the RNN that is H along the current regions of interest in an image determined by the attention mechanism the RNN now tries to predict the next word in the caption as far as performance is concerned in the papers show attend Intel released by the University of Toronto and University of Montreal in 2016 results vary with the data set soft and heart attention both perform decently well with heart attention performing it slightly better this is pretty cool right so where else can we use attention attention is not only used for image inputs for example neural machine translation nmt systems they are used to translate one language to another words are fed in a sequence to an encoder one after another and the sentence is terminated by a specific input word or symbol once complete the special signal initiates the decoder phase where the translated words are generated another cool application would be Microsoft's attention generative adversarial networks or Microsoft's attention gann that can create images from text through natural language processing it can perform fine-grained tasks like generating parts of an image from a single word in the description another application would be in the paper of teaching machines to read and comprehend the altars do the same thing I talked about in the beginning of the video a recurrent neural network takes some text and a question as input and it is made to output an answer here are some things to remember attention involves focus in high resolution on certain parts of an input while the rest of the input is in low resolution or is blurred two types of attention are soft attention and hard attention soft attention is deterministic while hard attention is stochastic attention can be used for non image inputs like neural machine translation attention Gantz and answering questions from text and that's all I have for you now hope you guys got some newfound understanding of attention and its applications in this video I have left a link to the main paper show attend intel with other papers and blog posts in the description down below don't forget to give the video a thumbs up and subscribe for more awesome content please subscribe please you did it right guys
Info
Channel: CodeEmporium
Views: 125,489
Rating: undefined out of 5
Keywords: Machine Learning, Data Science, Deep Learning, Attention in neural networks, soft attention, hard attention, attention on images, attention on audio, attention mechanism
Id: W2rWgXJBZhU
Channel Id: undefined
Length: 11min 18sec (678 seconds)
Published: Fri Mar 02 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.