Lecture 9 | (1/3) Convolutional Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so what we're going to be doing is the the trailing portion of the last lecture I have not had time to record it part of it has to do with the fact that we haven't managed to coordinate with mediatek we're overbooked turns out I should actually book my rooms sufficiently in advance for them to find the time we'll figure out how did I caught that final bit and put it out we'll put it out this week that covers batch normalization drop out basically the two key topics right ah the we're going to move on we are going to talk about scanning right so the story so far we've seen that multi-layer perceptrons are universal function approximate errs boolean functions classifiers regressions they can be trained through variants of gradient variations of gradient descent the kind of models we've considered so far or something like this you give it an input an image maybe and a vector input I'm calling it a vector but in your homeworks you don't you don't necessarily have to arrange your date as a vector there is conceptually vectors so when I when we are when we asked you to look at context to analyze any particular any particular input vector you are given recordings which actually get were a matrix of vectors and I and one of the things we were you were asked to do was to actually consider K vectors of constant context on either side to classify any particular vector and the nice thing for you to do would I have what I mean to actually just sort of patch all of these things down below here and make a really long vector of this kind and then pass that to your class if I were that's very silly because you're basically repeating data that's already there so in that case this percept release perceptrons over here would actually be looking at this entire lack of data but what are they really classifying the vector in the middle right this is how we actually implemented things now this is of relevance to today's lecture so now the cut regardless of how you thought about it the kind of models we've seen so far have the structure the feed-forward they look at a single block of and a single block of data and they perform perform classification you'd be looking at maybe this segment of your input and deciding you know whether it was class 1 or class 1 35 or whatever but now here's another problem I'll give you a recording of this kind this recording as in the case of your homework it's been converted to a sequence of vectors it's a spectrographic representation so x-axis is time y-axis is frequency the intensity in the color intensity at any pixel basically gives you how much the value the spectral value at that time at that frequency the energy in the signal of that time of that frequency so now I give it a simple I ask you does the signal reckon include the word welcome now using the naive approach how would we do it simple enough you pass this entire block to a multi-layer perceptron and you can train MLP is to perform to decide if this recording has the word classifier welcome in it right simple but then here's the problem I have these two recordings each of them has the word welcome but they have the word welcome in different locations so now if I passed the entire recording to an MLP with an MLP that has been trained to find welcomes in the first recording find it in the second one why not there are so basically were vectorize what's happening if i think of that entire input as a vector then so let's say I take my entire spectrogram I unravel it or think of just one row of it for instance right so you don't even have to think about all all the rows in the one case the signature of welcome was out here in the other case the signature of welcome was out here so if I think of it as points in some coordinate space this data this one laid on this plane right it was in working some components and this guy lies on an entirely different thing it's invoking a completely different set of components they are completely disparate although it's the same pattern right so a network that actually recognizes the word welcome in the first instance is not going to recognize the word in the second instance so what we need is a simple network that will fire regardless of the location of the word welcome where it is in this particular recording I can give you another instance of the same problem this time I sort of going to two dimensions here are all these pictures how many of them have flowers anybody tell me four of them right although some would argue that somewhere in that truck tractor is a flower but anyway so four of them have flowers on the restaurant now I asked you this question I'm gonna give you images of this kind and I'm going to ask you does it have a flower now same thing if I train multi-layer perceptron that can detect flowers in the picture to the left that MLP will not be able to detect flowers in the picture to the right yeah for the very same reason that in the one case it's one set of components that's invoked in the other case it's a completely different set of components that in work that's invoked that first flower picture represents a flower in a completely different subspace than the second one so we need a network that will fire regardless of the precise location of the object in other words we need shift invariance basically you have many problems at the location of the pattern is not important it's only the presence of the pattern and conventionally Murphy's are sensitive to the actual location basically different locations represent different subspaces and so what would happen is that if I just took this component pattern and shifted it by one position you end up in a completely different subspace a movement of so in a time signal a movement by one sample or in an image a movement by shift by one pixel puts it in a completely different subspace and so something that recognizes the flower and one is not going to recognize the flower in the other right so you are you need to have shift invariance where the network will fire even if the pattern is shifted like if the flower is shifted or if the word welcome is shifted you want something that would fire in any case right so now let me restate the problem I have a recording of this kind I want you to tell me if the word welcome has occurred in it how would you do it anybody who'll on how would you do it so I'm asking you now maaske to detect not to Train right so Alex if I ask you give you a recording of this kind of nasty of the word welcome happened in it how would you find it brute-force can write that's exactly what I could do I can build this MLP which is the designed to detect the word welcome and then I can scan I think that's what you were trying to say to write so I can just go left to right and look at each location and determine if the word scan has occurred there or not right at the word welcome has occurred there or not but what is the real question I'm asking the question I'm asking is did the word welcome occur anywhere in this image in the in the cigarette quarry so I had this collection of decisions taken one at every position how can I combine this collection of decisions to tell you if the word welcome has occurred anywhere in the recording or right basically I can take an R or alternately stated I can pick the maximum of every one of these guys if the word welcome has happened in even one location the maximum is going to be a one or close to one the rest of them are going to be very small right if the welcome never occurred the largest value you see is going to be small and so I can take all of these guys and just put them through a max and whatever pops out of the max that is going to be my classifier output correct is enough or it doesn't even have to be a max so think about it the word welcome is not very precisely defined in time where exactly did I begin saying the sound oh where did my end turns out it's not very precisely defined so you would expect that a welcome pattern detector might detect her welcome here with high confidence then if you shift it a little bit by one sample it's still going to kind of find the word welcome right simply because of the way we do things it's only when you shift it off significantly that it's not so instead of just saying you know I can do something like a soft Max Max or I can put a perceptron out there look at a weighted combination the weighted combination is going to say that if three adjacent locations all give me somewhat reasonably high scores for welcome I'm more likely to fire fire than if just one of them you know gives me a very high score and everything else gives me zero because then that is more likely to be an outlier right or I can do something even more sophisticated although this is completely unnecessary the I can put the entire thing to run and hiram LP right but the whole point is this this is kind of overkill you wouldn't actually be doing this it doesn't make a lot of sense turns out that yeah I'll get back to it but the basic idea is simple I can scan and I can pick the largest one or I can say you know do I get really large values in a small range of locations and that's myself max and that's going to give me my output now what do I have I have a scanning MLP I'm scanning with an MLP I'm the thing I had to decide was how wide is the patch over which I'm looking for the word welcome right and then I'm going for T going from 1 to the N I'm scanning through time at each location I'm pulling out a segment of my image and I'm passing it through my MLP and so I'm going to get one decision for every instant and I'm putting collection of decisions through my softmax on my max and that gives me a number make sense right fairly trivial but then this entire operation can be viewed as one giant Network because I'm looking at each segment and I'm passing that output through a network then I'm taking all of their outputs and then putting that to our softmax which is just one more element in my network right the entire operation is one giant network so the only difference over here between your standard network and this guy is that I have many subnets looking at the or the network and all of these subnets are identical that was the only difference between this and your standard Network I could have passed the entire input through a MLP or I can be having many copies of the same basic network looking for a pattern and combining their outputs both of them are just an MLP looking at looking at the input one has many identical subnets so you can think of this entire thing as just one giant MLP those T's are basically like like like looking over adjacent blocks the final softmax is is the final layer of the overall MMP right so I can write the anything is one giant MLP now same thing with images I have this image I have to decide if there is a flower in it so I start off with a flower detector which is an MLP and this flower detector is going to look at a patch of my image and it's going to give you an output and then I can begin scanning this left to right and I can scan for the desired output keep scanning this image left to right at each location I am looking for a flower and then when I'm scan the entire image like so I'm going to get one decision at every location I'm going to I can take the entire collection of decisions and put it through a max if the largest output is high enough I have at least found at least one flower in the image busy which is basically all I'm trying to do right so and again that could be a max could be a soft max I could be an MLP it doesn't really matter but the idea remains the same if you're scanning with an MLP so again now you're going over the two dimensions the x dimension and the Y dimension of the input I and J are the X and y dimensions at each location you're pulling out a portion of the image corresponding to the size of the pattern that you're searching for the flower this is what the MLP is actually looking at and that is being passed through your MLP and then you're getting going to get one decision at every location and then you're going to put that through Max or a self max right and again as before this entire thing is just one giant Network it's one giant network with many copies where many of the subnets are identical in structure so this whole thing is one giant network the last element over here is the final component of the network this is a giant MLP right so given this how would I train this network these are just really like now by the way if you've been looking at the slide numbers this light number is 67 and this last slide number is 56 what happened I suggest you keep track of your site's light numbers the rest I like to hide slides simply because I'm not gonna have time to go over all of these in the class please take a look it's about how you know shortcuts for drawing diagrams to represent these figures that we will use in the quizzes and further explanations at various times right so please do take a look at the slides yes so it could be just a perceptron so when I'm seeing a softmax a better idea would be to think of it as a perceptron where with a perceptron is looking at weighted combinations right and if the weighted calm exceeds a threshold it's going to fire otherwise it doesn't it's not the softmax Khan aspect of it per se is the fact that you're looking at weighted combinations which would give you information like at least three constitutive guys must have you know high matches right high scores make sense okay so now so how can I train this this is just one giant large network right if it's one giant large network here is the lovely thing I don't have to go off and train a flower detector separately or a welcome detector separately I could just be giving you lots and lots of these recordings and say these guys have welcomes these don't you have an instance of what is called weak labeling you're getting away with weak labels a strong label is where I give you the exact segment of audio which says this is the word welcome and I give you and then you get a train with those a weak label as here is a bag somewhere in this bag is a positive instance right think about this you're trying to build a classifier for you know cars pictures of cars and instead of giving you a picture of a car I give you a basket which has many many photographs in it and I say some of these photographs are cars then I give you these other baskets which have many photographs in them but none of them are cars right so at this point I'm not labeling every individual photograph I'm just giving you these baskets and then I'm saying okay now find me a farm you know a car detector that's basically what you have over here you are not actually labeling every location in the image you're just saying somewhere in this recording there is the word welcome and that information is sufficient because we've composed the entire thing is one giant network right that make sense or did I confuse anybody if it didn't make sense tell me now right so anyway so now I can just use straight-up back propagation I gave you and I can give you lots of recordings some with welcome son without and I'll tell you which ones have the word welcome which ones don't and you can just use backdrop to train the network but there's a constraint and the constraint is that these are shared parameter networks right so what this means is that when I built this giant network whether it was for analyzing the speech or whether it was for analyzing the image you had MIT you basically had many copies of the same subnet looking at every block in the original image and they are all identical so this giant network actually has many subnets that are identical so these are shared parameter networks what I mean by shared parameter is that you have imposed the structure that this parameter here and this parameter here are going to be identical right answer the shared parameter network and so any kind any any gradient update or learning algorithm that you use must take into consideration the fact that this is a shared parameter network so how would you actually go about learning in a shared parameter network consider a simple network with shared weights so this I mean this business of sharing parameters is not just restricted to things like scanning you can do it anywhere so I could decide completely arbitrarily there I have this hoc network and in this network I want this weight and this weight to have identical values if I impose this impose this restriction extra externally this is a shared parameter network because I'm forcing these two guys to help us always have the same value right and so not in terms of learning what happens I know both of these guys have the same value CWS if I perturb WS a little bit remember how we actually did gradient descent we were computing how much apart small perturbation in any parameter would change the output so if I perturb WS a little bit two guys are going to change and they are both going to influence the output so I can draw the influence diagram like so this common value WS is going to influence both those weights both those weights are going to influence the output and thereby they divergence so if I want to compute the derivative of the divergence with respect to the common value that both of these guys share then following this influence diagram which by na'vi must be very comfortable with having done this in the quizzes to write is it just means that the derivative of the divergence with respect to the common value is going to be the derivative of the divergence and respect to the first weight times the derivative of the that weight with respect to the common value plus the derivative of the divergence with respect to the second weight times the derivative of that weight with respect to the common value but then these guys DW IJ / DW s that's just one because the value is being exactly identically shared so these are these second components are actually 1 the derivative of the divergence with respect to the shared value is simply the sum of the derivatives of the divergence with respect to all of the individual terms that are supposed to have the same value right yes doesn't make any difference right so when you're saying your networks at a franchise you know I know the network's a universal function approximate us you're just speaking of what they are capable of it doesn't mean that any specific network is going to be able to model any given function then obviously it cannot model everything right so that's it that's a capacity statement that's not a statement about individual networks yeah because we are saying that these WS I'm saying both of these guys have the same value right so I'm saying WI KJ literally that statement is saying W IJ K equals what is the other one W M and that case also L equals W s there's an equality statement right that's what sharing that I parameter sharing it so the derivative is one right yeah yes nothing I mean it's just the look and look at what we are doing eventually look at what eventually what happened over here when I think of it's a way it's a way of it's one way to how we going to be thinking about the problem but as far as the input is concerned the input is the entire you know image your entire signal right yeah but the point really is that other so don't don't get too carried away by what we've said so far because we're going to build on this because the actual number of unique parameters it's just going to be the size of the smallest block because that's being repeated right and so this means that it kind of becomes as you'll see independent of the size of the input itself if I'm just going to be scanning effectively I'm scanning so if I'm scanning it doesn't matter whether I'm scanning a shot input RI white input the size of the had the lowest player what is the actual size you're considering the width of the block that you're scanning with that's about it right yeah so so going back here what does this mean when I'm doing a shared parameter network every one of those squares that I've drawn on the flower picture is being analyzed by the same MLP and all of their outputs are being passed on to the final same sub Network and all of their outputs are being passed on to a softmax or an MLP or whatever else so which means that if you look at those red arrows and the first figure all of those guys are the same weight because basically they are the same edge of the subnet that's scanning the figure right so when I think of the giant MLP and it's the equivalent of saying all of those guys share a parameter so also all of the lines that are paint colored blue can you see those they have the same value right all of the lines that are colored green they have the same value so in other words if I want to compute the derivative of their divergence with respect to that red arrow I'm going to have to sum over every copy of that red arrow when I'm scanning the image and adding the partial derivatives make sense right to everybody and so here is how the actual gradient descent is going to be when you are using a shared parameter network for every set s there's a set of weights which have the same value and there are many such sets for every set you're going to compute the derivative of the loss with respect to the common value of that set and you're going to be performing gradient descent with respect to this common value of the parameter for that set right and this one I can let zoom in what is how do you actually compute the derivative of this common value of the divergence with respect to this common value you'd be going over every single parameter within that set computing the local divergence with respect to the parameter and then you're adding the entire lot that's going to give you the derivative of the divergence with spec to the common value over the entire set and that derivative of course is going to be computed by backpropagation right so story so far position invariant pattern classification can be performed by scanning 1d in the case of fart sound or one dimensional signals are truly in the case of images where you can have things hacker in anywhere in the image but it's this there's nothing special about two dimensions this can be generalized to three four or any number of dimensions the basic concept still holds right and scanning is equivalent to composing a large network with repeating subnets basically basically so the large network has shared subnets and learning in scan networks back propagation rules must be modified to combine gradients from parameters that share the same value this principle applies in general for networks which shared parameters questions so far because we're going to get more complex really quickly this was the easy bit okay yeah not necessarily right so when you compute the derivative the derivative is also considering the input Y those locations you'd expect the derivative to be small because you're saying that in this location because the input doesn't have the pattern if I move this guy a tiny bit remember you're speaking of infinitesimal changes you don't really expect to see a large change in the output but this is the standard back back propagation rule we are not changing anything this holds for pretty much any pattern right so you give it any kind of feature that you're looking at it doesn't the fact that you have a shared parameter network doesn't really matter does it yes not consider a general forget about the fact that these are shared parameters so if you take any art MMP now when you are trying to do perform pattern classification over a specific kind of input that input is going to have components which are relevant to the class and components that are irrelevant to the class what do you expect that these the derivatives of the parameters with respect to the irrelevant components to be that's exactly the same situation here right that's so nothing changed not absolutely nothing changed right okay so now the what we've got here let's look take a closer look at the scanning keeping in mind that scanning is like looking operating on this with one giant network okay now let's look at what happens the entire MLP which is this guy operates on every single window of the input right every single so which may be an input image or an input sound or whatever else right so within each window what happens consider just this first block you have four neurons in the first layer every one of those is computing an output you have cleaner runs in the second layer in our example those three neurons are looking at the four are outputs of the four neurons from the first layer and they're computing their output so you get three outputs then here I have so the next thing would happen is this guy right and then the final one is looking at the outputs of the neurons from the second layer so this is the order in which you perform you would be performing the computations if you were scanning you within this block you'd first compute the first four neurons from those it compute the next three neurons then in computer next two neurons and your compute the final neuron right then you shift over one block you compute the first four neurons then you compute the next two neurons and then you computer one you run then you shift over one block which could be just shifted by one pixel and repeat the same computer computation but now so basically you'd be doing this for a every single block like so until you've done the entire computation at every single block you've computed the first four neurons then you computed the second neuron that the two neurons in the second layer then you computed the final neuron and then eventually their entire lot of outputs is being put through the softmax right this is this is the standard scanning but now I decided to change the order in which I do things okay just for one neuron for the first neuron I'm going to compute the output of the first neuron on the first block then I just move on and only for the first neuron I compute the output for the second block and the third block and the fourth block and the fifth block all the way to the end okay and then for the rest of that the remaining three neurons in the so this entire thing okay and then for the remaining three neurons I just look at the first block and compute the remaining clean your own values so now I have all four noodle outputs for the first block so from those I can compute the two in the next layer and then I can compute the one in the final layer right will this change the output that I got in the final hour and final layer for that block it's not right the fact that I just did this first guy beforehand doesn't actually change what happens in the first column I can do this so this means I can also do this for the second guy right I can do this for the first guy so this wouldn't change what happens what happens in the final neuron for the first block I could scan the first two guys beforehand and then perform the rest of the computations or I could just take all four of my first layer neurons and compute them on every single block on the input then take the first column of these guys compute the two new reward put some the second layer and then compute the final output that is still not going to change the final output of the first block right nothing changed now all that happened was it changed the order of computation so now that that logic can be carried over to the next layer what I can do is now I can just scan the entire input using the four neurons of the first layer then I can scan the entire are all blocks for one of the neurons from the second layer and do it all the way to the end and do the same for the second neuron of the second layer and do it all the way to the end and then compute the final output for the first block this is still not going to be any different from what if I would have computed if I just computed the first block all by its own on lonesome self right this hasn't changed in us I can scan with the final neuron itself and then put the entire thing through the softmax will this change the final output no right I'm still performing the same scanning operation the only thing that happened was I changed the order in which I perform the computation correct rusty okay I just want to make sure that you get the idea of what is going on so why is this a trivial thing let's go back and look at our network when we were scanning we were going across time within each at each time you were pulling out a block and you were putting it through an MLP then you put the collection of outputs through softmax so let me expand this what was happening here is that I was going across time within each at each time instant I was going through all the layers of my MLP for each layer I was going over all of the neurons in that layer right so I had my input I had my input scanning Mendel for I went over time but within each guy and at each location I first computed all the outputs of my first layer and when I computed all the outputs in the first layer I'm basically computing the outputs of the each of my neurons in the first layer and then from these guys then I go to the next layer and compute three outputs of each of my neurons in the next layer and then finally compute D I go to the last layer and compute the output of the neuron in that layer the first layer alone works on a block that is derived from the input that's this guy he'll equals one the rest of them simply work on the outputs of the previous layer right this is clear this is what I did when this came when scanning now what would happen so first one is over time the second one is over layers the third the third loop is over the neurons look at the Integris pseudocode okay I'm going to flip the order of the loops will this change the output final output because you're really decide bintang interested in the final outcome this really doesn't change the order of the final output so now instead of going over time and then going over layers and going over neurons I'm going to go over layers over neurons and then over time that's basically what I did when I showed you how I modified the computation right and so this is the resulting operation that we actually saw when I said am redoing the computation somewhat differently so I can do the whole thing in vector notation instead of writing out the loops I'm saying pulling out a block and then showing every operation and showing the operations computation for individual neurons we can use the standard vector notation that you guys have been writing in your code but this is what it's going to look like there are only two loops one is over time there is over Laius I flipped the order and that's basically what happened right okay same thing with 2d nothing really changes so but I'm performing but when I'm scanning this picture I have this MLP it has an input layer it has a subsequent layers of the network and this entire MLP is performing scanning the input but take a look at a single neuron what is the single neuron doing in the first layer first and layer it's looking it's computing and I find combination of all of the pixels and some in some patch and then perform and then applying an activation to that I find combination it computes a value right so instead of computing the entire MLP at each block I can just take this first guy and I can begin computing its output at every location and I had arranged the output in the same order that as the original image so that first neuron has actually now mapped out the entire input and it's created its own map right I can do this with each of my neurons in D so the first neuron has scanned this input and produced and produced a map I can do this with each of the neurons in my first layer and now suppose I want to classify beside if this particular block in the middle has a flower or not all I have to do is to pick those values that from the maps produced by the neurons in the first layer that correspond to then you know the outputs they produced when they looked at that particular block of the input and passed those to the subsequent layers of the MLP right so this is exactly the same as what we were doing with the 1d case I'm just changing the order of computation instead of scanning the entire port with my MLP first I'm gonna scan the entire input that would up with the neurons in the first layer and then I can just pick out the specific my outputs at different locations to decide to pass it through the rest of the MLP to decide what the classification output is at that location and of course this thing can be recurrence right so now the second neurons in the second layer they can look at the outputs of the are the the maps produced by the neurons in the first layer in the top left location and when they do that they are going to make their own decision about whether that top left location box in the input image has a flower or not and then I can scan right so the neurons in the second layer can do the same thing in produce and time maps and then finally if I want to decide if this block has a figure our flower in it or not I just have to pick up the corresponding locations from the maps produced by the neurons in the second layer pass them through the final layer of my MRP and that's going to give me a decision everybody see what I'm going what is happening over here this is basically this two dimensional analog of what happened with one dimensions what happened in one dimensions and of course the final guy too is now going to do can just operate on the maps produced by the second layer and it's going to give me an output for every location and that can finally be put through a max or whatever else to decide if the image has a flower the slides have a lot more text but you get the idea immediately correct so what happened over here this so I can either just retain them as a map or I can expand them out and put them through an MLP it doesn't really matter right but what happened over here this is basically exactly what we did with the one dimensional case so in the standard scanning you had in the image which is large so you scanned our overall XY positions in the image you scan the input and then when you scan the input you are at each location you pulled out a segment of the image which was that slice and you plan an MLP on it and then finally you perform the softmax on the outputs of the integral of the computation right so if I were to write it out explicitly it's going to look like so at each location I'm going over all of the layers I'm going over all of the neurons and I'm basically performing exactly the same operation is here but now I can flip the order and instead of going over the positions and then going over the layers and then going over the new dance I can go over the layers and the neurons and then over the positions and this doesn't really change the output of the network right so this is basically what you get when you're scanning with your MLP and that's the final operation that we actually perform I can write the whole thing in vector notation as before or instead of trying to write things out for every individual neuron but the basic idea was this instead of scanning over the network picture and going through all the layers in each location I scanned over the layers and within each layer I went through the layers and within each layer I scan the network right either the scandium box so you get an idea of what's happening now I'm going to break this down one step further what I remember from way back when what does an MLP really learn an MLP sort of decomposing an input into parts so if I were trying to build a classifier which decides if the input is in one of these two yellow boxes or not then I can just first I'm going to find the lowest lines then I'm going to build a little Pentagon's then I'm going to combine the Pentagon's and so on correct so and then finally take a decision same thing if we were trying to build a digits classifier which looked at this pixel per at this pixel grid and decided if the input was a digit or not then I can that you would expect the lowest layer neurons to pick out the salient features of digits the next layer neurons to decide if there are actual digits or not and then you're going to combine the lot correct so let's take carry the whole idea same idea over to the network itself to this guy now when we were sort of doing what we just did when we when we analyzed the entire input using an MLP so you looked at blocks of the image and you are analyzing the nth head block using an MLP so the first layer in of neurons in your MLP that first layer was responsible for extracting all the salient features from the entire block subsequent layers only sort of operated on these features and figure out whether they were in the right combinations or not so this first layer neurons they were trying to capture features about the entire big picture and the bigger the picture is the more features you expect to get so you really expect your first layer to be really very large to capture everything important about it right but then let me redo it somewhat differently my first layer perceptrons are not going to be looking at that entire block my first layer perceptrons are going to be looking at a sub box can so they're the first layer perceptrons now look at sub blocks and they begin scanning like so then what happens the information about this ain't a glock is no longer represented by a single pixel in the output map of the neurons in the first first layer it's actually represented by a grid of nine pixels in my example right so now if I want the neurons and the subsequent layers to actually decide about this block the neuron in the next layer is going to have to look at that block of inputs that square of inputs in the map at the square of outputs in the map produced by the neurons in the first layer obvious okay so the yellow dotted square is the block over which I'm looking for the flower so look at what's happening here good question because I'm sure does have the same question so look at what's happening over here in that red square I'm looking trying when I'm scanning the input I have decided on the size of my flower which is the red box and I'm scanning correct so I still want to scan and input the size of that red box to decide if there's a flower but in this case the neurons in the first layer they were looking at that entire box and they were producing a singie each neuron is producing a single output and then the neurons in the next layer but just reading these values and operating on them right so now instead I'm going to change that I'm going to say the neurons in the first layer are looking at the regions of the image the size of the white box and they're going to scan so if you now want to cover the bigger region you're going to have to look at nine of these outputs make sense right right and so you're going to have to look at nine of these outputs in the output off from the output of each of these neurons here I have four neurons so I'm going to have to look at this block of nine outputs from the output of the first neuron the block of nine outputs from the output of the second neuron the block of nine outputs from the output of the second third neuron and the fourth and then all of those are going to be combined so all of these guys are going to be combined by the neuron and the second layer before it makes a decision about this one block and it's going to repeat the same process for every position as it scans right so everybody clear what I've done is I've distributed my pattern instead of forcing my first layer neurons to learn all of the features in the input have smeared the representation across two layers allowed the neurons in the first layer to look for smaller patterns and I've allowed the neurons in the second layer for to look for larger patterns but somehow also consider the spatial arrangement so what this means is if you actually look at any single pixel in the output of the neurons of the next layer any single pixel and there are any single output in there of the neurons and the next layer is really representing that entire yellow box not just one small sub region in the yellow box but then interestingly it's actually going to be able to so effectively evaluate this entire yellow box but it's actually able to figure out patterns like these right now if you look at the four neurons in the first layer each of those four neurons remember we saw we saw way back when that a neuron really is some kind of a correlation filter it's trying to match the weights to the inputs every neuron is looking for a pattern so the neurons in the first layer looking for smaller patterns the neuron means in the second they are looking for combinations of these patterns found by the new Rosen's first layer so it might be able to say that so let's say the neurons in the first layer find different components of a petal as their basic pattern that secondly a neuron can now say that the first guy must fire in the top left position the second guy must fire in the center the third guy must fire in the middle the fourth guy must fire on the bottom right it can actually find Arrangements in the outputs of the neurons in the first layer to decide whether it's captured a relevant feature for the flower in the bigger yellow block right it can actually so the spatial arrangement is kind of taken care of but the interesting thing is that you're still just scanning with a shared parameter Network although I have distributed everything nothing really changed it's still just the same basic structure all that happened was that there was a little more sharing of parameters how exactly was there more sharing of parameters now consider the simple example so let's say the neurons and the first layer are only looking at regions the size of the yellow block and let's say adjacent blocks don't overlap then a single neuron and the first layer is going to produce nine outputs within the big block right this is the equivalent of having nine copies of that neuron with identical weights but these nine copies are looking at different regions of the aim of the picture so it's like having a shared parameter network with nine copies of the first neuron nine of the second neuron and nine of the third neuron and so on and the neurons in the second layer are actually looking at all of these guys so there's no sharing of parameters in a second layer to be the way we just saw it but in the first layer the nine copies of the first neuron are all identical and each of them except that each of them is looking at a different block of the input yeah right so now this logic can be recurrent right and now I instead of saying I can try to build a pattern over three layers I can say the first neurons in the first layer look at these really small regions and they produce their map the neurons in the second layer look at sub regions so the neuron in the second layer is looking at the larger a larger box which is at groups of four neurons in the first layer and the neurons and the third layer are going to be looking at at groups of boxes in the output of the neurons in the second layer and the neuron now as a result the neuron and the third layer is actually looking over the entire box what I've done is I've distributed the pattern over three layers which allows it to G allows me to get me a more allows me to look for features at a finer resolution none of this really changes the fact that the fine and then the final classifier just used the outputs from all the locations as seen in the final map right now the none of this really changes the fact that this is just one giant MLP except now you have shared parameters but the shared parameter structure is somewhat more complicated you have a more complex shared parameter structure but the whole whole structure still remains no now this stuff is not very clear then the next quiz should fix it for you and actually the homework homework 2 will fix it for you ok so this entire operation can still be viewed as one giant parameter network so it probably begins to makes more sense if you look at it again from the perspective the pseudocode the original scanning MLP was going over all XY locations and within each XY location it was going over layers and over the neurons I can so here's what it was the blue block shows the scanning the yellow shows in the MLP itself I can flip it and you're going over the layers and within each layer you're actually scanning the entire input but unlike the simple case what is happening over here is this thing that I've sort of magically hidden and compute the Z not expanded first there's no necessity for the size of the block that you're looking at within each layer to be the same as the size of the block that you're looking at in the next layer so these things can be independent and secondly when you look at the actual computation itself here is what happened previously I was just looking at this guy right one vector of inputs before passing it on to the next guy here what you would do is to look at a block of inputs a rectangular block of inputs from the output of the previous layer so at each layer you have say a bunch of neurons in each layer in any given layer that guy is going to produce a map of outputs correct and so this neuron will have a small set of weights corresponding to the size of the region in the output of the previous neuron that it's actually looking at and so the first thing you would do is to slice out a region the same same as the size of this weights that's this Y and then you compute an inner product between these two guys which is basically a component-wise multiply and then you add them up you're going to have the corresponding size also being sliced out from each of these four guys because remember you're looking at all of the four neurons from all of the neurons from the previous layer corresponding to each there's going to be a differentiate of weights so you'd be performing a component wise multiply and then you'll be adding them all and when you add them all you're going to get one output corresponding to this region in the maps of the previous layers which basically corresponds to some region and the input image itself okay and this having been computed you're going to this computation for justice layer you'd be scanning this you'd be performing this operation as you scan and you keep computing things left to right to fill up the entire map this entire operation yes so this is a company twice so this is like you would be this guy in this location of multiply this this location would multiply this so so this is the analyst showing and this realist rating the computation right so it's this guy that's actually going here this is the these are just the weights it's not the slice of the input so that we have if I want to be absolutely didactic I have all of these guys these boxes would all be pointing to the corresponding neuron of the next layer but you know for this box I'm going to have a set of weights exactly the same size as this guy right make sense but did I not answer a question right but what is happening is after having performed this computation you're going to keep sliding and you're going to be scanning the input and repeating the computation over every block of this input so this entire operation of performing this comp the you know performing this component wise multiply between weights and a block of the input and adding it up basically looking for a pattern over the entire collection of maps from the previous layer and then scanning that is a that that operation is called a convolution this business of scanning for a pattern is called a convolution and the entire network of course is going to be called a convolutional neural network as a consequence I can write the whole thing also in vector form where instead of having four different Maps I can stack these four different maps into one 3d tensor and then instead of having four different matrices of weights I can think of one 3d10 set of weights and the whole operations you can think of the thing you can think of these guys as forming a tube and I'm scanning the cube spatially to look for patterns right so why distribute why not just use the basic idea that we began with in the first place it that when you begin distributing parameter things in this manner the number of parameters can really really really really become small so consider this example let's say I'm looking for patterns in a K cross K block of the input and let's say I have n1 neurons in the first layer and two neurons in the second layer and teeny and three neurons in the and the one neuron here in the final layer right so how many weights will I need for each block I need K squared weights because I'm looking at K squared pixels so the first layer is going to require K squared forget about the plus one that's for the bias when you have a perceptron remember you always have a bias right these are just perceptrons but I'm ignoring the bias for now okay so you're gonna have K squared n1 weights for the first layer then you have n1 neurons in the first layer and and then two in the second so you have n1 and two neurons in the second layer and then so on right so that's going to be the total number of parameters if you do not distribute if you distribute parameters then things change a little bit the first layer let's say I'm still looking at I'm distributing this over two layers right and I'm still trying to look for kick for patterns in a K cross K block but the first layer is of neurons every neuron and the first layer is looking at little L L cross L regions so I have n1 neurons in the first layer and each of them is looking at an L cross L region so I'm going to have l squared and one neurons parameters in the first layer the second layer is going to be looking at K over L because I want to fill up the entire block right so the second layer is going to be looking at K over l squared times and one and two for every pair of neurons I'm actually looking at an entire block of pixels from the maps so this says the second layer is going to be looking at K over l squared and one and two parameters and then the subsequent I just you know into n3 and so on so the number of parameters has changed how much exactly did it change you can miss two you can you can continue this ad nauseam for any number of layers porns go through the arithmetic but maybe one in the queer quiz just so you get an idea but here is the actual here than an actual comparison in this example I have I'm trying to scan for flowers or some whatever I am with a person mmm with an MLP which has four neurons in the first layer tune two neurons in the second and one in the final layer and I'm trying to look for patterns over sixteen cross sixteen pixels so 16 16 times 16 is 256 so 256 times four is 1024 parameters were just the first layer and then a small number for the remaining so the total number of parameters there was 1034 but now let me say the first player is looking at little for cross for blocks and the next layer is looking at four o'clock for cross four blocks in the output of the first layer right so now because 4 cross 4 is 16 the first layer only needs n-1 times sixteen parameters the next one needs n1 n2 times sixteen because each of them is looking at a 4 cross 4 block so that's going to be 8 times 16 if you add up the total number of parameters it just comes up to 160 Waits so just by distributing the representation over two blocks over two layers I have reduced the number of parameters in this case by more than a factor of I by a factor of six as you spread it out more and more it's going to the actual read reduction in the number of parameters is going to be several orders of magnitude to be scanning for very similar patterns so yes so we are not even speaking of pooling there is no pooling over here right all I'm literally saying is instead of just saying yeah instead of so in one in one case I had my MLP and this guy was looking at his entire location and it was producing a map and each location represented a full block if I wanted to analyze this full block using this neuron then I just had to look at one pixel from this location right in the other case I have the same image this guy these guys are looking at much smaller blocks and then they produce their Maps okay and then the next neuron and the next near layer neuron instead of looking at just one pixel it's actually going to look at a group of pixels from the output of this guy right so your district so when it looks at a group of pixels it's basically looking at the outputs produced by this neuron and four adjacent locations right which effectively gives this new a single output by this guy this span both of them are looking at the same span but in one case have sort of distributed the representation over two layers and the other case have kept it in one layer right and then and the reduction in the number of parameters simply because it's a square as you increase the dimensionality that's going to get five you know more and more dramatic right it becomes very it really shrinks okay so basically distribution forces localized patterns and lower layers it's looking for smaller patterns it also did all the results in a large reduction in the number of parameters but the final story remains the same regardless of the distribution we can view the network as scanning the picture as an MLP the only difference is the manner in which the parameters are shared in the MLP okay so the story so far position invariant pattern classification can be performed by scanning the input for target pattern the operations and scanning the input of the full network can be reordered by scanning the input with individual neurons in the first layer and then silently scanning the maps produced by the first layer neurons where neurons in the second layer and so on and the scanning block itself the scanning operation can be distributed over many many layers so here's a different perspective if you think about it now it's exactly the same operation but the entire operation can be redrawn as before as maps of the entire image so I can say that the neurons in the first layer are looking for small patterns and they are scanning the input to produce maps to Reptar to produce maps of what they have found now the neurons and the second layer are scanning the maps produced by the first layer and because they're scanning regions in the maps produced by the first layer scanning larger you're looking at larger regions in the output of the fur of the the larger regions in the input image itself and that is scanning right and then the neurons and the third layers are looking at the regions at regions and the output of the neurons in the second layer and you're scanning but observe that because this is really just an MLP you're not scanning just one of the maps you're scanning all of these maps simultaneously right now this these operations hide this very important fact then the entire operation is basically the same as scanning the input with an MLP so if you do things in any different way that's not going to be the that's not going to be the outcome okay so now and we can have any number of layers in this pattern so there's some terminology each of the scanning neurons is generally called a filter that's because you you're drawing terminology from the signal processing literature you're calling these convolutions they're calling those neurons filters that's really a correlation filter that as we saw earlier and each filter scans for a pattern on the input but exactly what fatter pattern makes any put any any filtered fire that's easy to compute for the neurons in the first layer because you can just look at the pattern of weights and it will tell you what pattern it's firing for but then as you go to the subsequent layers things get a little more complex and it's very very difficult for us to be able to say what pattern the neurons in the second layer what patterns the neurons in this in the higher layers are looking for but the pattern that any neuron looks for is called its receptive field that's just a terminology again and the receptive field for neurons in the first layer are fairly trivial to compute for higher layer neurons the receptive layers receptive fields are non trivial to compute and must be derived so and these patterns will not really be very simple you can't expected I'm looking for flowers something's going to be looking for petals something's going to be looking for sepals you can't really predict it these are learned and somewhat unpredictable right now the final layer may feel into an MLP or a single neuron exactly the same as before now I can make some make some modifications right when we were scanning the input you are looking at a block and then we were sliding over by a more some amount if I slide over by exactly one pixel the resulting map is going to be approximately the same size as the input itself except for the width of the patch that I'm scanning for right so this will so you're maintaining size as you go as you and when you scan the image but now instead of shifting by just one pixel remember if I'm looking for a flower shifting by one pixel maybe doesn't give me that much information I can begin jumping a little more if I shift by two pixels what happens the resulting map is going to be half the size as the original input in each direction the overall size is going to be 1/4 if I'm skipping by 3 pixels it's going to be 1/9 you're going to shrink the input so you call this a stride and so you can actually convert the convolution operation can have a stride and it can happen at any layer now if you're looking at the pseudocode basically when you go from left to right in the image you're jumping by certain number of pixels yes they don't have to be bigger or smaller right so for example the left one could be using to cross two filters and the right one could be using four cross four but the net result is still looking at eight cross eight the point is I want to see 8 cross 8 it doesn't matter whether I do it as 8 cross 8 and 1 Cross 1 or 2 cross 2 and 4 cross 4 or 4 cross 4 and 2 cross 2 or any any other combination right that's by distributing it is you are changing the manner in which it's representing things you're able to look for smaller patterns and they're Arrangements right so now there's another thing that we commonly do which is to try to account for jitter in the image let's say I'm trying to look for a flower basically what are the basic patterns I'm looking for ideally if everything went right I want the basic patterns the small patterns I'm looking for us instead of just looking for flower what was the point of distribution the point of distribution was you know if you don't distribute you're saying am i finding a flower right but by distributing it you're going to say first let me look for petals and for sepals and for little you know stamens and other little characteristics of the flower and these are localized they are not the size of the flower and you expect the neurons and the first layer to be finding these localized patterns and then the neurons and the second layers are going to say oh I found a bunch of petals a separ you know a sepal a stamen these things all occur in the right right right arrangement so I think there's a flower or here's some pattern that represents a flower you're distributing the manner in which you're trying to represent the information right but then when you think about it if I'm trying to detect a flower if the and if I'm looking for petals if the pedal is shift slightly shifted if it's jittered then we've noticed earlier on that if I just jitter my stuff by one pixel you're going into a completely different subspace right so the next neuron is going to misfire it's not going to be able to find it so do you really want this level of sensitivity in your ability to detect patterns or would you like to have some degree of robustness to that he shifts and variations we would like to have a little bit of robustness and little shifts and variations so I want to be able to say I found a pedal in this region a pedal is a very localized feature so I don't want to be looking for a big block for a petal petal itself is localized but I want to be able to say that it hack happens somewhere in this region right this is exactly as the problem we began with in the first place the problem that we began with in the first place was is there a flower in this picture right and you look for flowers everywhere and then you found a max you could do the same thing you'd find for this look for this local patterns within a region and pick a max so this is an operation that's often called max pooling or max filtering what you will do is when you look at the outputs of the neurons in any layer which is a map to represent any specific location you would actually look at a block of regions around it and pick the largest value which tells you which sort of accounts for jitter accounts for a little bit of miss positioning of the input itself which can still manage to make the network fat right so this is basically how you account for jitter you would let's say you have outputs and for consecutive blocks and there you have one one five and six at me and then you you're going what this means is that whatever pattern that first neuron is looking for it was found with different confidences in these four regions but you know that with high confidence it's somewhere in this bigger block that particular pattern was found it has a confidence of six which is fairly high so you would just retain the confidence of the highest block saying I did find a pedal in this region and the confidence when I found it was six so that was the max pooling right so again the max operation is just a neuron right I have a bunch of inputs coming in it's like applying a weight of one and the activation is the max or max activation and if you've been through the sly slides for the previous class you can compute the max is just a standard activation you can compute a derivative luckily at least a sub gradient all of these can be performed so this entire max pooling operation can be thought of as one more layer in the neuron in the network where you are just looking at a bunch of inputs and applying a max activation you're looking at a bunch of inputs with weight one and applying a max activation yes yes they're just numbers it's not normalized no you're just saying what is the highest confidence of having found the pattern in any patch and that itself could be low they actually do right so think of it this way suppose I have something like one one seven six one one three four okay if I'm blocking things up like so here the max is one here the max is seven here the max is one here the max is four so I'm more likely to have found whatever pattern I'm looking for in these two guys than these two guys and amongst those are more and more likely to have found they found it here right so the information just carries over okay so the max operation again is just another neuron instead of applying an activation to the weighted sum of inputs each neuron just computes the maximum over all are all inputs so basically I can do the same max operation exactly as before I can instead of thinking of it as a network component I can also do this as a scan right and compute the max all over the input and actually find out produce a max map and the same idea holds the max operation is not a magical operation it just happens to be another layer in your MLP where the weights are fixed and the activation is the max activation that's about it nothing else changes so if you define your own or network properly then if you wrote your code then the code is the final implementation is going to be agnostic to the activation itself which you should be replaced to be able to replace with different things right so the max and the regular activations will just flip around so it's not a big deal now again this whole max pooling really only makes sense if you have strides of greater than one otherwise it's sort of kind of loses its significance so typically when we have max pooling we'll be jumping our entire blocks rather than sliding by one so the max pooling operation typically shrinks your input and so because it's shrinking your input and because you're sort of picking one value down sampling from a collection the max pulling operation is often called a down sampling operation in fact any operation which shrinks the out size of the output is called a down sampling operation and so if you were for instance striding with a step of size of two this max operation is going to shrink the size of the output map by a factor of two in each direction the size before so just give me a few seconds I'm almost done here and there well now the next layer so if you want to have some positional invariance in your input then you would be scanning then you perform this max pooling operation that's going to produce some maps the next layer is obviously going to be operating on the outputs of the max because the max is just yet another layer in the network right so the overall architecture is going to look something like this you can have many layers of convolution followed by max pooling and so on the individual perceptrons at any scanning or converged convolution layer are called filters the filter the input image to produce an output image the individual max operations are also called max max pooling or max filters and this entire network there's a convolutional neural network so in the rest of the slides for today I have a little more and on the application of the same concept to the illustrative example I've used as images but in the remaining slides we the slides illustrate this for signals like sounds time series data where the actual scanning is along one direction north along two directions take a look one when you apply the basic idea to time series data with scanning in one direction it's called a time delay neural network so take a look and in time delay neural networks we generally do not perform max pooling because it sort of sort of distorts the notion of time when you're performing max pooling other than that the basic idea is just the same we'll continue with the same idea in the next class okay
Info
Channel: Carnegie Mellon University Deep Learning
Views: 5,238
Rating: 4.8032789 out of 5
Keywords:
Id: 2XbZ03D0Sf4
Channel Id: undefined
Length: 81min 23sec (4883 seconds)
Published: Wed Sep 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.