Encoder Decoder Network - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so where we left it was that we've got ourselves now a fully connected network so it makes no assumptions about the size of the input the number of parameters we're going to have it just adapts itself depending on the size of the input which for images you can imagine makes quite a lot of sense they change size quite a lot but in most other ways it acts exactly like a normal deep network we've talked about this before in other videos like the deep green one but the deeper you go into the network the sort of higher level information we have on what's going on its objects and animals and things rather than bits of fur and edges and the shallower we are we have much less idea but the shallower we are we also have much higher spatial resolution because we've got basically be input image size because of these max pooling layers mostly every time we down sample what we're doing is we're taking a small group of pixels and just choosing the best of them the maximum and putting that in the output and that just have society image and hard it again and hard it again and you can imagine it if you've got an image of 256 by 256 we might repeat this process for five six times until we've got a very small region it's done for a couple of reasons one is that we want to be invariant to where things are an image which means that if the dogs over to the right we still want to find it even if it's or over to the left and so we what we don't want it to be affected by that the other the other issue quite frankly we don't have enough videogram yet and we mu teen lee fill up multiple graphics cards each of which has developed gigabytes on it depends on the situation you're looking at this is only one dimension I've drawn here but it's actually two dimensional if you have the x and y dimensions you're actually dividing the amount of memory required for the next layer by four and then by four again and then by four again and so actually you save an absolutely massive amount of ram by spatially downsampling and without it we'd be stuck with very small networks indeed but we've got this problem that yes we've worked out the cat so an image or something like this but it's very very small right it's only a few pixels by a few pixels we've got a rough idea there's something going on here maybe we could just blip balloon it up like it like a large linear up sampling and just sort of go well that's roughly a cat but it wouldn't be anything interesting so I guess the interesting thing happened in 2014 when Jonathan long proposed kind of solution to this which is essentially an a smarter upsampling what we do is we we essentially reverse this process basically we have a sort of an up sample here which will maybe double the size and then we we look over here and we bring in some of this interesting information as well and then we up sample again and we go alright so we can this is now the same size as this so we can bring in some of this information and when I say bring in I mean literally add these two these and we can have convolutional layers here to learn that mapping so we can take nothing from here or everything from here it doesn't really matter and finally we up sample back to the original size and we bring this in here using a Sun now what we've actually done is a kind of smart way of making this bigger I mean Carly you've got to kind of try and get your head around it but these features are very sure what's in the image but only roughly where it is these features are much higher pixel resolution they're much more sure in some sense where things are but not exactly what they are right so you could imagine in an intuitive way we're saying well this is a cat and down here we've seen some text to refer let's combine them together to outline exactly where the cat is this is a kind of idea and you can use this for all kinds of things so people have used it for segmentation or we call semantic segmentation which is where you will label each pixel with a class depending on what is in that pixel traditional segmentation usually meant backbone and foreground now semantic segmentation means maybe hundreds of classes so for instance in the image on scene here it might be you the table the computer the desk the window yeah this kind of things and there's a huge amount of different applications for that kind of thing so on a basic level you could imagine just trying to find one object specifically in a theme so just for people it's I have a person or its background we don't care what else or you could be training this on something like image net with lots and lots of classes or I mean there's the MS cocoa data set for example that has lots and lots of classes and so you're trying to find it airplanes and cars and things and people do this on street scene segmentation as well so you could say look given this picture of a road where is the road where is the pavement what's a building where are the road signs and actually analyze the entire scene which is obviously really really quite powerful the other thing is that you don't have to segment the image instead of segmenting it you can just try and find objects you can say instead of just outline where an object is yes or no why don't we try and draw a little heat map of where we think it is and then we can pinpoint objects so we can say where the two pupils on a face or can we draw around someone's face or their nose or their forehead so that we can benefit a model for that so Aaron was doing this in his network where he was actually predicting the 3d and positional information of a face based just on a picture and you were want to go with that we've also been using it for human pose estimation so where's the right hand where's the left hand what pose is this person currently doing which obviously you can imagine has lots of implications for things like Kinect sensors and sort of interactive games but also you know pedestrian tracking and and loads of other examples of things where it might be useful to know what a person is up to and finally we're using obviously implant science to try and count objects and localize objects so where's for disease on this image can we produce a heat map that shows exactly where it is where are the ears of wheat in this image can we count the number of spikelets to get an estimate of how much yield this week is producing compared to this wheat and then we can start to run experiments on you know these ones are water-stressed does that does that mean this one's better this kind of thing so this is called an encoder decoder because since sometimes what we're doing is we were encoding our spatial information to some kind of features of what's going on in the scene in general we remove the spatial resolution in exchange for learning more about the scene and then we bring it back in by finding detail from earlier parts of the network and bringing them in as well that's for decoding stage in some Centris is a little bit like again in the sense of this is the generator here and this is a discriminator it's just that you would switch around but let's not go not over complicate things and this one lit up which is maybe pause and maybe this one lit up because here was a few lines in a row and this one is sort of Furby texture or something you know and we're getting lower and lower level as we go through

Info

Channel: Computerphile

Views: 94,302

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science, Dr Mike Pound, University of Nottingham, Deep Learning, CNN, Convolutional Neural Networks

Id: 1icvxbAoPWc

Channel Id: undefined

Length: 6min 19sec (379 seconds)

Published: Wed Jun 13 2018