IanGoodfellow PhD Defense Presentation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the booty the soul from the signal that extension professor Jung Rana and you see did detect a leak with director Cheon for real miracle second degree so we'll all right so the title of my thesis is decoding of representations and its application here vision I'd like to thank all the members of the jury for coming today to hear my defense for everybody else I'll try to make this talk is nice for you as possible but I should warn you that the purpose of the defense is for the committee to evaluate what I've learned and contributed and not necessarily to tell you exactly how to make all this work for yourself so throughout the talk I'll refer you to other talks I've given up a little bit more geek if about some of these these projects I've worked on if any of you want to make use of them at home the thesis presentation itself will be a little bit fragmented because I wrote a thesis by articles which means that I stapled four articles together and they're only very loosely related to each other so I apologize for the diminished aesthetic appeal of that kind of presentation as a basic overview I'm going to give you a little bit of a summary on deep learning which is the kind of machine learning that I study for my PhD and then I'll present four different articles that I wrote during the course of my time here all of them are applied to solve a computer vision problems other than that they don't have a lot in common the first two are about probabilistic models spikin slabs sparse coding and multi predictions equality machines the third article is about a new kind of activation function for a girl networks called max out and the last article present is about how we used convolutional networks to transfer street numbers at Google vary my internship died last year so machine learning is defined as being the ability of software to learn from experience beyond that it's difficult to define much more formally there are several different approaches to machine learning that cause different systems to improve on different tasks with respect to different performance measures and even the kinds of experience can be different today I'm mostly going to talk about becoming more accurate at classifying input examples into categories or more accurate estimating probabilities for certain events that correct and the experiences will all be things like getting to see an image are getting to see an image associated with the label the basic principle that I use for most of the work that I'm presenting today is called maximum likelihood estimation the idea is that you define a model that describes the probability of some event happening such as an image occurring or an image having a specific label and this model that defines as probability has parameters associated with it if you change those parameters it change what changes what probabilities the model assigns to different events the idea behind maximum likelihood estimation is to pick the parameters that maximize the probability of the data at your data set being generated there are other approaches to machine learning some of them work better when you have very few examples maximum likelihood estimation is an efficient and very reasonable way of going about doing machine learning when you have at least a couple thousand examples the main technique that I use to perform this estimation in most of the projects that I present today is called gradient descent the idea behind gradient descent is that you have some kind of objective function or cost function you want to minimize here I'm representing that with the function y equals x squared and you can minimize this function by looking at the derivatives of the function at every point so over here the derivatives are positive and you can move opposite the direction of the derivative to decrease the objective function value so over here as long as the objective as long as the derivatives of the objective are positively moving in the negative direction until we reach the minimum where the derivatives are zero in multiple dimensions the derivatives tell you which direction to move as well as the site that's a little bit difficult to draw most of the projects I'll show you today involves supervised learning as some component the idea of supervised learning is you have some kind of input such as an image like this photo of a bird I do have some desired output like a label saying that the image is up a bird I'll use the variable X to refer to the inputs and I'll use a variable Y to refer to the targets and in some cases where we might have more than one X I might also use the variable V to refer to the input sometimes but that should be clear from context the main idea behind deep learning is that we often want to learn the way that we represent the data that we use if you think about some standard machine learning classifiers like support vector machines if you use a linear support vector machine you try to classify this data set where all the X's are positive examples and all the blue dots are negative examples you can't pick any one line through this space that will divide all the examples cleanly from each other but if you change the representation to polar coordinates then you can easily draw a single horizontal line that divides the two this is just one example of how changing the representation can make a machine learning algorithm perform better obviously there are more sophisticated machine learning algorithms and there are more sophisticated reasons for changing your representation other ways that things can become easier besides just introducing linear separability but this is just illustrating the basic concept that the representation you use matters a lot for your performance in deep learning the specific idea is that we want to learn representations that are themselves composed of simple operations and other representations by following several different simple transformations of the data we can arrive at very complex transformations of the data here I've drawn an example using images from Matthew's Ahlers paper and visualizing the features convolutional networks if you have an input photo this consists of pixels and it's very difficult for a classifier to associate values of individual pixels with category values associated with the image you can look at this 10 pixel over here instead this is a picture of a person but what you can do is you can extract features of several different locations and have those features become slightly more abstract and have more meaning to them than the original pixels do the first layer of your network the visible layer consists only of these raw pixel values but the first hidden layer can extract features such as edges by looking for bright pixels represented by the white lines here next to dark pixels represented by the black lines here by pulling out all the edges across the image we now have a representation of the image in terms of edges and that makes it easier for the second layer to look for things like extended contours and corners finally after we have a description of the image in terms of contours and corners we can combine these descriptions into small object parts and from description of the image in terms of object parts it's easy to pull out descriptions of entire images such as people or cars or animals so that's the basic idea behind deep learning the exact way that you learn these features and the exact way that they're extracted depends on the kind of algorithm that you use both for training and for inference so you'll see several different ways of doing that today the first paper that we'll tell you about is called spikin slab sparse coding the full title of the paper varies a little bit depending on whether you're looking at the conference or the journal version of it I wrote this paper along with Aaron and Yahshua who were here at my committee today and the basic idea is that we wanted to make a new feature learning algorithm that we thought would generalize better when you have very few labeled examples available and it does do that but not quite as dramatically as we hope the main reason that we thought we might want to do this kind of approach is that another kind of sparse coding had been shown by Adam Coates to perform much better than our VMs even if you use the same features for both architecture you could learn your futures using an RB up and then put them into a sparse coding feature encoder and get better performance at the same time Erin and another grad student here at Montreal had trained a model called the spike and slab restricted Boltzmann machine which has a new kind of sparsity inducing hidden unit in the model our idea was that maybe this new kind of unit for restricted Boltzmann machines could be useful if we introduced it to the world of sparse coding to get this improved generalization performance to the sparse coding feature encoder had shown on the CFR ten dataset the basic way that we need to train this algorithm is using what's called variational learning this is necessary because some of the probabilistic computations that we need in order to maximize the likelihood of the data under this model are not tractable if you just evaluate the log probability itself what we can do it stood as you can construct a lower bound called the evidence lower bound or the negative variational free energy or the negative Helmholtz energy the idea is that this green curve here represented by the italic L represents our lower bound on the true likelihood the true likelihood is a much nastier function both of these shapes here are fictional they're kind of a cartoon idea of how you can think of this process in 1 dimensions this isn't the real likelihood function or the real lower bound but the idea is that by making this line wavy we can think of it as something that's nasty to compute it difficult to work with by making this line smooth you can convey that it's something easy to work with the way that we actually construct this lower bound is by performing what's called variational inference the specific thing that's difficult to compute in the spigen slab sparse coding model is the probability distribution over the hidden units given the visible units and the way that we get around that problem is we introduce a simpler approximate probability distribution they will call Q and we look for the Q that is as close as possible to the true posterior distribution over the hidden units at the point where our Q matches the real probability the green curve would touch the red curve in practice we make assumptions that prevent it from people but we do try to make the bound as close as possible so that we're working in as closest space as possible to the original learning problem the way that we actually carry out this search for the best cube is by minimizing the KL divergence between Q and the true probability distribution doing this requires both analytical and iterative optimization what I mean by that is we can use the Euler Lagrange equations to solve for what kind of function you need to represent the probability distribution but that function is still going to have some parameters on it and those parameters need to be found by an iterative search that iterative search can often be very expensive so what we approach in this paper was a better way of doing that iterative search to explain exactly how it works I need to tell you a little bit more about the model itself the spike in Slab sparse coding model involves variables to be called H which are binary we call those the spike variables and also variables called s which are real valued we call those the slab variables every spike variable is paired with a slab variable and they're multiplied together before you do a linear transformation to get a reconstruction of the observed data V the basic idea is that you can think of this as decomposing an image into a set of edges each spike variable is saying that each edge can be either absent or present and each slab variable is saying how strong that edge is in the image in other words of brightness associated with that component when we want to do inference over this model the Euler Lagrange equations can tell us that we get a Gaussian distribution over the slab variables and that our spike variables are basically just parameters we need to find with an iterative optimization procedure unfortunately no one else before us have been able to scale this model to a very large size this is a plot in large space showing the number of training patches on the x-axis and the number of latent variables in the model on the y-axis and different approaches to this problem had all gotten stuck along this diagonal line in log space whereas you try to scale up to larger and larger sizes you just get stuck so you do good work in object recognition you have to get up here these blue dots are previous work in object recognition using other kinds of Cheaper feature extractors the little green plus signs are where we got to with our new inference algorithm so our new inference algorithm was based on moving to the graphics card and designing an update that allowed us to update all of the parameters of the variational distribution in parallel rather than updating each of them iteratively updating them iteratively didn't work as well with massively parallel architectures like GPUs and so it was forced to run it relatively slow speed on CPU and not benefit as much from updating variables simultaneously to result conflicts faster finally we did get the payoff that we were looking for we did actually get a better generalization with you labeled fewer few labeled training examples are available for each class so here we took the C for 10 dataset and on the x-axis we use different amounts of labels all the features were learned on the entire dataset and this figure originally came from a paper by Adam Coates where he compared a code that he calls threshold to sparse coding there the red and the blue curve with the confidence intervals around them to this figure we added this spike and slab sparse coding curve so you can see that on this curved where there aren't very many labeled examples available we outperform sparse coding in every step of the way we believe this is because the spike variables allow the spike in Slab sparse coding model to have true sparsity it can say that certain edges are not present in most of the samples it generates traditional sparse coding uses what's called a la crosse a la plus prior and if you generate from the model it almost never generates anything that actually has a true zero in the code so the higher amount of zeros mike regular eyes more strongly in subway we don't have an extensive explanation of the way this actually works better than when we go to the full-size data sets we didn't actually managed to beat the state-of-the-art with this model we did manage to improve the state of the art with a fixed pooling pattern so features are extracted all across the image and then you take little grids of them across the image and take the average of the feature value within each grid using that we were able to perform better than sparse coding which was the previous state of the art there was a new adaptive pooling method that we weren't really able to reproduce in our own work and that adaptive pooling method working with a different feature extractor had the state of the art on C for 100 of the time that we wrote this paper if we used a pyramid method to try to compete with it we could get the error bars on our accuracy level to overlap with their error bars but overall spike and slab sparse coding didn't really push forward the state of the art a whole lot it just allowed us to work better when there are very few labeled examples available one thing we did do that was pretty exciting was we won a transfer learning challenge at nips 2011 we had a very small label training set this is the entire labeled training set the idea was you could train the model on the entirety of CFR 100 and then for these new classes like tomatoes and ants and things like that you've got only these examples we used spiking stuff sparse coding and the current polling pattern to win the contest with a test set accuracy of 48 percent even though we have is here is 5 examples for some of the classes the next paper that I'll present is called multi prediction of deep Boltzmann machines this one there's a video available online as well as the paper the video is associated with the International Conference on learning representations 2013 in the workshop tract and that might be a little bit hard to dig up because the final paper was actually a noose the basic idea behind this paper was that we wanted to find a better way of training deep Boltzmann machines deep Boltzmann machines are a ballistic model that we find very exciting because they have feedback connections but the way that you train them is very complicated and hacky and we think that partly it's just frustrating to deal with that level of complication and also we think that it introduces some sub optimality because the procedure is too greedy this this paper was with my opponent visors Aaron and Joshua and also many Mir Seth who couldn't be here today because he's in California for an internship the typical DBM training algorithm that we want to improve consists of four stages if you have two hidden layers and it grows linearly few more in there than that the first stage is you do you train a restricted Boltzmann machine which is like a little building block of these deep probabilistic models and it has just a single layer of hidden variables you trade this one model to represent the input data well and then you train as many other bolts and machines on top of that as you need to build the whole architecture that you want after you've finished training each of the individual models you could Lou them together into one big deep probabilistic model where maybe the first layer represents edges and the second layer represents corners and so on the cool thing about this model is that the knowledge of the corners can influence the knowledge of the edges the information doesn't flow purely uphill it flows both uphill and downhill after you've glued this model together you can jointly train the whole model to maximize the log probability of both the pixels you see coming in from below and the labels you see coming in from above so you're learning both to associate pixels to labels you can learn the pictures of cats are cats but you're also learning how to draw new cats after you've done this training stage here you can go ahead and convert this into a specialized classifier by taking the weights that you learned in this generative model and using them to define a feed-forward multi-layer one classifier at the same time you also delete the class variable from the old model and you freeze all of the weights in the original model and continue learning deleting this class variable doesn't correspond to marginalizing it out it's just a hack that the original authors can work very well so we have a few issues with this training procedure what is just that it's complicated it has some crazy hacks in it like deleting this variable here but the other is that we think that training each of the layers in isolation is too greedy and deprives them of the opportunity to cooperate with each other and learn from the start that Corner's can influence edges and so on we think that one reason that the layer wise pre-training is necessary for the defaults and machine to work is that we train the deep Boltzmann machine using these sampling based approximations the idea is that the deep Boltzmann machines probability distribution is defined using two different factors one is the one that has all of your parameters in it defining relatively how likely each state is compared to the other states but then to get this to be a probability distribution we need to divide by the partition function it's just a normalization constant that makes your entire distribution add up to one the problem is that this is often intractable and in deep bolts and machines it is the way that we can approximate it is if you look at the derivative of its logarithm to maximize the log likelihood you end up being able to write it in terms of an expectation over different states of the gradient of the energy in that state and you can simulate this expectation by drawing samples from a model unfortunately drawing samples from the model requires Monte Monte Carlo Markov chain Monte Carlo and that has issues with mixing between widely separated modes and a few other problems one thing in particular is you don't know how many steps of Markov chain updates you need to do before your latest sample represents a fair sample from the models distribution so if you take to fuse oops your gradient will be inaccurate when you train using the persistent contrastive divergence algorithm and that's pretty much the only algorithm that worked for vehicle to machines prior to this paper we think that one reason that deep Boltzmann machines might be so hard to train all at once rather than layer by layer is that the number of steps that you need to mix well between great steps changes a lot during the course of learning if very early on the second layer doesn't do much of anything then you can make pretty big steps on its parameters without damaging your Markov chain but then as it starts to specialize it actually be useful you can get multiplicative interactions where small changes in the deeper layers modulate the lower layers as well and get essentially use that leverage to really change the distribution as a whole and what what you're mapping between parameter space changes and KL divergence space changes might be very different as learning proceeds so our goal is to make a new learning algorithm where we can throw out all those problems instead of trying several different models with several different steps some of the criteria we'd like to just trade one unified probabilistic model from the start and be able to classify well with it evaluate probabilities of all of it and so on the way that we did this is with what we call multi prediction training we use variational inference to infer the probability distribution over subsets of the variables given other variables so here each row corresponds to the learning step on a different example in a mini patch and we would run all three rows at the same time to get a better estimate of the gradient of this criterion the black circles represent variables that we observe the blue circles represent the targets of the variables that we want the model to fill in so we observe these black variables we've run made field inference through an architecture that looks kind of like a recurrent Network and at the end we get the models estimate of these blue variables then we use back propagation to improve the log likelihood of the models estimate of those variables this turns out to be similar to a criterion called generalized pseudo likelihood but we're making a variational approximation to it one thing that's worth noting is that the kind of variational approximation we're making here is a little bit less principled than the one we used for a spike in sub sparse coding here we don't actually get a lower bound on the generalized log pseudo likelihood the partition function makes that very hard to do but what we can say is that we're finding the closest to the true distribution possible within the model and then we're shaping that approximate distribution there's a few different benefits of this training procedure one is that the learning rate doesn't really affect how accurate your approximation is the inaccuracy in your approximation comes from the variational assumptions that you make like that the different units are independent from each other but it doesn't matter what your previous steps of learning did PCD you have this persistent chain that sticks around and if you break that chain by moving too fast then your approximation will suffer on the subsequent steps another thing is that as you continue to learn because the inference defines the entirety of learning your learning algorithm will start to make the independence assumptions more and more true as time goes on with deep Boltzmann machines traded PCD you partially get that effect because you have the mean field inference in the positive phase or the negative phase might undo it to some extent so the models we learn that might might be somewhat suboptimal but they also have a better idea of how to do inference in themselves I should also mention that similar ideas have also been applied by Stoyanov at all in 2011 for other kinds of models that were it as deep or as pulse of work Bolton machines there's also a trick that we can do to reduce the cost of mean field inference which is um you just reduce the number of steps and you divide the input by two and you treat everything is observed this is a trick for inferring the labels given the inputs and the math behind it is not really sound but it's following the same that inference in the dropout algorithm uses and it seems to work well into practice over here you can see that we actually get better accuracy with the multi inference than with true mean field inference and that comes from the fact if you do multi inference you're averaging together several different recurrent Nets that all solve similar optimization problems so if there are small imperfections in the way that they've been fit you can wash those imperfections out our results are mostly on the MS data set the original dbm got 95 mistakes on the test set if we use the same setup as they did we get 99 mistakes so this is where you have the extra MLP on top of the deep Boltzmann machine if we just use inference in the model itself and we allow ourselves to make the model a bit bigger than they did we can get down to 91 mistakes and that's that's like an average over five folds or something like that if you run it multiple times you'll get it down to mistakes in the 70s sometimes but that's not a reliable outcome we also looked at the variation across different hyper parameters these box and whisker plots are showing you the error in log space on the y-axis of a few different methods they're the other methods are centering based training of Duke Boltzmann machines which is another way of training deep Boltzmann machines without having to do leher wise pre training the people in introduced center of deep Boltzmann machines didn't write much about classification so we ran the classification experiments to see how well centering works for classification and it doesn't seem to work very well for that task in particular Multi prediction Depot's and machines reach a much lower error rate and they are also a lot more robust to choices of hyper parameters centering does work a lot better for the log-likelihood of the pixel values though we don't really understand exactly why centering fails for classification but multi prediction fails for a set point overall we accomplish the goals that we wanted to we got one model that can be trained at one for period and it uses the same unified probabilistic model both for filling in missing values and for classification so the next paper I'll show you was a collaboration with like half the lab and hit this is called max out networks for people that want to see more detail there is a talk about this at ICML 2013 and the videos from that are online as well as the paper the basic idea behind max out is we wanted to introduce a new kind of activation function for neural networks the way that neural networks usually work for feed-forward networks is you have some sort of input vector that we'll call B you have a bunch of weights that I defined by a big weight matrix you can do a matrix multiplication to go from V to Z usually those he also had some biases in here put in Trebek so these each row the matrix that defines one of your Z output values is called a weight vector and each of the individual Z values is fed through an activation function to output the final value of your hidden layer and this activation function is usually element-wise meaning we apply on the same function to Z 1 to get h1 as you apply to Z 2 to get h2 but we don't ever look at more than one z value at a time one of the most traditional functions is the logistic sigmoid activation function the idea is that on the x axis we've got the z value coming in on the y axis we've got h value coming out and when Z is very negative the output is close to zero when Z is very positive the output is close to 1 that makes it easy to interpret this value is saying whether some feature is absent or present when we use things like gradient descent it's very easy to make fast progress going downhill if you land on the steep part of this curve if you let over here you don't make any progress so a new hidden unit that was introduced by Jared Goodell in 2009 and then popularized by SAVAK aroma 11 is the rectified linear activation function where it doesn't saturate on the right it just goes up in a straight line forever on the left side it's flat and you can make a lot better progress on this kind of gradient than on the gradient of the logistic sigmoid activation function however you can still get stuck left of zero so one way of dealing with that is to actually make a new kind of activation function that looks at two different Z values at the same time and specifically our activation function is just to take the max of those values other people have previously done things where they take the max over groups of different values but they tended to have some other kind of non-linearity in between the Z and the Max so the idea behind this project is to see how well you can do using only the max itself because that lets you essentially learn the activation function so here we've learned to do the absolute value rectification function which is a popular activation function especially in the first layer of convolutional networks for vision the difference is that rather than being driven by a single input Z value we've got two in Quincy values so our x axis is now assuming the up like a single the input value of the drives disease if we land over here where the green line is maximal then we can update the parameters for the green line if we land over here where the blue line is maximal then you can update the parameters for the blue line there is no where you can when that you get stuck so learning can proceed much faster in this kind of architecture if we compare max up to various sizes of rectifiers it's a little bit tricky to make the comparison fair because there's different numbers of parameters of different numbers of units this blue network here represents the max out network we've got two different filters per layer or per unit sorry one thing we could do is we could compare to a rectified linear network that has exactly the same architecture but it does they can against zero get the rectified linear functionality and this has the same number of parameters the same number of units and so on but it's kind of unusual because you don't usually do cross channel pooling in a rectifier net so that might be unfair to rectifiers to make it more fair we tried two other versions of rectifiers one where we have the same number of outputs per layer but that has fewer parameters in the max fnet there's another one where we do the same number of filters per layer so it has more outputs and this has a lot more parameters overall than the max out net so it's a bit unfair to max out but we wanted to make sure we gave rectifiers a fair chance so we instantiate at all four of these different architectures we did a random hyper parameter search over all of them and then we plotted the best example of each of them on the C part n dataset you can see that max out is the dark blue curve and if you add the zero to the max so that your inputs are rectified linear units rather than just the linear response you do quite a bit worse if you make it smaller and get rid of the zeros you do even worse if you make it really big which makes it more expensive to Train than the max m-net it actually keeps up with a maximum after quite a while but then it gets stuck after a while so it seems like max out is a way of getting a lot of filters into the network very efficiently and also allowing the gradient to flow through very fast there's a few other advantages to max out that I would get into detail about here it also works very well with the dropout algorithm but doing a limited time I'm not going to present that part in detail Maxo has been applied for several different problems since we first published it it's been especially popular in the speech community which is surprising because we use it exclusively for vision but there have been something they made at least for speech papers that I'm aware of using MEK so what are yo show students also used it to do matchmaking for multiplayer games it's been used by a drusy sirmons lab for text detection which means just finding where text is in a photograph it's been used at McGill for text transcription which means actually reading the letters out texts after you've seen it and in Yeosu is live it's been used for making optimization problems easier by converting neural nets to max up when you wanted the optimization to not be as difficult it's also been used in recurrent networks here and then in some online challenges it's been used for whale call detection and black box classification black box classification means you're given data you're not told what the data means at all but you're asked to classify based on what the pattern the train set was so the last paper I'll show you is about my internship at Google last year this is with the street smart team at Google and it's about how to transcribe the address numbers on buildings seen by the Street View card the basic idea is we have sequences of numbers coming in so this is a little bit different from the object recognition work I showed earlier where you have a bird and you want to say it's a bird here we've got multiple numbers and we want to say what category they're in but we also want to get those categories listed in the right order so that's a bit trickier the other thing is we don't have a very good detection of them so earlier I showed you a really nicely cropped picture of a bird with a bird filled the whole image here we don't know exactly how big the numbers are going to be or exactly where on the image they're going to feed so we want a single neural network that can find the numbers and transcribe them to the output the way that we did this is with a convolutional network where we assume that there is a bound on the number of output digits that we're going to use so we said that there's going to be at most 5 digits we have one output that encodes the link from one to five and it can optionally also say that's longer than five I don't know how to transcribe it and that means that we could that have a human being transcribe that one for a street view that's acceptable because it means we can save a lot of money on hiring human operators we can now hire them we'll leave the transcribe the street numbers that are of like sixth or greater if you wanted to make a robot that autonomously drives around and you know finds houses based on their address number then you would you need to have some way of handling logger sequences but we stepped around that problem application each of the different digit classifiers responds to just a single position in the sequence so this one tells you what the first number is this one tells you what the second number is this one tells you the third number and so on all of these different classifiers are fully connected to a set of features that are extracted by a convolutional network and we train the entire thing with back problem and that means that if it wants to spend most of the effort loading features that help it predict the length it'll do that if it wants to devote most of the features to the first digit it will it'll do that instead we don't need to spend a lot of time hand designing most of the system the way that we train this is we just write down a log likelihood for the sequence as a whole the way that the log likelihood looks is you evaluate the log likelihood of the lake and then use sum over the true length in the data set the first L positions that have been labeled so at train time you don't need to rely on the models estimate of the length to figure out how long the sequences you can just use the label in the data set and you simply don't back propagate anything for the digits that are used so here there's only the first three digits being used at test time we actually do need to figure out the length and that is just slightly complicated you can't just ask the length predictor how long it is because that will disregard the probabilities of the sequence of a poll it will look at only the individual element but if you want to look at the electro-motive the whole sequence you need to scan through each of the individual digit predictors and see how much probability they assign to their maximal class so if you have a ten digit classifier that's really not confident it will assign you know like probability point one to everything and that might be a sign that you shouldn't try to output a digit in that position because you've got to commit to a bet on a single sequence so if your length predictor is only slightly less confident your sequence is like four than like five the digit predictor might push it over the edge it turns out that you don't really have to do this if all you care about is making a map what do you make a map we throw out the lowest confidence predictions anyway and just have human operators transcribe those if you care about evaluating the total accuracy across the entire data set then you actually do gain a few percentage points of accuracy by doing this optimization but the only things where it's changing their answer are the ones that we have lower confidence anyway so the ones that are usually thrown out in the commercial system on the public street view house number is data set we set the state-of-the-art both on the individual characters and on the full sequence but nobody else had published on the full sequence before we also had an internal private data set that's much harder with things like you know numbers that are spray-painted on the wall in countries that can't afford cold plastic mass-produced house numbers and so on that data set we only get 91% accuracy human accuracy is about actually how about this slide I think we actually matched human accuracy on the public s pH n oh sorry our coverage of human accuracy is 89% so we can transcribe 89% of the data with the same accuracy as a human and the system knows when it's going to be inaccurate so we can throw out the remaining examples where our accuracy would drop below the human brain here are some examples of numbers that we transcribed this is specifically once it got wrong to give you an idea of how hard the internal private data studies here we transcribe this as 1180 this 7 could look like a 1 in isolation but as a human being when you see it in context with the other one you realize that they probably do not change font in the middle of the address number so you can infer that it most seven our system isn't smart enough to have figured out the law of conservation of fonts over here this is one that the human operator actually got wrong as well I actually can't read it you can see that number fell down and because the box is less dirty we're the number ones you can tell what the number used to be I think it's a three but that's from memory on the projector I can actually read it our human operator transcribed it as 184 and just ignored the gap and our convolutional net guessed that the answer was 1844 and got it right one thing that can also confuse it is large gaps between numbers so here the correct answer is 239 our text detector was actually a little bit high and we cropped the nine off but the convolutional net didn't even get the three because this gap between is human the three was big enough that it decided that they were separate pieces of text finally our interface for working with this system is not really the greatest because sometimes we're asking the convolutional not to do magic if we show it a large square with several numbers in it it's pretty hard for you know it was meant to transcribe the six seven sticks down here and you know that this 500 up here is part of an add one thing we found it was pretty cool is that this network gets a smarter and better at this task with depth one thing that kind of guided our development of this system is something that Yann said on Google+ actually which was someone was asking why is it interesting to study architectures with depth greater than one and Yann said because it's interesting to study programs that can execute more than one sequential instruction so you can think of each layer as executing instructions in parallel but when you have more layers you can execute instructions in sequence and this is a little bit of a different perspective than the representation learning perspective we use in the other work this one is saying some of our features might not be factors of variation that are driving the data some of them might just be little bit bookkeeping that the program is using while it proceeds through the segmentation algorithm and so we have both factors of Representatives of variation and little index variables that help us to parse better and better as the network gets deeper and deeper as a control experiment we tried running networks with more and more parameters that didn't get deeper and they just over fit so that seems to suggest that depth is a nice prior which is one of Yahshua's perspectives that a lot of white depth is helpful is that we are telling the system that the function should use ought to look like several recursively applied functions and not just something that can be done massively in parallel early at the start as a broad conclusion to the thesis I would say that the last four years have been very exciting and a bit of time of a lot of change in deep learning when I signed up everything was all about unsupervised learning and we did indeed confirmed that unsupervised learning can help you or there's very little labeled data available but unsupervised learning is not yet to the point where we could really solve AI level tasks with it and it could get there someday it could get there someday especially in combination with other techniques but it's not there right now another thing we found is that some of the work that we developed in the course of working on unsupervised learning is actually useful in the short term even if it doesn't lead to better classification it is very good for things like filling and missing values like the NPDB m is able to do or being able to classify in the presence of missing in place and finally we found that there is actually a very effective way of getting deep large neural networks to perform very well on human-level AI tests and that is to use lots of implicit ensemble as provided by the dropout algorithm and also sort of by the multi prediction algorithm and also to just use a whole lot of data like we had for the Street View project and this suggests that over the next five years hopefully we'll learn how to do well without meeting so much data and we'll be able to leverage the unsupervised learning technique that have somewhat fallen out of fashion but haven't yet fully reached their potential and that concludes my presentation so I'm ready for questions
Info
Channel: nouiz
Views: 147,830
Rating: 4.939651 out of 5
Keywords: PhD (Degree), Machine Learning (Software Genre)
Id: ckoD_bE8Bhs
Channel Id: undefined
Length: 45min 21sec (2721 seconds)
Published: Wed Sep 03 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.