History of Bayesian Neural Networks (Keynote talk)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay great everybody can hear me loud so I'm going to give a history of Bayesian neural networks and I wanted to dedicate this talk to my friend and colleague David McKay who's probably the smartest person I've ever met who sadly died this year and was a pioneer of this field and I'm really delighted that Ryan Adams has a tribute to David talking about his contributions to the field which were incredible and numerous okay so I'm apparently a nitch old-timer now so that means I give talks about history and I have 30 minutes to talk about 30 years so I'm going to go pretty quickly all right so let's go and dive into it so thirty years ago as I said back in the 1980s there was a huge wave of excitement in neural networks it was amazing I was a you know I was a teenager at the time and I was already you know interested in this field and you know both machines came out in 1985 the back pop paper that appeared in nature came out in 86 and these lovely PDP volumes came out in 1987 who remembers or knows the PDP volume how sad these were classics this is like what launched the field in the 1980s and you could go back in there and find a lot of brilliant ideas unfortunately if you search for deep learning you won't find this stuff because it used to be called connectionism and nips was this main conference which launched in 1987 and sort of like at that time I was an undergraduate and you know I worked on recurrent neural nets for natural language parsing which seemed like an amazing and crazy idea of course it didn't work as we had hardly any data and hardly any compute power so that was the 1980s well the talk is about Bayesian neural networks I just want to set some standard terminology clear we don't need to describe what a neural network is but I'm going to do it anyway because I'm going to do it my way a neural network is a parameterised function so like any machine learning problem you have some data let's take X's and Y's by neural network I mean kind of a you know multi-layer perceptron and variations of it feed-forward neural networks of course we could talk about other versions any parametrize function has parameters let's call them theta the parameters correspond to the weights and biases in the neural network and a feed-forward neural network models a distribution over some outputs given some inputs and theta and generally it's a nonlinear function of both theta the parameters and X the inputs so this is for example for a binary classification neural network this would be called logistic regression as well you know it's some linear combination of inputs with the weights pass through some non-linearity like the logistic or could be probate or something like that okay of course you can vary all these things you can vary the nonlinearities and make them rail use or whatever you want and you can stack these functions together so multi-layer or deep neural networks are just models of some overall function that are expressed in terms of compositions of functions like for example here this would be called a two layer Network and usually what people do is they train these networks to maximize some likelihood or some penalized likelihood which you can call an error function if you don't like the word likelihood and they tend to use variants of stochastic gradient descent to do that so that's the neural network well let's just you know fast track to the present for a minute what's deep learning well deep learning is basically I mean it's a lot of things but you know to one slide approximation deep learning systems are neural network models similar so similar to the ones that we had in the 80s and 90s with a few important differences one of them is that we've had some good ideas since the 80s we've had some architectural and make innovations things like having many layers using RAL use better initialization and learning rates dropout lsdm etc etc list gets longer every time you look at archive we also have importantly vastly larger datasets web-scale data which we didn't have back then we have vastly greater compute power including GPUs on the cloud and so on we have much much better software for dealing with this stuff which is great and generally open and there is a huge amount of investment and media hype so that's the difference between now and the 1980s okay so the talk is the history of bayesian neural networks I've talked about neural networks and a little bit about deep learning let's step back a little bit and think well what are the limitations of deep learning is it the panacea that's going to solve everything in life well neural networks and deep learning methods are amazing they give great results on many benchmark tests but they generally are very data hungry often they need millions of examples for things very compute-intensive to train and deploy poor at representing uncertainty something I would come back to in a few minutes easily fooled by adversarial examples finicky to optimize you require a lot of expertise and so on and generally uninterpreted lack boxes lacking in transparency and difficult to trust I love them they're great but we have to acknowledge that there are a lot of issues and fortunately every one of these bullet points here there are people at this conference working on so I'm not saying that nobody's working on these things we're all working on these things ok so the talk is about and the whole workshop is about Bayesian deep learning so what do I mean by about by being Bayesian ok so let's be clear neural networks have parameters when we talk about Bayesian neural networks we talk about dealing with all sources of parameter uncertainty ok for the parameters are uncertain given the data and also potentially dealing with a structure uncertainty may be you know how many hidden units how many layers etc ok so yeah parameters are the weights and structure the choice of architecture a number of hidden units choice of activation functions etc that's what Bayesian means I'll go into this in a couple slides of tutorials just in case so Bayes rule is just a way of updating our beliefs about hypotheses given data it says that before we observe data we should express our knowledge or state of knowledge about hypotheses through a probability distribution that's the best way to represent uncertainty before we have data we should better have some uncertainty so let's represent it that's called a prior and then any hypothesis can be evaluated by scoring the probability that it gives the data under that hypothesis that's called the likelihood and basil which comes from the basic rules with probability theory says the way you combine the prior and the likelihood is you multiply them and you renormalize over the space of plausible hypotheses you're considering and that's learning learning from data and when I'm talking about hypotheses I mean anything that's uncertain and when I'm talking about data I mean only the measured things and in machine learning the only thing that's measured is the data everything else is uncertainty uncertain the parameters the structure of the model and so on so Bayesian method tries to be honest about the fact that we have uncertainty about all of these things this is my obligatory one slide I'm basing machine learning as backup so essentially even Bayes rule is not a you know is not a principle it's a corollary of two more fundamental rules the sum rule and the product rule probability everything follows from the sum rule and product rule and it's instead of X's and Y's we write it in terms of the symbols that we care about like say de for parameters of a model and for a model class D for the data we might observe then we get learning as just one simple expression which is apply Bayes rule to your state of knowledge which was the prior once you observe the data and then you get your posterior state of knowledge which is the posterior parameters given data and that's your prior for the next bit of data that observe there's nothing really different between a prior and posterior okay it's just states of uncertainty if you want to make predictions the summon product will tell you that you should be taking the predictions for every parameter value and averaging those predictions with respect to the posterior probability of the parameters given the data so Bayesian prediction is naturally an ensemble America more you don't have to come up with any new principles you just apply the Bayes rule at the level of model classes which we get here okay so that's a basic machine learning in one slide it's incredibly theoretically straightforward it's hard to pick holes in it unless you don't like priors we talked about that later on if you want okay so why should we care all right what we should care because we should care about having calibrated model and prediction uncertainty if we're going to deploy these things in the real world they better know they better know when they don't know and they better be able to tell us when they don't know because if decisions are going to be based on the outputs are machine learning methods if they're overconfident we're going to have big problems on our hands that's one of the reasons we should care calibrated uncertainties of course there's a lot more to machine learning than calibrated uncertainties but we'll get to that in a minute the other reason I find very appealing is that this framework these equations on the previous slide give you an automatic way of capturing model complexity control and doing structure learning so you know if you're trying to search over a space of possible models for a good model for the data and you want to compare models that are very different from each other and so on then Bayesian Occam's razor which is just it just follows from these basic rules tells you a way of doing that here's a very nice figure from Yaron Gauss this which just you know has just been submitted called uncertainty in deep learning this is a nice figure to represent what prediction uncertainty might look like so I have some data I'm learning some function remember neural nets are just the way ways of learning functions where I have the data I know what the function looks like where I don't have data I better have a good representation of my uncertainty and that's what this shows that you can do I also wanted to thank Yaron because he dug up some very interesting papers I didn't even know about in the history of a neural net okay so that's why we should care and I want to make a simple note about models versus algorithm historically actually nips used to have tracks but they weren't like parallel tracks they were just tracks that you submitted to and the biggest track by far was always the algorithms and architectures track and I like the term algorithms and architectures because it it separates you know I would call it models versus architectures so models are things that I've listed here just some random examples of models comments latent Dersch layout location recurrent neural Nets hmm soulless machines these are models these are representations of data and we better really cleanly separate models from algorithms which are you know examples of algorithms like SGD contra gradients MCMC etc are listed in in this column here and algorithms are things that target some kind of learning either finding some optimal parameters data star like maximum likelihood or map algorithms or they might target finding the posterior distribution the probability of parameters given the data like MCMC variational Bayes etc and the important thing to remember is that often these algorithms are not so different from each other the ones that do optimization and inferring posteriors are not all that different and also that Bayesian belongs firmly in the algorithms kal I can be Bayesian about any of these things I want take get me in fact any well-defined model and you can treat it in a Bayesian manner okay that well define means that it has to give probabilities the data in some way and all neural nets do that basically okay so models versus algorithm you could mix and match these as you please all right so now I've defined Bayesian I've defined neural network what's the Bayesian neural network it should be pretty obvious right now here is a neural network it's a representation of functions with some parameters we call theta which are the weights of the neuron that and then a Bayesian oral network just means treating those parameters in a Bayesian manner so before you observe the data you have a prior over the parameters which might have some what are called hyper parameters controlling things like the width of the prior that represents sensible ranges of weights that you're willing to consider for your data okay it's a soft way of representing ranges is the way I like to think about it then once you observe the data you get the posterior through some algorithm and then if you want to make predictions it's a little bit more expensive you don't choose a single parameter value you're supposed to average over parameter values to make your predictions in some approximate way okay that's all the Bayesian neural network is and now let's talk about the history which where it starts getting fun and this stuff does literally go back 12:29 years okay at least so here is a lovely paper by denker at all with lots of great authors here people like Sarah soya and john hopfield are on there as well which is a large automatic learning rule extraction and generalization in a journal called complex systems and if you look at this paper carefully basically it talks about choosing probability distributions over weight of basically what's a neural network and then I love this sentence we can map the weight space on two function spaces follow so you know people have done that you know you can think about the representation of functions distributional functions represented by a neural network which is a more intuitive thing to think about than the distribution of weights which is kind of an arbitrary nuisance parameter usually and then basically talks about integrating over weight space and so on and so forth an interesting early paper is not exactly doing the stuff that we now call basal neural nets but it definitely talks about this stuff 87 okay now we move on to 89 the note the affiliations a lot of this stuff is going out at AT&T and john hopfield was also at Caltech again I think kind of an AT&T heavy group here's a paper from 89 consistent inference of probabilities and layer networks predictions and generalizations and here we have it there it is Bayes rule applied for neural nets and then if you look at section 4 here is an interesting application of what they do they say to demonstrate the utility of the average prediction error for determining a sufficient size of the training set as well as selecting the optimal architecture of the network okay so 1989 they're using Bayesian inference to try to figure out optimal architectures of neural networks danker and this other guy who you might have heard of young Laocoon next time you see Yan a you on video yeah yeah and I didn't know you were Bayesian he'll like that well um you know Yan has a classic paper on Bayesian inference with John Denker who is the first author of this paper where essentially the contribution of this paper is to try to algorithmically figure out how to compute this material for the parameters and the way they do it is they optimize the parameters and they compute the second derivative of the likelihood with respect to the parameters the log Lysa's with respect to parameters which is the Hessian the diagonal versions of the diagonal elements as a Hessian are these hii and then that's an estimate of the posterior and that's called Laplace's approximate that's a diagonal version of lobos's approximation for neural net and also 1999 there is a very lovely paper by Ray bunting and andreas Vikon which is also doing very clearly expressed Bayesian inference on neural networks okay so now now we're entering the golden era of Bayesian neural networks around 1992 a whole series of papers that David McKay rode coming out of his thesis and this one is one of the most beautiful ones it's called a practical basing framework for back propagation network so he's really taking this stuff and trying to make it very practical and I think the abstract itself is lovely to read quantitative and practical based in fragrance describes for learning them of mappings and feed-forward networks the framework makes possible one objective comparisons between solutions using alternative network architectures so architecture selection two objective stopping rules for network pruning and growing procedures three objective choice of magnitude and type of weight decay terms or additive regularizer so all that regularization ad-hoc worry you know is sort of subsumed by thinking about the Bayesian framework for a measure of the effective number of well determined parameters size quantified estimates of error bars on Network parameters and on network output so prediction errors as I showed you from Y are in figure 6 objective comparisons to alternative learning methods and and so on and so forth like splines and radial basis functions and so on beautiful paper is a faster today you want to go read one paper maybe this should be the one all right so there now things were getting serious 1992 and so now this is nips 1992 my first nips and the proceedings came out in 1993 ok so December next 1992 the papers came out in 1993 here is a paper by Radford Neal in knits 1993 called Bayesian learning via stochastic dynamics again you know an absolutely beautiful brilliant paper so what does this paper do well this paper is the first MCMC Markov chain Monte Carlo sampling algorithm for a Bayesian neural networks so David McKay did a Laplace approximation very elegantly and Radford Neal at around the same time was doing Markov chain Monte Carlo and in fact he didn't do any kind of old Markov chain Monte Carlo he did something very sophisticated he used what he called hybrid Monte color what we now call Hammonds Hamiltonian Monte Carlo HMC which is a sophisticated Markov chain Monte Carlo method that uses gradients to sample efficiently so here is how this stuff works you have a neural network you do back prop on the neural net that's what gives you the gradient and now instead of just optimizing with those gradients you follow those gradients in some way according to the HMC algorithm and occasionally you reject moves given by those gradient computations to sample from the weight amazing elegant paper so he was writing his thesis at the time hit this is a his cover page of his thesis 1995 along with doing hybrid Monte Carlo for Bayesian neural networks the thesis also establishes a link between Bayesian neural networks and Gaussian processes which the machine learning community basically didn't know about and so he introduced gaussian proxies as a way of thinking about simplifying bayesian neural networks into the community and he also describes automatic relevance determination I'm going to spend a couple of slides describing what these things are because they're useful concepts Gaussian processes are a way of representing unknown functions so imagine you're doing a function approximation from X to Y and you want to learn that function from some data like these magenta dots and instead of fitting a function with some parameters we're going to be what's called nonparametric we're going to consider a probability distribution over the space of functions and as our choice is going to be a Gaussian process it defines a distribution over the space of functions then we can apply Bayes rule sort of loosely written here over the space of functions in the following way and a Gaussian process is it's just an infinite dimensional generalization of a Gaussian distribution and we can use Gaussian processes for any kind of regression classification ranking dimensionality reduction etc tasks in fact any task for which you have an unknown function so this is what Gaussian processes are and what was the let's just describe some of the relationships as you all or many of you know I like drawing these cubes so you can start with here's one way of thinking about gaps and processes you can start with linear regression and you can apply various operations to linear regression to get other models for example you can kernel eyes within your regression kernel izing means map your features into some high dimensional feature space and do the linear task in the high dimensional feature space so these are the orange arrows kernel izing the blue arrows are taken model and treated in a bayesian manner so remember any model can be treated in a bayesian manner that just means do inference over the distribution of parameters instead of just fitting of the distribution of parameters the magenta arrows are taking a regression model and turn it into a classification model by assuming a different type of output discreet output instead of continuous output so you have linear regression yeah Bayesian linear regression and kernelization and linear regression is Gaussian process regression support vector machines lie here their kernel eyes linear classification essentially and Gaussian process classification is sort of a Bayesian analog of support vector machine classification okay so here is how Gaussian processes are related to which is the regression linear regression SVM's etcetera how are they related to neural networks well this is what Radford Neal determined in 1994 he basically showed that if you have a neural network with one hidden layer and people worrying how do I choose my number of hidden units people seem to still worry about that well he said if we take the infinite limit of having infinitely many hidden units in that one hidden layer and we we give the we treat the neural network in a bayesian manner so we give Gaussian priors we could give other priors as long as they were independent but Gaussian is an easy case if we give Gaussian priors to the weights and we scale the width of those gaussians in a way that we can take this limit sensibly then this is exactly a Gaussian process okay and different choices of the the non-linearity give you different kernels okay so there were there's like you know the neural net kernel there is sort of a kernel corresponding to R Lu there's a kernel corresponding to all of these different things um radford also went on to talk about like limitations of gaussian proxies again his thesis is beautiful to read he talked about things like well what would happen if you had deep networks or it maybe even infinite he even talks about infinitely deep neural networks and so on okay another thing I said Radford introduced in his thesis which was actually joint work with David McKay is this idea of automatic relevance determination or ard so if we have a Bayesian neural network we might want to do something called feature selection where we might want to say which of the input features are relevant and which ones are irrelevant and let's try to do that automatically that's what automatic relevance determination does and the way ard works is that you take the weight of a neural network and then you say there is a hyper parameter alpha D corresponding to the outgoing weights for each of my input units little D okay so when alpha D what alpha D is is it's the the precision of the Gaussian inverse width of the Gaussian in the prior and now we do a thought experiment so let's do the following thought experiment if alpha D goes to infinity well what's that the precision of a Gaussian goes to infinity means the variance of the Gaussian goes to zero if you have the prior on the way saying zero mean and zero variance that means that those all of those outgoing weights get zeroed out okay they disappear so that means that input D is irrelevant on the other hand if alpha D is some you know small finite and not many not infinite number then you have some finite variance then the weights coming out of input D can vary so feature D is relevant and now the cleverness here is that if you do Bayesian inference over the weight then you can actually optimize the alphas this is sometimes called type two maximum likelihood or level two maximum likelihood you can optimize the alphas and what will that will do is it will optimize something that figures out from the data which inputs are relevant and which inputs are irrelevant and now you can actually apply that reasoning at every layer of your network so it doesn't have to be just at the input layer you can apply that at every layer of your neural network and get the system to automatically cancel out the roles of individual hidden units and make them disappear from your network so this is a way of doing feature and architecture selection from a Bayesian framework in neural networks and interestingly this idea got ported over to Gaussian proxies so people in the GP community now know about automatic relevance determination but it came out of neural networks okay so what about variational learning so we've talked about Laplace approximations which are these Gaussian quadratic things we've talked about MCMC which radford Neil introduced well what about variational Bayesian learning of neural networks where would you go to find the earliest reference to that so here we have a paper that if you read it it doesn't use the words variational or I don't think it maybe even talks about KL divergence is I'm not sure maybe it does this is paper in colts 1993 called keeping the neural network simple by minimizing the description length of the weight and essentially in modern language what this is called what this actually does is it derives a diagonal Gaussian variational approximation to the Bayesian neural network weights but it's just couched in an information theory MDL language because the beauty by the way of Bayesian methods is you could talk about Bayesian methods from minimum description lands from information theory perspective etc and it's all equivalent okay then this idea was generalized in 98 I'm jumping forward a little bit by Barbour and Bishop which in this case they called it ensemble learning in Bayesian rural network because ensemble learning is another synonym for variational proclamations that you might want to look up this paper does full covariance Gaussian variational approximations to bayesian neural network I'm being told to accelerate so here is a the abstract of the Hinton and VanCamp paper you read it it doesn't sound like it's doing very showing for so that's what it is read it carefully the target remember the target amazing inference is apathy overweight MCMC takes sample takes a chain of samples such that those samples converge to the distribution P of theta given D variational bass finds an approximation key of theta that is the argument of KL q of theta to the posterior okay just as a reminder all right here's an interesting aside it's a really important aside right around the same time in 1991 Radford Neil wrote this other paper called connectionist learning of belief networks and what this paper does is it explicitly links feed-forward neural networks which were also called connectionist networks and graphical models belief networks in this case and in fact what it does is it says if you have a feed-forward neural network with sigmoid activation functions in of thinking of those units as deterministic let's think of them as binary random variables that thing is called a sigmoid belief network and you can do learning in that by Gibbs sampling again incredibly beautiful very nice and definitely worth knowing about and you know for example we've extended on this line of work by having a Bayesian nonparametric version of this model which samples over the number of hidden units the number of layers and type of hidden units in these types of belief networks and in fact in this paper we are trying to learn the structure of infinitely wide and infinitely deep belief networks so here is another cube if you're looking at probabilistic models that probably sick generated models some things that you know people might know here in terms of V AES and so on you should probably start with factor analysis which is over 100 years old and then think about different operations you could do to that make undirected versions or deep versions again as soon as synonym for deep if you go back a few years it's hierarchical there are a whole bunch of papers to talk about hierarchical generative models in fact I have a paper for example with geoff hinton where we talk about rectified gaussian belief networks which are hierarchical generative models with rectified gaussians nowadays that same paper would probably be called the value networks or something like that because that's exactly what it was and then you can have deep Boltzmann machines etc etc I probably should accelerate but deep and hierarchical synonymous okay we're coming closer to the present I wanted to highlight I've skipped over many papers I apologize to lots of people who've done great work I'll mention some at the end I wanted to highlight a particular paper by Max Welling and you I take called Bayesian learning via stochastic gradient laundry band dynamics and it combines SGD as we all know in love with Langevin dynamics of MCMC that Radford Neal introduced the machine learning community to to get a highly scalable approximate MCMC algorithm based on mini-batch SPG and I just want you to look at equation four of this paper because what is it well it's a it's an update rule for your weight which says take the derivative of the log likelihood of your mini-batch with a law prior this is just SGD with some learning rate Epsilon what are you going to do that s UD oh you add some noise you add some Gaussian noise SGD oh that's looks like that's not even a line change in your code that's a few characters change in your code in your SGD code okay a few characters change in your su decode turns your SGD code into a approximate MC MC algorithm that does bayesian inference okay that's what I mean by you know optimization amazing inference don't have to be that different okay there are whole bunch of papers that have come out in this sort of revival of Bayesian neural networks a lot of great stuff over the last few years there's a lot of fun we're all having and I'm glad to see there's lots of people in this workshop as well it's hard to call it a workshop when it's in a room this big so I'm going to wrap up by reminding people well why do we need probabilities okay so I think the public modeling approach is essential for a lot of problems in learning and intelligence where essentially we need to carefully represent uncertainty so these are problems like forecasting decision-making learning from limited noisy and missing data learning complex personalised models all the data compression is probabilistic modeling you know I think automatic model discovery scientific modeling experiment design active learning etc all of these are fundamentally reliant on estimates of probabilities okay so I don't think probabilities are going anywhere they're here to stay with us so I'm going to conclude probabilistic modeling offers a general framework for building systems that learn from data advantages include better estimates of uncertainty automatic ways of doing learning of the structure and of fighting overfitting and a very nice foundation the disadvantages include higher computational and memory cost but it depends on your approximate inference algorithm you know you can write down your probabilistic framework and then say ok I'm going to approximate it away until I get something that is about as fast as running SGD of my neural network when we have approximations like that already and Bayesian neural nets have a long history and they're undergoing a tremendous wave of revival so it's quite exciting so this paper I have at the bottom is not about Bayesian neural nets it's just an overview of probabilistic machine learning with some kind of some of these arguments spelled out in some interesting directions ok so that's it happy to take questions at the eff time thanks for the great talks we've been thanks also for mentioning the importance of david in inspiring a lot of work in this area so i'm also a great proponent of the separation of the model and the algorithm but since this is a small workshop i think we can have a lively philosophical debate i'd like to take issue with your firm placement of Bayesian inference in the algorithm side as I think you would agree or in fact I'm actually quoting you so Bayesian inference is just the calculus of uncertainty and in that sense it's far more fundamental than either a model or an algorithm it's like mathematics it imbues everything I think it's sort of cheapening it to associate it firmly with the algorithm yeah you're right you're right it's thanks for you know correcting me by by quoting me you're right in a sense it's a Bayesian inference isn't an algorithm any more than calculus is an algorithm ok calculus is the language the mathematical language of rates of change you could say and probability theory is a mathematical language of uncertainty and so Bayesian methods are methods that try to adhere to the rules of probability theory but there are a whole bunch of algorithms that are subsumed by that yeah Thanks me I'm Matthias so thanks again wonderful talk I noticed at the beginning of your talk you mentioned two reasons why deep learning took off so much and there were availability of huge number of computations and amazing software I'm just wondering whether you can comment on you know whether the Bayesian crowd is taking advantage of that and what we can expect in that direction that's a great question I think the the Beijing crowd has not taken advantage as much as it could have in those things there's already there's good software for Gaussian processes for example but not sort of an industry at industry strength let's say although there are recent efforts like for example we've we've recently implemented Gaussian proceeds in tensor flow for example so now you know any advances intensive flow on the software hardware side you can use Gaussian processes for that the other area that I think is is very interesting that I'm certainly excited in is probabilistic programming where the framework there should be to abstract away the inference algorithm so that people can get around to writing down their model just like in the deep learning community people basically wire up their deep learning architecture and then just call an optimizer to learn on that and that just makes their life easier because they don't have to read the rive things so we can do the same thing in probabilistic modeling by doing problems like programming and we can interface all those you know sort of computation graphs automatic differentiation all of that can be used for probabilistic modeling as well and I follow up on that this is a shameless plug so yeah Edwards system developed by David flies lab in and where we at Google are collaborating with them so it's open source it builds on top of tensorflow and Tiano and implements almost every algorithm users heard and it's highly scalable Edward at redwood dog it's good stuff yeah [Applause]

Info

Channel: Bayesian Deep Learning Workshop NIPS 2016

Views: 38,658

Rating: undefined out of 5

Keywords: Deep Learning, Bayesian, Bayesian Deep Learning, BDL, NIPS 2016

Id: FD8l2vPU5FY

Channel Id: undefined

Length: 40min 24sec (2424 seconds)

Published: Wed Mar 08 2017