Variational Autoencoders - EXPLAINED!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

over the last decade deep learning has taken the field of AI by storm using neural networks we can now solve a host of problems to name a few one problem that we can solve is object detection feed the network and image and they will be able to identify locations of important objects in that image another problem we can solve is language translation feed a neural network in English sentence it'll spit out the equivalent in French another problem that we can solve is audio classification feed the neural network a sound wave and it will determine the object that produced that sound so if it hears uh it spits out dog and if it hears it spits out cat you can see these problems are quite different they have completely different input and output variables however all of them have one thing in common in all cases the neural network will process the input sample and it'll spit out some result that gives us some additional information about the input so take the case of object detection we give an input image and after processing it by the network we now know what objects are present and where they are located in the image that's additional information in the language translation case we give an input sentence in English and after processing it by the neural network we now know how to say the same sentence in another language like French that's additional information too now in the audio classification case we feed an audio sample as input and after processing it by the network we now know what animal made that sound the identity of that animal is again additional information however there are a category of networks that are a bit different in the sense that they don't merely provide additional information about some input sample but they also try to create or generate some sample image or audio or text themselves and this class of neural networks is called generative models appropriately named in this video we're going to go through a particular type of generative model called a variational auto encoder or a VA II the explanation will be twofold I'll start with an easy to understand intuition on VA ease once we have a firm understanding of them then we'll compare it to other types of generative models that have been hogging the spotlight recently generative adversarial networks Gans technique or not you'll be walking out with newfound knowledge and generative modeling and variational autoencoders I'm also going to throw in some technical jargon for you extra curious viewers this is code Emporium so let's get started let's start out with a broad concept generative modeling generative models are also just neural networks themselves normal neural network models usually take some sample as input and this sample is like raw data it could be like an image text or audio generative models on the other hand produce a sample as an output because of this flip I think you can see how and why this is so interesting with this technology there is so much potential for example you can train a model to understand how dogs work by feeding it hundreds of dog images then during test time we can just ask the model for an image and it'll spit out a dog image the cool thing is every time that we ask our model to generate a dog it'll generate a different dog every time so you can create an unlimited gallery of your favorite animal dago's sweet but what does this generative model black box look like let's take a look at this variational auto encoder as an example as mentioned before variational autoencoders are a type of generative model they are based off another type of architecture called auto-encoders these auto-encoders consists of two parts an encoder and a decoder the encoder takes an input sample and converts its information into some vector basically a set of numbers and we have a decoder which takes this vector and X man's it out to reconstruct the input sample now you may be thinking why are we doing this what is the point of trying to generate an output that is the same as the input and the answer to that is there is no point while using auto-encoders we don't tend to care about the output itself but rather the vector constructed in the middle this vector is important because it is a representation of the input image or audio and it's in a form that the computer understands so another question what is so great about this vector on its own I'd say the vector itself has limited use but we can feed it to complex architectures to solve some really cool problems here's an example of a paper that uses auto-encoders to infer location of an individual based on his or her tweet this architecture that they use consists of three stacked autoencoders to represent the input text from the tweet this is then piped to two output layers one of them is used to determine the state in the United States where the tweet was made and the other is to estimate the latitude and longitude positions of the user where the tweet was made I'll link the paper below in case your extra curious this is just one of the many interesting examples of what you can actually do with these auto-encoders however something we cannot do with auto-encoders is generate data now why is this the case let's go back to the auto encoder architecture it consists of an encoder and a decoder during training time we feed the images input and make the model learn the encoder and decoder parameters required to reconstruct the image again during testing time we only need the decoder part because this is the part that generates the image to do this we need to input some vector however we have no idea about the nature of this vector if we just give it some random values more likely than not we will end up with an image that looks like garbage so that's pointless now we need some method to determine this hidden vector here's some more intuition the idea behind a term this vector is through sampling from a distribution I'll explain these basic concepts of sampling and distribution but I'll also translate that into more technical terms for those of you who are more advanced in probability theory so distribution and sampling think of distribution as a pool a pool of numbers vectors consider the case where we want to build a generative model to generate different animals to accomplish this our generative model needs to learn to create a pool for cats a pool for dogs and another pool for giraffes like so when I say the dog pool I don't actually mean a pool that consists of dog images but instead it consists of some vector representation of these images and they are only understood by the computer so in a nutshell think of distribution as a pool of vectors now onto sampling sampling is a verb in English sampling means just closing your eyes reaching into a pool and picking one vector if you know where the pool is then you can go to the pool and randomly pick the vector so when we say I sample from the distribution of dog images it's equivalent to saying that we picked a random vector from the dog pool pretty simple now the problem with general auto-encoders is that we as human beings don't really know where these pools are imagine this box represents all possible values for the vector the hidden vector the cat pool can be here the dog pool can be here and the giraffe pool can be somewhere here each of these pools is learned by the model during training time so when we feed hundreds of images of animals our model will find patterns linking similar dogs cats and drafts and come up with these pools now these pools or more technically these distributions are learned internally by the auto encoder but there is no way for humans to know about these pools to make use of them for generating images during time we are basically sampling from a random distribution in other words it's equivalent to blindfolding ourselves and picking a value from this huge box that only consists of valid vectors in very specific locations and just garbage vectors everywhere else this is a very high chance that we'll pick a non relevant garbage vector from which we get a non relevant garbage output accordingly so the big takeaway we cannot generate dog images with an auto encoder because we don't know how to assign values to the vector during the generation phase we clearly have a problem here but what if we did know where to pick these vectors from then that would solve our problem right variational autoencoders does just that we first define a region we want to constrain this universe that is constrained the region from which we want to pick the vectors and within this region the goal of the variational auto encoder is to find the pool's the dog pool the cat pool and the giraffe pool and this is done during the training phase during the testing phase all we need to do now to generate an image is randomly sample a vector from this known region and then pass this vector to the generator part of our variational auto encoder this will generate an image a neat property about this region is that it's continuous so we can just alter some values in the vector to still get valid looking images say we train a variational auto encoder to print or generate handwritten digits from 0 through 9 the VA II will learn the pools such that they are within a defined region now these pools will represent the ten digits from 0 to 9 so we'll have to learn 10 pools the region now in which these pools are learned is continuous so I can just randomly sample a vector from this continuous region and just change its values ever so slightly the results of just changing this vector actually leads to very trippy and a psychedelic looking generating images when they're placed next to each other this is the simple intuition behind variational auto-encoders if you understood this then Congrats now let's revise some differences between the general auto-encoders and variational autoencoders just to make sure you have a clear understanding what each does from a more technical perspective though so first of all why do each exist the goal of general auto-encoders is to learn a hidden representation of the input while the goal of a variational autoencoder although it also learns a hidden representation of the input it also is used to generate new information general autoencoders cannot generate new data here's another question what are they optimizing autoencoders the general autoencoders learn to transform an input into some vector by minimizing reconstruction loss now during training and autoencoder make sure what is thrown into it is also spit out in other words it tries to minimize the difference from the original and the reconstructed images hence it seeks to minimize the reconstruction loss variational autoencoders on the other hand generate images by minimizing the sum of reconstruction loss and a latent loss now reconstruction loss is the same as what we defined for autoencoders with latent loss we ensure that all the pools learned by the network are within the same region that is in the middle that I defined here for more technical context we assume the pools follow a normal or Gaussian distribution hence during testing time they are actually sampled from the mixture of these gaussians now that we have a clear understanding of VA ease let's see how this compares with a more famous generative model generative adversarial networks so first off how do they learn to generate data variational auto-encoders have two losses to optimize the first is reconstruction loss what goes into the network is also spit out making sure that there is as little difference as possible the second is latent loss that is making sure the latent vector takes only a specific set of values so we want to know which region to sample this vector from optimizing two losses our variational auto encoder will learn to generate images now generative adversarial networks are gans work a little differently like all the VA e has an encoder and decoder architecture ganz also have two components a generator and a discriminator the generator is responsible for generating images and the discriminator determines whether a given image is either real or fake by fake I mean whether it was actually created by the generator both generator and discriminators play a minimax game where one tries to outperform the other the generator will try to generate an image that fools the discriminator making it think that its image is real and the discriminator tries to correctly distinguish between the real and fake images caching the generator with its wits if one of them messes up then its architecture is slightly tweaked to improve performance while looking at thousands of images during training the generator and discriminator and networks improve each other until the generator becomes proficient at generating animal images and the discriminator becomes proficient at determining real images from fake images generated by the generator then during testing time we can just use the generator to spit out the images that we need another aspect that we can compare ganz and VA eases stability during training now training and ganz involves finding something called a Nash equilibrium that is a point in the game between the generator and discriminator where the game is set to terminate or that there is an end of game point however there is no concrete algorithm to actually determine this equilibrium end of game point yet on the other hand V II's offer a closed form objective and by closed form I mean that there is a nice little formula that we can use to determine the end of training phase in variational auto-encoders now here's a third aspect from which we can compare and vs how good are the generated images va E's work very well in theory but they tend to generate blurry images you can mostly attribute this to the fact that VA E's are looking to optimize two factors during the training phase the reconstruction loss that is making sure that the output is as close to the input as possible and the latent loss that is making sure that the latent vector can only take a fixed range of values these two factors often counter each other there's a trade off so the middle ground usually leads to blurry image generation Gann training on the other hand is more empirical and optimized by way of trial and error they just work you can write down the losses theoretically but most of the intuition is based on the fact that we had the results before the actual theory for simple spatial data like images Gans produce really high quality results I made a video on the evolution of Gans since its inception in 2014 so be sure to check that out after this one and that's a brief comparison with Gans there are certainly deeper concepts that I didn't cover such as the need for the repair motorisation trick in variational auto-encoders or explicitly deriving the two losses of a variational auto encoder the reconstruction and latent loss however there are plenty of good blog posts out there outlining these concepts and I've linked some of these resources below I hope you got the base intuition of variational auto-encoders so that you can now more easily understand any learning resource you pick up from here on I may make a more mathy technical video on variation on O encoders later if most of you guys requested but I'll leave it at this for now thank you guys so much for watching subscribe to code Emporium in CS dojo for more videos on machine learning deep learning and artificial intelligence see you in the next one buh bye

Info

Channel: CodeEmporium

Views: 56,244

Rating: undefined out of 5

Keywords: Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Neural Network, vae, autoencoder, variational autoencoder, gan, generative adversarial network, vae networks, vae explained, gan explained, generative model

Id: fcvYpzHmhvA

Channel Id: undefined

Length: 17min 36sec (1056 seconds)

Published: Mon Jun 17 2019