Watching Neural Networks Learn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you are currently watching a neural network learn about a year ago I made a video about how neural networks can learn almost anything and this is because they are Universal function approximators why is that so important well you might as well ask why functions are important they are important because functions describe the world everything is described by functions that's right functions describe the sound of my voice on your eardrum function the light that's kind of hitting your eyeballs right now function different classes and Mathematics different areas in Mathematics Study different kinds of function high school math studies second degree one variable polynomials calculus studies smooth one variable functions and it goes on and on functions describe the world yes correct thanks Thomas he gets a little excited but he's right the world can fundamentally be described with numbers and relationships between numbers we call those relationships functions and with functions we can understand model and predict the world around us the goal of artificial intelligence is to write programs that can also understand model and predict the world or rather have them write themselves so they must be able to build their own functions that is the point of function approximation and that is what neural networks do they are function building machines in this video I want to expand on the ideas of my previous video by watching actual neural networks learn strange shapes in strange spaces here we will encounter some very difficult challenges discover the limitations of neural networks and explore other methods for machine learning and Mathematics to approach this open problem now I am a programmer not a mathematician and to be honest I kind of hate math I've always found it difficult and intimidating but that's a bad attitude because math is unavoidably useful and occasionally beautiful I'll do my best to keep things simple and accurate for an audience like me but know that I'm gonna have to brush over a lot of things and I'm gonna be pretty informal I recommend you watch my previous video but to summarize functions are input output machines they take an input set of numbers and output a corresponding set of numbers and the function defines the relationship between those numbers the particular problem that neural network solve is when we don't know the definition of the function that we're trying to approximate instead we have a sample of data points from that function inputs and outputs this is our data set we must approximate a function that fits these data points and allows us to accurately predict outputs given inputs that are not in our data set this process is also called Curve fitting and you can see why now this is not some handcrafted animation it is an actual neural network attempting to fit the curve to the data and it does so by sort of bending the line into shape this process is generalizable such that it can fit the curve to any data set and thus construct any function this makes it a universal function approximator the network itself is also a function and should approximate some unknown Target function the particular neural architecture we're dealing with in this video is called a fully connected feed forward Network its inputs and outputs are sometimes called features and predictions and they take the form of vectors arrays of numbers the overall function is made up of lots of simple functions called neurons that take many inputs but only produce one output each input is multiplied by its own weight and added up along with one extra weight called a bias let's rewrite this weighted sum with some linear algebra we can put our inputs into a vector with an extra one for the bias and our weights into another vector and then take what is called the dot product let's just make up some example values to take the dot product we multiply each input by each weight and then add them all up finally this dot product is then passed to a very simple activation function in this case a relu which here returns zero we could use a different activation function but Aurelio looks like this the activation function defines the neuron's mathematical shape while the weights shift and squeeze and stretch that shape we feed the original inputs of our Network to a layer of neurons each with our own learned weights and each with our own output value we stack these outputs together into a vector and then feed the output Vector as inputs to the next layer and the next and the next until we get the final output of the network each neuron is responsible for learning its own little piece or feature of the overall function and by combining many neurons we can build an Ever more intricate function with an infinite number of neurons we can provably build any function the values of the weights or parameters are discovered through the training process we give the network inputs from our data set and ask it to predict the correct outputs over and over and over the goal is to minimize the Network's error or loss which is some measurement of difference between the predicted outputs and the true outputs over time the network should do better and better as loss goes down the algorithm for this is called back propagation and I am again not going to explain it in this video I'll make a video on it eventually I promise it's a pretty magical algorithm however this is a baby problem what about functions with more than just one input or output that is to say higher dimensional problems the dimensionality of a vector is defined by the number of numbers in that Vector for a higher dimensional problem let's try to learn an image the input Vector is the row column coordinates of a pixel and the output Vector is the value of the pixel itself in mathspeak we would say that this function maps from R2 to R1 our data set is all of the pixels in an image let's use this unhappy man as an example a pixel value of 0 is black and one is white although I'm going to use different color schemes because it's pretty as we train we take snapshots of the Learned function as the approximation improves that's what you're seeing now and that's what you saw at the beginning of this video but to clarify this image is not a single output from the network rather every individual pixel is a single output we are looking at the entire function all at once and we can do this because it is very low dimensional you'll also notice that the learning seems to slow down it's not changing as abruptly as it was at the beginning this is because we periodically reduce the learning rate a parameter that controls how much our training algorithm Alters the current function this allows it to progressively refine details now just because our neural network should theoretically be able to learn any function there are things we can do to practically improve the approximation and optimize the learning process for instance one thing I'm doing here is normalizing the row column inputs which means I'm moving the values from a range of 0 1400 to the range of negative one one I do this with a simple linear transformation that shifts and scales the values the negative 1 1 range is easier for the network to deal with because it's smaller and centered at zero another trick is that I'm not using a relu as my activation function but rather something called a leaky relu a leaky value can output negative values while still being non-linear and has been shown to generally improve performance so I'm using a leaky value in all of my layers except for the last one because the final output is a pixel value it needs to be between 0 and 1. to enforce this in the final layer we can use a sigmoid activation function which squishes its inputs between 0 and 1. except there is a different squishing function called tan H that squishes its inputs between negative one and one I can then normalize those outputs into the final range of 0 1. why go through the trouble well tan H just tends to work better than sigmoid intuitively this is because tanh is centered at zero and plays much nicer with back propagation but ultimately the reasoning doesn't matter as much as the results both networks here are theoretically Universal function approximators but practically one works much better than the other this can be measured empirically by calculating and comparing the error rates of both Networks I think of this as the science of math where we must test our ideas and validate them with evidence rather than providing formal proofs it'd be great if we could do both but that is not always possible and it is often much easier to just try and see what happens and that's my kind of math let's make it harder here we have a function that takes two inputs UV and produces three outputs x y z it's a parametric surface function and we'll use the equation for a sphere we can learn it the same way as before take a random sample of points across the surface of the sphere and ask our Network to approximate it now this is clearly a very silly way to make a sphere but the network is trying its best to sort of wrap the surface around the sphere to fit the data points I hope this also gives you a better view of what a parametric surface is it takes a flat 2D sheet and contorts it in 3D space According to some function now this does okay though it never quite closes up around the poles for a real challenge let's try this beautiful spiral shell surface I got the equation for this from this wonderful little website that lets you play with all kinds of shell surfaces see what I mean when I say that functions describe the world anyway let's sample some points across the spiral surface and start learning [Music] [Laughter] [Music] well it's working but clearly we're having some trouble here I'm using a fairly big neural network but this is a complicated shape and it seems to be getting a little bit confused we'll come back to this one we can also make the problem harder not by increasing dimensionality but by increasing the complexity of the function itself let's use the mandelbrot set an infinitely complex fractal we can simply Define a mandelbrot function as taking two real valued inputs and producing one output the same dimensionality as the images we learned earlier I have to find my mandelbrot function to Output a value between 0 and 1 where 1 is in the mandelbrot set and anything less than one is not under the hood it's iteratively operating on complex numbers and I added some stuff to Output smooth values between 0 and 1 but I'm not going to explain it much more than that after all a neural network doesn't know the function definition either and it shouldn't matter it should be able to approximate it all the same the data set here is randomized points drawn uniformly from this range now this has actually been a pet project of mine for some time and I've made several videos trying this exact experiment over the years I hope you can see why it's interesting despite being so low dimensional the mandelbrot function is infinitely complex literally made with complex numbers and is uniquely difficult to approximate you can just keep fitting and fitting and fitting the function and you will always come up short now you could do this with any fractal I just use the mandelbrot set because it's so well known so after training for a while we've made some progress but clearly we're still missing an infinite amount of detail I've gotten this to look better in the past but I'm not going to waste any more time training this network there are better ways of doing this are there different methods for approximating functions besides neural networks yes many actually there are always many ways to solve the same problem though some ways are better than others another mathematical tool we can use is called the Taylor series this is an infinite sum of a sequence of polynomial functions X Plus x squared plus X cubed plus x to the fourth up to x to the n n is the order of the series each of these terms are multiplied by their own value called a coefficient each coefficient controls how much that individual term affects the overall function given some Target function by choosing the right coefficients we can approximate that Target function around a specific point in this case Zero the approximation gets better the more terms we add where an infinite sum of terms is exactly equivalent to the Target function if we know the target function we can actually derive the exact coefficients using a general formula to calculate each coefficient for each term but of course in our particular problem we don't know the function we only have a sample of data points so how do we find the coefficients well do you see anything familiar in this weighted sum of terms we can put all of the X to the N terms into an inputs vector and put all of the coefficients into a weights vector and then take the dot product a weighted sum the Taylor series is effectively a single layer neural network but one where we compute a bunch of additional inputs x squared x cubed and so on we'll call these additional inputs Taylor features we can then learn the coefficients or weights with back propagation of course we can only compute a finite number of these the partial Taylor series up to some order but the higher the order the better it should do let's use this simple Taylor Network to learn this function using eight orders of the Taylor series here's our data set and here's the approximation [Music] that's not great polynomials are pretty touchy as their values can explode very quickly so I think back propagation has a tough time finding the right coefficients but we can do better rather than using a single layer Network let's just give these Taylor features to a full multi-layered Network let's give it a shot [Music] foreign it's a bit wonky but this performs much better this trick of computing additional features to feed to the network is a well-known and commonly used one intuitively it's like giving the network different kinds of mathematical building blocks to build a more diverse complex function let's try this on an image data set [Music] well that's pretty good it's learning but it doesn't seem to work any better than just using a good old-fashioned neural network the Taylor series is made to approximate a function around a single given point while we want to approximate within a given range of points a better tool for this is the Fourier series the Fourier series acts very much like the Taylor series but is an infinite sum of Sines and cosines each order n of the series is made up of sine N X plus cosine n x each sine and cosine is multiplied by its own coefficient again controlling how much that term affects the overall function n these inner multiplier values control the frequency of each wave function the higher the frequency the more Hills the curve has by combining weighted waves of different frequencies we can approximate a function within the range of 2 pi one full period again if we know the function we can compute the weights and even if we don't we could use something called the Discrete Fourier transform which is really cool but we're not dealing with it in this video I hope you see where I'm going with this let's just jump ahead and do what we did before compute a bunch of terms of the Fourier series and feed them to a multi-layer network as additional inputs Fourier features note that we have twice as many Fourier features as Taylor features since we have a sine and cosine let's try it on this data set this works pretty well it's a little wavy but not too shabby note that for this to work we need to normalize our inputs between negative pi and positive Pi one full period let's try this on an image that looks strange at first almost like static coming into Focus but it works and it works really well if we compare it to networks of the same size trained for the same amount of time we can see the Fourier Network learns much better and faster than the network without Fourier features or the one with Taylor features just look at the level of detail in those curly locks you can hardly tell the difference from the real image now I've glossed over a very important detail the example Fourier series I gave had one input this function has two inputs to handle this properly we have to use the two-dimensional Fourier series one that takes an input of X and Y what we do with that extra y here are the terms for the 2D Fourier series up to two orders we are now multiplying the X and Y terms together and end up with sine x cosine y sine X sine y cosine x cosine Y and cosine X sine y every combination of sine and cosine and Y and X not only that we also have every combination of frequencies that inner multiplier so sine 2x times cosine 1y and so on and so forth here's up to three orders now four that is a lot of terms we have to calculate this many terms per order and this number grows very quickly as we increase the order much faster than it would for the 1D series and this is just for a baby 2D input for a 3D 4D 5D input forget it the number of computations needed for higher dimensional Fourier series explodes as we increase the dimensionality of our inputs we have encountered the curse of dimensionality lots of methods of function approximation and machine learning breakdown as dimensionality grows these methods might work well on low dimensional problems but they become computationally impractical or impossible for higher dimensional problems neural networks by contrast handle the dimensionality problem very well comparatively it is Trivial to add additional dimensions but we don't need to use the 2D Fourier series we can just treat each input as its own independent variable and compute 1D Fourier features for each input this is less theoretically sound but much more practical to compute it's still a lot of additional features but it's manageable and it's worth it it drastically improves performance that's what I've been using to get these image approximations it really shouldn't be surprising that Fourier features help so much here since the Fourier series and transform is used to compress images it's how the jpeg compression algorithm Works turns out lots of things can be represented as combinations of waves so let's apply it to our mandelbrot data set again it looks a little weird but it is definitely capturing more detail than the previous attempt well that's fun to watch but let's evaluate for comparison here is the real mandelbrot set actually no this is not the real mandobrot set it is an approximation from our Fourier Network now you might be able to tell if you're on a 4k monitor especially when I zoom in this network was given 256 orders of the Fourier series which means 1024 extra Fourier features being fed to the network and the network itself is pretty damn big when we really zoom in it becomes very obvious that this is not the real deal it is still missing an infinite amount of detail [Music] nonetheless I am blown away by the quality of the Fourier Network's approximation Fourier features are of course not my idea they come from this paper that was suggested by a Reddit commenter who I think actually may have been a co-author I'm still missing details from this adding Fourier features was one of if not the most effective improvements to the approximation I've applied and it was really surprising to return to the tricky spiral shell surface we can see that our Fourier network does way better than our previous attempt although the target function is literally defined with Sines and cosines so of course it will do well so if Fourier features help so much why don't we use them more often they hardly ever show up in real world neural networks to State the obvious all of the approximations in this video so far are completely useless we know the functions and the images we don't need a massive neural network to approximate them but I hope that you can see that we're not studying the functions we're studying the methods of approximation because these toy problems are so low dimensional we can visualize them and hopefully gain insights that will carry over into higher dimensional problems so let's bring it back to Earth with a real problem that uses real data this is the mnist data set images of hand-drawn numbers and their labels our input is an entire image flattened out into a vector and our output is a vector of 10 values representing a label as to which number 0 through 9 is in the image there is some unknown function that describes the relationship between an image and its label and that's what we're trying to discover even for tiny 28 by 28 black and white images that is a 784 dimensional input that is a lot and this is still a very simple problem for real world problems we must address the curse of dimensionality our method must be able to handle huge dimensional inputs and outputs we also can't visualize the entire approximation all at once as before any idea what a 700 dimensional space looks like but a normal neural network can handle this problem just fine it's pretty trivial we can evaluate it by measuring the accuracy of its predictions on images from the data set that it did not see during training we'll call this evaluation accuracy and a small network does pretty well what if we use Fourier features on this problem say up to eight orders well it does do a little better but we're adding a lot of additional features for only eight orders we're Computing a total of 13 328 input features which is a lot more than 784 and it's only two percent more accurate when we use 32 orders of the Fourier series it actually seems to harm performance up to 64 orders and its downright ruinous this may be due to something called overfitting where our approximation learns the data really well too well but fails to learn the underlying function usually this is a product of not having enough data but our Fourier Network seems to be especially prone to this this seems consistent with the conclusions of the paper I mentioned earlier and ultimately our Fourier Network seems to be very good for low dimensional problems but not very good for high dimensional problems no single architecture model or method is the best fit for all tasks indeed there are all kinds of problems that require different approaches than the ones discussed here now I'd be surprised if the Fourier series didn't have more to teach us about machine learning but this is where I'll leave it I hope this video has helped you appreciate what function approximation is and why it's useful and maybe sparked your imagination with some alternative perspectives neural networks are a kind of mathematical clay that can be molded into arbitrary shapes for arbitrary purposes I want to finish by opening up the mandelbrot approximation problem as a fun challenge for anyone who's interested how precisely and deeply can you approximate the mandobrot set given only a random sample of points there are probably a million things that could be done to improve on my approximation and the internet is much smarter than I am the Only Rule is that your solution must still be a universal function approximator meaning it could still learn any other data set of any dimensionality now this is just for fun but potentially solutions to this toy problem could have uses in the real world there is no reason to think that we found the best way of doing this and there may be far better Solutions waiting to be discovered thanks for watching
Info
Channel: Emergent Garden
Views: 1,168,933
Rating: undefined out of 5
Keywords:
Id: TkwXa7Cvfr8
Channel Id: undefined
Length: 25min 27sec (1527 seconds)
Published: Thu Aug 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.