The moment we stopped understanding AI [AlexNet]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is an activation Atlas it gives us a glimpse into the high-dimensional embedding spaces modern AI models use to organize and make sense of the world the first model to really see the world like this alexnet was published in 2012 in an 8-page paper that shocked the computer vision Community by showing that an old AI idea would work unbelievably well when scaled the paper second author ilas HK would go on co-found open AI where he and the open AI team would massively scale up this idea again to create chat GPT this video is sponsored by kiwico more on them later if you look under the hood of chat GPT you won't find any obvious signs of intelligence instead you'll find layer after layer of compute blocks called transformers this is what the T and GPT stands for each Transformer performs a set of fixed Matrix operations on an input Matrix of data and typically returns an output Matrix of the same size to figure out what it's going to say next chat GPT breaks apart what you ask get into words and word fragments Maps each of these to a vector and stacks all of these vectors together into a matrix this Matrix is then passed into the first Transformer block which returns a new Matrix of the same size this operation is then repeated again and again 96 times in chat GPT 3.5 and reportedly 120 times in chat GPT 4 now here's the Absurd part with a few caveats the next word or word fragment that chat GPT says back to you is is literally just the last column of its final output Matrix mapped from a vector back to text to formulate a full response this new word or word fragment is appended to the end of the original output and this new slightly longer text is fed back into the input of chat GPT this process is repeated again and again with one new column added to the input Matrix each time until the model's output returns a special stop word fragment and that is it one Matrix multiply after another GPT slowly morphs the input you give it into the output it returns where is the intelligence how is it that these 100 or so blocks of dumb compute are able to write essays translate language summarized books solve math problems explain complex Concepts or even at the next line of this script the answer lies in the vast amounts of data these models are trained on okay pretty good but not quite what I wanted to say next the alexnet paper is significant because it marks the first time we really see layers of compute blocks like this learning to do unbelievable things an AI Tipping Point towards high performance in scale and away from explainability while chat GPT is trained to predict the next word fragment given some text Alex net is trained to predict a label given an image the input image to alexnet is represented as a three-dimensional Matrix or tensor of RGB intensity values and the output is a single Vector of length 1,000 where each entry corresponds to Alex Net's predicted probability that the input put image belongs to one of the a thousand classes in the imag net data set things like tabby cats German Shepherds hot dogs toasters and aircraft carriers just like chat GPT today alexnet was somehow magically able to map the inputs we give it into the outputs we wanted using layer after layer of compute block after training on a large data set one nice thing about Vision models however is that it's easier to poke around under the hood and get some idea of what the model has learned one of the first under the hood insights that kvky suit and Hinton show in the Alex net paper is that the model has learned some very interesting visual patterns in its first layer the first five layers of alexnet are all convolutional blocks first developed in the late 1980s to classify handwritten digits and can be understood as a special case of the Transformer blocks in chat GPT and other large language models in convolutional blocks the input image tensor is transformed by sliding a much smaller tensor called a kernel of learned weight values across the image and at each location Computing the dot product between the image and kernel here it's helpful to think of the dot product as a similarity score the more similar a given patch of the image and kernel are the higher the resulting dot product will be Alex net uses 96 individual kernels in its first layer each of Dimension 11 by 11 by3 so conveniently we can visualize them as little RGB images these images give us a nice idea of how the first layer of alexnet sees the image the upper kernels in this figure show where Alex and has clearly learned to detect edges or rapid changes from light to dark at various angles images with similar patterns will generate High Dot products with these kernels below we see where Alexon has learned to detect Blobs of various colors these kernels are all initialized as random numbers and the patterns we're looking at are completely learned from data sliding each of our 96 kernels over the input image and Computing the dot product at each location produces a new set of 96 matrices sometimes called activation Maps conveniently we can view these as images as well the activation Maps show us which parts of an image if any match a given kernel well if I hold up something visually similar to a given kernel we see high activation in that part of the activation map notice that it goes away when I rotate the pattern by 90° the image and kernel are no longer aligned you can also see various activation Maps picking up edges and other lowl features in our image of course finding edges and color blobs in images is still hugely removed from recognizing complex Concepts like German Shepherds or aircraft carriers what's astounding about deep neural networks like alexnet and chat GPT is that from here all we do is repeat the same operation again just with a different set of learned weights for Alex net this means that these 96 activation maps are stacked together into a tensor that become the input to the exact same type of convolutional compute block the second overall layer in the model we can make our activations easier to see by removing the values close to zero unfortunately in our second layer we can't learn much by simply visualizing the weight values and the kernels themselves the first issue is that we just can't see enough colors the depth of the kernel has to match the depth of the incoming data in the first layer of alexnet the depth of the incoming data is just three because the model takes in color images with red green and blue color channels however since the first layer computes 9 6 separate activation Maps the computation in the second layer of alexnet is like processing images with 96 separate color channels the second factor that makes what's happening in the second layer of alexnet more difficult to visualize is that the dot products are really taking weighted combinations of the computations in the first layer we need some way to visualize how the layers are working together a simple way to see what's going on is to try to find parts of various images that strongly activate the outputs of the second layer for example this activation map appears to be putting together Edge detectors to form basic Corners remarkably as we move deeper into alexnet strong activations correspond to higher and higher level concepts by the time we reach the fifth layer we have activation maps that respond very strongly to faces and other highlevel Concepts and what's incredible here is that no one explicitly told Alex net what a face is all alexnet had to learn from were the images and labels in the imag net data set which does not not contain a person or a face class Alex net was able to learn completely on its own both that faces are important and how to recognize them to better understand what a given Colonel and Alex net has learned we can also look at the examples in the training data set that give the highest activation values for that kernel for our face kernel not surprisingly we find examples that contain people finally there's this really interesting technique called feature visualization where we can generate synthetic images that are optimized to maximize a given activation these synthetic images give us another way to see what a specific activation layer is looking for by the time we reach the final layer of alexnet our image has been processed into a vector of length 4,096 the final layer performs one last Matrix computation on this Vector to create a final output Vector of length 1,000 with one entry for each of the classes in the imag net data set chfi suit and Hinton noticed that the second to last layer Vector demonstrated some very interesting properties one way to think about this Vector is as a point in 4,096 dimensional space each image we pass into the model is effectively mapped to a point in this space all we have to do is just stop one layer early and grab this Vector just as we can measure the distance between two points in 2D space we can also measure the distance between points or images in this high-dimensional space hinton's team ran a simple experiment where they took a test image in the imag net data set computed its corresponding vector and then search for the other images in the imag net data set that were closest or the nearest neighbors to the test image in this High dimensional space remarkably the nearest neighbor images showed highly similar Concepts to the test images in figure four from the Alex net paper we see an example where an elephant test image yields nearest neighbors that are all elephants what's interesting here too is that the pixel values themselves between these images are very different Alex net really has learned high-dimensional representations of data where similar concepts are physically close this high-dimensional space is often called a latent or embedding space in the Years following the alexnet paper it was shown that not only distance but directionality in some of these embedding spaces is Meaningful the demos you see where faces are age or gender shifted often work by first mapping an image to a vector in an embedding space and then literally moving this point in the age or gender Direction in that embedding space and then mapping the modified Vector back to an image before we get into activation atlases which give us an amazing way to visualize these embedding spaces please take a moment to consider if this video sponsor is something that you or someone in your life would enjoy I was genuinely really excited to work with this company they make incredibly thoughtful educational products and by using the link in the description below you're really helping me make more of these videos this video sponsor is kiwo they make these fun and super well-designed educational crates for kids of all ages they have nine different monthly subscription lines to choose from focus on different areas of steam and you can also buy individual crates which are great for trying out kiwo and make amazing gifts growing up I was constantly building here I am building a tower outside my house to my second story bedroom I was obsessed with electronics and would have absolutely loved projects like this pencil sharpener from the Eureka crate line which is focused on science and engineering I really believe that this type of Hands-On self-driven learning is magical when I really think about my own education it's the times that I've been fully absorbed in projects like this that I learned the most and now that I'm a dad I really want my kids to have the same kind of experiences kiwo really does an amazing job boxing up start to finish projects like this my daughter just got the panda crate for fine motor skills it includes these special crayons specifically designed to help her learn different ways of grasping you can see her here insisting that she gets to bring them in the car with us huge thanks to kiwo for sponsoring this video use the discount code Welch labs for 50% off your first month of a subscription now back to alexnet there's some really amazing work that combines the synthetic images that maximize a given set of activations with a two-dimensional projection or flattening out of the embedding space to make these incredible visualizations called activation atlases Neighbors on the activation Atlas are generally close in the embedding space and show similar Concepts the model has learned we're getting a peak into how deep neural networks organize the visual world looking at the synthetic images that most activate neighborhoods of neurons we can visually walk through the embedding space of the model seeing it Mak smooth visual transitions from Concepts like zebras to Tigers to leopards to rabbits moving to the middle layers of the model we can see less fully formed but still meaningful Concepts moving along this path amazingly correlates with the number and size of pieces of fruit in an image the same princip applies in large language models words and word fragments are mapped to vectors in an embedding space where words with similar meanings are close to each other and the directions in the embedding space are sometimes semantically meaningful there's some incredible very recent work from the team at anthropic that shows how sets of activations can be mapped to Concepts in language these results can help us better understand how llms work and can be used to modify Model Behavior after clamping a set of activations that correspond to the concept Golden Gate Bridge to a high value the llm the team was experimenting with began to identify itself as the Golden Gate Bridge Alex net won the imag net large scale visual recognition challenge by a wide margin in 2012 the third year the challenge was run in Prior years the winning teams used approaches that under the hood look much more like what you might expect to find in an intelligent system the 2011 winner used a complex set of very different algorithms starting starting with an algorithm called cift which is composed of specialized image analysis techniques developed by experts over many years of research Alex net in contrast is an implementation of a much older AI idea an artificial neural network where the behavior of the algorithm is almost entirely learned from data the dot product operation between the data and a set of Weights was originally proposed by molic and pits in the 1940s as a dramatically oversimplified model of the neurons in our brain in the second half of each Transformer Block in chat GPT and at the end of alexnet you'll find a multi-layer perceptron the perceptron is a learning algorithm and physical machine from the 1950s that uses molic and pits neurons and can learn to perform basic shaped recognition tasks back in the 1980s a younger Jeff Hinton and his collaborators at Carnegie melon showed how to train multiple layers of these perceptrons using a multivariate calculus technique called back propagation these models a couple layers deep and remarkably pretty good at driving cars in the 1990s Yan laon now Chief AI scientist at meta was able to train five layer deep models to recognize handwritten digits despite the intermittent successes of artificial neural networks over the years this approach was hardly the accepted way to do AI right up until the publication of alexnet if this was obviously the way to build intelligence systems we would have done it decades earlier as Ian Goodfellow writes in his excellent deep learning book at this point deep networks were generally believed to be very difficult to train we now know that algorithms that have existed since the 1980s work quite well but this was not apparent CC 2006 the issue is perhaps simply that these algorithms were too computationally costly to allow much experimentation with the hardware available at the time the key difference in 2012 was simply scale of data and scale of compute the imag net data set was the largest labeled data set of its kind kind to date with over 1.3 million images and thanks to Nvidia gpus in 2012 hinton's team had access to roughly 10,000 times more compute power than Yan laon had 15 years before laon's layet 5 model had around 60,000 learnable parameters Alex net increased this a thousandfold to around 60 million parameters today chat GPT has well over a trillion parameters making it over 10,000 times larger than alexnet this mindboggling scale is the Hallmark of this third wave of AI we find ourselves in today driving both their performance and the fundamental difficulty in understanding how these models are able to do what they do it's amazing that we can figure out that Alex net learns representations of faces and that large language models learn representations of Concepts like the Golden Gate Bridge but there are many many more Concepts these models learn that we don't even have words for Activation atlases are beautiful and fascinating but very low-dimensional projections of very high dimensional spaces where our spatial reasoning abilities often fall apart it's notoriously difficult to predict where AI will go next almost no one expected the neural networks of the 80s and 90s scaled up by three or four orders of magnitude to yield alexnet and it was almost impossible to predict that a generalization of the compute blocks in alexnet scaled up by forers of magnitude would yield chat GPT maybe the next AI breakthrough is just another three to four orders of magnitude of scale away or maybe some mostly forgotten approach to AI will resurface as Alex net did in 2012 we'll have to wait and see are you mad that I called the blocks of compute [Music] dumb not at all describing the compute blocks as dumb highlights the impressive nature of how simple operations can combine to produce intelligent Behavior it's a great way to emphasize the power of the underlying algorithms and training data
Info
Channel: Welch Labs
Views: 470,330
Rating: undefined out of 5
Keywords:
Id: UZDiGooFs54
Channel Id: undefined
Length: 17min 38sec (1058 seconds)
Published: Mon Jul 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.