If LLMs are text models, how do they generate images? (Transformers + VQVAE explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so when you ask a question to a large language model they are really great at generating long sequences of coherent text and that's why they are large language models after all they have been trained to learn language the probability of different permutations of words uh co-occurring together in a sequence how can we take a system like that that can output words and then convert it into something that can also output images cuz images cannot be Express this language [Music] right so last week Deep Mind introduced their brand new multimodal llm model and blew everyone's mind gini can input and output text like most other llms but it can also view images understand it listen to audio and most impressively it can generate images on its own okay now I see blue and pink yarn how about a pig with blue ears or an octavus or a bunny with a pink nose the Gemini paper or technical report report doesn't go into much details on what the exact architecture of their network is because it is 2023 and the techniques behind all of the major llm breakthroughs from these big companies are shrouded in secrecy I miss the old days when we got every information we needed to learn what the researchers actually did and learn from it but despite all of that they did provide some breadcrumbs in their text and references to form an inform judgment about what might be going on behind the scenes with this model this video is about the algorithms and the ideas behind multimodel llms and the purely genius method through which they not only understand but can also generate new images if you are aware about how llms generate text when you prompt them with something like ask it what color are roses they basically process the prompt and then generate a reply one token at a time like roses are red the generated token at each step goes back into the model and and the next token is then outputed until they complete generating the entire sequence this method Works easily for text because llms essentially act as a multiclass classifier at each time step picking the probability of each word from its vocabulary to occur given the context but to generate an image we have to generate pixels each pixel carries the red green and blue value if it's a colored image so if you're generating a 256x 256 RGB image pixel by pixel then the LM has to generate 196,00 independent values which is crazy and intractable clearly pixel by pixel is not the way to go here we need to completely change how we think about images since llms are good at generating words or subwords of a language like say English it'll be nice if we could reimagine this whole image generation problem and think about it as another language generation problem kind of like Egyptian hieroglyphics or the Mayan glyphs ancient languages before the invention of scripture and vocabulary that used images and symbolism to represent words or sounds but how do we create such a language representation space of all of the billions of images that we have access to in our universe clearly we can't do this manually so machine learning comes to the rescue here we need a ml model that can learn to map between images and their coded sequence this is not an easy task but deep learning researchers have found remarkable almost genius way to still make it happen and this video is going to cover that the answer actually lies in a classic computer vision neural network architecture the variational autoencoder or more specifically the vector quantized variational autoencoder or the vqv AE let's get into it so the auto encoder was developed as a neural network that can compress and extract features from images so let's say we have an RGB image it is of shape 256x 256 and it has three channels because it's an RGB image so the auto encoder takes an image like this and then passes it through a neural network this neural network is called the encoder it can be a CNN based architecture or like a vision Transformer as well uh the encoder then compresses this input image into a lower dimensional space which is also called the lat in space for example it can be a 32x 32 by 16 dimensional tensor you can basically imagine this tensor to be a collection of 32x 32 vectors where each Vector is of size 16 the compression from this original image to this latent space is about 8% because it's just 32 * 32 * 16 by 256 * 256 * 3 and it's around 8% next another neural network architecture called the decoder inputs this Laten space and then outputs a new image of shape 256 256 * 3 the same shape as the original image and so the neural network is trained to reconstruct the original image from this latent space takes an image passes it through the encoder to compress it and then the decoder regenerates the original image and the loss is basically the Reconstruction loss between the output image and the original image and that's what the auto encoder does there's an old saying in AI that compression is intelligence by learning to compress millions of images the autoencoder learns which properties of the image is most important to retain in the latent representation and which properties are high frequency noise that can be omitted and what's more important is that the auto encoder is a completely self-supervised or unsupervised process that is it does not require any manual labeling and all of the features are extracted from the data itself so let's say that you have an image of a tree if you pass this tree through your encoder you're going to get a latent embedding and let's assume that this embedding is of 100 Dimensions now an embedding is basically a representation of your original image and you can think of it like a point in a high dimension space now if you consider this as your 100 dimensional space this embedding is going to map to a specific point in this space now imagine you had a different image another image of a different tree now the embedding of this tree is also going to be another 100 dimensional embedding but it's not going to be the same as the embedding of the original tree but also if you map this embedding into your embedding space you can kind of expect that it's going to map somewhere close to the original trees embedding because they are both trees and on the other hand if you had a car if you had a car and you you derived its embedding and you mapped it into your latent space you're probably going to have it somewhere far away from the trees and this is the effect of compression the similar images map close together to each other in the latent space and different images are mapped farther apart it has an effect similar to arranging books in a bookshelf like how we arrange books that are of the similar genre in the same shelf so that it's easier for us to remember where we kept it and this also brings us to our next interesting question so if you know that in this area of the latent space we can find images of trees can we randomly sample a new point in this space and then pass it through our decoder to Output a new image of a tree because if we can do that we we will effectively have created a generative model from an autoencoder because we just randomly generated a point in lat in space and use that randomly generated point to convert it into a brand new image now this poses a question if we can convert images into fingerprints using the encoder can we also use the decoder to convert randomly generated fingerprints into randomly generated new images turns out yes we can techniques like the variational autoencoder do exactly that by turning normal autoencoders into generative image models without getting into specifics they train Auto encoders so that later on we can generate images by passing in randomly generated latent embeddings the benefits of the auto encoders does not just stop at compressing embedding and generating new images if you have seen my video on latent space exploration you might remember that we can also do some other cool stuff like finding similar images from a database or interpolate between two images by slowly adjusting these latent vectors we can learn about the most dominant Trends and biases in our data and even manipulate images like adding a sunglass or making someone smile now we have managed to compress our image by about 92 person by representing it with a 32x 32x 16 tensor it is still 16,384 unique values that represent an image the problem is that the latent embeddings are continuous meaning each value in the fingerprint can take any real number this makes it intractable for an llm because an llm cannot output these continuous latent embeddings llms are good at outputting a sequence of tokens so we need to discretize or quantize our latent space and create a new vocabulary from frequently occurring symbols that expresses the contents of the image and this is where the final evolution of the autoencoder that we're going to talk here today the vector quantized variation Auto encoder or the vqv comes in now to most this might be the most technically Advanced concept that will be discussed in this video today so pay attention the vqv makes two major changes to the auto encoded architecture one it separately trains a discrete set of embedding vectors that form the vocabulary of our new image-based language this list is called the codebook and they consist of a set of learnable discrete embeddings called code words let's say we have eight quote words each of Dimensions 16 in practice these numbers are much larger than 8 and 16 but it's good for an illustration uh let's imagine this code word embeddings in a vector space as red dots now when the encoder outputs its latent tensor let's consider the top right vector and map it in our embedding space it is here so we find the closest Red code word embedding to it in this case the third embedding seems to be the closest so we will replace the entire top right embedding by the integer index 3 maybe for the next Vector in our tensor the nearest code book embedding is index two so we replace that with two we repeat this for all of the feature map to replace each Vector with an integer corresponding to the code book index nearest to it the resultant 32x32 integer map gets inputed into the decoder the decoder first converts this map back to the 32x32 x 16 latent Space by fetching the corresponding vectors from the code book and then proceeds to generate the new image that's it that is the entire vqv forward pass the main thing vqv achieves is basically clamp the encoder's output to only contain vectors from the code book or the image vocabulary so in a sense if you can come back to our 2D latent space example where every codebook embedding has its own positions in the space as the code words train they begin to shift inside this uh latent space basically optimizing itself to best capture the different semantic information in the input data set and that's why it's called Vector Quant ization because each of these codw embeddings control like different portions of the latent space they are basically quantizing or discretizing the continuous latent space into a bunch of bins which characterized by the code wordss themselves you can see this behavior in the vono diagram right here how the code word embeddings shift during the training to control different territories in the embedding space they're quite trippy to look at so intuitively these code words basically capture different semantic meanings inside the image and this grid captures how those semantic meanings are specially arranged in the original image so in other words if we can Generate random grids and we pass it through the decoder it should be able to generate new images according to this spatial and semantic code that we inputed to it and because we still have the encoder decoder architecture we can conver conver back and forth between images and their codified representations We have replaced the continuous fingerprints with a sequence of symbols like we wanted now there are also other works that derive from the vqv like the vector quantized Gan that replaces the VA architecture with the Gan or a generative adversarial Network to train this codebook Gemini sides two papers in their Network architecture section uh the original Del paper from open that did use vqv and another one which is Google's own party model which used a VQ Gan so there's a good chance that Gemini also uses a VQ Gan uh to me that's more like an implementation decision and Architectural design choice they are both the means to the same end so hopefully I've given some intuition about how vqv work uh feel free to read up on how VQ Gans work which are kind of similar is just a gan inv variant of the VA architecture so now we have arrived at a point where a single image can be represented as a sequence of discrete codes deep learning research has start us if you throw a large amount of data to a huge neural network and then train it for a long enough time you're going to get good results and for perspective on how huge these models can be open eyes Del from 2021 trained 8,192 code book tokens with a 32x32 feature map and in 2023 that number can easily be doubled by a corporation like Google a 32x32 grid that can contain 8,192 categories means at least theoretically the decoder can generate up to to 2 to the^ 9,26 16 Unique Images it's just massive the hard work is now done and now it's time to reap its fruits and the next part of this video we'll discuss how we can use the vqv to train an llm to generate images uh let's say we want the neural network to learn the following sequence roses are red then an image of a rose the text parts of the sequence are pretty simple to encode they are derived from the lm's text vocabulary and then embedded using the word embedding and its positional embedding the images however will first go through our vqv in encoder to get its coded sequence these image tokens will then be encoded directly using the codebook embeddings and then its positional encodings will be added to it so now we have a sequence of embeddings and we can train the model using next open prediction the model automatically learns in this unified space that contains both the word embeddings as well as the image or the code book embeddings and during inference we can input the model with roses are red and it will start generating the image codes one by one and once we have the full list of image tokens that we need to generate our 32x32 grid that we can input to the decoder we can generate rose image again also note that this method can be easily augmented to support video inputs for example the Gemini paper writes that video understanding is accomplished by encoding the video as a sequence of frames so so basically just input the video as a sequence of images and each images have their their own image tokens and then there's probably a separator to input the next frame in the Gemini paper the folks at Deep Mind trained on Millions perhaps billions of examples of raw multimodal data acquired from the web they apply a lot of quality filters to all of the data sets using both theistic rules and modelbased classifiers and perform a lot of safety filtering to remove harmful content they don't want the model to see during training after pre-training Gemini on next toen prediction they evaluate the models manually and generate new data for instruction tuning with supervised fine tuning and reinforcement learning with human feedback they write that the data quality is more important than data quantity when it comes to instruction tuning especially for large language models uh multimodel llms and especially those that are able to generate images too is a huge step towards better and more useful AI if you want to learn more about the history of multimodal models I have a video that goes over all of the basics building straight off first principles up to all of the model llm based multimodel models that we have today so feel free to check that one out in this video we learn how using the vqv we are able to train a language of discrete image tokens that can be used to train llms and generate images in conjunction to text a researchers have found a way that works well and now the main thing between us and good AI is high quality data high quality supervision some Guard railing some transparency and honesty from those that develop these llms thanks for watching you're magnificent don't forget to subscribe because you're going to love the next video bye cool that's cool I think I did it too I think I finished it
Info
Channel: Neural Breakdown with AVB
Views: 3,848
Rating: undefined out of 5
Keywords: machine learning, ai, deep learning
Id: EzDsrEvdgNQ
Channel Id: undefined
Length: 17min 36sec (1056 seconds)
Published: Wed Dec 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.