The Platonic Representation Hypothesis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
PHILLIP ISOLA: Thank you, [? Sherry, ?] yes. OK, so, yeah, thanks for letting me take the time, despite the fact that I'm, I guess, inviting myself. But I get to share a little bit of my work, too. So this is work that we are publishing at ICML. It's a position paper, so it's a little different than some of the talks I would normally give. It's a little bit more opinionated. But I hope that this is a good audience for that, to get some interesting feedback and discussion going. This is work with Jacob, or Minyoung, and Tongzhou, who are two of my students. They're actually both here. Is Jacob here somewhere? Yeah, in the back, and Tongzhou is here. And then additionally, Brian Cheung was another author on this work. And this is "The Platonic Representation Hypothesis." So it actually is teed up perfectly from [? Shimon's ?] last question last night, which is, why are we seeing different methods converge on similar representations? Why is it that somebody born blind can learn a similar representation of the world to somebody that is seeing? And here is going to be some of my thoughts on that question. But we didn't plan that. I didn't coordinate with [? Shimon ?] on this. So one of my favorite papers from the last 10 years is this one. This is some work from [? Agita ?] and Antonio and others. And what they found is that if you train a deep net to classify a scene, then what happens is you get intermediate neurons, labeled A and B, that if you evaluate them on a bunch of other images, it looks like these neurons are acting as object detectors. So object detectors emerge as a intermediate representation for solving the scene recognition problem, which makes sense. This is such a cool paper because it's one of the first times that I think we saw that these are not just black boxes. They actually have some kind of interpretable structure inside. So what you're seeing on the right here are that there is a neuron on some layer of a scene detection network, a scene classification network that will fire whenever it sees a dog face. So it's like to decide if this is outdoors, I might be looking for dog faces. There's another neuron that fires whenever it sees a robin hood. And of course, Antonio showed you more of this the other day. There's more recent results like that. OK, so we found this really interesting. And did some work a little bit later with Alyosha and Richard Zhang, in which we were solving a completely different problem, which is image colorization. So take a black-and-white photo and try to predict the colors. And you can ask the same question. What are the internal units sensitive to? What do the neurons react to? And you can think for yourself. A lot of you all have seen work like this before. But it should be some features about color, some low-level texture stuff, but what it's going to be. What is this neuron A going to be sensitive to? Dog faces. OK, you learn a detector for dog faces, whether you train it to do scene recognition or colorization, two very different problems. And you also get flower detectors. So you get a lot of different units that detect different types of things that we might-- might be nameable, might be semantic. So this is a story that it's-- you've seen it before. I think many of you have. This is textbook knowledge by now. These are figures from our textbook. So I can-- I get to define, what is textbook knowledge? OK, they really are. You can look at page 300 something. But it's led me-- and I think many people have had similar-- have stated similar hypotheses. But it's led me to this basic hypothesis, that different neural networks that are trained in very different ways, with very different architectures seem to be converging to a similar or-- the strong version of the hypothesis is they're converging to the same way of representing the world. Maybe that's a hypothetical endpoint. It'll be the same, and they're not quite there yet. So let's unpack this a little bit. Oh, OK, so let's unpack this a little bit. But I know what you're thinking. I know at least one of you is thinking. Where's-- Alyosha is here? OK, maybe not. Well, I can use my model of Alyosha. I know what he's thinking. Yeah, of course, they converge to the same thing because it's all about the data. That scene recognition system was trained on ImageNet. And-- that's actually not true. It was trained on a different data set. But it's trained on a data set with a lot of photos of dogs. And the colorization network was trained on ImageNet, which is a lot of photos of dogs. Of course, they both learned similar detectors. And that's not what I'm going to say. That's not what I'm going to argue. So it could be that there's something else common between these different systems. It could be the architecture. Dan talked a little bit about this a few days ago. It could be that it is the optimization process. Andrew was talking about some optimization properties that might lead to convergence. It could be the people. Maybe-- we're all talking. We're all sharing ideas. We're all going to converge sociologically to the same types of ways of representing the world. But what I'm going to argue, essentially, is that it's none of these. So what's left? What is common between all these systems? It's the world. These systems are all trained on this same universe, this same reality, this same Earth. And it's kind of similar to it's all about the data. But it's that the data is an intermediary to the world. And you could have data that's superficially very different. But if it's still data that samples from a similar world, then it will lead to similar representations. So that's going to be the rough argument. OK, so outline of the talk will be, first, I will share some evidence of convergence between different models trained in different ways. Then I will talk a little bit about what we might be converging to. Is there an endpoint to this? And finally, I'll talk about limitations and implications of this idea. And again, there's a lot of interesting implications and a lot of really important limitations. So I hope that sparks debate. OK, so evidence of convergence. OK, I already showed you that scene classifiers and colorization networks learn some similar units. There's a lot more papers along those lines. Maybe the first and most famous examples are results like this from Hubel and Wiesel found that there are Gabor-like or oriented line detectors inside the cat cortex. And Bruno and others have shown that there are simple statistical models that also result in these Gabor filters being the natural way of processing visual data. And of course, we also have seen that in neural networks, the first layer, you always get these edge detectors, these Gabor-like filters. These are the filters on the first layer of AlexNet. So that's some kind of commonality. These-- different systems, brains, cats, and so forth, are converging to similar first-layer representation. And over the years, people have built that up and looked at second layer and third layer and seen commonalities. Now, I'll jump forward to a very recent version of that idea, which is work from Amil. Amil here? OK, Amil is over there-- Amil and Alyosha on what they called Rosetta neurons. And this is super cool. This is showing in each column a different neural network, a different vision network or graphics network. And the networks are processing this image of the cat. And in each row, they find-- they're highlighting a neuron, which seems to be shared between all of these networks. So in StyleGAN, there exists a neuron that fires for the Santa hat. And in ResNet50, there also exists a neuron that selectively fires for the Santa hat. So it's a filter. So that's why we're seeing a heat map. And that heat map is high for the same region. And these networks are not trained together. These were entirely different neural networks trained on different data. So they called them Rosetta neurons because it's like a common language that's discovered across these different models. And they were a little more conservative than we are. So they said they found 20 or so, and it doesn't explain all the variants. But I want to say that this is like the start of some kind of convergence, which may go further. So there's a lot more evidence along these lines. But I want to get to some of our new results and new experiments, where we looked at this in a little bit more detail. So we have some level of similarity between different neural networks, different architectures, different data sets. And I think that's something that many people in the field have remarked on over the last decade. But for this to be actually converging to some optimal or platonic representation, we've got to show that the level of similarity is increasing over time or increasing over performance or scale in some sense. So that's going to be the next question. Is this increasing? And if so, that suggests that there is a convergent trend going on. So I want to now set up a little bit of notation to formalize what I mean by representation and what I mean by representational alignment and convergence to get us on the same page. So I'm not saying that all models are converging in all senses. I'm actually meaning it in a fairly particular sense. So what I mean is we're going to restrict our attention to representations that are vector embeddings, so mappings from some data like images to a vector. It's not everything, but this is a very general class of representations. And we're going to characterize a representation only in terms of its kernel. So this is actually a very common way of characterizing a representation. This tells me, how does the representation measure distance between different data points? So the kernel of a vision system evaluated over this set of images will look like a matrix. It will be saying, my representation of this person's face is similar to my representation of that person's face and very different from my representation of this house. So the kernel is an inner product between the embeddings between one image and another image or one data point and another data point, evaluated over all the pairs of data points. So kernels are really important in understanding representations. Kernel methods, kernel machines make use of this structure for learning. In neuroscience, there's a lot of literature on representational dissimilarity matrices. That's a kernel. This is from a neuroscience paper. So kernels are a fundamental structure for understanding the properties of a representation. And that's all we're going to talk about today. So in order to know if two representations are the same in the way that they represent distance, or they-- the same in their kernel structure, we need a kernel alignment metric. And that just means I'm going to take the kernel for one representation and look at the distance between that kernel and the kernel for another representation. And that's my measure of how similar the two representations are. There's a lot of kernel alignment metrics. I'm not going to go into the details of them. But just think of it as you create this kernel matrix from one neural net. You create it for another neural net, evaluate it on the same data points, and you somehow measure the distance. So let's do an experiment now. We're first going to look at if different vision models, different vision neural networks are increasing in their kernel alignment over time, or in particular, over performance. As they get better, which also as the years pass they get better, are they becoming more similar in terms of their kernels, how they represent the world in their internal activations? So two hypotheses we can put out here. So one is that, no, they're not. There are many different ways that you can represent the world that are all good. There's not just one way of solving the problem of vision. That's one hypothesis. And two is, actually, too bad, no. All strong models are like. This is the Anna Karenina setting. That's a term that was coined by Bansal, et al., so it's not our own term. But I really like it. It's the idea that it could be the case that all strong representations are somehow alike. They have something in common because it's all happy families are alike. It's the same idea that there's a lot of ways you could go wrong. But if you are right in all properties, then that's a set of constraints that forces you to be alike. Dan Yamins and I think Rosa have called this the contravariance principle. So it's also quite related to that. There's a lot of people in the audience that have studied this. So it's just great to be here. So let's run our own experiment on that. We're going to take 78 vision models. What I want to emphasize here is that these models are all trained in different ways. So some are ResNets. Some are transformers. Some are trained on ImageNet. Some are self-supervised systems. Some are trained on other data sets. So different training data, different objectives, different architectures. And I'm going to bucket these different vision systems, these different visual representations by their performance on this benchmark of general visual competence. This is the VTAB benchmark. So this is our proxy for, are you just a good general-purpose visual representation? And so this is used as-- to evaluate if I have learned good features in computer vision. So on the x-axis is going to be different bins of performance. So the first bin will be visual representations that don't solve many VTAB tasks. And the last bin will be visual representations that do solve a lot of VTAB tasks. And the y-axis is going to be the kernel alignment, the average kernel alignment between items in each bin. So here's the result. So systems that are really good at a lot of different tasks are all quite similar. And systems that are only good at one task or not very good at all are all quite dissimilar. So it kind of makes sense. You might be thinking, yeah, of course. If a system-- if two systems are good at the same set of things, then of course, they must have similar internal structures, similar representations. I don't think it has to be the case, but it's reasonable to suppose that. You can construct worlds in which that wouldn't be true. But it's not too surprising. And it's the Anna Karenina scenario. All strong representations in these experiments are alike. So here's a t-SNE plot or a UMAP. It's similar to a UMAP plot of that result. Again, here are different vision networks. And notice that regardless of whether you're contrastive or your CLIP or your classification, different architectures, different objectives, the-- what's causing two representations to be similar is their performance, not their architecture. It could have been that all of the contrastive methods cluster together separately from all of the non-contrastive methods. But that wasn't the case. Performance is what is dominating the clustering of these representations. So I think that a lot of people would have expected that. As vision systems become more general purpose, stronger at more tasks, they become more aligned in how they represent the world. The next experiment is going to be-- well, to me, it was a little more surprising. So now we're going to ask, is the same happening between two modalities? So as language models get bigger and better, do they become more and more alike to vision models? This is a little bit weird, right? So a few hypotheses. Hypothesis one is that, no. If you do better and better at next-token prediction, next-word prediction, you're going to become really good at language, at syntax, at low-level, superficial properties of language. And you're going to probably not be something that's generally useful for other domains. It's just you're going to be a super specialist on language. Hypothesis two, maybe not. Better language models are just better intelligent representations of the world. And they're also better vision models. I'll tell you exactly what-- how we measure that. And maybe the strong form of hypothesis two is, the best vision model is the best language model. This is going to be too strong. But we'll put that there for, yeah, maybe. OK, so I have to tell you how we can measure whether or not a vision model represents the world in a similar way to a language model. Again, we're going to use kernel alignment. But it's cross-modal kernel alignment. So here are some images. And this box is representation space. And I'm imagining a neural network in which the apple and orange, according to the vision system, have a similar representation. They're nearby in representation space. And the apple and the elephant are far apart. And we can also embed the corresponding words for those items into representation space of a language model. And what we're asking is whether the similarity in the language representation matches the similarity in the visual representation for the corresponding image that matches that text. So if the representations are converging, we'd say that the similarity according to a language model of the word "apple" and "orange" is roughly the same as the similarity according to a vision model of an image of the apple and an image of the orange. So we have to have paired data to evaluate this. The models in this section are all trained without paired data. So they are vision models trained only on images. And we're going to measure their similarity in representation space to language models trained only on language. But we're going to evaluate the kernel alignment using paired data to be able to ask, does the vision system embed these two photos of Yosemite close to each other in a way that-- and does the language model also embed these two sentences that are captions about Yosemite close to each other? So here's the main result. Here's the experiment. We took 11 language models, 5 vision models, these vision transformers. We measure on the x-axis the performance of the language model at language modeling, so the performance of the language model at next-word prediction. And we measure on the y-axis the kernel alignment between the language model at each of these points and a vision model, DINOv2, which is trained self-supervised only on images, no language used at all. OK, so here's the result. So as a language model, like Llama, for example, up here becomes better and better at next-word prediction, its kernel becomes more and more alike to the DINO kernel. And as DINO-- the different colors are different sizes of DINO. So the biggest DINO model is the most aligned with the language models. It goes both ways. Bigger, better vision models have more and more similar kernels to bigger, better language models. So we have some metric I didn't describe fully. We only got up to 0.6 on that metric. But the point is the trend. The trend is going up. It might [INAUDIBLE] off. We'll see what happens. We did this for a bunch of different language models, a bunch of different vision models. One thing I want to point out is, OK, I actually lied to you. I said that we were only looking at pure language and pure vision models. We did look at one VLM, one model clip that's trained to align images and text. And you would expect that a model trained to align images and text will end up with a similar kernel because it's trained to have the same kernel between the vision encoder and the text encoder. But CLIP is not actually-- it's only marginally more aligned with Llama, with a language model, with language models in general, than DINO is aligned with language models. So DINO has almost the same alignment, in this kernel-alignment sense, with language models as CLIP does, despite that CLIP is trained to be aligned with language models, which is interesting. Was there a question or no? AUDIENCE: [INAUDIBLE] PHILLIP ISOLA: Yes, Bill. AUDIENCE: Do you think this would work for audition? PHILLIP ISOLA: Yes. [INAUDIBLE] AUDIENCE: [INAUDIBLE] sound the way as an apple sound the same? PHILLIP ISOLA: Yes, maybe. The hypothesis is that, yes, it will work. Me personally, I don't know. But the hypothesis doesn't represent my exact belief. It's like we're stating a hypothesis. But, yes, the hypothesis is it will. And I'd love to hear Josh's thoughts on that at some point. Yeah, Dan. AUDIENCE: [INAUDIBLE] brief clarification. The reason you're saying that CLIP is only marginally better is because it's at 0.2 versus the thing from the previous slide? Does that have [INAUDIBLE]? PHILLIP ISOLA: Well, 0.6. Yeah, that was the point, yeah? AUDIENCE: OK. PHILLIP ISOLA: Numbers are still kind of low. So who knows what will happen as they go up to 1. Blake. AUDIENCE: But to push on that a bit-- and I guess this gets to what you were saying, though. Given what you said about weaker models or more specialized models, if you weren't training on this very general next-token prediction thing, but say you're doing something really specific, like you trained your language model just to always highlight the word "dog" for me, I imagine this wouldn't hold. PHILLIP ISOLA: I think you're right. AUDIENCE: [INAUDIBLE] highlighting "dog" would not correlate with its match to visual models. PHILLIP ISOLA: Yeah, I think you're right. I'm not going to talk about that explanation. That is something we talk about in the paper. We call it the multitask hypothesis. Train on more tasks, get more convergence. But it's also basically the same as the contravariance principle. So Dan and Rosa have already articulated that idea. But I think it's-- I think that's important. OK, but let me go on. We have a few reasons why we think this might be happening. I'm happy to discuss more offline. I want to talk about where all of this might be heading. And this is the most hypothesisy part of this hypothesis, I suppose, is this is not proven. But the picture that I have in mind, that we have in mind is something like Plato's cave. So I think most of you probably know the allegory of the cave, this idea that there's prisoners in the cave whose only experience of the outside world is the shadows on the cave wall. But they somehow infer that there is a world out there. And Plato made that as an allegory about our own experience. We only experience data. We don't actually have any access to true physical state. And he says, maybe metaphysically, there isn't even a true state. There's just ideal latent variables, ideal forms behind it all. I'm not going to make any kind of metaphysical argument. It's just an analogy. But the picture that we have in mind is that, yeah, there is a world out there. There is some data-generating process, some causal variable z. And you can observe that world in different ways, multi-view learning. You can look at it through images. You can caption the images. You can potentially get to that same sentence via a different set of sensors, maybe via touch or a different camera from a different angle. So there's a lot of different ways of viewing the world. But if I learn a representation of any of these ways of viewing the world, well, because they're generated by the same causal process by the same world behind it all, they should somehow become alike. That's the basic idea for what we have, for why we think ultimately there is this convergence. It's a very general idea, an idea that's been stated many times in various ways. But I think it's still a powerful idea to investigate. OK, so I want to-- so that's the general idea. Now, I want to tell you about one particular toy mathematical model in which you will expect to get this type of convergence. But I'm going to emphasize, this is just one particular mathematical formalization of this idea. I think the idea could be explored much more broadly. So the mathematical model that we have that would exhibit this type of convergence between an image embedding and a language embedding goes as follows. You have a world that consists of discrete events Z. They're sampled from some unknown distribution P of Z. All observations, all learning signal is mediated via observation functions, which we are going to assume are bijective functions. That's a huge simplification. So they contain all the information in the observation function that is contained in the latent variable Z. And in this world, we're going to model co-occurrences. We're going to use a contrastive learning method that will basically say, two things that co-occur are a positive pair. Two things that don't co-occur are a negative pair. Align the positives. Align two things that are co-occurring. Push apart two things that are not co-occurring. And this is the standard setup for contrastive learning. So this is not far off from the types of learners that are popular. Like contrastive language models might try to align two co-occurring words. And contrastive image models might try to align two co-occurring image patches. So in that particular world, with discrete random variables and bijective observation functions, if you train a contrastive learner with a noise contrastive estimation objective, you can prove-- and this is not a new result, but you can show that the-- this will converge to the pointwise mutual information function. So the intuition for that is that what contrastive learners are trying to do is they're trying to classify between positives, co-occurring items, and negatives, non-co-occurring items, and trying to learn an embedding of the data in which the inner product, or the similarity between the embedding vectors, is proportional to that probability ratio. And that probability ratio is just how much more often do the two items co-occur together divided by the product of the marginals. How often would you expect, by chance, them to co-occur? And so basically, if I learn an embedding f, such that its inner product with other items is equal to this ratio, then the kernel that it will arrive at, the way that this representation measures similarity is going to be this pointwise mutual information function, the joint probability divided by the chance rate of those things co-occurring. This PMI function has been a favorite of mine for a long time. So I'm just-- I'm always trying to sneak it in. But the rough picture is that contrastive learning boils down to finding an embedding in which similarity equals co-occurrence rate or pointwise mutual information, which is like normalized co-occurrence rate. So this is a little bit-- maybe if you don't like the Plato analogy, this is the Wittgenstein. And meaning is used-- meaning derives from the rate of co-occurrence. And so the reason why the apple and the orange are nearby in representation space is because those two things co-occur in kitchens. And elephants don't tend to co-occur with those items as much. And one of the interesting things about this model is that if these observation functions are bijective, then not only do you converge to the co-occurrence of the observations, you converge to the co-occurrence of the underlying events because bijective observation functions on discrete random variables preserve probability. And so all these things work out to be the same. So what that boils down to saying is that the different views that satisfy these properties will converge to the same kernel. So the language representation and the visual representation will converge. So a little bit of a mix of Plato and Wittgenstein. Maybe I should have chosen more recent researchers. But I went with those. So that's just one mathematical model of what might be going on here. I think I'll skip that small example. And I want to talk about limitations and implications because I think those are maybe the most interesting to get into. So again, I think I know what a lot of you might be thinking. Hold on. An image is not equivalent to a sentence. Again, if I go and see the total solar eclipse a month ago, that experience, I just can't describe it. It's ineffable, right? There's no words to describe it. Or if I am writing an essay and I talk about this concept of free speech, this is an abstract concept. There's no image that captures that concept. So I think this is a really valid criticism. Different modalities aren't actually all bijective with some underlying representation. They might have different-- fundamentally different information. So I don't quite know how to fully resolve this. The empirical evidence suggests we're seeing some convergence, despite that this might be true. But this is a real limitation. Mathematically, in these cases, we don't have that-- we don't satisfy that mathematical model I gave you. The observation function is lossy or abstract or partial. It's not a complete representation of the underlying world. But despite this, we do see some interesting convergence. I think one example is that CLIP is trained to reduce image representations to just being captions. It's trained to align visual representations with language representations. And yet we love it in computer vision. It works really, really well, despite that it's trained to throw away everything about vision other than language. OK, so maybe language is actually closer to a complete representation of what we care about in vision than we thought. But it is a limitation. Another limitation-- oh, actually, probing that limitation a little bit more, so one implication is that the more lossy and incomplete is your observation in language, the more it might not match what is in the image. Because a single word is not going to fully describe an image. And a caption might only partially describe an image. But we did an experiment, which is, well, what about 1,000 words? So we varied the number of words in captions and measured the alignment between the captions and the visual data. We only went up to 30 words, not 1,000 words. But you can see the trend goes up. So the kernel alignment between rich captions with the corresponding images is higher than the kernel alignment between just a single word or a very partial caption. So it kind of makes sense. It's kind of consistent with this idea that as you get closer and closer to complete bijective observations, you will get better and better alignment with the vision modality. Imperfect alignment-- so I told you that on our metrics, we haven't explained all the variance by any means. There's a lot more variance left to explain. So this is an open challenge. And one other thing for the people who are deeper into this field of representational alignment is, technically, we're not seeing global structure alignment. We're seeing local structure alignment. So if you're familiar with the CKA metric-- this is for the people that are really experts in this area-- we're actually not seeing increasing CKA alignment. This is a measure of global structure. We're seeing-- we only see this alignment when we look at local nearest neighbor structure. So this is a detail to the analysis. But I'm happy to talk more about that. [INAUDIBLE], yeah. AUDIENCE: Yeah, Phillip, appreciate the y-axis misalignment [INAUDIBLE]. Can you say what it is for a random model? What's the floor? What's the kernel of a randomly initialized model? [INAUDIBLE] PHILLIP ISOLA: Right. So for random model, I think it's roughly 0 on this metric. Tongzhou, is that right? TONGZHOU WANG: For a purely random process, not the [INAUDIBLE] network. PHILLIP ISOLA: No, random network, though. AUDIENCE: Random parameters, but up the data points, still encoding them as a kernel. TONGZHOU WANG: I don't have that number. But that's a purely random process. Without network [INAUDIBLE] bias. It's much, much lower [INAUDIBLE]. AUDIENCE: I think-- but-- PHILLIP ISOLA: Yeah, that's a good baseline. We should come back to that. Yeah, I don't know the exact number. I think it's quite a bit lower, though. So another thing that you might have in mind is, OK, yeah, I buy this convergence. That's-- empirically, that seems to be going on, but not because they're converging to some platonic reality. These are [? BS ?] machines. They're converging to just dumb models. OK, so valid. It could be maybe all these models are converging, but not to a good representation of the world, but to just being these superficial, stochastic parrots, maybe. So it could be that there are fundamental limitations. We're all using transformers. We're all doing next token prediction. And this is just flawed. That's an option, too. Maybe it's socio-technical biases. We all chat, and I tell you, yeah, transformers are amazing. They're converging. Then you go and use them, and it increases the convergence. So maybe don't take that lesson. OK, so there's a lot of limitations. There's some other ones we discuss in the paper. I think there's also really interesting implications. And I want to end with the implications. So one implication is that there's this complementarity between all of these different sensory modalities. And we've seen this a lot. We've talked a lot about this. Again, [? Shimon ?] was posing the question yesterday that if I want to train a vision model, well, this implies that I should be able to use language data, too, because the underlying kernel, the underlying structure, if it's really shared between them, then I can get there via multiple paths. And I might as well use all the paths available to me to increase the rate at which I get to that representation. And I think an interesting experiment could be it should be the case that to train a vision model, there's value to training it on a word. A word should be worth n pixels to a vision model. And a pixel should be worth m words to a language model. If I train-- I'm going to train a language model, I should train it on pixels, too. That should help my performance. And there's some evidence. People are starting to do that. There's some evidence of this. But I think it's only-- in LLM land, it's only a little bit explored. Most LLMs are only trained on language data. But this implies you should train them on other types of data, too. And they should get better at language modeling. And there's some evidence-- I mean, this is from the GPT-4v blog post. So it's not-- who knows if it's replicable. But they say that if you train GPT-4 jointly with vision model-- so GPT-4v has joint vision and language model-- you do better at pure language-reasoning tasks than if you don't use vision. So there's some transfer between the modalities. Another interesting implication is that it should somehow be relatively easy to translate or convert between different representational formats for different modalities if they're all converging to the same representation. So, for example, it should be relatively easy to translate between images and text if their representations are the same. When you train a representation of text and images and they converge to the same representation, it will act like a bridge. And maybe you'll only need a little bit of data to find the mapping from the visual representation to the text representation. And this is something I think that we also see in practice to some degree. There's a lot of success of unpaired translation, of translation between modalities. There's some interesting work from Sompolinsky, et al., or Sorscher, et al., showing that you can do unpaired translation between images and text to a certain degree. And the idea is basically the same, that if the representations that you get in each modality are the same, then you just have to align those two representations up to some kind of permutation or some rotation. And this is just another old philosophical question, this Molyneux's problem maybe you've heard about. Imagine that a child is born blind. They only know how to discriminate shapes by touch. And then they are given sight. Would they immediately be able to discriminate shapes by sight? And I think an implication of this hypothesis, a postdiction of this hypothesis would be that, well, not immediately. But it shouldn't-- it should be relatively easy to take your representation learned from touch, or maybe from language in this case, and use it as a target to learn a representation for vision. So you still have to learn this arrow going from the new modality, the eyesight you've been given. But you already have the kernel. And half the battle, or maybe a lot of the battle was learning this kernel structure. So you already have it. And you just have to learn how to map the new modality into it. And indeed, there's a lot of interesting work. I think we heard a bit about it on one of the early-- the first days of this workshop, that people have tried this now. They've done things like Pawan Sinha did this Project Prakash, where he gave-- where they had surgeons that would operate on children who had cataracts. They get sight for the first time. And it doesn't take them very long to understand images. Now, they don't have it immediately. But it's not very long. OK, so I think there's a lot more to discuss. I'll end with the final implication being that if there is some endpoint to all this, if these things are heading toward something, we should work to characterize it. And think it's a great challenge. I hope that we can get a better idea of what that model is. Or if this is just not true at all, we should prove that and show that, too. So I will thank my co-authors and funding agencies. [APPLAUSE] So I'm also moderating. So I'll say five minutes for questions. OK, [? Shimon. ?] AUDIENCE: So one thing is an alternative or maybe a similar view, if the convergence may be related to finding the close-- the simplest program that generates [? beyond ?] the observation. And this was-- it would lead-- it tends to find the latent variables that actually generated the observations. And the exact observation, different system would get some different observation of the same system. They will end up with recovering the same latent variables. And it will not depend on the computer that made it and so on. So maybe there is a direction of whatever we do is we're eventually recovering the correct latent variables and so on and so forth. So anyway, this general direction. PHILLIP ISOLA: Yeah, I find that really compelling. So we had a section I skipped, which we called the Simplicity Bias Hypothesis, which is roughly that there are many ways of fitting whatever data you have. But under the pressure to find the simplest, you will get more convergence than if you don't have that pressure. And, yeah, maybe the simplest program is somehow we're working our way toward that simplest program via regularization, implicit and explicit. I think it's speculation. But it's interesting to consider that. AUDIENCE: [INAUDIBLE] on some data, some recent data or some arguments that deep networks actually have bias towards-- PHILLIP ISOLA: Yes. AUDIENCE: --the lowest complexity and so on, in some way, and so on. So maybe all of this-- PHILLIP ISOLA: I think it all fits together. And actually, Jacob, one of the authors, had one of those papers on deep nets have the bias toward the simplest structure. But I think there's still a lot of open questions there, yeah. [? Andrea. ?] AUDIENCE: Just a comment. It's on your [? projecting ?] assumption. Maybe you're winning a bit at the moment because you're using networks that have been trained for recognition, whereas if you had-- say you had a network that had just been trained to do reconstruction, 3D reconstruction, maybe this wouldn't work so well. PHILLIP ISOLA: Yeah. AUDIENCE: So it's just possible you're seeing something because of that. PHILLIP ISOLA: It could be. I think some-- are-- I don't know. [INAUDIBLE], are some of the models that we have trained for reconstruction rather than recognition? AUDIENCE: There are maybe some that [INAUDIBLE]. PHILLIP ISOLA: OK, yeah, so we have MAEs in there. But they actually are kind of an outlier. So that's a little bit of a violation, yeah. AUDIENCE: [INAUDIBLE] can we trained through 3D [? depth ?] prediction? So [INAUDIBLE] it's not completely solid. But it's just [INAUDIBLE] you're seeing something like that, maybe? PHILLIP ISOLA: Yeah, I think that's a good point. And maybe, yeah, contrastive-- it's like instance discrimination and classification, these might all be more alike than we really realize. Even though I said they're different objectives, they're maybe not that different, yeah. Yeah, that's a good point. Let's go to the back. AUDIENCE: Yes, I guess this is motivated by the possibility that the representations learned by these models don't necessarily align with representations that humans have or of underlying true reality. So I guess it seems like the data that these models are trained on is curated in some way. Like the statistics, the text on the internet, the images on the internet don't think-- they selectively pick out remarkable things about the world. It's not necessarily a random sample over all possible data that you could observe. So I guess taking that into account, what is the implication of this curation for representations that come downstream of that? PHILLIP ISOLA: I think that's a great question. So the data sets that we use to train these models are different in a lot of ways. But they share one thing in common, which is they're all internet data sets. They're all photos and captions and texts downloaded from the internet. And that might be a very biased and curated type of data. If you just had a robot on Mars, it might come up with a very different representation. So I think that's an open question. I guess the strong form of the hypothesis is that, no, the robot on Mars will find the same representation because it's more about physics and the underlying, like, you know, F equals ma. But I think I would believe more in some data distribution properties do really matter. AUDIENCE: [INAUDIBLE] NASA would upload the data to the internet. PHILLIP ISOLA: NASA will upload the data to the internet, yeah. OK, Dan, here. AUDIENCE: Yeah, so I guess my very minor comment, which is at some point you said that maybe the strong form of the hypothesis is that the best vision model is the best language model. But maybe is what you're saying that the-- some late layer of the best vision model has the same representation as some [INAUDIBLE] layer? Because it's not-- the models are the same. I mean, obviously they're computing things. PHILLIP ISOLA: Yeah, just the kernels align at some layer of the models is maybe what the statement would be. AUDIENCE: But you could ask whether like the intermediate or early layers look similar. And if they come to look very similar in between, that would be even more surprising, right? If an intermediate layer came to look-- of a language model came to look like an intermediate language layer of a model, that would be really unusual. PHILLIP ISOLA: That would be-- yeah, so we a search over multiple layers and take the average or the max, depending on how we measure things. So we don't really see this layer-by-layer sequence being matched between the two models. But it's more like somewhere in both of the networks the kernels align. AUDIENCE: Right. So the hypothesis that they're representing the same thing is reasonable. But it would be really surprising if it turned out-- PHILLIP ISOLA: I agree. AUDIENCE: Yeah. PHILLIP ISOLA: That would be even stronger. AUDIENCE: [INAUDIBLE] tracking it. PHILLIP ISOLA: Yeah, we haven't seen that yet. Yeah, [? Leslie. ?] AUDIENCE: [? I ?] [? guess ?] another question is something like, if you trained on very different-- supposedly very different data distribution, not so much Mars because that's the world's pictures [INAUDIBLE] whatever, but just only spatial transcriptomics data, tons and tons of spatial transcriptomics data, would you expect it to be the same asymptote? Or would you expect it-- yeah, maybe it goes up for a while. And then it asymptotes lower. PHILLIP ISOLA: I think every-- OK, so one of the assumptions to this model, the mathematical version of the model, is that you train on the same distribution of events, not the same data, but the same distribution over underlying events that generate the data. And the more you violate that, the more I think this might not be true. But I expect that for different problems and domains, there'll be a percent-- a degree to which this is true. Yeah, so I don't know. Leslie? AUDIENCE: Suppose this alignment, should we think of it as being marginally very weak or very strong? Or how do you think of it quantitatively? PHILLIP ISOLA: Right. Is-- just on the scale of 0 to 1, like, it's point-- AUDIENCE: [INAUDIBLE] [? extremely ?] strong? PHILLIP ISOLA: I think it's fairly strong. So the meaning of that number is the percent of nearest neighbors that are in common between the two kernels on average. So if I take an item and I find its nearest neighbors in one-- under one kernel, and I take an item and I find its nearest neighbors under another kernel, it says, about 1 out of 5 nearest neighbors against a dictionary of 1,000 possible neighbors is shared. OK, I think that we should take the rest of the discussion to the break. So I'm happy to chat more. But I also need to moderate and move things forward. So I'm going to say, we'll come back at-- [? Sherry, ?] what time are we coming back? Is it 3:45. 3:45. Yeah, come back at 3:45. Thank you. [APPLAUSE] [SIDE CONVERSATIONS]
Info
Channel: Simons Institute
Views: 3,172
Rating: undefined out of 5
Keywords: Simons Institute, theoretical computer science, UC Berkeley, Computer Science, Theory of Computation, Theory of Computing, Understanding Lower-Level Intelligence from AI; Psychology; and Neuroscience Perspectives, Phillip Isola
Id: 1_xH2mUFpZw
Channel Id: undefined
Length: 44min 27sec (2667 seconds)
Published: Tue Jun 18 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.