PHILLIP ISOLA: Thank
you, [? Sherry, ?] yes. OK, so, yeah, thanks for
letting me take the time, despite the fact that I'm,
I guess, inviting myself. But I get to share a
little bit of my work, too. So this is work that we
are publishing at ICML. It's a position paper,
so it's a little different than some of the
talks I would normally give. It's a little bit
more opinionated. But I hope that this is
a good audience for that, to get some interesting
feedback and discussion going. This is work with Jacob,
or Minyoung, and Tongzhou, who are two of my students. They're actually both here. Is Jacob here somewhere? Yeah, in the back,
and Tongzhou is here. And then additionally,
Brian Cheung was another author on this work. And this is "The Platonic
Representation Hypothesis." So it actually is
teed up perfectly from [? Shimon's ?] last
question last night, which is, why are we seeing
different methods converge on similar representations? Why is it that
somebody born blind can learn a similar
representation of the world to somebody that is seeing? And here is going to be some of
my thoughts on that question. But we didn't plan that. I didn't coordinate with
[? Shimon ?] on this. So one of my favorite papers
from the last 10 years is this one. This is some work
from [? Agita ?] and Antonio and others. And what they found is that
if you train a deep net to classify a scene,
then what happens is you get intermediate
neurons, labeled A and B, that if you evaluate them
on a bunch of other images, it looks like these neurons
are acting as object detectors. So object detectors emerge as
a intermediate representation for solving the
scene recognition problem, which makes sense. This is such a cool
paper because it's one of the first times that I
think we saw that these are not just black boxes. They actually have some kind of
interpretable structure inside. So what you're seeing
on the right here are that there is a neuron on
some layer of a scene detection network, a scene classification
network that will fire whenever it sees a dog face. So it's like to decide
if this is outdoors, I might be looking
for dog faces. There's another
neuron that fires whenever it sees a robin hood. And of course, Antonio showed
you more of this the other day. There's more recent
results like that. OK, so we found this
really interesting. And did some work
a little bit later with Alyosha and
Richard Zhang, in which we were solving a completely
different problem, which is image colorization. So take a black-and-white photo
and try to predict the colors. And you can ask
the same question. What are the internal
units sensitive to? What do the neurons react to? And you can think for yourself. A lot of you all have seen
work like this before. But it should be some
features about color, some low-level texture stuff,
but what it's going to be. What is this neuron A
going to be sensitive to? Dog faces. OK, you learn a
detector for dog faces, whether you train it to do scene
recognition or colorization, two very different problems. And you also get
flower detectors. So you get a lot of
different units that detect different types of
things that we might-- might be nameable, might be semantic. So this is a story that it's-- you've seen it before. I think many of you have. This is textbook
knowledge by now. These are figures
from our textbook. So I can-- I get to define,
what is textbook knowledge? OK, they really are. You can look at
page 300 something. But it's led me-- and I think many people
have had similar-- have stated similar hypotheses. But it's led me to
this basic hypothesis, that different
neural networks that are trained in very
different ways, with very different
architectures seem to be converging
to a similar or-- the strong version
of the hypothesis is they're converging to
the same way of representing the world. Maybe that's a
hypothetical endpoint. It'll be the same, and
they're not quite there yet. So let's unpack
this a little bit. Oh, OK, so let's unpack
this a little bit. But I know what you're thinking. I know at least one
of you is thinking. Where's-- Alyosha is here? OK, maybe not. Well, I can use my
model of Alyosha. I know what he's thinking. Yeah, of course, they
converge to the same thing because it's all about the data. That scene recognition system
was trained on ImageNet. And-- that's actually not true. It was trained on a
different data set. But it's trained on a data set
with a lot of photos of dogs. And the colorization network
was trained on ImageNet, which is a lot of photos of dogs. Of course, they both
learned similar detectors. And that's not what
I'm going to say. That's not what
I'm going to argue. So it could be that there's
something else common between these different systems. It could be the architecture. Dan talked a little bit
about this a few days ago. It could be that it is
the optimization process. Andrew was talking about
some optimization properties that might lead to convergence. It could be the people. Maybe-- we're all talking. We're all sharing ideas. We're all going to
converge sociologically to the same types of ways
of representing the world. But what I'm going to
argue, essentially, is that it's none of these. So what's left? What is common between
all these systems? It's the world. These systems are all trained
on this same universe, this same reality,
this same Earth. And it's kind of similar
to it's all about the data. But it's that the data is an
intermediary to the world. And you could have data that's
superficially very different. But if it's still data that
samples from a similar world, then it will lead to
similar representations. So that's going to be
the rough argument. OK, so outline of
the talk will be, first, I will
share some evidence of convergence between
different models trained in different ways. Then I will talk a
little bit about what we might be converging to. Is there an endpoint to this? And finally, I'll talk about
limitations and implications of this idea. And again, there's a lot
of interesting implications and a lot of really
important limitations. So I hope that sparks debate. OK, so evidence of convergence. OK, I already showed you
that scene classifiers and colorization networks
learn some similar units. There's a lot more
papers along those lines. Maybe the first and
most famous examples are results like this
from Hubel and Wiesel found that there are Gabor-like
or oriented line detectors inside the cat cortex. And Bruno and others
have shown that there are simple statistical models
that also result in these Gabor filters being the natural way
of processing visual data. And of course, we also have
seen that in neural networks, the first layer, you always
get these edge detectors, these Gabor-like filters. These are the filters on
the first layer of AlexNet. So that's some kind
of commonality. These-- different systems,
brains, cats, and so forth, are converging to similar
first-layer representation. And over the years,
people have built that up and looked at second
layer and third layer and seen commonalities. Now, I'll jump forward to a very
recent version of that idea, which is work from Amil. Amil here? OK, Amil is over there-- Amil and Alyosha on what
they called Rosetta neurons. And this is super cool. This is showing in each column
a different neural network, a different vision network
or graphics network. And the networks are processing
this image of the cat. And in each row,
they find-- they're highlighting a neuron,
which seems to be shared between all of these networks. So in StyleGAN,
there exists a neuron that fires for the Santa hat. And in ResNet50, there also
exists a neuron that selectively fires for the Santa hat. So it's a filter. So that's why we're
seeing a heat map. And that heat map is
high for the same region. And these networks are
not trained together. These were entirely
different neural networks trained on different data. So they called them
Rosetta neurons because it's like a common
language that's discovered across these different models. And they were a little more
conservative than we are. So they said they
found 20 or so, and it doesn't explain
all the variants. But I want to say that
this is like the start of some kind of convergence,
which may go further. So there's a lot more
evidence along these lines. But I want to get to
some of our new results and new experiments, where we
looked at this in a little bit more detail. So we have some
level of similarity between different
neural networks, different architectures,
different data sets. And I think that's something
that many people in the field have remarked on
over the last decade. But for this to be
actually converging to some optimal or
platonic representation, we've got to show that
the level of similarity is increasing over time or
increasing over performance or scale in some sense. So that's going to
be the next question. Is this increasing? And if so, that
suggests that there is a convergent trend going on. So I want to now set up
a little bit of notation to formalize what I
mean by representation and what I mean by
representational alignment and convergence to get
us on the same page. So I'm not saying
that all models are converging in all senses. I'm actually meaning it in
a fairly particular sense. So what I mean is we're going
to restrict our attention to representations that
are vector embeddings, so mappings from some data
like images to a vector. It's not everything, but
this is a very general class of representations. And we're going to characterize
a representation only in terms of its kernel. So this is actually
a very common way of characterizing
a representation. This tells me, how does the
representation measure distance between different data points? So the kernel of a
vision system evaluated over this set of images
will look like a matrix. It will be saying,
my representation of this person's face is
similar to my representation of that person's face
and very different from my representation
of this house. So the kernel is
an inner product between the embeddings between
one image and another image or one data point and another
data point, evaluated over all the pairs of data points. So kernels are really
important in understanding representations. Kernel methods, kernel machines
make use of this structure for learning. In neuroscience, there's
a lot of literature on representational
dissimilarity matrices. That's a kernel. This is from a
neuroscience paper. So kernels are a
fundamental structure for understanding the
properties of a representation. And that's all we're
going to talk about today. So in order to know if
two representations are the same in the way
that they represent distance, or they-- the same
in their kernel structure, we need a kernel
alignment metric. And that just means I'm going
to take the kernel for one representation and look at the
distance between that kernel and the kernel for
another representation. And that's my measure of how
similar the two representations are. There's a lot of kernel
alignment metrics. I'm not going to go into
the details of them. But just think of it as you
create this kernel matrix from one neural net. You create it for
another neural net, evaluate it on the
same data points, and you somehow
measure the distance. So let's do an experiment now. We're first going to look at
if different vision models, different vision neural networks
are increasing in their kernel alignment over time, or in
particular, over performance. As they get better,
which also as the years pass they get better, are
they becoming more similar in terms of their kernels,
how they represent the world in their internal activations? So two hypotheses
we can put out here. So one is that, no, they're not. There are many different
ways that you can represent the world that are all good. There's not just one way of
solving the problem of vision. That's one hypothesis. And two is, actually,
too bad, no. All strong models are like. This is the Anna
Karenina setting. That's a term that was
coined by Bansal, et al., so it's not our own term. But I really like it. It's the idea that
it could be the case that all strong representations
are somehow alike. They have something in
common because it's all happy families are alike. It's the same idea
that there's a lot of ways you could go wrong. But if you are right
in all properties, then that's a set of constraints
that forces you to be alike. Dan Yamins and I
think Rosa have called this the contravariance
principle. So it's also quite
related to that. There's a lot of people in the
audience that have studied this. So it's just great to be here. So let's run our own
experiment on that. We're going to take
78 vision models. What I want to emphasize here
is that these models are all trained in different ways. So some are ResNets. Some are transformers. Some are trained on ImageNet. Some are
self-supervised systems. Some are trained
on other data sets. So different training
data, different objectives, different architectures. And I'm going to bucket
these different vision systems, these different
visual representations by their performance
on this benchmark of general visual competence. This is the VTAB benchmark. So this is our
proxy for, are you just a good general-purpose
visual representation? And so this is used as-- to evaluate if I
have learned good features in computer vision. So on the x-axis is going to be
different bins of performance. So the first bin will be
visual representations that don't solve many VTAB tasks. And the last bin will be
visual representations that do solve a lot of VTAB tasks. And the y-axis is
going to be the kernel alignment, the average kernel
alignment between items in each bin. So here's the result.
So systems that are really good at a
lot of different tasks are all quite similar. And systems that are only good
at one task or not very good at all are all quite dissimilar. So it kind of makes sense. You might be thinking,
yeah, of course. If a system-- if two systems are
good at the same set of things, then of course, they must have
similar internal structures, similar representations. I don't think it
has to be the case, but it's reasonable
to suppose that. You can construct worlds in
which that wouldn't be true. But it's not too surprising. And it's the Anna
Karenina scenario. All strong representations in
these experiments are alike. So here's a t-SNE
plot or a UMAP. It's similar to a UMAP plot
of that result. Again, here are different vision networks. And notice that regardless
of whether you're contrastive or your CLIP or
your classification, different architectures,
different objectives, the-- what's causing two
representations to be similar is their performance,
not their architecture. It could have been that all
of the contrastive methods cluster together
separately from all of the non-contrastive methods. But that wasn't the case. Performance is what is
dominating the clustering of these representations. So I think that a lot of people
would have expected that. As vision systems become more
general purpose, stronger at more tasks, they
become more aligned in how they represent the world. The next experiment
is going to be-- well, to me, it was a
little more surprising. So now we're going to
ask, is the same happening between two modalities? So as language models
get bigger and better, do they become more and
more alike to vision models? This is a little
bit weird, right? So a few hypotheses. Hypothesis one is that, no. If you do better and better
at next-token prediction, next-word prediction,
you're going to become really good
at language, at syntax, at low-level, superficial
properties of language. And you're going to probably not
be something that's generally useful for other domains. It's just you're going to be a
super specialist on language. Hypothesis two, maybe not. Better language
models are just better intelligent representations
of the world. And they're also
better vision models. I'll tell you exactly
what-- how we measure that. And maybe the strong
form of hypothesis two is, the best vision model
is the best language model. This is going to be too strong. But we'll put that
there for, yeah, maybe. OK, so I have to tell you how
we can measure whether or not a vision model represents
the world in a similar way to a language model. Again, we're going to
use kernel alignment. But it's cross-modal
kernel alignment. So here are some images. And this box is
representation space. And I'm imagining
a neural network in which the apple and orange,
according to the vision system, have a similar representation. They're nearby in
representation space. And the apple and the
elephant are far apart. And we can also embed
the corresponding words for those items into
representation space of a language model. And what we're asking is whether
the similarity in the language representation
matches the similarity in the visual representation
for the corresponding image that matches that text. So if the representations
are converging, we'd say that the similarity
according to a language model of the word "apple" and
"orange" is roughly the same as the similarity according
to a vision model of an image of the apple and an
image of the orange. So we have to have paired
data to evaluate this. The models in this section are
all trained without paired data. So they are vision models
trained only on images. And we're going to measure their
similarity in representation space to language models
trained only on language. But we're going to evaluate
the kernel alignment using paired data to be able to
ask, does the vision system embed these two
photos of Yosemite close to each other
in a way that-- and does the language model also
embed these two sentences that are captions about Yosemite
close to each other? So here's the main result.
Here's the experiment. We took 11 language models,
5 vision models, these vision transformers. We measure on the
x-axis the performance of the language model
at language modeling, so the performance
of the language model at next-word prediction. And we measure on the
y-axis the kernel alignment between the language
model at each of these points
and a vision model, DINOv2, which is trained
self-supervised only on images, no language used at all. OK, so here's the result. So as
a language model, like Llama, for example, up here
becomes better and better at next-word prediction, its
kernel becomes more and more alike to the DINO kernel. And as DINO-- the
different colors are different sizes of DINO. So the biggest DINO model is the
most aligned with the language models. It goes both ways. Bigger, better vision models
have more and more similar kernels to bigger,
better language models. So we have some metric
I didn't describe fully. We only got up to
0.6 on that metric. But the point is the trend. The trend is going up. It might [INAUDIBLE] off. We'll see what happens. We did this for a bunch
of different language models, a bunch of
different vision models. One thing I want to point out
is, OK, I actually lied to you. I said that we were only looking
at pure language and pure vision models. We did look at one
VLM, one model clip that's trained to
align images and text. And you would expect
that a model trained to align images and text will
end up with a similar kernel because it's trained to have the
same kernel between the vision encoder and the text encoder. But CLIP is not actually--
it's only marginally more aligned with Llama,
with a language model, with language models in
general, than DINO is aligned with language models. So DINO has almost
the same alignment, in this kernel-alignment
sense, with language models as CLIP does,
despite that CLIP is trained to be aligned with
language models, which is interesting. Was there a question or no? AUDIENCE: [INAUDIBLE] PHILLIP ISOLA: Yes, Bill. AUDIENCE: Do you think this
would work for audition? PHILLIP ISOLA: Yes. [INAUDIBLE] AUDIENCE: [INAUDIBLE] sound the
way as an apple sound the same? PHILLIP ISOLA: Yes, maybe. The hypothesis is that,
yes, it will work. Me personally, I don't know. But the hypothesis doesn't
represent my exact belief. It's like we're
stating a hypothesis. But, yes, the
hypothesis is it will. And I'd love to hear Josh's
thoughts on that at some point. Yeah, Dan. AUDIENCE: [INAUDIBLE]
brief clarification. The reason you're saying that
CLIP is only marginally better is because it's at
0.2 versus the thing from the previous slide? Does that have [INAUDIBLE]? PHILLIP ISOLA: Well, 0.6. Yeah, that was the point, yeah? AUDIENCE: OK. PHILLIP ISOLA: Numbers
are still kind of low. So who knows what will
happen as they go up to 1. Blake. AUDIENCE: But to push on
that a bit-- and I guess this gets to what you
were saying, though. Given what you said about weaker
models or more specialized models, if you weren't training
on this very general next-token prediction thing, but say you're
doing something really specific, like you trained your language
model just to always highlight the word "dog" for me, I
imagine this wouldn't hold. PHILLIP ISOLA: I
think you're right. AUDIENCE: [INAUDIBLE]
highlighting "dog" would not correlate with
its match to visual models. PHILLIP ISOLA: Yeah,
I think you're right. I'm not going to talk
about that explanation. That is something we
talk about in the paper. We call it the
multitask hypothesis. Train on more tasks,
get more convergence. But it's also basically the same
as the contravariance principle. So Dan and Rosa have already
articulated that idea. But I think it's-- I think that's important. OK, but let me go on. We have a few reasons why we
think this might be happening. I'm happy to discuss
more offline. I want to talk about where
all of this might be heading. And this is the most hypothesisy
part of this hypothesis, I suppose, is this
is not proven. But the picture that I have
in mind, that we have in mind is something like Plato's cave. So I think most of
you probably know the allegory of
the cave, this idea that there's prisoners
in the cave whose only experience of
the outside world is the shadows on the cave wall. But they somehow infer that
there is a world out there. And Plato made
that as an allegory about our own experience. We only experience data. We don't actually have any
access to true physical state. And he says, maybe
metaphysically, there isn't even a true state. There's just ideal
latent variables, ideal forms behind it all. I'm not going to make any
kind of metaphysical argument. It's just an analogy. But the picture
that we have in mind is that, yeah, there
is a world out there. There is some data-generating
process, some causal variable z. And you can observe that
world in different ways, multi-view learning. You can look at
it through images. You can caption the images. You can potentially get
to that same sentence via a different set
of sensors, maybe via touch or a different
camera from a different angle. So there's a lot of different
ways of viewing the world. But if I learn a
representation of any of these ways of
viewing the world, well, because they're generated
by the same causal process by the same world
behind it all, they should somehow become alike. That's the basic idea
for what we have, for why we think ultimately
there is this convergence. It's a very general
idea, an idea that's been stated many
times in various ways. But I think it's still a
powerful idea to investigate. OK, so I want to-- so that's the general idea. Now, I want to tell you
about one particular toy mathematical model
in which you will expect to get this
type of convergence. But I'm going to
emphasize, this is just one particular mathematical
formalization of this idea. I think the idea could be
explored much more broadly. So the mathematical
model that we have that would exhibit
this type of convergence between an image embedding
and a language embedding goes as follows. You have a world that
consists of discrete events Z. They're sampled from some
unknown distribution P of Z. All observations,
all learning signal is mediated via observation
functions, which we are going to assume
are bijective functions. That's a huge simplification. So they contain
all the information in the observation
function that is contained in the latent variable Z. And in this world, we're
going to model co-occurrences. We're going to use a
contrastive learning method that will basically say, two
things that co-occur are a positive pair. Two things that don't
co-occur are a negative pair. Align the positives. Align two things that
are co-occurring. Push apart two things
that are not co-occurring. And this is the standard setup
for contrastive learning. So this is not far
off from the types of learners that are popular. Like contrastive
language models might try to align two
co-occurring words. And contrastive
image models might try to align two
co-occurring image patches. So in that particular world,
with discrete random variables and bijective
observation functions, if you train a contrastive
learner with a noise contrastive estimation objective,
you can prove-- and this is not a new result,
but you can show that the-- this will converge to the
pointwise mutual information function. So the intuition for that is
that what contrastive learners are trying to do
is they're trying to classify between
positives, co-occurring items, and negatives,
non-co-occurring items, and trying to learn an
embedding of the data in which the inner product, or the
similarity between the embedding vectors, is proportional
to that probability ratio. And that probability ratio
is just how much more often do the two items
co-occur together divided by the product of the marginals. How often would you expect,
by chance, them to co-occur? And so basically, if
I learn an embedding f, such that its inner
product with other items is equal to this ratio, then the
kernel that it will arrive at, the way that this representation
measures similarity is going to be this
pointwise mutual information function, the joint
probability divided by the chance rate of
those things co-occurring. This PMI function has
been a favorite of mine for a long time. So I'm just-- I'm always
trying to sneak it in. But the rough picture is that
contrastive learning boils down to finding an embedding
in which similarity equals co-occurrence rate or pointwise
mutual information, which is like normalized
co-occurrence rate. So this is a little
bit-- maybe if you don't like the Plato analogy,
this is the Wittgenstein. And meaning is used--
meaning derives from the rate of co-occurrence. And so the reason why
the apple and the orange are nearby in
representation space is because those two things
co-occur in kitchens. And elephants don't
tend to co-occur with those items as much. And one of the interesting
things about this model is that if these observation
functions are bijective, then not only do you converge
to the co-occurrence of the observations,
you converge to the co-occurrence of
the underlying events because bijective
observation functions on discrete random variables
preserve probability. And so all these things
work out to be the same. So what that boils
down to saying is that the different views
that satisfy these properties will converge to
the same kernel. So the language representation
and the visual representation will converge. So a little bit of a mix
of Plato and Wittgenstein. Maybe I should have chosen
more recent researchers. But I went with those. So that's just one
mathematical model of what might be going on here. I think I'll skip
that small example. And I want to talk about
limitations and implications because I think those
are maybe the most interesting to get into. So again, I think I know what
a lot of you might be thinking. Hold on. An image is not
equivalent to a sentence. Again, if I go and see the
total solar eclipse a month ago, that experience, I
just can't describe it. It's ineffable, right? There's no words to describe it. Or if I am writing
an essay and I talk about this
concept of free speech, this is an abstract concept. There's no image that
captures that concept. So I think this is a
really valid criticism. Different modalities
aren't actually all bijective with some
underlying representation. They might have different--
fundamentally different information. So I don't quite know how
to fully resolve this. The empirical evidence
suggests we're seeing some convergence,
despite that this might be true. But this is a real limitation. Mathematically,
in these cases, we don't have that-- we don't
satisfy that mathematical model I gave you. The observation function is
lossy or abstract or partial. It's not a complete
representation of the underlying world. But despite this, we do see
some interesting convergence. I think one example
is that CLIP is trained to reduce image
representations to just being captions. It's trained to align
visual representations with language representations. And yet we love it
in computer vision. It works really, really
well, despite that it's trained to throw away
everything about vision other than language. OK, so maybe
language is actually closer to a complete
representation of what we care about in
vision than we thought. But it is a limitation. Another limitation--
oh, actually, probing that limitation
a little bit more, so one implication is that
the more lossy and incomplete is your observation
in language, the more it might not match
what is in the image. Because a single
word is not going to fully describe an image. And a caption might only
partially describe an image. But we did an experiment,
which is, well, what about 1,000 words? So we varied the number
of words in captions and measured the alignment
between the captions and the visual data. We only went up to 30
words, not 1,000 words. But you can see
the trend goes up. So the kernel alignment
between rich captions with the corresponding
images is higher than the kernel alignment
between just a single word or a very partial caption. So it kind of makes sense. It's kind of consistent
with this idea that as you get closer
and closer to complete bijective observations, you will
get better and better alignment with the vision modality. Imperfect alignment-- so I
told you that on our metrics, we haven't explained all
the variance by any means. There's a lot more
variance left to explain. So this is an open challenge. And one other thing
for the people who are deeper into this field
of representational alignment is, technically, we're not
seeing global structure alignment. We're seeing local
structure alignment. So if you're familiar
with the CKA metric-- this is for the people that are
really experts in this area-- we're actually not seeing
increasing CKA alignment. This is a measure
of global structure. We're seeing-- we only see this
alignment when we look at local nearest neighbor structure. So this is a detail
to the analysis. But I'm happy to
talk more about that. [INAUDIBLE], yeah. AUDIENCE: Yeah,
Phillip, appreciate the y-axis misalignment
[INAUDIBLE]. Can you say what it
is for a random model? What's the floor? What's the kernel of a
randomly initialized model? [INAUDIBLE] PHILLIP ISOLA: Right. So for random model, I think
it's roughly 0 on this metric. Tongzhou, is that right? TONGZHOU WANG: For a
purely random process, not the [INAUDIBLE] network. PHILLIP ISOLA: No,
random network, though. AUDIENCE: Random parameters,
but up the data points, still encoding them as a kernel. TONGZHOU WANG: I don't
have that number. But that's a purely
random process. Without network
[INAUDIBLE] bias. It's much, much
lower [INAUDIBLE]. AUDIENCE: I think-- but-- PHILLIP ISOLA: Yeah,
that's a good baseline. We should come back to that. Yeah, I don't know
the exact number. I think it's quite
a bit lower, though. So another thing that
you might have in mind is, OK, yeah, I buy
this convergence. That's-- empirically, that
seems to be going on, but not because they're converging
to some platonic reality. These are [? BS ?] machines. They're converging
to just dumb models. OK, so valid. It could be maybe all these
models are converging, but not to a good representation
of the world, but to just being these superficial,
stochastic parrots, maybe. So it could be that there
are fundamental limitations. We're all using transformers. We're all doing next
token prediction. And this is just flawed. That's an option, too. Maybe it's
socio-technical biases. We all chat, and I tell you,
yeah, transformers are amazing. They're converging. Then you go and use them, and
it increases the convergence. So maybe don't take that lesson. OK, so there's a
lot of limitations. There's some other ones
we discuss in the paper. I think there's also really
interesting implications. And I want to end
with the implications. So one implication
is that there's this complementarity between
all of these different sensory modalities. And we've seen this a lot. We've talked a lot about this. Again, [? Shimon ?] was
posing the question yesterday that if I want to
train a vision model, well, this implies
that I should be able to use language data,
too, because the underlying kernel, the
underlying structure, if it's really
shared between them, then I can get there
via multiple paths. And I might as well use all
the paths available to me to increase the rate at which
I get to that representation. And I think an
interesting experiment could be it should be the case
that to train a vision model, there's value to
training it on a word. A word should be worth n
pixels to a vision model. And a pixel should be worth
m words to a language model. If I train-- I'm going to train
a language model, I should train it
on pixels, too. That should help my performance. And there's some evidence. People are starting to do that. There's some evidence of this. But I think it's only-- in LLM land, it's only
a little bit explored. Most LLMs are only
trained on language data. But this implies you
should train them on other types of data, too. And they should get better
at language modeling. And there's some evidence-- I mean, this is from
the GPT-4v blog post. So it's not-- who knows
if it's replicable. But they say that if you train
GPT-4 jointly with vision model-- so GPT-4v has joint
vision and language model-- you do better at pure
language-reasoning tasks than if you don't use vision. So there's some transfer
between the modalities. Another interesting implication
is that it should somehow be relatively easy to
translate or convert between different
representational formats for different modalities
if they're all converging to the same representation. So, for example, it
should be relatively easy to translate
between images and text if their representations
are the same. When you train a representation
of text and images and they converge to
the same representation, it will act like a bridge. And maybe you'll only
need a little bit of data to find the mapping from
the visual representation to the text representation. And this is something
I think that we also see in practice to some degree. There's a lot of success
of unpaired translation, of translation
between modalities. There's some interesting
work from Sompolinsky, et al., or Sorscher,
et al., showing that you can do unpaired
translation between images and text to a certain degree. And the idea is
basically the same, that if the representations
that you get in each modality are the same, then you just
have to align those two representations up to some kind
of permutation or some rotation. And this is just another
old philosophical question, this Molyneux's problem
maybe you've heard about. Imagine that a
child is born blind. They only know how to
discriminate shapes by touch. And then they are given sight. Would they immediately be able
to discriminate shapes by sight? And I think an implication of
this hypothesis, a postdiction of this hypothesis would be
that, well, not immediately. But it shouldn't-- it should
be relatively easy to take your representation
learned from touch, or maybe from
language in this case, and use it as a target to learn
a representation for vision. So you still have to
learn this arrow going from the new modality, the
eyesight you've been given. But you already have the kernel. And half the battle, or
maybe a lot of the battle was learning this
kernel structure. So you already have it. And you just have to learn how
to map the new modality into it. And indeed, there's a
lot of interesting work. I think we heard a bit about
it on one of the early-- the first days of this workshop,
that people have tried this now. They've done things like
Pawan Sinha did this Project Prakash, where he gave-- where they had surgeons
that would operate on children who had cataracts. They get sight for
the first time. And it doesn't take them very
long to understand images. Now, they don't
have it immediately. But it's not very long. OK, so I think there's
a lot more to discuss. I'll end with the
final implication being that if there is
some endpoint to all this, if these things are
heading toward something, we should work to
characterize it. And think it's a
great challenge. I hope that we can get a better
idea of what that model is. Or if this is just
not true at all, we should prove that
and show that, too. So I will thank my co-authors
and funding agencies. [APPLAUSE] So I'm also moderating. So I'll say five
minutes for questions. OK, [? Shimon. ?] AUDIENCE: So one thing is
an alternative or maybe a similar view, if
the convergence may be related to finding
the close-- the simplest program that generates
[? beyond ?] the observation. And this was-- it
would lead-- it tends to find the latent
variables that actually generated the observations. And the exact observation,
different system would get some different
observation of the same system. They will end up with recovering
the same latent variables. And it will not depend on the
computer that made it and so on. So maybe there is a
direction of whatever we do is we're
eventually recovering the correct latent variables
and so on and so forth. So anyway, this
general direction. PHILLIP ISOLA: Yeah, I find
that really compelling. So we had a section
I skipped, which we called the Simplicity Bias
Hypothesis, which is roughly that there are many ways of
fitting whatever data you have. But under the pressure
to find the simplest, you will get more
convergence than if you don't have that pressure. And, yeah, maybe
the simplest program is somehow we're working
our way toward that simplest program via regularization,
implicit and explicit. I think it's speculation. But it's interesting
to consider that. AUDIENCE: [INAUDIBLE]
on some data, some recent data or some
arguments that deep networks actually have bias towards-- PHILLIP ISOLA: Yes. AUDIENCE: --the lowest
complexity and so on, in some way, and so on. So maybe all of this-- PHILLIP ISOLA: I think
it all fits together. And actually, Jacob, one of the
authors, had one of those papers on deep nets have the bias
toward the simplest structure. But I think there's still a lot
of open questions there, yeah. [? Andrea. ?] AUDIENCE: Just a comment. It's on your
[? projecting ?] assumption. Maybe you're winning a bit at
the moment because you're using networks that have been
trained for recognition, whereas if you had-- say you
had a network that had just been trained to do reconstruction,
3D reconstruction, maybe this wouldn't
work so well. PHILLIP ISOLA: Yeah. AUDIENCE: So it's just
possible you're seeing something because of that. PHILLIP ISOLA: It could be. I think some-- are-- I don't know. [INAUDIBLE], are
some of the models that we have trained
for reconstruction rather than recognition? AUDIENCE: There are maybe
some that [INAUDIBLE]. PHILLIP ISOLA: OK, yeah,
so we have MAEs in there. But they actually are
kind of an outlier. So that's a little bit
of a violation, yeah. AUDIENCE: [INAUDIBLE] can we
trained through 3D [? depth ?] prediction? So [INAUDIBLE] it's
not completely solid. But it's just [INAUDIBLE] you're
seeing something like that, maybe? PHILLIP ISOLA: Yeah, I
think that's a good point. And maybe, yeah,
contrastive-- it's like instance discrimination
and classification, these might all be more
alike than we really realize. Even though I said they're
different objectives, they're maybe not
that different, yeah. Yeah, that's a good point. Let's go to the back. AUDIENCE: Yes, I guess this is
motivated by the possibility that the representations
learned by these models don't necessarily align with
representations that humans have or of underlying true reality. So I guess it seems like the
data that these models are trained on is
curated in some way. Like the statistics, the
text on the internet, the images on the internet
don't think-- they selectively pick out remarkable
things about the world. It's not necessarily a random
sample over all possible data that you could observe. So I guess taking
that into account, what is the implication of this
curation for representations that come downstream of that? PHILLIP ISOLA: I think
that's a great question. So the data sets that we use
to train these models are different in a lot of ways. But they share one thing
in common, which is they're all internet data sets. They're all photos
and captions and texts downloaded from the internet. And that might be a very biased
and curated type of data. If you just had a
robot on Mars, it might come up with a very
different representation. So I think that's
an open question. I guess the strong
form of the hypothesis is that, no, the
robot on Mars will find the same
representation because it's more about physics and the
underlying, like, you know, F equals ma. But I think I would believe
more in some data distribution properties do really matter. AUDIENCE: [INAUDIBLE] NASA would
upload the data to the internet. PHILLIP ISOLA: NASA will upload
the data to the internet, yeah. OK, Dan, here. AUDIENCE: Yeah, so I guess
my very minor comment, which is at some point you said
that maybe the strong form of the hypothesis is that the
best vision model is the best language model. But maybe is what
you're saying that the-- some late layer of
the best vision model has the same representation
as some [INAUDIBLE] layer? Because it's not-- the
models are the same. I mean, obviously
they're computing things. PHILLIP ISOLA: Yeah,
just the kernels align at some layer of
the models is maybe what the statement would be. AUDIENCE: But you could ask
whether like the intermediate or early layers look similar. And if they come to look
very similar in between, that would be even
more surprising, right? If an intermediate
layer came to look-- of a language model came to look
like an intermediate language layer of a model, that
would be really unusual. PHILLIP ISOLA: That
would be-- yeah, so we a search over
multiple layers and take the average
or the max, depending on how we measure things. So we don't really see this
layer-by-layer sequence being matched between the two models. But it's more like somewhere
in both of the networks the kernels align. AUDIENCE: Right. So the hypothesis that they're
representing the same thing is reasonable. But it would be really
surprising if it turned out-- PHILLIP ISOLA: I agree. AUDIENCE: Yeah. PHILLIP ISOLA: That
would be even stronger. AUDIENCE: [INAUDIBLE]
tracking it. PHILLIP ISOLA: Yeah, we
haven't seen that yet. Yeah, [? Leslie. ?] AUDIENCE: [? I ?] [? guess ?]
another question is something like, if you trained
on very different-- supposedly very different
data distribution, not so much Mars because that's
the world's pictures [INAUDIBLE] whatever, but just only
spatial transcriptomics data, tons and tons of spatial
transcriptomics data, would you expect it to
be the same asymptote? Or would you expect it-- yeah,
maybe it goes up for a while. And then it asymptotes lower. PHILLIP ISOLA: I think every-- OK, so one of the
assumptions to this model, the mathematical
version of the model, is that you train on the
same distribution of events, not the same data, but
the same distribution over underlying events
that generate the data. And the more you
violate that, the more I think this might not be true. But I expect that for
different problems and domains, there'll be a percent-- a
degree to which this is true. Yeah, so I don't know. Leslie? AUDIENCE: Suppose
this alignment, should we think of it
as being marginally very weak or very strong? Or how do you think
of it quantitatively? PHILLIP ISOLA: Right. Is-- just on the scale of
0 to 1, like, it's point-- AUDIENCE: [INAUDIBLE]
[? extremely ?] strong? PHILLIP ISOLA: I think
it's fairly strong. So the meaning of that number is
the percent of nearest neighbors that are in common between
the two kernels on average. So if I take an item and I find
its nearest neighbors in one-- under one kernel, and I take
an item and I find its nearest neighbors under another
kernel, it says, about 1 out of 5 nearest
neighbors against a dictionary of 1,000 possible
neighbors is shared. OK, I think that we should
take the rest of the discussion to the break. So I'm happy to chat more. But I also need to moderate
and move things forward. So I'm going to say,
we'll come back at-- [? Sherry, ?] what time
are we coming back? Is it 3:45. 3:45. Yeah, come back at 3:45. Thank you. [APPLAUSE] [SIDE CONVERSATIONS]