Thank you very much. It's a great opportunity, I think, for us
to hear from three of the founders of this field in a slightly more perhaps candid format
than what we have heard so far. So to get us started, I know we've heard about
their research interests today, but I thought it'd be interesting to ask each of them to
introduce the neighbour, tell us who they are and what they think is their greatest
achievement. Joshua, I will let you start. All right. So here's Yann LeCun. I met him when I was doing my master's, and
he was coming to do his postdoc with Jeff. And he came to Montreal as a stopover going
to Toronto. And and immediately it was clear that we had
shared interests. And then later, Yann invited me to come to
Bell Labs, where I learned so many exciting things, including convolutional nets. And there you go. It's still the hot thing today. And that was over 30 years ago. Yeah. And could you please introduce Jeff? Oh, that's interesting. Oh, I'll be historical as well. So when I was a young student, I undergraduate
started getting interested in machine learning and neural nets and I read the entire literature
from the 60s, realized that it stopped around 1970 and there was no data whatsoever. And then I got connected with people in France
who were interested in what they called at the time Automata Network, and they showed
me a preprint badly photocopy pre-print of a paper entitled Optimal Perceptual Inference. This was a code word for both machines. So this was Jeff. And to send Husky's paper on those machines. This was the first paper I ever saw that actually
mentioned hidden units. You know, I was already interested in your
material on that back then because it was clear from the literature that people were
after learning algorithm for material units and had kind of started playing with things
a little bit like backdrop or what we would now call target prop actually. And and, you know, nobody understood what
I was talking about. And so I said, you know, this is the person
I need to be. One of those two authors of Jerry Sandusky
or or Jeff Hinton. And I met Terrazzano Steve first. That was in March 1985. He came to a workshop in France. And I told him I was working on. And at the time, I think he was already you
had already started working on that talk, but he didn't say a word and then went back
to the U.S. and told you if there is this kid in France working on the same stuff we're
doing. And and then Jeff was invited to a conference
in France a few months later in 1985, June 1985, cognitively in 1985, where he gave a
keynote on those machines. And he had somehow read my paper written in
bad French in the because I wrote badly in English and French in the in the proceedings
and figured out it was kind of similar to that and then, you know, maybe connected with
the stuff Terry had told him. And so he kind of sepi out and we had lunch
together and we had couscous and we figured out we were really interested in the same
questions and had the same philosophy about stuff. And it clicked right away. And then I guess he invited me at Carnegie
Mellon for the summer school in 1986 and then 1987. So this was the one person in the world I
wanted to meet most. And it was a dream come true when we did meet. Thank you. Thank you. It's not about you. It was much afraid. OK. I was just thinking and I think I might have
been the external examiner for your shop and your thesis, but I can't remember. I would have been a good external examiner
for it, but I can't actually remember if I wasn't. Yes, OK. There you go. This is what I was the external examiner for
his masters. OK, so there weren't that many people doing
this. No, no, no. So it was a very nice thesis. Oh, I passed and it was applying neural nets
to trying to do speech recognition, which was clearly hopeless. But he he did pretty well despite that. And yes. So I known Joshua for a long time. He's the only person I published with where
all of our joint papers have over 4000 citations. So yes, Joshua and I have done one in Canada
and we've done one in Canada because Canada supports basic research. So I think of Joshua as basically a sort of
beacon for the research policy of NSA, which is to find good people and support them and
not to not to expect them to do what they said they do, but to just get on and do interesting
research. Yusra is now at the stage of his career that
I wish I was at now, so things are happening very fast now. There's lots of new ideas, many of them coming
out of Usher's lab. I just can't keep up with them anymore. I think he has three or 400 graduate students. That's what he feels like anyway. That's what it feels like. So there's a couple of archived papers a week
and every. Yeah, so. I've been particularly impressed by the work
on attention, so, um. Yeah, my feeling was that, you know, I was
the oldest one and youngest, the second oldest, and Yoshio's the third oldest. And Josh had some catching up to do. And I unfortunately, I think he's caught up
so particularly with the work on potential for machine translation. I think that's made the kind of impact the
young, maybe convolutional Metz and Terry and me, maybe both machines and. So that's what I have said, I guess. Thank you very much. There's been a few opportunities, I think,
to, you know, mentioned work that has been done in the past. Many previous results have been really influential
even today. And we didn't realize until many years later
the influence of these results. And I'm curious to hear from you. What's the difference between doing no network
research in 1985 or 1995 versus today? In what way has that changed both the practice
of it? We could actually focus on doing research
without all the noise that is currently happening around the eye. And so it's a totally different environment. And of course, there was interesting parallels
between the time when when we met each other and around mid eighties and 80s, where neural
nets were still marginal compared to the traditional A.I. and what happened maybe five, 10 years
ago when the planning and nets came back. And there was also a period in the early 90s
where neo-Nazis were very hot and there was a lot of hype and many companies were trying
to exploit it. So there's some parallels with now. But I guess the huge difference is that it
now really works. I think it was it was working. But again, also and a lot of the applications
went. So I don't. Yeah, I mean, you know, very often you see
this in the history of technology in particular, I sort of broadly, you know, broadly speaking. So people who are working on the Perceptron
and you know, Adaline, back in the 50s and early 60s, by the time, you know, by the late
60s when when people became convinced that that was not a viable path towards really
building intelligent machines, they started just changing the names on the techniques
they were using. It became adaptive filters. And, you know, those have huge practical consequences. When you go back in the old days when people
had modems, you know, you would turn on your modem and the modem would go, you know, and
make this horrible noise. That's actually a Perceptron algorithm running
on a pseudo random pattern to do a quick cancellation or they're too young to remember that. Yeah, that's right. Most of them are. But, you know, modern modems actually do this
or that. You antenna's the fact that, you know, sometimes
you get a few bars on your cell phone and sometimes it disappears because there is an
adaptive antenna that focuses the beam on your cell phone. And that's an adaptive algorithm that's very
similar to another line or the you know, the old algorithms, the 50s. So changed name really. And you see this Watsonia, you know, used
to be that the the pathfinding algorithm and there used to be a part of it is still in
the textbooks. But we don't see this as any more. It's just an algorithm. Right. Well, you know, try exploration for for chess. So the same phenomenon occurred with neural
nets in the 90s where, you know, a bunch of companies kind of were founded around the
idea of using neural nets for a particular application like, you know, a credit rating
or pollution control or or, you know, control of, you know, the engines for cars and whatever. And this used and they you know, they went
underground. They weren't like, you know, really sort of
viewed as big things anymore. But they were used, you know, same with our
cheque recognition system, with commercial. That's it was very widely used, but not not
very kind of seen as the path to the pathway. So I think you see waves of that. So it could be that the current wave of deepening
what's certainly going to happen is that a lot of the techniques are being used today,
are going to disseminate widely in the economy and are going to influence a lot of different
things. And it's probably going to go underground
after a while unless we find the next step for the next, you know, the next step in progress,
that we'll sort of keep this set of technique in the in the in the eyes of everybody. But if we don't find this, those techniques
will just be, you know, part of the toolbox. What he said. Yeah, I find it much harder also to work on
diplomacy now than just five or six years ago because five, six years ago you could
come up with some, you know, have you know, some like, you know, pretty obvious idea and
it out and it would work. You would, you know, have a hard time publish
the paper, but, you know, you put it on archive or something and whereas now everybody is
working on it. So it's much harder to actually be innovative. And so that's why we're kind of some of this
is sort of shifting our interest to the next step, really. So perhaps I have a question related to this. You all have a large number of papers. Some of them have several thousands of citations
amongst your set of papers. Is there like one gem in there that you feel
nobody is reading, but you really feel like this is a seminal contribution in this crowd
will be the first to read it. Let's start with that. He has a paper. He has a lot of citations and there is a paper. Not the correct strategy here is to figure
out your age. No, figure out the figure out the paper, this
one below your age number and tell everybody to read that one, because none of you have
to worry about age numbers anymore. I think there is a paper that I wrote. It just sits in about 2000 and eight or nine
or something like that. That's. Using matrices to model relationships and
matrices, Matrix is quite like capsule's using matrices also to model concepts. And so you give them triples and you have
to from the first two chance of a triple, the third. And had a lot of work around the year 2000
with a bit of background on what I call linear relational embedding, which was basically
the early work on learning. Embedding with people in a conventional way
I thought was rubbish. Actually, I have a wonderful review of that
work which says Hinton has been working on this work for the last seven years. It only has one Nons. S citation. It's time to move on. That was from triple-A. I don't remember such
things. I still don't know who wrote that review. So anyway, the idea was instead of using vectors
for concepts and matrices for relations, which has the problem, that if you've got 100 component
vector, you need a 10000 component matrix, you use matrices for both. And that has a big advantage now that relations
are the same kinds of things as objects. So you can do relations of relations and which
is a tiny problem. So we taught it that sort of three and plus
two makes five. So the relation is plus two makes five. And we thought a bunch of facts like that
and then we taught it to and plus makes plus two. So actually the output it had to produce and
it had never seen the combination of two and plus before, so it had to create a matrix
four plus two. And then we showed that that matrix actually
work. So if you gave it seven on this new matrix,
four plus two, it would give you nine. And this paper got absolutely slaughtered
by triply eight out of ten. It got a two or two and a three, I think. So in that sense, it's a cognitive science
and a complete the merger. I mean, they didn't like it, except that the
editor of Cognitive Science, John Anderson, said, if I understood the paper right, it's
amazing, but I don't think our readers will be interested. And I still think there's a very interesting
idea of using matrices for objects instead of vectors so you can operate on the matrices. And that's exactly what's happening in Capsule's. And I suggest everybody go and read that paper
and see if we can get figuration your prescription for tonight. So I have one, please. It's a paper that's very low in the citations
and in fact, I didn't even try to submit it to one of the good places because I knew it
would be rejected. It's called something cultural learning. And it's about this idea that in order to
learn these high level abstractions, which is sort of my obsession, we need the guidance
from other humans. These humans need that. And we might need the same kind of thing for
machines. And then the question is, well, where do these
other people got that information about the world in the first place? So the answer, of course, is cultural evolution,
that we pass on these notions about the world through education and talking to our children
and so on. And so the question is, how can we use these
kinds of ideas to train not just a single learner, but the whole collection of learning
agents? In combining some of the ideas that I worked
on a little bit earlier called Curriculum Learning, where one agent teaches another
one in some going from easy examples to harder examples and low level concepts for high level
concepts. So so I think these kinds of ideas initially
I proposed this actually to Google as a project to be able to paralyze training more efficiently
because current distributed training is stuck with sort of a very big problem of having
to communicate lots and lots of weights or lots and lots of activations between all the
nodes. But if the only thing you need to communicate
is this very small pipe of high level abstract stuff, language like things, then we might
be able to have a collection of agents learning together in a parallel way. And we could paralyze as much as one million
agents as we do with humans. OK, yeah, so I have a gym somewhere and yeah,
a couple more ideas and papers, I mean, I can refer to, of course, one of your papers. So this one one one paper that we actually
never got published anywhere because it got rejected two or three times in the title of
it was pretty too pretty to the decomposition. It was about sports or twinklers. So this was back in the days where we're interested
in and supervised layaways training, where we were trying to find ways to, you know,
train a pair of players or something and a recorder or some kind to represent to find,
you know, slightly higher level representations of the things we were training it on. And the problem is, when you train on auto
quarter, is that you don't want it to try to learn the identity function, because if
you lose your identity function, it doesn't do anything for you. So you wanted to reconstruct stuff, you train
it on, you want you to not reconstruct anything else. And that's the hard part. And so one way we found to do this was to
make the middle layer spa's, for example, and Yosri came up with half a dozen ideas
of, you know, different ways to do this. And I was able to encoders and stuff like
that, this kind of nonsense work and everything. We came up with different ideas. Also, Jeff, at the time came up with contrasty
divergence, which is yet another way of doing the same thing. And this idea of produces positive composition
is that you you have an encoder, but the input of the encoder does not become the output
of the encoder, does not become the input of the decoder. But in but in fact, they are decoupled and
there is a cost function for making the input of the decoder different from the output of
the encoder. And there is some sort of energy based kind
of version of this that to make this work and it's it connects with some stuff that
Yahshua later worked on called Target, which which also I worked on many years ago, where
instead of propagating gradients through a neural net, you probably get targets. So I think that's an idea that has legs but
has not been exploited or implemented in the right way or not really explained in the right
way or not. We haven't found any situations where it actually
works, kind of like Catalyst for a long time. You know, it's obviously a good idea, but
we haven't really found the thing that makes it makes it makes it quick. So that's one idea. There's another idea. This paper, maybe you didn't get a lot of
citations, but all of my students at the time read it. Yeah, we were something we were really paying
a lot of attention well within a small community. So this paper actually you got something like
70 or 80 citation without ever it's only an archive paper version anyway. I think it's more than just. Yeah, I mean, archive. As you know, we are a big proponent of OK
publication for reasons like this, among others. There's another set of ideas. I think that's really interesting. And I wouldn't say they don't gather a lot
of interest, but because I think the interest in this is going, there is a generalization
of neural nets to graph data. So you can think of a conversation that as
an image, you know, the image that you can think of an image of the function on a regular
good graph, but you can actually define convolutions on irregular graphs. And that opens the door to all kinds of sort
of different ways of thinking about, uh, about neural nets. And this is something that actually you don't
want to be working on that I think is in the audience or is going to give a speech tomorrow. And the third idea also that has to do with
graph is some of the adjustment I worked on many years ago in a very, very highly cited
paper, in fact, or most cited paper. But the people cited because the commercial
nets, the second part to that to that paper that nobody's ever read. Oh, no. Know I you did, of course. Like the students had to read the second half. OK, also, as you know, basically the idea
of it is a combination of ideas there. But there's the idea that the objects that
can be manipulated by a neural net do not need to be things like vectors that can be
kind a more complex, structured objects like graphs. And I think it's, again, an idea that, you
know, can be if if people figure out how to use this properly, you can be credited for
much of much of the theory behind it. Yeah, absolutely. So as you know, you progress in your career,
it seems you agree on more and more things. We could call you the three amigos of Deep
Learning. And I'm curious to know, are there things
on which you strongly disagree? I think that you just do not see eye to eye
on the prescription. Yes. Quebec separatism, politics getting in the
way. Right. And America. And though we agree on American politics right
now, more and more. Yeah, OK. Well, I think. I think we we disagree, perhaps, I mean, we
don't disagree, like on major things, but we have sort of different approaches sometimes
to to problems or to withdraw from them. You know, there was a time that Jeff was completely
enamored with anything had to do with free energies and and environmental inference. And, you know, at that time, John wanted to
know nothing about probability. He called the probability police. Yes. Yes. Um.