Yoshua Bengio, Yann LeCun & Geoffrey Hinton AI Keynote (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Thank you very much. It's a great opportunity, I think, for us to hear from three of the founders of this field in a slightly more perhaps candid format than what we have heard so far. So to get us started, I know we've heard about their research interests today, but I thought it'd be interesting to ask each of them to introduce the neighbour, tell us who they are and what they think is their greatest achievement. Joshua, I will let you start. All right. So here's Yann LeCun. I met him when I was doing my master's, and he was coming to do his postdoc with Jeff. And he came to Montreal as a stopover going to Toronto. And and immediately it was clear that we had shared interests. And then later, Yann invited me to come to Bell Labs, where I learned so many exciting things, including convolutional nets. And there you go. It's still the hot thing today. And that was over 30 years ago. Yeah. And could you please introduce Jeff? Oh, that's interesting. Oh, I'll be historical as well. So when I was a young student, I undergraduate started getting interested in machine learning and neural nets and I read the entire literature from the 60s, realized that it stopped around 1970 and there was no data whatsoever. And then I got connected with people in France who were interested in what they called at the time Automata Network, and they showed me a preprint badly photocopy pre-print of a paper entitled Optimal Perceptual Inference. This was a code word for both machines. So this was Jeff. And to send Husky's paper on those machines. This was the first paper I ever saw that actually mentioned hidden units. You know, I was already interested in your material on that back then because it was clear from the literature that people were after learning algorithm for material units and had kind of started playing with things a little bit like backdrop or what we would now call target prop actually. And and, you know, nobody understood what I was talking about. And so I said, you know, this is the person I need to be. One of those two authors of Jerry Sandusky or or Jeff Hinton. And I met Terrazzano Steve first. That was in March 1985. He came to a workshop in France. And I told him I was working on. And at the time, I think he was already you had already started working on that talk, but he didn't say a word and then went back to the U.S. and told you if there is this kid in France working on the same stuff we're doing. And and then Jeff was invited to a conference in France a few months later in 1985, June 1985, cognitively in 1985, where he gave a keynote on those machines. And he had somehow read my paper written in bad French in the because I wrote badly in English and French in the in the proceedings and figured out it was kind of similar to that and then, you know, maybe connected with the stuff Terry had told him. And so he kind of sepi out and we had lunch together and we had couscous and we figured out we were really interested in the same questions and had the same philosophy about stuff. And it clicked right away. And then I guess he invited me at Carnegie Mellon for the summer school in 1986 and then 1987. So this was the one person in the world I wanted to meet most. And it was a dream come true when we did meet. Thank you. Thank you. It's not about you. It was much afraid. OK. I was just thinking and I think I might have been the external examiner for your shop and your thesis, but I can't remember. I would have been a good external examiner for it, but I can't actually remember if I wasn't. Yes, OK. There you go. This is what I was the external examiner for his masters. OK, so there weren't that many people doing this. No, no, no. So it was a very nice thesis. Oh, I passed and it was applying neural nets to trying to do speech recognition, which was clearly hopeless. But he he did pretty well despite that. And yes. So I known Joshua for a long time. He's the only person I published with where all of our joint papers have over 4000 citations. So yes, Joshua and I have done one in Canada and we've done one in Canada because Canada supports basic research. So I think of Joshua as basically a sort of beacon for the research policy of NSA, which is to find good people and support them and not to not to expect them to do what they said they do, but to just get on and do interesting research. Yusra is now at the stage of his career that I wish I was at now, so things are happening very fast now. There's lots of new ideas, many of them coming out of Usher's lab. I just can't keep up with them anymore. I think he has three or 400 graduate students. That's what he feels like anyway. That's what it feels like. So there's a couple of archived papers a week and every. Yeah, so. I've been particularly impressed by the work on attention, so, um. Yeah, my feeling was that, you know, I was the oldest one and youngest, the second oldest, and Yoshio's the third oldest. And Josh had some catching up to do. And I unfortunately, I think he's caught up so particularly with the work on potential for machine translation. I think that's made the kind of impact the young, maybe convolutional Metz and Terry and me, maybe both machines and. So that's what I have said, I guess. Thank you very much. There's been a few opportunities, I think, to, you know, mentioned work that has been done in the past. Many previous results have been really influential even today. And we didn't realize until many years later the influence of these results. And I'm curious to hear from you. What's the difference between doing no network research in 1985 or 1995 versus today? In what way has that changed both the practice of it? We could actually focus on doing research without all the noise that is currently happening around the eye. And so it's a totally different environment. And of course, there was interesting parallels between the time when when we met each other and around mid eighties and 80s, where neural nets were still marginal compared to the traditional A.I. and what happened maybe five, 10 years ago when the planning and nets came back. And there was also a period in the early 90s where neo-Nazis were very hot and there was a lot of hype and many companies were trying to exploit it. So there's some parallels with now. But I guess the huge difference is that it now really works. I think it was it was working. But again, also and a lot of the applications went. So I don't. Yeah, I mean, you know, very often you see this in the history of technology in particular, I sort of broadly, you know, broadly speaking. So people who are working on the Perceptron and you know, Adaline, back in the 50s and early 60s, by the time, you know, by the late 60s when when people became convinced that that was not a viable path towards really building intelligent machines, they started just changing the names on the techniques they were using. It became adaptive filters. And, you know, those have huge practical consequences. When you go back in the old days when people had modems, you know, you would turn on your modem and the modem would go, you know, and make this horrible noise. That's actually a Perceptron algorithm running on a pseudo random pattern to do a quick cancellation or they're too young to remember that. Yeah, that's right. Most of them are. But, you know, modern modems actually do this or that. You antenna's the fact that, you know, sometimes you get a few bars on your cell phone and sometimes it disappears because there is an adaptive antenna that focuses the beam on your cell phone. And that's an adaptive algorithm that's very similar to another line or the you know, the old algorithms, the 50s. So changed name really. And you see this Watsonia, you know, used to be that the the pathfinding algorithm and there used to be a part of it is still in the textbooks. But we don't see this as any more. It's just an algorithm. Right. Well, you know, try exploration for for chess. So the same phenomenon occurred with neural nets in the 90s where, you know, a bunch of companies kind of were founded around the idea of using neural nets for a particular application like, you know, a credit rating or pollution control or or, you know, control of, you know, the engines for cars and whatever. And this used and they you know, they went underground. They weren't like, you know, really sort of viewed as big things anymore. But they were used, you know, same with our cheque recognition system, with commercial. That's it was very widely used, but not not very kind of seen as the path to the pathway. So I think you see waves of that. So it could be that the current wave of deepening what's certainly going to happen is that a lot of the techniques are being used today, are going to disseminate widely in the economy and are going to influence a lot of different things. And it's probably going to go underground after a while unless we find the next step for the next, you know, the next step in progress, that we'll sort of keep this set of technique in the in the in the eyes of everybody. But if we don't find this, those techniques will just be, you know, part of the toolbox. What he said. Yeah, I find it much harder also to work on diplomacy now than just five or six years ago because five, six years ago you could come up with some, you know, have you know, some like, you know, pretty obvious idea and it out and it would work. You would, you know, have a hard time publish the paper, but, you know, you put it on archive or something and whereas now everybody is working on it. So it's much harder to actually be innovative. And so that's why we're kind of some of this is sort of shifting our interest to the next step, really. So perhaps I have a question related to this. You all have a large number of papers. Some of them have several thousands of citations amongst your set of papers. Is there like one gem in there that you feel nobody is reading, but you really feel like this is a seminal contribution in this crowd will be the first to read it. Let's start with that. He has a paper. He has a lot of citations and there is a paper. Not the correct strategy here is to figure out your age. No, figure out the figure out the paper, this one below your age number and tell everybody to read that one, because none of you have to worry about age numbers anymore. I think there is a paper that I wrote. It just sits in about 2000 and eight or nine or something like that. That's. Using matrices to model relationships and matrices, Matrix is quite like capsule's using matrices also to model concepts. And so you give them triples and you have to from the first two chance of a triple, the third. And had a lot of work around the year 2000 with a bit of background on what I call linear relational embedding, which was basically the early work on learning. Embedding with people in a conventional way I thought was rubbish. Actually, I have a wonderful review of that work which says Hinton has been working on this work for the last seven years. It only has one Nons. S citation. It's time to move on. That was from triple-A. I don't remember such things. I still don't know who wrote that review. So anyway, the idea was instead of using vectors for concepts and matrices for relations, which has the problem, that if you've got 100 component vector, you need a 10000 component matrix, you use matrices for both. And that has a big advantage now that relations are the same kinds of things as objects. So you can do relations of relations and which is a tiny problem. So we taught it that sort of three and plus two makes five. So the relation is plus two makes five. And we thought a bunch of facts like that and then we taught it to and plus makes plus two. So actually the output it had to produce and it had never seen the combination of two and plus before, so it had to create a matrix four plus two. And then we showed that that matrix actually work. So if you gave it seven on this new matrix, four plus two, it would give you nine. And this paper got absolutely slaughtered by triply eight out of ten. It got a two or two and a three, I think. So in that sense, it's a cognitive science and a complete the merger. I mean, they didn't like it, except that the editor of Cognitive Science, John Anderson, said, if I understood the paper right, it's amazing, but I don't think our readers will be interested. And I still think there's a very interesting idea of using matrices for objects instead of vectors so you can operate on the matrices. And that's exactly what's happening in Capsule's. And I suggest everybody go and read that paper and see if we can get figuration your prescription for tonight. So I have one, please. It's a paper that's very low in the citations and in fact, I didn't even try to submit it to one of the good places because I knew it would be rejected. It's called something cultural learning. And it's about this idea that in order to learn these high level abstractions, which is sort of my obsession, we need the guidance from other humans. These humans need that. And we might need the same kind of thing for machines. And then the question is, well, where do these other people got that information about the world in the first place? So the answer, of course, is cultural evolution, that we pass on these notions about the world through education and talking to our children and so on. And so the question is, how can we use these kinds of ideas to train not just a single learner, but the whole collection of learning agents? In combining some of the ideas that I worked on a little bit earlier called Curriculum Learning, where one agent teaches another one in some going from easy examples to harder examples and low level concepts for high level concepts. So so I think these kinds of ideas initially I proposed this actually to Google as a project to be able to paralyze training more efficiently because current distributed training is stuck with sort of a very big problem of having to communicate lots and lots of weights or lots and lots of activations between all the nodes. But if the only thing you need to communicate is this very small pipe of high level abstract stuff, language like things, then we might be able to have a collection of agents learning together in a parallel way. And we could paralyze as much as one million agents as we do with humans. OK, yeah, so I have a gym somewhere and yeah, a couple more ideas and papers, I mean, I can refer to, of course, one of your papers. So this one one one paper that we actually never got published anywhere because it got rejected two or three times in the title of it was pretty too pretty to the decomposition. It was about sports or twinklers. So this was back in the days where we're interested in and supervised layaways training, where we were trying to find ways to, you know, train a pair of players or something and a recorder or some kind to represent to find, you know, slightly higher level representations of the things we were training it on. And the problem is, when you train on auto quarter, is that you don't want it to try to learn the identity function, because if you lose your identity function, it doesn't do anything for you. So you wanted to reconstruct stuff, you train it on, you want you to not reconstruct anything else. And that's the hard part. And so one way we found to do this was to make the middle layer spa's, for example, and Yosri came up with half a dozen ideas of, you know, different ways to do this. And I was able to encoders and stuff like that, this kind of nonsense work and everything. We came up with different ideas. Also, Jeff, at the time came up with contrasty divergence, which is yet another way of doing the same thing. And this idea of produces positive composition is that you you have an encoder, but the input of the encoder does not become the output of the encoder, does not become the input of the decoder. But in but in fact, they are decoupled and there is a cost function for making the input of the decoder different from the output of the encoder. And there is some sort of energy based kind of version of this that to make this work and it's it connects with some stuff that Yahshua later worked on called Target, which which also I worked on many years ago, where instead of propagating gradients through a neural net, you probably get targets. So I think that's an idea that has legs but has not been exploited or implemented in the right way or not really explained in the right way or not. We haven't found any situations where it actually works, kind of like Catalyst for a long time. You know, it's obviously a good idea, but we haven't really found the thing that makes it makes it makes it quick. So that's one idea. There's another idea. This paper, maybe you didn't get a lot of citations, but all of my students at the time read it. Yeah, we were something we were really paying a lot of attention well within a small community. So this paper actually you got something like 70 or 80 citation without ever it's only an archive paper version anyway. I think it's more than just. Yeah, I mean, archive. As you know, we are a big proponent of OK publication for reasons like this, among others. There's another set of ideas. I think that's really interesting. And I wouldn't say they don't gather a lot of interest, but because I think the interest in this is going, there is a generalization of neural nets to graph data. So you can think of a conversation that as an image, you know, the image that you can think of an image of the function on a regular good graph, but you can actually define convolutions on irregular graphs. And that opens the door to all kinds of sort of different ways of thinking about, uh, about neural nets. And this is something that actually you don't want to be working on that I think is in the audience or is going to give a speech tomorrow. And the third idea also that has to do with graph is some of the adjustment I worked on many years ago in a very, very highly cited paper, in fact, or most cited paper. But the people cited because the commercial nets, the second part to that to that paper that nobody's ever read. Oh, no. Know I you did, of course. Like the students had to read the second half. OK, also, as you know, basically the idea of it is a combination of ideas there. But there's the idea that the objects that can be manipulated by a neural net do not need to be things like vectors that can be kind a more complex, structured objects like graphs. And I think it's, again, an idea that, you know, can be if if people figure out how to use this properly, you can be credited for much of much of the theory behind it. Yeah, absolutely. So as you know, you progress in your career, it seems you agree on more and more things. We could call you the three amigos of Deep Learning. And I'm curious to know, are there things on which you strongly disagree? I think that you just do not see eye to eye on the prescription. Yes. Quebec separatism, politics getting in the way. Right. And America. And though we agree on American politics right now, more and more. Yeah, OK. Well, I think. I think we we disagree, perhaps, I mean, we don't disagree, like on major things, but we have sort of different approaches sometimes to to problems or to withdraw from them. You know, there was a time that Jeff was completely enamored with anything had to do with free energies and and environmental inference. And, you know, at that time, John wanted to know nothing about probability. He called the probability police. Yes. Yes. Um.
Info
Channel: RE•WORK
Views: 12,834
Rating: undefined out of 5
Keywords: technology, science, future, tech, ai, artificial intelligence, emerging technology, rework, re-work, yoshua bengio, yann lecun, geoffrey hinton, panel of pioneers, godfathers of deep learning, deep learning, deep learning summit, google brain, microsoft, facebook, facebook ai research, montreal, canada, neural networks, ai ecosystem, yt:cc=on
Id: 1KVU7M6MfBA
Channel Id: undefined
Length: 23min 21sec (1401 seconds)
Published: Thu May 16 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.