Filling the Gap in Large Language Models | Yann LeCun | Eye on AI #116

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Craig Smith and this is ionai this week I talked to Yan lacun one of the seminal figures in deep learning development and a long time proponent of self-supervised learning Yen spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that Gap he also talked about his theory of Consciousness and the potential for AI systems to someday exhibit the features of Consciousness it's a fascinating conversation that I hope you'll enjoy okay so Jan it's great to see you again good to see you again I wanted to talk to you about where you've gone with self-supervised learning since last week spoke in particular I'm interested in how it relates to large language models because it's the large language models really came on stream since we spoke and in fact in your talk about Japan which is joint embedding there you go thank you yeah you mentioned that large language models lack a world model so I wanted to talk first about where you've gone will self-supervised learning and where this latest paper stands and computer trajectory but to start if you could just introduce yourself and we'll go from there okay so my name is General Khan or yanukun you want to do it English way and I'm a professor at New York University and at the current Institute in the center for data science and I'm also the chief AI scientist at at Fair which is the fundamental AI research lab that's what Fair stands for at meta the old Facebook so tell me about where you've gone with self-supervised learning how the joint embedding predictive architecture fits into your research and then if you could talk about how that relates to what's lacking in large language models okay self-supervised learning has been basically brought about a revolution in natural language processing because of they are used for pre-training Transformer architectures and the fact that we use transformer architectures for that is somewhat orthogonal to the fact that we do self-supervised running but the way those systems are trained is that you take a piece of text you remove some of the words you replace them by black markers and then you train the very large neural net to predict the words that are missing that's a pre-training phase and then in the process of training itself to do the system Lawrence good representations of text that you can then use as input to a subsequent Downstream task I don't know translation or or hit switch detection or something like that so that's been a two-year Revolution over the last three four years and including in sort of very practical applications like every sort of top performing content moderation systems on Facebook Google YouTube Etc use this kind of technique um and there's all kinds of other applications uh of that too now large language models are partially this but also the idea that you can train those things to just predict the next word in a text and if you use that you can have those system generate text continuously so there's a few issues with this first of all those things are what's called generative models in the sense that they predict the words the information that is missing whereas in this case and the problem with geology models is that it's very difficult to represent uncertain predictions so in the case of words it's easy because we just have the system produce essentially what amounts to a score or probability for every word in the dictionary and so I cannot tell you if the word missing in the sentence like the blank chases the mouse in the kitchen it's probably a cat it could be a dog but it's probably a cat right so you have some distribution of probability over all words in the dictionary and you can handle uncertainty in the prediction this way but then what if you want to apply this to let's say video right so you show a video to the system you remove some of the frames in that video and you train it to predict the frames that are missing for example predict what comes next in a video and that doesn't work and it doesn't work because it's very difficult to train the system to predict an image or whole image we have techniques for that for generating images but for actually predicting good images that could fit in the video it doesn't work very well or if it works it doesn't produce internal representations that are particularly good for a downstream task like object recognition or something of that type so attempting to transfer those SSL methods that are successfully in NLP into the realm of images has not been a big success um it's been somewhat of a success in audio but really the only thing that works in the domain of images is those joint embedding architectures where instead of predicting the image you predict a representation of the image right so you feed let's say one view of a scene to to the system you run it to some neural net that computes a representation of it and then you take a different view of the same scene you run it through the same network that produces an another representation and you train the system in such a way that those two representations are as close to each other as possible and the only thing the systems can agree on is the content of the image so they end up encoding the content of the image independently of the viewpoint uh the difficulty of making this work is to make sure that when you show two different images it will produce different representations so to make sure that they are informative about the inputs and your system didn't collapse and just produce always the same representation for everything but that's the reason why the techniques that have been generative architecture that have been successful in NLP aren't working so well in images it's their inability to represent complicated complicated uncertainties if you want so now that's for training a system in SSL to learn representations of data but what I've been proposing to do in the position paper I published a few months ago is the idea that we should use SSL to get machines to learn predictive World models so basically to predict where the world how the world is going to evolve so we'll predict the continuation of a video for example possibly predict how it's going to evolve as a consequence of an action that an intelligent agent might take because if we have such a war model in a in an agent the agent being capable of predicting what's going to happen as a consequence of its action we'll be able to plan complex sequence of actions to arrive at a particular goal and that's what's missing from all the pretty much all the AI systems that everybody has been working on or has been talking about loudly except for a few people who are working on robotics or it's absolutely necessary so some of the interesting work there comes out of the robotics Community the sort of machine learning and robot discovery because there you need to have this capability for planning and the words that you've been doing is it possible to build that into a large language model or is it incompatible with the architecture of large language models it is compatible with large language models and in fact it might solve some of the problems that we're observing with long-line wish models one part is large language models is that when you use them to generate text you you initialize them with a prompt right so you type an initial segment of the text which could be in the form of a question or something and then you hope that it will generate a consistent answer to that text and the problem with that is that those systems generate text that sounds fine grammatically but semitically but sometimes to make very stupid mistakes and those mistakes are due to two things the first thing is that to generate that text they don't really have some sort of objective other than just satisfying the sort of statistical consistency with the prompt that was typed so there's no way to control the type of answer they will produce at least no direct way if you want me that's the first problem and then the second problem which is much more acute is the fact that those large English models have no idea of the underlying reality that language describes and so there is a limit to how smart it can be and how accurate they can be because they have no experience of the real world which is really the underlying reality of language so their understanding of reality is extremely superficial and only contained in whatever is containing language that they've been trained on and that's very shallow most of human knowledge is completely non-linguistic it's very difficult for us to realize that's the case but most of what we learn has nothing to do with language language is built on top of a massive amount of background knowledge that we all have in common that we call Common Sense and those machines don't have that but a cat has it the dog has it so we're able to reproduce some of the linguistic abilities of humans without having all the basics that a cat or a dog has about how the world works and that that's why those systems are failures essentially so I think what we would need is an ability for machines to learn how the world works by observation in the manner of uh babies and infants and young animals accumulate like all the background knowledge about the world that constitutes the basis of common sense if you want and then use this war model as a tool for being able to plan sequences of actions to arrive at a goal so setting goals is also an ability that humans and many animals have setting sub goals for arriving at an overall goal and then planning sequences of actions to satisfy those goals and those language models don't have any of that they don't have an understanding of the entertaining World they don't have a capability of planning for planning they don't have goals they can just set themselves goals other than through typing a prompt which is a very weak way where are you in your experimentation with this job architecture pretty early so we have forms of it simplified form of them that we call joint emitting architectures with other PE without the predictive and they work quite well for learning representations of images so you take an image you distort it a little bit and you train a neural net to produce essentially what I'm also identical representations for those two distorted versions of the same image and then you have some mechanism for making sure that it produces different representations for different images and so that works really well we have SIMPLE forms of Japan the predictive version where the representation of one image is predicted from the representation of the other one one version of this was actually presented that narrates this is called vicreg l for local and it works very well for training neural net to learn representations that are good for image segmentation for example but we're still working on a recipe if you want for a system that would be able to learn the properties of the World by watching videos understanding for example very basic concepts like the word is three-dimensional the system could discover that the world is three-dimensional by being shown video with a moving camera and the best way to explain how the view of the world changes as the camera moves is that every pixel has a depth that extreme Paradox motion Etc once that concept is learned and the notion of objects and occlusion objects are in front of others naturally emerges but because objects are part of the image that move together with Paradox motion at least inanimate objects animate objects are objects that move by themselves so there could be also a natural distinction this ability to spontaneously form the categories the babies do this at the age of a few months they have an argument without having the names of anything they know right they can tell the car from a bicycle a chair table a tree Etc and then on top of this you can build Notions of intuitive physics the fact that objects that are not supported were full for example the babies around this at the age of nine months roughly it's pretty late and inertia six series of that type then after you've acquired those so basic knowledge background knowledge about how the world works then you have pretty good ability to predict and you can also predict perhaps the consequence of your actions when you start acting in the world and then that gives you the ability to plan perhaps it gives you some business for common sense so that that's the progression that we need to do we don't know how to do any of this yet we don't have a good recipe for training a system to predict what's going to happen in the video for example within any degree of usefulness just for the training portion how much data would you need it seems to me you would need a tremendous amount of data we need a couple hours on Instagram or YouTube that would be enough really the amount of data of raw video data that's available it's incredibly large if you think about let's say five-year-old child and let's imagine that this five-year-old child can usefully analyze visual percept maybe 10 times a second okay so that's 10 frames per second and if you count how many seconds they are in five years it's something like 80 million so the child has seen at 800 800 million friends right or something like that to shoot yeah it's an approximation it's not that much data we can have that tomorrow by just recording like saving a YouTube video or something and so I don't think it's an issue of of data I think it's more an issue of architecture training paradigm principles mathematics and principles on which to base this one thing of cities if you want to solve that problem abandon five major pillars of of machine learning one of which is those generated models and to replace them with those joint embedding architectures a lot of people in Vision are already convinced of that then to abandon the idea of doing probabilistic modeling so we're not going to be able to predict to represent usefully the probability of the continuation of a video from condition on what we have already observed we have to be less ambitious about or that's a medical framework if you want so I'm I've been advocating for many years to use something called energy-based models which is a weaker form of modeling under a certainty if you want then there is another concept that has been popular for training joint emitting architectures over the last few years which had the first paper on in the early 90s actually on something called Siamese Networks so it's called contrastive running and I'm actually advocating against that too so used to this idea that once in a while you have to come up with new ideas and and it's going to be very difficult to convince people who are very attached to those ideas to abandon them but I think it's time for that to happen once you've trained one of these networks and you've established a world model how do you transfer that to the equivalent of a large language model one of the things that's fascinating about the development of llms in the last couple of years is that they're now multimodal they're not purely text and language so how do you combine these two ideas or can you or do you need to yeah so there's two or three different questions in that one question one of them is can we usually transform existing language models whose purpose is only to produce text in such a way that they have they can do the planning and objectives and things like that the answer is yes that's probably fairly simple to do can we can we train language model purely on language and expect it to understand the underlying reality and the answer is no and in fact I have a paper on this in a of all places a philosophy magazine called Novena which I co-wrote with a cartoon philosopher who is a postdoc Himalaya at NYU where we say that there is a limit to what we can do with this because most of the human knowledge is non-linguistic and if we only train systems or language they will have a very superficial understanding of what they're talking about so if you want systems that are robust and work we need them to be grounded in reality it's a no debate whether AI should be grounded or not and so the approach that some people have taken at the moment is to basically turn everything including images and audio into text or something similar to text so you take an image you cut it into little squares you turn those squares into vectors that's called tokenization and now an image is just a sequence of tokens the text is a sequence of words right and you do this with everything and you get those multimodal systems and they do something okay no clear that's the right approach long term but they do something I think the ingredients that I'm missing there is the fact that I think if we're dealing with sort of continuous type data or like video uh we should use the joint embedding architecture not the generative architecture is that large language models currently use first of all I don't think we should tokenize them because a lot of it get lost in translation when we tokenize images and videos there's a problem also which is that those systems don't scale very well with the number of tokens you feed them with so it works when you have a text and you need a a context to predict the next word that is maybe the 4000 less words it's fine but four thousand tokens for an image or video is Tiny like you need way more than that and those systems scale probably with the number of tokens we feed them so we're gonna need to do a lot of new Innovations in architectures there and my guess is that we can't do it with generating models we will have to dig on 1080. how does a computer recognize an image without tokenization so commercial nets for example don't tokenize they take an image as pixels they extract local features the technical motifs on on different windows on the image that overlap and then those motifs get combined into other slightly less local motifs and it is kind of hierarchy where representations of larger and larger parts of the image are are constructed as you go up in the layers but there's no point where you cut the image into squares and you turn them into individual vectors it's more sort of a progressive so there's been a bit of a back and forth competition between the Transformer architectures that tend to rely on this tokenization and convolutional Nets which which don't or in different ways and my guess is that ultimately what would be the best solution is a combination of the two where the first two layers are more like convolutional Nets they exploit the structure of images and video certainly and then by the time you get to up to several layers the other representation is more object-based and there you have an advantage in using those those Transformers but currently basically the image Transformers only have one layer of convolutions at the bottom and I think it's a bit of a waste and it doesn't scare very well when you want to apply them to video on the timeline this is all moving very fast it's really very fast yeah but how long do you think before you'll be able to scale this new architecture it's not just scale it's actually coming up with a good recipe that works that would allow us to just plug a large neural net or the small neural net on on YouTube and then learn how the world works by watching a new video we don't have that recipe we don't have probably don't have the architecture other than some vague idea which I call hierarchical Japan but there's a lot of details to figure out that we haven't figured out there's probably failure mode that we haven't yet encountered that we need to find solutions for and so I can give you a recipe and I can't tell you if we'll come up with a recipe in the next six months a year two years five years ten years it could be quick or it could be much more difficult than we think but I think we're on the right path since searching for a solution in that direction so once we come up with a good recipe then it will open the door to new breed of AI systems essentially that can they can plan they can reason and will be much more capable of having some level of Common Sense perhaps and have forms of intelligence that are more similar to what we're observing in animals and humans your work is inspired by the cognitive processes of the brain yeah and uh that process of perception and then informing a world model is that confirmed in Neuroscience it's a hypothesis that is based on some evidence from both neuroscience and cognitive science so I what I showed is a proposal for what's called a cognitive architecture which is some sort of modular architecture is that would be capable of things like like planning and reasoning that we observe in capability that we observe in animals and humans and that the current most current AI systems except for if your robotic systems uh don't have so I think that's important in that respect but it's more of an inspiration really than a sort of direct copy interested in understanding the principles behind intelligence but I would be perfectly happy to come up with some procedure that is that uses back proper level but at a higher level kind of does something different from a supervised running or something like that which is why I work with sales supervisor and so I'm not necessarily convinced that the path towards the satisfying the goal I was talking about of running World models Etc necessarily goes through finding biologically plausible learning procedures what did you think of the forward foreign algorithm and were you involved in that research I was not involved although I've thought about things that are somewhat similar for many decades but very few of which is actually published it's in the direct line of a series of work that Jeff has been very passionate about for 40 years of new learning procedures of different types for basically local learning worlds that can train fairly complex neural Nets to learn good representations and things like that so he started with the Boston machine which was a really interesting concept that turned out to be somewhat impractical but very interesting concept that a lot of people started Backpage of course he and I both had in in in developing something I worked on also simultaneously with back prop in the 1980s they're called Target prop where it's an attempt at making backpack more local by Computing a virtual Target for every neuron in a large neural net that can be locally optimized unfortunately the way to compute this target is normal cost and I haven't worked on this particular type of procedure for a long time but Joshua Benjo has published a few papers on this over the last 10 years or so uh Yeshua Jeff and I when we started the Deep learning conspiracy in the early 2000s renew the interest of the community and deep learning we focused largely on forms of kind of local self-supervised learning methods so things like in just case that was focused on restricted bossing machines Yeshua ET ceter hold on something called dinos in autoencoders which is the basis for a lot of the large language model type training that we're using today I was focusing more on what's called supposed to do encoders so this is different ways of doing of training a layer if you wanted a neural net to learn something useful without being it without it being focused on any particular task so you don't need label data and a lot of that work has been put aside a little bit by the incredible success of just pure supervised running with very deep model we found ways to train very large neural Nets with with very many layers we just back prop and so we put those techniques on the side and Jeff basically is coming back to them I'm coming back to them in a different form a little bit with this so the Japan architecture and he also had ideas in the past something called recirculation a lot of info Max methods which actually the Japan I use this thing idea sort of similar he's a very productive source of ideas that are that sometimes seems out of the left field and where the community pays attention and then doesn't quite figure out right away and then it takes a few years for those things too disseminate and sometimes they don't uh just a minute yeah hello Beauregard I'm I'm recording right now who Rasmus I'll answer uh when I get back I yeah you'll be famous someday okay okay great thanks very much yeah bye-bye uh sorry about that there was a very interesting talk by David Chalmers at some level it was not a very serious talk because everyone knows as you described earlier that large language models are not reasoning they don't have common sense he doesn't claim that they do no that's right but what you're describing with this jumper architecture if you could develop a large language model that is based on a world model probably at first it would not be based on language really based on visual perception Maybe audio perception if you have a machine that can do what a cat does you don't need language can be put on top of this we to some extent language is easy which is why we have those large language models and we don't have so since I've learned how the world work yeah but let's say that you build this world model and you put language on top of it so that you can interrogate it communicate with it with it does that take you a step toward what Chalmers was talking about and I don't want to get into the theory of Consciousness but at least an AI model that would exhibit a lot of the features of consciousness David actually has two different definitions for ascensions and Consciousness you can have sentience without consciousness so simple anymore are sunshine in the sense that they have experience and emotions and drives and things like that have the type of Consciousness that we think we have okay or at least the illusion of Consciousness that we think we have so sanctions I think can be achieved by the type of architecture I propose if we can make them work okay which is a big if and the reason I think that is is that where those systems will be able to do is have objectives that they need to satisfy and think of them as drives and having the system compute those drives which would be basically predictions of of the outcome of a situation or a sequence of actions that the agent might take basically those would be indisinguishable from emotions so if you have you are in a situation where you can take a sequence of actions to arrive at the results and the outcomes that you're predicting it's terrible results in your destruction okay that creates fear you're trying to figure out like is there another sequence of action I take that would not result in the same outcome if you make those predictions but there's a huge uncertainty in the prediction one of which probably have maybe is that you get destroyed it creates even more fear and then on the contrary if the outcome is going to be good then it's more like Elation right so those are long-term prediction of outcomes which systems that use the architecture and proposing I think will have so they will have some level of experience and they will have emotions that will drive their behavior because they would be able to anticipate outcomes enter has act on them now Consciousness is a different story so my full Theory Of Consciousness which I've talked to David about seeking it was going to tell me I'm crazy but he said no actually that overlaps with some pretty common theories of Consciousness among philosophers is is the idea that we have essentially a single World model in our head somewhere in a prefrontal cortex and that word model is configurable to the situation we're facing at the moment and so we're configuring our brain including our world model for solving the problem that you know satisfying the objective that we currently set to ourselves and because we only have a single World model engine we can only solve one such task at any one time this is the characteristic of humans and many animals which is that when we focus on the task we can't do anything else and we can do subconscious types simultaneously but we can only do one conscious deliberate task at any one time and it's because we have a single War model engine now why would Evolution build us in a way that we have a single World model engine there's two reasons for this one reason is that single World model engine can be configured for the situation at hand but only the part that changes from one situation to another and so it can share knowledge between different situations the physics of the world doesn't change if you are building a table or trying to jump over a river or something and so your sort of basic knowledge about how the world Works doesn't need to be reconfigured it's only the thing that depends on the situation at hand so that's one reason the second reason is that if we had multiple models of the world it would have to be individually less powerful because you have to all fit them within your brain and that's a limited size so I think that's probably the reason why we only have one and so if you have only one word model that needs to be configured for the situation at hand you need some sort of meta module that configures it figures out like what situation am I in what sub goal should I set myself and how should I configure the rest of the my brain to solve that problem and that module would have to be able to observe the state and capabilities we'd have to have a model of the rest of itself of the agent and that perhaps is something that gives us the illusion of Consciousness so Emma said this is very speculative okay I'm not saying this is exactly what happens but it fits with a few things that we know about about consciousness you were saying this that this architecture is inspired by cognitive science or neuroscience how much do you think your work Jeff's work other people's work at the kind of the Leading Edge of deep learning or machine learning research is informing Neuroscience or is it more the other way around certainly in the beginning it was the other way around but at this point it seems that there's a lot of information that then is reflecting back to those fields it's always been a bit of a feedback loop so new Concepts in machine learning have driven people in neuroscience and Community science to use computational models if you want for of whether we're studying and many of my colleagues my favorite colleagues work on this the whole field of computational Neuroscience basically is around this and what we're seeing today is a big influence or rather a wide use of deep Browning models such as convolutional Nets and Transformers as models explanatory model what goes on in the visual cortex for example so the people you know for a number of years now who have they done fmri experiments and then showed the same image to a subject in the fmri machine and to accomplish.net and then try to explain the variance they observe in the activity of various areas of the brain with the activity that is observed in corresponding neural net and what comes out of the studies is that the notion of multi-layer hierarchy that we have accomplished on that matches the type of hierarchy that we observe in the adhesive ventral pathway of the visual system so V1 corresponds to the first two layers of the conventional net and in V2 to some of the following layers and V4 more and then the E4 temporal cortex to the top layers they are the best explanation of each other if you try to do the matching right one of my colleagues at Fair Paris was a dual affiliation also with Academy Club in Paris has done the same type of experiment using Transformer architectures language models essentially and observing brain activity of people who are listening to stories and attempting to understand the story so that they can answer questions about the story or give a summary of it and there the matching is not that great in the sense that there is some sort of Correspondence between the type of activity You observe in those large Transformers and the type of activities are in the brain but the hierarchy is not nearly as clear and it's what is clear is that the brain is capable of making much longer term prediction that those language models are capable of today so that begs the question of what we what are we missing in terms of architecture and to some extent it jibes with the idea that the models that we should have should build hierarchical representations of the preset that different levels of abstraction so that the highest level of abstraction are able to make long-term predictions that perhaps are less accurate than the lower level but longer term we don't seem to have that in current models I had a question I wanted to ask you since our last conversation you have a lot of things going on you Teach You How Your Role at Facebook with your role I think at cvpr or how do you work on this you have like three days a week or two hours a day where you're just focused and then are you a tinkering with code or with diagrams or is it in iterations with some of your graduate students okay or is this something where it's kind of always in your mind and you're in the shower and you think yeah that might work I'm just curious how are the above oh okay so first of all once you get to understand is that my my position at meta at Fair is not a position of management I I don't manage anything okay empty scientist which means I try to inspire others to work on things that I think are promising and I advise several projects that I'm not personally involved in I work on strategy and orientations and things like this but I don't do day-to-day management I'm very thankful that Joel Pino is doing this for fair and doing very a very good job I'm not very good at it either so it's probably better if I don't if I don't do it so that allows me to spend quite a bit of time on research itself and I don't have a group of Engineers and scientists working with me I have a group of more Junior people working with me students and postdocs both at at fair and at NYU both in New York and in Paris and and working with students in postdoc is wonderful because they're Fearless they're very creative many of them have amazing tolerance in theoretical abilities or implementation abilities or knack for making things work and so what happens very often is either one of them will come up with an idea that whose results surprise me and said I was thinking about this whole world and that's the best thing that can happen or sometimes I come up with an idea and turns out to work which is great usually not in the form that I formed meditated no normally it's there's a lot of contributions that have to be brought an idea for it to make it work and then what's happened also quite a bit in the last few years is I come up with an idea that I'm sure it's going to work and a few students and postdoc tried to make it work and they come back to me they said oh sorry it doesn't work and here is a fairy mode oh yeah we should have thought about this yes okay so here's a new idea to get around this problem so for example several years ago I was I was advocating for the use of generative models with latent variables to handle the uncertainty and I completely changed my mind about this now I'm advocating for those joint evading architectures that do not actually predict and I was I more or less invented those Contracting methods that a lot of people are talking about and using at this point I never been working against them now in favor of those methods yourself such as vcrag or Barlow twins that basically instead of using quotation methods can I try to maximize the information content of representations and the idea of information maximization at known about for decades because Jeff was working on this in the 1980s when I was a postdoc with him and he abandoned the idea pretty much he had a couple papers with uh one of his student uncle subecker in the early 90s that showed that he could work but only in sort of small Dimension and he pretty much abandoned it and the reason he abandoned it is because of a major flaw with those methods yeah due to the fact that we don't have any good measures of information content or the measures that we had are up about not lower bound so we can't try to maximize information content very well and so I never thought about those that those methods could ever work because of my experience with with that and and one of my postdocs Defender D actually kind of revised the idea and surely that it worked that was the bottle of twins paper and so we changed my mind and so now that we had a new tool information about Mexican musician applied to Jonathan building architectures and came up with an improvement of it called vicrag and and now we're working on that but there are other ideas we're working on to solve the same problem with other groups of people at the moment which probably will come up in the next few months so we don't again we don't have a perfect recipe yet and we're looking for one and hopefully one of the things that we are working on with stick yeah are you coding models and then training them and running them or are you conceptualizing and turning it over to someone else so it's mostly conceptualizing and mostly letting the students and postdocs doing the implementation although I do a little bit of Gooding myself but not enough to my taste I wish I could do more I have a lot of products and students and so I have to devote sufficient amount of my time to interact with them sure and then give them some breathing room to do the work they do best and so it's an interesting question because that question was asked to to Jeff yeah after his talk right yeah and he said he was using Matlab and and he said you have to do this those things yourself because if something doesn't if you give a project to a student then the project come back saying it doesn't work you don't know if it's because there is a conceptual problem with the idea or whether it's just some stupid detail that wasn't done right and so what I'm facing with this that's when I started looking at the code and perhaps experimenting with it myself yeah or I get multiple students to work on them to collaborate on the project so that if one makes a an error perhaps the other one will detect what it is I love coding I just don't do as much as I like them this japa or the forward forward things have moved so quickly you think back to when the Transformers were introduced or at least the attention yep mechanism and that kind of shifted the field it's difficult for an outsider to judge when I hear the jump a talk is this one of those moments that wow this idea is going to transform the field or have you been through many of these moments and they contribute to some extent but they're not the answer to ship the Paradigm it's hard to tell at first but whenever I kind of keep pursuing an idea and promote it it's because I have a good hunch that they're gonna have a relatively big impact and it was easy for me to do before I was as famous as I am now because I wasn't listening to that much so I could make some claim and now I have to be careful what I claim because a lot of people listen to me yeah and it's the same issue with Jeff so Jeff for example a few years ago was promoting this idea of capsules yeah and everybody was thinking this is going to be like a big thing you know people started working on it it turns out it's very hard to make it work and it didn't have the impact that many people thought it would have including Jeff and it turned out to be limited by implementation issues and stuff like that it the underlying idea behind it is good but like very often the Practical side of it kills it it was the case also with muscle machines so conceptually super interesting they just don't work that well yeah they don't scale very well they're very slow to to train but conceptually it's a very interesting idea that everybody should know about so there's a lot of those ideas that are conceptual that allow us there are some mental objects that allow us to think differently about what we do but they may not actually have that much practical impact for forward we don't know yet okay it could be like the Wake sleep algorithm that Jeff talked about 20 years ago or something or it could be the new back part we don't know yeah or the new Target prop which is interesting but not really mainstream because it it has some advantages in some situations but it's not it brings you like an improved performance on some standard Benchmark that people are interested in so it doesn't have the right appeal perhaps so it's hard to figure out but what I can tell you is that if we figure out how to train one of those Jet by star architecture from video and the representations that it learns are good and the predictive model that he learns are good this is going to open the door just a new breed of AI systems yeah no doubt about that it's exciting the speed at which things have been moving in particularly in the last three years about about Transformers and the history of Transformers one thing I want to say about this is that we see the most visible progress but we don't realize like how much of a history there was behind it and even the people who actually came up with some of those ideas don't realize that their ideas actually had roots in other things for example back in the 90s people were already working on things like that we could Now call mixture of experts and also multiplicative interactions which at the time were called the CMI Pi networks or things like that so this is the idea that instead of having two variables that you add together with weights you multiply them and then you have a way but you have weights before you multiply it doesn't matter but this audio goes back a very long time since the 1980s and and then you had ideas of linearly combining multiple inputs with weights that are between 0 and 1 and sum to one and are data dependent so now we call this attention but this is a circuit that was used in mixture mixture of expert models back in the early 90s also right so the Italian fold then there were ideas of neural networks that have a separate module for computation and memory the two separate modules right so one module that is a classical neural net and the output of that module would be an address into an Associated memory that itself will be a different type of neural net and those different types of neural net Associated memories use what we now call attention so they compute the similarity or the product between a query vector and a bunch of key vectors and then they normalize them so this onto one and then the output of the memory is awaited some of the value value vectors there's a series of papers by my colleagues in the early days of fair actually in 2014-15 one called memory Network one called end-to-end memory Network one called stack augmented memory Network another one called key value memory Network and then a whole bunch of things and those use those Associated memories that basically are the basic modules that are used inside of Transformers and then attention mechanism like this were popularized in around 2015 by a paper from your adventures group at Mila and demonstrated that they are extremely powerful for doing things like translation language translation in NLP and that really started the craze on attention and so you combine all those ideas and you get a Transformer that uses something called self-attention where the input tokens are used both as queries and keys in a Associated memory very much like a memory Network and then you view this as a layer if you want you put several of those in a layer and then you stack those layers and that's where the Transformer is the Federation is not obvious but there is one those ideas have been around and people have been talking about it and with similar work also around 2015-16 and from deepmind called the neural turing machine or differential vulnerable computer those ideas that you have a separate module for computation and another one for memory there's a paper by Seb ohite oh and his group also on neurones I have separate memory Associated memory type system they are the same type of things I think this idea is very powerful a big adventure to Transformers is that the same way convolutional Nets are equivalent to shift so to shift the input of a commercial net the output also shifts where otherwise stays unchanged in a Transformer if you permute the input tokens the output tokens get permuted the same way but are otherwise unchanged so comments are equivalent to shifts Transformers are equivalent to permutation and with a combination of the two it's great which is why I think the combination of comets at the low level and Transformer at the top I think for natural input data like image and video is a written combination because there's a combinatorial effect as the field progresses all of these ideas create a Cascade of new ideas is that why the field is is speeding up it's not the only reason the there's a number of reasons the so one of the reasons is that you build on each other's ideas and Etc which of course is the Hallmark of Science in general also art but there is a number of characteristics I think that helps that to a large extent the one in particular is the fact that most research work in this area now comes with code that other people can use and build upon right so the habit of Distributing your code in open source I think is a is an enormous contributor to the acceleration of progress the other one is the availability of the most sophisticated tools like pythorch for example or tensorflow or jacks or things like that where which where researchers can build on top of each other's code base basically to come up with really complex Concepts and all of this is permitted by the fact that some of the main contributors that are from industry to those ideas don't seem to be too obsessive compulsive about IP protection So Meta and in particular is very open we may occasionally fight patents but we're not going to sue you for infringing them unless you sue us and Google has a similar policy You Don't See This Much from companies that tend to be a little more secretive about their research like apple and Amazon but although I just talked to Sammy he's trying to implement that openness move out to him good luck it's a cultural change for a company like apple so this is not a battle I want to fight but if you can win it like good friend yeah uh it's difficult it's very difficult battle also I think another contributor is that there are real practical commercial applications of all of this they're not just imagine they are real and so that creates a market and that increases the size of the community and so that creates more appeal for new ideas right more more more Outlets if you want for new ideas do you think that this hockey stick the curve is going to continue for a while or do you think we'll hit a plateau then it's difficult to say nothing looks more like a next an exponential at the beginning of a sigmoid so every Network process has to saturate at some point yeah the question is when and I don't see any obvious wall that is being hit by uh AI research at the moment it's quite the opposite it seems to be an acceleration in fact I'll progress and there's no question that we need new Concepts and new ideas in fact that's the purpose of my research at the moment because I think there are limitations to current approaches so this is not to say that we just need to scale up deep running and turn the crank and we'll get to a human level intelligence I don't believe that I don't believe that it's it's just a matter of making refreshment planning more efficient I I don't think that's possible with the current wave reinforcement running is formulated and we're not going to get there with supervised running either I think we definitely need new Innovative Concepts but I don't see any slowdown yet I I don't see any people turning away from me I think it's obviously not going to work blah blah blah despite there is screams of various critiques right yeah sure about that but uh but they to some extent at the moment are fighting a rear guard battle Yeah because they they plant a flag they said you're never going to be able to do this and then turns out you can do this so it depends the fact a little further down and then haha now you're not going to be able to do this right so exciting yeah okay my last question are you still doing music I am and are you still building instruments electronic instruments yes in processor designing a new one wow yeah okay maybe I think I said this last time maybe I could get some recordings and put them into the podcast or something that all right I probably told you I'm not such a great performer and uh I'm probably better at uh conceptualizing and building those instruments and playing them but but yeah as far as possible [Music] that's it for this episode I want to thank Yen for his time if you want to read a transcript of today's conversation you can find one on our website ionai that's e-y-e-hyphen o n dot a i feel free to drop us a line with comments or suggestions at Craig at ionai that's c-r-a-i-g at e-y-e hyphen o n dot a i and remember the singularity may not be near but a i is about to change your world so pay attention thank you
Info
Channel: Eye on AI
Views: 20,549
Rating: undefined out of 5
Keywords:
Id: mBjPyte2ZZo
Channel Id: undefined
Length: 58min 45sec (3525 seconds)
Published: Tue Feb 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.