Ilya Sutskever: The Mastermind Behind GPT-4 and the Future of AI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

He pretty much said Large Language Models are truly learning and understanding not simply predicting what word to write next. He said the ability of future models will greatly improve. To me it sounds like he things LLMs can lead to AGI.

👍︎︎ 3 👤︎︎ u/reidabooyo 📅︎︎ Mar 16 2023 🗫︎ replies

Captions

yeah I'm Craig Smith and this is I on AI [Music] this week I talked to Ilya switzerberger a co-founder and chief scientist of open Ai and one of the primary Minds behind the large language model gpt3 and its public progeny chat GPT which I don't think it's an exaggeration to say is changing the world this isn't the first time Elia has changed the world Jeff Hinton has said he was the main impetus for Alex not the convolutional neural network whose dramatic performance stunned the scientific community in 2012 and set off the deep learning Revolution [Music] as is often the case in these conversations they assume a lot of knowledge on the part of listeners primarily because I don't want to waste the limited time I have to speak to people like Ilia explaining concepts of people or events that can easily be Googled or binged I should say or the Chachi BT can explain for you the conversation with Elio follows a conversation with Jan lacun in a pre this episode so if you haven't listened to that episode I encourage you to do so meanwhile I hope you enjoy the conversation with Elia as much as I do yeah it's terrific to meet you to talk to you I've watched many of your talks online and read many of your papers can you start just by introducing yourself a little bit of your background I know you were born in Russia where you were educated what got you interested in computer science if that was the initial impulse or or brain science Neuroscience or whatever it was and then I'll start asking questions yeah I can talk about that a little bit so yeah indeed I was born in Russia I grew up in Israel and then as a teenager I'm a family immigrated to Canada my parents say I was interested in AI from a pretty early age I also was very motivated by Consciousness I was very disturbed by it and I was curious about things that could help me understand it better and AI seemed like a very like a good angle there so I think these were some of the ways that got me started and I actually started working with Jeff Hinton very early when I was 17. we moved to Canada and I immediately was able to join the University of Toronto and I really wanted to do machine learning because that seemed like the most important aspect of artificial intelligence that at the time was completely unaccessible but to give some context the year was 2003. today we take it for granted that computers can learn but in 2003 we took it for granted that computers can't learn the biggest achievement of AI back then was deep blue the chest plane engine yeah but there it was like you have this game and you have this tree search and you have this simple way of determining if one position is better than another and it really did not feel like that could possibly be applicable to the real world because there is no learning and learning was this big mystery and so I was really really interested in learning and to my great luck Jeff Hinton was a professor in the University I was in and so I was able to find him and we began working together almost right away and was your impulse as it was for Jeff to understand how the brain worked or was it more that you were simply interested in the idea of machines learning AI is so big and so the motivations were just as many like it is interesting but how does intelligence work at all like right now we have quite a bit of an idea that it's a big neural net and we know how it works to some degree but back then although the neural Nets were around no one knew that Google Nets are good for anything so how does intelligence work at all how can we make computers be even slightly intelligent and I had a very explicit intention to make a very small but the real contribution to AI because there were lots of contributions to AI which weren't real which were but I could tell for various reasons that they weren't real that nothing would come out of it and I just thought nothing works at all AI is a hopeless field so the motivation was could I understand how intelligence work and also make a contribution towards it so that was my initial early motivation so that's 2003 almost exactly 20 years ago and then Alex said I've spoken to Jeff and he said that it was really your excitement about uh the breakthroughs in convolutional neural networks that led you to apply for the imagenet competition and that Alex had the coding skills to train the network can you talk just a little bit about that I don't want to get bogged out in history but it's fascinating so in a nutshell I had the realization that if you train a large neural network on a large sorry large and deep because back then the Deep part was still new if you train a large and a deep neural network on a big enough data set that specifies some complicated tasks that people do such as Vision but also others and you just train that neural network then you will succeed necessarily and the logic for it was very irreducible where we know that the human brain can solve these tasks and can solve them quickly and the human brain is just a neural network with slow neurons so we know that some neural network can do it really well so then you just need to take a smaller but related neural network and just strain it on data and the best neural network inside the computer will be related to the neural network that we have that performs this task so it was an argument that the neural network the large and deep neural network cancel the task and furthermore we have the tools to train it that was the result of the technical work that was done in Jeff's lab so you combine the two we can train those neural networks it needs to be big enough so that if you trained it it would work well and you need the data which can specify the solution and with imagenet all the ingredients were there Alex had these very fast convolutional kernels imagenet had the large enough data and there was a real opportunity to do something totally unprecedented and it totally worked out yeah that was supervised learning and convolutional neural nuts in 2017 the attention is all you need paper came out introducing self-attention and Transformers at what point did the GPT project start was it was there some intuition about Transformers and self-supervised learning can you talk about that so for context at the open AI from the earliest days we were exploring the idea that predicting the next thing is all you need we were exploring it with the much more limited neural networks of the time but the Hope was that if you have a neural network that can predict the next word the next pixel really it's about compression prediction is compression and predicting the next word is not it's let's see let me think about the best way to explain it because there are there were many things going on they were all related maybe I'll take a different direction we were indeed interested in trying to understand how far predicting the next word is going to go and whether it will solve unsupervised learning so back before the gpts unsupervised learning was considered to be the Holy Grail of machine learning now it's just been fully solved and no one even talks about it but it was a Holy Grail it was very mysterious and so we were exploring the idea I was really excited about it that predicting the next word well enough is going to give you unsupervised learning if it will learn everything about the data set that's going to be great but our neural networks were not up for the task we were using recurrent neural networks when the Transformer came out it was literally as soon as the paper came out literally the next day it was clear to me to us that Transformers addressed the limitations of recurrent neural networks of learning long-term dependencies it's a technical thing but it was like if we switched to Transformers right away and so the very nascent GPT effort continued then and then like with the Transformer it started to work better and you make it bigger and then you're realizing to keep making it bigger and we did and that's what led to eventually gpt3 and essentially where we are today yeah and I just wanted to ask actually I'm getting caught up in this history but I'm so interested in it I want to get to the problems or the shortcomings of large language models or large models generally but Rich Sutton had been writing about scaling and how that's all we need to do we don't need new algorithms we just need to scale did he have an influence on you or was that a parallel track of thinking no I would say that when he posted his article then we were very pleased to see some external people thinking in similar lines and we thought it was very eloquently articulated but I actually think that the bitter lesson as articulated overstates its case or at least I think the takeaway that people have taken from it overstated's case the takeaway that people have is it doesn't matter what you do to scale but that's not exactly true you gotta scale something specific you gotta have something that will be able to benefit from the scale it's a great breakthrough of deep learning is that it provides us with the first ever way of productively using scale and getting something out of it in return like before that like what would people use large computer clusters for I guess they would do it for weather simulations or physics simulations or something but that's about it maybe movie making but no one had any real need for compute clusters because what do you do with them the fact that deep neural networks when you make them larger and you train them on more data work better provided us with the first thing that is interesting to scale but perhaps one day we will discover that there is some little twist on the thing that we scale it's going to be even better to scale now how big of a Twist and then of course with the benefit of highs that it will say does it even count it's such a simple change but I think the true statement is that it matters what you scale right now we just found like a thing to scale that gives us something in return the limitation of large language models to say exists is their knowledge is contained in the language that they're trained on and most human knowledge I think everyone agrees is non-linguistic I'm not sure Noam Chomsky agrees but there's a problem in the large language models as I understand it their objective is to satisfy the statistical consistency of the prompt they don't have an underlying understanding of reality that language relates to I asked gbt about myself it recognized that I'm a journalist and I've worked at these various newspapers but it went on and on about awards that I've never won and put it all read beautifully but none of it connected to the underlying reality is there something that that is being done to address that in your research going forward yeah so before I comment on the immediate confession that you ask I want to comment about some of the earlier parts of the question sure I think that it is very hard to talk about the limits or limitations rather of even something like a language model because two years ago people confidently spoke about their limitations and they were entirely different right so it's important to keep this context in mind how confident are we that these limitations that we'll see today will still be with us two years from now I am not that confident there is another comment I want to make about one part of the question which is that these models just learned statistical regularities and therefore they don't really know what the nature of the world is and I have a view that differs from this in other words I think that learning the statistical regularities is a far bigger deal than meets the eye the reason we don't initially think so is because we haven't at least most people those who haven't really spent a lot of time with neural networks which are on some level statistical like what's a statistical model you just feed some parameters like what is really happening but think there is a better interpretation it's the earlier point of prediction is compression prediction is also a statistical phenomenon yet to predict you eventually need to understand the true underlying process that produce the data to predict the data well to compress it well you need to understand more and more about the world that produce the data as our generative models become extraordinarily good they will have I claim a shocking degree of understanding a shocking degree of understanding of the world and many of its subtleties but it's not just the world it is the world as seen through the lens of text it tries to learn more and more about the world through a projection of the world on the space of text as expressed by human beings on the internet but still this text already expresses the world and I'll give you an example a recent example which I think is really Italian fascinating so we've all heard of Sydney beings alter ego and I've seen this really interesting interaction with Sydney over Sydney became combative and aggressive when the user told it that it thinks that Google is a better search engine than being now how can we like what is a good way to think about this phenomenon what's a good language what's what does it mean you can say wow like it's just predicting what people would do and people would do this which is true but maybe we are now reaching a point where the language of psychology is starting to be appropriate to understand the behavior of these neural networks now let's talk about the limitations it is indeed the case that these neural networks are they do have a tendency to hallucinate but that's because a language model is great for learning about the world but it is a little bit less great for producing good outputs and there are various technical reasons for that which I could elaborate on if you think it's useful but it is right now look at this second I will skip that there are technical reasons why a language model is much better at learning about the world learning incredible representations of ideas of concepts of people of processes that exist but its outputs aren't quite as good as one would hope or rather as good as they could be which is why for example for a system like chat GPT this is a language model that has an additional reinforcement learning training process we call it reinforcement learning from Human feedback but the thing to to understand about that process is this we can say that the pre-training process when you just train a language model you want to learn everything about the world then the reinforcement learning from Human feedback now we care about their outputs now we say anytime the output is inappropriate don't do this again every time the output does not make sense don't do this again and it runs quickly to produce good outputs but now it's the level of the outputs which is not the case during pre-training during the language model training process now on the point of hallucinations and it has a propensity of making stuff up indeed it is true right now these neural networks even charge ability makes things up from time to time and that's something that also greatly limits their usefulness but I'm quite hopeful that by simply improving this subsequent reinforcement learning from Human feedback step we could just teach it to not hallucinate now you could say is it really going to learn my answer is let's find out and that feedback loop is coming from the public chat GPT interface that if it tells me that I want a Pulitzer which unfortunately I I didn't I can tell it that it's wrong and will that train it or create some punishment or reward so that the next time I ask it'll be more accurate the way we do things today is that if we hire people to teach our neural net to behave to teach LGBT to behave and right now the manner the precise manner in which they specify the desired behavior is a little bit different but indeed what you described is the way in which teaching is going through like basically be that's the correct way to teach you just interact with it and it sees from your reaction it infers oh that's not what you wanted you are not happy with its output therefore the output was not good and it should do something differently next time so in particular hallucinations come up as one of the bigger issues and we'll see but I think there is a quite a high chance that this approach will be able to address them completely I wanted to talk to you about Jana kun's work on joint embedding predictive architectures and his idea that what's missing from large language models is this underlying World model that is non-linguistic that the language model can refer to and it's not something that's built but I wanted to hear what you thought of that and whether you've explored that at all so I I reviewed beyond the cancer proposal and there are a number of ideas there and they're expressed in different language and there are some maybe small differences from the current Paradigm but to my mind they are not very significant and I'd like to elaborate the first claim is that it is desirable for a system to have multi-modal understanding where it doesn't just know about the world from text and my comment on that will be that indeed multi-modal understanding is desirable because you learn more about the world you learn more about people you learn more about their condition and so the system will be able to understand what the task that it's supposed to solve and the people and what they want better we have done quite a bit of work on that most notably in the form of two major neural Nets that we've done one is called clip and one is called Dali north of them move towards this multimodal direction but I also want to say that I don't see the situation as a binary either or that if you don't have Vision if you don't understand the world visually or from video then things will not work and I'd like to make the case for that so I think that some things are much easier to learn from images and diagrams and so on but I claim that you can still learn them from text only just more slowly and I'll give you an example consider the notion of color surely one cannot learn the notion of color from text only And yet when you look at the embeddings I need to make a small detour to explain the concept of an embedding yeah every neural network represents words sentences Concepts through representations embeddings High dimensional vectors and one thing that we can do is that we can look at those High dimensional vectors and we can look at what's similar to what how does the network see this concept of that concept and so we can look at the embeddings of calories and embeddings of colors happen to be exactly right you know it's like it knows that purple is more similar to Blue than to red and it knows that purple is less similar to Red than oranges it knows all those things just from text how can that be so if you have a vision the distinctions between color just jump at you you immediately perceive them whereas this text it takes you longer maybe you know how to talk and you already understand syntax and words and grammars and only much later you say oh these colors actually start to understand them so this will be my point about the necessity of multimodality which I claim it is not necessary but it is most definitely useful I think it's a good direction to pursue I just don't see it in such Stark either or claims so The Proposal in the paper makes a claim that one of the big challenges is predicting High dimensional vectors which have uncertainty about them so for example predicting an image like the paper makes a very strong claim there that it's a major Challenge and we need to use a particular approach to address that but one thing which I found surprising or at least unacknowledged in the paper is that the current Auto regressive Transformers already have that property I'll give you two examples one is given one page in a book predict the next page in a book there could be so many possible pages that follow it's a very complicated High dimensional space and we deal with it just fine the same applies to images these Auto regressive Transformers work perfectly on images for example like with open AI we've done work on the igpt we just took a Transformer and we applied it to pixels and it worked super well and it could generate images in a very complicated and subtle ways it had the very beautiful and supervised representation learning with Dali one same thing again you just generate think of it as large pixels like rather than generic million pixels we cluster the pixels into large pixels let me generate a thousand large pixels I believe Google's work on image generation from earlier this year called the party I believe they also take a similar approach so the part where I thought that the paper made a strong comment around well the current approaches can't deal with predicting High dimensional distributions I think they definitely can so maybe this is another point to tell me and then what you're talking about converting pixels into vectors it's essentially turning everything into language the vector is like a string of text right Define language though you turn it into a sequence yeah a sequence of what Like You could argue that even for a human life is a sequence of bits now there are other things that that people use right now like diffusion modes where they produce those bits rather than one beat at a time they produce them in parallel but I would argue that on some level this distinction is immaterial I claim that at some level it doesn't really matter it matters as in like you can get a 10x efficiency gain which is huge in practice but conceptually I claim it doesn't matter on this idea of having an army of human trainers that are working with chat gbt or a large language model to guide it in effect with reinforcement learning but just intuitively that doesn't sound like an efficient way of teaching a model about the underlying reality of its language isn't there a way of automating that into to to yams credit I think that's what he's talking about is coming up with an algorithmic means of teaching a model the underlying reality without a human having to intervene yeah so I have two comments on that I think so the first place so I have a different view on the question so I wouldn't agree with the phrasing of the question yeah I claim that our pre-trained models already know everything they need to know about the underlying reality they already have this knowledge of language and also a great deal of knowledge about the processes that exist in the world that produce this language and maybe I should reiterate this point it's a small tangent but I think it's so important the thing that large generative models learn about their data and in this case large language models about Text data r sum compressed representations of the real world processes that produce this data which means not only people and something about their thoughts something about their feelings but also something about the condition that people are in and the interactions that exist between them the different situations a person can be all of these are part of that compressed process that is represented by the neural net to produce the text the better the language model the better the generative model the higher the Fidelity the more the better this the better it captures this process so that's the first comment that we make and so in particular I will say the models already have the knowledge now the army of teachers as you phrase it indeed you know when you want to build a system that performs as well as possible you just say okay like if this thing works do more of that but of course those teachers are also using AI assistance those teachers aren't on their own they are working with our tools together they are very efficient it's like the tools are doing the majority of the work but you do need to have you need to have oversight you need to have people reviewing the behavior because you want to have it to eventually to achieve a very high level of reliability but overall I'll say that we are at the same time this second step after we take the finished pre-trained model and then we apply the reinforcement learning on it there is indeed a lot of motivation to make it as efficient and as precise as possible so that the resulting language model will be as well behaved as possible so yeah there is these human teachers who are teaching them a model with desired Behavior they are also using AI assistance and the manner in which they use AI assistance is constantly increasing so their own efficiency keeps increasing so maybe this will be one way to answer this question yeah and so what you're saying is through this process eventually the model will become a more and more Discerning more and more accurate in its outputs yes and it's that's right if there is an analogy here which is it already knows all kinds of things and now we just want to really say no this is not what we want don't do this here you made a mistake here in the output and of course it's exactly as you say with as much AI in the loop as possible so that the teachers who are providing the final correction to the system their work is Amplified they are working as efficiently as possible so it's not unlike an education process how to act well in the world we need to do additional training just to make sure that the model knows that Hallucination is not okay ever and then once it knows that now you are in business I see and it's that reinforcement learning human teacher Loop that will teach a human teacher Loop or some other variant but there is definitely an argument to be made that's something here should work and we will find out pretty soon that's one of the questions where is this going what research are you focused on right now I can't talk in detail about the specific research that I'm working on but I can mention a little bit I can mention some of the research and Broad Strokes and it would be something like I'm very interested in making those models more reliable more controllable make them learn faster from less data less Instructions make them so that indeed they don't hallucinate and I think that all this cluster of questions which I mentioned they're all connected and there's also a question of how far in the future are we talking about in this question and what I commented here on is the perhaps nearer future you talk about the similarities between the brain and neural Nets is a very interesting observation that Jeff Hinton made to me I'm sure it's not new to other people but that large models or large language models in particular hold a tremendous amount of data with a modest number of parameters compared to the human brain which has trillions and trillions of parameters but a relatively small amount of data have you thought of it in those terms and can you talk about what's missing in large models to have more parameters to handle the data is that a hardware problem or a training problem this comment which you made is related to one of the problems that I mentioned in the earlier questions of learning from this data indeed the current structure of the technology does like a lot of data especially early in training now later in training it becomes a bit less data hungry which is why at the end it can learn very not as fast as people yet but it can learn quite quickly so already that means that in some sense do we even care that we need all this data to get to this point but indeed more generally I think will be possible to learn more from less data I think it's just I think it requires some creative ideas but I think it is possible and I think learning more from less data will unlock a lot of different possibilities it will allow us to teach rais the skills that is missing and to convey to it our desires and preferences exactly how we want it to behave more easily so I would say that the faster learning is indeed very nice and although already after language models are trained they can learn quite quickly I think there is opportunities to do more there heard you make a comment that that we need faster processors to be able to scale further and it appears that the scaling of models that there's no ends in sight but the power required to train these models were reaching the limit at least the socially accepted limit so I just want to make one comment which is I don't remember the exact comment that I made that you're referring to but you always want faster processors of course you always want more of them of course power keeps going up generally speaking the cost is going up and the question that I would ask is not whether the cost is large but whether the thing that we get out of paying this cost outweighs the cost maybe you pay all this costs and you get nothing then yeah that's not worth it but if you get something very useful something very valuable sometimes you can solve a lot of problems that you have which we really want sold then the cost can be justified but in terms of the processors faster processors yeah any day are you involved at all in a hardware question you work with cerebris for example the wafer scale chips now all our Hardware comes from Azure and gpus that they provide yeah yeah you did talk at one point I saw about democracy and about the impact that that AI can have on democracy people have talked to me about that if you had enough data and a large enough model you could train the model on the data and it could come up with an optimal solution that would satisfy everybody do you have any aspiration or do you think about where this might lead in terms of helping humans manage Society yeah let's see it's such a big question because it's a much more future looking question like I think that there is still many ways in which our models will become far more capable than they are right now there's no question in particular the way we train them and use them and so on there's going to be a few changes here and there they might not be immediately obvious today but I think in hindsight it will be extremely obvious that will indeed allow it to have the ability to come up with solutions to problems of this kind it's unpredictable exactly how governments will use this technology as a source of getting advice of various kinds I think that to the question of democracy one thing which I think could happen in the future is that because you have these neural Nets and they're going to be so pervasive and they're going to be so impactful in society we will find that it is desirable to have some kind of a democratic process where this let's say the citizens of a country provide some information to the neural net about how they'd like things to be how they like it to behave or something along these lines I could imagine that happening that can be a very like a high bandwidth form of democracy perhaps where you get a lot more information out of each Citizen and you aggregate it to specify how exactly we want such systems to act now it opens a whole lot of questions but that's one thing that could happen in the future yeah I can see it in the democracy example you give that that individuals would have the opportunity to to input data but uh and this sort of goes to the world model question do you think AI systems will eventually be large enough that they can understand a situation and analyze all of the variables but you would need a model that does more than absorb language I would think what does it mean to analyze all the variables eventually there will be a choice you need to make where you say these variables seem really important I want to go deep because a person can read the book I can read a hundred books or I can read what book very slowly and carefully and get more out of it so there will be some element of that also I think it's probably fundamentally impossible to understand everything in some sense anytime there is any kind of complicated situation in society even in a company even in a mid-sized company it's already beyond the comprehension of any single individual and I think that if we build our AI systems the right way I think AI could be incredibly helpful in pretty much any situation [Music] that's it for this episode I want to thank Ilia for his time I also want to thank Ellie George for helping arrange the interview if you want to read a transcript of this conversation you can find one on our website ionai that's e-y-e hyphen o n dot a I we love to hear from listeners so feel free to email me at Craig c r a i g at e-y-e hyphen on dot a i I get a lot of emails so put listener in the subject line so I don't miss it we have listeners in 170 countries trees and territories remember the singularity may not be near but AI is changing your world so pay attention

Info

Channel: Eye on AI

Views: 231,242

Rating: undefined out of 5

Keywords: GPT-4, GPT4, LLM, MACHINE LEARNING, ARTIFICIAL INTELLIGENCE, AI, DEEP LEARNING, SUTSKEVER, OPENAI

Id: SjhIlw3Iffs

Channel Id: undefined

Length: 42min 59sec (2579 seconds)

Published: Wed Mar 15 2023