Season 1 Ep. 22 Ilya Sutskever | The Robot Brains Podcast

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] today here with me is ilya sutscover ilia is co-founder and chief scientist of open ai as a phd student at toronto ilia was one of the authors on the 2012 alex net paper that completely changed the field of ai having pretty much everyone switched from traditional approaches to deep learning resulting in the avalanche of ai breakthroughs we've seen the past 10 years after the alex and breakthrough in computer vision at google among many other breakthroughs ilias showed that neural networks are unexpectedly great at machine translation at least at the time that was unexpected now it's long become the norm to use neural nets for machine translation late 2015 elia left google to co-found open ai where he's chief scientist some of his breakthroughs there include gpt klib dali and codex all of which i hope we'll be talking about any single of the six breakthroughs i just mentioned so far would make for a illustrious career and ilia has many more than those six in fact aliens works less than 10 years out of his phd are cited over 250 000 times reflective of his absolutely mind-blowing influence on the field elia i have to say i really miss our days together at open eye every day you were such a source of inspiration and creativity so happy to have you here on the podcast welcome thank you peter i am thank you for the introduction i'm very happy to be on the podcast well i'm so glad we we finally get to chat again and we used to spend so much time together and i'm really excited to uh kind of catch up on all the things you've been up to the last few years but first i want to kind of step back a little bit to what is i think many believe definitely i believe the defining moment of the modern ai era which is the imagenet breakthrough in 2012. um it's the moment where no a neural network beat all past approaches to computer vision by a very large margin and of course you were one of the people making that happen and so i'm really curious from your perspective how did that come about um everybody else was working on different approaches to computer vision and there you are working on neural nets for computer vision and then you drastically outperform everyone um how do you even decide to do this yeah i'd say that what led me to this result was a set of realizations over over the time period of a number of years which i'll describe to you so i think the first 3dp world pivotal moment was when james martens has written a paper called his deep learning by hessian optimization and that was the first time anyone has shown that you can train deep networks end-to-end from supervised data but for some context back in those days everybody knew that you cannot train deep networks it cannot be done back propagation is too weak you need to do some kind of pre-training of some sort and then maybe you'll get some kind of a move but if it is the case that you can't train them end to end then what can they do and the thing is you know why there is one more piece of context that's really important so today we take deep learning for granted of course a large neural network is what you need and you get you travel you shove data into it and you'll get amazing results everyone knows that every child knows that how can it be how can it be that we did not know that how could such an obvious thing was not known well people were really focused on machine learning models where they can prove that there is an algorithm which can perfectly train them but whenever you put this condition on yourself and you require to find a simple elegant mathematical proof you really end up restricting the power of your model in contrast neural networks like the fundamental thing about neural networks is that they are basically little computers little parallel computers that are no longer so little anymore they definitely are they can be as little or as large as you want but basically it is a computer it is a parallel computer and when you train a neural network you program this computer with a back propagation algorithm and so the the thing that really clicked for me is when i saw this these results with the hessian optimized they realized wait a second so we can actually program those things now it's no longer the case that you know maybe you could so the prevailing view was aspirationally maybe someone could train those things but it's obviously impossible local minima will get you but no you can train a neural net then the second realization is human vision is fast it takes several hundred milliseconds at most to recognize something and yet our neurons are slow so that means that you don't even need that many layers to get respectable vision so you put this so what does that mean it means that if you have a neural network which is pretty large then there exists some parameters which achieve good results on vision now if only there was a data set which we could train from an image that came up and then the gpus came up and then i was this has to happen and then at some point i had a conversation with alex kuczewski which he said that he has gpu code which can train a small continent to get respectable results on cfr in 60 seconds and i was like oh my god so let's let's do this on image and it's to it's going to crush everything and that's how it happened that's how it came to be i love the back story here ilya and it reminds me a lot of our ideas at open eye where many things to you just look unavoidable and and just so clearly they have to be that way i remember the first time you you articulated to me um that a neural net is just a computer program um and this is like several years before even even karpathy started talking about software 2.0 being you know programming with neural nets and it's just parallel and serial compute it's it's really it's really amazing that you saw this even before there was real success in neural nets um when did you realize it was actually working on imagenet what was that like i mean i i had very lit i had very little doubt that it would work but it was kind of you know every at this point you know alex was training the neural net and the results were getting better week after week and that's about it but i felt but i felt like the big risk from my perspective was can we have can we have that do we have the ability to utilize the gpus well enough train a big enough you know big enough there's no such thing it's more like an interestingly large neural network it has to be a neural network that is large enough to be interesting whereas all previous neural networks are small if you're just going to have something which is going to be way larger than anything before then it should do much better than anything anyone's ever seen of course we are far beyond that our computers are faster and your network's a larger but the goal was not the goal is just to go as far as possible with the hardware we had back then that was the risk and fortunately alex had the kernels that eliminated that risk right that's a very good point i mean at the time it wasn't i mean today you put something in pytorch tensorflow whatever your favorite framework is and you can train a neural network back then you actually have to build some pretty specialized tools yourself to to make this all run now as that breakthrough happens i'm curious what are you thinking next what do you think like okay we do this you probably knew this this breakthrough happened before everybody else in the world because i mean you you had the results before the public workshop and so before everybody else in the world even knew that neural nets are going to be the new state of the art and a new way of doing computer vision you already knew that and so where was your mind going at that point so i think there were two things which i was thinking about so the the thing the the belief so my belief has been that we've proven that neural nets can solve problems that human beings can solve in a short amount of time because with the risk we've proven that we can train neural nets with modest numbers of layers and i thought we can make the neural networks wider but making and that will be pretty straightforward making them deeper is going to be harder and so i thought okay well depth is how you solve problems that require a lot of thinking so can we find some other interesting problems that don't require a lot of thinking and i actually was thinking a little bit about reinforcement learning but the other problem was problems in language that people can so can understand quickly as well so with language you also have the property that you don't need to spend a lot of time thinking you know what did what did they say exactly sometimes you do but often you don't so problems in language and translation was the the preeminent problem in language at the time and so that's why i was wondering if you could do something there another thing which i was thinking about was actually go as well i was thinking that using a convnet could potentially provide very good intuition for the non-neural network goblin system that existed back then can you say a bit more about the the go system um how how a neural network could and actually has changed them from their how that's done i mean basically the thing about neural networks is that okay so before deep learning anything you had to do with ai involved some kind of maybe search procedure with some kind of hard-coded heuristics where you have really experienced engineers spend a lot of time thinking really hard about how exactly under what conditions they should continue something or discontinue something or expand resources and they just spent all their time trying to figure out those heuristics but the neural network is formalized intuition it is actually intuition it gives you the kind of expert gut feel because i read i read this thing that an expert player in any game you can just look at the situation and instantly get a really strong gut feel it's either this or that and then i spend all their time thinking which one of those which of those two it is he said great the neural network should have absolutely no trouble if you buy the theory that we can replicate functions that humans can do in a short amount of time like it's less than a second and it felt like okay in case of something like go which was a big and soft problem back then [Music] and you know i should be able to do that back in the time ilya with the first time i heard that you know maybe he's a comp confnet for go my naive reaction obviously because it clearly it succeeded my naive reaction was components are famous for a translation in variance and there's no way that we want to be translation invariant on on the board of go because you know it really matters whether a pattern is you know in one place or another place um but obviously you know that that didn't stop the continents from succeeding and and just capturing the patterns nevertheless yeah i mean you know that's that's again the power of the parallel computer can you imagine programming a continent to do the right thing well it's a little bit hard to imagine that but it's true that that part may have been a small small leap of faith and maybe to close the loop on go so my my interesting goal ended up in me participating in alphago paper as well in in in a modest way you know like i i got i had an intern chris madison and we wanted to apply super confidence to go and at the same time google acquired deep mind and all the deep mind folks have visited google and so we spoke with david silver and aj huang can be sorry a cool project to try out but then deepmind really they put a lot of effort behind it and they really had a fantastic execution on this project yeah i think while the imagenet moment is the moment most ai researchers saw the coming of age of deep learning and a whole new era starting alphago is probably the moment most of the world saw that ai's is now capable of something very different from what was possible before it's interesting though because while most of the world's focused on that around the same time actually a new york times article comes out saying that actually something very fundamental has been happening in natural language processing which you alluded to and that actually the whole google translate system was had been revamped with neural networks even though a lot of people think of neural nets as at the time as pattern recognition and patterns should be signals like speech or or visual signals and language is discrete and so i'm really curious about that um how how do you make the leap from these continuous signals were neural nets to many people seemed a natural fit to language which most people would look at as discrete symbols and very different yeah so i think that leap is very natural if you believe relatively strongly that biological neurons and artificial neurons are not that different because then you can say okay human beings let's let's find let's think of the single best professional translator in the world someone who is extremely fluent in both languages that person could probably translate language almost instantly so there exists some neural network with a relatively small number of layers in that person's mind that can do this task okay so if we have a neural network in outside our computer which might be a little bit smaller and it's trained on a lot of input output examples we already know that we will succeed in finding the neural net to solve the problem so therefore the existence of that that single really good instantaneous translator or i the existence of such a one such person is proof that the neural network can do it now it's a large neural network our brains are quite large but maybe you can take a leap of faith and say well maybe our digital neurons we can train them a little bit more and maybe a little bit less noisy and maybe it will still work out now of course the neural networks are still not at the level of a really amazing human translator so there's a gap but that was the chain of reasoning that humans can do it quickly biological neurons are not unlike artificial neurons so why can't the neural network do it let's find out with your collaborators at google you invented the modern way of of doing a machine translation with neural networks which is uh really amazing can you say a little bit more about how that works all you need is a large neural network with some way of ingesting some representations of words and when a representation of words so what does it mean a representation it's a word that we use an ai often a representation is basically okay so you have the letter a how do you show it or the word cat how do you present it to the computer to the neural network and you basically just need to agree with yourself that hey they're going to create some kind of a mapping between the words or the letters into some kind of signals that happen to be in the format that the neural net can accept so you have this one you you just say i'll just design this dictionary once and feed those signals to the neural net and now you need to have some way for the neural network to ingest those signals one at a time and then it emits the words one at a time of the translation and that's literally it it's called the autoregressive modeling approach and it's quite popular right now but it's not because it's so but it's not because it's necessarily special it's just convenient the neural networks do all the work the neural networks figure out how to con build up their inner machinery how to build up their neurons so that they will correctly interpret the words as they come in one at a time and then somehow you know break them into little pieces and transform them and then do exactly the right orchestrated dance to output the correct words one at a time it's probably possible to design other neural networks that are other ways of ingesting the words and people are exploring this right now you know you may have seen some you know if you follow ml twitter you may have seen some words like phrases like diffusion models so maybe they will be able to ingest the words in parallel and then do some sequential work and then output them in parallel it doesn't actually matter what matters is that you just present the words to the neural net somehow and you have some way that the neural net can output the words of the target and that's what matters yeah to me it was a very big surprise at the time that that it worked so well for language i was 100 certain that it will work great for anything continuous and then all of a sudden the sequence to sequence models that you pioneered was like okay well i guess now it's going to work for everything what's my conclusion because if it can work for for language what's what's left in terms of signals we we work with right now you of course didn't start um working on neural nets from from the day you're born and i'm really curious you know where did you grow up and how did that lead you to ending up you know being an ai researcher yeah so i was born in russia i grew up in israel and i moved to canada when i was 16. according to my parents i've been talking about ai at a relatively early age and i definitely remember at some point thinking about ai and reading about this whole business is playing chess using brute force and it was totally clear it was it seemed that yeah you could do the chess stuff no problem but the learning stuff that's where the real meat of ai is that's why ai is so terrible because it doesn't learn and humans learn all the time so can we do any learning at all so my parents so when my family moved to canada to toronto and i entered the university of toronto i sought out the learning professors and that's how i found jeff hinton and then the other thing is that he had he had this he was into training neural networks and neural networks seemed like a much more promising direction than the other approaches because they didn't have obvious computational limitations like things like decision trees which were those words were those that phrase was popular back in the day now jeff of course have a has a very long history working in ai and especially neural networks deep learning um you know coming out of england coming to the us then moving to canada and and his move to canada in some sense helped spark the ai the beginning of the new ar era in canada of all places right you're there at the same time which is really interesting kind of curious you know do you think there's any reason your parents decided to go to toronto and that it is like the place where both you and jeff ended up and alex i mean the three of you were there together to make that happen i think it's a bit of a happy coincidence i think it has to do with the way immigration works it it is it is a fact that it is quite quite a bit easier to immigrate into canada and if you immigrate into canada toronto is perhaps the most appealing city to settle in now that coincidence brings you to university of toronto and you find jeff hinton working on neural networks but i gotta imagine when you you looked into his history you must have noticed he'd been working on it for 30 40 years and was there any moment you thought well if it doesn't work after 30 40 years it's not going to work now either i see what you're saying but my motivation was different i had i had a very explicit motivation to make even a very very small but a meaningful contribution to ai to learn it because i thought learning doesn't work at all completely and if it works just a little bit better because i was there i would declare it a success and so that was my goal and do you remember anything from your first meetings with jeff how was that i mean so i was i was a third year undergrad when i met him for the first time i mean i thought it was great so my major in undergrad was math but the thing about math is that math is very hard and also and all the really talented people who go into math and so one of the things which i thought was great about machine learning is that not only it is the thing but also all the really clever people going to math and physics so i was very pleased about that what i remember from actually reading kate metz's book um is actually my my possibly my favorite anecdote from the book tell has jeff telling the story about him meeting you elia and so here here's how the book tells a story uh maybe you've read it maybe not but essentially the book says yeah there's jeff you know and this this young student comes in elias let's give her undergrad still and jeff gives you a paper and um you you go read it and you you come back and um you tell him i don't understand it and jeff's like oh that's okay you know you're still underground what don't you understand i can explain it to you and essentially you say actually i don't understand why it's not why they don't automate the whole process of learning it's it's still too too much hand holding i understand the paper i just don't understand why they're doing it that way and just like okay wow this is this is interesting it gives you another paper and um again you go read you come back so goes the story and you say oh i don't understand this one either and jeff's like what do you understand don't understand about this one i'm happy to explain and you go i don't understand why they train a separate neural network for every application why can't we train one gigantic network for everything it should you know it should help to be trained jointly and to me that that's really i mean that reminds me a lot of our times at open eye where it always felt like you are you know already thinking you know several steps into the future of how things are going to shape up just from the evidence we have today you know how it really should be several years down the line uh that at least according to the book that's how jeff remembers the first two meetings with you yeah i mean some some something like this did happen it's true so the field of ai back then when i was starting out was not a hopeful field it was a field of desolation and despair no one was making any progress at all and it was not clear if progress was even possible and so that's why well how what do you do when you're in this situation so you say you're walking down this path this is the path the most important path but you have no idea how long it is you have no idea how hard it's going to be what would be a reasonable goal in this case well the goal which i chose was can i make a useful step one useful step so that was my explicit motivation at least for quite a while before it became clear that actually the path is going to become a lot a lot you know a lot sloppier and a lot more rapid where ambitions became grew very rapidly but at first when there was no no gradient the goal was just make any step at all anything useful that would be meaningful progress towards ai and i think that's really intriguing actually earlier because i think that's what drives a lot of researchers is is to just find a way to make to make some progress knowing not knowing actually ahead of time how far you can get but just being so excited about the topic that you you just want to find a way to at least make some progress and then keep going um and it's of course very interesting in your case that you know then the whole thing switched from you know slow progress to ever faster progress all of a sudden thanks to the thing that you're like look you're trying to make that bit of progress and it turns out to open up the floodgates for for massive progress now you start you start in canada you your phd research of course you know completely changes the field you start a company that gets acquired by google and you're at google then the big thing and also that the moment actually our paths start start crossing or about to cross is that you you know you're on this role at google you're doing some of the most amazing pioneering work in ai you're clearly in an amazing situation where you are you know doing some some of the best work that's happening in the world and you just you decide to change your situation how did that come about i remember being at google and feeling really comfortable and also really restless i think two two factors contributed to that one is that i somehow i could look 10 years into the future and i had a little bit too much clarity about how things will look like and i didn't enjoy that very much but there was another thing as well and that's the the experience of seeing deep mind build work on alphago and i and it was it was very inspiring and i thought that it's a sign of things to come that the field is starting to mature up until that point all progress in ai has been driven by individual researchers working on small projects maybe small groups of researchers with some advice by their professors and maybe some other collaborators but usually it would be small groups it would most it would most of the work would be idea heavy and then it would be some kind of a some effort on an engineering on the engineering execution to prove that the idea is valid but i felt that alphago was a little different it showed that in fact it showed to me that the engineering is critical and in fact the field will change and you'll become the engineering field that it is today because the tools were getting very solid and the question then becomes okay how do you really train those networks how do you debug them how do you set up the distributed training and it's a lot of work and the stack is quite deep and i felt that the culture at google was very similar to the academia culture which is really good for generating radical novel ideas and in fact google has generated a lot of radical and revolutionary ideas in ai over the years and most most notably the transformer from from the past few years but i felt that that's not going to be the whole of progress in ai i felt that it's not now only a part of progress in ai so if you think of it as of the body you can say you need both the the muscles and the skeleton and the nervous system and if you only have one it's amazing but the whole thing won't won't really move you need all things together and so i felt that i had a vague feeling that it would be really nice if it was some kind of a company which would have these elements together but i didn't know how to do it i didn't have any any path to it i was kind of just daydreaming about it and then at some point i got an email from sam altman saying hey let's get dinner with some cool people and i said sure and and i showed up and and greg greg brookman was there and elon musk was there and a few others and we just chatted about wouldn't it be nice to start in a new ai lab and i found that really the time was right because i was thinking about the same thoughts independently and i really wanted it to be engineering heavy and you know no seeing that elon was going to be involved i thought well who better to who would be better can't imagine a better person from whom to learn the you know big engineering project side of things so i think this was the genesis there is kind of there there is more to it but i think that was the real the real genesis of opening eye from my perspective that yeah like i was thinking about something and then it just one day i woke up with this email hey the thing that from my perspective it was like i was daydreaming about something and then my daydream came true almost like this the the dream daydream becomes true what you're really saying there is that you know there is a group of people very highly accomplished and ambitious people who are in some sense aligned with your dream and want to want to make this happen together but all that gets you is essentially you know sometimes some paperwork that a new company now exists and um maybe some money to to get going but you actually still need to decide what to do with those resources and with your time i'm kind of curious at the beginning of of open eye what was going on in your mind in terms of how to shape this up um i mean obviously it's been a massive success but but i'm really curious about you know that the beginning part and how how that played out for you so the beginning part i would describe it as a whole lot of stress and it wasn't exactly clear how to get going right away there was only clarity about a few things which is there need to be some kind of a large project and i also was excited about the idea that maybe if you can predict really well you make progress on supervised learning but beyond that it wasn't clear what to do so we tried a whole lot of different things and then we decided that maybe it would be good to solve a difficult computer game dota and if this is this and this is where greg just showed his strength and he just took on this project even though it seemed really impossible like genuinely impossible and just meant for it and somehow it worked in the most stereotypical deep learning way where the simplest method that we tried just ended up working the simplest policy gradient method as we kept scaling it up we just never never stopped improving with more scale and more training just to double click on that for for a moment i don't think everybody knows what dota is can you say a bit about that and i mean i fully agree why it's so surprising that the simplest approach ultimately work is a very hard problem so for some context the state of the field back then was okay so if you look at reinforcement learning in particular deepmind has made some very exciting progress first by training for a neural net with reinforcement learning to play simple computer games and then peop and then the reaction was okay that's exciting and interesting and kind of cool but what else can you do and then alphago happened and then the opinion is shifted okay reinforcement learning maybe can do some things but you know go it's funny by the way go used to seems this used to look to be this impossible game and now everyone's says oh such a simple game the ball is so small our perceptions quickly but then deepmind we're talking about how starcraft is the next logical step up to go and it made a lot of sense to me as well it seemed like a much harder game not necessarily in its not necessarily from the for for a if a person would play but for our tools it seemed harder because it had much more you had a lot more moving parts it's much more chaotic it's a real-time strategy game and we thought that it would be nice to have our own twist on it and to try to make a bot which can play dota and dota is another real-time strategy game that's really popular it's been the it had i believe it definitely had i don't know if it still has the largest prize pool the largest annual prize pool of any professional esport game it was very it has a very vibrant very strong professional scene people dedicate their lives to play in this game they it's it's a game of reflex and strategy and instinct and a lot of things happen you don't get to see the whole game the point is that it definitely felt like a grand challenge for reinforcement learning at that time and our opinion about the tools of reinforcement learning was so let's put it this way so the grand challenge felt like it's here and the field's opinion about the tools and their ability to solve a problem like this was like here there was a huge mismatch and so when we started working on it we thought oh yeah we're going to need to develop all kinds of crazy planning methods and hierarchical reinforcement learning methods and whatnot but let's just get a baseline let's just see when the basement breaks and that's when the baseline just didn't break it just kind of kept improving all the time and it's interesting with each with each so what would happen on the over the course of this project we would have these public demonstrations of our problems as we'd reach different milestones of performance we would have some kind of a public exhibition game against a professional of different level of accomplishment so at first we had a a public exhibition game against retired professionals then we had them against acting professionals and then finally we had a game against the strongest professionals and we defeated them but the interesting thing is that at each step you'd have people who you'd have very knowledgeable experts in ml who would come out on twitter and say well that was really cool great successful reinforcement learning but obviously the next step would require the plan the explicit planning thing or the hierarchy thing and somehow it did not so that was that was a very important for us i felt like it really it really proved to us that we can do large projects i remember i was not part of this project uh just just to be clear um but but i was there at opening when it was all happening working on other projects and i remember being very very surprised you know that no explicit structure was needed though well in my mind obviously but maybe you know it's not even true but in my mind there is this large lstm model neural network that maybe somehow through back propagation actually internalize the structure that we all at least not all of us but maybe me i thought we would have to put in explicitly and maybe the neural network was able to just absorb that intuition through back propagation without the need to to hard code it which was really intriguing to me because it just seemed like wow you know a lot of intuitions might be better provided through through data than through hard coding which seems a very common trend in all of deep learning but maybe in reinforced learning at the time wasn't that strongly believed yet till till that result came out yeah i mean i i agree i agree with your assessment i feel like yeah i i i like to think that this result had changed the field's view at least a little bit about the capability of simple reinforcement learning now to be fair you still need quite a hefty amount of experience to get a very strong result on such a game and then we also use the similar approach so i would say if you if you have the ability to generate a very large amount of experience against some kind of a simulator then this style of reinforcement learning can be extremely successful and in fact we have also another important results in openai's history was to use the same exact approach to train a robot to solve the rubik's cube so physical robot the physical robot hand actually solved the physical rubik's cube and it was a quite challenging project the training was done entirely in simulation and the simulation was designed in such a way so that it's extra hard and it requires the neural net to be very adaptive so that when you give it the real robot the the real physical robot it will still succeed but at core it was the same exact approach as the one we used with the dota project which was very large scale reinforcement learning in fact it was even the same code so that was a case where we had this general technique these general powerful results which you were able to use in more than one place and that was what you've done on reinforcement i know that right now there's other reinforced learning happening at open eye in the context of language actually before we we get and i'm really curious about about that but before we get to that um language modeling gpt is probably you know the the most visible thing in recent years in the public eye of what ai is capable of and you know opening i generated these gpd generations of models that can complete articles in very credible ways and it's been very surprising how capable it is and so what i'm really curious about again in some senses is you know you decided that i mean not alone but together with collaborators at open eye you decided that you know it was it was time was right to to go down this path of you know building language models and i'm really curious what was it for you that made you believe that you know this was the thing to start doing yeah so from my side a really important thing that happened to me is that i was really interested in unsupervised learning and for context the the the results that we spoke about earlier on about vision and even about you know go and dota all these results translation they are all cases where you have somehow you train a neural network by presenting it with inputs and desired outputs you have your random input not random you have a typical input a sentence an image something and you have the desired output and the neural network you run it and you compare the predicted output with the desired output and then you change the neural network to reduce this error and you just do that a lot you do it a lot and that's how learning works and it's completely intuitive that if you will do this the neural network will succeed i should say maybe not completely intuitive but definitely pretty intuitive today because you say hey here is my input here's my desired output don't make the mistakes eventually the mistakes will go away and it is something where you can at least have a reasonably strong intuition about why it should work why supervised learning works and why reinforcement learning in contrast at least in my mind unsupervised learning is much more mysterious now what is unsupervised learning exactly it's the idea that you can understand the world whatever that means you can understand the world by simply observing it without there being a teacher that will tell you what the desired behavior should be so there is a pretty obvious question which is okay so like why would like how could that possibly work how could it possibly be that you have okay so what would you do then what was the typical prevailing thinking the prevailing thinking has been that maybe you have some kind of task like you take your input your observation an image let's say and then you you ask the neural network to somehow transform it in some way and then to produce the same image back but why would that be a good thing for the task you care about is there some mathematical reason for it i found it very unsatisfying in my mind it felt to me like there is no good mathematical basis for unsupervised learning at all whatsoever and i was really bothered by it and after a lot of thinking i had i had the i had the bill i developed the belief that actually if you predict the next bit really well you should have a really good supervisor the idea is that if you can predict the next bit really well then you have extracted all the meaningful information that somehow the model knows about all the meaningful information that exists in the signal and therefore it should have a representation of all the concepts and it's the idea in the context of language modelling it's very intuitive you know if you can predict the next word moderately accurately maybe the model will know that words are just clusters of character separated by space if you predict better you might know that there is a vocabulary but you won't be good at syntax if you improve your prediction even further you'll get better at the syntax as well and suddenly we'll be producing syntactical mumbo jumbo but if you improve your prediction even further necessarily the semantics has to start kicking in i felt that the the same the same argument can be made about predicting pixels as well so at some point i started to believe that maybe doing a really good job on prediction you'll give us some supervised learning which back then felt like a grand challenge another interesting thing that now everyone knows that unsupervised learning just works but not that long ago it seemed like this completely intractable thing so anyway to come back to the story of how the gbts were created so then you know i'd say the first project that really was a step in this direction was led by alec radford who is an important hero of the gpd saga where we trained the neuro and lsdm to predict the next character on reviews on on on on amazon reviews of products and we discovered that this lstm has a neuron which corresponds to sentiment in other words if you are reading a review which is positive the sentiment neuron will fire and if you're reading a review which is negative the sentiment neuron will not fall so that's interesting and that felt to us like it validated the conjecture of yeah of course eventually we want to predict what comes next really well you need you need to discover the truth about the data and so then what happened is that the transformer came out and then we saw the transformer and i think it was it was pretty like it got us really excited because we were really struggling we believe that long-term dependencies were really important and the transformer had a very clean elegant and compute efficient answer to long-term dependency and for context the transformer is this neural network architecture and in some sense it's just really good but a little bit more technically so we discussed that these neural networks are deep in some way and we know and it's been the case until relatively recently that it was pretty hard to train deep neural networks and previous neural networks for training models and sequences of language the longer the sequence was the deeper the network would get the harder it would be to train but the transformer decoupled the depth of the transformer from the length of the sequence so you could have a transformer of manageable depth with very long sequences and that was exciting and this investigation led to gpt-1 and then i would say further then we continue we continued to believe in scale and that led and that led to gpt2 and three and here it's really i want to i want to call out dario mode who really believed that if he were to scale up the gpts it will be the most amazing thing ever and that's how we got gbt3 in gbt3 i mean when it came out it wasn't just i think what was so exciting to the entire community it wasn't just something that could complete text when you start with a prompt it could maybe say oh this is likely your next sentence you i mean it could complete all kinds of things people would write web pages even write some very basic code that gets completed with gbt3 that they would um and and they would be able to prompt it and and that really intrigued me this notion of of prompting where you have this gigantic model that's trained on i don't know how much text out there but that somehow when you then briefly feed it a little bit of extra text in in the moment you can actually prime it to start doing something that you wanted to do can you say a bit more about that where did that come from and how how how does that work you think so what is a language model exactly you just have a neural network the text takes some text and tries to output an educated guess of what the next word might be and it outputs an educated guess it might say you know it's 30 the word the some kind of a guess of probabilities of what the words might be then you can pick a word according to this probability that the neurologic outputs and then commit to it and then ask the neuron to predict the next word again and again and again now we know that real the we know that real text in some sense is very responsive to its beginning like we know that text has a lot of very complicated structure and if you read a document which says this document below will describe a list of questions that were given in the um mit entrance exam in 1900 i just made it up then i i strongly expect that in fact there will be 10 or so questions in math of the kind of math that was usually in math exams in the 1900s if the model is good enough it should actually do that now how good enough is good enough well this is a little bit of a qualitative statement but if it is definitely good enough it should be able to do it so then you train in gpt3 and you see can it actually do it and sometimes it cannot but very often indeed it is responsive it is very responsive to whatever you whatever text you give it because to predict what comes next correct well enough you need to really understand the text you're given and i think this is kind of in some way the centrality of prediction good enough prediction gives you everything you could ever dream about [Music] we are dropping new interviews every week so subscribe to the robot brains on whichever platform you listen to your podcasts now one of the things that i think also stood out to me with gbt is that it's it's a research breakthrough it's a major research breakthrough but it also feels very practical like i mean whenever i'm typing something i mean i know what i want to type next it's already in my head but i still have to type it but with a gpt you know gp2 onwards probably it could complete it fairly accurately and so it seemed like very different in that sense from for example the rubik's cube breakthrough or the um dota break to switch for fundamental research breakthroughs but it was hard to dream of the direct applications and here with gpt it was so easy to to dream of so many applications and i'm curious if that you know in in your own uh kind of evolution on things when gpt started working did you start thinking about applications or did you know more generally the people around you had open and start thinking about applications what was going on yeah we were definitely excited about the potential applications i mean we were so excited about them that we built a whole api product around gpd3 so that people could go and build their new and convenient and sometimes unprecedented applications in language i mean i think it's general it's it's a general so maybe another way of looking at what's happening is that ai is just continuous to continuing to get more and more capable and it can sometimes be tricky to tell if a particular research advance is real or not real suppose you have some cool demo of something like what do you make of it it can be hard to understand the magnitude of the advance especially if you don't know how similar the demo is to their training data for example but if you have a product that's useful then the advance is real and i feel that maybe in a sense we have moved away from the field has matured so much that you no longer need to rely on demos and even benchmarks as indicators as the only indicators of progress but usefulness as the truest indicator of progress and so that's why i and and so i think this is a good sign for gbt3 for sure and yeah the applications we were excited about them and people are using gbt3 all the time right now are there any uses that you've seen that you're able to share the applications being built there's plenty of applications i remember seeing something that helps you write a resume and pretty fight something that helps improve your emails i think i've seen something like this i don't remember but they all have this kind of flavor i know that there is a lot of users unfortunately i don't remember specific examples of this over my head this is jumping ahead a little bit in the progression of of of the research trajectory you've gone through with openi but maybe the biggest application of course and maybe it's not called gpt anymore is called codex but it's very similar it's a system that can help you write programs can you say a bit about that and how is it i'm curious is it just like gpt but trained on github code instead of text or are there some differences so the the the system that we described in the paper is essentially a gpt trend of code it's that simple the thing that's the thing that's interesting about it is that it works as well as it does because you can say like what what have you even done you've done nothing you just took a large neural net and you trained it to code from github but the result is not bad at all the ability its ability to it can solve real coding problems much better than i think most people would have expected and again this comes back to the the power of deep learning the power of these neural nets they don't care what problem to solve and you can all kind of say well you know people can code so i can hear less if you believe that in a biological neuron is not very different from an artificial one then it's not an unreasonable belief at all so then the question becomes what's the training data you know predicting github is not exactly the same as coding so maybe it won't quite do the right thing but it turned out to be good enough and it turned out to be very useful especially in situations where you have a library which you don't know because it's right all of github it has such familiarity with all the major libraries and if you don't know it but you kind of just write a comment use this library to do x you come up with code which is gonna often be correct or pretty close and then you have something to work with and you edit it a little bit and you have something working but yeah it's just it's just the gpp trained to predict code pretty well i think in many ways it's really mind-blowing in terms of potential societal impact because if i think about a lot of the the way we create impact in the world as people we're often sitting behind a computer right and we're typing things and whether it's typing emails or or writing up documents on work we've been doing or writing code um this could really accelerate any anybody's work and and the kind of things we could do in one day i don't know if we're already seeing metrics for this but i would imagine that you know if it's not now in the next generation and i'm curious about your thinking you know what kind of productivity we can expect from from people thanks to these tools so i'd say that in the near term productivity will continue to increase gradually i think that as time goes by and the capability of ai systems increases productivity will increase absolutely dramatically i feel very confident in that we will have we will witness dramatic increases in productivity eventually in the long term a day will come and the systems will in fact just the world will be kind of like the ai is doing all the work and then that work is given to people to enjoy that what i think is the long-term future will hopefully be like so in the medium term it's going to be amazing productivity increases and then in the long-term future it's going to be like infinite productivity or fully automated productivity now one of the things that of course people think about a lot in that context when you give an ai a lot of productivity it better be productive doing the right thing and better not be productive i don't know you know blowing something up by mistake and and so forth or just misunderstanding what it's supposed to be doing and in that sense i've been really curious about this project at opening where reinforcement learning is combined with gpt can you say a bit more about that take a step back so we have these ai systems that are becoming more and more powerful and a great deal of their power is coming from us training them on very large data sets we don't understand for which we have an intuitive understanding of what they do so they learn all kinds of things and then they act in ways which we can inspect but perhaps not we can inspect but it might be but and we do have for these large language models for example we do have some ability to control them through the prompts and in fact the better the language model will get the more controllable it will become through the prompt but we want more we want our models to do exactly what we want or act closer to what we want as much as possible so we had this project indeed that you alluded to of training these language models with reinforcement learning from human feedback where now you do reinforcement learning not against a simulator but against human judges that tell you whether the output was desirable or undesirable and if you think about it this relevant this reinforcement learning environment is really exciting you could even argue that reinforcement learning has kind of maybe slowed down a little bit because there weren't really cool environments since you could do it but doing reinforcement learning language models and with people that feels like such it's such a it opens such a powerful vista such you can do so many things there and what we've shown is that these large neural networks these large gpt models when you do reinforcement learning from these from these teachers essentially and i should also say there is a small technicality which again this is a technical thing for the a male focused subset of the audience in reinforcement learning you're usually providing reward good or bad but the way we do it with reinforcement learning from human feedback is that the the teacher needs to look at two outputs by the model and to say which one is better because it's an easier task it's an easier task to compare two things than to say whether one thing is good or bad in absolute and then we do a little bit of machine learning in order to then create a reward out a reward out of it reward model and then use this reward model to train the neural net and this is a pretty simple efficient thing to do and you you obtain a very fine grained way of controlling the behavior of these neural networks of these language models and we've been using it quite a bit like recently we've trained we've been training these instruction quality models which actually people can use to the api through the open ai api where in gbt3 the model is just strain on the insulin so you need to be quite clever about specifying your prompt specifying your prompting to design to kind of core and getting the model to do what you want providing some examples whereas the instruction following model has been trained in this way to literally do what we tell it to so there's a word which i think is known in some subsets of the machine learning community but not in all of it and it's called the model is this this is an attempt to align the model so the model with its power and great power and unclear capabilities will in fact be trained and incentivized to literally do what you want and with the instruction following model you just tell it what you want do x write y modify z and it will do it so it's really convenient to use and this is an example of the technique of reinforcement learning from human feedback in practice but moving forward of course you want to learn from teachers in all kinds of ways and you want to use machine learning to not not just have people you know provide supervised examples or provide rewards but you really want to have a conversation where you ask exactly the right question to learn the information that you need to understand the concept so that's how things will be in the future but right now this approach has been used fairly successfully to allow to make our gbt models more aligned than they are naturally and when you say aligned as i understand it you can also align them in a personalized way so aligned to a specific person's preferences like i could teach you to follow my preferences and you could have a different one i mean so dancing is definitely yes so the specific model that i mentioned to you the instruction following this model it's a single model and it's been aligned you know it's been you know we say it's a line which is which is another which is a way to say that it's been trained and incentivized to follow the instructions you're given so it it's an interface and it's a very convenient interface of course it is possible with these neural nets they can do whatever you want you can train them in literally any way you want you can personalize them in arbitrary ways you could say okay for this user we do this for that using that and the user can be specified with some with the paragraph or they may be with some of their past actions so almost anything is possible now when you say almost anything is possible that also reminds me of a lot of our past conversations it always seems like you know no limits to your imagination of what might be possible and and you know angles to to try to get there and maybe maybe one of the other most surprising recent results is um you know traditionally a lot of work in computer vision in language processing and reinforcement learning kind of separate research arenas almost but then uh recently you together with collaborators at openhab released the clip and dolly models that bring language and vision in some sense together in into the same network to to really somehow have a single network that can handle both at the same time i'm kind of again i'm curious about you know how how did you come [Music] just conclude okay this is the direction that maybe we should push down maybe it becomes possible now to have this combined model that can handle both vision and language in the same model and effectively translate between them as desired well i think the the underlying motivation here is that it seems implausible that the neural networks of the future will not have both vision and language and that was the motivation to begin thinking in this direction and as to whether this should be possible i mean i think i think at least in my view there was plenty of evidence that neural networks should just succeed at this task if you make it large and you have an appropriate data set if they can generate language like they do why can't they generate the language of images or go in the other direction as well so it was more maybe it's maybe it's good to think of it as an exploration of training neural networks in both images and text and with dali for context dali is literally a gpt3 that is trained on text followed by almost like a textual representation of an image so use those tokens to represent an image so that from the perspective of the model it's just some kind of a funky language but it's kind of like you know you could you can train gpt to on on english text and french text it doesn't care so what if you just had a different language which had some human language and the language of images and that's daoi and it worked exactly as you'd expect and it was still a lot of fun to see a neurological generate images like it did and with clip it was an exploration in the opposite direction which is can a neural network learn to see using a lot of loose natural language supervision can it learn a huge variety of visual context concepts and can it do so in a way that's very robust so that you know and i think the robustness point is something which i think is you know it's also very flexible but i i think the robustness point is is especially important in my eyes and let me explain what i mean by robustness so there is one thing which i think is especially notable and unsatisfying in neural networks revision is that they make these mistakes that the human being would never make so we spoke earlier about the image in a data set and about training neural networks to recognize the images in this dataset and you'd have neural nets which achieved super human performance in this data set then you put it on your phone and start taking photos and it would make all these disappointing mistakes what's going on and then it turns out that what's really going on is that there are all kinds of peculiarities in this dataset which are hard to notice if you don't pay close attention and so people have built all kinds of test sets with the same objects but for maybe unusual angles or in a different presentation for which the image that you you're neurologically just failed but the clip neural network it was trained on this vast and loosely labeled data from the insurance text this neural network was able to do well on all these variants of images it was much more robust to the presentation of the visual concept and i think this kind of robustness is very important because human beings are in when it comes to our vision you know a third of our brain is dedicated to vision our vision is unbelievably good and and i feel like this is a step towards making neural nets a little bit more robust a little bit more neural network's capability is a little bit more in line with the capability of of our own vision now you say imagenet versus the clip data set um the clip data says a lot larger how much larger is it that i mean what's the difference in size between those like hundreds of times large it has it has open-ended categories because the categories are just free form text but it's really kind of the size but also the coverage and the variety you need the data that needs to be diverse it needs to have a lot of stuff in it if a data set is narrow it will hurt the neural network when i look back at the last 10 well nine-ish years right since um since the imagenet breakthrough it seems like year after year there are new breakthroughs new capabilities that didn't exist before many of them thanks to you elia and your collaborators and i'm kind of curious how do you kind of from looking back at the last nine years and then as you project forward no are there some things that you are particularly excited about that we can't get to today but you're hopeful that you know maybe become feasible in the next few years yeah so i'd say that there is a sense in which the deep learning saga is actually a lot older than the past nine years you know it's funny if you read if you read some of the statements made by rosenblatt i think in the 60s so the rosenblatt invented the perceptron which was the one of the first neural networks that could learn something interesting on a real computer it would learn some image classification and then the rosenblatt vented me onto the new york times and he said you know one day a neural network will see and hear and translate and be conscious of itself and be your friend something something like this and he had he he was trying to raise money to build increasingly larger computers and he had academic detractors who didn't like the way funding was misallocated in their mind and that led to the you know to the first major neural network winter and then i think now these ideas were kind of always there in the background just that the environment wasn't ready because you needed both the data and the compute and then as soon as the data and the compute became ready you were able to jump on this opportunity and materialize the progress and i fully expect that progress will continue i think that we will have far more capable neural networks i think that you know i don't want to be too specific about what i think like about what exactly may happen because it's hard to predict those things but i would say one thing which would be nice is to see our neurologicals being even more reliable than they are being so reliable that you can really trust their output and when they don't know something they'll just tell you and maybe ask for verification i think that would be quite impactful i think they'll be they will be taking a lot more action than they are right now i think our neural networks are still quite inert and passive and they'll be much more useful their usefulness will continue to grow and i mean for sure i i'm totally certain that we will need some kind of new ideas even if those new ideas may have the form of looking at things differently from the way we're looking at them right now and i would argue that a lot of the major progress in deep learning has this form for example the most recent progress within supervised learning like what what was what was done what's different we just trained larger language models but they've existed in the past it just we realized that language models were were the right thing all along so i think there will be more realizations like this where things that are right in front of our noses are actually far more powerful and far more capable than we expected and again i do expect that the capability of these systems will continue to increase they will become increasingly more impactful in the world it will become a much greater topic of conversation i think that the products we will see unbelievable truly unbelievable applications incredible applications positive very given transformative applications i think you know we could we could imagine lots of them with very powerful ai and eventually i really do think that you'll be in a world where the ai does the work and we the people enjoy enjoy this work and we we use that work to our to our benefit and enjoyment and you know and this this and par part of the reason open ai is a cap profit company where after we return our obligations to our investors we turn back into a non-profit so that we could help materialize this future vision where you have this useful ai that's doing all the work and all the people get to enjoy it and and that's really beautiful i i i like the model you have there because it essentially i mean it reflects the in some sense the vision that the benefits of you know really capable ar could be unlimited and it's not great to concentrate an unlimited benefit into a very small group of people because i mean that's just not not great for the rest of the world so i love the model you have there one of the things that ties into this ilia is that maybe ai is also becoming more expensive a lot of people talk about it that you know training models you want a bigger model that's going to be more capable but then you know you need the resources to train those bigger models and i'm really curious about your thinking on that no it's just going to be you know the more money the bigger the model the more capable or is it possible that the future is different so so there is there is a huge amount of incentive to increase the efficiency of our models and to find ways to do more with less and this incentive is very strong and it affects everyone in the field and i fully expect that in the future we'll be able to do much more using a fraction of the cost that we do right now i think that's just going to happen for sure i think costly hardware will drop i think methods become more efficient in all sorts of ways there are multiple dimensions of efficiency that our models could utilize the art at the same time i also think that it is true that bigger models will always be better and i think it's just the fact of life and i expect there should be almost like a kind of a power law of different models doing different things i think you'll have very powerful models in small numbers that are used for certain tasks and you'd have many more smaller models that are still hugely useful but and then you have even more models which are smaller and more specialized so you have this kind of continuum of size specialization and it's going to be an ecosystem it's going to be not unlike how in nature there are animals that will occupy any niche and so i expect that the same thing will happen these compute that for every level of compute there will be some optimal way of using it and people will find that way and create very interesting applications love your vision elia um i think we actually covered a tremendous amount already and i'm really intrigued by everything we covered but there is there's one question that that's really still on my mind that i'm hoping we can uh we can get through which is um helium you've been behind a lot of the the breakthroughs in ai in the last 10 years even actually even a bit before that um and i'm just kind of curious wha what what does your day look like what do you think are some habits and things in your schedule or or things you do that help you be creative and productive it's hard to give useful blanket advice like this but maybe two two answers consist of protecting my time and just trying really hard you know i don't think i don't think there is an easy way you need to just just gotta embrace the suffering and and push through it and that's because and pushing those walls and that's where the good stuff is found now when you say protecting your time which which really resonates of course um then you get to choose how you fill it in and i'm kind of curious if you just look at let's say maybe you know the last week or the week before and they're like protected time you know what are you doing are you going on walks are you reading papers are you brainstorming with people what what's going on yeah i'd say i'd say mostly in my case it would be not necessarily going in works but lots of solitary work and yeah there are people with whom i have very intense research conversations which are very important and i think those are those are the main things i do i do know that you're also an artist or you know aspiring artist or whatever we want to call it at the same time do you think that plays a role at all in boosting your creativity i mean i'm sure it doesn't hurt so now it's hard to know with these things obviously but yeah i think it can only help well ilya it's so wonderful to have had this chance to chat with you i mean it's been way too long since we've had a chance to catch up and and this has been so good to you know get to know you even better than them before uh thank you so much for making the time thank you peter i had a great pleasure being on the podcast [Music] [Music] you

Info

Channel: The Robot Brains Podcast

Views: 118,125

Rating: undefined out of 5

Keywords: The Robot Brains Podcast, Podcast, AI, Robots, Robotics, Artificial Intelligence

Id: fCoavgGZ64Y

Channel Id: undefined

Length: 78min 56sec (4736 seconds)

Published: Tue Sep 21 2021