Ilya Sutskever | The era of AI will come in full swing | GPT-4o is the most powerful AI model

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it I think probably two years before that it became clear to me that supervised learning is what's going to get us the traction and I can explain precisely why it wasn't just an intuition it was I would argue an irrefutable argument which me like this if your neural network is deep and large then it could be config tried to solve a hard task so that's the key word deep and large people weren't looking at large neural networks people were you know maybe studying a little bit of depth in neural networks but most of the machine learning field wasn't even looking at neural networks at all they were looking at all kinds of basian models and kernel methods which are theoretically elegant methods which have the property that they actually can't represent a good solution no matter how you configure them whereas the large and deep neural network can represent a good solution to the problem to find the good solution you need a big data set which requires it and a lot of compute to actually do the work we've also made advanced work to so we've worked on optimization for for a little bit it was clear that optimization is a bottleneck and there was a breakthrough by another grad student in Jeff hinton's lab called James Martins and he came up with an optimization method which which is different from the one we're doing now using now some second order method but the point about it is that it's proved that we can train those neural networks because before we didn't even know we could train them so if you can train them you make it big you find the data and you will succeed so then the next question is well what data and an imet data set back then it seemed like this unbelievably difficult data set M but it was clear that if you were to train a large convolutional neural network on this set it must succeed if it just can have the compute and right that right at that time gpus came out you and I you and I our history and our paths intersected and somehow you had the the the the observations that a GPU and at that time we had this is are couple of generations into a Cuda GPU and I think it was GTX 580 uh generation you had the you had the uh Insight that the GPU could actually be useful for training your neural network models what what was that how did that day start tell me you know you and I you never told me that moment you know how did that day start yeah so you know the the GP the gpus appeared in our in our lab in our Toronto lab thanks to Jeff and he said we we got we should try these gpus and we started trying and experimenting with them and it was a lot of fun but the but it was unclear what to use them for exactly where are you going to get the real traction but then with the existence of the image and data set and then it was also very clear that the convolutional neural network is such a great fit for the GPU so it should be possible to make it go unbelievably fast and therefore train something which would be completely unprecedented in terms of its size and that's how it happened and you know very unfortunately Alex kvki he really loved programming the GPU and he and he was able to do it he was able to code to to program really fast convolutional kernels and and then and then train the neural net on image in a data set and that led to the result but it was like it it shocked the world it shocked the world it it it broke the record of computer vision by such a wide margin that that it was a clear discontinuity yeah yeah and I wouldn't I would say it's not just like there is another bit of context there it's not so much like when when we say break the record there's an important it's like I think there's a different way to phrase it it's that that data set was so obviously hard and so obviously outside of reach of anything people were making progress with some classical techniques and they were actually doing something but this thing was so much better on a data set which was so obviously hard it was it's not just that it's just some competition it was a competition which back in the day it wasn't an average Benchmark it was so obviously difficult so obviously Out Of Reach and so obviously with the property that if you did a good job that would be amazing big bang of AI fast forward to now uh you came out to the valley you started open AI with some friends um you're the chief sign scientist now what was the first initial idea about what to work on at openai because you guys worked on several things some of the trails of of inventions and and work uh you could you could see led up to the chat GPT moment um but what were the initial inspiration what would you how would you approach intelligence from that moment and led to this yeah so obviously when we started it wasn't 100% clear how to proceed and the field was also very different compared to the way it is right now so right now you already used we already used to you have these amazing artifacts these amazing neural Nets who are doing incredible things and everyone is so excited but back in 2015 2016 early 2016 when you were were starting out the whole thing seemed pretty crazy there were so many fewer researchers like hund maybe there were between a 100 and a thousand times fewer people in the field compared to now like back then you had like 100 people most of them were working in Google deepmind and that was that then there were people picking up the skills but it was very very scarce very rare still and and we had two big Initial Ideas at the start of open AI that stay that had a lot of staying power and they stayed with us to this day and I'll describe them right now the first big idea that we had one which I was especially excited about very early on is the idea of unsupervised learning through compression some context today we take it for granted that unsupervised learning is this easy thing you just pre-train on everything and it all does exactly as you'd expect in 2016 unsupervised learning was an unsolved problem in machine learning that no one had any insight exactly any clue as to what to do that's right Yan lean would go around and give a talk give talk saying that that you have this Grand Challenge on supervised learning and I really believed that really good compression of the data will lead to unsupervised learning now compression is not language that's commonly used to describe what is really being done until recently when suddenly it became apparent to many people that those gpts actually compress the training data you may recall the Ted Chang New York Times article which also alluded to this but there is a real mathematical sense in which training these Auto regressive generative models compress the data and intuitively you can see why that should work if you compress the data really well you must extract all the hidden secrets which exist in it therefore that is the key so that was the first idea that we really excited about and that led to quite a few Works in open AI to the sentiment neuron which I'll mention very briefly it is not this work might not be well known outside of the machine learning field but it was very influential especially in our thinking this work like the the result there was that when you train a neural network back then it was not a Transformer it was before the Transformer right small recurrent neural network LS l M sequence work you've done I mean this some of your some of the work that you've done yourself yeah so the same lsdm with a few twists train to predict the next token in Amazon reviews next character and we discover that if you predict the next character well enough there will be a urine inside that lstm that corresponds to its sentiment so that was really cool because it showed some traction for un supervised learning and it validated the idea that really good next character prediction next something prediction compression has the property that it discovers the secrets in the data that's what we see with these GPT models right you train and people say just statistical correlation I mean at this point should be so clear to anyone obervation that observation also you know for me intuitively open up the whole world of where do I get the data for unsupervised learning because I do have a whole lot of data if I could just make you predict the next character and I know what the ground truth is I know what the answer is I could be I could train a neural network model with that so that that observation and masking and other other technology other approaches you know open open my mind about where would the world get all the data that's unsupervised for unsupervised learning well I think I think so I would I would phrase it a little differently I would say that with unsupervised learning the hard part has been less around where you get the data from though that part is there as well especially now but it was more about why should you do it in the first place why should you bother the hard part was to realize that training these neural Nets to predict the next token is a worthwhile goal at all mhm that was learn representation that it would it would be able to understand that's right that it will be useful grammar and yeah but to actually to actually but it just wasn't obvious right so people weren't doing it but the sentiment neuron work and you know I want to call out Alec Ratford is a person who really was responsible from any of the advances there the sentiment this this was this was before gpt1 it was the precursor to gpt1 and it influenced our thinking a lot then the Transformer came out and we immediately went oh my God this is the thing MH and we trained midrange gpt1 now along the way you've always believed that scaling um will improve the performance of these models yes larger larger networks uh deeper networks uh more training data would scale that um there was a very important uh paper that open AI wrote about the scaling laws and the relationship between um loss and uh the size of the model and the the amount of data set the size of the data set uh when Transformers came out it gave us the opportunity to train very very large models uh in very reasonable amount of time um but with the in with the the intuition about about the scaling laws of the size of of of models and data um and your journey of gpt1 2 3 um which came first the did you see the evidence of GPT 1 through3 first or was there the intuition about the scaling law first the intuition so I would say that the way the way I'd phrase it is that I had a very strong belief that bigger is better and that one of the goals that we had at open AI is to figure out how to use the scale correctly there was a lot of belief about in open AI about scale from the very beginning the question is what to use it for precisely because I'll mention right now we talking about the gpts but there's another very important line of work which I haven't mentioned the second big idea but I think now is a good time to make a detour and that's reinforcement learning that clearly seems important as well what do you do with it so the first really big project that was done inside open AI was our effort at solving a realtime strategy game and for context a real-time strategy game is like it's a competitive sport yeah right where you need to be smart you need to have fast you need to have a quick reaction time you there's teamw work and you're competing against another team and it's pretty it's pretty it's pretty involved and there is a whole competitive league for that game the game is called do 2 and so we train a reinforcement learning agent to play against itself to produce with the goal of reaching a level so that it could compete against the best players in the world and that was a major undertaking as well it was a very different line it was reinforcement learning yeah remember the day that that you guys announced that work and this is this by the way when I was asking earlier about about there's there's a large body of work that have come out of open AI some of it seem like detours um but but in fact as you're as you're explaining now they might might have been detours it seemingly detours but they they really led up to some of the important work that we're now talking about chat GPT yeah I mean there has been real convergence where the gpts produce the foundation and then reinforcement learning from DOTA morphed into reinforcement learning from Human feedback that's right and that combination gave us Chad GPT you know there's a there's a there's a a misunderstanding that that uh chat GPT is uh in itself just one giant large language model there's a system around it that's fairly complicated is could could you could you explain um briefly for the audience the the uh the fine-tuning of it the reinforcement learning of it the the um uh you know the various uh surrounding systems that allows you to uh keep it on Rails and and uh let it let it uh uh give it knowledge and you know so on so forth yeah I can so the way to think about it is that when we train a large neural network to accurately predict the next word in lots of different texts from the internet what we are doing is that we are learning a world model it looks like we are learning this it may it may look on the surface that we are just learning statistical correlations in text but it turns out that to just learn the statistical correlations in text to compress them really well what the neural network learns is some representation of the process that produce the text this text is actually a projection of the world there is a world out there and it's has a projection on this text and so what the neural network is learning is more and more aspects of the world of people of the human conditions their their their hopes dreams and motivations their interactions and the situations that we are in and the neural network learns a compressed abstract usable representation of that this is what's being learned from accurately predicting the next word and furthermore the more accurate you are at predicting the next word the higher a fidelity the more resolution you get in this process so that's what the pre-training stage does but what this does not do is specify the desired behavior that we wish our neural network to exhibit you see a language model what it really tries to do is to answer the following question if I had some random piece of text on the internet which starts with some prefix some prompt what will it complete to mhm if you just randomly end up on some text from the internet but this is different from well I want to have an assistant which will be truthful that will be helpful that will follow certain gu rules and not violate them that requires additional training
Info
Channel: Me&ChatGPT
Views: 10,284
Rating: undefined out of 5
Keywords: AI, GPT, GPT-4o, Ilya Sutskever, chatgpt
Id: f96Xrg4uNik
Channel Id: undefined
Length: 17min 49sec (1069 seconds)
Published: Sat May 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.