GPT-3 bottleneck is training data | François Chollet and Lex Fridman

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
gpt-3 similar to gdp 2 actually have captivated some part of the imagination of the public there's just a bunch of hype of different kind that's i would say it's emergent it's not artificially manufactured it's just like people just get excited for some strange reason and in the case of gpt 3 which is funny that there's i believe a couple months delay from a release to hype maybe i'm not historically correct on that but it feels like there was a little bit of a lack of hype and then there's a phase shift into into hype but nevertheless there's a bunch of cool applications that seem to captivate the imagination of the public about what this language model that's trained in unsupervised way without any fine-tuning is able to achieve so what do you make of that what are your thoughts about gbt3 yeah so i think what's interesting about gpts3 is the idea that it may be able to learn new tasks in after just being shown a few examples so i think if it's actually capable of doing that that's novel and that's very interesting and that's something we should investigate that said i must say i'm not entirely convinced that we have shown it's it's capable of doing that it's very likely given the amount of data that the model is trained on that what it's actually doing is pattern matching uh a new task you give it with the task that it's been exposed to in its training data it's just recognizing the task instead of just developing a model of the task right but there's a side to interrupt there's a parallel to what you've said before which is it's possible to see gpt 3 as like the prompts that's given as a kind of sql query into this thing that it's learned similar to what you said before which is languages used to query the memory yes so is it possible that neural network is a giant memorization thing but then if it gets sufficiently giant it'll memorize sufficiently large amounts of thing in the world or it becomes more intelligence becomes a querying machine i think it's possible that uh a significant chunk of intelligence is this giant associative memory uh i definitely don't believe that intelligence is just a giant illustrative memory but it may well be a big component so do you think gpt 3 4 5 gpt 10 will eventually like what do you think where's the ceiling do you think you'll be able to reason um no that's a bad question uh like what is the ceiling is the better question how well is it gonna scale how good is gptn going to be yeah so i believe gptn is gonna shifty tn is gonna improve on the strength uh of gpt2 and three which is it it will be able to generate you know uh ever more plausible text uh in context just monitoring the process performance um yes if you trade if you're training bigger more on more data then uh your text will be increasingly more uh context aware and increasingly more plausible in the same way that gpt 3 it is much better at generating clausable text compared to gpd2 um but that said i don't think just scaling up the model to more transformer layers and more training data is going to address the flaws of lgbt3 which is that it can generate possible texts but that text is not constrained by anything else other than plausibility so in particular it's not constrained by factualness or even consistency which is why it's very easy to get gpt3 to generate statements that are factually untrue uh or to generate statements that are even self-contradictory right uh because it's uh it's its only goal is plausibility and it has no other constraints it's not constrained to be self-consistent for instance right and so for this reason one thing that i thought was very interesting with gpd3 is that you can place the mind the answer it will give you by asking the question in specific way because it's very responsive to the way you ask the question since it has no understanding of the content of the question right and if you if you have the same question in two different ways that are basically adversarially uh engineered to produce set an answer you will get uh two different answers to contractor answers it's very susceptible to adversarial attacks essentially potentially yes so in in general the problem with these models is genetic models is that they're very good at generating possible text but that's just that's just not enough right um uh you need uh i think one one avenue that would be very uh interesting to make progress is to make it possible uh to write programs over the latent space that these models operate on that you would rely on these self-supervised models to generate a sort of flag pool of knowledge and concepts and common sense and then you will be able to write explicit reasoning programs over it uh because the current problem with gpt three is that you it's it's it can be quite difficult to get it to do what you want to do uh if you want to turn gpd3 into products you need to put constraints on it you need to force it to obey certain rules so you need a way to program it explicitly yeah so if you look at its ability to do program synthesis it generates like you said something that's plausible yeah so if you if you try to make it generate programs it will perform well uh for any program that that has seen it in its training data but because uh program space is not interpretive right um it's not going to be able to generalize to problems it hasn't seen before now that's currently do you think sort of an absurd but i think useful um i guess intuition builder is uh you know the gpt-3 has 175 billion parameters human brain has a hundred about a thousand times that or or more in terms of number of synapses do you think um obviously very different kinds of things but there is some degree of uh similarity do you think what what do you think gpt will look like when it has a hundred trillion parameters you think our conversation might be in nature different like because you've criticized gpt3 very effectively now do you think no i don't think so so the the to begin with the bottleneck with scanning lgbt gbt models generative retrain transformer models is not going to be the size of the model or how long it takes to train it the bottleneck is going to be the trained data because openui is already training gpt3 on a crore of basically the entire web right and that's a lot of data so you could imagine training on more data than that like google could try on more data than that but it would still be only incrementally more data and i don't recall exactly how much more data gpg3 was strained on compared to gpt2 but it's probably at least like 100 or maybe even 1000x i don't have the exact number uh you're not going to be able to train the model on 100 more data than with what you're with what you're already doing so that's that's brilliant so it's not you know it's easier to think of compute as a bottleneck and then arguing that we can remove that bottleneck but we can remove the compute bottleneck i don't think it's a big problem if you look at the at the base at which we've uh improved the efficiency of deep learning models in the past a few years i'm not worried about training time bottlenecks or model size bottlenecks the the bottleneck in the case of this generative transformer models is absolutely the training data what about the quality of the data so so yeah so the quality of the data is an interesting point the thing is if you're going to want to use these models in real products then you you want to feed them data that's as high quality as factual i would say as unbiased as possible but you know there's there's not really such a thing as unbiased data in the first place but you probably don't want to to train it uh on reddit for instance that sounds like a bad plan so from my personal experience working with uh large-scale deep learning models so at some point i was working on a model at google that's trained uh on uh extra 150 million uh labeled images it's an image classification model that's a lot of images that's like probably most publicly available images on the web at the time and it was a very noisy data set because the labels were not originally annotated by hand by humans they were automatically derived from like tags and social media or just keywords in in the same page as the image was fun and so on so it was very noisy and it turned out that you could easily get a better model not just by training like if you train on more of the noisy data you get an incrementally better model but you you you very quickly hit diminishing returns on the other hand if you try on smaller data set with higher quality annotations quality that are annotations that are actually made by humans you get a better model and it also takes you know less time to train it uh yeah that's fascinating it's the self-supervised learning there's a way to get better doing the automated labeling yeah so you can enrich or refine your labels in an automated way that's correct do you have a hope for um i don't know if you're familiar with the idea of a semantic web is this a semantic web just for people who are not familiar and is uh is the idea of being able to convert the internet or be able to attach like semantic meaning to the words on the internet this the sentences the paragraphs to be able to control convert information on the internet or some fraction of the internet into something that's interpretable by machines that was kind of a dream for um i think the the semantic white papers in the 90s it's kind of the dream that you know the internet is full of rich exciting information even in just looking at wikipedia we should be able to use that as data for machines so information is not it's not really in a format that's available to machines so no i don't think the semantic web will ever work simply because it would be a lot of work right to make to provide that information in structured form and there is not really any incentive for anyone to provide that work so i think the way forward to make the knowledge on the web available to machines is actually something closer to unsupervised deep learning yeah the gbt3 is actually a bigger step in the direction of making the knowledge of the web available to machines than the semantic web was yeah perhaps in a human-centric sense it it feels like gpt-3 hasn't learned anything that could be used to reason but uh that might be just the early days yeah i think that's correct i think the forms of reasoning that you that you see it perform are basically just reproducing patterns that it has seen string data so of course if you're trained on the entire web then you can produce an illusion of reasoning in many different situations but it will break down if it's presented with a novel situation that's the opening question between the illusion of reasoning and actual reasoning yes the power to adapt to something that is genuinely new because the thing is even imagine you had uh you could train on every bit of data ever generated uh in the history of humanity uh it remains so that model would be capable of of anticipating many different possible situations but it remains that the future is going to be something different like um for instance if you train a gpt stream model uh on on data from the year 2002 for instance and then use it today it's going to be missing many things it's going to be missing many common sense facts about the world it's even going to be missing vocabulary and so on yeah it's interesting that uh gbt3 even doesn't have i think any information about the coronavirus yes which is why you know a system that's uh you you tell that the system is intelligent when it's capable to adapt so intelligence is going to require uh some amount of continuous learning but it's also going to require some amount of improvisation like it's not enough to assume that what you're going to be asked to do is something that you've seen before or something that is a simple interpolation of things you've seen before yeah in fact that model breaks down for uh even even very tasks that look relatively simple from a distance like l5 self-driving for instance google had a paper a couple of years back showing that something like 30 million different road situations were actually completely insufficient to train a driving model it wasn't even l2 right and that's a lot of data that's a lot more data than the the 20 or 30 hours of driving that a human needs to learn to drive given the knowledge they've already accumulated you
Info
Channel: Lex Clips
Views: 26,951
Rating: undefined out of 5
Keywords: françois chollet, artificial intelligence, ai, ai podcast, artificial intelligence podcast, lex clips, lex fridman, lex friedman, joe rogan, elon musk, lex podcast, lex mit, lex ai, mit ai, ai podcast clips, ai clips, deep learning, machine learning, computer science, engineering, physics, science, tech, technology, tech podcast, physics podcast, mathematics, math, math podcast, friedman, consciousness, philosophy, turing, einstein
Id: UyPMS3Gvdko
Channel Id: undefined
Length: 15min 19sec (919 seconds)
Published: Tue Sep 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.