Is ChatGPT A Step Toward Human-Level AI? — With Yann LeCun, Meta Chief AI Scientist

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
John welcome to the show um awesome to be here great to have you back this time we get to do it one on one uh which is nice it was great having you and uh Daniel Kahneman on and uh you and I have been speaking for a long time since 2015 2016 about the possibilities for artificial intelligence and I think there was like a quick moment where everyone started talking about crypto but now I think we're we're focused on the right stuff again it was a distraction so let's talk a little bit about the wave of uh generative artificial intelligence that we're seeing and how that might relate to general intelligence so look I think when we were talking for the first time we were talking about how can we make a machine that looks like and that thinks like a human and you had told me okay there's going to be a time where they start to predict and if they can predict they can plan and that's how we're going to get close to artificial intelligence and I say okay that sounds nice yeah but that's never going to happen at least that's what I was thinking now we're starting to chat with some of these advances like chat GPT and I'm starting to think oh okay maybe it wasn't that far off so where are we now in terms of the pursuit for artificial general intelligence and is this a step a big step forward towards that or is it again maybe a distraction uh the short answer is it's not it's not a particularly big step towards uh you know more like human level intelligence I don't like the term AGI artificial general intelligence because I think human intelligence is very specialized so if you want to designate uh the type of intelligence that we observe in humans by General that's a complete misnomer so I I right you know I know maybe they're cheaper sale but uh but I want to make that point that that human intelligence is actually very specialized so no the answer the short answer is uh so first of all from the scientific point of view uh gtt is not a particularly interesting scientific advance of course we don't know all the details because open AI has not published any but uh for a company called open AI yeah right now it's the it's the least open that you can imagine um but uh they started out as wanting to be open and then they realized they couldn't fund their research unless they went slightly secret secretive so by the way can we just pause on that because that's interesting yeah what does that say about the the AI research world that if you wanted to start out as open and you couldn't you had to go for profit now they're like cat profit is that is that it's interesting that that it's impossible to fund this type of research without that and there's something structurally need to change because of that well there are a lot of non-profit AI research organizations right called universities but there are also other non-profits like um you know the Island Institute for AI for example is non-profit uh in in Seattle uh open AI initially was was non-profit and then switched to for-profit uh and originally was publishing everything they were doing and now kind of basically keep everything secret they've become sort of a contract research house for for Microsoft to some extent um and it's because the the funding model is not clear um there's a you know uh reverse phenomenon that occurred at Google when uh I created Fair nine years ago we uh you know had a big drum roll about the fact that we were going to be completely open Etc and we still are we are holding that line and as a result it caused uh Google brain at the time to become much more open than they were interesting because uh that's what the scientists wanted you know if you uh you tell a scientist you can come work for us but you can't say a word about what you what you're doing you're killing that career so um you know so you have to basically write a life right so and open research is much more efficient you just get more more stuff out of it you get things that is more reliable uh you know and and you attract better people uh you you have you know a better intellectual impact which is which means people have kind of more respect for you and want to work for you and things like that right so there's a lot of advantages to this but there has to be an economic model and the economic model the only one I know outside of universities and philanthropy is a industry research lab inside of a large company that has that is profitable and is uh sufficiently well established in its Market that it can think for the long term and invest in fundamental research so um so that's the case for you know certain corners of Google that's the case for fair at uh admitta and not quite for deep nine so deepmind is an interesting uh thing because they started out as a startup and you can't you can't absolutely cannot do research in the startup you just can't because you just don't have the the phones right or the ability to wait long term that's right so you know you can do it you can do it for two or three years but then you basically have to focus your entire attention to uh you know building products and and getting revenue and getting the company to survive so what allowed them to do what they're doing is that they got bought by Google but then still then after that their economic model was not obvious because they were kind of sort of an ivory Tower separated from Google and to some extent they still are and and so you know Google was had the foresight to fund them regardless of whether they were producing something that was useful uh but in the current context of uh you know more efficiency and money saving in the tech industry uh that model might might have to change I you know it's it's not clear right the so the economic return uh after 10 years or so or nine years that Google has gotten from deepmind you know is now clear it's worth the investment so they're banking on like you know bigger Longer investment I'd say are the the business model is very clear like Fair had has had a huge impact on on the company mostly indirectly through other groups right because fail does not work on products but um but there's been like a huge amount of impact so um long long Rider sorry the uh open AI um you know could not keep doing what they were doing unless they were commercial essentially and made and it caused them also to make kind of wide promises and so what they have to do to be able to raise enough money for Microsoft and others uh is to make very flashy demos and so they and they're very good at that so they make really flashy demos their technology is not particularly Innovative from the scientific point of view it's very well engineered so they put together you know large-scale scaled up system with trained with very well curated data I mean they know what they're doing but in terms of Advance there's nothing but not much right okay so let me take this from and by the way thank you for that little uh diversion I think it's good we went into that discussion of how this stuff gets funded and where it goes but going back to our discussion of the March towards General Intel or sorry I'll use your term human level intelligence okay [Laughter] from uh layperson's point of view it does feel like oh okay now I'm talking to AI now ai understands what I think and can actually draw it now I can take my voice and start talking on its own so why isn't that a step towards okay intelligence because the understanding that those the current systems have of uh you know the underlying reality that language expresses is extremely shallow so those systems have only been trained with text a huge amount of text so they can regurgitate text that they've seen and you know interpolate for new situations uh things like that they can uh you know even produce code and and stuff like that but um they do not understand they have no knowledge of the underlying reality there's no they've never had any contact with you know the physical world um you know if I uh uh if I take a piece of paper let's say looking for a piece of paper um and I you know I I hold it like this right and I tell you uh I'm gonna lift my my hand from one side you can exactly predict what's gonna happen right and for listeners that paper is being held horizontally horizontally drops and um and and sort of you know with my two hands and then one hand moves away so when one part of the paper kind of droops uh because of gravity and you know exactly how it looks because you know you know the properties of papers and stuff like that right so um this type of knowledge that all of us have learned in the first few months of life none of those systems have any of this but I could I could do your knowledge right but I could chat with Chachi PT and say what happens if I'm holding a paper or two hands and I let go with one and it will tell you it will droop no I mean it will answer that but you think but it just won't understand it no it might actually not tell you because it will depend whether you'll tell you or not depends on uh whether there were kind of similar situations uh in text that is being trained on but it's not at all and I get I can come up with you know a huge stack of similar situations that each one of them will not have been described in any text um so so then the question you you want to ask is you know how much of uh human knowledge is present and described in text and my answer to this is a tiny portion like most of human knowledge is not actually language related um I don't know you do uh I don't know do carpentry right build a piece of furniture uh your ability to predict where the piece of furniture is going to look like as a consequence of how you build it it's completely non-linguistic okay everything that has involves any kind of motor control planning things like that that there's basically no UBC college so now think about the entire collection of knowledge in every animal is obviously non-linguistic because they don't have language or at least now human type language languages through a few a few species like dolphins and stuff right um now you know dogs and cats know a lot of stuff and about how the world works and all of that knowledge humans have it too to some extent not to the same degree but uh in in all domains because we're all specialized but none of that knowledge is captured by any current AI system essentially that's a lot okay let's build on that I'm just going to read the response I asked chat GPT if I'm holding a paper horizontally with two hands and let go with one hand what will happen now I I'm not going to say that that that you're wrong obviously you're right but I'm just gonna gonna read it to you for for the sake of discussion chat GPT responds if you are holding a paper horizontally with two hands and let go with one hand the paper will tilt or rotate in the direction of the hand that is no longer holding it due to the unbalanced forces acting on the paper if the paper was initially still it will also move in the direction of the hand that let go do the force of gravity acting on it if the paper was moving in a certain direction before you let go it will continue in that direction but may also be affected by air resistance and other external factors okay that's a critical response it sounds correct and it's completely wrong it's exactly wrong it's actually the exact opposite that's that that's happening right it's not moving uh I mean it's it's the part that you let go that troops right so yeah and this is saying the opposite oh that's true yeah yeah yeah so so right so it sounds correct it's grammatically correct the general theme is correct because they probably were descriptions of similar situations that the system was trained on and it kind of memorized it uh and it's it tries to adapt the the text so that it uh is relevant to the current situation but it gets com it gets it completely wrong and you know it gets wrong things like you know comparing numbers so you tell it you know uh you do a prompt you say you know you know for a fact that seven is smaller than five or bigger numbers like 250 is smaller than uh you know 196. and then you start kind of you know telling a story with with numbers and it will assume that what you prompted it with is right um even though it's false rob you don't even have to do this right I mean there's very a lot of situations like this where the system will just not um say things that are actually correct now why is that it's because uh large language models are trained to predict the next word in a text and they train on enormous amounts of text and they have enormous amounts of memory um but uh but they basically you know probabilistically generate the next the next word and then we inject that word into their context of a few dozen previous words that they've said uh or or the prompt and then generate the next word again and then re-inject that in the input Etc there are various ways to do this more efficiently but that's the basic excuse me the basic idea um so now the issue with this is that there is no um there is no way to specify a task that the system has to accomplish other than by specifying that task inside of the prompt which is a very circuitous inefficient and complicated way of specifying your task um it's not very controllable okay that's the first thing the second thing is that system so that system is not like optimizing uh an objective if you want like trying to satisfy an objective right um it's just kind of generating one word up to the other and because it's generating one word after the other it's not doing any planning so it's not like planning to tell a story or or an answer whether it's you know like uh uh a kind of uh align to the story or you know a set of facts and things like this um it just generates one word after another it has no capability of generating commands for a tool like say a calculator or anything like that or a physics Simulator for example you could have simulated that piece of paper and then observe the result and then kind of tell you what the result was that's what we do in our head when we are being described this kind of situation we have our own internal mental simulator and because we've learned how the world works we can simulate what goes on and then describe the result right uh llms do not have that they don't have any internal model of the world that allows them to predict um and then uh in addition to this you would like when the system produces a statement you'd like to be able to verify that that statement is factually correct or does not uh you know break any kind of logic of any kind you know compared to another statement that was made before and there is no way in the current architecture of those systems to do this um right and so until we we build uh systems that have some internal model of the world it allows them to kind of simulate the world if you want some way of generating actions uh on the World to use tools like a calculator or something or interrogating a database or search engine uh uh and an objective that it has to satisfy for the tasks that we are asked you to accomplish and a method by which it can plan an answer that satisfies the objective is factually correct or or not depending on these the desired Behavior Uh you know and uh perhaps interrogates the the right sources of information we're not going to have anywhere anything resembling human level intelligence okay and I definitely want to get to the type of research and the models that might get us there but first I want to talk a little bit about the hallucination that chat GPT just had with my interaction with it because Hallucination is definitely a big issue and I I'll be honest and this is embarrassing to admit as a journalist but as I read it I was ready to believe it because it was like oh here's AI answering a question with a somewhat plausible answer and stating it so confidently and that is an issue with these models right is that they they do hallucinate that's probably why we haven't seen Google bring it into search uh go go ahead well that's why you haven't seen any kind of systems of this type from either Google or meta despite the fact that they have the technology okay okay so uh you know uh I mean certainly uh we have to realize that most of the technology underlying techniques used in chargeability have been invented at Google and meta right uh and the whole thing has been built with pytorch which is you know made by by metal it's not owned by meta anymore but uh um but it originated there so for example you know it uses large Transformers Transformer architectures those were originally invented at Google Transformer themselves use something called associative memory I mean it's called self-attention but it's you know basically the same principle those were basically proposed by meta many years ago uh the the use cell supervised pre-training by removing words those are techniques that you know go back to the uh 1990s or even 1980s in some cases uh they've been popularized by the bird style language models again were proposed at Google and then a number of techniques for dialogue systems so there's a very active dialogue system uh research group at meta that was proposed lots and lots of methods um which uh inevitably whether they say it or not open AI must have been influenced by um and then the user technique now now that uh transgpt is uh available using a technique that is called reinforcement running through human feedback rhf uh which was proposed by deepmind actually so you know they've done a good job at sort of integrating a lot of things that you know have been proposed in the literature and uh and sort of engineering A system that kind of produces a impressive demo and they have to produce impressive demos because that's the economic model that's how they're gonna raise money for Microsoft and others whereas if you are meta or Google you could think about like you know putting out a system of this type that you know is going to spew nonsense um and you know because you are a large company you have a lot to lose by you know people kind of making fun of you for that uh and it's not clear what the benefits are okay there are right so we're still working on those things to make them useful but it didn't matter put out a system of its own Galactica and yeah okay so yeah talk us through what happened there because this was a system that was supposed to summarize scientific literature and do lots of other cool things it comes out and then three days later it goes back behind closed doors that's right so what happened uh there was a previous system also called blunderbot and there's another story about that that will oh right that's the thing that started talking about how Mark Zuckerberg is a sort of yeah money hungry capitalist I mean it was just reflecting what it what it was trained on in the Press right and that's basically you didn't get called into Zach's office and be like hey what are you what are you guys talking telling it about me yeah no no it just I mean it just trained on the you know the general conversations you read in the media and in the media you know Mark Zuckerberg is uh very often painted as you know some sort of money hungry bad guy which is not at all but that's the way he spent it so um uh yeah so let's start with uh brother but then so blenderbot was was put out and uh several months ago and it's a dialogue system a bit like tragedy PT it's designed to be entertaining more than anything else and it's capable of having multiple personalities so we can talk into like several Styles and things like that and it has provisions to somewhat verify uh factual correctness although not particularly well developed uh but it does have a lot of uh kind of guardrails and and kind of systems to like prevent prevent it from you know saying things that might be offending or or whatever uh or or even objectionable or even controversial right so if you try to get it to talk about anything related to politics or religion or anything like that it will change the topic and it won't tell you it's because I don't want to talk about this it will just change topic right so people thought this thing was really stupid and boring because it doesn't want to talk about anything that's kind of you know controversial or fun which is the kind of stuff you want to talk about you know everybody wants to talk about uh and uh and it it's frustrating because it would change topic and you know anytime you wanted to to do that so uh it was not nearly as convincing um but so so you could you could you could say that the reason it was you know not that impressive in the end is was because it was it was made to be safe essentially okay and if it's too safe it's boring um so um now let's go to Galactica so Galactica is a different animal it's also a large language model and that large language model has been trained on the entire scientific literature so this is something like you know millions of scientific papers and the purpose of it it's used is to help scientists write papers so think of it so it's not going to write a scientific paper for you it's not going to answer scientific questions although you could try to use it for that but sometimes you might make stuff up but it's designed to be essentially a a you know predictive keyboard on steroids right so you start typing a paragraph about something and it will you know complete the text the entire paragraph it will insert relevant citations uh if you say you know the the state of the art uh in object recognition on the image that database is it will find the correct reference it will actually you know build a table of results with links to the references and stuff like that right but the same way uh driving assistance systems for cars are just that driving assistance this is just writing assistance right so in the end your hands have to be on the wheels on the wheel at all at all times uh you are responsible for the text that in the end uh is finished it just helps you it's a tool that helps you write more efficiently particularly if you are not a you know native English speaker um which you know most scientists aren't right I mean I even use chat GPT that way I put in the beginning of the paragraph and say hey which ways could this go understand that it might not be accurate and that's the way you should treat it really right um as uh you know as as a predictive keyboard on steroid and and something that just helps you write but it's it's not gonna you know write event new things answer questions do science blah blah blah so what happened was that when we put out Galactica people try to break it so people are not scientists like didn't understand what the use of it was was going to be uh and and they would prompt it with things like uh you know what what are the benefits of eating crushed glass or something like that and of course that's kind of a leading question so the system will kind of make a story of like why it's good to you know create a crush class and then the reaction on Twitter was oh my God you know people are going to eat crushed glass because they're going to listen to it right which is no insane I mean it's stupid people are not that dumb uh well yeah and I think you overestimate people a little bit but sorry continue well there might be a tiny proportion but like you know other like if you use things like that it's not clear um you know particularly which really ultimately was designed to be you know integrated into tools that scientists use to write papers right so um and then you know others more seriously said uh oh this is going to destroy scientific publication because now uh you know anybody can generate a a nice sounding scientific paper and then we'll submit it to a a conference and this will completely flood uh and overwhelm the reviewing system that we have in science and the star science and uh I thought that was a completely ridiculous argument because the reason why you might want to submit a paper is because you want to prop up your CV and so you have to put your name on it otherwise what's the point and if you put your own name and it's garbage it's bad for you it's bad for your career like you know if you if you've you know send 100 papers there are complete nonsense to a conference with your name on it it's certainly be good for your career like absolutely not so um so I don't I so I think this kind of knee-jerk uh reaction uh was completely unwarranted and it really mirrors a lot of knee-jerk reactions that have happened in the past when new tools or new communication technologies have appeared where you know all of all of them was going to destroy Society so I think it's the same kind of New York reaction that we're observing with with AI today um you know this is not to say that there is no denture but um but it's not like the horrible things that people make to me right so then why not keep it up well so what happened was the team that worked on it which is uh within fair is called papers with code there was so distraught by the reaction that they just they just they just couldn't take it they said like you know we're just gonna take it out like this was not a high level decision this was not a decision by uh Communications departments or the management this was them okay it's a small team a few people and they felt like really good about what they produced they wrote a very long paper they open source their code uh they took down the demo but they left the code so anybody can download and run it on their own computer interesting and this is sort of what we're gonna see I I just wrote this story and we've talked about on the podcast about how the the battle over you know AI ethics applications of AI is just going to be extremely intense and I think we're starting to see some of that so we've covered hallucination good we got to that let's talk about the type of models that you think so you talk about our move to artificial or to human level intelligence needing an understanding of the environment things of that that can't be expressed with words now when you tell me that I'm back to kind of where I was in the early days of our conversations saying there's no way technology is going to be able to do that but sounds like you think that there is a chance that it can how do we get there and what type of advances are we seeing today that might lead us to think that we do have a chance so I actually wrote a position paper about this which is rather long but the intro is easily readable by non-specialists and the title of it is a path towards autonomous machine intelligence where basically I I lay out a plan or a path forward to address those questions to build AI systems that are capable of of planning whose behavior is controlled by objectives that need to be satisfied which can be specified by hand or learned so things like you know factual correctness and blah blah blah and answer this question and you know don't those few offensive stuff and things like that uh and uh and have internal models of the the world or the the thing they are interacting with which could be a person if they're if there are dialogue systems then uh when you're talking to someone you need to have a mental model of what that person uh knows and can understand okay if you if you speak in terms that the person cannot understand then you know the communication fails right so when when you when you talk to someone you you have something to say and you may have to tell them you know some background information depending on what you think they know and then sort of take them to the stage where uh uh where where you think they've absorbed the information that uh you think would be useful to them so you need to talk to someone you need an internal model of what that person or that person will react to what you tell them um if if it's another type of agent that generates actions in the physical world or even in the digital world like a robot that you know domestic robot that you know needs to cook or fill up the the uh dishwasher um that model that system needs to have an internal model of the world that allows you to predict where the state of the world is going to be as a consequence of its own actions because that's what you need to be able to plan if you want to plan a sequence of action to arrive at a goal you need to be able to imagine if I take this action this will happen and then if I take that action this will happen etc etc and so you can optimize your sequence of actions so that the resulting sequence of states that the world is going to follow is going to satisfy your ultimate goal um and uh uh and then the issue with this is that you know how do we get a machine to learn models of the world baby humans and animals do this in the first few months of life uh mostly by observation and understand a huge amount of background knowledge about the world um basically just by observation this is the type of learning that we need to reproduce in machines that we currently cannot do except for simple things like text because you have to understand that text is actually simple compared to the real world right so how do you do it okay so um okay so there's one way to not do it which is the way large language models are trained so the way you pre-train a large language model is that you don't training to just predict the next word in a text you take a long piece of text if it doesn't work typically and you remove some of the words uh you you blank them out you replace them by a blank marker or you substitute another word or you do Vice corruption things and then you train some gigantic neural net to predict the words that are missing okay so this is called cell supervised running and it's uh this particular instance is uh what's called a generative model which is much more General than the usual kind of colloquial use of the of the term generator model it's generative in the sense that it it produces uh signals that are the same as the input okay it tries to fill in the blanks if you want okay um so it generates the missing information if you want uh and this particular instance is called a denoising autoencoder why autoencoder because you give it an input that is corrupted and you ask it to produce the clean uh version of that input that's called a denoising auto encoder and again that concept goes back to the 1980s nothing new there uh except the application of this idea of the nursing how to encoder to text using a Transformer architecture which is those you know very large neural Nets that we use nowadays with 40 90 layers or whatever and hundreds of billions or or at least tens of billions of parameters uh works amazingly well like surprisingly well so in the process of being trained to filling in the blanks those models basically learn to represent language as a series of numbers that represents the basically the meaning of the input sentence to the extent that this meeting is useful to fill in the blanks okay so inevitably by being trained to do this the those systems understand a little bit about the real world but not much it's very superficial you know if I if I train the system we sentences of the type uh you know the Cat chases the blank in the kitchen you know the blank there can only be a few things right it's either a mouse uh or an insect of some type or maybe a laser spot that someone is you know playing with the cat uh or something like that right there's only a few options and so those systems are trained to produce essentially a long list of numbers that are scores for each of the possible words in the dictionary for as you know how likely they are to appear at that location and and to be able to do a good job at this the system has to understand a little bit about you know what's a kitchen and what's a cat and you know cat catches mice and things like that and so we launched that but it's very superficial um and uh and then what you do once you've trained the system to do this you chop off the top layers and then you can use the internal representation as input to a subsequent Downstream task that you can train supervise like for example translation or uh you know hate speech detection for example um so the this technique has become completely mainstream in natural language processing um and so much so that uh companies like meta and alphabet in their various Services have deployed those things massively for doing things like content moderation right so uh H fish detection for example has made enormous progress over the last three four years and it's essentially entirely you to the fact that we're not using those techniques okay we can do his speech detection in multiple languages basically with a single model uh with we don't have to train it with a lot of data in each language because we don't have a lot of data in every language we have a lot of data in English and others and no French and German and blah blah blah uh you know maybe the main language in India but India has you know like an enormously large number of languages that people use including online you know in local dialects and stuff um uh so you know how you make the HP detection work in 500 languages and even 500 would not cover everything so you need those techniques and and you know this has made a huge amount of progress now one thing you can do with those things you can fine tune them to just predict the next word and that's what large language models are now um coming back to this question of planning last time we talked with with Danny kaneman which was an amazing discussion uh Danny is famous for you know this book uh thinking fast and slow and for this idea that uh we have basically two ways of uh acting uh system one and system two equals and system one system two so system one corresponds to tasks that you accomplish subconsciously you don't need to plan anything it's completely reactive um so think about like you know you're you're an experienced driver and you're driving on the highway you're barely paying attention and you know you you're you're not calling on your sort of high-level cognitive functions to do that you can do it more or less automatically uh anything that involves a real-time reaction like um you know playing tennis or something like that that's mostly subconscious you're not you just don't have time to plan right so it has to be built into your muscle memory as we call it right but even complex acts like you know you pay chess and you are a chess Grandmaster or you're playing as you know 17 News game against 50 bad players like me uh you don't have to think you can just look at the board and just immediately play you don't have to plan anything because it's so easy for you you know it it's become kind of a subconscious task now all of those things though all of those tasks uh when you learn them but when you're not very good at them uh you have to use your entire cognitive uh resources you run to drive and you pay attention to everything you imagine all kinds of catastrophe scenarios uh you Drive slowly you're using your photo cortex your model of the world uh that you know tells you I'm driving next to a cliff and I know that if I turn my steering wheel to the right I've got to fall off the cliff and nothing good is going to happen you don't need to try it to know that something bad is going to happen right because you have this model that you've you've built in your head for the last 17 years if you are a 17 year old um so that model of the world allows you to predict the consequences of your actions and allows you to learn extremely quickly any new skill um you know same with with chess uh if you're a bad chess player you will have to think for you know 15 minutes or or more when you play against a challenging player and and plan all kinds of strategies uh so uh what characterizes intelligence is the ability to predict first of all and then the ability to use those prediction those predictions as a tool to plan by predicting the consequences of actions you might take uh prediction is the essence of intelligence okay so now here's the problem encoder that we use to pre-train natural language processing systems works for text it doesn't work for anything else so it doesn't really work for things like video or images so natural ideas you take an image you block some pieces of it and then you train some system to predict the pieces reconstruct the the parts huh that's how Dolly works not really no okay it's it's you know it uses I mean W2 uses diffusion model which is kind of a slightly different idea but it has yeah uh but uh if you do it the way I I just described there is like one or two models that sort of worked that that use that one is called Nae by my colleagues at fair and that means Mass autoencoder uh but it doesn't work as well as other techniques and those other techniques are not generative models okay so they're models that do not attempt to reconstruct missing information there are techniques that uh attempts to reconstruct uh Missing information but not reconstruct the image itself but reconstruct a representation of that image an internal representation of that image those techniques I I call them joint embedding architectures so essentially in the uh you know the university you have an encoder that produces produces a representation of the input whether it's an image or text or video or whatever and then you try to reconstruct the uncorrupted version of the input the input being corrupted right that's through a decoder in a joint embedding architecture you have two encoders one encoder sees the perfect version of the input the other encoder sees a corrupted version or distorted version of some kind you run those two things two both encoders and then you tell the encoder that sees the corrupted input can you predict the representation of the full input but you're not trying to reproduce all the details and it makes sense for images because or video so let's imagine we're on a scenario with video right so I have a video clip uh and a complete video clip and what I do is I I mask the the last half of the video clip okay the corrupted version of video clip it's just the the first half of that video clip okay the rest is invisible and then the the complete version of course it's a full video clip right so you run this for video clips with some neural net that produces some representation of the video clip right uh and then you train this guy to predict the representation that that guy has produced and of course implicitly what it has to do is predict the rest of that video clip in representation space now why is that better than just reconstructing the piece of the video clip that is missing just predicting it the reason is there is an infinite number of things that can happen after a particular video clip right and we don't know how to represent uh a distribution of all the possible things that could possibly happen we cannot do it in pixel space right so for example uh you know you you're seeing a green screen behind me right now you're not seeing the back of my head okay you can make so if I start rotating my my head uh you might be able to predict what the back of my head will will look like and then when you look at it you might be surprised by what you see maybe I have a small ponytail or something I don't but you know and so and there is the you know an infinite number of variations of what could be possible there and I could decide to you know change the action and move my head in a particular way or or something so you can't just predict every detail what's going to happen let's say you want to build a it's very important if you want to do things like building Cellular in cars because you to be able to drive safely you like to be able to predict what cars around you are going to do or what pedestrians are going to do you see a kid on the sidewalk and uh you know a soccer ball kind of crossing the road good chance that the kid is going to run after it right so you slow down uh so you know we have this sort of interesting models of the world that allow us to predict and then as a consequence plan kind of safe actions but then in that same street where the kid is with a soccer ball uh it's you know aligned with trees and there is a pond behind the trees and it's a windy day so the tree you know all the leaves are moving and there is you know ripples on the pond and everything and you don't want to spend any resources predicting all of those details which are essentially unpredictable and so that's why generative models uh essentially fail when you train them on images because there's just so many details to predict the system gets completely lost in predicting all kinds of irrelevant stuff which we call noise uh but you know how would the system know so uh so one of the main thing I'm proposing in that in that piece is to abandon those generating models basically and focus on those joint embedding architectures or getting models to learn system to run models of the world pretty two models of the world okay and maybe that's what gets us there okay let's let's take a quick break here and then come back for about 10 more minutes on the show we have Jan lacun with us he's the VP and chief AI scientist at Facebook and known as the father of deep learning so plenty to talk about we'll be back right after the break and we're back here with yanmakun the VP and chief AI scientist at Facebook Facebook meta I don't know is it the same meta it's called meta um actually I'm also a professor and professor at NYU right of course can't forget that so uh yeah let's let's talk a little bit let's just go one one level deeper about this in the time that we have left so Facebook I I know released a a application or at least has it internally where you can type a sentence and it will make like not like an image with Dolly but like a small little video that actually resembles that sentence so is that taking us closer to this ability to predict and understand the world that you're talking about uh yes or no so yeah there is uh two little demo applications one is uh it's called Uh make a scene and this is one uh yeah a bit similar to to Delhi where you type a description of an image and it just produces an image it's it's based on slightly different ideas but it's um uh anyway before W actually right uh the main author of Delhi by the way is uh uh was a DTR Ramesh a brilliant young guy who he was an undergraduate student with me and did some research projects in my lab before going on to open AI as an intern and then being hired as a as a scientist um so uh so there's this thing called magazine that produces fixed images and then there is a similar thing called make a video that produces short short video clips essentially and that kind of systems that can do this now from Google as well and right but those are they're not really sort of publicly available yeah you can't use it I saw the blog post I was like oh I want to use this and then I realized it wasn't public I imagine the problems that you had with Galactica would pale in comparison to whatever happens with this movie maker with language that people kind of I think you know more attention to no the the menu resources is like where does the training data come from and things like that so uh you know before we can roll them out it has to be trained on data that is uh public acceptable and all that stuff um uh there is a similar system also uh in the works um there's been some publication on it uh they can generate audio so we can generate sound effects also by textual description or simultaneously with the video um there's systems also from fair that uh produce music uh they can do music continuation so it's like a large language model but with music right so you feed it audio and it just continues interesting uh so you're going to see more and more of those things and meta is really interested in those things in sort of creative AIDS because uh everyone who is on an online service whether it's a social network like like you know Facebook or Instagram or whether it's going to be you know in the metaverse everyone is going to need to have easy ways to generate content and be creative without necessarily having all the you know being technically uh astute enough uh in in terms of of art uh to uh to do all that thing so so being able to generate you know be creative and generate content I think is something that's very very important and so you're going to see things like that popping up on you know meta's services in various uh interesting generating images effects modifying images generating video generating music sound effects 3D models okay which of course is important for the metaverse um and uh and eventually you know all the way up to intelligent virtual characters and stuff like that really okay So Meta is going to go in all in on this stuff oh totally yeah yeah I want to ask you a question about it because you know the thing about um you know our song videos all this stuff it there's a musician songwriter Nick Cave who did this post about Dolly writing I mean not Dolly um chat GPT writing songs and the authenticness of all of it now there's been this debate is this going to take everybody's job or is it too too soulless and dumb to actually do this this is sort of the argument that this AI really can't do what humans do so this person writes Nick writes chat GPT may be able to write a speech or an essay or a sermon or an obituary but it cannot create a genuine song it could perhaps in time create a song that is on the surface indistinguishable from an original but it would always be it will always be a replication kind of a burlesque songs arise out of suffering by which I mean they are predicated upon the complex internal human struggle of creation and and well as far as I know algorithms don't feel so I'm curious when you what you think about that like from the perspective of a songwriter okay and this stuff actually produces yeah there's a lot of questions around this right so first of all it is true that current AI systems don't feel okay but it's not going to be too far much longer so if we have systems of the type that it was describing before that have objectives and that plan actions to optimize objectives they will have the ability to predict the outcome of a situation which means if they are able to satisfy their goal uh given a situation they're going to feel the equivalent of elation if they can predict that they're not going to be able to satisfy their goal uh you know they're going to be disappointed if they predict that the situation may be dangerous for either themselves or the person they are interacting with they might actually have you know the equivalent of fear so if you have autonomous AI systems that work by optimizing objectives and they have the ability to predict they're going to have emotions it's Inseparable from autonomous intelligence fascinating so maybe we'll have an AI that tries to fulfill some objective and fails miserably and writes a song about it and that will be okay but but um a big part of art I mean it's certainly true for songwriting and but also for novels and for jazz improvisation and all kinds of stuff right is uh basically a vehicle to communicate human emotions and Nick Cave has a point which is that there is nothing to communicate unless you have emotions that the auditorial reader can relate to right so uh and and that really depends on the person like you know I'm I'm totally taken by you know good jazz improvisation uh uh but like words on songs like like you know in the case like do nothing to me like I mean we have different different ways of reacting to different things but um so you know is right in the way that if you want the genuine experience of Art um of communicating uh human Human Experience or human emotions it has to come from another human even if machines eventually have emotions they're going to be very different different from humans so it's not going to replace this type of genuine art it's you know yeah like you know project yourself like back uh a couple hundred years ago where like every like he wanted to buy a salad bowl or something uh it would be a handmade right it would be like a putter that would you know kind of make that right and just make it by hand and bake it and everything right uh as a consequence would be relatively expensive for most people um and then and then came industrialization so now it became possible to build uh you know ceramic bowls for pennies uh in sort of large quantities and uh did that um kill the the whole idea of making Balls by hand no in fact handmade uh objects become more expensive as a result right and that leads me to a question that I have to ask you which is that everybody's asking about the training data that that these things are using in fact I I put a tweet out asking hey I'm going to talk about generative AI so we'll end with this generative AI ethics question what would you like to know everybody wanted to know about how the people who these databases or programs are trained on are actually going to end up being compensated and whether we should actually even allow them to train on stuff that people have created without their consent here's like two of those comments so uh Roger McNamee early Facebook investor now now critic says and he didn't know it was going to be a Facebook conversation a conversation with someone with Facebook but he says um why should it be legal for a computer scientists entrepreneurs and investors to profit from unlicensed exploitation of the works of creative humans and we also had one more from Marshall Morrow and he said the inevitably of the inevitability of intrusive Tech is a uniquely American phenomenon I don't know why American but anyway this is what he says they they ask you to post baby pictures and are now training lucrative AI engines with your images this was not disclosed at the time pay me now so yeah when you think about that what's your what's your reaction it's going to be a debate for society to figure out because I don't think the answer is totally clear uh you know for example uh photography the invention of Photography uh Shrunk the market for um portrait painted Portrait by a lot it's not like portrait portrait is disappeared but it certainly reduced the market for it recorded music reduce the market for performance musicians and in every instance of those things there was uh you know collectives of artists to say like you know we have to stop this because this is going to kill our business they were universally unsuccessful okay so you're not going to stop technology right the now the question is is a legal one so if you if you assume that current legal uh interpretation of copyright if you want uh is used then you cannot let those machines plagiarize so if you use a generative model that's been trained on whatever and it produces regardless of the process and it produces a piece of art that is too similar to an existing one the audience that produced that existing one uh is entitled to sue the the the person who is Distributing this this new piece of art um uh and and ask for compensation now but what if that piece of art is not copyrighted that generated piece of art is not copyrighted so nobody can drop it from it who are you who are you gonna assume um you know is is there grounds for um pursuing now if again it's a copy yes there is going for suing even if you give it for free if that piece of art is in the same style as a known artist but is not the same piece of the same painting or whatever it is then that's where things become complicated because human artists are absolutely authorized to is get inspired and write down copy someone else's Style um that happens absolutely all the time in um in art and so would would it make sense to apply a different rule for it's called Uh artificial audience right that generate things uh like you know they can get inspired by whatever they see they've been trained on uh but then they cannot just reproduce so that would be a perfectly fine thing to do I imagine that a lot of artists would not be happy with this uh but you know that's a it's a definite possibility now uh perhaps what you might want to do is what you know the early internet uh also authorized or or put in place which is that you know you can put a little five or you know robot.txt or something that uh tells search engines and and crawlers do not use my content for any for anything like you know you can't index it you can't uh use it for anything so if you're an artist and you don't want your content to be used you know Lobby for this kind of stuff and this kind of policy to be uh to be respected by by you know calling um algorithm um and then um and then it could be like you know the next step would be uh like if you use my RTC production as training data you owe me money and I think that's a bit extreme so you know the answer is probably somewhere in between not clear what like and it's not for people like me to decide like you know I have no particular legitimacy to like tell people this is the right thing to do uh I'm a scientist I'm an AI scientist um I think this is for society as a large at large to the to the side uh through the usual Democratic process um but and but you have to be careful like in one of the remarks that you you mentioned it was said like you know why should like a large tech company in California profit from my my art me poor artist uh this is the wrong way to frame this because within a year or two you know any teenager in the parents basement is going to be able to do this so um you you don't want to use the you know current bad press the the tech industry uh has to motivate people to kind of go after this because that's you're going to lose if you do this yeah this is not something you know the big tech company are just the first ones to have the technology to do this but eventually everybody is going to be able to use this right this kind of stuff and train their own models and call the internet right I mean you know calculate startups and and and you know young programmers yeah and it is a point I made this point in my most recent newsletter but that we are not going to see the centralization I'm sure you agree that we saw with with you know Communication online like with the Facebook and the twitters and AI it's going to be much more distributed and the last thing I'll say is that I'm so torn about this because I have been plagiarized by someone using my words feeding them into a generative AI systems and then publishing them as their own that's already happened to me however I just love the systems I love using Dali to illustrate uh my my stories and I love speaking with chat GPT where I mean I have it on speed dial it's so fun to speak with so okay that that will do it for us uh unfortunately out of time Beyond I feel like we can speak for hours each time but unfortunately uh that's the end of the show so thank you so much for coming on we'll have to do it again soon well thanks for having me guys okay great it was a pleasure awesome yeah sorry for
Info
Channel: Alex Kantrowitz
Views: 11,130
Rating: undefined out of 5
Keywords:
Id: ULbpPHjiSBg
Channel Id: undefined
Length: 61min 7sec (3667 seconds)
Published: Wed Jan 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.