The Long Story of How Neural Nets Got to Where They Are: A Conversation with Terry Sejnowski

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay hi everyone welcome to a discussion I'm going to have here with Terry sanovsky about uh the long story of neural Nets and how they got to where they are now so I guess Terry I'm I'm curious to begin with when did people first realize that there were neurons in the brain well uh I think that uh you know the idea of a neuron by the way is as a unit of the brain goes back to kahal uh he was a Spanish neuro anatomist uh around 1900 and he used something called the Golgi sting which it's like a part of a reduced silver method like for photography which stained neurons in their entirety but only like one in a hundred and so that produced beautiful fills and he drew them and uh was was spectacularly successful and and at that point that was the considered the unit uh that was called the neuron Doctrine but uh but in fact before that what did people think the brain was made of I mean there must they were anatomists before Ramada kajal what did they think the brain oh yeah yeah well uh the um I mean there were theories that go back to Descartes thought it was a Pneumatic that these these they they knew about these uh uh processes you know that they look like uh wires but they thought they were uh little tubes that carried fluid and they are indeed but not the pressure there was the signal was not the pressure so so they they knew about them and I think that they they probably had various theories over the years but uh I think it wasn't until galvani and others again you know at the uh early part of the uh 20th century who uh discovered that electricity was being carried by these nerves uh and that kind of started the modern era the the electrophysiological approach to understanding the signals but so people knew I mean people obviously knew from vulture and the Frog's legs and things that muscles had to do with electricity but when did they how did they figure out that the things in the brain had to do with electricity well uh let's see I think that you know I can't put my finger on in a specific experiment but I I would say that it was not much of a leap from the uh stimulating uh they actually they stimulated the nerves to the muscle right so they they had the idea that this was coming from the brain uh signal of transported along the nerve uh but I think that it wasn't until uh Hans Berger probably who was able to record EEG signals from the scalp and showed that it was modulated by your level of of of arousal and you know what you were thinking that he saw that it was modulated and and that was a very was a microvolt electric signal uh but that that was a clear indication that there's electrical signals flying around inside the brain so what was that like 1920s or something or was that um no no in like 19 10 something around their 12 so it was around the same time as as electrocardiography was being invented is that right well well that was EEG electrical encephalography that's exactly right yeah but I mean the heart version of that the EK oh Cardiology yes okay electrocardiology uh yeah that that that probably about the same time I think you needed amplifiers that were sensitive enough and I think once you have then you could record all over the place right so okay so people decided there were there were all these kind of wire-like things in the brain and did people I mean after ramani kajal that people was there a huge sort of development of of people dissecting different parts of the brain and figuring out lots of detail or not so much oh yeah that was a a big business because uh there are a lot of parts to the brain and there are many species that you could look into and kaha was one of the still with the leaders in his lifetime he he wrote the book that is still used today because it's not just scientifically interesting it's artistically interesting because what he would do is to uh look through his microscope all day no photography there that it was all uh artistic in the evening he would sit down and draw what he saw uh you know the shapes and sizes and the way they're connected and they're beautiful they're just beautiful art but it's withstood the test of time in that sense that we still refer to it and I often show one of his pictures at the beginning of my talk just because it's just a nice way to honor uh our foundations and also just a very nice pretty uh drawing but so so why weren't they using cameras by the way because I mean photography certainly existed at that time well you have to talk to kahal about that my guess is that it was easier for him to draw than to worry about photography and and trying to keep track of things I I think photography the dog grow types I think were developed at the time but those are really required very high illumination and I'm not sure they were as had the resolution that if that you needed at that point right but but okay so then then we're getting you know 1920s and things like that there were electromechanical devices that were people were building as you know to do machinery kinds of things and there was people investigating the brain when did those connect thank you uh you know the the I would say that the uh the the foundation of modern electrophysiology was uh in that period in the 20s and 30s uh and I think it was you know the the uh what what uh with the the types of experiments people would do would be to stimulate a part of the brain and then to see what would happen uh and and you know you could do that in animals uh and even in some cases in humans uh actually what was most most that we know about localization of function actually came from war wounds in World War one where people got shot in part of their their brain was uh it was I was taken you know was was uh like in the back of the brain right there's a common you have a shell that takes away the Hat the posterior part of the brain and you're blind and lo and behold that that's they were able to map out the different parts of the cortex now that was Anatomy um but the but you know in order to be able to stimulate and and to record really required uh development of other techniques for example uh they had uh of all things a smoked drum that they use for recording electrical signals um before you know the the photography was used routinely um I you know I I I I I have to say that uh I I haven't read that literature in detail but it was by by the 40s certainly things were going full blast people could record they had a really good idea about what was called Action potentials there's a famous set of three papers that Hodgkin and Huxley published in 1952 which was the now you know a classic explanation for the uh basis the ionic basis of the action potential and so I would say between you know 1930 and 1950 that was a golden period where the new techniques were being developed and by the way uh what you know as you know in science you know everything it depends on the techniques that are are available for the instrumentation the way that you uh take the data analyze it and that's that just changes with every generation right right but so okay so big big moment for neural Nets 1943 mccullochon Pitts um and at least that's my impression of the sort of the big moment for neural Nets I the um what was I mean McCullough what was did you did I I don't I don't know all the dates did you ever know did you ever meet McCulloch no I've never met him I've I've written an obituary for him but I've never met him I've I know people that have met him he was at MIT at the time he was he was a a very uh uh interesting character uh he had a medical background but he decided he wanted to go into understanding the brain as a more basic approach and he was charismatic he attracted a lot of really interesting people around him really smart people Pitts was a a mathematical genius but he had psychological problems and and but McCulloch sort of uh was able to um uh buffer him from this Institutes of of life and and together they published this in this paper that as you say did have a lot of impact uh and what they showed was that what was known about neurons at the time was that that they had a lot of inputs and then and that they had a single output which was either a spike or not so that kind of is like a binary event although it's not really because the timing is continuous but nonetheless it was a simple output summation which we now know is not linear but a single output what they showed was that with that simple uh the color pits unit uh you can construct logic you know and Gates or Gates and and since you can from that you can compute you know you do Universal Computing they said well they're asking universally compute so that was the the the the thrust and I think it had an impact primarily at the mathematicians but but it did uh in a sense it it introduced the idea that the brain might compute something like a computer maybe not exactly but that it was doing what it was doing was computing and so I mean the origin of that that effort I mean so McCulloch I think had been a a psychiatrist and so on and was was I think when they wrote that paper in Chicago um and uh then I mean uh this was it's a paper that I think you know it's it's only references are to you know things like you know Russell and Whitehead's print could be a Mathematica things like that there I don't think it has any uh kind of um uh physiological references so to speak it presents itself as something a bit like a turing machine kind of presentation of uh you know this is how we might imagine brains work that that's true that that's why I said that it was uh written for a computational audience or more mathematically oriented audience but there was another paper he wrote about the same time which was much more influential on the biologists and we now call the neuroscientists that that's a relatively recent uh areas of the term neuroscience but it was uh let's see if I can remember the the title oh yeah what the Frog's eye tells the Frog's brain that was like that was the Jerry lutfin thing right that was that that was with their life oh yeah that was that was later that's true but I'm I guess what I'm saying is that his influence on the brain science was really uh much more that paper had a much larger influence than his ears because my impression is if I remember the dates I might have them wrong I think like the Jerry Latin McCulloch pits fly what the Frog's eye tells the frogs at the end of the 50s yes it was it was probably around that you know if into the 50s I think so maybe yeah I'll have to check but but you know that was still in the early days we're talking no it had to be earlier than that um let's let's well we'll we could look it up but it I think it was in the maybe early 50s or 40s the it's it's some so but I mean the the history because there were other things that had happened right by that time you know Alan turing's the 43 is a date that I think well that's the original McCulloch Pitts paper oh oh you're talking about left then yeah no that's that's right okay but um but but then I mean you know there was a whole development as I understand it from uh you know there was you know Alan turing's 1936 you know this is what a computer might be like and then the original 1943 McCulloch pits paper was sort of leveraging the touring idea to say oh we can make these neuron like things that can be lecturing machines okay uh see if I just looked at the date it's 1940. what was what the Frog's eye tells the Frog's brain really I am just looking at it and it says proceedings the IEEE 1940. I think this is the the 41 now it says no wait a second I don't think that is right oh uh that was the page number sorry it's 1959 you're right you're absolutely right okay okay I stand correct the um I think um yeah But but so I mean before that time from you know starting in the 40s I mean it seems like several things were happening I mean there were there was the um uh you know on the more kind of artificial device type side I mean there was the the the McCulloch pits paper and then I mean uh people like John Von Neumann got into the picture weiner I guess no but we know got into the picture um and uh I mean uh you know and Von Neiman was writing books like you know the computer and the brain and so on I I mean what was the uh at that point there seemed to be a whole development of people saying you know uh we're going to make electrical devices and they're going to be like brains and people started talking about you know giant electronic brains and so on and I think I mean that was the you know in a sense in some sense I think neural Nets were perhaps the original vision for what computers might be like well you know that that's an interesting observation because uh if you think about it computers came around relatively late right uh digital computers uh and I think that you know the way that uh and a lot of people also you know we're thinking along the analog lines including uh Frank rosenblatt you know who came up with the perceptron so that that was an analog device that he actually built it he physically built an analog device that it was actually a model of a single Neuron a single we call it this neuron uh no the the the the uh you know it's interesting you know the the the interest in in building brains or something that works like the brain has waxed and Wane for many decades and with a period of about 20 20 30 years and and that was a high period right you know in that era in the 50s uh maybe a little before that and a little after that uh that's when uh 59 actually was also when rosenblatt uh published his perceptron paper so I think that I mean one famous other data point was the the Dartmouth conference 1956 yes that was uh that was it's called the birth of artificial intelligence actually what it's called is the birth of that word because because people were thinking along those lines for decades before that I mean it was it wasn't like it was a great idea suddenly I think what really kicked it off was by that time they did have digital computers and they did have they were writing programs and and they were able to write programs that could do interesting things like proof theorems and play games and that sort of thing and they figured wow if we could we have a computer could prove theorems which is the highest level of human achievement you know cognitive achievement mathematics then certainly it could should be able to do other forms of intelligence and maybe even you know build robots and so forth and and that was uh you know it was based on the false intuition that problems like vision and motor control were easy why because you know it's effortless we don't have to put much a cognitive effort into it and everybody can do it you don't have to be a genius to be able to in fact people who do sports are generally not as intellectual as people who don't but uh the the um that era was really uh kind of an exciting one but then you know if it uh I I would say petered out when they began to realize how difficult the problems were and despite the fact that enormous amount of money was spent by the government DARPA and other when our and other agencies the returns were very meager in the 60s and 70s but I mean my impression was there were projects like machine translation was a big effort that was an almost complete failure at that time well my favorite story along those lines is that they translated they had a translator through Russian English to Russian and then Russian back to English and and the the phrase that they tried to translate was uh the spirit was willing but the flesh is weak and it came back it's good wine but rotten meat but how are they even doing those translations was this essentially a textual replacement type of approach uh well uh you know I I have no idea what their uh what the actual approach was but I think that they they must have had some way of thinking about syntax because Chomsky was using that as the basis of language of in that era I don't think he was that much involved though I think the machine no no he wasn't no but he but the people wait you can't think about language without thinking about Chomsky right in that era yeah you know and and uh if for another reason then uh they people quickly realize that if you just replace word by word it's gibberish just you know you have to have the right word the number the number you know two that has to be done somehow uh but but the but the other thing is that was completely missing and they didn't understand at the time was that more important than even word order is syntax no I'm sorry semantics semantics meaning uh and that's really I think why modern uh neural networks that could do trans language translation are primarily uh able to do that by uh to by understanding understanding but by being able to put embed sentences and words into a high dimensional space in which neighborhoods of words uh share semantics but it is kind of interesting that you know very early attempts in in AI was machine translation and the technology that's now brought us our friend chat GPT for example is that same kind of Technology yes that's an interesting parallel and I hadn't thought about it uh I think that there is something magical about language that really influences humans by the way I I also contributed in a small way back in the 80s I had a program called net talk which was um for that by today's standard is a very tiny Network they had about uh to 300 units in it and about 20 000 parameters weights between the units which which by today's standards is so tiny the uh you know it's embarrassing by comparison but it's uh it was we trained it to take in a window of seven letters and to assign the sound the phoneme to the middle letter and we trained it up on a bunch of words uh dictionaries and uh transcriptions and it did a credible job we we actually uh recorded you know from a machine that uh turned it was a deck talk yes I remember that it was very nice it turned sounds into the phonemes into sounds and you could actually hear the thing as it learned and it was was mesmerizing in fact it got so much publicity that I was on The Today Show and demonstrating that talk to the world and even to this day people come up to me and say I listened to that program and uh it was the first time I I knew anything about neural networks but it really uh I can still remember it you know it was one of those moments and when you suddenly see something you didn't it seemed to be very convincing in sort of going from babbling like a young child to talking like a more adult character I mean that seemed to be it it did that it it and and we now know why okay so it what it did was when it's learning it picks up the most important or the biggest regularities and the most regular uh assignment is verbs versus nouns right and and so it would alternate between a consonant and a sound but it wouldn't all you know it would be the wrong sound but it would be a vowel that's actually I memorize that from the tapes and then uh it went through a period where it got the small words right and then it went through another period where it got bigger words right but it would often have one of the sounds wrong and then eventually you right you could understand it it sounded just like this a little kid talking of which which you which was uh yeah and here let me just explain why that's important and it's still today uh when people uh you know critics talk about these networks they say oh they're stochastic parrots right there was nothing stochastic about these they're all deterministic and they're not parents because there's no way that you could train a network to produce this English is so irregular in terms of the sounds and the letters and the and the correspondences there are some rules but there are always exceptions and if you keep writing them down you get a book with 300 pages but in the same architecture with very very few weights a few tens of thousands it was both able to pull out the regularities and the exceptions and it did it by learning algorithm rather than having linguists sit there and write down rules so it wasn't and and the other thing is that it has to generalize right because we can give it new words like I I I remember giving it uh Jabberwocky what uh you know this is a the famous uh poem with nonsense words in it was brilliant in the slightly Toes that gyre and gimbal in the wave right it did incredible job but maybe not exactly the way I pronounced it but it was English level pronunciation that you could understand oh God right I I remember this a very cool demo and thank you for publishing the paper about it in uh my then very young complex systems it was it wasn't your first issues yes right um it's um but let's let's let's go back a bit because I'm I'm curious you know the whole question of of um you know we had McCulloch Pitts 1943 this is what a neuron might be this is what an idealized artificial neuron might be like and then but that was more you can make sort of uh logic out of these things you could make a turing machine type thing out of these things but then I think there was the question of well well you know in brains one of the important features is they learn and I guess sometime what was it people like Hebb and so on were starting to talk about how those artificial neurons would be able to do learning what but by that point people had had the idea that the synapses the connections between neurons were the places where memory lived is that right how did people conclude that or what when did they conclude that okay so the synapses go back to kahal we mentioned earlier uh and and by the way uh it was it is below the level of light microscope which is about a half a micron and so it was only conjectured that uh that's the junction where the signal goes across in fact there was a big disagreement between kahal uh and Golgi who at the Golgi thought that there were direct electrical connections between the two neurons and they they hated each other they would uh be uh at odds with each other scientifically and ironically they both won the Nobel Prize together but they didn't we're not on speaking terms this is very common in Neuroscience is that they were both right or both wrong because there are electrical connections they're called you know Gap Junctions electrical synapses and there are chemical synapses and they both are important in different parts of the brain different neurons but but the the idea though that you have memory of the synapses came a little bit later and and uh heavy and plasticity is probably the most famous uh but there there were people before that konorski and others who had similar ideas uh I think that you know the the the the the the synapse has an Allure which I think is a little bit overrated uh now you know there's there's a lot of them and there's no doubt that they change in strength uh certain you know Neuroscience that has has really worked out a lot of the details but uh it turns out that many other things are also plastic and and and and and in particular the intrinsic properties of neurons so what are they so uh well first there is the threshold excitability that can be varied uh there are a lot of ion channels in the different parts of the dendrites the other receiving area and the uh the the spike initiating Zone in there the Soma and all of those are also uh plastic uh or you know that they can be modulated uh up and down neuroscientists talk about plasticity they really are talking about plasticity of the circuit not not just the synapses and a beautiful example of that is stomatic gastric ganglion of the lobster which has only a few dozen neurons in it and it controls this the stomach The rhythms of the stomach and there are two rhythms they're very slow like you know one to ten seconds there's a pyloric Rhythm and a gastric Rhythm and it turns out there are neuromodulators that modulate the intrinsic properties and the synapses of strengths which can take you from one to the other right so the very same network you can have two very different behaviors and and in fact in our brains there are dozens of neuromodulators that do the same thing uh so so that on a shorter time scale we're talking about seconds minutes and and hours rather than you know days and and weeks and months so so when these signals are propagating through like the dendrites and so on that you're saying that there are different uh in addition to the kind of the the synaptic weights or something of the strength of a connection between neurons that there are different sort of resistivities or something like that of the different uh parts of the dendrick tree oh different voltage dependencies yeah different voltage dependencies which uh translates into differences in excitability differences in integration time time constants uh differences in the Dynamics so uh even in within the synapse for example there are biochemical reactions taking place that uh keep track of the history of the signals that have been passing through recently for example uh you know it's called a trace and uh and and short-term memory uh is also a term that psychologists call it and uh and some of those biochemical reactions uh uh you know can be uh modify uh the size of the synapse and and that makes it more permanent uh consent can can change some of the uh uh the the properties of which ions are flowing through and and how much for example calcium ions are very important and and that could be regulated those ion channels are very highly regulated so uh the the Nature has take him Computing down to the molecular level that's why it's so efficient and it's extremely um miniaturized and and uh and and it it it you know your brain runs on less than 20 watts of power right and just think of everything you can do and digital computers can't do right but but so we're gonna we're skipping to the Future and we need to go back to the past here to to to get this narrative but but um uh you know if if we think about artificial neural Nets which have you know these weight matrices and and biases and so on um is the thing that you're describing in terms of of the characteristics of the whole neuron and its whole dender Victory and so on is that captured by that or is that not or is there some extra term that kind of should be in the in the future of artificial neural Nets that captures these kinds of things that um you're now finding happen in actual neurons uh well it depends on what you're trying to do if you're certainly if you're a scientist trying to understand the brain then yes all those details are important but if you're just trying to extract principles then uh then there are some principles that are absolutely essential and I think the different the wide range of time skills in artificial networks there's only two time scales there's a fast inference which you know it goes up and down very quickly and then there is the training the the back prop uh popular uh the training algorithm for for these uh Senate finances the weights in the network and uh and that's very very slow it takes a long time to train these networks up so that that's a that's the kind of the longer time scale but in the brain there's everything in between you know there's a incredible range of this working memory there's all sorts of traces that that are there that um uh in fact you can show uh this is something that psychologists have studied it's a very surprising they've they can show that you know you you can look at you know a thousand pictures in in a few hours and then you come back the next day and they'll say have you seen this before and you are pretty good at saying yes or no how the heck could you do that right it must be the case it left a trace these things left to trace an enormous amount of information came in now you couldn't recall them I mean you might be able to recall a few but you're but you could but interestingly you can just you know if you've seen it before and here's something else which I think is even more amazing um so when I ask you a question you know within a second whether or not you know the answer right you just seem to know that well in in the traditional computers you'd have to search through the whole memory bank right you have to you'd have to have an enormous uh the efficient algorithm to do that and and uh which means that the the somehow the brain has used all the Machinery at all these different time scales in order to solve problems like that one very quickly and and survival depends on it right I mean you know but in a neural net the you know the the information is sufficiently distributed in the operation of the neural net is such that that same phenomenon of you know you're not searching through anything you're just running the neural net that that that happens there too I mean it's in a more traditional kind of digital computer metaphor where you're saying you know these are things arranged in a file system or a database or something right okay so here's what it's like and this is an analogy so it's not perfect okay uh it also gets to this point that you were talking about in different between logic and what networks are good at networks are really good at pattern recognition in other words one pass through and you've got an answer right it's inference right okay so here's here's what kind of computer the brain is it's like a digital computer in which the instructions are recognize the object in 100 milliseconds bang it doesn't right that's one instruction yeah or you know uh construct a sentence in one second that you're going to just spew out a bunch of words right and and and and a lot of it's been automatized so that it's very efficient very fast and and and and so you're using these macro instructions that basically can can solve enormously complex computational problems in the Flash and that the reason there's a lot of evolutionary pressure for that because if you can't recognize a tiger you are toast right well let's let's come back talking of um okay let's go back to the the we were talking about how people kind of got from the McCulloch pits idea which was essentially an idea of of make logic out of something that seemed to be like the neurons people saw in brains two then the idea of how would this how would learning work and that that's something I mean that has a much more complicated history as I understand it in the sense that you know the the neurons the neural net that is in chat GPT for example is extremely similar to a McCulloch Pit's neural net but the method of training is something that took quite a while and I think you were quite involved in in figuring out how that would work so the the uh you're absolutely right that uh that learning was really the secret sauce that of course made neural networks uh able to solve all these problems um of course many other factors go in you know Computing certainly uh but what's what's magical about learning uh and psychologists have been studying learning for hundreds of years now so we're you know it's not like a mystery that humans learn but the the question was how do they learn what's the mechanism and and and and that was uh something that uh was you know has has a uh we were talked already about uh Donald Hebb so let me tell you a story about Donald Hebb um which most people even neuroscientists aren't aware of I I only I only know about conceptualization Donald Hub was a psychologist is that right he was a psychologist at McGill and uh and and he uh wrote this book organization of behavior which was very influential not because of the book of because one phrase in the book which which is that if two neurons uh fire are activated simultaneously then the connection between them should be strengthened as they call the HEB synapse so Coincidence of firing leads to increase in strength that was the idea uh and it was this in fact when an electrophysiologist actually did that experiment they that's what they discovered that was much turns to be much more complicated but let me tell you the story I was asked to write a review of that book in 19 was uh let's see oh probably around 1980 or something but it was it was published in 49 so it's as many years later right but uh so I actually read the book and I always thought that well this is uh an algorithm that is for associative memory because you associate an input with an output and that's the kind of the essence of associative memory a bit more complicated when you have interactions in the network but no he hate the behaviors were very uh dominant at the time and he hated behaviorism a behaviorism is the idea that you know you you could do pavlovian condition you know classical conditioning but also operant conditioning to train animals to do a repetitive task you know to train them to recognize objects and and and you know they were using it on humans too the humans are also have classical conditioning very very all animals I mean including flies invertebrates but so but I was completely wrong so here's here's what the problem was that he thought he was solving it's called the Persistence of activity so you know if you have a feed forward Network what happens is the information flies through and then it's gone but how is it that you can remember something that happened you know a few minutes ago or you know an hour ago right there has to be some way for the activity that came in to cause some persistence like for short-term memory in particular and the only way that he could think of doing it is if you had some kind of a recurrent uh connections you know that would circulate activity and and then he said well how could you learn the new thing right because you you want isn't it's not just a delay line right it's got to be some way that you've strengthened synapses and so he had the idea that if you strengthen all the synapses it'll in a line of as the activity goes through a bunch of neurons and especially in a circle that that would maintain activity that that came in for a longer time and and and actually that recurrent networks are exactly that they they hold on to information that was given at the beginning of the sequence and they used that to then uh combine it with information coming in at the end of the sequence like sequences of words so so that that was uh uh at and and he he did he didn't say anything about when he was decreased the strength of the synapse right because obviously if it keeps increasing you're going to saturate and and so interestingly what uh what people discovered was that the if this is what's important is not that they come in at the same time but whether one comes in a little before or after within 10 plus or minus 10 milliseconds if the presynaptic comes in the input before the output within a window of 10 milliseconds and you increase the strength but if the input comes in 10 milliseconds after the output then use decrease the strength and if you think about it for a second there's a what's implicit here is the concept of causality because if it comes in before it could have contributed to the output but it came in afterwards it certainly can't right so and that's an essence of persistence is that you have Persistence of a causal chain so that's an actual mechanism that's observed in in neurons yes that's been that's been established in the cortex is in hippocampus and the superior colliculus it's it's a it's a universal mechanism it's not the only one by the way there are many other forms of plasticity I gave you a couple examples earlier but even with synapses for example in hippocampus it was a beautiful paper uh that was published uh just a couple years ago like four or five years ago uh by Jeff McGee showing that in hippocampus there's another form of plasticity that has a window of one second and it it's when you're creating a what's called a play cell in a rat when the rat's running around and you stimulate a pathway and suddenly this neuron becomes active every now whenever the rat comes back to that location it's called a place zone so so going back to HEB so what you're saying is that Hebb imagined that that you would learn things by virtue of recirculating that information through this neural net and that if it came through again and it looked like what it was before then it would be then that weight would somehow be increased was it was that kind of the idea well it every stage at every point at every point along the chain you increase the strength so now you can it comes back to the beginning and it activates the neuron again so it's like a chain reaction right you know a very simple idea and and and it but you know in in modern networks with a lot more connections it's not a simple chain but it has a in the recurrent Network you can like it reverberates you know right analogy with some Physics But but so that idea from from head did that did that sort of grow into some big kind of chain of things because by the time one was talking about things like the perceptron and rosenblack and so on that wasn't a learning system that was a that was a pure no no but but okay so the the now here's a really interesting chapter in neural network history which uh is is not always people aren't aware uh you know but I doubt if you're aware but it's it's very important there was a period in the 70s 60s and 70s especially the 70s in Japan when a couple of of Japanese uh started actually building things networks uh that that were precursors of the ones we have today for example we have something called convolutional neural networks right and they're used for object recognition and vision uh feed forward and and they have the architecture you see in the visual cortex well Fukushima in Japan who was working for uh the I think the the television company at the time uh had this idea of just building it right and and even I had the heavy implasticity now he didn't get very far because you know the computers that were incredibly slow but uh no but nonetheless he had the concept and he was given the uh Franklin prize a couple years ago to to you know recognize his pioneering work and another Japanese was was that the thing I remember only this one keyword the neocognatron is that that's it that's it okay well that's that's it that's the I was talking about this I only got one word but yes I I wasn't well you have you have uh half the battle is that you can Google it you can find out a lot more the the other person before I go out is is uh sunichi Amari uh who was a mathematician uh and he had a developed networks that are kind of precursors of the hopfield network now they didn't weren't the same hopfield network is characterized by symmetric connections but but he had the idea though that you could you could get pattern recognition out of him and then there's a whole family of associative networks that were came out of steinbook from Germany and others but uh you know those have been superseded uh in in terms of their you know with the architectures and the learning algorithms but but okay going going back again so these were so there was Hab and then the next big thing that seemed to be happening was things like the perceptron and Frank rosenblatt and that was what late 1950s or something 59 was the the period when he had his book and then you know that was uh when he uh got a lot you know it was it was a competing approach to artificial intelligence and uh he he you know he had some demonstrations which were very interesting he actually built a giant analog computer with potentiometers for weights and and he had visual inputs that he could give it to it and um you know with photodios I give or you know some kind of uh uh way in modern terms this will be a one level one layer neural network I mean this was a single this was a it was one layer of Weights you're right one layer of weights and and and you know you there were class of problems for which you can actually come up with good solutions that they're called linear predicates we're in the weight space you can draw a parallel plane that goes through the examples if they are if the positive inevited examples are on different sides then you can do a discrimination between them uh and it will generalize the new inputs so in a sense we've we've gone from you know the one one layer neural Nets to what you know track GPT is maybe a 400 layer neural net or something and that's the uh well that's right they're called Deep now now this is it so that there was a a period in history and I remember I told you about the fact that it waxes and wanes right so that in that era the early 60s there was a the Resurgence of interest in uh you know artificial literal networks on Frank rosenblatt artificial intelligence by writing programs for computers and so forth and and and uh the uh the the Marvin Minsky and Seymour pappert wrote a really sophisticated mathematical monograph called perceptrons where they did a really thorough analysis of this with what I just told you about linear predicates and you know and they show that there are a lot of most interesting problems uh don't fall into that category so therefore uh you know it's a it's like not a very good uh starting point and they at the end of all this beautiful math they they had this kind of the speculation saying that in their view there would never be a generalization to perceptron learning algorithm the key was that you could learn it from examples right that's what rosenblatt really did right he had he added the learning algorithm which which was very was heavier actually what what was his learning algorithm so what oh okay here's what you do you take the uh output from give it an input you take the output and then you compare it with what it should be you know is it part is it yes or no um is this a cat or not and then if it's right you don't do anything and if it's wrong you use that that error to update all the weights so that the next time you know the the weights that should have pushed it above threshold gets stronger the ones that should have pushed it away uh get weaker how do you know which ones to change oh uh you know there's the gradient you just see you just ask you know what's how does the error change with the if we change the weight so it's the same story as much it's exactly the same story for uh back prop right it's it's air gradient that's so it's also Frank rosenblatt figure that out was he the person who yes that was the beauty that's the perceptron learning algorithm he figured out Grady descent now it was it was a convex problem that means that the beauty was that didn't matter where you started with the weights you were guaranteed gradient descent would take you to the bottom which would be the best set of weight stickers you know to be able to solve that problem if it was solvable what was Frank rosenblatt's background what what how did he get into that psychologist yeah he was a psychologist but he had very broad interests and uh you know he was interested in uh uh first of all a lot things in in AI or electrical engineering and so he had a lab where he did experiments and build things um and uh yeah he he was uh a polymath you would call him today right so this was a time when when there was early interest in Machine Vision for military applications for optical character recognition for things like that if I understand correctly yeah well that's always been um I think a practical problem in fact uh one of the first successes was uh of neural networks was a handwritten digit classification right for the post office this is Jan lacun by the way I want to finish the story with Martin Minsky and pepper because this this is really had a huge impact right they they speculated that there would be no generalization of that learning algorithm to multi-layer Networks and that's where Jeffrey Hinton and I started in other words if you talk to anybody in AI or engineering what they would tell you is that forget it you're working on something that for which is a dead end and we have mathematical proof of that right there was no mathematical proof it was it was just buried in this very complex mathematical book but it was it was purely speculative right there was there was no theory behind it except you know but in any case so Jeff and I started from a popular Network we heated it up and we discovered that both of them it was both we called the bolster machine because it was fluctuating and you could use physics to understand the equilibrium States and we showed that in equilibrium there was a learning algorithm for an arbitrarily as many layers as you want uh in fact even it's really beautiful because it was a local algorithm I'm like backprop that requires Global Information of all the gradients and all the weights uh it could all be done with just local correlations between Pairs of neurons so it was hebian heavy and plasticity Balsa machine we've solved the problem of multi-layer learning we thought we had figured out how the brain works but you know it turned out to be a little bit more complicated I remember that paper yes that was that was um was like 85 or something is that right yeah exactly uh that was uh yeah that was uh yeah that that came out the learning hour came out the actual I think we've the first uh that was that was the paper cognitive science yes 85. but so so okay well I'm I'm still and I know um I mean back to the the I mean the the time of perceptrons right I mean Marvin Minsky always used to claim that he didn't mean to kind of kill the field by by writing that book although like you must have talked to him about it well you know Marvin you've talked to him a lot you know him probably better than I do but I actually have evidence for this first-hand evidence so at the a meeting that was held in 2006 at Dartmouth the fifth anniversary of the 56th meeting on AI and they were about oh I think about six or seven people at the meeting from the original meeting right this is 50 years later so it's a these are old men now Marvin was there and at the end of the meeting it was a banquet and at the banquet you know each one of them got up and said what they thought about the meeting it was it was a very interesting meeting looking forward and backwards uh you know the future of AI and where are we now and so they asked for questions so I got up and I asked Marvin a question I said well there are people in neural networks that consider you to be the devil because of the fact that you held back the field for decades are you the devil okay and this this this this this I I've never seen this before it was like you know I I I just you know suddenly he just got extremely animated and he just launched into this long tirade you know what's going on about we didn't understand anything about scaling or complexity and blah blah blah that you know parody it doesn't do parody and so forth and he and you know it was it was all you know it was it was clearly I I had uh press this button and so I finally had to stop him because you know it was it was getting embarrassing and I said Dr Minsky I asked you a binary question yes or no are you the devil and he kind of spluttered he kind of spluttered for you know okay he didn't know what to say and he finally said yes I'm the devil okay okay okay that's very Marvin asked I don't know uh uh interesting um no I've come to actually moderate it in the sense that I I I don't think his was the only reason I think that may have contributed but I'll tell you the I think the real reason was that in order to make progress he needed a lot more Computing we they didn't have it back then but you know one of the ironies Marvin when he was a graduate student at Princeton had actually made a uh you know a physical electronic neural net thing yes the snark or something was that what it was called but I don't quite know what it sounds like something in Lewis Carroll but uh yeah we're right I think it could have been yeah it could have been no no you're you're I I've heard that story and and I don't I'm not sure I don't think it's anything published but yeah he he would build these little networks out of Parts you know electrical components and uh you know and and you know I I wonder about the psychology there because he clearly must have been very intrigued and and and I wonder if he got burned or something that he turned his back on it I don't know but uh he wrote his thesis on this in the math department and they didn't know what to make of it because you know nobody in the in the department knew anything about the neural networks and so they sent it over to the Institute to prevent study and the mathematicians there only talked to God right so they sent they sent back the following cryptic comment that it may not be mathematics but someday it will be and they were right they were absolutely I wonder who wrote that I think Von Neumann I think had died by that point I mean Von Neumann would have been somebody who I think you know because Von Neiman was was pretty involved in this whole you know brains versus computers etc et cetera et cetera although I don't think yeah I mean I'm not sure he ever figured out I don't think he really did I mean he had a whole analysis of probabilistic Logics and so on because he was convinced that you know uh sort of error rates were a critical feature of brains versus computers and so on but I don't think he ever worked on neural Nets I don't think he I think he he had read the McCulloch pits paper and I don't think he ever worked on it he was he was he it has still been lectures uh at Yale he he raised an interesting problem which is that um how is it that you prevent uh the introduction of of imprecision you know noise or uh error from propagating and and he came up with redundancy as one way uh but it's still a very very interesting question right how is it that you can uh the brain is so resistant to not not just a little bit of noise but also you can scoop out a big piece of the brain as long as it's not in a critical area that controls you know your breathing or something you're you're okay uh I'm sure the same would work I'm sure if you took chat GPT and you scooped out you know a billion weights I'm sure it would do just fine I I think so too and that's the point it was called graceful degradation and and that has I think to do more with not so much redundancy being uh that usually means I having identical replicates that that might work and you take the average I I think that it has something to do with multiple Pathways that there are many ways to solve the same problem that the brain has advantage of of you know converging input it's and also the other thing that that is a major difference is that uh the the brain has a is probabilistic it has to keep track of evidence and and a lot of it comes in from different sensory systems you know Vision audition and so forth and has to be combined and and to come up with the best a guess about what was just said you know word by the way uh you know people who are going deaf okay it helps if they're looking at the person's face because they could do lip reading right because there's a lot of information about the sound from the lips and so but how do you combine that right you have to combine those two senses and you have to use probabilities in order to be able to eventually make it the right decision and by the way uh that's the way the Transformers work what they do is they just don't pop out uh you know a word because that that's the only word but they they assign probabilities to all the words and then the ones that have the highest probabilities are the ones that there are used for the next Loop for the next word in the next uh the subsequent output so going back to the history for a minute um I mean at some moment the physicists got involved so sometime around maybe late 50s early 60s there started to be physicists who said you know these neural Nets are a little bit like spin systems and we can you know a little bit like uh uh you know magnetic you know systems of you know the the neurons are a little bit like spins up and down in magnetic systems and things like that and there can be analysis done that way at least that's my uh I mean I don't know whether whether those people or even I mean there were people like I don't know it was a fellow Canelo or something who I remember running across at some point yes yeah so there's a whole cast of characters from that Arrow key yellow was one of them uh and actually there was another guy who was at Irvine I've gotten his name now but Gordon Shore Gordon shot you got it and and you're right they had this intuition that you know you should be able to use canes metaphysics to understand it but these these were isolated people who were publishing in obscure places and I would say had little or no impact because they weren't part of a community there wasn't a community it was basically a lot of isolated people very smart people but going in different directions and it wasn't wasn't really uh adding up you know right and and I I you know and the other thing of course is that the neuroscientists couldn't understand anything that they were saying because they didn't have any background so it was it was uh they didn't have any impact of the neuroscientists they may have had some impact on some of their colleagues but their colleagues you know anything about the brain so they couldn't help them what problem did they think they were solving uh okay they they I think in fact I I know Gordon Shaw I knew him and I I talked to him a little bit about this and and he he he really felt that there was some kind of collective phenomenon that was going on in the brain that was it was like a phase transition or something which allowed many neurons to work together collectively and he of course used analogies from physics as a way of understanding that like you know the magnitude the phase transition and magnetism and so forth and there's still people who kind of think about those lines of face transitions but uh but the the okay my first experience with this group by the way was a meeting uh that I attended in 1979 here in La Jolla at UCSD that was organized by Jeff Hinton he was a postdoc here at the time working with Dave romellheart and later that was a psychology these were psychologists they were cognitive psychologist Jay McClellan Dave rumahart the PDP group was here in in the 80s uh and and so he organized a meeting uh which was called parallel models associative memory and that was probably the most well-established neural network model and and the idea was you have an inputs and outputs and then you tried to store patterns in the weights and and you could ask questions like how many patterns can you store and and how much interference between them and so forth and so he invited a bunch of of these isolated people uh you know working like Don Gaiman from statistics uh uh another guy who was very interested in coconut to Evo kahonen from Finland uh had uh a way of using Collective of neural networks to do categorization and others uh that uh Jerry Feldman who was a computer scientist from Rochester uh and you know so we're talking about people from many different departments and I was coming from physics I was actually a postdoc at the time at the Harvard neurobiology working with Steve kofler so I was making a transition to Neuroscience but you'd already you'd already been involved with neural Nets but by the time I well I was just looking and I found that 1975 thing of yours about neural Nets i i i i you're right about that that I had already gotten bitten by the neural network bug before I got my PhD and uh and I did that uh primarily on my own but I I was I was I had a lot of help from John hopfield was that president at the time but that was John hopfield was a condensed metaphysicist who came out of very much that physics but a different tradition from like the Gordon Shore world and so on is that yeah it was uh well he was actually a biophysicist at the time but you know he had he had work earlier work on some condensed matter problems but uh but no to that point he was working on more microscopic biophysics things like uh hemoglobin you know I see which which is an interesting problem there might be a phase transition there you know when you when you bind oxygen and and when you give it up and and you know you know it had actually he had a very influential idea about uh uh error correction so when you're when when DNA is being when RNA is being translated to proteins uh if you if there's a certain error rate you know one part in I don't know a thousand or something but that's not good enough because you know you can get all kinds of mutations then or in the proteins and so he realized that you could fix it by air by if uh air error correction but it would take energy and it has to be irreversible otherwise it could go backwards and so that turned out to be true and and you know that that was a very important Insight that comes from physics and it was a subsequently experiments were done uh at Bell Labs interestingly where he was a I think uh he spent he spent time there as a consultant he was at Princeton academic appointment but but you know so he was in that kind of tradition thinking about biological problems and he got bitten by the by the Neuroscience bug it was going around a lot of people were infected I was and and but but he he unlike me had access to all these meetings from you know because the neuroscientists thought about here's somebody's really smart you know we should invite him so he went to the uh there's a whole series of meetings at the Neuroscience research program in Boston NRP had these meetings where they brought in people from different areas of Neuroscience to come together for example I sold my shelf it went on neural coding Ted Bullock was the uh the person who put that one together very very prescient uh and and John will go to these meets and he'd come back and tell me about them right and I would hear all about the the wonderful interesting things happening in neuroscience and maybe even more uh interested in trying to solve uh you know figure out how the brain works um so that was that that was uh you know led to for him the the 82 pns paper which was the Hatfield Network that's right that's a huge influence uh and and influence first on the physicists I think he like was the able to uh bring into Neuroscience people like heimson polinski now who's major figure in uh computational Neuroscience um you know there was yeah I remember you know he was at Caltech at that point and so was I and I heard about that uh the you know that Network idea probably in 1881 something like that and I was um uh the frustrating thing for me is that I never managed to actually reproduce those results but as I wrote programs that should have done what what his paper said uh you know the things that his paper said and they never worked it never worked Stephen you had a bug in your program because I it it works for everybody else really in other words oh wait a second okay so there's actually a subtlety here very interesting subtlety so it would only work if you did the updates asynchronously yes yes yes if you try to do it synchronously it would it would go into you know limit cycles and a lot of other things but but if you did it asynchronously it was guaranteed it literally mathematically guaranteed to converge by but as I recall I mean what I tried to do is to map out the the you know it's a you know the big idea is you get these attractors where you know there'll be a bunch of possible inputs and they all evolve to a single possible output I mean that that that multiple multiple a local Minima that were point attractors and could be used for doing what's what's called uh completion in other words you give it a half of a vector which puts you close to the Basin and then it would complete the vector it give you the entire output so it's completion because I'm a good archivist I'm I'm sure I have my actual program from that time I need to go find out so I'm not accusing you of of of of of of of making a mistake in your program I'm just saying that that you must have had uh you know you putting in the wrong inputs or something and it could be that you were looking at the wrong output even but okay but but so you know the point there was this was essentially a a mathematical structure not particularly intended at the time I think to have direct sort of correspondence with things in the brain but a mathematical structure which would allow you to do this thing of of going from many possible inputs that could represent I don't know a letter A or you know many possible things that might be like a letter A many possible things that could be like a letter b and it would go to the actual sort of single point attractor of it's an A or it's again and the reason why that was an important step forward is that up to that time the people who are working on these associative networks worked with linear ones linear summations and and then a threshold and the uh Hatfield showed that you could do the same thing in a highly non-linear Network and and and you could uh store and you could actually uh estimate the the capacity very nicely and it had a lot of nice properties and like I say it led to the bolsom machine and a really nice learning algorithm so it was a breakthrough I think in in in in the conceptions yeah right no actually I remember this capacity was like the how many vectors can you store in before the thing you know how how many attractors fit in your in your space so to speak how many distinct um uh yes that that was that was one of the that was a one of the uh the the you could you can actually analytically show with that one I think was like 0.19 M or n is the number of neurons right but so okay so this happened and that I mean that one feature of that network is it's it's very recurrent in the sense that you put an input and it keeps on you know doing more stuff to the same input it's not one of these kind of just feed forward ripples through type type things you know that is that so that's what made it impressive is that he harnessed the non-linearity to do something interesting right so okay so what happened next so then then that that was that was a a single Network where you kept on recirculating through the network as opposed to multiple layers of perceptron-like network yeah that that's that's right and and uh because of the fact that it it was kind of limited uh you know there were there were no hidden units in it uh it was very difficult you know to make build on that although a lot of people I think we're thinking about for example uh I know the the recently I've gotten very interested in learning sequences and there were people who would put in asymmetric connections you know that would allow the network to settle into one local minimum and then jump into another one right so there there I think even you know the their the people you probably know David Kleinfeld I think here was one of the people but uh you know there was a tremendous amount of activity it was really exciting and the neural information processing systems meeting brought together the physicists with the neuroscientists and the computer scientists and the statisticians and people studying computer vision and speech recognition they all came to these meetings that uh were organized originally by uh Posner Ed Posner at Caltech but then he died tragically and I became the president of the foundation at ran it and have been ever since and what was exciting back then it was really exciting just to be able to meet people from all those different fields and they all had the same problem which was that the tools and techniques that were available at the time didn't allow them to make progress with vision or speech or robotics uh it with traditional tools that they had and they they just thought that there had to be another algorithm or architecture that would do that neural networks were very promising because there was learning algorithms right you might be able to learn the solution and and so and we had the problem was it was a Tower of Babel because everybody would would be speaking in their jargon and it was hard for other people to understand what they were saying uh you know the the the physicists would talk about you know all these equations and attractors and so forth and the neuroscientists couldn't figure out what that was and and then the Neuroscience would come up and talk about you know all the all of these uh Greek words for different parts of the brain the lateral geniculate nucleus I mean you know it was all Greek uh you know impenetrable and and then the only ones that actually I think got through were the engineers because they actually built things and they could demonstrate them right they said here look look what I'm building here here's what it's supposed to do you know they have uh their own language but they would at least be able to uh explain you know what what what the what the goal was and how they did it you didn't mention the mathematicians who I think were not I mean you know the attractor concept was one really from mathematicians not so much from physicists but I don't think they well the mathematicians came to mainly statisticians but there were some basic mathematicians dynamical systems people but not not as many uh they're not in retrospect uh I think that's one of the things that was missing was that people I think we didn't appreciate how important Dynamics is going to be uh and understanding the trajectories in these very large scale networks although the truth is Dynamics is not I mean you know in a chat GPT for example it is a feed forward Network there isn't you know there isn't really Dynamics okay okay well you know you'd be surprised okay first of all there's a loop that takes you from the output to the input and it circulates that's Dynamics right it's a it's an outer loop so to speak it's an outer loop but that outer loop is found in the brain too if the between the cortex and the basal ganglia then that's how I'm able to speak to you in a sequence of words because my basal ganglia has learned how to how to do that uh and and with automatically without having to think about it right well so but but um okay but back okay so we're now in the 80s and we're heading for your contribution with boltzmann machines and so on which was so so I think some precursor to that must have been like the simulator kneeling idea I mean people have been you know you're trying to find the minimum of this kind of mountainous surface and you're using gradient descent and you typically get stuck in the Mountain Lakes so to speak right and then there was this idea from what 1980 or so of if you kind of just jiggle things around like you do when you are trying to make something crystallize well in in annealing right if you jiggle things around there's more likelihood that you're so you haven't got stuck you just brought back a memory this vivid memory uh so Jeff and I were collaborating at the time and our view was that if you connected up the network properly with the right connections between different features like we saw in the visual system that we could make progress with vision and and you know and then the question is can you learn those connections and I remember I just had read a paper by Scott Kirkpatrick on simulated dealing he was at IBM Thomas J Watson at the time and and I said to Jeff you know we could he we you know the the Hatfield network is running at zero temperature you always pick the lowest energy that you you either flip up or or not depending on which has the lowest energy so you're always going downhill energy that's why you could prove that it has to converge eventually uh and I said we just heat it up and so that was that that I remember because once once we had that then we we were able to start thinking about first of all avoiding local Minima at high temperature so you go through high temperature and then you gradually reduce the temperature and if you do it slow enough you can prove that you'll eventually find the global minimum but you have to do it really slow the claim that you can prove that well it's you have to go in logarithmic time that that was the uh the the hook was that you have to go really slowly okay okay but but nonetheless and it became longer and longer the bigger than bigger the network because the whole thing has to come to equilibrium to all the parts and and they have to trade information activity patterns I see I see this is specifically for a neural net because yeah that's right but but but what what we discovered and this is for me it was a shock because that's not what we're looking for but we discovered that something magical happens in equilibrium which is that suddenly with just heavy and learning you can reduce a a a a loss function that was had to do with the with the loss between the input and the output if you're training it to do some kind of categorization or just any function any you know going from input to Output function you could minimize that over time and again if if you come to equilibrium you can show that if there's a network uh that uh is is able to do that transformation or categorization then both a machine could find it and it could have arbitrary number of hidden layers it just took enormously uh you know a long time for it to come to equilibrium and then compute all statistics you have to compute the average correlation over many many time intervals but but is that happening because basically it's zygotic it will eventually visit every state and so one of those States that's the assumption that is that's the Assumption you have to make to come to the conclusion that it's going to find the solution and and and you know it just it's just a very uh slow process and it it you know it it I still think it's it's much more beautiful than back prop back prop is kind of a statistician's nightmare when you have you're twiddling all these parameters and you're not really uh being able to there's no there's no real theory behind it it's just uh taking gradients but it works very good argue it's the it's the great savior of multivariate calculus oh no no I I we use uh back prop for it was this when that came in actually that saved our net talk because both the machine was taking too long but as soon as we put back prop in boy with solve it overnight so what was the origin of the back prop I do well I I remember I was visiting Jeff in Pittsburgh and he told me that Dave romoheart had this idea about taking gradients you know the partial derivatives of the loss with respect to each weight in the network and how you use the chain rule and I and I said that well you know that that's that's interesting but it's not as nice as the bolster machine but so which was able to do that that's a very mathematical idea for a psychologist to have uh they was a really unusually uh I would say uh broad uh broadly based psychologists he was working at the time on language and was trying to understand something about how you know with Jay McClellan something about how words are work together and and he had the idea that you could use uh a uh a network with constraints in it between different words that could could settle to some kind of uh interpretation of what the with what words went together noun phrases and things uh Real by relaxation and and and and in in a sense though he was beginning to think about Dynamics so I think that that's what led him through this backdrop idea and and it was interesting the PDP books came out in 86. which uh was when the first paper came out backpack paper in nature so they already had that for a few years and I was using a net talk already when they started talking about back propagation what was it was it introduced as a kind of Applied Mathematics idea or was it introduced as a what what and what oh by the way the actual algorithm goes way back a lot of people had it in fact there's a control theory book that goes back to the 40s I think where they they just posited that you could solve a classical control theory problems if you did this gradient taking the gradients in a large network of and you know and the problem was that people just couldn't simulate it because the computers couldn't take it so nobody had actually applied it to anything practical so so Dave uh was and along with with Jeff we're applying it to these networks that we're building right we were simulating oh what all these networks of the bolsom machines so he said well I'll just take one of the networks and I'll well I back prop to it and boy that was it efficient right like I said we used it for Ned talk it was it was a summer research project by a graduate student from Princeton by the way but so so what were the other practical things going on at that time so there was net talk I remember um I mean what what were the other things people where people were actually sort of using neural Nets and doing stuff that they could demo uh well I I gave you an example earlier about uh reading so I had a graduate student from electrical engineering who combined uh the spectrogram from the auditory signal and and then the uh the mapped the face with lips into the spectrogram and then combine them and show that you can improve anybody's a speech recognition system without you know as long as there was also as I recall I mean there was the optical character recognition people and I don't really that was much more of a practical kind of engineering thing yeah that was that that's right that so there was there were a bunch of Engineering in okay Engineers particularly like this and and applied it to a lot of things that by the way there are earlier people like Brittany Woodrow who are using it for uh simple networks for Equalization in in modems and and how to uh shift the frequency bands you know away from the noise and so there were there were a lot of these small engineering applications and not non-trivial ones by the way which were uh you know already there and people are using them and and this was basically scaling up what they had already done and by the way this is something that you know success has a thousand fathers right and and we can go back and see how so many people have pieces of the puzzle right and that were worked under special circumstances or for a particular problem and it wasn't until the whole all spaces were put together that we saw what it was going to look like that that uh with you know the back prop together with the architecture and the Deep networks all of those things had that was by the way it took 20 years to go from the concept the learning algorithm to something that can solve a really important practical problem that's because you needed a lot more Computing a lot more data and a lot better algorithms but so I mean when people were doing optical character recognition I mean they had been using things sort of uh you know image processing kinds of things and saying when there are these characteristics it must be an A in one of these characteristics must be B but but that had turned into a neural net Story by Suddenly by 1980 I think yeah that that was right so the traditional in computer vision the traditional way that you solve a problem is by hand crafting features that were specific for specific objects that were invariance as a rotation and uh projecting an object onto the two-dimensional image and that was very very slow because it required that you know your your crafty handcrafting and also a lot of compute you know you're you're you also have to worry about whether the feature is present in other uh objects right because they will be so you have to have a combination of features and then the question is you know how do you do that uh ultimate make the decisions uh and it was with a sense it's like building a neural network by hand except that it was you know uh very labor to intensive and also took a long time but uh the beauty of learning is first of all that the sun automatically it will figure out what the features are needed at the lowest levels and then it will combine and then be able to solve the invariance problem at a higher level higher layer but the beauty the real beauty of the network uh learning is that you don't have to be a domain expert you just have to have lots of data number one number two the very same architecture with little twiddles in terms of number of layers the number of units in the layer you know hyper parameters can be applied to any problem in other words learning is a universal method and it it trades off computing for labor and and you know now Computing is a dirt cheap compared to what it was back then and so learning is is a preferred method and there's still by the way I mean model based Solutions are still important and useful for small problems but they don't scale very well well I know you know I remember in 2010 probably talking to you I was you know I was interested in in curating kind of uh Machine Vision methods for common kinds of objects and that was you know when things like the Viola Jones method for finding you know faces and so on was was there and that was a you know image processing based idea and it was I was I was really close to putting a lot of effort into curating you know 100 different kinds of objects and figuring out what um uh well you know that was that was saved by Deep learning in in 2012 or something that's a turning point yeah for the famous imagenet uh Europe's poster uh presentation by Jeff and a couple of his graduate students uh yeah they by the way uh Paul Viola you know the of the Viola Jones face uh detector it's just detected Faces by the way which that was a difficult problem you know you have the image where are the faces right and he got his PhD from MIT working in my lab in the in the 80s he came to my lab because there's no one else said MIT was working on neural networks huh okay yeah so but that was a neural net based down and in the end it was it was it was all it was all a lot of the concepts it went into was entropy he used entropy and a bunch of other Concepts that came from neural networks no it's a neural network uh and you know it clearly had some labor in it in terms of uh how to you know package it into an engineering solution but no it was he was he was in my lab I mean he we were talking Tony Bell was there at the same time they were in the same room together talking to each other Tony bell used entropy for creating an algorithm for independent component analysis right and so uh Paul and often created an entropy algorithm for uh Pace recognition face uh detection all right so let's see back to the history here so we're in the mid-1980s there's back propagation gets invented and there are a few early applications of neural Nets and then my impression was it was I mean not that much happened I mean the the big breakthrough was was 2011 2012 but what what was happening in the in the period from like mid 1980s to that time okay what happened was that although they look promising the the the the the the uh out that the networks are too small to solve a really difficult complex problem like image recognition databases weren't big enough um and and so what had to happen really is that there had to be a lot of exploration of Network architectures and and ways of of of eking out a little bit better performance I'll give you one example so every year pretty much A Clockwork Jeff would call me up and say Terry I figured out how the brain works and I you know he would tell that to other people we did compare stories but I remember once he called up and he said you know we can get 10 better performance out of back propagation and I said patented because you know that's going to be valuable someday and it turned out to be something called dropout and so here's here's how it works which is really interesting because it turns out to be what we see in the brain too instead of assuming that the weights are going to be fixed what you do is on every every batch of uh of inputs that you're training epochs are called you you drop out half the weights in the network you just it cancel it to zero and then you do back prop and then in the next Epoch you you wipe out a difference at random set and and that improves the learning the performance and the learning process by 10 which is amazing and and you know a lot of reasons for it we've interestingly turns out that the synapses in your cortex right you've got a lot of them right you know like a million billion of them are probabilistic that's to say On Any Given input the the probability that you have any output at all is only about 10 20 percent so you know it's a again the brain is filled with these tweaks and a little little improvements on algorithms to make everything much more efficient and we gradually are discovering that as we explore these Network models and it took like decades to to re to you know just by group Force trying out different things for example I'll give you another example this is uh yamukun convolutional Network uh so he there were a lot of things that he had to do to make it work which like one of the things was that he did something called pooling where he had oh say a lot of uh feature detectors say so vertical edges that were pulled across space and then he used that as a way for them creating a little bit of spatial invariance for the next layer well that's called a complex cell in the visual cortex and that's something a human weasel discovered back in you know 59 or 62 was the classic paper Jay physio and uh and so what what happened was that that Jan would systematically discover things like that like normalization like you know before you pass it on to the next layer why don't you normalize the activity level so they all have the same average level and so that you're you're not saturating when you go up and again that's what inhibitory neurons are negative feedback in in the cortex so basically yawn converted converged through just trial and error with many different things to the architecture of the visual cortex now the difference is that the neuroscientists knew it was there but they didn't know why it was there I didn't know computationally what complex cell was doing or what the normal what the inhibitory cells are doing but Jan actually converged and for computational reasons for those particular features and now we can go back and then look at the cortex and say well what else is there that we don't understand that might also be useful so there was a back and forth going on back then between the computational neural networks and and and and that's continued and it's actually blossomed which is within the last five or ten years uh there's been a tremendous amount of flow of ideas and algorithms back and forth between those things like Dropout for example that wasn't invented for neurophysiological reasons that was invented as a as a kind of algorithm hack oh exactly you know though Jeff is brilliant I mean he's he's a genius at coming up with brilliant hacks that and often are ones that nature was there first right Jeff you know this okay so so you now must have the laundry list of the of the 25 hacks Nature has discovered of which you know the Jeff hintons and Jan lacuns of this world have only discovered 10 or something I would say that that that's less than that it's probably there are many many others and I say that because I I look at the brain and what I see is not one giant Network I see dozens and dozens of different kinds of networks and different brain areas solving different problems very efficiently and we're just scratching the surface right now with regard to principles that we can extract from the brain I think the brain will be a source of inspiration for decades to come uh-huh but so okay so we're now but by the way when you were doing that talk was that that was on a vax type computer is that right is that it was a ridge computer which was a knockoff of a vax uh seven I remember the bridge the third the 32-bit it was an impressive 32-bit computer I had this the size of a washing machine and because I was a presidential young investigator at this award I got a 50 discount so I bought two yeah I had more computer power in my lab than the Entire Computer Science Department you know running one of the 780s right yeah I remember that I I had one of these ridge ridge computers at Institute for advanced study yes it was it had that and some workstations those were my two um uh things the the ridge was a very strange creature um I think it had some risk architecture technology in it it had yeah very of course I think you know was really primitive anyway but but um okay but so that was the that was the machine and and that was um okay but so you're saying in this period from the mid 80s up until 2010 or so this was a time when I mean you know to me I would hear about these neural net Innovations and I I was just looking like you know I collected all these papers over the course of decades about neural Nets and I never really did anything with them and I never you know it never seemed to be it never seemed to be going anywhere um but uh but there was a there was a group of people that continued I guess to pursue these things and that group I mean it was a fairly small group as I as I mean it wasn't accumulating a lot more people I know all of them because I was one of them yes and uh like I said the the the neurops meetings continue to go on for like you know 30 years and uh during that period but the the the core of the people were uh part of a group that was funded by the Canadian Institute for advanced research seafar is now called cifar is that why Jeff engine went to Canada he went to Toronto no he this was well maybe it was part of the reason but no he went to Toronto he wanted you know the the the main source of uh support for computer science at that time was DARPA and and Jeff was a Carnegie Mellon and felt the awkward taking military money so he went up to Toronto and uh was there a particular person in Canada who really believed in this kind of Direction not really okay no there were there were what was the seafar group was a very interesting group because they would put together a project and so you had to have some interesting projects and you you supply for support and it wasn't enough to actually do a big darker project what it was was to give you some salary relief and maybe have a postdoc right but there's a core of about 10 people and it was like yamlikan Joshua bengio and a bunch of others but we would and I was a scientific Advisory board so I would go to all their meetings and so I saw this happen over a period of like 10 uh you know actually went on for longer 20 years it was uh originally it was focused on vision uh and that's when Jeff was a director and then when Yoshua took over it was focused on language uh and and but the beauty of it was that it was a group that would get together regularly we had Summer Schools we had and Jeff by the way spent most of that time working on Bolson machines because he really felt that that was where the real breakthrough was going to occur because it was just had a beautiful theory behind it it wasn't just a engineering hack but the the beauty was that as the gpus were of of a dividing line between the old neural network performance and the new because literally within a couple within a year you could enhance because the architecture as you know for doing Graphics is large Matrix operations right with vectors and so that you had multiple cores right so so you could within a space of a few years you could jump one or two orders of magnitude in the size of the network that you could learn and that suddenly broke through the the threshold for actually solving world-class problems in in Vision speech recognition language and so forth so that I so I I I you know that was a phase transition and here's why okay image net was a huge database of tens of millions of images with uh tens of thousands of categories you know even obscure ones like all kinds of different birds and flowers and so forth that what most humans probably wouldn't yeah and lots of shocking ones I looked at it it had some it was it was a it was it had a certain um you know it yeah it was it was a curiously curated thing yes but you know but it didn't matter what the categories were because you know it just needed something that would scale everything up and and progress that was being made in the field was like with the labor-intensive method we discussed earlier was about maybe one percent decrease in the error per year but the error was huge like it was like the performance is like 20 correct and this was this was you were I mean there were these right so uh so what happened was that uh using convolutional neural networks these two grad students in Jeff's lab lowered the error in in imaging that by 20 percent so it's like 20 years of research overnight as far as computer vision was concerned and it it and by the way the way that the community works is all with benchmarks right so if you can do better than the existing algorithms that you can't compete so they had something literally faced around over maybe it took a year or you know or so until everybody was Now using uh these cnns but so this was this was the Alex net thing is that right yes that's right Alex kashefski was one of his reticence the and and so I mean I think there's a story that's told that it was just like well they left it training for a month and they forgot about it and then came back and found out that it had been successful was that story true is that an apocryphal story I you know I I have to ask uh Jeff that I don't I don't uh I you know I didn't I've never heard that story but it's it's it's I think it sounds like it could be apocryphal by the way by the way as you're talking about apocal stories I once heard a story that the first diaper graph from MIT was to build a robot that played ping pong and when they got it they discovered they forgot to ask money for uh to to write a vision program [Laughter] and I've always thought that was a apocryphal because but at the very same meeting at uh Dartmouth where I met Martin Minsky I asked them I said I heard the story and I it's like to find out whether or not it's true he says you're wrong uh we did not give it to a graduate student as a project uh summer project we gave it to undergraduates hmm okay to solve the vision part of it yeah that's right so that's how and I actually found an AI mammal to that effect but by the way had DARPA supported a lot of neural on that stuff I I don't remember doppa being deeply involved in neural Nets seems like that kind of thing no no DARPA was funding Marvin Minsky to write a computer program that could play ping pong right but not using the AI lab and it had nothing to do with neural networks right we're going to build a robot and they were going to write programs that did vision and they were they were going to assign a gratitude as a summer project to write a computer division program that that was how naive they were about the complexity of the problem and it even you know after that that was that was probably in like 50 early like 60 the 56 50 probably 1960 or so and you know and just think about it right even in the 80s people were still struggling in computer vision sure but but I mean that was a different tradition that was the symbolic tradition of AI which uh you know I think well this is a long discussion but they're probably not for But but so so coming back to okay so gpus when did gpus come on the scene I'm trying to remember that must have been uh let's see 2000 around 2000 or so maybe a little bit later but so so then there was a 10-year period before there was sort of breakthrough work done and neural Nets based on gpus oh oh no no no no no I'm sorry I I missed I was thinking to about 2010 I would have I would have guessed uh they picked it up very early I remember yamlukun at one of the CFR meetings saying that he has a grad student who was hacking in that it looked very promising and and so you know as soon as that they got that up and running everybody in neural networks uh you know realize that this this was like you know exactly the right architecture for the problems that we have uh there's a Cuda up uh program a programming language that you could use so you didn't have to microcode it it was it was very efficient right but so so I mean the sequence that you know that was the imagenet Breakthrough in um uh and and then by the way actually it wasn't uh that was the big public breakthrough but there was a breakthrough earlier at Microsoft Jeff sent a one of his graduate students there to help them with speech recognition and apparently it had it really was working much more effectively than than their uh program which is based on uh Markov uh hidden Markov models you know there was a bound Welch algorithm so when was Leanne lacun's Lynette stuff that was was that I that was actually much earlier that's when he was at um Bell Labs at home though there was a group there Larry Jackal and a bunch of other sarasoya uh and um the the uh let's see who was the other one uh I mean uh there was just a really a powerful group of Engineers there that were were solving these practical problems and Jan was one of them Isabel gione was there at the time so this was handwritten digit recognition that was that was the was that the it that's that's the they had this database from mnist it was from some post office I think at Rochester New York and it was uh you know they had thousands and thousands of zip codes like 700 pixels right but the way the way the letters were being routed at that point people were were keying in they saw there's a human would see the zip code and then they would put in in whatever you know this invisible ink or something they would actually put the the the the data on what the actuals yeah they'd have to because if a human being was the only image processing system that could do that at the time right and and I think I think that there were competing systems which weren't as accurate but but so how are those guys at Bell Labs how had they gotten into neural Nets and what was that I know it's interesting they were uh I think Larry for Jackal for example was actually uh he worked on chips he built chips and so they there was a whole group of people at um Caltech and at Homedale and a bunch of other places that realized that we could speed things up even further by special purpose hardware and so they they were designing chips that could do these things was that a kind of cover Mead related uh Enterprise or was that not to was his effort yet a separate branch which brunch but the cover Mead oh cover Mead no no Carver the uh the neuromorphs uh actually I'm as one of the founders of the Telluride uh Workshop metamorphic engineering Workshop uh so I know everybody uh and and they were Caltech was one of the hotbeds for the neural network people in fact the Dimitri salches was there who's trying to build the first generation of optical neural network and now it's making a comeback because it's actually a very practical solution to a lot of these but when was that that was in the uh 80s uh probably you know about the time of the 80s seven was the first Europe's meeting and then it was probably yes it was off to I left Caltech in 83 so I I thought you missed that oh and the CNS program so they had a computational neural systems which was jointly with Biology and they had a lot of people building things there were chips and things but uh they the account check was was way ahead of a lot of other engineering institutions which were still back in the Dark Ages you know they didn't they didn't understand what how the importance of learning they didn't understand multi-layer networks and but so the the scoop at Bell Labs that homedel what where had they come from I mean I think the analycombe was a physicist originally is that not just correct he he yeah he comes from the I know I think more of an engineering background uh Larry Jackal was a physicist uh yeah so I think well I might be wrong but I thought Jan was coming from computer science but but he had a a version of back prop that he uh came up with on his own for his thesis and I remember I met him in lesouch which is a physics uh school summer school and uh and he was it was wonderful because you know here's his poor that that poor guy he was came from a good school but isolated right there's nobody around him at all to talk to at all and he came to this workshop and he he was so happy to talk to somebody and what was the workshop about I mean there was because this is a very physicist organization although it was it was uh the neural networks are pretty hard at the time I mean this was I think it was already past the hopfield network was already uh in Vogue and uh it was physicists uh we're we're I I have to say that uh the I it was it was they physicists I think had the right way of thinking about the problem but but I think what here's what was missing in physics was that you solve a problem in physics by writing down the hamiltonian and solving the equations right all the problems but I'm saying that you know you have you have a hamiltonian for the hydrogen atom right and you have to the quantum mechanics you you solve it with quantum mechanics uh okay but but no one in physics at least not not Mainline physics thought about was what if you are able to change the hamiltonian now what if you have power over physics power over the hamiltonian what could you build right so that's a different mindset because now you have that's what learning is all about is how how you can you can implant the complexity of the world into your brain through learning and and that's that's like you know letting letting nature of kind of adapt to the environment rather that's an interesting interpretation though I mean you know obviously the physicists in the around the early 1980s were all interested in spin glosses but you would have you know this whole pattern of different you know weights effectively in the spin gloss so your your claim is that basically they never really imagined the idea of having I mean that people like even the early stuff was about finding uh energy Minima and spin glasses right that that so the spin glasses people would study random uh that works uh and you know so you could prove analytic uh solutions to what the Landscapes look like but uh yeah and and and and actually hopfield one of his contributions was to show that uh how do you put the local Minima in it turns out you do it with heavy and learning you take the vector right and and then you take the cross product and then put those into the weights in that way that was the original hubfield Network I did if I remember yeah that's it that's how hot field uh created the local minimum the the uh the tractor States right that's an interesting perspective that the the the the thing missed by the physicist was that you could change the weight so to speak right that's right at your at your will right but so okay so then we got to 2010 and uh basically gpus come in and one starts to be able to do all sorts of uh uh and basically then the basic you're basically saying that given gpus you could make the old neural net ideas work is that a fair uh you could jump okay so Moore's Law doubles every 18 months right so how it stopped a while ago it was supposed to it didn't work it hasn't been working for a while okay but it's it's some something like you know no that was the original let's just say a year okay let's just say a year it doubles every year so to get to a hundred you would have to do like two to the uh uh six uh you know the give or take six to the sixth so you well I guess that's like six years but ten you know with the 18 month words law so it's like a decade it's like jumping a decade into the future right and and and that is a uh a major kind of leap into a larger scale and and by the way we don't really quite understand what determines by the way what what was the the speech recognition vision and and language all made this transition within about five years you know less than 10 years in other words it happened at different times right I mean they happen at different times because people were working on the problem at different times and and they didn't all have the same access to the the tools that the Google had and so forth so if that for the Transformers but the the reality I think is that uh it may be something similar to what happened in nature so very very few things scale up a very few algorithms scale up the way that brains have and also these large language models and a deeper neural networks so if you look at the size of primate brains which are on average per body weight much bigger the cortex has expanded even faster in terms of the amount of neurons and the connectivity within the cortex in fact in the case of the human brain it's gotten to get it inside of a reasonable volume you've got to have these convolutions right you kind of squeezing it in right and if you look at the cortex although there are differences from area to area that uses the same six layers and inputs and outputs and so there's this is you know is it more it is better something was good about that architecture which wasn't the case with a lot of other parts of the brain like the uh the amygdala or the hippocampus hippocampus actually just stayed the same it didn't expand even though it's really important but it it didn't but the as as you add more Computing into the cortex it at some point you start getting capabilities that were there before like language processing right and so it it it's it's the the in computer science I've learned that you have to look at how each algorithm scales and you know the worst is NP complete in which case it's exponential but you know even polynomial kills you at a certain point because you know it's it's gonna eventually use up all the resources beauty of these networks they scale with as n write the number of Weights it's linear well not yet I mean the the learning part of it with back propagation is more like an N squared story I mean the actual back propagation no I'm talking about I'm running it real time just the inference part yeah okay and that's all that matters really you know in terms of using it to solve problems um you know the the offline learning and everything is is you're right it takes an enormous amount of energy and so forth and eventually we'll figure out how to do it online but not not yet uh like they call it lifelong learning but the uh the the the the there is sorry here because if if you now build the special purpose Hardware nature built the hardware so that it could take an order and problem with n Waits and actually build all the weights so they're physical it becomes an order one problem right you could run it in real time you don't have to simulate it it just runs and and that is very very very few algorithms are are can scale that way and and now there's a huge huge uh push for building special purpose machine learning uh digital machine learning and the neuromorphic engineering it takes it one step further and and takes it down to building special purpose uh that works which which are based on the same biophysics that your neurons are based on but so if you look at the biological because I mean in in the actual history of neural networks there's you know 2012 big progress in image recognition then there's I don't know when the the um speech to text that must have been when was that that was like four or five years ago now and now large language models and so on and I'm curious whether that recapitulates anything that you see in actual history of biological evolution oh well I mean trailer bites no no you're absolutely right you're absolutely right that the transition for uh Transformers came about 10 years later uh not quite I mean it was about three years ago so it was less than 10 years but it would but it required a much bigger Network than uh imagenet Alex and up um and and speech recognition so you're right it wasn't all at the same time you're right but I I was exaggerating it it occurred over a time period where things were expanding very rapidly in terms of the sizes of the networks you know primarily at places like uh Google Microsoft but but I you know maybe you know for all we know it's going to continue right it's not it's not going to stop it's it's it's because you know you just add more right of these uh gpus or now they have the tpus and Google and other special purpose chips that are going to make it even faster and cheaper and eventually okay I did this calculation so how many uh okay 100 175 billion weights in in GPD three right so that that's a how many how much cortex does that correspond to that corresponds to the number of synapses under a square centimeter of Cortex and you have about a thousand square centimeters in your brain of Cortex so wait a minute so for each neuron so that like 100 billion neurons in the brain and each neuron has what 10 000 connections or something yeah a roughly 10 to the 14 total yeah they're just about there's tens of the uh that would be that will be only a thousand and to the 11th neurons and and well okay 10 to the 10th 11th neurons and about ten to the fourth connections each so that's like 10 to 15. okay it's not quite that high but between 10 to 14 10 to 15. but so that's like a million billion synapses right and and so a couple of the tenths of the uh let's just now there's up to a trillion I think the networks have a trillions of these weights in them so you know that that's 10 to the 12th so there's still like a thousand factor of a thousand away from uh right just the raw storage capacity although a synapse is much more sophisticated than a single weight I was telling you the complexity is it's more like a dynamical system so you have to add another factor of 10 or something on the other hand the the clock speed is an awful lot lower in a brain than it is in a ah okay now is that a feature or is that a bug or a feature okay so here is again uh your digital computers go really fast and they have buffers and they can do things faster than we can but they but if you have a a network that solves problems in real time you don't need those buffers or very little you just have a little bit of working in memory uh and and the idea though is that the time scale for processing in the brain matches the time scale of the world you don't have to go faster than that well that that's defined the time scale of the world is defined by the time scales we think about I mean there are plenty of things in physics that happen on much shorter time scales the ones humans are aware of I mean you know the speed of light is very fast oh yeah no no those actually you know things are happening at synapses the microsecond level I'm talking about Behavior of the whole brain right in other words 100 milliseconds to recognize something there's a lot of things flying around inside you know to do that so yeah but but but but I guess but even that is now millisecond we're talking about you know if the synapses time constants of a millisecond the fast ones there's some that are even much slower okay well but you know the world works on a millisecond time scale you know you you catch a ball and and you have to judge things in the millisecond time scale and and I think this is a self-fulfilling loop I mean it works on millisecond time scales because that's the time scale of our thought processes yeah you know I I all I'm saying by the way okay if it is possible for nature to go much faster but a great expense for example the giant axon in the squid it's a millimeter in diameter just to show you and the reason is that the larger the diameter the faster you can get an action potential and the squid has an escape response so when it is in danger what happens is it it suddenly activates that nerve and and then there's a has it like shoots off as a jet uh of water that is now released to make the thing to scoot off faster than other fish or you know whatever is uh whales or whatever are going to can go uh in that in a random Direction so so with that with that so and and then there are other things for example your ability to detect the location of a sound and and uh it turns out that you can do that because Nature has can do comparisons of time differences between the two ears in the microsecond range okay so if if nature needs to do that yeah it can it can it can engineer with the existing circuits but it only does that when it's really pushed to the wall where having that can save your life or it can give you some capability that is important for your survival so that's you know if the the I I think that it's a beautiful thing that the thing works in real time and that it's it's Engineers do that and by the way again uh you you can do these fancy things but at the great expense in terms of volume the amount of material you need and also the energy you have to put in so Nature has taken the low road and figured out what's the least amount of energy I could use take you know if it could solve the problem with milliseconds let's go with the millisecond right or you know muscles even even slower right I mean you could summarize that by who needs computers when you have brains I didn't say that but I agree But but so if you look at the I mean for example let's take chat GPT does does what's the next stage I mean so if you're right that that corresponds to you know what we have right now with a trillion weights is you know one square centimeter of Cortex how big is the total area of the cortex about a thousand square centimeters okay so about the size of a dinner napkin okay and does any other other creatures have bigger ones but they also have much larger body weights all right it's it's brain see three halves power of the body weight and that's because that's the surface area of the body which is the receptors and and everything scales according to that I see acceptors so the whale so whale has a lot of nerves coming in from all of its different flippers and so on that hasn't find a place but that's the first approximation and like I say there are exceptions like the primates are on a they they they they can still they still scale with body weight but on a slightly different uh uh line in in that space but is it but so the uh okay go back to your question which is uh you know what's the uh what what what where are we headed and what what's missing okay so first of all I made a list I have a database interested in archive paper that's coming out next month in neural computation on is called large language models and the reverse Turing tests and if you're curious about the reverse stirring test take a look uh Stephen passed the reverse Turing test oh what's that wait a minute what is the reverse Turing test well it turns out different people come away taking different things some people come away saying this is amazing I can't believe it people say it makes mistakes the it's terrible I mean it's a you know it's not like humans that don't make mistakes right right right and it and what I do is actually have about three different uh dialogues with three different people who range in terms of their conclusions about whether they're talking to something that is incredibly intelligent or something that's incredibly dumb and every you know there's a range in between there's a huge debate out there I mean it's going on and on now it's reached the public and people are going off in their directions but uh the reverse hearing test is the following so I came to the conclusion that what you get out of this what you put in so if somebody asked intelligent questions you get intelligent answers but if you ask stupid questions you get stupid answers if the reason is that the gpg3 has access to the rules database which includes the world's smartest people in the world's stupidest people it it takes out whatever persona if it gets it it it it it it is presented you're saying if you're prompt if you're you know if your prompt is characteristic of one type of person then you're going to be sampling the part of it's it's uh it's training set it will look through its training set and say is there a person like this out there yeah there are a lot of people out there saying stupid things so that must be what I'm expected to do is to be a stupid person well alternatively it's kind of like this what is it you know these these kind of um uh Echo chamber type things you're saying basically it's it's the it's the perfect um it's reflecting it's mirroring your the quality of your prompt it mirrors the answer back with the same level of complexity or given that assuming that the ocean of what's out there on the web and in books and so on that it's been able to sample that assuming that you are somewhere in that ocean the prompt you you produce will you know the the fish that are around that part of the ocean will be what you'll get back from the prompt you give example somewhere that isn't even in the ocean then it's a different story exactly so you know that's a that's wonderful that's a wonderful metaphor and so that's the idea is that you have this ocean was filled with all kinds of different creatures really Smart Ones really stupid ones and everything in between and and and and gpg has that when they get a query that got a fair where in the ocean am I going to go where are there where are there discussions like this you know that I've experienced before and if it's a really interesting uh question deep question where a lot of people have deep thoughts it will go there and in mind that part of the ocean right as as for for responses right but but I'm curious I mean one one of the things you know it is apart from its outer loop it is a feed forward type Network and so there are you know there are severe limitations on what it can expect to compute and I guess one question would be you know if you're you know if you talk about learning as a universal you know meta algorithm so to speak one of the things that you know we're not particularly good at it but we do manage to learn things which are non-trivial kind of looping procedures so to speak what what um I mean is it obvious what the I mean because because the learning methods used for for you know gpt3 and so on are as we've just been discussing they're they're very classical learning methods in the in the right in the history of neural networks I mean uh you know is that how how does one learn Loops how does one learn things that are more complicated than kind of feed forward kinds of things Okay so uh okay so so the the the whether okay you uh brought up the issue of what is it that the Transformer is learning it's basically taught to predict the next word in a sentence right that's it and once you've created the internal uh uh representations which take advantage by the way of every every anything possible including the semantics the the syntax uh you know all all of the uh the the the relationships in terms of higher order statistics that might be relevant for different uh the text that is encountered uh to solve that problem it can now now be used to generate as a generative Network it can generate new texts that is really original by the way it's not it's not copying people think oh it's cutting and pacing from all those things no it would never it would never be able to do that never never in a million years these networks generalize from the examples it's given it well given examples that it may never have gotten before anywhere in the world right but the fact is you know if you if you give it a big piece of a well-known document it will continue verbatim because it will have enough information that's encoded I mean basically will have successfully encoded in its weights you know that precise statistical continuation well I I would be surprised if it was a perfect verbatim but yes it would be able to replicate something that was uh well established well known but if I'm pretty sure if you just take a random a sentence from a random blog in in the internet no I think you're right I think what happens is if you're if you're out in the in the corners of the distribution where there's very little support out in the corner of the distribution and there's only one thing out there it's going to be something it's going to look very much like that thing but if you're if you're in a place where there's a quite High population of things it's going to do all kinds of differences that that that's my intuition too yeah that that it it does it to compress it has to compress all that data uh that's in the high density regions and by the way it's not good at extrapolating that's one of the problems is is how do you go beyond the text that you have and and look at uh out of distribution it's called it's a very Hot Topic in machine learning right now right but I mean you know and in fact the number of Weights is very comparable to the number of tokens of of training data I mean yeah well that's one of the that's another one of the things that we were told back in the 80s that we we had were underpowered that they would over uh you know that the network would memorize basically uh which which didn't happen actually even though we had relatively small databases uh we had regularization techniques like weight Decay which helped but but actually now people actually have found that there are rules about how much data you need but it's much much less than the complexity uh algorithm the complexity equations that are came from statistics which would have predicted much much larger databases which are based on on wrong assumptions though about the high dimensional spaces well I mean my impression is you know if you're trying to learn some you know Wiggly mathematical function you know you you might actually need quite a lot of examples it turns out I suspect that there's more regularity than in language than people knew and that's why it's possible to learn you know possible to represent that with a it's exactly right what and that's what these networks are trained are designed to do is to pick up regularities that may not be apparent but allow you to compress and maybe are able to interpolate within the dysfunction that you have and and by the way it with the reason why deep learning and and when you have many layers is useful is that uh the way that you do that is by uh starting out with trying to understand what's what's happening locally you know in terms of small amounts of data coming in and then gradually build up larger more abstract representations as you go up uh so that becomes uh easier for them to do to generalize as you go at higher higher levels right but if you stick a probe inside a deep Learning Network and you say what's going on in here you know you'll get all kinds of complicated stuff coming out if you stick a probe inside a brain and say what's going on and you get all kinds of complicated stuff coming out that's right you know in either case do you think that there will be kind of a or there can be sort of a a human understandable theory of what you're seeing uh I it's too early to tell but I know that some really smart mathematicians are working on that problem as we speak and uh we already have an indication that there is going to be a real insight coming from brain recordings so you know I we just went through the numbers the brain has you know the good the the billions and billions of synapses and so forth oh and so you might think that this High dimensional space is somehow you need all of it at the same time you know to be able to process uh but but now we have been able to sample from uh save a hundred thousand routinely now at the same time for multiple brain areas what we've discovered is that if you have an animal doing a task relatively simple task in terms of discrimination or judging making a decision about whether to go right or left but nonetheless you know for animals right that's important and they they learn how to do that but under all of the cases that we have studied so far uh it looks as if the part of the space of of all the activities and all the neurons that are actually carrying the information that you need is a very low dimensional Subspace out of the all you know the millions and billions of Dimensions out there On Any Given trial you're only using typically four to six dimensions so that means that embedded in this very high dimensional space are these little tunnels that are carrying your behavior along and and a lot of them are very well greased because you use them over and over again right those are the motor activities that you learn for example when you play tennis that's why you get very very accurate because you can be very precise with with that with that tunnel but uh but the actual geometry of that space is very difficult to to understand and we're I'm working with a Fields medalist actually stanislav smeared off who who's actually gotten interested in that problem and he is a very very smart I don't know if you know him very very yeah I do know him I do know him he's been he's uh one of his other activities is studying pigmentation patterns on various kinds yes and he was here just a few weeks ago and gave us a beautiful talk on that it was a cover article in nature it was a lizard goddess spots right right he's it was using cellular automata and things to study this I was it's been on my list to drop in in Zurich and and see the see the bavarium without the actual creatures so to speak yes yes so it's fascinating it was a it was the the whole story is just amazing uh so but but you know that's biology right but but but so um you know what is not obvious is when you say it's a low dimensional Space by even describing that and talking about it in terms of variables and dimensions and so on you're kind of assuming a certain kind of mathematical structure Which is far from obvious to me at least really exists there I mean in other words if I started talking about Turing machines and saying what they do and talking about what dimensional spaces they live on it will be a very bizarre way of talking about those essentially computational processes so it's it's an interesting claim that you can think about things in terms of sort of geometrical continuous mathematics like that well uh you know it's I guess it was it was it was a great surprise for me and for many many others when people started reporting that Dave tank uh did these experiments in Princeton what he did was he actually it was it took enormous amount of computer time to do this what you have to do is you have to follow the whole you record from 100 000 neurons and and and what you do is you are now reducing the dimensionality of that very high dimensional space and and you're mapping as as the as this population shifts in the actual brain you're following it in this uh transforms low Dimension transform space right as as you go through and you've got to do like you have to follow it and you have to do uh incremental pcas and so forth it's very crude but now people are finding many many much much better algorithms to do that and so so now so it's all about one of the the big ironies of of of of of of machine learning is that it was inspired originally nureps by the brain now it's all those algorithms are being used to analyze data from the parade yes and uh yeah and so that's why we're using machine learning to look at the geometry of these spaces and it's very difficult to have any intuition about what's happening in these spaces you you really have to use these tools and machine learning to do that right no I mean look I I looked at the dimension reduction of how gpt3 completes sentences which is you know so the question is is that sort of a a semantic law of motion that you get from kind of how it is completing things in language how it is moving through that through that space right it's a bit of a mess so far but um you know it doesn't mean there isn't there isn't something to be found it it it it's it's a mess if you expect it to be something that you can imagine see we're we're really very unimaginative creatures because we live in a three-dimensional space with time and and and now the brain is living in this uh you know billions of dimensions and the properties of high dimensional spaces is very very counterintuitive I've a lot of these theorems don't make any sense well in fact in the 2012 breakthrough in a sense as I understand it it was a story of you know people thought maybe maybe this is correct maybe not that that you know gradient the sentence on finding that minimum loss configuration would be something where you know if somebody said well let's go to a higher dimensional space most people like me would have said it's going to be more difficult than a higher dimensional space but that turned out not to be true just the opposite right the bigger the bigger the space the easier it is to find a solution and now what's happening is very interesting because it's possible once you have that solution you take the output from this very large Network and train up a smaller Network to give you the same output and so you can distill it it's called distillation but you need the big Network to find it in the first place because if you start from the small Network you'll never make it you'll have you you don't have enough degrees of freedom so that's you're also getting training data and so on and yeah it's the nature of how you need the extra dimensions in order to explore the space uh efficiently the other thing by the way the big difference was uh you know between the people back in the 80s who are telling us that it'll never work because you get these local minimal you're graded to non-convex spaces it's just a it's a non-starter and we're told that we were you know we're going to be uh overfitting we don't have enough data right and that's for coming from classical statistics where you have small models you're trying to find the optimal set of parameters for that model a single set of parameters right and what what the assumption that assumption if you just say okay I don't care how many parameters or what their values are I just want to solve the problem right that that gives you throws you into much higher Dimension and and if you repeat the same learning process over and over you get different networks which have roughly the same performance but they're all different in terms of the weights between all the units right they're they're you can't map one into the other which means that there's a degeneracy in the number of solutions there are many many possible solutions depending on where you start in that weight space uh and and that's wonderful because it means that uh you know the bigger the space the faster you're going to get to one of those locations right right I mean right the the but you know the classical statistics the failure of people to estimate you know all these questions about you've got this problem you're trying to solve maybe it's playing some game maybe it's generating language the thing that was always confusing to me is one didn't know how hard those problems were like I remember you know Derek Cesaro you remember him you you he worked uh uh in the center that I had at University he was at the Institute I think with you wasn't he yes yes he was yeah he was supposed to talk with me and he came and I never understood the work he was doing he worked with you which is great but I never understood what he was doing well well he did something really amazing okay I'll tell you very simple we wanted to pick a game that for which we could show that uh the this this approach could outdo humans right because that's how you get people's attention so we picked backgammon backgammon is interesting because if it has a very high fan out ratio and it's probabilistic because you throw the dice and and so we got a massive a large data set from books you know from games that are are published that are uh and then we brought in uh people who were you know Master's level not Grand Master but you know people who are good enthusiastes played many many games we've accumulated a big database and we trained up a feed to our Network and again we didn't have a large Network so it had to be only one layer I think or maybe two layers but it played surprisingly uh not a high level but it played at a good amateur level right it was playing the game and you know it was uh we didn't it was learned it learned how to play it on its own now here's what what happened now that's when he was at the Institute now but now he went his first job was at uh again at IBM Thomas J Watson and it was there that he had he he made a brilliant shift away from uh just training it to take uh uh we trained it to take the board and to predict the value of of that board you know for so that you knew what move is going to be the best move so what he did was he had uh the the network play itself over and over again and of course at the beginning it's just random and eventually one is able to uh win the game by you know just random chance but then every time it it wouldn't one side wins the all the weights are updated in that value function and it keeps going back and forth and he used temporal difference learning which is a reinforcement learning algorithm for updating the weights uh and back prop for the value function but ultimately the whole thing got better and better and better it got to the point where there's nobody there that could beat it so he had to get some you know Grand Master From New York where birdie I think is his name who wrote books um uh you know the best the best moves you know the for for Backgammon players it came in and and it it he said that well first of all he said this is better than any program I've ever played before it's really playing at a very high level uh you know it's it won a significant number of the games not I mean I I don't know what the maybe 34 or something and he said but it made moves that I've never seen before anybody make and I and he took them back to New York and played them out and turned out that they were better than the human moves right so it was creative it actually was able to discover new moves that that no human had ever thought of before but then okay the the because the computers were getting faster and playing more and more games a million games you know two million games he came back and he said this now has reached not just the Grand Master Level but it's probably a world champion and if it played the world champion it might even win so so that it was an achievement and it was overshadowed when go using exactly the same self-play and reinforcement learning but much more computer power was able to beat the world champion and go that that got the world's attention I think Jerry deter should deserve much more um credit for doing the equivalent thing with the competing power that was available to him at the time he basically figured out how to how to how to make progress on games and and now it's it's really played out really well in terms of the performance as the networks get bigger more Computing and so forth that you could play more complex games in fact it got to the point where a deep mind uh was able to it actually uh was able to bootstrap itself up without even knowing the rules of the game I think it was able to reach a level where it beat Alpha zero though it beat Alpha go like 100 games to zero it was called Alpha zero actually right yeah no I mean for me it was an interesting that the the uh uh I chalk up my complete failure to understand the significance of Jerry Cesaro's work back in 87-ish time type time frame right and um you know part of the problem that I had was I had no idea how difficult playing backgammon is as a kind of computational problem and so somebody could tell me you know I remember when I first saw I happen to see the sort of first public demo of the Deep Mind you know playing 8-bit video games or something and there was a fairly small event and somebody turns to me and says well you know how impressed are you at this and I said I have no idea because I don't know how hard this problem is and and that was you know which is the same statement you're saying the statistics folk were telling you you know if you do all the stuff there's there's uh you're you're you're going to overfit and so on I think the the difficulty is for these human-like tasks it's very hard at least you know so far as I know to really have an estimate of you know how difficult is this you know we have measures of difficulty like you know we've got for for combinatorial problems we have you know computational complexity Theory and NP problems and NP completeness all those kinds of things but for these more human-like problems it's not obvious what the kind of what the what the human version of computational complexity theory is well in some way if you have a network that that solves the problem you know that the complexity can't be more than that or whatever whatever resources are that you had to use the question is what does solve the problem mean because for example with chat GPT if the problem is write a high school level essay then you know which is a again a very fuzzy you know threshold of what does it mean to solve the problem okay there you go so in other words uh this is always a problem in AI is that you know as soon as you solve a problem that couldn't be solved before people say well you know that that's not AI That's not a general intelligence eventually they'll stop saying that we've been through so many iterations of that in the last 50 years I have to believe that's eventually right you know but your point is what's the metric and how do we know how complex problems are I think we're beginning to understand them but it's it's not the metric we're going to use is not it might be having to do with information Theory or some sorry but I I think I don't think so I mean this is presumably the mistake in thinking about language people were thinking about you know Shannon style you know Markov chain type things and and then it's that's not really a good representation of you know of all possible sentences one can say in combinations the ones that are meaningful are available okay so I think you're right and here's here's the insight and I think that Okay so so if you look at go uh you know it played itself 100 million times probably who knows but if you just add up all of the the the moves all the game positions And in 100 million games it's like uh maybe you know 100 billion I don't know but you know the total number of possible board positions is some Google Plus right 10 to the 10 to the 10th or something it's it's so giant that the number of training examples it was given was an infinitesimal it was a measure zero on on that whole set of all possible but the point is that under under all circumstances that humans play those random positions never occur would never have ever occurred and so somehow in this very small subset there are regularities that represent the complexity of the problem the Retro complexity of the problem not not all the possible solutions possible game positions and intuitively same thing if I look at all possible images right so if you have a megapixel and each one has 10 gray levels right that's uh uh 10 to the seventh possible images of the million no yeah that's right 10 to a million roughly yeah right and and so that's a huge number and the fact is that the ones that actually occur in the real world is this is a very small fraction of all of those right right right the actual physics that we perceive is a tiny fraction of all of the you know the images the physics Etc they're all tiny slices of what could be that's right and and characterizing that you know those slices and and seeing what what it takes you know characterizing kind of what what the what the Contours of what what is actually possible are that's right and that's geometry that's geometry in other words the geometry in the images of all the images from The Real World and figuring out what's common with all of those images is what learning does for you it pulls out something that is essential about real images like physics from physics generally you know the very assumption that there is geometry assumes things about sort of space and uh and all those kinds of things we should probably I I'm I'm being reminded we've been we've been Yaki on we almost made it through the whole history from from uh from the last hundred years or something of of uh but you haven't asked me about the future well I think that may be maybe we have to do that a different time we will okay the I think um uh but let's see I think there might be people here other uh questions um you have some questions here the question about uh Dennis gabor's 1959 um uh what's that is okay what was it 59 paper was it or was it yeah yeah electronic conventions and their impact on civilization first proposal of random circuits with weight tuning to learn a black box function so I think didn't uh Gabor had had a bunch of stuff that came from sort of the optical holography type tradition holography uh that's a very interesting uh paper and also is very very relevant because remember I told you that uh the original perceptron Frank rosenblatt's perceptron uh could only do linearly several functions well it turns out that he actually he knew that and so what he did was he put another layer of Weights in between the input and the perceptrons units and made them all random So Random projections and now you know what he did what he did there which a lot of dissolved more complex problems but you know what he did was We Now call compressed sensing hmm in other words you can you can map the original space into a a big you know a different space in which uh it's possible now because the features you've created are kind of random features and now those random features can be weighted in a way to solve a more complex problem and and there's a beautiful Theory this is called the Johnson Linens Strauss theorem very beautiful which uh underlies all of these uh compressed sensing algorithms and interestingly there's evidence that it's used in the brain because there are parts of the brain which have that character in terms of random projections huh okay so so the thing grows randomly yeah and that Randomness is you know seeing through is kind of seeing through some random glasses so to speak so the thing that I I don't think Gabor had was the idea of learning algorithms in other words the the closest thing that comes to mind is something called Reservoir Computing where you start with a random Network that has random activity patterns and what you do is you train the the output layer to do some function you know so you suppose you want to replicate some function you can do that and it works it's amazing how well it works it's it's a very uh so what will be our expectation if if I walked around wearing randomizing glasses where every pixel where where you know every pixel of out of input is is randomized before it gets into my eyes would I be able to untangle that well I your your visual system makes the assumption that it's not random and so the very at the very first layer uh that goes to the brain the ganglion cells is making averages over a small region of the visual field and so if the pixels that are that should be next to each other are not then I I think you'd end up with nothing I think you just end up with garbage yeah right well maybe if you you know if you if you're like uh uh uh you know if you because I mean it's it's lunch and and early you know in early development right okay modulo learning in other words where uh it might be learned you might learn something at the highest level in other words it looks random at the input but but once you get to a higher level maybe it's able to pick things up with the receptive fields are much bigger that's possible uh you know both even you should do the experiment put on randomizing glasses and see how long it takes yeah right let's see how to see how that works anyway people have done they flip the image they turn it upside down and that screws people but you can adapt to that interestingly so you have a couple of other questions here uh one is about Robert heck Nielsen's uh stuff I never understood I I I knew him and I I'm sure you did too yes he was uh in the uh engineering department here at UCSD and he also had a company hnc which was uh early neural network company for fraud credit card fraud detection right right I mean that was a that was a big early AI application in fact my first company ended up um after I was no longer much involved with it ended up going and building what became uh you know one of the standard systems for that okay it seems to be a or he was an early adopter but he was also very very uh charismatic and uh he had a big influence on the early uh movement you know the neural network movement because he he was so enthusiastic and you know True Believer and so so what kind of neural Nets was he using I never quite got that straight I mean you know I have to say it was a bit was always a bit mysterious uh he he well he was if you press them he would say well you know this is under patent or something you know he has to patent it or something yeah I I I you know I I've often wondered whether there was something there that I just didn't understand and I know I know I've known several people who actually did have something but they couldn't explain to me but they were able to make things work uh Mark Tilden who builts robots for Huawei is is a engineer who he builds things out of parts that shouldn't work but do because of of of of the motors for example that change resistance when they when they are run into a block and now the motor can't turn that sets the robot off in a different direction but it's not programmed it just does it because the motor the resistance has changed and now so the circuit then flips but you know this is you know that's another uh era another area but I you know I I that's that's you know I'd love to if anybody out there has actually read his papers and understands them I'd love to talk to them because I I I I I've never had enough energy to do that the question here about the human brain project and uh let's see I I mean there are various brain simulation projects I'm not quite sure what what this one that was the European Union and it was a billion euros over 10 years and it was the the brainchild of Henry markram who's a old friend of mine a very good neurophysiologist but he's also very charismatic and convinced the European Union that he was going to build in 10 years he was going to build a human brain I thought they were doing I mean I remember them this I I remember meeting these people and they were talking about you know simulating a piece of a mouse hippocampus or some such other thing that was not very interesting that was their first step they wanted to start small but they they eventually wanted to you know create a simulation of the whole cortex okay so how far did they get uh they they okay there's a movie out which actually uh it was a documentary over time which shows how when they started out they had these Grand uh uh goals and you know and and gradually over time they began to realize how difficult the problem was and even with the he okay so Henry started out with the blue Brain Project which was to simulate neural circuits and this idea was I just scale this up right I've got it working for the small Network I just have to scale it up and I can create the network of the whole brain and and so my response to that was Henry let's suppose let's just suppose yeah that's a difficult problem but suppose you solve that problem suppose you build a really and he believes in very very accurate reconstructions of real neurons not just the units you know that have no internal structure but the real neurons it's supposed to suppose you do that and you turn it on and suppose that lonely hello it works and you can talk to it and so forth what have you learned you know you you haven't it's as mysterious as the brain right that's yeah is that really the right gold yeah I I remember when people were doing artificial Evolution back in the early 80s um and they were observing that you know this is an this is a simulation of some rainforest system or some such other thing but it was an incredibly complicated setup again same same thing you can observe you know it's not it's not obvious how you extract kind of the um uh the the the scientific narrative than that let's see the questions here um uh zav is asking Terry mentioned that current models are about three or four numbers magnitude less compute than the brain um how many uh what is management algorithmic development of back propagation and so on does he estimate that we are so in other words in I think the question being asked is are there algorithmic you know do we know the correct algorithms and it's just a question of scaling up the amount of compute or do we not yet know the correct algorithms to to reproduce the important aspects of the brain uh so my view is that uh there are many many learning algorithms out there if you give them the the constraints that we have on the brain in terms of the the units and the connectivity patterns that ultimately they probably will all converge to the same small set of classes of solutions and and so back prop this just basically takes us there faster you know in other words humans require a whole you know development of the brain that goes through many stages of development and takes Decades of experience to have an adult brain right and that's that's learning is going on during that entire time right and but you can continue to learn even as an adult maybe not as fast or as much and so so my my goal is to figure out how that is done even though it's slower and and may take uh you know more resources because if we understand that then we're then we'll actually I think have a chance to actually uh accomplish the the the the Henry markham's goal right in other words we we then it really is going to be interesting because the way that we're going to be able to uh you know instead of feeding data in like you know with a shovel what we're going to do is we're going to have these very large-scale networks in the world absorbing data from the world with vision and audition just like we do and the Brain will develop internally the model just like we do if if we've understood it properly and so that is where we're going to have and end up if if you really want to understand how humans because we we we should be able to analyze that process in the computer much better than we can with the as we in the we can't we don't have access to the whole human brain right for all the activity and all the neurons but we do have access to all the units in gdp3 and so in principle if we can't figure out how that works you know then it's going to be really tough but I think I think it should be possible well I mean the question is whether there is a narrative scientific story of how it works yes because just saying you run the program and that's how it works is is not what we have become used to as a kind of uh success in science well uh I think that uh I think it's too early to know uh whether there is a simple narrative there are some people you know I remember when even when I was young people are saying well the brain is too complicated to even for the for the brain to understand itself it's not possible right there's some kind of I've forgotten what the logic was it has something to do with it's you know you can't it's self-referential or something you can't understand the complexity to understand it is much greater than the brain itself right and it's not going to fit in the brain I I think that's a bunch of Hui yeah I think that you know and as Stephen you've written very eloquently on this which is the idea of of physical laws being a compression of all of the complexity in the world right yeah right but but and so look I think the issue is and we're now getting into it off into a quite different topic and we probably should should wrap up in a moment but but um uh okay we'll we'll we'll we'll reserve it for another time the talking about the future and um uh um let's see um let's see yeah there's a question here from it's all gone wrong talking about um uh neural Nets can solve problems in ways that aren't intuitive to humans and uh um ending up with all kinds of issues I think one of the things that to me is is interesting and surprising is that you make these neural net systems and you ask them to extrapolate basically or even interpolate but the fact that they will do that in a human-like way is not it's not obvious that's going to work that way I mean that's telling one something scientifically quite important the fact that it extrapolates you know a cat image as being a cat image and so on and that's more or less like what humans do is quite non-trivial I don't know whether I mean you know presumably the reason it works that way is because architecturally it is the same type of thing going on in the brain as is going on in the neural net and you know for for for the aliens so to speak that doesn't have the same you know the aliens version of optical character recognition there oh that's really an a might be very different from ours because they're you know that their version of what counts as a small change might be very different yeah that there was this wonderful movie uh in which uh aliens came and they couldn't communicate with us because they were using some kind of cloud-like things Inception I think no no not inceptional what was it called you what's that yeah yeah it's probably a rival yes yeah that was that was uh they were they had sort of visual um non-linear visual kinds of uh that's right it was a completely alien literally even if it was actually made in Photoshop but yes it was alien conceptually alien um right was they were very nicely drawn the the um yeah I think we have we have a number of questions here about um uh future kinds of things I think we should um uh uh probably I should ask um um I think we should probably reserve the future for another time because I think we could go on talking about the future for for the future is Limitless we've we the history is at least history at least has a finite bound but but we've only touched on it by the way I mean okay so this is something that you know I lived through and you're getting slightly different or maybe very very uh different response but I think though that we're converging right and in other words what you're just bringing up which is that uh was the things that we didn't anticipate I did not in any way anticipate where what what what the neural networks were able to do I didn't had no idea how they would extrapolate uh and scale right nor did I it was a big surprise and and in fact uh it continues to surprise me he has new things that are coming up uh every year there's it seems like there's some breakthrough but we don't know it's like we're in the middle of it we'll see right but I think you know the the the thing that you know I didn't really believe in neural nets for a long time I you know it seemed like they were the things people were writing I have to say just seemed very complicated very arbitrary you know and and all these kinds of different methods and so on it just didn't it didn't seem to be a coherent kind of scientific set of set of things to be talking about and so it's it's some I mean the fact that suddenly the engineering has started to work is has been surprising and it's some you know I'm just curious for somebody like you um you know the the thread of belief in your case is probably linked to the fact that you're interested in how brains work I mean it had you been me we're purely interested in the technological side of it maybe you wouldn't have continued to be a believer okay here is my here is my belief and my intuition which which I think Jeff Hinton shared which is that the only existence proved that any problem Vision intelligence could be solved was the fact that nature solved them right somehow so why not look under the hood and try to extract some principles general principles like massively parallel like a high connectivity like learning right basic principles see how far we can get right that was a very uh you know it seemed obvious to me back then of course I didn't know how long it would take but you know it was just a matter of of of of of believing in nature but but you know there's a there's a certain sense in which you know there's there's molecular scale stuff going on in the brain you know people say oh we can only explain brains by thinking about quantum mechanics all these kinds of things you don't know how low you're going to have to go and the fact that I think one of the things that's really very interesting about chat GPT is it kind of tells you that a large chunk of the kind of language and language related thinking really can be done at the level at this pure computational level that there's no mysterious you know infrastructure underneath there right so that I think this is an interesting observation and I think you know language is very recent it's only within the last hundred thousand years so it couldn't have uh evolved a language organ which is what Chomsky believes right it had to have views all of the Machinery that was already there maybe using it a different way and and and it's all about sensory motor loops and controls so I already told you about the basal ganglia playing a role in sequences of actions so if the brain was already set up for it right and and natural and because language Maps so well onto the brain uh the machine that was already there means that it's it's there's some principles the language had to be using similar principles that say Vision was using or auditory processing recognition right by the way is it the case that the timing of words coming out if you look at your Loop in this basal ganglion you know right cortex Loop is can you like measure oh the time it takes this yeah yeah it's uh it takes about 100 milliseconds to go from the cortex to the basically alien back to the cortex again and now how many words do we say in 100 milliseconds we say we say a word or two in 100 milliseconds uh okay it depends on the words but uh but I'm I'm thinking that it's about you know 10 words a second you know that I think that's pretty fast yeah so you're saying every time we do one of those Loops we make a word I I think no I don't think it's a word it could be just a phony maybe it's just a sound like a token in chat GPT yeah like it's making one token just like it is right right maybe that maybe at that level uh yeah you think that I mean you think it's plausible that the outer loop of human language is that Loop which is the same type of thing as you know I don't I can't imagine it being anything else uh you know I you know that's okay what we use for doing things like uh Sports and uh very complex thinking okay so let's assume you're right that language which I'm sure is true the language is is simply sort of recruiting existing characteristics of the brain right the question is okay so you say language invention lost hundred thousand years one could discuss whether other organ you know other other critters have have some form of language but let's assume it was invented fairly recently in in you know in us so to speak the question is what other things can we recruit from other parts of the brain to have quite different kinds of uh modalities of uh of of action I mean language has proved it pretty useful because the way of communicating abstract ideas from you know from generation to generation so to speak and so a question would be if we try to imagine you know for the for the aliens that um have the same basic brain architecture that they've recruited two more they've invented two more General ideas what would those gen do we have a clue about what those General ideas might be well now you're talking about the future okay okay now but here's something to keep in mind which is that language is imperfect it's very slippery these Concepts like intelligence or understanding which people are arguing about they're saying oh gpd3 can't possibly understand what it's talking about right it just sounds like it understands what it's talking about what's understanding mean how do you define it how do you measure it well I mean for me because I've spent my life kind of building computational language you know for me it's you have this thing can it be evaluated can it be can you compute from it that is an operational definition of of saying that you've got this representation of something that can actually you know in language that is not what we do in language we we're just sort of building pieces on top of what we have said but if we are thinking about computational language where you've actually taken the utterance and you've converted it to something where the computer can go off and execute something from it that is an operational definition of understanding it's not the one that mostly it's not the one that we use but that's a that's an operational definition so there are many that's a very good one and there are probably many other definitions that other people use that's why they get into arguments because people are def different definitions when they're trying to understand something themselves well I think the the point you're making there in effect is that you know what we do in computational language and the way that we evaluate things execute things from that is something brains actually don't do and potentially you know and so I suppose that's an answer to this question of okay so what else could you know if we've invented language in the last hundred thousand years what could we invent and the question is does the architecture of the brain support the you know architecture of computers and so on does support executing computational languages that's obvious the architecture of the brain supports that but uh yeah now now we are indeed talking about the future and we're talking about that and we can go on for many many more hours so you know let's let's plan to get together again sometime yes indeed indeed and and uh did we did we miss I mean you know in terms of the history um you know you talk about other perspectives but I think we covered a uh you know from what I know of this history I think we um uh now you know I know some of this history I know through you but some of it I know independently and I think we we know you know we we hit most of the most of the big topics that I'm I'm aware of that that um that that came in there yeah there are many many uh side Pathways that uh you know many many more people should be given credit I I have to say that um you know I've I've focus on people I know and people I've worked with and so forth but so many other people out there that in fact is it it I can't keep up with the literature anymore you know and coming out of archive it's just overwhelming it's well right I mean you know this is a field which has gone from a tiny group of people during its sort of uh uh you know lean period to you know it must have gone from tens of people to maybe even a hundred thousand people or something at some point that you know I was talking to Eugene wigner once he was that uh taught me statistical mechanics when I was a Princeton and I asked him what it was like to be there you know in Europe when quantum mechanics was being discovered and he said that well he said well first of all back then physicists didn't have this kind of prestige it had back then it was in fact you it was very few jobs were available at universities and uh you couldn't get you know it wasn't very lucrative and his father told him that he should get a job in chemical engineering which he did okay and he said that the number of theoretical physicists now theoretical physicists were so small you could do you could know them all you would write letters back and forth right and that was that was how the whole thing got off the ground and maybe that's the way all things get off the ground it's never you know you don't start with the mass of people you start with a small group that actually believes in what they're doing and they're willing to give up a lot of prestige or whatever money uh to do it I think this is a weird case I think no one that's a bit unusual because I think that as we were mentioning earlier there were all these separate little pockets I think that's not so common I think many of these things you know the tree grows from one trunk so to speak ah okay that's an interesting observation yeah because I don't think in quantum mechanics there were not multiple trunks there was it was not like there were different pockets of things going on it was pretty much just one right right and but I think with neural Nets what's strange is that the the original sort of McCulloch pit style initiation is is the seed I think from which all the other stuff grew and yet there were all these separate branches and you know the same seed but it wasn't a single trunk from that and that I think is unusual and I think that the time scale also the time scale from the initial seed to fruition is extremely long I mean many generations okay so one possible explanation for that I I agree that that it was very diverse but one possible explanation is you have a much bigger search space to explore and many groups were going off and searching different parts of computation space right and sugiman and so forth and and you know uh it took a long time for all of those to sort of uh be unified into a large scale with what's called machine learning for a long time the you know new reps was was the biggest machine learning conference and the most important now it's been rebranded as an AI meeting although it's still it's still we're talking it's still neural networks which is what we started with right but but it's gotten much richer and and much much more sophisticated in terms of computational understanding of graphical models and things but it is so bizarre that the original neural net idea from whatever it is you know what is it you know 85 years ago or something right it's still that's what's running inside chat GPT I mean lots and lots of copies of that was yeah that that's shocking shocking you know that that it could have built it could have built from simple ideas that were floating around back then and coalescing and but by the way the but the amount of resources that went into these uh large language models is staggering yeah right oh that's a whole different 10 million dollars to to learn a single Network right oh my God you know that's right we'll see maybe the brain has figured out much better ways to do it and um that that uh um right it's um all right well we should wrap up here thank you all for um thank you Terry you're welcome welcome interesting conversation and thanks to folks who've been uh listening here and um we shall see you all another time another time we we meet and talk about the future Back to the Future indeed
Info
Channel: Wolfram
Views: 37,402
Rating: undefined out of 5
Keywords:
Id: XKC-4Tosdd8
Channel Id: undefined
Length: 185min 39sec (11139 seconds)
Published: Tue Feb 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.