Greg Yang | Large N Limits: Random Matrices & Neural Networks | The Cartesian Cafe w/ Timothy Nguyen

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right welcome everyone to the Cartesian Cafe we're very lucky to have Greg Yang with us here today Greg gang is a mathematician and AI researcher at Microsoft research who for the past several years has done some incredibly original theoretical work in the understanding of large artificial neural networks Greg received his bachelor's in mathematics from Harvard University in 2018 and while there won the Hoops prize for best undergraduate thesis he also received an honorable mention for the Morgan prize for outstanding research and Mathematics by an undergraduate student in 2018 and was an invited speaker at the international Congress of Chinese mathematicians in 2019. welcome Greg how are you doing today uh thanks Tim this is a great opportunity to be here yeah great I'm very happy to have you here so uh we were lucky to catch up earlier in the year over Ramen and SF when you were visiting the bay area and we had quite some interesting conversations uh why don't you uh explain to our audience how you uh got into math and some of the uh hiatuses you had along the way uh yeah sure yeah so I mean I think like growing up in general um like I I just did a lot of math so like uh when I was so before I was 12 I was in China so I was born in China and I grew up there for for some time uh and then like during when I went to elementary school there you know like my mom would buy me these like Olympic mathematics books and just like just give it to me to read and just like solve problems on my own and uh when I came over the states I also participated in math competition so like kind of like you know I I kind of had a history with math even before College uh and then uh when I went to Harvard I uh took the the math 65 class uh my freshman year which like some of the audience might know as like kind of this Infamous class uh the Freshman math class like it's like the hardest or something like the hardest in the United States or something he has his own Wikipedia page um that like I guess I got his Fame from like people like Bill Gates taking it and saying okay Bill Gates is like uh I took it and realized I'm not a that good at math um and uh so but so that was pretty fun and uh yeah so I so I I I did like math at Harvard and I I uh did my freshman and sophomore year and then um yeah after that uh I actually decided to uh take a leave of absence from Harvard um and at the time it was because uh I wanted to become a DJ um and like a producer of uh like electronic music um so that's so there's a parallel history here which we're like kind of in high school I got into like producing music so I mean at the time I was more doing like rock music and stuff uh but like as I I got to college I got more you know more contact with electronic music uh and so over time I just like made my own tracks and so uh at that point in time I just wanted to try something new you know like a hardware is kind of like or it's not it's not a hardware thing but just like you know like you're just kind of in that hamster wheel of life for like you know for the past or 18 20 years and it's you know it feels like it's kind of boring right like let's try something new like something unexpected something surprising uh and I mean I was really getting to the music at the time too so uh yeah so I just decided to take time off and this is of course also like leveraging like Harvest kind of uh really generous policy on taking leave of absence uh I'm not sure how long like this policy dates back to but I wouldn't be surprised if like this is also in place when Bill Gates you know and Mark Zuckerberg took some time off um but like essentially you can like take any amount of time off and come back anytime you want like so this is really nice right because you don't have to worry about uh like you just like a safety net in some sense they can try something like really risky and it doesn't work out just come back to school and finish it up you know um so yeah so I just decided I'm gonna take some time off and uh like make some you know like sick beats you know and uh and and uh try to play some big shows uh stuff like that um so yeah so like uh like around like 2012 or so yeah I just uh yeah I just like went home for a bit also like stay in Boston area for a bit um and worked on music so uh yeah so this is like kind of like the my music career I guess uh at that time and but the unintentional side effect of all of this is that um you know I mentioned this like hamster wheel of life right uh it kind of give you a break from this this hamster wheel and I kind of really started to like understand myself more I think it's just like you know if you're in that hamster wheel it's kind of you're always like you know like pedaling and you don't have time to stop and think but like with that break like the the unintentional side effect was this positive side effect is that like I already had time to to understand myself like um like you know like what what I really am interested in and what I like and like how my brain and my body works like stuff like that um so like I read a lot of you know a bunch of books like you know like Quantum I mean quantum mechanics physics you know mathematics whatever like a bunch of random things and so like you know in summary like this uh this period uh one is one is about 20 um like kind of gave me I think it really changed like who I am like in the sense that you know I think the uh within me today is essentially the same person as you know that person after I took my leave of absence but that person is like a Delta jump away from like the person a year before you know so like in terms of like my self-identity like that period was like actually quite crucial actually just a quick question did you did you have uh uh I don't know a lack of clarity going into Harvard or did you did that emerge once you matriculated to Harvard uh uh I I wouldn't say it's lack of clarity I think it just I was just like everybody else you know like just I mean like I don't know I'm sure like most people when they go into college they don't really know what they're doing they know like they're like good at doing things which is why they got into Harvard or other you know Stanford these kind of schools but I doubt that people have a very clear Vision on what they want to do other than maybe the pre-med people uh who are like set on spending like the next 10 or whatever 10 plus years of their their medicine but um yeah I mean I I don't think I was like you know more uh like unclear than other people but it's just that like um like that period really gave me kind of uh yeah this Clarity and like this Focus this purpose of life um yeah so I haven't gotten through this but like essentially during that period I really realized a couple things but one thing was that like I really wanted to uh make HEI happen or showing AI like wow this and this is 2012. when you thought had these ideas okay that's that's pretty pretty early uh to have those thoughts well that was like when like I guess like deep learning was starting to blow up but I wasn't aware of deep learning at the time [Music] um so yeah so so at the time like a lot of like these kind of futuristic things really appeal to me um like making sure my eyes one thing like making something that is like much much smarter than yourself that sounds like such an attractive idea that like I just decided okay it's like this is like either either this happens or I try you know um and uh yeah I mean like among other things I something other things that I found really interesting was like you know like is it possible to make humans immortal uh I mean I like I want to live for a long time and see like how the future it goes but but you know that sounds like not something I can like you know like I'm not like a big you know X Factor in this kind of line of research because it's like a lot of like experimentation with like you know biomedic uh biological materials and like there's a lot of red tape regarding this you know I don't have like it doesn't seem like it needs a lot of math either okay I didn't actually realize that your your thoughts were so were so like kind of all over the place in some sense because because my kind of my impression of people going to Harvard and taking the hardest math track is that you know they were Stellar math students uh when they were already applying to Harvard and they just keep going in that uh line and and so to what extent were you not even though I guess you were good at math when you applied or actually did you even know you wanted to study math when you applied to Harvard uh uh I would say your characterization is kind of by and large not correct in the sense yeah absolute numbers like they are definitely like I think there's some kind of psychological bias because you know when you look at like people who wins the fields no matter whatever like they would have gone through the track but they're only like a small number out of you know the the total number of people who go on like you know could take take math courses at Harvard and be a math major um I by and large I think people like majoring or concentrating in math which is like a harder terminology uh by and large they they take other fields like Finance is a popular one uh Tech is a popular one um like only a very small amount of people uh want and have the capability to do like a PhD and like you know follow the math career because I mean to be completely Fair it's like a very hard career oh sure sure no no sorry I uh maybe I don't know if we're talking past each other I meant at the at the high school to to undergrad level the people who are you know Harvard is an elite institution with an elite math program and I just my impression is that if you're going to go that route then it seems like you had some certainty about your interest in math but in your case because you ended up taking some time off and had all these thoughts about uh artificial intelligence and aging it it just strikes me as a bit unusual that's all so I was trying to figure out if you were already oh yeah like I had all those interests before going to Harvard or did you go to Harvard like then then had your like let me rethink a little bit yeah yeah uh yeah I think going to Harvard I was like open to everything like okay I wasn't like that that's set on doing math for life or anything like that uh I mean I knew math is a powerful tool but it's not like yeah it wasn't you know it definitely wasn't that set on being professional mathematician I don't feel like I don't feel like you know most other people are like in the math department are also you know kind of like like what you're thinking I feel like there's study more like me um I mean especially okay maybe just to clarify right like uh for Hardware when you enter Harvard at least at the time I think it's still the same uh right now but when we enter Harvard you don't choose a major actually so they let you explore for like a year and a half before they asked you to decide a major so so in that sense like when you get you might be uh accepted based on you know your math capabilities but like you're not forced to choose you know a path until you until you explore for an exhibition amount of time yeah yeah so yeah so I feel like you know the people I interacted with you know they they were they were all good at math and like some people are definitely are thinking about like the professional mathematician route but uh most people I think are thinking about all kinds of different opportunities in life you know I see so actually if I recall actually you had two two like hiatuses right you had one to become or to to go further into your music career and then you're another one for math let's wrap up the the first one so the first one you wanted to go into your music career how did that where did that end yeah I mean so so yeah so I essentially have produced a bunch of tracks and then uh it yeah and then like I said I read more and more about like artificial intelligence I just like started spending more time on that and like so I just I ended up like kind of you know like slowly decreasing the amount of time I spend on music I put you some like nice tracks though you know but uh are they are they publicly available uh they are but I need to find it oh okay all right we we could post that later okay great uh yeah um but uh yeah it's not like I play like Coachella or anything like it's just like a very very tiny tiny career uh in music um but uh yeah so so during that time yeah I realized I want to make HEI happen uh and then uh there are two other realizations are kind of synergistic with this the other thing is uh like you know before I knew I was good at math But like after taking a break I really like realized that I like love it like so you can you can Center this out if you want no man I don't know you're not the first person to cuss on on okay okay okay but yeah I love math like it's really something I think you know in Psychology they talk about intrinsic and extrinsic motivation or reward and this is like a case where I think uh like you're in a hamster wheel for so long like extreme these strings like reward overpower your intrinsic motivation for a long time but like taking kind of a break and just like you know like and don't don't worry about what anybody else think or like what the society demands of you it really brings out this intrinsic motivation so I realized I really love this and um um so this is the the second realization after the AGI thing and then the third thing that ties all these together is that I also recognize that like math is really the fundamental language for essentially anything in science and uh very likely like to make good uh advances in anything including any AI like you need a very solid foundation in math and I mean I I learned very quickly as well so like I just like essentially decided to like read everything in math and like kind of know all the at least I oh no no all the basics of the major branches of math um so this was this was during your second Hiatus or or the first oh okay okay this this already started yeah pretty early on like around 2012 ish okay um so I decided to just like read everything from the beginning in a sense like I'm going to start with like set theory I'm gonna just read the set theory and understand like all the basics like this one Islam or whatever and and then like go up from there I mean there's like Basics like you know algebra then I attract the algebra so and so forth geometry you know analysis it kind of just builds your way up you know like all the way to like you know category Theory uh I don't know harmonic analysis whatever you know um just like all the kind of main branches of math we know about today it's just like to gain like a good foundation so that I don't I don't have blind spots so to speak it's kind of excessive in some sense because like you know when you want to do research in any particular field you don't need all these other you know extraneous knowledge perhaps but I mean like do you like in my view there's a lot of synergies between all these different branches of math and like letting you see everything that Humanity knows I mean which I don't claim to people I try to try to go go toward that goal uh it really lets you see things from an equal side point of view you know [Music] um yeah I think um I mean like like as a comparison you know I think it's not inaccurate to say that like for example a lot of people in like half the people I say in machine learning they uh you know they they would be kind of afraid or like kind of uh you know like stop in their tracks if they see like complicated math they'll kind of be turned away in some sense you know like and and that kind of prevents you from doing things that you could do if you have more knowledge so this is kind of just just saying like just as an post-talk evaluation my choices like I normally don't have this problem right when I do research like if I see compliment complicated math I just like look at it and I mean I I will be able to understand it you know much better than if I don't have the background sure okay so so you had some self-study you went back to Harvard and then you took some time off again uh yeah yeah yeah so right so so I I kind of went back for her to Harvard for a semester uh just because I want to like kind of uh see my friends you know I made a Harvard uh before they finish off so this is like the last semester if I uh would have continued with outbreak and then um yeah then it took some time off yeah to to just like really you know like drill through this this goal of like reading all the foundations um so like that so so yeah so first time I took like a year and a half off and I went back for a semester and then the second time I took two years off and this is like just Purity like there's no music in this case it's just purely like kind of you know reading and learning as much as possible as fast as possible yeah that's interesting because you're you're at Harvard you're at one of the most elite institutions clearly you have much to learn from your your top-notch professors and peers and nevertheless you took two years off for self-study uh can you explain your motivations for doing that yeah um right I mean uh like the way I see it which I think is not like that extraordinary is that um before like really foundational stuff there are really good textbooks that are like battle tested over years and years of teaching by you know by professors and like people reading them that like probably is like you know much uh on average much better quality than like you know some professors instruction on a particular day right uh and I mean so there are different factors to this right like one like if you can read very fast you can like read books much faster than listening to you know verbal instructions um yeah second I mean like there's also like like a instructor you know it's a person there's like you know variance with respect to the teaching quality even if they're very good um and I mean and to be honest right like I'm sure like a lot of people have this experience like when I take math courses I kind of just a lot of times I skip the lectures I just read the book anyway like he doesn't he doesn't like make a difference really so in some sense you're kind of naturally an auto didactic essentially I mean you're you're I mean there are many reasons why one could be an auto Didact but basically you don't have a problem learning on your own I have this quality as well and yeah somehow when you have the motivation your I mean your bottleneck is just you go you know basically you can go at your own rate and in a way that a teacher who has to cater to an entire classroom can't kind of customize to you can customize to yourself the teacher cannot right I mean maybe maybe that's the point okay yeah yeah yeah I mean this is especially true for like you know all the all the knowledge that we already know for many years at least it's not like we're I'm not learning any like super cutting-edge stuff you know like for that you know like if you want to learn them the most advanced stuff in you know like you know like higher topless Theory today okay like probably you know you need to like talk to Jacob Glory or something but you know but if you if you want to know just like things we've known for 50 years I mean you know there's no particular reason you can just read books okay fair enough okay anyways so you took your two years off came back and I guess you you were probably well ahead of your peers uh presumably and and then you uh yeah what happened then uh sorry what happened after what so you took two years off now you're this math buff I guess by by lifting your Pathways for two years you come back and uh then what happens yeah uh right so yeah I mean so so yeah I don't understand like at the time I wasn't you know there was no like ambitious plan you know like for the next 10 years or something it was just like I knew this is probably valuable so let me just spend time reading math and then like whatever happens next you know whatever it would just happen you know I'll just see where he goes um but uh but but the way it worked out in the end is that um so I I came back uh to school and I was assigned a uh so I was in the math department at the time I declined my I declared my major to be math and uh so I was assigned like some random randomly assigned uh like a academic advisor which for undergraduates means they sign some papers and like they don't talk to you for the rest of it you know um but I got assigned to uh you know so he must be very familiar with Yao from like Club yellow manifolds um and uh yeah so so so from this you know I didn't expect you know too much I had no expectation of you know talking to him too much but somehow like we did get to talking and over like multiple interactions uh he learned about like some of the papers I I wrote uh when I was uh in my time off uh so in particular I wrote like this one paper on like some surprising connection between algebraic topology and uh like computational learning theory so I mean in a gist without going into too much detail uh like the VC Dimension which is like a very fundamental notion of like harness uh complexity of a class like to in terms of learning like how hard it is it to learn like a hidden function uh this this VC Dimension is captured by like the the homology of some natural space Associated to the to the the to the function class um okay so that's like a once in a summary but like going forward uh like kind of uh back to the the story um so he he really liked that paper uh you know yeah he didn't he didn't ever explain why but he just like started inviting me to like parties and like events and stuff like that and I I just I guess okay he must like my papers I guess you know and like he like you know arranges for me to meet like visiting researchers or whatever um so yeah so we became really good friends and um yeah so then like uh you know like the end of the semester uh when I was trying to uh yeah we're not trying to see like uh what I should do next um like he we had a conversation uh and you know I told him at the time I was interviewing at Google and then he's like oh no no uh you should go to Microsoft and the next day I got like this email from uh Harry Shum who was like the head of our research he's like oh one of the you know leadership uh below the CEO um and uh and then yes he just like you know asked like asked me to chat with uh Jennifer Chase who was the uh uh head of the New England uh lab in MSR in Microsoft research and then like we had a conversation me and Jennifer and then she also put me in touch with Mike Friedman uh who you know is is another like uh big he was a medalist who worked on clone Curry conjecture uh but but he's been working at Microsoft on Quan Computing for some time on topological Quantum Computing and and so when we chatted I think we really resonated because like I think in my heart I'm like kind of a topologist and like definitely the paper I was you know that Yao liked was like on this topology and uh computational learning theory and of course Mike himself is you know somebody who won like wide recognition for his topological work and then now he's working on like the intersection of you know topology and uh Quantum computing so I think we really resonated um and at the end of conversation like he just kind of went back to Harry and just be like oh just hire this guy like nice don't ask me questions yeah uh yeah so then I got a like kind of call like a couple days later from Harry and Harry is like you know like I got too few as metal is telling me to hire you right if I if I don't hire you then me right so that's essentially how I got into uh MSR um in like around 2017-ish very nice very nice okay great well that I think that's a great story and that takes us to to uh what what you've been doing since then so why don't we just get straight to it so today we're going to be talking about your body of work uh known as tensor programs uh it right now spans five very technical uh papers um the way I wanted to organize this conversation well first of all I invited you on selfishly essentially because I wanted to understand your work better um but of course in order to uh make this maximally worthwhile and enlightening I thought it'd be good to put it in a greater context and we'll see if this succeeds or not but basically you know let's just take a step back um you're so you're you're an AI researcher and right now uh there's an AI Revolution going on right right as we talk you know chat GPT has been the rage and before that stable diffusion and many other things and people are very aware now of the power of AI and part of it is that models are getting larger um and part what what your theory does is to try to have a theoretical uh grounding an analysis of what happens when neural networks get larger in a very specific sense and a lot of your work builds upon uh studying random matrices and non-linear functions of it and so I think the way I wanted to organize this discussion is to kind of think about your body of work in the context of this concept of letting objects get larger so you have random objects that are that are have a certain size and as that size gets larger things can simplify and so that's that's kind of like the big idea that I think we can sort of Nest your your work in and I think that's uh that will be helpful because people will have seen things like the central limit theorem and then things like random Matrix Theory and then we'll uh eventually get to your work how does that how does that sound yeah that sounds great okay great so I spoke it a bit let me let me actually uh kind of repeat that in writing so there's there's a capital N which I think will uh which I will let go to Infinity and uh in this limit this large n limit let's say things simplify okay that's like the overall concept of what we're going to be talking about and the way we're going to pursue this root is by the following so first we're going to review this we're going to have a warm up where we'll uh review this large and limit Concept in the context of the law of large numbers and the central limit theorem [Music] okay and uh here we think of n as the number of samples so these are classical results about what happens when you sample uh repeatedly uh from a distribution and what happens when you take certain averages or or scaling limits thereof then we're going to go over random Matrix Theory right which is basically an N by n version of this you're going to have a random Matrix that is the entries are random it's n by n and you let N get large and then uh basically there are some very nice results coming uh uh that arise from looking at the eigenvalue distribution of these random matrices right so that's that's what's gonna happen there then we're going to look at tensor programs oops tensor programs right and correct me if I'm wrong but as I was thinking about this um I think the tensor primes could be thought of as some kind of non-linear compositional Central limit theorem is that does that sound kind of accurate to you or is that uh is that an unfair kind of uh summary I'll say is kind of like a non-linear compositional uh low large numbers ah okay okay sure okay yeah or law of large numbers let's say yeah okay um and then what we're going to do is we're going to use your tensor programs um to uh well we're going to get new proofs of random Matrix Theory results so that will validate uh sort of the value of the tensor programs and I think of course the real in some sense value of the tensor programs is what it has for uh neural network training right so implications yeah for neural network Theory right so here we have an N by n Matrix here you could think of n as the the size or the width the neural network yeah and the common unifying theme is that uh in in these different settings you have a random object right you have some random variable you have a random Matrix you have a neural network with some random weights and as the size goes to Infinity the analysis becomes tractable right yeah I realized actually before I I kind of maybe took over a little bit I I said all this stuff but I should have let you describe what tensor programs are can you do you want to say like your tldr of what tensor programs are because I just sort of put this outline without actually letting you speak about it so what are what are tensor programs and how did how did you what how did you come about this this concept um yeah so uh maybe in a sentence I would say like oh so they're two parts okay maybe not one sentence but two sentence or something there are two parts uh the first part is kind of like this um this set of rules for expressing computation uh which is essentially you know matricable location and entry-wise like non-linearity functions essentially it just says like roughly speaking or anything you can write in pi torch or tensorflow at least like deep learning computational Frameworks uh can be sort of Chronicle compiled down to this kind of lower level language so this is the first part it's kind of like the formalization of the rules of the computation it's kind of like a you know like a turing machine you know like just kind of true machine formalizes the rules of computation for computation you know that this is like formalizing the rules for computation for deep learning um and the second part is like this whole large number or essential limit aspect of this which is that when you like randomly sample the matrices in such programs like you can obtain like really General insights about what happens when the size of the uh The Matrix is called Infinity in such a program so like concretely in like for you can you can implement random Matrix Theory inside of this framework and in this case like the size of matrices are just the size of matrices in random Matrix Theory uh you can also Implement like all of you know deep learning inside this language and in that case like the size of matrices in the program correspond to like the width of the neural networks um but but it's like a very you know flexible language just like in between machines they're very flexible in terms of expressing computation um okay so that's like the rough gist of uh autism frames are uh I can also talk briefly about like the history um but yeah roughly speaking uh like when I yeah when I when I joined Microsoft in 2017 like I was already exploring like uh the behavior or random neural networks uh like sort of like so-called initialization so what that means is that like the neural network just have random ways and you haven't done really anything with it um and then just trying to understand this Behavior so you know there's like a very well-known behavior of such networks from like 20 30 years ago uh known as this new network option process correspondence so in other words like you know a random new network in the sense of like the ways are randomly sampled I can be uh can be shown to be uh a gaussian process in the limit as the width of the network or number of neurons per layer which is a width this with cos Infinity um so so this is like a well-known result uh from a while ago for like a you know one hidden layer new network um and then like when I was starting to do research like there were essentially some more like you know activities in in that area in terms of like deep networks and uh at the time like most of the works were done by like you know like physics essentially so they were kind of very flippant about you know we're like non-rigorous about like the ways they manipulated some of the equations and I think as a mathematician I was kind of uh like not super satisfied with that so my motivation at the time was to uh you know like to understand like you know can I convince myself that this is all good you know like in in any circumstances like the assumptions they made like in terms of these like physics like calculations are okay and so like driven by this motivation I kind of like realized oh like all these different things you do for these different architectures that people were writing about actually can be distilled into like this simple form like this low-level language like which is essentially like tensor programs but that like essentially has two instructions one is matrix multiplication and uh like in these entry-wise non-linearities so it's essentially as long as you can express your computation in these two forms like you can you stress like you know resnet in this form you can express Transformers in this form as long as you can express them in this form you can like kind of in using kind of mechanical calculation to calculate the the behavior of the infinitely large new network so that's how you came about just like driven by motivation to understand the behavior large in your network especially you know like kind of to to to convince myself that like all the physics the physics way of doing things is actually correct in some circumstances says okay great um all right why don't we actually just dive in because we've uh We've uh waxed poetic for long enough also Greg why don't you explain this the law of large numbers yeah yeah okay so so I mean the the most trivial case of the low large numbers oh let me just maybe write down this is a lot of large numbers oh wait um right so so the the most trivial case okay maybe maybe before you know talking about exactly what it is you know like in uh a lot of situations you you add up a lot of things right and in general you you like sometimes you want to understand like how how does this behave when you add up a lot of things together so I mean the most trivial case of this is like okay you like you know you suppose you buy one orange every day and you buy it for like 100 days straight okay how many oranges are you gonna have right obviously you know just you just one plus one plus you know so on so forth right and there's like a hundred of these and this is like a hundred so I mean in general like the sum you know like the sum if you have uh instead of 100 you have uh end of these then like you know you have you know the the total is end so in other words if you have you know if you divide this by n you just get one which is like kind of what you're adding every day I mean this is entirely trivial right like you know if you're in kindergarten you know this so like the law laundry memory is just a really in essence just like saying that was the more complicated case where you like you know like suppose you flip a coin every day and if uh you know if the coin is positive these heads you buy two oranges or his tails you buy zero oranges right so like now you know instead of buying like very certainly one orange every day you it's a stochastic thing it's like a random thing and you you're buying on average you're buying one orange every day so now you can ask okay like if I have uh this um I'll just call this X for the number Orange um on on uh on day one and day two so on so forth so like this is like the orange number number oranges but on day 100 for example and I do the same thing I divide by n then low large number says you can ignore all the noise right like it's the same the answer essentially will be the same when n is large as if you're just buying one orange every day where one is equal to the you know expected number of orange G5 every day right so this this will be one right if again like the condition here is that like you know um like two oranges two oranges or zero oranges we had probably one half each right well I guess you didn't really mean equals to one if it's equals to one with some yeah so like uh or maybe I'll I'll say this uh or okay I'll just say this uh as n goes to Infinity where in here this is like not 100 but and right so so again like in other words well larger number says if you're averaging things if you're adding like 100 things and divide by by dividing by 100 then like when when this hundred is instead of it's not 100 but like a thousand or ten thousand then like you can forget about all the noise you know in each each uh you know like random variable here you can just look at the mean and that's the only thing that matters right so um like yeah or maybe it's not so much uh maybe uh maybe being a little pedantic but it's not so much noise as variance right because it's like there's zero oranges and two oranges I I don't know uh uh I'm not sure what to think of as noise but the point is that the uh if you average over average over those two two with respect to the underlying probability measure which we said was a half uh for each case then it's the one origin expectation what what what's what's um uh the thing the thing is that there's fluctuations right like there are like there is there is there is one run in which you actually buy two all two oranges every day and there is another run in which you end up buying no oranges every day right but those are relatively rare right and it becomes more rare as n goes Infinity right so there's basically the fluctuations the variance goes to zero and gets large yeah that's right yeah yeah so like if you want to formalize this the rigorous version of large numbers is that if like you know X1 X2 and so on and so forth um Are all uh or like I'll say ID for in case people don't always and um independent and identically distributed yeah we'll assume our readers know what that is but uh basically it's you can think of the excises as independent coin flips where the coin has the same distribution every time you you flip the coin right yeah yeah that's a quintessential example yeah they're all ID uh then you know like um I'll just write this out then like you know taking the the song of the initial sequences and then dividing or like averaging them out uh this will converge uh to the expectation of like any one of these um as n goes to infinity and if you are more pandemic more more pedantic you can like say like what exactly is this notion conversions like there are many of these like you know like almost sure that's the strong law of large number if you if you say it's almost sure um yeah I mean that's that's pretty much it right like everybody has some intrinsic feeling for this kind of thing like when you average a large number of things you get the mean that's good yeah yeah we we're it's uh it would be uh too far afield for us to go into uh the different Notions of convergence but but intuitively it means that with with uh you know uh near certainty or you know or certainty that it kind of can't encapsulate it with the English language but with certainty when you flip your coin when you sample all these variables ID and average them with certainty you will get the expected value it's sort of like thinking like if you pick a random real number with uh with probability one you're going to get an irrational number rather than a rational number you can kind of think of it that way like like I guess technically you could get a rational number but it just it's just never going to happen in some in some formal sense again the English language is too coarse to capture the notion of measure zero but you can think of it that way yeah Yeah well yeah I think you can also be like more quantitative and you can say that like essentially for any finite but large and like the probability that this average this empirical average is like very far away from this expectation is small but the problem exactly it's like a calculus Epsilon Delta notion of making this rigorous yeah yeah okay anyways okay great so this is the law of large numbers uh did you want to say anything else about it or should we go on to the central limit theorem and how uh how it's a different kind of large I mean maybe I'll just say that like um but when eventually when we go to tensor programs like the the key the master theorem which is the name of the kind of the one theorem that you need to know uh in that theory is formulated in this way like kind of like when you take average of certain things that are actually not ID and like in fact in fact they're going to be in general correlated it tells you what the like the limit of this average is and so it's it's formulated in a way that's most akin to a lot of large numbers your average bunch of things are correlated like how do you think about this average because now you have correlation like you know how do I should I assume the correlations is just like you know zero or like does it you know cause the limit to differ so this is like the one of the the most important consequences of testable I mean it tells you when like correlations arise from this kind of like matrix multiplication entryways nonlinearities like how should you think about this yeah we just to make this more clear so in the in the okay let's not get too deep right now but I just want to unpack what you said and then let's just move on and it'll make more sense to come back but basically the key thing in the law of large numbers is this IID assumption and in the tensor programs we're going to replace that uh with well they're not IID but they're going to be sort of uh the X I's will be non-trivial functions of the previous ones right so that's what we meant earlier by nonlinear compositional uh you know Central limit theorem or law of large numbers basically the X I's can be non-linear functions of each other sure and so so basically that's where the heavy lifting comes about right yeah that's right okay right and the excise are basically the hidden hidden units in a neural network that's that's that's the point yeah yeah okay yeah I'm actually great that we we covered it this way because then already here we can kind of get a feel for what tensor programs is about that's right yeah great great yeah okay um all right are we ready to go the central limit there yeah let's do that so essential limit theorem is a different uh theorem from the low large numbers but it's you know very intimately related so um you know so in in the in the you know the the scenario we had before we had these random variables you know like X1 X2 and so on they're all ID um but you know assume assume you know the the meaning zero so if you know a meaning of uh expectation of X1 is zero then like you know the large number says that um now this this mean over endings you know like the limits and goes to Infinity this will go to zero this mean goes to zero right because we have assumed the expectation is zero so this scenario you know is uh the scenario that you know when you average you know uh these things these random variables with zero uh expectation then you get zero in the limit for large number of samples so this is kind of you know like a consequence of low large numbers but it's also somewhat not interesting when things go to zero so the central limit essentially tells you like what is the what is the right Skating here to obtain a non-trivial result and what it is the non-trivial result so so Central limit theorem assess the following that you said If instead of uh dividing by n you divide by square root n instead then you know taking the limit uh gives you a gaussian distribution um I'll just say like uh Sigma Square here where uh expectation x 1 square equals Sigma squared it's the variance of this one right okay great yeah so um yeah so so what is you know this uh imply so the one way to think about this uh in connection with a lot of large numbers is that uh you can roughly understand a sum of uh n ID things with zero mean and variance Sigma squared as roughly equal to n uh oh actually sorry like let's say let's say um let me just say this like expect let's say expectation of X1 equals mu which may be non-zero and an expectation of X1 squared equals Sigma squared so this is a more General scenario than the one we assume just now um but back to the situation right like if you're adding N Things uh each thing being an independent copy of this random variable with mean View and variant Sigma then you expect the sum should look something like uh n times mu plus square root n times Sigma like a gaussian so it's kind of like a asymptotic expansion you know if you're familiar with that for example from physics like how you should think about a sum right like the dominant term is the mean the scale is linearly with a number of samples n and then the sub dominant the sub-leaving term the next sub eating term is uh proportional to like a gaussian with scale square root of and where n is the number of samples right and they're like some low order timers as well low order terms but um but like the top two terms are the most famous ones so you know like this so this is this term is the low large numbers and then this term is the central limit there I see um yeah maybe maybe just what oh um what yeah no no no no no it's fine no uh maybe I I was gonna say one useful Factory call uh readers of of uh who've gone this far over you know it but it's good just to recall it which is that if you have uh fact If X and Y are independent then the variance of X Plus y is the variance of X plus the variance of Y and then also the variance of a scalar times x is c squared variance of X so sort of that sort of justifies your your approximation here in some sense of course the first term is just the linearity of expectations and then the second term essentially follows from these facts I just wrote right because if you sum and independent things then uh the variance will be n times the variance of the individual random variable and then that's how you get the second term because the square root of n changes the variance of uh yeah the square root of n times Sigma changes the variance of the normal distribution by the square of that which is n times Sigma Square yeah so so yeah yeah right so here I think I want to point out like one basic intuition um the the that I think will come back uh later uh I mean this is this is like a very very basic intuition that you know like like you know it's not like uh the key intuition behind Instagrams but I want to point this out it's just like a very basic thing that you know I think benefits everybody if uh they they can't understand it which is just that like when you add like random things together uh when the fluctuation fluctuations are independent like uh they cannot in some sense they cannot conspire right like to to push in the same direction because the fluctuation essentially has no information between like you know different uh different instances of fluctuation like the fluctuation X1 doesn't know anything about the fluctuation of x n right um has such a they cannot like conspire to push in the same direction and so there's a lot of which is why when you add them up instead of skating like n like when we add end things together instead of skating like n which is the case if it can conspire to kind of push in the same direction like they don't know about each other so they can only like kind of you can only expect them to push away from zero by something smaller than n and like the the you know the basic fact of like how the variances add essentially imply that the smaller number lesson and is something like square root of n yeah right exactly like a very basic intuition I think everybody should really understand in and like Yeah so basically like a large number says okay the mean is the only thing that can conspire to push in the same direction and the essential limit says if you cannot conspire to push in the same direction you get square root of n Behavior exactly exactly and that follows from the fact that I just wrote Because when you add independent things you get growth like n but if you add n of the same thing which is what the second thing I wrote variance of c times x let X be n then you get N squared you see so you get another factor of n so that's the second case is the conspiracy case the first case is the independent case so here I mean I do want to point out right like when whether the mean is zero or not zero gives you like kind of two behaviors that the two kind of like canonical behaviors right like when the mean is non-zero like the in some sense the correct ways of scaling the sum and things is to divide by n right and then you get like a non-zero and non-infinite object yeah which is the expectation of the random variable but when the mean is zero like the right way of scaling it to get an untrivial thing is to divide by square root of n right and this this like this this uh kind of like uh change in the right skating Behavior depending on you know the characteristic of the random variable so this kind of uh you know like thought process will come back later when we talk about you know how to like how does test programs apply to like in training large new networks because in in that case like there are also different you know ways of skating your new networks kind of like n or square root n here depending on the different scenarios you're in and that could be correct we're not correct in different scenarios and uh like one of the main big contributions of Desert Winds is to kind of clarify and to in fact derive the correct uh skating um in general so that you have like you know nice payoffs when you train largely networks that kind of like from a more empirical perspective it would be really hard to guess the right skating yeah indeed indeed so Greg so I guess um as we were discussing uh before um it's instructive to go over a proof or a sketch of a proof of the central limit theorem because these ideas will show up again later in particular in the random Matrix Theory subject right right okay so let's let's do that uh so again like there are many different proofs of this but um one way to go about this is to first use the fact that [Music] um like you know two distributions let's say like um p and Q uh are equal uh if and only if um they're moments are equal where like my moments I mean you know like uh expectation of you know X to them the like K power where X is drawn from P or Q so like you know we've pieced and these are teeth moments and Q things accused moments or for different K's so these are the moments right right so so this is true for okay greater than equal to one then uh the diffusion is equal like under some assumptions under some like you know regularity assumptions sure I think maybe you're going to already say it but I think the point here is that the space of polynomials is dense in the space of continuous functions and so if you can show for all the moments you can show it for any continuous function and and intuitively if if a probability distribution integrates against a continuous function uh if two popular distributions agree on all continuous functions when you integrate against them then they have to be the same I think that's fairly intuitive right yep yeah exactly so so essentially then like to back to CLT right um okay the point is that if you can show that we just you know suffice to show that um oops oh that when you add these endings and divide by a square root of n uh and then you take the K Tower and then you take expectation then this is going to converge to the expectation [Music] of X drawn from gaussian also I'm going to assume that the variance is one from now on that's true it's because you can always scale it to be so sure yeah so this whole so this so we have reduced the problem to showing a specific you know convergence of you know expectations of this powers of this sum okay so this is the first step we have reduced our problem to specific uh to showing specific conversions of you know deterministic uh quantities okay so then the second step is to actually you know compute these quantities these powers uh and take expectations and see like what do you expect them to be Okay so you know like when you so in general right when you take a sum I'm gonna use um index uh Alpha and beta here so when you take a sum of uh X Alpha um and then you take it's K power right we can't essentially expand this as the sum of X alpha 1 to X Alpha K over all like sequences alpha one two Alpha k right where um each of them range from one to n right um right and you know one and then when we take expectation because the linearity of expectation you just uh you just get uh you can push you can push this inside this expectation right um okay great so now um again the goal is to kind of understand like how this uh the sum behave as and becomes large while if AK stays fixed okay so so just remember this psychologically like the the number of indices are fixed in our scenario but uh their range can become larger and larger okay so now the next step we're going to use is to observe that because we have assumed um because expectation of X Alpha is equal to zero um these like these kind of uh expectation products of X Alphas they're going to be zero if any X Alpha appears only once in the in in the product right so for example um like you know X1 X2 squared x three to the third power this is going to be zero but i x one squared X2 squared is like in general not going to be zero right like right I'll just say like this right and and the point is the problem is this guy has a power one right and you're using the fact that by Independence Independence means that the uh expectation of the product is the product of the expectations right so so the point is is uh yeah you're just actually this is true for uh yeah okay yeah so let me just complete this yeah assumption yeah exactly exactly Yeah so basically disjoint industry in in general you can factorize uh over the disjoint indices essentially just yeah yeah yeah okay yeah great so so what this implies is that in this big sum here um we only need to worry about uh the sum ants where each Alpha appears at least twice okay um okay so so let me maybe just write this down this is I think yeah I see where this is going now okay uh I I haven't I haven't uh yeah I didn't study this proof in preparation for our talk but I could see I think I see where this is going you can uh get a little nicer price okay all right I I don't want to steal the show but okay uh appear greater than or equal to two times okay okay now so this is our like first step of you know just eliminating some of the fluffs we don't have to worry about and then the next step is um to understand like what is the like the component of the sum that will dominate when n becomes large okay so so for example you know like in these sums you you can have the scenario where you know you have have X1 squared X2 squared and so on and so forth X um K over 2 squared for example you can have you can have this kind of behavior we can have like X1 to the cube X2 to the to the cube you know so on and so forth or X cubed to the K over three something like this we can have a mixture of these you know we can have like X2 X1 squared X2 cubed you know so on and so forth um and uh the point here is that when you compute uh when you calculate the number of terms of these kinds like so for example this kind is when like every exponent is two and this kind is every exponent is three and so on and so forth when you like look at how many such terms there are then you you notice that like this this kind where every exponent has exactly is exactly two the number of such terms will dominate the total number of terms as n becomes large yep yeah um so in other words you can kind of forget about all these other kinds of terms as enclosing Infinity because they just contribute less and less uh as n becomes large so we can say that I like this is um kind of uh asymptotically equal to the case where uh each Alpha uh I appear equals to two times right so so where okay let me let me write it another way because in this case it's simple um something like it's alpha one squared to Alpha K squared over O Alpha 1 to Alpha k okay let's just let yeah let's just back up a little bit I think the easiest way to see what you're saying is you know look just just treat this as a counting problem right and so in the first term you have the fewest constraints in the sense that here when you when you have the lowest possible non-trivial power which is two because we we don't want powers of one right in this case the number of such terms is basically and uh uh yeah you have to choose K over two elements from uh capital N right because once you've chosen those uh K over two elements from a set of n elements in this case indices you've already constrained them to all have power too so the number of such terms is just this n choose K over two whereas as you just wrote if it was all three then it would end choose K over 3 which is a smaller power of n right and by the same logic if you had higher Powers they basically kind of have each each power is like kind of you could think there's like another another co-dimension in some sense in the space of all possibilities right and so so the the there's a leading term and everything has kind of positive co-dimension right and so so only only the lowest Powers non-trivial power survive yeah that's right yep yep yeah great so now I'm gonna kind of just stay on this page I'm gonna erase all of these sure um okay so now the our mission is to calculate what exactly is the sum so first of all like you know by symmetry uh you know like you can't evaluate this expectation and the expectation is just uh be by Independence right you can just Factor uh pushing the expectation into each Square of X alphas and because the the variance of each X is assumed to be one then this is just one right and so the correct combatorics here is um uh n-truths so it's like the okay so the the succinct way of writing this is um Andrews uh the multi so this is the multinomial coefficient or essentially choose pairs um where they're uh K over 2 of these uh uh things and then times so this is essentially Counting uh uh yeah like this is kind of counting like how many ways things can coincide and then there's another uh one of Andrew's uh K over two so what we just showed was the expectation of X One Plus n divided by square root of n to the K power equals you said it was K minus 1 double factorial yeah so like the way I use double factorial is a bit different but but anyway you can just you know what does it mean in just this all right all the all the all odd numbers yeah exactly yeah okay okay and the point is that this is also the same thing as expectation of x squared where X is distributed as a foreign K even is that right yeah and for KR which is zero yeah of course yeah yeah right yep okay uh great and so this this shows that uh right yeah so yeah if you want to get technical this is in the description yeah yeah sure yeah we're not that fancy yeah okay great yeah yeah okay uh okay great so so okay so where do we where do we go from here all right so so we can talk about you know like something more fancy now which is uh instead of looking at like a sequence of random scalar random variables and you know taking average now we can look at like a random Matrix so uh you know like if you if you really read into random Matrix you can think of it as you know like using fancy words like a non-computative version of low large number essential limit theorem um again this is a getting maybe a bit too fancy but like uh this is meant in the sense like random like matrices uh under multiplication they are non-commutative versus um in this case uh when you when you look at like you know X1 X2 XL you can you can when you look at it from a random Matrix the angle we can think of it just like a diagonal matrix where the diagonal entries are X1 to xn um but in any case like let's not worry about this uh uh let me let's just like kind of set the stage for what what do we care about in this random Matrix setting and um and then like the genealogy between the random Matrix stuff and the CLT for example will become more clear as we try to apply kind of the same moment methods and the same ideas there and then we will from there we will also see you know like the classical methods of you know understanding these things will look very much like what we did for Central limit um but we using tensor program is like a very different way of thinking about things and uh and then from there we can like you know uh use a jumping point to see like how this different way of thinking about random matrices uh translate really well to neural networks so when we go uh to random Matrix Theory um the the kind of the the thing we care about here analogous to them uh essential limit scenario is that um we want to understand like kind of the eigenvalue distribution uh Alpha like a large random Matrix um and the setup is typically something like this so um yeah so if you have a you know I'm going to draw this picture a matrix as M by n so yeah so like a very common setting is when uh this Matrix let's call uh let's call this a a is uh let's say symmetric and um and then A's entries let's say I just like ID IDE um thousand uh subject to this constraint so essentially what this means is that like uh on the diagonal they're all ID but like you know otherwise like on the upper diagonal upper triangular portion is ID but like the lower triangular portion is exactly the flip of that yep so so that's what I mean by ID subject to this symmetry constraint um like let's say zero one or something like that and the question uh like classical question asked in random Matrix theory is uh what is the like how does the eigenvalue um distribution look like uh for a as um and then ghosting Infinity so this is the like one of the key questions and just to note that um because like we have shown the Matrix is symmetric by you know the typical linear algebra theorems uh the eigen values are real so we're looking at it like you know for finite matrices we're essentially going to look at like a histogram of um eigenvalues on the real line so this is like say like negative one to one or something and no So like um the eigenvalues you're gonna you're gonna rescale the eigenvalues right maybe you should say that yeah yeah so so yeah maybe maybe let me not like be very specific about the scale here okay okay yeah let's just say so we essentially we're going to look at you know like um on the real line uh like you know a histogram for you know any like you know finite Matrix you can always kind of look at the distribution like a histogram and um when the The Matrix becomes you know infinitely large you have like infinite number of um eigenvalues and that forms like a continuous distribution right yeah I guess when you say histogram basically I mean uh uh with with uh you know it's I say it's with measure one you're going to have matrices with distinct eigenvalues what you're doing is actually binning and then in those bins you can have multiplicity right that's right yeah I'm bending it here but like so like if you don't bend then you're just going to look at like a bunch of Deltas or something exactly exactly so okay so yeah just to be clear the reason why you have a histogram is because you've been but it's okay okay that's that's just I just wanted to clarify that but okay okay let's keep going yep um great okay so so how do we answer this question right like you know matrices is like a much more complicated than a sequence numbers um but it turns out like you can't use some of the sync tricks that we used previously just now for Central limit theorem so with movement method the point here is that um when you look at the distribution um let's say um p of a this is defined to be the distribution of uh eigen valves of a then um Again by the usual uh linear algebra theorems you have that um the like um x to the K where X is drawn from PA um uh yeah this is like for for any like you know like deterministic even Matrix doesn't have to be random like we said um is equal to the trace um of uh a to the K power divided by uh and we're n is the size of the Matrix again like the picture is this and by n a is n by n uh sorry don't you have to rescale a somehow so that you actually get a limiting distribution oh yeah so this is just for finance find a finite matrices finite n n is finite right now I'm not oh okay okay this is just like this is true like foreign P of a is the um empirical measure on the eigenvalues is that right yeah okay so I think let's just write that out because it's uh you know we're going very deep here so so sort of the so another way of writing that P of a is the called empirical measure I guess for the eigenvalues of a which basically means it's the measure such that you have 1 over n and you have a direct Delta for each eigen value yeah so you need a one over n so that it's a probability measure and then you just you have a you know you count one for each eigenvalue yeah yeah that's right okay good good good good okay okay so so the yeah so the point here is that like you can manipulate some like gross characteristic in the sense of the moments of the eigenvalues by taking power so the Matrix a uh and then taking the trace okay so this allows you like this this trace of you know powers of a directly gives you the the moment of the empirical distribution of the eigenvalues okay so so this is how we're gonna like use method through the trace of the power so the Matrix in the in the like non-matrix case you had essential limit then that says that distribution of a sum of things with zero meaning and independent distribution like becomes like a gaussian right when the number of things is large uh in this case we're asking okay what does eigenvalue distribution look like for a large Matrix and uh the answer is not a gaussian distribution uh you know like as you maybe would guess if you're coming off from the central limit perspective but actually the limit is a a a semicircle distribution so let me kind of just write this down so like CLT right says um that um this uh yeah so CLT says you know this will converge to a like the distribution of um this scale sum or conversion distribution like this which is like a a gaussian with zooming and you know some variants and then like this this thing uh called a semicircle law says that um if you have like this in a random Matrix M by n and then you take the eigenvalues I'll just call it do it call it like this then when n is large the shape is a semicircle or this is zero okay so so like the the limiting distribution looks different right sorry but here here now you better uh divide by some power of n to get your limiting yeah but but I I didn't put a scale here I just say semicircle shaped right which is still true it just said the scale is kind of going Infinity if you don't yeah sure sure okay okay fair enough okay yeah I mean like you know I don't the point here I want to make it just that the shape is different okay okay um but nevertheless like even if the shape is different uh you can still using you can still use the moment method to get what you want um and I'll just briefly sketch how this would go the argument um so just like in a central limit case to prove the semicircle law you want to show that um for this uh for this uh random Matrix random as before you want to show that the expectation now like taking over a of the trace of a to the k um divided by n um and uh also I'm going to scale this additionally uh by uh into the K over 2. so so I'm going to show that like this the scaled version of the trace of the power of K uh will converge to uh the um the essentially you know this where X is drawn from the semicircle okay okay and um just like before in the CLT case like this this trace of AK you can actually expand it as a sum like before it was very simple we're just expanding uh a power of a sum of scalar things now we're expanding powers of the Matrix and the trace of that but you can still expand this and essentially you're just gonna get something like this uh Alpha One so you're gonna get a product of entries of a where you're gonna have something like a of Alpha One Alpha two times a of Alpha 2 Alpha three so forth and then a of um Alpha K alpha one right so like this thing loose back okay yep yep um great and then uh the you you want to take expectation of this all right so so now it looks you know some someone familiar from as before uh and the task now is to figure out okay what is the dominant term here uh and as before you can like you know see that because you know entries have zero meaning like if you know any of the a alpha one alpha two up here once exactly then the whole term is gonna vanish so the only things that are interesting are terms products where uh each a alpha one alpha two appears two one more times and then just as before the dominant terms are exactly those where Alpha each a alpha and Alpha two appears exactly twice right and uh whereas before there was like something like easy commentatorial counting which you know we can still get wrong if we're not very experienced but but here we're the connotorial uh description becomes uh graphical in the sense that like you can think of like the the sequences Alpha One Alpha two and so on and so forth uh as like um choices of vertices on a graph so so the graph is a fully connected graph with n vertices and then like you know each Alpha I is a choice it comes from comes uh corresponds to one of these vertices and each a of one of a two corresponds to an edge between you know Alpha One and Alpha two this is Alpha One Alpha two and the fact that like you know the the in this product like of the last term is Alpha K of one it looks back means that like essentially looking at the structures you're looking at are like paths uh through the graph that like eventually you know looks back or in other words you're looking at Cycles uh in this graph and you know because we observed that like any a alpha one fo2 has to be pure like exactly twice for for the most dominant terms like you're also looking at graphs that kind of like or cycles that you know go through each Edge exactly twice right so so maybe using a different color here you know for example like the the the the edges you the Cycles you really care about for example could be something like this uh okay okay so it Loops back like this and maybe it goes here comes back here and then Loops back and Loops back right so this is like particular instance of a cycle where each Edge appears exactly twice and um from there is you know accounting problem of like how many are there of these kind of cycles and turns out it reduces to something that are well known from common Torx called Catalan numbers Catalan numbers like from like the region of Spain um and uh yeah and from there like essentially you're pretty much done because the cattle numbers are precisely the moments of the semicircle distribution oh I see okay are exactly the moments of the gaussian distribution I see I didn't actually let's just take a step back the the right hand side of this right semi circle law means that the probability distribution is some is proportional to what 1 minus x squared I assume right uh yeah right okay so actually you can uh so if I worked hard enough I could I could uh I could perform the moments of that distribution by hand like just doing some some substitutions it's pretty straightforward is that right uh yeah yeah okay you can reduce like some recurrence equation and then that's like the currency equation of cattle numbers ah okay okay cool I see okay because I mean there are many definitions of the Catalan number I I've never seen it Define in terms of moments of the semi-circle law I usually think of it as a more combinatorial uh definition but okay I mean okay great so so the right hand side this essentially gives you proof of that fact as well like this construction because sure anything that satisfies the recurrence of the Catalan numbers are the Catalan numbers and okay so you just show that the right hand side satisfies that okay great okay great yeah I'm I'm satisfied Okay so Okay so okay so the the point of of of of this is that for both the central limit theorem and the semicircle law for random matrices the moment method uh is tractable enough that you can do it by hand uh essentially yeah yeah so uh let me also just kind of uh fill in some details we missed here uh which is that like how what's the right scaling for the Matrix to get bounded uh distribution in the end so like in particular what I mean is like what is the scale what is the scale of this Matrix like what is the entry so earlier we were saying like the entries are standard options for the variance one but uh but uh the correct skating to get you know a if you want something like that's like uh think negative one to one I forgot the exact sale it might be negative two but but let's say negative one to one then you you need uh the entries the entries should be something like gaussian or zero meaning or variance one over n so in other words the typical size of each entry is one over square root of n the standard deviation is one over scroll down the variance one are n so that turns out to be the right skating for you to get like kind of like a finite non-zero distribution as a limit of the eigenvalues yeah and that already came from from uh looking at what you wrote here because basically you see that if you absorb a square root of n into the denominator into a then that that gives you this variance of one over n yeah yeah okay so so in essence you know it's it's kind of like analogous to Central limit where you know there's this and here and essentially here we're multiplying the matrix by square root and bottom in the bottom okay great all right Greg so at long last let's get to your tensor programs work why don't you um tell us what it's all about and yeah let's see right um so um maybe let me um start by start from the random Matrix angle and um let me motivate this by kind of continuing from this uh moment method discussion right so earlier um well maybe let me let me use a different color let me use a white color so earlier we're talking about how to compute it the moments of the eigenvalue distribution uh it suffices to compute um the K power of a matrix and his Trace right and um so the the point uh that I'm gonna make right now is that this operation this this quantity that you want to calculate can be written as uh kind of a series of like vector Matrix multiplications okay and the way it works is like this and so so the point of this you know going forward is that this set of computation can be attracted into a more General framework of tensor programs and then I'll talk about that framework and then I'll kind of specialize back to this case at the end so okay so first of all let me just Express this Trace uh as this uh this expectation uh which is very trivial but it's very useful where um V is like a standard gaussian okay so so what I'm doing here is like here I just assume a is at some any deterministic Matrix and to compute you know the trace of a to the K or any any other Matrix instead of either K that's fine too like the trace is just the same as the expectation of V transpose times this Matrix times V right so like this is this is a scalar like because we're hitting those matrix by vectors on both sides so you get a scalar back and the point is that when you take expectations essentially like the only contributions are going to be from the diagonal of this HK right so so that gives you the trace okay yeah all right so this is an entirely trivial trick but what this allows you to do is to write this now as like a series of Matrix multiplications so so I can Define you know like say [Music] um I'm gonna do this so I'm gonna write um let me call this um I'm gonna say like X I'll call this uh X1 equals AK times V and then X sorry a times v x 2 equals a times X1 and so on and so forth right so then x k I'm going to Define this to be you know a times x k minus one right so so in effect um the um you know XK is essentially AK times V right like I guess I don't need quotes here so um AK times V right and and the trace of AK is just equal to expectation over V of uh x k uh inner product with the right okay so also just to know here that you know I'm using this notation where the superscript is a index not a power so like something depending on where you come from like this can trip you up this notation but uh in like more physics physics areas this is pretty common um okay so yeah so so this is a power and then this is an index just to be clear yep okay so so again everything I have done here is completely trivial I just I re-rolled the trace of some expectation and then rewrote this particular expectation as like a series of measures of locations and then some inner product at the end okay so the significance of this is that you can reason like iteratively about each of these major small locations in a principal way that allows you to calculate this uh final inner product expectation very easily and this is kind of the the core Insight between behind tensor programs which is that like like when you do missions multiplications and nonlinearities ensure non-linearities you can kind of keep track of searching information about how the the vectors obtained from matron lubrication were obtained from entry-wise nonlinearity how they evolve with the operations in the limit when the Matrix size or the vector size become large okay so in this particular case the gist is that so what's the gist right just as that um every [Music] um x i have um roughly uh ID entries um as n goes to Infinity so like roughly ID is not I have not defined it it just like I just want to appeal to intuition that like he's he's kind of become the the entries uh become roughly independent and identically distributed as and as the I guess we're using large n here it's the size of the vectors and the matrices called Infinity um and uh okay so this is assuming because this is assuming like you know um for example a has like gaussian one risk but I ran entries yeah and we're dropping the assumption that a isometric now right a all the entries are independent for for this scenario like a can be it can be symmetric but I'll get to like kind of uh the the more General case where a is not symmetric but yeah you can like you know a has normal entries up to where like okay sure okay okay I mean in the case of neural networks a won't be symmetric but okay for now they get it's symmetric yeah I mean because I just want to stay uh in this uh this setting of uh uh symmetric matrices for symmetrical law yeah I was going to say because if the if all the entries were independent uh gaussians then I think it would be automatic that the entries are identically distributed but not independent right uh yeah by symmetry you will be identically distributed um but yeah like you don't know whether it's independent or not yeah sorry for the non-symmetric case it's uh identically distributed I'm trying to think if it's symmetric is it also identically distributed I'm not sure uh for symmetric yeah I think so I mean just if you permute the entries like you don't you shouldn't change any distribution ah it's just like you have like this you know yeah yeah sorry yeah yeah yeah okay okay yeah yeah yeah okay right so um right so so that's the first point that like the entries have a roughly ID entry uh I have the idd uh the sorry the vectors X I have roughly ID entries as n goes to infinity and the second point is that like we can track the distributions of X I's entries through some calculus for some calculus I'll maybe briefly like cover next but um yeah roughly speaking maybe let me kind of like try to jump ahead a little bit and come back and fill in the middle the essentially the eventual conclusion is that you can associate um associate a random variable uh z x i to each x i such that like X X I's entries look like um ID samples from c x i hmm I see yeah and and there's like a there's some some there exists set of rules for computing um ZXI from ZX I minus one to ZX one mm-hmm yeah and I guess Z let's say zx0 where x0 is defined to be just V yep really recall from the previous slide X1 is a times B so x0 naturally can be defined to PV sorry and v v was the uh gaussian with with the units oh yeah the yeah the uniform isotropic Standard Version yep okay and so so the punch line here is uh like the thing we care about in the end was this expectation of inner product of x k and V right so we're let me maybe let me rewrite this given this definition as x0 then like based on this intuition right this is uh also I like we actually need to scale this by like whatever n is so so because we have the intuition that you know each each entry of x k and x0 are identically distributed from like the ZXI is esk and the zx0 um this is gonna be pretty much the same as you know this um well okay so yeah let me just write this out maybe so I'm just going to write this this is this is equal okay um right and then our intuition is like each of for each Alpha these two things are coming from the same or from respective distributions ID samples so this should look like yeah I guess it wasn't obvious Alpha is the is the coordinate subscripts are coordinate and disease superscripts are are like yeah like yeah superscripts correspond to different vectors subscripts correspond to the components of those vectors that's right yeah and yeah so this roughly it like corresponds to um like by a large number of kind of intuition right if you believe that you know x k Alpha comes from z x k and then uh x0 comes from dx0 then this should look like the the expectation the product of these two random variables um whenever is large so so what we expect is this as yeah much as backup actually so it seems like this way of bookkeeping is refinement of the moment method in the sense at the moment method moment method doesn't at least the way we discuss it doesn't reuse any computation for every K you have to compute the moment from scratch so to speak here you're doing it step by step or in a neural network setting layer by layer and and and so you you kind of are inductively kind of keeping track of the of the distributions right yeah yeah yeah so you can think like that I mean like so uh I mean so if two different people momentum I mean something different but in the way that we we kind of phrase it momenta is just like overarching technique for like you know trying to like show that's two distributions are equal if their moments match and like the traditional way of using this moment method is trying to like kind of expand out some sum and try to like see like what is the leading term of the sum right and in that case like you know for example in the random Matrix case we had to count like cycles that contains each Edge exactly twice so that's like a combinatorial problem that aprily you know you kind of like you have to redo for every Power okay right exactly exactly um now I mean like but but if you really unpack all the proofs like kind of like one way because eventually you do get to Canada numbers so cattle numbers are currently defined so in some sense like if you keep unwinding the proof you have some kind of recurrence somewhere in the end but it's like very kind of like implicit you know whereas here is very explicit like exactly exactly once you have you know this computation written down in this expression you can just feed it to a computer and like essentially the computers can like just calculate what the z's are and you know everything is like automatic at that point um whereas for like more traditional methods is kind of a bit more ad hoc like you know there's some commentary problems that you know you might be able to do explicitly there are others that you can't the the master theorem in this case says that like for maybe let's say like for any um like I'll just kind of to say smooth enough which is in fact it's like a very mild condition here but any is smooth enough say um fee which is goes from r k to R they say um then um the following uh empirical average uh let me let me do plus one because I want to put in um x0 which is V X1 and so on and so forth to x k all right so so this this thing is a scalar this is a scalar I'm adding you know ends Gamers together and divide by n so this will converge and you also want to the individual components too I believe oh yeah yeah that's right yeah yeah so this will converge to expectation of z x zero to Z x k alright so so remember these are the random variable I have associated with each Vector whose construction I did not tell you about so I just told you that there is a construction like this um and you know in general of course like these these guys are correlated with each other they're not in general independent random variables um and uh uh like yeah so so this this kind of generally follows from the structure of the program either our programs where like these random variables can be in them and but in general like in general programs uh computations like these um these random variables are not independent which reflects the fact that you know there are correlations between x0 and XK or X1 and XK and so on and so forth uh when you kind of like say when you take inner product right when you're summing the product of entries like that corresponds to like you know taking taking expectation the correlation or the expectation of the product or the two random variables right and and these things are usually non-vanishing okay so so yeah so this is called the master theorem like this is a specific instance you know for the computation that we are doing here uh for in the interest of the semicircle law and the moment method um before seeing like the uh the the the reason like this is really powerful uh is because like when you have non-linearities uh for example in the case of neural network this becomes like much much more you can calculate much much more interesting things right rather than just you know like a linear random Matrix Ensemble yeah let's just take a step back because this is going to take a while for people to parse I mean now it there's more I can see what's going on because I thought about it for a while but let's take a step back so the how do I say um so um okay so so we've only defined these X zero through xk's in the case where you're just applying a matrix a over and over again of course uh after this you can tell us how what a more general relationship is between these X I's there's a more General class of Transformations that can take you between these X's but for the time being the the point is the following so you have these sequence of vectors of size n okay and for any finite n right the entries are not ID there I for each x i the components are identically distributed but they're not independent but they're they get more and more independent and goes to infinity and the reason why that statement is how that statement is is encapsulated by the master theorem is the fact that if you take um if you average over all the samples which is what this thing here is doing on the left hand side right you're taking uh how to say actually I kind of cheated a bit so if you if you go down the components which is what this Alpha is right then if the these components in Alpha were truly independent then that is corresponding to taking IID samples of a common distribution which is to say that you're approximately averaging over the underlying probability distribution right and that that's what's the right hand side so what I'm trying to say is that like here's what I'm trying to say like uh um for example the expectation of f of x where X is distributed as a unit gaussian right that's the same thing or it's approximated by this sum where X alpha or IID samples right and it's the IID samples that tell you that the the uh empirical average converges to the expectation in the in the Other Extreme case where all the X Alphas conspire to be correlated in the most degenerate case if they're actually all identical so they're all perfectly correlated as soon as you know One X Alpha you know all the rest then this sum just collapses to the evaluation of f on a single point which is very far away from being the average over the distribution right so all I'm trying to say is that the right hand side I just look at k equals zero right and k equals General K just generalize what I'm saying the fact that there's an expectation on the right hand side is the statement that the left hand side the components are becoming more and more independent as n goes to Infinity right yeah yeah exactly so in other words like what the master term says essentially just a recapitulation of what I was saying earlier that like the Z x i's are kind of like encapsulates the distribution uh under which like the entries of X I uh look ID so like as a mass system just says that like the entries of these all these vectors look ID when n is large and we can like see this by kind of like where like the way we can see this is that through the lens of a large number they look I like ID right like yeah if you like the L the X Alpha are all actually ID over Alpha then this is like a trivial consequence of lower large numbers yeah of course in our situation they're not actually ID they're correlated in some like somewhat complex ways and the master of them says that as long as we're only you know like kind of doing things like large numbers where we average things then it looks just like you know ID things you can just pretend there are things that you can get back yeah or it's another way saying that the correlations you know vanish essentially is and gets large in this sense yeah yeah that's right yeah yeah okay so that's great so so and then it's kind of an exercise I guess to oh I say uh actually I don't know what's the next step do you want to revisit the rmt proof from this master there or do you want to just go to non-linearities now I can just mention that like you know like if you if I instantiate uh the things I skipped over then the expectation that you calculate you know using the Z random variables are the same as the ones you would calculate from a more conventional means so so that will give you a proof of the semi-circolo okay then move on okay so maybe actually it seems like you could just state it in words so basically um Let me let me just erase what I wrote so basically it sounds like a Corollary can compute moments of uh of let's say from rmt via formula for computing disease right because the left hand side based on what we just talked about in the previous slide can is is kind of the uh rmt moment you want to compute right just by by a change of variables because we had the right so in this case here Phi is a function of K plus 1 variables but in this case it's actually a function only of the first variable in the last variable it can just be independent of all the intermediate ones right that's right that's legitimate so you choose the five such that it's the inner product of x0 and XK and that gives you the rmt moment that you care about and then you just now have to understand what the right hand side is saying and and part of the tensor program theory is that there's a a way of computing disease and you just compute that expectation and then you get the the the the answer you expect right right hand side a tensor programmed Theory and so and and then you get the answer you expect right yeah yeah great great yeah okay um great so so yeah so just to to make it really clear this is how the tensor program Master theorem uh rederies the random Matrix Theory uh moment computations um the more uh the the the well we've only Illustrated in the case where you obtain the X's from repeated application of a matrix can you now state it the tensor program what a tensor program is in a more General setup so that uh we can see how it applies to neural networks right right I can write down like a more General formulation but uh roughly speaking right uh what a tensor program is like the most basic version is that you have some initial objects um so the initial objects includes like you know some some vectors in RN so so for example in our previous example like the V uh is um is the initial uh vector calculation um you know so like you should expect um expect that [Music] um like um let me call these uh let's say X expect X to be sample from like a standard gaussian kind of thing okay each entry will look like by definition these vectors will have ID injuries okay um and then you also have matrices say like I'll call W yeah this is the capital N I guess to be consistent notation so uh these you should expect to um be have like to have uh something like you know this kind of uh entries with variants one over n and mean zero okay so so this is the most basic version there's like more advanced version that has a bit more data but essentially given these two uh these objects you can compute you can you can like create new vectors via uh two instructions one is you know you can do matrix multiplication so you know just if you have W and if you have X where X can be either initial vector or a vector that we have generated using these instructions then you can form you know W Times X right which is also a vector in RN RN okay so this is one instruction for generating a new vector another instruction is null then so here suppose you have a bunch of vectors which again either are initial vectors or vectors you've generated previously using one of these two instructions uh say like X1 to x k right and then then you can generate Phi of X1 to XK or or fee so here where I'm using kind of like a deep learning notation where where fee he has the signature rk2r and then the meaning of this is that if if we define y to be this then the meaning of this is this notation is that why each entry y each entry of this this uh expression is defined to be a fee of X Alpha f x 1 Alpha to x k of yeah right so I'm applying the entry wise to the vectors yeah so so so machine learning yeah K is typically one because you apply it just to a single layer but you're allowing a non-linearity to allow it to act uh across multiple layers essentially but always coordinate wise yeah yeah yeah yeah so yeah so like you know when people say like entryway is most people think of like relio or hoverboard tangent something like this but yeah you can have more General things and in fact like people do use them but you can think of like if you know about batch normalization this is kind of like a you know multivariate non-linearity in some sense where like X1 XK are like the the um activations from different patches like one two k are the bash index for example uh you can think of a personalization this way so so it's just to say that like uh people do use kind of more these more General types of entry-wise known annuities okay okay so yeah so here again we have some initial objects vectors and matrices and then we can generate new vectors using one of these two instructions and uh and that's that's pretty much it for tensor programs just like these two essentially these two instructions and you know what we had before if you go to go back to the previous slide about master serum what we have here like applies exactly as before where like x0 to XK are all the vectors in the program like x0 for example can be initial uh in the next contact image can be generated like which is exactly the case an example for the semi-servical law we had just now but uh yeah so the mass but the master theorem holds exactly as written um when x02 XK are all the vectors in any program right and again like we have omitted the exact way that we have constructed a ZX eyes uh and and of course the dial is like kind of the the key behind any actual calculation you will do but like the spirit of this is that and when you multiply uh matrix by Vector the resulting Vector from this mammal that you will have you know some correlation between the entries typically right but the the master term says like you can always reason about them as if they have ID entries as the size equals infinity right so so that's kind of the spirit of the theorem now again like the the key thing you need really need to elucidate when you actually need to do computation is the construction of these um and like I mean also related like how do you like uh compute the Newsies when you have a new uh Vector from the old lens right I see okay okay so we just described a much more General tensor program where we allow these non-linearities and the master theorem if you if we go back one slide applies to this result although actually I guess we we have some redundant notation there's the Phi for the evaluation and then there's the Phi for the non-linearity maybe let's call this a different letter for the uh what do we want to do or I mean like this yeah I mean yeah like uh I mean if I if I is just any you know yeah you're right okay fine yeah there's there's yeah yeah yeah yeah if I is just a generic symbol okay so there's there's the Phi of the master theorem which is your choice of a nonlinearity by which to evaluate things and then there's the Phi of slide 15 which is the uh non the a symbol for a non-linear map that you could apply at any step when you want to apply one of these these rules the point I was trying to make is that um once we have these rules for generating sequences of vectors then the master theorem on the previous slide uh applies and and you get the same um asymptotic IID yes yeah right yeah okay so I think to round out our discussion because we've been uh talking for a while now why don't we um specialize in the case of a feed forward neural network uh which is a particular kind of tensor program and then we can um uh end with one of your more recent applications of tensor programs which are the so-called ABC parameterizations and their tensor programs will be used to show that there's a family of very interesting limits of neural networks and and particularly the Dynamics of the neural networks does that sound like a plan yeah yeah sounds good uh yeah before that let me just briefly remark that you know like there's like a low discrepancy between uh what I've written here for tensor programs versus like the semicircle law application where we assume the Matrix would be symmetric uh I just want to like briefly note that like this is easy to take into account uh where you can like when asymmetric um you can just write it as you know like W plus W transpose where like w is not symmetric right so we can always symmetrize and like to you have to take into account the symmetization when you express the program but everything is your word okay great so um yeah and for those who are going to look at your paper your first paper had tensor programs with just these W matrices and then your follow-up work included transposes right uh because W transpose is not independent of w and so you had to extend your framework to that but okay those are kind of the details of your papers but but yeah just just so in particular like really here you know I can do either W Times X or W transpose times uh sorry times x where the W can be reused sure yeah right and this is where the power comes from because in a lot of computations especially involved in your networks and also in the semi-circle case you're going to keep reusing the same Matrix over and over again that creates a lot of correlation between you know between the vectors uh and like keeping track of these correlation is like a very complex tax task if you you know don't know how to approach it but like methoderm you know like I am what I omitted a lot of details but the theorem essentially allow you to do it in a mechanical way which makes like this really complicated mess very clean yeah but just double check in your first paper I think every time you apply to W it had to be a fresh new IID one it was later that you allowed reuse is that right no in the first paper you're already allowed reuse but you just say you can only uh you cannot do transpose okay okay W or W transpose like uh consistently you cannot use both W and W transports that's good that's that's the first paper and makes things a bit clean dinner and uh in the sense of like how you phrase the result and also like for the gaussian processes applications like that's usually only all you needed when you add transpose there's a lot of complications with by over it's a little more powerful allows you to do more things okay okay just wanted to clarify okay so let's Okay so let's now maybe fast forward to uh neural networks and ABC yeah yeah so so first of all I just want to um kind of write down what a new network is and just make it clear that like this framework can indeed Express the kind of computations and keep learning that people would care about so um usually in in a deep learning you know you have um a a a new network function which I'll call F and typically I use zai for the input of the neural network like you know different people had different choices but such a such a mathematician uh yeah it's a bit uncommon in deep learning but um it is what it is you know okay like people should be happy to expand their Horizon of Greek letters okay but um so so you know like a typical neural network is just a composition of linear and entry-wise nonlinear uh functions so in this case like a three a simple three layer neural network can be written as you know V times or I'll just say V trans plus times V W Times V of U times PSI if zai is in r d then U is in are say n uh times D and then W is I'll just write the shapes down n times n and then V is oh like for a scalar output function V is just um 1 times n sorry uh no n times one because I'm doing transpose huh I say can you write it like that because you want you want the large Dimension and to be on the inside yeah yeah okay okay so uh and just recall for the mathematicians uh like fees applied entry-wise yep Okay so um let's so I I just want to at least uh essentially make the point that you know like this is uh this new now commutation can be expressed as a program and uh just for Simplicity I'm gonna assume like the the in fact is equal to one uh the general case follows pretty straightforwardly but this ladies like for the first for the first intro to this it helps to simplify this so then in this case zai is just a scalar and um let's just back up so this is this is what you'd call a three layer uh feed forward neural network are also called a multi or sorry egg um a MLP a multi-layer perceptron right yeah yeah okay so let me just let me just write that just to people have the lingo down so this is a sure feet forward or MLP neural network yeah all right go ahead yeah so um yeah so so uh to express like the four iteration uh of this uh this function we can uh do the following so like first of all you know like in the program you need to specify where the initial objects so the initial objects are uh the vectors Uh u and v uh I mean in our case like they're literally in RN uh I guess we use big n here and then there's initial Matrix w itches n times n okay so and then uh sorry Vector oh uh you mean use U PSI you mean or oh no so like uh in this computation PSI is a constant I see sorry sorry your distinguish between vectors and matrices have to have r or n by n whereas vectors are n by something that doesn't depend on N like and and by a constant n by d and n by one that's why u and v are vectors yeah um yeah you can think of like that yeah okay like it doesn't it doesn't matter a whole lot here because like because PSI is going to be like because I is not random like we're gonna like fix I and sometimes like there's going to be part of the for sure so like I'm just saying I'm just saying u and v are matrices as well it's just that you distinguish between vectors and matrices because a matrix for U an intensive program is something that's n by n that's all oh yeah yeah yeah that's right yeah matrix it's like two Dimensions going Infinity yeah one dimension going Infinity yes yes okay that's it that's all I was trying to say yeah okay okay good good yeah okay so uh and then you know to compute let's call you know like the result of the U times I to be H1 apply V you get X1 apply W you get H2 apply for you get x2 and then you play a times V you um get the output of the network um okay so I'm I'm just gonna essentially uh express everything up to X2 like the contraction by V uh you can yeah you can do it multiple ways you can express it in a slightly more General tensor program or you can like in this case you can just express it as like part of the master theorem because like you can like think of be contracting with X2 as you know I guess sum of you know the entries uh which can be expressed as kind of like in the statement the master there so I'm just gonna okay I'm just gonna express everything out to X2 okay so so you know foreign changing color back to white so now H1 is E2 equal to um like you can you can say like I'll just save like on V1 of uh of U right where V1 is just defined to be U times Z so just you know multiply each entry by PSI right and then X1 is equal to of V2 of H1 where V2 is literally just V like from the from the definition of the new network again like you know fee is applied entry-wise right and so these are all like London operations even like H1 was actually the V1 was actually linear but just like no need to be in the sense of why are you calling fee one V2 uh what's the subscripts on the fee oh like they're they're the the fees uh the the non-linearities applied in the instruction of London so remember we have two instructions for generating new vectors okay go back to like page 15 right like there's uh there's a fee here on page 15. if you can see where I'm scribbling yeah but why are you calling fee sub 1 and V sub 2. they're just they're just different non-linearities I'm defining I'm putting them in the form of the the null in instruction the operation which requires oh I I see what you're saying okay okay uh ah okay okay C1 is just contraction with PSI and V2 is is just a application of Phi yeah okay okay sorry got it okay okay so um continuing on to Define H2 now H2 is W Times X1 and this is the first time we're using the map mode instruction because you know W as we specified is a matrix and uh you know finally X2 is equal to Phi of H2 you know which is just like an it's a numbing operation again foreign so you know it's again it's very obvious that uh you know like things that are typically occurring in your Networks can be expressed this way so for people who are more versed in deep learning like you know more uh familiar with like more advanced architectures like convolutionary networks residual networks bashronization Transformers itself attention all these things can be expressed this way it just gets a bit more complicated but you know like you can look at the papers where there's like explicit examples of these written down um but again like this the summary and the gist here is that I want to give an example of why in your network computations can be encapsulated in this General framework this language right um and again like the punchline is that once you can express all these things in this language then like you can apply this master theorem in on page 14 and uh again I am I omitted certain construction details about like what disease are but you can kind of Reason uh iteratively through the program to understand the behavior when the with the network becomes large right okay so now we can get into some more advanced topics uh in regards to you know like uh why does the knowledge of you know how largely networks behave like give you a lot of you know power in practice right like why do we care about this especially in the age of like gbt bird you know as new networks get larger and larger they just get so much better Beyond people's expectations yeah let me just make one comment actually based on what you just said you said basically you couldn't what you just did here was Express this MLP as a tensor program and if you just follow through the master theorem and maybe uh with one or more steps because of this V transpose um you I think what you're going to get is that F uh is uh is a gaussian process right and and this is how you can you're going to get the nngp if you follow your nose uh through the master program is that right yes yeah so yeah so like I just wanted to mention that we don't have to go too deep but right right so yeah so roughly speaking um yeah if if you scale this last fee in the correct way um then uh yeah like roughly speaking like the the resulting distribution of f where the randomness comes from the sampling of BW and uh will become asymptotically a gaussian process yep um and uh yeah and this is so like you know again this this fact has a very long history dating back to the 1990s with referee Neo for simple neural networks um and you know like kind of around 2017 there's some more Works involving deeper neural networks um but like kind of every time you want to you know like say like a new architecture comes out you have to kind of do a lot of things manually and I spend a lot of time to see it whether like oh this actually holds also in this case or not and so it's like a very arduous process to actually try to manually prove all these cases actually if I recall it's at least for MLPs it's maybe even for convolutional networks it's not that arduous in the sense that if you look at those proofs what they implicitly assumed is that you if you let the layers go to Infinity one by one uh from from input to Output then you're basically applying the central limit theorem successively and things aren't too hard what's difficult is if you let all the layers go to Infinity at the same rate that's that's the difficult part and that's where your work fills in that Gap essentially right uh uh yeah and also like um when they're like kind of uh for example they're like when you look at rnn's recurring your networks um then your weights are tied right across time steps it doesn't make any sense to let each layer go to Infinity the same time because they're the same Matrix you know yeah I see so so like when like we're doing a lot of way sharing this is not a feasible thing to do um yeah so uh and also just to remark that like um like kind of going going further than this right so far are we just talking about like how do you express you know like for different architectures can you express right the the architecture in this kind of computation um but really the the most powerful thing the much more powerful thing to do is to actually which we'll do in a second is to kind of like look at the entire computation graph training so so not just expressing like a single forward pass of the architecture but unrolling gradient descent into a very big computation graph and then like understanding the behavior of the network after training so this is really like one of the most powerful things you can do with tensor program because you can just unroll anything any kind of iterative computation you care about uh in this format and when you want to do this when you want to unroll for example creating The Descent then you must encounter you know like uh the sharing weights because you need to use W uh in the forward pass from W transpose in the backward passing to iterate this many many times right over the course of gradient descent and like this kind of way sharing pattern in this computation essentially prevents you from trying to do any kind of like and taking one layer at a time to Infinity right because you're sharing so much weight uh within the commutation yeah fair enough fair enough okay um all right sorry that that aside took uh took us maybe a little bit far but okay let's go back so where did you want to go next yeah so um yeah so I can like talk a bit about [Music] um like you know how this Foundation lets us understand the different uh large width behaviors of new networks and like what the significance is in the age of large neural networks well like I want to kind of you know make analogy uh with things in mathematics you know in case uh like mathematicians are watching this video um but you know like it's traditionally you know I think a neural network you start to you know have like a single behavior when you take the width infinity and usually that's something associated with like a kernel or like a gaussian process you know these kind of things um and uh but it turns out this is not true right like there's no like these there's no like single one Behavior Uh for large networks in fact there's like a very large space of possibilities uh for how our large near hours behave depending on how you take certain hyper parameters um you know how how you scale those hyper parameters as the width becomes large so uh you know this is kind of similar to certain behaviors we know from mathematics as an example you know if you're more from a like an algebraic background um then you must know that you know uh while like the rational numbers has a very natural completion in the real numbers uh there are in fact like different completions and rational numbers uh into other fields like the periodic fields for you know P being any prime number and uh a a very powerful result um in this area is the classification of all you know Fields uh that are completions of the rationals and which turn out to be yeah essentially the reals plus all the periodic fields and from there we have very powerful you know like localization results on how we can like you know uh equate the solvability of certain equations over rationals to the simultaneous solubility of the equation over the other Rios and the P addicts right all the possible completions and this is like a very powerful basis uh from which like language program now um extends uh and so just like this when we look at you know like uh infinite with neural networks there are different ways of taking those limits just like how there are different ways to complete the rational numbers and in a sense like to understand the behavior of a finding your network is like roughly equivalent to understanding uh the behavior of different limits uh of infinite with new networks and like from this perspective it becomes very natural to understand like what are all the possible infinite width limits of these large with networks right and so this is like a very different perspective from the more traditional perspective that oh when we take with to Infinity like some kind of Kernel Behavior will happen and like you know like if and because the current behavior is bad you know we don't like the infinite with your network but in fact like there are different behaviors and as we'll see like there's in fact one Behavior we really really like and gives us some some really powerful technology in this context of you know like huge new networks like gbt yeah right yeah yeah yeah so um right so so now let me kind of set the stage of like how do we view the space of different limits right for each of these parameter tensors there are two numbers that you need to specify at the very least to uh specify a training uh procedure uh one is the recorded the initialization variance um so so supposing that you sample these things with ID gaussian entries with zero mean then what is the variance uh that you should specify for those things and then the second thing is a learning array so um yeah so for those of the viewers who don't uh or haven't seen you know creating Ascent before like the the you know the greater just just quickly I guess you know creating the scent essentially it's an iterative algorithm which just says like to to if I want if I want you know some loss in the function so some like function of the neural network F to to become smaller and smaller then like you should update the parameters like so then you should get updated to U minus some this is the learning rate Ada um maybe okay let me be explicit here let me just just write learning rate instead of ADA learning rate times um uh okay let me write it in the gradient format the gradient of the loss with respect to U and likewise for the other things right right and then like this thing you know there's a lot of literature for this you know how do you set it this learning rate in the convex literature it's a very old topic but you know like there's much less uh understanding of of this uh at least learning in a non-convex setting uh where you're optimizing your networks um but in any case like this this new network uh this parameter learning rate is what we need to set right for for these each of the uwb okay so uh right okay so um in particular when we talk about like large n limits right we're in here in this case is uh oh I I shouldn't right here I guess let's back up so what we have is uh uh for um let's actually let me just introduce you to notation which is quite standard just it'll be a little easier to talk about so let Theta be the set of all parameters in this case Theta is u w v of all parameters like neural network parameters right and so we have a random function a random neural network function which depends on this random parameter Theta and for each loss function l so L here is What's called the loss function you're trying to minimize your loss right that gives you gradient descent Dynamics in other words uh you can run this gradient descent step by step and that gives you a discrete sequence of neural networks given by just updating uh sequentially under this gradient update Rule and the question you're trying to ask so we haven't seen a question it's a question what kind of scaling limits exist for the Dynamics of f by dynamics of F I mean it's not just that the limiting F exists at initialization but the entire gradient trajectory must also exist right yeah that's right okay great okay great and like the different behavior of these scaling limits depend on how you scale this initialization variance in the learning rate right for each uh of w and W uow MD right so for example like one way you can scale this is say like I want uh the initialization variance of V to be uh maybe let me let me erase this and abbreviate it so things are more alike so learning rate um so for example I can say let's and it'll have variance scale like I guess we should use begin here Big N to negative one here and then begin the negative one here begin to negative one here right and then learning rate can be like just one right um and uh so here just to clarify like I just mean like how you should scale these things I don't mean exactly setting these two such values but just like you know like the initialization variance should should have like double n right that's that's all I mean right like there can be constants in front of it um so when you specify such a scaling relation uh in powers of n for uh these hyper parameters for UW and V like any such Choice specifies one way to take the limit right uh when you let N Go to infinity and that gives you one set of behavior um yeah so he will turn out for example um yeah in this case like uh when you specify like these hyper parameters uh the new network will actually blow up to Infinity like after one step um and okay so so like in general right like in general if we or are just elaborate further but one step you beat basically even if the limit exists at Time Zero the limit does not exist at time equals one yeah yeah that's right yeah yeah so uh yeah so in general right like you have in this particular case there are six numbers you need to specify um so like I'll just call this um um like using the this notation from uh ABC prioritization I can call it negative B here and then negative oh sorry negative C here so there are six numbers like the B and C for each v w and u okay and um yeah each way of like specifying each number give you a set of behavior uh when you take n to Infinity and you know kind of to connect to the all the previous things we talked about like the the reason that we can't understand all of these diverse set of behaviors when you take anti-infinity is because we have this tensor Machinery where we can for any set of B and C we can express the computation in intensive program and take the limit again it's like automatic because you have the master term and you can just automatically calculate what is it what is the behavior right in the in the infinite end limit okay so now like the question is uh what is this landscape of limit look like right so again like kind of using analogy with P addicts right like we know that completions of the rational consists of like P being all primes P addicts and the real number uh that the real field like what is the corresponding picture for neural networks right when we vary the set of B and C right so like kind of the B the B's and C's here are kind of like analogous to you know like the p in P addicts right where P ranges over primes in that case but here B and C can be any real number um and like roughly speaking uh you can like uh in this particular case like there's like it's a six dimensional space um and you can actually like partition a space and kind of classify uh what what the like the partition the space looks like okay so so so the see this is a picture right here so it's a six-dimensional space but obviously we cannot draw um Six Dimensions so instead uh I'm gonna just give you kind of like you know like projectors to a two-dimensional thing using like a nonlinear projection I'm gonna distort something I'm gonna take some quotients but I just don't want to carry across like the the most important features of this but so this is like the six dimensional space this is like the space of um the b u b u b w b be in the c u c w and CV so this is the R6 space basically okay we draw in two Dimensions yep okay so um the point first of all is sound like most of the space if you like throw a random dot in R6 like according to the bait measure whatever that means you're gonna get like some pretty uninteresting limits so so this is like this this background this is like a like the measure the full measure background where um essentially you have two behaviors two two uninteresting behaviors one is uh like the training closed up in the limit or or the training gets stuck uh add initialization so in other words like when you do gradient descent it doesn't change the function at all okay so so these are you know neither of these are interesting right like you you get you don't get any interesting functions so if it blows up you get not not well-defined function if it gets stuck in the translation you haven't learned nothing from the data right so like very very uninteresting um and then it turns out there's like uh like a co-dimension one or I guess yeah I used code Dimension One I forgot exactly how what's the code animation but uh there's a space um in the middle of this the Sea of all interesting things where you actually get you know like non-trivial Behavior and so I kind of Drew a quadrilateral here I mean generally it's kind of like a a higher dimensional part of hedron but the the Salient feature here is that [Music] um uh you have um like all the all of the points in the interior along uh with the points in the lower boundary here so maybe let me let me use a different color here so like the lower boundary here uh plus like the interior here they're going to be in What's called the kernel region so what that means is that on the neural network uh will evolve or will have very simple Behavior Uh in the limit in the in the sense that like the function f at time T in the grand descent evolution you know using using the greatest algorithm we talked about in the last slide uh it's gonna evolve something like ft equals ft minus one minus the learning rate um times a kernel K times f t oh T minus 1. okay so so this is uh okay like this is like a the simplest case when you use um the square loss where this is like literally a linear equation where K is you know it's like a linear you know operator on f um let me just write this down in your operator so so this K would will kind of change when you as you Traverse through this space that I I kind of colored in here like you know this so so the point is that you know if you look at the limit determined by this set of B and C and then another limit to determined by this set and another limit from this in general uh they all satisfy a linear equation of this form assuming Square laws um but the kernels can be different the K can vary between them linear equation do we have explicit characterizations enough so for the ntk it's basically the [Music] here this is like the ntk limit it's actually like a vertex in this part of each one um and uh yeah for for okay like so for any of these you can kind of calculate exactly what the kernel is based on okay and what the dncs are um yeah it's a recursive calculation based on the program structure based on the architecture of the new network oh I see I see okay because the ntk uh has this very special form of being the gradient of the function times itself but it also has a recursive structure and you're saying the most General kernel is just defined by that recursive structure it might not be so simple as you know grad F dot Greta something like that I mean like so like um because the ntk has a very simple characterization in terms of the the function itself but in general it has to be something more recursive uh I mean in general it does look something like grad F dot grad F uh but just like one depends on what the B and C's are like you actually will kind of zero out part of the grad for example okay okay yeah sure okay anyways just wanted to right I understand okay great so so from a mathematical perspective this current regime is very nice because a priority like we have no idea how the non-convex you know Evolution or the function goes underground descent right like this is kind of like all the convex optimized uh optimization theorist they kind of like all Vex over this problem but now if we take this limit this becomes a convex problem so it's a linear problem even and uh like you can obtain a lot of results on optimization uh trajectory of such functions in the in the limit in these kernel limits absolutely mathematically looks great but and this is like a very very big but uh the problem with this is that there's no uh it does not exhibit the behavior we know as feature learning okay so so what that means is that uh if you look at you know go back a slide to slide 17 and you look at you know the equation for f of value like feature learning can be at least like the the lack of feature learning roughly means that if you know X2 of you know a input is like equal to X to uncode this um let me let me write this down so if X2 applies to input uh as initialization time so without training without seeing the data is is equal to X2 osai after training if this is so we say that like there's no feature learned for impose I and when we say like there's no feature learning it means like this is true for any impulse I think I think the more proper way of saying that because X2 is going to become uh it's a vector of n coordinates and N it's going to Infinity I think the point is the entry-wise the the change in coordinates is little o of of one essentially it's going to zero right yeah yeah so yeah roughly so like that's like a more pedantic way of saying this yeah so like essentially um right each entry if like the change in the in the entries of X is much smaller than the entries themselves at the beginning then uh then essentially there's no features learned okay and I think this is something that that people might have a hard time understanding because the how to say the function f can still change even though there's no feature learning because the vectors are getting bigger so even though they're they're coordinates the the coordinates with themselves changing ever so less and less uh you know you could you can have an overall effect just because when you multiply by a large weight Matrix large beaning and N by n where n is large all those small diffs can add up to an O of one diff at the very end of the function right so it's sort of like there's there's two different diffs going on there's whether the function changes the F Theta of PSI versus whether the individual neurons or components change and even though the neuronal changes are going uh to zero as that goes Infinity the function overall still changes just because there are many neurons right yeah yeah right right like I said that like most of the space uh like like in the in this island in the middle of the picture most of the the you know the limits right correspond to these kernel limits which mathematically is very nice because you you have simplified the behavior of complex neural networks other than complex Financial networks but it has a very tragic behavior of not being able to learn features and this is this is like like really I think from mathematics mathematical perspective it's hard to appreciate how important this is in real you know settings but like all the things that we love uh from you know the empirical success of deep learning today you know like imagine that bird GPT like all the success comes from the fact that these new networks can learn really really good representations of like pictures or language right like this by this x2 in the previous slide is kind of like it's on the new Network's internal representation of like a picture of a cat for example or like where if the size is a picture of a cat or you know zai can be like a sentence right and like so it's like an internal representation of the semantics of sentence and like if if uh you do not have future learning then like all this representation is like nonsense it doesn't even know about like the the data at all it's just like something that would be the same as if you just blindly choose a random representation so so this is a really bad bad um bad behavior right no feature on it so like if you just look at the picture like this it sounds it sounds like okay this looks like infinite with new networks are just bad right because anything you pick here probably does not do feature learning but if you squint very closely in fact like the the last part that I haven't talked about which is like the upper border of this um polyhedron this actually does exhibit feature learning in the sense that like the in the previous slide like the X2 um uh does not uh does not like get stuck at the at its value and neutralization but it does evolve and in fact you can like squint even more closely and like you'll see that the upper vertex here like it's kind of maximally feature learning I'll just say that okay the very like in a sense that can be precisely formulated but roughly speaking it's like saying that anything that can learn features like all the parameters I can learn features will learn features like if anything that can move away from neutralization you'll move away from the translation I see I my guess just thinking about this um how do I say you wrote it as a point because basically uh okay so the way you drew this picture the kernel regime is consists of the interior of this polytope and some of the faces and since this is a high dimensional polytope then you're going to have many co-dimension faces and higher Cod Dimension means more features are being learned and uh the point at which you've used up all your co-dimension so that you're at a single point all your all your layers are learning right so that's why you single point right yeah okay good yeah okay great uh okay that makes sense um I think we did get to everything you wanted to say I just wanted to rewind a bit because you made this kind of comment that I wanted to push back but I didn't want to interrupt you you said uh a lot of things in machine learning are quite poetic philosophical or or subjective and I think this is one of them you said that no feature learning is is bad but I wanted to push back just a little bit because I think that might be a little bit too strong of a statement because there are regimes in which the antique does well or even better than neural networks especially low data regime right so I I want to qualify bad a little bit just is that that there's something fundamentally bad about no future learning it's just that in practice uh we have gotten the gains that we've seen from very large neural networks because they do feature learning but on the other hand we know that kernel methods do well when kernel methods do well too it's just that kernel methods don't scale well so it's you know if if the kernel computation wasn't a bottleneck it's unclear what would happen if you fit a kernel to you know millions and billions of images or pieces of text right it's just that we can't do that computation so I I want to just maybe uh kind of push back in that sense uh yeah so uh definitely you know when you um yeah like in low data regimes we're like essentially you need to specify some kind of um inductive bias yeah I mean kernels is like one way to specify this bias and could be it could work better than your networks um yeah I mean so like again like one car if kernel methods work better than your networks and like it's potentially also true that even without feature learning a kernel limit will also work better than the final new network and whatever that means I mean yeah so uh yeah though I I I kind of like don't really feel like uh even if you can't scale the Chrono matters you will do better than you know like deep learning um like in in a large data region even if you can compute it like that's kind of my feeling like I mean we have very concrete uh results on this I don't see if our tenant like you know how many meta learning kind of things but anyway like yeah I mean there's a whole there's a whole cottage industry of showing how kernels and neural networks differ uh I'm not up to different literature but yeah yeah yeah yeah um okay so maybe like I'll I'll kind of finish up by kind of talking about like how this maximal feature learning limit really translates to really good great gangs in terms of like empirical performance in your networks we're like allowing us to do things that we couldn't do before okay so so maybe let me kind of come back to uh some some problems with empirical deep learning right so uh today like in like the the best the best new networks are essentially the largest new networks training on a lot of data and like the Vantage is so apparent larger in your networks are better that essentially like to train the best model you want to throw all your resources at training that model right so like you know like for example like when companies I open an ID mine Google like they when they are confident that this is like like we should scale up this method they just throw all their compute at this training as one model and um when you do so you have essentially one shot to make a succeed right like a lot of things can break when you do this you kind of put all your eggs in the same basket like just your ex can break for like very mysterious reasons because like you kind of almost by definition you have never done experiment in the scale because you're literally throwing all your compute at it right whereas like during the experimental phase you're going to use smaller amount of compute so when problems occur at that large stage uh it's very costly in terms of you know like the computation the GPU hours the energy and also like the man the the man hours in terms of trying to fix these problems um and uh in particular like a lot of these problems come from like for example batch voices High parameters which will like kind of just make the networks diverge uh uh during training for for no apparent reason um and like the the way people do it now is you know for example things like learning rate and these initialization skills they just kind of randomly guess like roughly speaking it's like something similar to what they've used for smaller models but that actually leads to a lot of problems like with these large models just things will break or you just like be worse than your smaller models right in which case that you spend a lot of money for nothing um so it's important to like I can't predict like if you extrapolate your compute to extraordinary amounts like what is the right hyper parameter or what are the hyper parameters to use for that you know like a large model training run and turns out that like this uh on on the page 18 like this maximal feature learning limit like which essentially is a choice of the B's and C's which themselves are like recipes for how to scale your hyper parameters as your model size change like this turns out to have the very like useful property that as you scale your neural network larger and larger uh if you follow the recipe given by these species then if you're you're you start with a set of hyper parameters which are optimal like a set of learning rates and neutralization which are optimal for a small model and you scale those type of parameters according to you know the B's and C so for example with the B's are one so like it means like an inverse then you want to scale those so that like you half number when you double the width right so if you scale them like so then you you're guaranteed that like things will stay approximately apple as n goes to Infinity right so no matter how big you scale your model you have some kind of guarantee that you're never going to be far away from the optimal hyper parameters you could use if you tuned uh all the hyper parameters directly on the large model sure I mean the point is this like uh having a limit is a step is a stability result it's saying that as you let and get large something is converging to something and this maximal update uh yeah it is the maximal update uh right so the maximum update parametrization AKA mu P right yeah this is your fifth and most recent paper uh yeah so so uh yeah this this maximal feature learning maximal update is such that because of the stability guarantees of this of this uh convergence result you're able to transfer hyper parameters in a theoretically grounded way yeah yeah yeah and like it's you know and it's like this this is like the unique one that can do this in the general case uh I mean which is kind of if you think about it's obvious because like the auto hybrid parameters has there's only one way to scale it right you feel God you know how the hyper parameters scale and this is like one thing I can do it and no other way of skating can do it um so actually if you tried to use any of these other green ones why can't you transfer hyper parameters uh like so so I'm just saying here I'm just saying that if like if anything if anything can do it if there exists an awful way of scanning hyper parameters then no other way of skating it is optimal right kind of by definition because there's only one way to do it sorry so I'm not following like like all these green points have infinite with limits they they have some feature learning can you also uh no like they don't write because uh okay I mean so so like the the concrete answer for example is something like this uh like if in the uh any Green Point other than the MU P Point um essentially differs from up uh in the sensor like some like learning rate is Goes To Zero Effect it goes to zero it has uh the width goes to Infinity so if like that hyper parameter like right like like the optimal hyper parameters for for that learning rate is probably like not actually zero in the in the limit so like to compensate for this like in that prioritization your nominal learning array would go to Infinity right in that parameters okay I'm gonna say for the fact that it goes to zero in that Prime transition I see I see I see basically okay you you I think the point is that you bought you want both the features learning to converge and the hyper parameter converge if one goes to zero the other one has to go to Infinity so so you and you don't want that situation yeah so like really the thing the simple thing I want to say is that if you know that the auto hyper parameters scale one way right and like you then you change prioritization so that like like you know fixed hyper parameter goes to like the infinity over zero then obviously that's not the right way of doing it like does that make sense I'm just saying like it can only be one way to scale if you know that like like you know you know there is you know you know a particular scaling is correct like it preserves optimality then no other skating can't do the same thing right it's like this it's a unique in this property sure sure yeah yeah okay okay optimality is unique right yeah yeah if you know this thing is optimally I just think something really dumb which is that like if you if you try to do like the law large numbers with N squared instead of n well your learning rate which is could be thought of as your coefficient in front of that quotient could blow up by being n so that you have n Over N squared equals one or n but that's like that doesn't count that's cheating right so so you you want yeah okay yeah okay anyway okay I think I think we understand each other okay anyways I think this is probably a good place to stop we've talked uh for quite a while now um yeah this is a lot of fun Greg thank you so much for your time uh very very deep and beautiful mathematics part of the reason why I wanted you to be on my podcast is that I think uh you're honestly I think you're I mean I mean your work certainly is is recognized by uh by people but I think it deserves to be known more I I I don't know how many people in the in the pure math Community have come across it because of course you're an AI researcher in Industry you don't even have a PhD so I guess maybe I mean you know academics haven't picked you up I don't know but I hope this gets developed further and and you know it will be appreciated by both mathematicians and and machine learning practitioners yeah for sure for sure yeah I appreciate yeah this opportunity to talk it out with you right thank you yeah all right yeah thanks man

Info

Channel: Timothy Nguyen

Views: 39,105

Rating: undefined out of 5

Keywords:

Id: 1aXOXHA7Jcw

Channel Id: undefined

Length: 181min 27sec (10887 seconds)

Published: Wed Jan 04 2023