Approximation of the entries of a random orthogonal matrix... - Kathryn Lockwood

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm interested in random matrices in particular random orthogonal matrices I figured I'd start by giving all the necessary background state some classical results and then getting to some more recent results and maybe hopefully prove some things so first thing let's start out and just fork off what we mean by an orthogonal matrix chain so an N by n matrix u over R is orthogonal if u u transpose is equal to u transpose u is the N by n identity matrix right and then the set of all such matrices so the set of all figure yes yeah the set of all n by n orthogonal matrices over R is the orthogonal group and we denote this with a bold o of n to indicate the dimension so a couple facts about the orthogonal group that will come in handy crash through the first fact is the columns of a matrix U in the orthogonal group form an orthonormal basis for our secondly if we have in orthogonal matrix it acts as a linear isometry on RN and so that is again if u is in the orthogonal group then the inner product of U V with UW is just the inner product of V with W for all V W and RS and lastly the orthogonal group itself is a compact we group so here that is it's a group as well as the compact manifold okay so here's our group of orthogonal matrices we'd like to know what does it mean if we take a random element from this group of matrices and so here's what we would expect so if if u is a uniform random element from our group and we fix another matrix M in the orthogonal group then we'd like the following equality and distribution if we multiply on the left by M it has the same distribution as multiplying on the right by M and I have the same distribution as the original matrix itself okay and so what this says the distribution of our random matrix U is a translation invariant probability measure so here invariant measure same you and so if I have again a fixed matrix in your thuggin 'el group and for any subset or the orthogonal group you apply to Ma should be mu applied to am should be me applied a and now there's only one possible measure that does this so this is a serum right there is a unique translation invariant probability measure on our agonal group and we call this hard measure okay yes you say this is what I expect this is what I want to happen so I'm saying this is what we'd like a uniform random element to do exactly exactly this is what I expect and in fact the measure that does it is heart measure yep so when I'm taking a uniform random element I'm exactly choosing an orthogonal matrix it's distributed according to harm measure right and so that's what it means I'm taking a random element from the group and so I'd like to describe what haar measure is so we understand that better and so I'm going to go through two constructions there's many right so let's construct our measure so the first construction I'll use is a column by column approach that is I'm going to start with an empty matrix and filled in a single column at a time the way we do this is you choose your first column you want uniformly from the unit sphere SN minus one contained in RS so the norm of U 1 is 1 right then to choose the second column choose you to uniformly according to surface area measure on Z if you won per intersect the unit sphere so this is the set of all X and RN so that X has norm one and if I take the inner product of X with you one it should be zero okay that gives in my second column I continue this process of choosing my columns uniformly according to surface area measure from the unit sphere of all vectors orthogonal to each of the previously chosen ones all right and so you can see how at the end of this process if I continue I'll end up with a matrix U that has these column U is orthogonal by construction so that's good in order to show I have harm measure I just need to see that it's distribution is translation invariance right so I need to check that when I multiply by a matrix M in the residental group I'm not changing anything so here again I'll fix a matrix M in the orthogonal group ok then mu is just M times this matrix that's been constructed and this has columns mu 1 up to M um in a way I can think about mu is the first column mu 1 is then constructed by choosing u 1 uniformly from the sphere as before then multiplying by M right now by my second fact over here M is an orthogonal group so it acts as a linear isometry so it's going to preserve surface area measure and if I look at the inner product of the second column mu2 with mu one again because it's a linear isometry it's just the inner product of u 2 with u 1 and that was zero from the previous construction right so again the columns are orthogonal and you can see if I continued this argument I would again have a matrix with the same distribution right so I have a translation invariance I needed and then the heart measure okay so one last thing to note here is I only handled multiplication on the left that's where the compact ly group comes in because in a compact Lee group left translation variance implies right translation variance and vice versa so I just need to check that multiplication on one side worked right so this construction what I did is I took a orthogonal matrix and then showed its distribution with translation variance I'd like to also do a construction where I start with a matrix that I know has a translation invariant distribution and then orthogonalize it instead so I'll need a definition first I'm going to find the Frobenius norm so if I take a and B and by n matrices over R then the Hilbert Schmidt inner product this is given by take the inner product sub H s this is equal to the trace of any D transpose and so then the induced norm so then the norm on a so the square root of the Hilbert's my inner product okay with itself this is the Furby nice one or sometimes with shot into norm okay so let's do that second construction of our measure okay this is going to be a Gauss gram-schmidt approach so the idea here again I take an empty matrix and this time I'm going to fill it with iid standard Gaussian random variables so I'll fill an N by n matrix X with iid independent identically distributed standard Gaussian and I'll denote these as a set X IJ then it's pretty fast to check that this matrix has a translation invariant distribution so we know the density function for this X the density X with respect to the product DX IJ so this is 1 over 2 pi to the N squared over 2 X minus the curviness norm squared on x over 2 and so now I'll check that if I multiply by a matrix M so if I take Y to be M times X right the distribution is unchanged so here then at the joint density of the entries of Y is given by absolute value the determinate I'm inverse over 2 pi to the N squared root 2 X minus the Frobenius norm squared on M inverse Y over 2 right because M is an orthogonal matrix the determinant of M inverse is just 1 and again because it acts as a linear isometry it doesn't change the help of Schmidt norm when I multiply by an inverse so this is exactly the same as the density of the entries of Y itself but I don't yet have harm measures because this matrix is not orthogonal and so I do what we always do when we want to us organize I use the gram-schmidt algorithm and so all I need to check now is that in applying the gram-schmidt algorithm I'm not losing this translation invariance that I really need so the way you can check that is showing that multiplication by an orthogonal matrix commutes with applying the gram-schmidt algorithm so here I'll let you guess X I denote the columns correct first I'll apply gram-schmidt and then multiply by M so here the order will be gram-schmidt then multiply so for instance in the first step I replace X 2 the second column with X 2 minus the inner product X 1 X 2 times X 1 and then if I multiply afterwards by M I end up with a second column M X 2 minus the inner product X 1 X 2 MX 1 and similarly and all other steps my gramma algorithm now if I do this in the reverse order so if instead I multiply by M then a Phi gram-schmidt my matrix now has columns M X I so at the same step in the gram-schmidt algorithm I'm replacing the second column of M X which is M X 2 with M X 2 minus the inner product M X 1m X 2 times M X 1 so again because it acts as a linear isometry I can pull those up get rid of those inside the inner product and so the key is that I ended up with the same exact matrix in the end and so I haven't lost the translation invariance of the distribution all right so this gives us an idea of what we mean by uniform random orthogonal matrix we mean something distributed with harm measure which we construct in any of these ways okay all right good so the next thing we can ask is how can we approximate the entries of such two hard distributed orthogonal matrix so first of all the asymptotic distribution of these individual entries of such a matrix this is a classical result so this is due to burrow in 1906 so we'll let X x1 up to xn be a uniform random point on this year then the probability that root n times the first coordinate of X is less than or equal to T this converges as n goes to infinity to 1 over root 2 pi interval minus infinity T a minus x squared over 2 yes ok so that is the root n times the first coordinate of a random point on the sphere converges weakly to a Gaussian random variables we let n go to infinity and so as an immediate corollary to this if for each n we let u n be a random orthogonal matrix then the sequence of root n times the 1 1 entry of our random matrices this converges weakly to the Gaussian distribution as n goes to infinity and so it follows from that first construction of our measure where we chose the first column uniformly from the unit sphere so the Browse result on the sphere gives us the results for the matrix so we have the individual entries of our random orthogonal matrix are essentially Gaussian we like to extend this result to higher dimensions in order to do so we need a way of quantifying the distance between two probability measures so the notion of distance I'm going to use is called total variation pursuit we let mu and nu leave RL probability measures on some metric space X then the total variation distance between mu and nu which I'll denote the norm sub TV this is defined to be two times the supremum over all a contain the next absolute me weigh - anyway the supremum is taken over all for L measurable subsets so now using this notion of distance diaconis and Freeman greatly strengthened Burrell's results about 80 years later so this is due to diet Cronus and Freeman so again I'll take X to be a uniform random vector on my normalized fear root n SN minus one I'll let n be at least five and K be between one and n minus four I'll also let V be a Gaussian standard Gaussian random vector an R K then the total variation distance between the distribution of the first K coordinates of my random point on the sphere X and my Gaussian vector Z is less than or equal to two times k plus three or n minus K minus three so what this gives us is we can approximate simultaneously K entries from our point on the sphere by K iid Gaussian random variables in this notion of total variation distance so long as K is little o of n so in our matrix context what this says is again so long as K is little o of n then K entries from the same row or column of a random you in the orthogonal group can simultaneously be approximated and again this is in total variation distance by a Gaussian and so this light at diaconis to ask the question I how many entries can we really simultaneously approximate this is something diaconis himself long with eaten in the resin provided a partial answer to in the square case but really John gave the big result on the square case so what John showed was that a PN by queuing principle sub matrix although random orthogonal matrix is close and variation distance to a Gaussian matrix when the dimensions PN and Q 1 are both little o a root so as I can take a little over down by little root n sub matrix wouldn't approximate it simultaneously and he showed that this was sharp in the sense that if x and y are greater than 0 and PN is asymptotically extra down Q 1 asymptotically why we're done then the total variation distance is not goes to the ER so in fact this is the maximal square sub-matrix that we can do the approximation for so putting this result on the square matrix being little over done by little L ruin together with the original diaconis Freedman result a little L event entries from a single row or column indicates that I can take a sub matrix that has dimensions whose product is little o of n and I should be able to do this always so this is the new result and this is due to myself under my adviser Elizabeth mecca's so here we'll let VN VAP n by Q and upper left block of a random matrix Q uniformly distributed on their saw agonal group so that is distributed according to our measure will let xn be a matrix of P and Q n iid a standard Gaussian and then as long as we just know the product of the dimensions is little L of n we let and go to infinity the total variation distance between the Joint Distribution of the entries of route and Zn and a Joint Distribution of the entries of xn s tends to 0 so this is a nice result and that it recovers the qiong's result it recovers the diaconis tremely result and all the intermediate cases so notice here that if Q over P tends some to some ADA that is finite then we're back to John's case in that case P and Q I must supposed to be little L of root N and so there's not a whole lot of new information there we can essentially recreate his proof the interesting case is going to be once u n over P n tends to zero so P n is going to infinity it could be almost like and thank you and is staying bounded so here what you end up meaning is you're going to be looking at covariances of traces of powers of Wishart matrices which I'll do in a second and so when you were doing it in John's case you only needed to go up to powers one or two and so those are easy to kind of compute by hand but as soon as you let the powers of these matrices grow with n they get very difficult to do right so one of the main things that had to be done with figuring out what those covariances looks like so I thought maybe this would be the most interesting part to look at would be the proof of this little lemma here so here the lemma will need so again I'm assuming Q over P is tending to 0 right then the covariance of the trace of extra and close x2 the H and the trace X transpose X to the K this is equal to 2 H J pH plus K minus 1 Q P Q to the h plus K minus 1 plus an error term Delta H K where this error term is bounded by a constant times P H plus K minus 2 Q squared 2 to the K okay / roommate so really the important part for us was we needed these leading terms to cancel with other things and we needed this bounded parts our error term to be small enough that when we plugged it into you know the rest of the word the total variation distance went to 0 as we needed so I thought I would do a little bit of the proof here since it goes into graph theory so the idea here we can take the trace of X transpose X to the 8th and we write it as a sum I 1 up to IH between 1 and P this sum J 1 up 2 JH between 1 and Q of x i1 j 1x I 1 thank you so on up to X IH IH x I h j1 so I'm writing the trace is this product of these Gaussian random variables and now I'll rewrite this is the sum over G of X G where G is going to be an S graph so G is an S graph the way we define an S graph we plot the I K on a top line and the JK on a bottom line then we draw H up edges from JK to I K and H down edges from I k 2j k plus 1 so that each edge in the S cough corresponds to one of the random variables and so the product of all the random variables corresponds to in s graph and so I can count up the number of s graphs instead of counting but let's do a picture to get a feel for this so I can plot the eyes they don't have to be distinct so maybe I - overlaps with i-4 then I'll plot the J's on the bottom maybe J 1 J 2 overlap J 3 and J 4 overlap then I draw up edges from JK - okay so here J 4 I 4 and damages from Ikeda JK plus 1 so I 1 go side to I Q goes 2 J 3 I 3 - J 4 and I 4 2 J 1 good ok so now if I go back to looking at this covariance that I'm interested in this is equal to the expected value of the product - the product of the expected values right so here - 3 - x value except the value trace X transpose X - ok and so I can think of this as being the sum over all S graph G and K of the expected value of the graph I get by the Union - the expected value of my first graph times the expected value of the second graph and so what we need to do is we need to figure out how the difference in the expected values can change and count the number of graphs per class so we end up with two dominating term the first so recall what the moments of a Gaussian are service is expected value Z to the M is M minus 1 skipped factorial if M is even and 0 otherwise so the first thing we can do is any of these products where I have a variable appearing to an odd power those are going to expected value 0 so I can go ahead and throw away all the graphs with edges of odd multiplicity ok and so my first main term is going to be the case where I have double edges in both of my individual graphs so the expected value of x g is equal to the expected value of x k is equal to 1 and then in the union if I take two graphs that have only double edges and overlap them I'll have one graph or one edge of multiplicity four so the picture here maybe my G looks like this with all double edges and my Kang looks like that also with all double edges and so the union of the two graphs I have to choose some edge in both to overlap and so I get something like this with my gene Union ok and so this will have expected value 3 it's inside the one edge of multiplicity 4 and the difference gives me 2 the last thing I need to do is Town up the number of ways I can construct graphs of it look like this so we have 2h edges to begin with if I have only double edges I have H distinct edges then I choose which vertices are going to be on top and how many on bottom so if I take R 0 up to H minus 1 I can choose R plus 1 bottom vertices so the J's and i'll have h - our top vertices 5 h distinct edges i've h plus 1 distinct vertices total then we count up the number of isomorphism classes right so here we have 2's graphs are isomorphic if one can be converted to the other by permuting one group e on the top line and one through q on the bottom line so that is 2's caps are isomorphic if i'm just changing the labelings really of the vertices so the way i count up the number of isomorphism classes i count up the number of ways i can lay down the edges and so to do this I count all the possible ways that you can draw up and down edges when you have to H or H plus 1 vertices that looks like H choose are H minus 1 choose off and subtract off the ones that don't give you an S graph so here I subtract off the bad one let's using 1 over R plus 1 h 2z h minus 1 sir number of isomorphism classes and then the number of graphs per class so this is given by the number of ways I can label those edges or I can think of it as the number of ways I can label the vertices so I can label the first top vertex with a P then I have P minus 1 choices for the next top vertex and so on t minus h plus r1 same thing on the bottom are 2 choices for the first bottom vertex up to Q minus R minus 1 and so this is less than or equal to p h- r QD r plus 1 good and then finally this constructs one of my graph you can check the other graph so the K graph in exactly the same way only now you must have at least one overlapping vertex to be one less new vertex to play with and also you need to choose the number of ways in which they can overlap choosing that edge of multiplicity for and I have qhk ways in which they could take the overlap being just HK ways I can pick the overlaps H from the choice in the first graph and K of choice and edges Integris good so putting this all together I get a term to HK some R 0 to H minus 1 sum F is 0 to K minus 1 and then all of the things from my first one and so on so for the K graph I get K minus s minus 1 so 1 less vertex K choose s K minus 1 to not so that takes care of the first case so in case I'll do much more fast just explain what the second possibility is so the other possibility and counting up these graphs would be if we have both of our individual graphs have expected value 0 and so here XG and XK are both zero but when we take the Union it has nonzero expected value and so the way this could happen would be is if I had some cycle of single edges or edges of odd multiplicity lying in the one graph that exactly overlaps with the same cycle inside my other graph so that when I combine the two if I lay them over each other exactly everything has double edges and then I'm all good the expected value is no longer zero okay and we can have other stuff going on that's fine too all right so in this case you do the same kind of process counting up all the number of possibilities you can do this then you extend you graphs with six single edges and so on and so on and so on so that you get a couple more of the big sums like this and so then the last thing you can finally do with this is this holds always for any P and Q we like as long as the product is little oven then under that special condition Q over P tending to zero you can start to bound the sums and bound the difference in the expected value in that case M s how you end up with that really nice formula I had over there for the covariances good yeah all right so maybe the last thing I wanted to just a result so the original question asked by diaconis didn't put any stipulation on having to take a sub block of all the entries together right there was no such thing in that original question and so these proofs all rely heavily on the geometry but if we allow ourselves to weaken the notion of distance then you can actually approximate any little event entries you like from the random matrix so this is just one last result here so this is due to Chatterjee and Mecca and zoom here if we take a 1 up to 8 K and by n matrices over R such that decrease any I J or AI AJ transpose is root N Delta IJ we let u be a random orthogonal matrix and X YZ a random vector we get by taking trace a1 you up to trace KK you so this is an RK right and again we let Z be a standard Gaussian random vector and RK then the l1 kantorovich distance which is topologically weaker between my X and my Y a X and Z is less than or equal to root qk / and minus 1 and so in particular if we take the a k to be some set of root end i'm a IJ where this is the matrix with a 1 in the IJ + 3 and zeros everywhere else right then this gives us the result that in this week very notion system we can choose any little element entries out of the matrix we like and so this answer is the original question by in a coordinate freeway right question [Applause]

Info

Channel: Institute for Advanced Study

Views: 7,095

Rating: 4.9550562 out of 5

Keywords:

Id: gxqYj-70O0M

Channel Id: undefined

Length: 42min 58sec (2578 seconds)

Published: Thu Jun 01 2017