Tamara G. Kolda: "Tensor Decomposition"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Looks promising thanks for the resource!

👍︎︎ 1 👤︎︎ u/Willingo 📅︎︎ May 12 2018 🗫︎ replies
Captions
I think we're ready to begin with our next speaker I'm Jim Crowley the executive director of Siam and it gives me great pleasure to introduce our next speaker this I am invited lecturer for people that are familiar with Sandia National Lab you will know that Sandia has many outstanding mathematicians and computer scientists and our next speaker Tammy Cole de is among them among the best there she's a distinguished member of the technical staff in Sandia National Laboratories in Livermore she has many awards for her research including three best paper Awards and she received the prestigious presidential early career Awards for science and engineering and I should note that she's very active in Siam activities she served as a journal editor on to Siam journals she's been a co-chair for several sciences including the science on computational science and engineering and the sianandjim meeting she served as chair of an activity groups and she's also been elected as a Seinfeld and now she serves on this sign Board of Trustees despite all of these service activities in the role that she's played in all these things he's been a very active researcher and her work spends several areas including data mining optimization non-linear solvers parallel computing and software plus multi-level algebra and tensor decomposition which is the subject of today's talk so I introduce Tammy cold [Applause] thanks Jim for a lovely introduction thanks everybody for coming today I'm going to talk to you about ten thirty compositions a mathematical tool for data analysis I want to acknowledge first my collaborators who works whose work will be featured today we have Alex Williams from Stanford University who was a summer intern gray Ballard who was a summer intern then a postdoc is now a professor at Wake Forest Casey Begley know who's astute in turn from Georgia Tech gender she's a postdoc at Sandia David Hong another student intern from University of Michigan and Cliff Anderson Bergman one of my colleagues without their work you wouldn't be seeing what you're seeing today and of course I also want to acknowledge all the other collaborators I've worked with whose work is not a feature today but some are here so what are we going to talk about so we're going to talk about tensors what is a tensor a tensor is just a D way array for some value of D so we know some tensors already a vector is a tensor of order D equals one a matrix is a tensor of order D equals two so those are easy when we get to D being three or higher it gets more interesting I can't even make a super good picture so we have all these dots and now they're arranged into a 3d structure so that's a tensor of order three then you have to use your mathematical imagination for a tensor of order three or higher so it gets a lot harder for me to draw but you understand what I'm talking about so that is a tensor so we're going to talk about using tensors in understanding data analysis as a mathematical tool it's tensor decomposition so what do I mean here we're going to try to take this tensor object this D way array and break it up into meaningful parts so I'll tell you soon what those meaningful parts are the mathematics will play roles in defining how to measure how well we've done in approximating our original data and in terms of creating good algorithms for doing these decompositions and then we'll give lots of examples the official advice for giving a talk here said to give examples you will get lots of them and we'll show how you can use this to understand your data better to visualize it and so on there are lots of related concepts in matrix land so if you've heard of principal component analysis as statisticians this is one this is a higher-order version of that their singular value decomposition for the linear algebra community there's independent component analysis non-negative matrix factorization sparse factorization matrix completion all of these ideas and many more have been extended into tensor land so we have this idea of a matrix tensor decomposition we talked about putting things in two parts of what are these parts they are rank one tensors and so that's just the outer product of vectors so matrices you know these you take two vectors say an a of size M a B of size n and then you can do their outer product which will denote with a circle and you get an M by n matrix so we've M by n numbers but they describe by only M plus n pieces of data and so the IJ entry of this Rank 1 matrix is just given by AI times BJ and the tensors the idea is the same so in the 3-way case now we have an M vector and n vector and a P vector their outer product is again denoted with circles and we get an M by n by P object so now something M times n times P that's described by only M plus n plus P values so we write this as this funny-looking Chickenfoot is what I always call it this funny-looking Chickenfoot is a Rank 1 tensor for D equals 3 for the da case it gets ugly so now we have vectors a sub 1 through a sub D and then we should take their inner product and this gets quite messy in terms of notation and so for your all's benefit today I'm just going to stick to the 3-way case ABC so we don't run out of letters and so that you don't have to squint to see two too many subscripts but it all extends to the D way case so let's go back and think about matrix decomposition some of you may have seen this type of thing before we have some data and we want to find a low-rank model so it means it's the sum of a few of these outer products that gives a good descriptor of the data so each entry of our matrix X such as our observed data is approximated by some M which is the sum of entries from these rank one matrices so we can write that in this complicated index notation and we can write it in mate it's notation so X is approximated by M which is the sum of outer products of vectors typically in matrix world it's written as a B transpose but I'm going to give different notation because a B transpose does not generalize nicely to higher dimensions so we'll just write it a comma B in this double bracket notation so that's what we do in matrices we have some obscured errors is our error metric we'll talk about other possibilities later but we'll just look at the difference for every single entry square those and sum them and that's what we're trying to sort of minimize to measure how well our model approximates our data so Jill talked about Google searches I will talk about Google searches too I searched for low rank structure this turns up over 5 million webpages and over at 100 thousand Google Scholar references so matrix finding low rank structure in the world is eerily useful and and quite surprising in its in its utility if we go to tensors in the 3-way case it's not much different only now we've added this third vector C for each of our rank 1 tensors so that's the only difference there we can write this we'll call each of these rank 1 pieces by the way a component so we often look at the individual components those are our building blocks and then we can write this in tensor notation it's the same basic idea and the thing that we're looking for is these factor matrices a B and C that describe our data we're still using some obscured errors the only difference now is we're just summing over this 3-way object and tensors while new have the potential to be an even more powerful tool for data analysis because now it's higher order structure that you're looking for and in fact if you search on Google Scholar there's already over just on Google there's already over half a million results for low rank tensor structure and then if you look on Google Scholar there's over 14,000 papers so this is an area that's picking up some steam I thought it might be interesting here I'm saying this CP tensor factorization what is CP so a little bit of that depends on who you ask I'll give you just a little romp through history of this thing what's really interesting to me is this was invented in 1927 by a fellow named Frank Hitchcock he was an MIT professor he wrote this very nice paper that is still quite readable today sometimes older papers don't translate well to the modern age but this one I think translates quite well and on the first page he has this equation that turns out to be very clearly recognizable to those in the field as this CP tensor decomposition he was among other things also the adviser for Claude Shannon of information theory Fame and this paper it gets a lot of citations on Google Scholar I can't remember the exact number but it's only for his Hitchcock second most cited paper although perhaps that will change as these methods get more and more popular so Hitchcock did this work in 1927 it was kind of forgotten to history for a while Lisa for all people except for a small subgroup and then it was reinvented in some other communities twice in 1970 and so there was two groups that invented it Carroll and Chang at Bell Labs and then they called it canonical decomposition and so that was candy comp this awesome name and then Richard Harshman at the same time invented it as well and he called it parallel factor analysis or pair effect for many years people would write papers where they call it one or the other when I first started in this field I did not realize that the words were the same thing right away so it's quite the revelation in 2000 guy named Pierre suggested that we compromise and just call this thing CP to acknowledge both of these early papers again not knowing about Hitchcock so that was the idea in 2010 folks started calling it canonical poly attic decomposition I don't know who was the first but it was in several paper starting in 2010 by comunicate Pierre come on leave in the last hour and others they sort of cleverly eerie engine reverse-engineered CP to mean canonical poly attic poly addict goes back to this sort of nice paper by Hitchcock ax talks about the poly attics and so I thought this is sort of a nice reinvention of this term I was sort of interested especially preparing let's talk to figure out who gg Chang was so if you look up Hitchcock is Wikipedia page both Carol and Harshman have very nice articles about them after they died but GG Chang was a bit of a mystery so Gigi was a female member of the technical staff at Bell Labs at that time women were not hired into the same technical roles as men although they were quite welcome to be the programmers and her role as a programmer is widely acknowledged because in 1970 this was not unusual that this was considered the work of women funny today I still look programming but it's not now maybe it's less considered that way so I was able to sort of reach out to some people at Bell Labs including a fan fan Jean Graham and actually get in touch with people that she worked with with her former husband with her niece and even with her daughter and says able to find out a little bit more about Gigi and people have only wonderful things to say about her I'll just give you the facts for now but perhaps we'll be able to get some more information in the future she was born in 1927 and died in 2007 emigrate to the u.s. from China was a really prolific researcher she actually has quite a few papers to her name including several single authored papers about her software and then I was able to get from her daughter this really nice picture of her working with Doug Carol back when they were working together in the 70s so this is sort of a nice story of a bit of a hidden figure at least to me and the tenser world so now I'm going to motivate some of the this idea of CP as a tool for data analysis so Alex Williams is a graduate soon at Stanford and neuroscience group that came to work with me his group a group of march Schnitzer at Stanford is collecting some really novel data based on a device that they invented that can measure the activity of multiple neurons at once in the mouse the brain of a mouse as it's moving around in a maze and so at each moment in time they get an image something like what you see here through no small amount of work they get to this vector and so I actually asked them well how hard was it to pre-process the data and they said about two years so two years of graduate student postdoc work they make this lovely set of numbers out of this measured image and that's what we start with so for 300 neurons they have sort of an activation level for every neuron and they do this over time as the mouse is moving throughout the maze and so we have a neuron by time matrix where we have the activation level of each neuron at each time that is one trial of the mouse going through the maze then they do that over and over and over again 600 times because they want to understand learning and so then it becomes this question of how do you analyze this type of data so let me tell you little bit more about the experiments that they did here so we have this mouse will start and either the west or the east of this maze there's a wall in the middle so that when it gets them to the middle it has to make a turn and if it makes the correct turn it gets a treat and so at first it should always turn south whether or not it starts in the east or the west otherwise then they so these goes along and he does that for a bunch of trials then they change the rules and isn't that just how life goes the rules are just changing in the middle of your life and then you have to switch your behavior and so then the mouse should turn right so nothing changes that the mouse starts in the West but if it starts in the east it needs to turn north to get its treat then they change the rules again and the mouse should figure out the new set of rules to get its treat and so that's the setup so we get something out of tensor decomposition that looks like this chart so now you're thinking I should have sat closer to the front and you can feel free to move so what is this thing so we're making this tensor decomposition which looks like this and so let's just look at the second component this first plot over on the left is the activation of each neuron in a sense that corresponds to the to the vector in mode one then we see a profile in time which corresponds to the sort of activity and time and notice that this curve is smooth we did nothing to make this smooth this is just inherent in the data and then last is this funny business with the different trials which are coated with different markers to say what's going on and so it's orange if they if the mouse started in the East Green if the mouse starred in the West those are sort of equally spread out then it's a circle if it ended in the south which is where it ended most of the time and then it's a square if it ended in the north and then it's filled in if it got its treat and it's empty if it didn't get its treat so do we see anything here well let's zoom in on this third factor here and you can see in time it's happening sort of towards the end and then there's a very clear separation I can draw a line between between the filled and circle the filled in markers and the not filled in markers so my guess is that these are the neurons that have to do with whether or not that Mouse got its treat so I call those two happy neurons if we look down here at this second to last one this is sort of towards the beginning of time and in fact we see a very clear separation between the green and the red and that mean the orange and the green and that corresponds to the starting position so these are the neurons that somehow understand where the mouse has started if we look at the last one this is also this is towards the end and it's a little bit harder to see the shapes but the tops are the squares and the bottoms are their circles and whether or not they're filled in they're still sort of separated here squares are always on top that's where the mouse actually ended circles are on the bottom and so this is the sort of set of neurons that understands where the mouse ended so this is very interesting the tensity composition knew nothing of the conditions of the trials at all and yet somehow it saw this underlying behavior in this low range structure so this is sort of the utility of these methods and the neuron people are extremely excited as a paper about it on bio archive and the references in the slides so how do we actually compute this thing let me say a few words about that so we want to fit this CP model we are given our data X we want to build a model and it has the matrices a B and C that we want to find so if we write it directly out we get this equation at bottom that we want to minimize we want to find a B and C terminus minimize this equation that's kind of ugly looking but we can do some tricks so but first let me tell you a little bit about this if we want to determine how many of these components we have the our if you want to do that exactly for some given tents or even a very small tensor this is np-hard and in many cases actually it seems to be impossible even for really small tensors an infamous nine by nine by nine ten so that we don't know the rank of and in fact you might say well I don't care and I often say this I just need an approximation to this thing but even that can be ill-posed unless you put some conditions on it so we make sure to add those the approximations you get are not nested so the best of rank one may have nothing to do with the best rank two which may have nothing to do with the best rank three this just means more computation you have to do it for every single rank that you're interested in and especially when you don't know the rank you usually want to try a few one key thing especially for learning algebra folks are used to the singular value decomposition or principal component analysis the a B and C that you get are not orthogonal and so this is a bit unusual there are a lot of papers that make orthogonality assumptions and there some are some applications where that holds true but most the time in real world data we cannot make any assumption about these vectors that comprise these vector matrices being orthogonal and I just think that makes it a more interesting problem one benefit you get that you don't have with matrices where you sort of impose an orthogonality constraint is that these matrices you get are unique up to certain conditions which is you can always swap two of the components that's just an ordering and you can also sort of scale an a by two and a B by a half and it's the same thing but up to these trivial ambiguities of scaling and permutation the factor matrices are often unique especially in the case where the number of vectors is small compared to the size of the matrix and so that means it makes it much more interpretable when you're trying to analyze data and principal component analysis for those of used it and data analysis can be very hard because you can rotate the factors and still get equivalent solutions and now you cannot so how do we actually fit this model use a method called alternating these squares and that's because if we stare at this equation and just think of what's going on with respect to a if we fix B and C we can stare at this and convince ourselves it's just at least a linear least squares problem and so that makes it nice and easy to solve we can do the same for B fixing a and C do the same for C fixing a and B and then we just keep doing that until we get to convergence and so that's an alternating least squares algorithm and the function value is monotonically decreasing throughout and so you just sort of continue until you cease to see any improvement in your fit function this is a non convex problem so those are always a little bit scary but every sub problem is convex and moreover it's actually exactly solvable because it's this least squares problem so let's look a little bit more in detail at the least squares problem here so we want to solve this least squares problem and I write it and sort of ugly notation and I don't require you to know this notation I'll also show it in pictures so we have this matrix over on the left so we took that three way array and we rearranged it as a two-way array so we can rearrange it however you want so we're going to rearrange it as a two-way array that's called the tensor unfolding to a matrix or the matrix unfolding of the tensor and we're going to put the first mode that will correspond to the rows and everything else does correspond to the columns we want to solve for the set of vectors in the matrix a and then we have this sort of combination of the B and the C this funny thing we combine that into something called a matrix cat reroute product it doesn't matter what that is it's except for sort of chroniker products along the rows but these are just details just think of everything as Elsa's shoved into that matrix this is a least squares problem but it's sort of backwards and sideways to what we normally think of so we have the matrix of the least squares problem on the right and the right hand side is on the left just in case it doesn't look familiar to you the big thing is it's a short and very wide matrix or if we flip it it's a tall and skinny matrix and how bad is it so we have an N by n by n problem then our matrix is R by n squares where R as a number of components that's not so bad just N squared but if it was a do a tensor it's R times n to the D minus 1 so this gets very wide very fast and it's actually a very over determined problem so maybe there's something we can do about that so one thing we can do is do randomization and this is a very powerful tool that's getting a lot of steam so the idea that we can do in this case is just pick a subset of the columns of the matrix and then the corresponding entries and the right-hand side which is on the left and solve this smaller problem and so we can do that with a sampling operator now if you just do this naively it doesn't look so good so as always an algorithms the devil is in the details of how you actually implement things but I want to stress that one of the neat things about this is we have very strong mathematics that tell us how good the solution is this is not just a heuristic idea we can actually bound the quality of the solution based on properties of the original matrix and how many of these samples we take so this makes it very powerful there are a few details we don't actually create the unfolded matrix in computer memory we actually just keep it in its tensor form and pull out just a few elements we need similarly on the on the matrix side we don't actually ever form this funny matrix this funny cats meow this is actually some computation to forming it and some memory movement so we avoid that and just pull out the pieces we need in sort of computation these days the thing you actually worry about almost more and then how many operations you do is how much memory you move around and so we are avoiding that as much as possible there's also a huge detail which I am NOT going to go into today but your matrix has to have certain properties for this to work it has to be suitably incoherent the thing to know is that you can always fix that up by applying a Johnson lineage dose transform just doing that would be slow but then they have these magical things called fast Johnson Linux jaws transforms which involves a FFTs or some fast transform and then a random diagonal sign matrix and you are in business and so we can add on what we call mixing mixing everything up so it's incoherent and then these this theory about how accurate the solutions are is guaranteed to hold so it turned out that this worked great and then the sticking point in fact it was so great we're like well everything's so fast and then we realized that we had a sticking point which was checking convergence so whoops didn't think about that one and so what could we do for that so we'll take another idea of randomization but a different approach and figure out how we want to check if this function value has stopped improving so it's a very simple idea a very old-fashioned idea and just take a poll so this is like polling voters on Election Day we're just going to look at a few of the entries randomly sampled and then we can sort of bound how accurate our estimate is unique using turn off tufting theory so we actually know how many samples to take we have some idea of how accurate our answer is and we can sort of do an F hat an estimate of the original that is and then we just use this Omega to scale things up so it's sort of the same size since we've only looked at a few entries and we can make a great estimate and so how good is it in practice we use just say 16,000 samples on a five-way tensor that has size ranging from five to the fifth up to 60th to the fifth so even in the smallest case the five to the fifth we're only using less than 1% of the data with 16,000 samples we're always accurate a relative accurately accuracy of at least 10 to the minus 3 with only these 16,000 samples and we can get speed ups on this relatively small tensor of up to 300 X and so this is sort of a nice idea maybe even independent of all the other stuff that we've talked about you can sort of check values of functions with just sampling a few entries so let me show you another example of this applied to real data so we were motivated to this because sometimes our data sets are somewhat large so there's this hazardous gas data set where they are looking at gases released in room and so they want to know if there's a hazardous gas there was a hazardous gas in this room right now you would want to know so they wanted to release the gas and then measures of the gas flowing over these sensors in time and being able to tell if they can detect what's going on so we have 71 tensors it's actually sensors it's actually 72 arranged in a 9 by 4 by 2 grid but one of the senses was bad so we didn't use it there's 5000 points and time this is sort of down sampled from the original data they run the experiment at 5 different temperatures to make sure it works in all cases and we did a hundred and forty wells the folks who did this did a hundred and forty experiments all this data which is just this big pile numbers there's two gigabytes of data and we want to see if we can make sense of that but that's rather big so if we use our regular method it takes over a minute and this is sort of at the some getting close to the limits of what we can do if we use the randomize method we get at least a factor of 2x speed-up and as the data set gets bigger the factor is much higher and that just takes less than half a minute and what do we get so in fact the answers that we get from these are indistinguishable so I'll just show you the ones from the randomized method so we get this sort of sensors and you can see there's some patterning here which sort of reflects this 9 by 4 by 2 pattern even though we didn't tell the tensor analysis you get these profiles in time and you notice several of them don't pick up until partway through and that's because they don't actually release the gas at the beginning of the run they wait a little bit and then they release the gas so we can see that behavior there there's different things going on with the heat and then we have a bunch of different experiments the top one it's just sort of a background condition to me because it's constant doesn't matter when you release the gas but then other things start happening so this is really a matrix of size 140 by 7 so 140 experiments 7 factors so we can throw that into a pca projection so we project this down to 2d and see if that helps us make sense of what we're seeing because there's a lot going on and so we get these blobs so we could could be a psychologist what do you see in these blobs but let's just see if that means anything and in fact when we labeled what gases were released there's a very clear sort of separation of the different gases into these appropriate blobs and in fact if you didn't know but your data was which happens more than you might think in your experiment you would see that there is this underlying structure okay so we've talked a little bit about neurons and gases and fitting and using randomized methods one thing that had bothered me for many years was this idea of why you squared error seem like there are other metrics to use and in fact many of our applications we had people coming to us with data that did not look like continuous Gaussian data at all so we had to start asking what else can we do so we looked at the goodness of fit criteria so we have this squared error maybe we could replace that which is sort of an arbitrary function at every element look at a little function of the data and the model estimate at that point maybe we could swap out that function instead of doing X minus m squared we could use something else so where does X minus M Squared even come from so which you often hear if you talk to people who do this is I have data that has low range structure plus Gaussian noise and so what they really mean there as they have data that's normal distributed with mean 0 and standard deviation Sigma that's their model of their noise their epsilon you could write this a different way where for every point the model that we're building is an estimate of the mean value of that point and we know there's some standard deviation Sigma and Sigma is assumed to be constant throughout so what this is saying is we have these little curves that's our estimate of the mean and we think that our data is somewhere in that narrow region defined by the Sigma and so we can sort of describe the likelihood of observing an x given a mean mu using the probability distribution function that's in terms of mu and Sigma so let's suppose that we take our model to estimate that mean so at every ijk our model estimates the mean the model is required to have low rank structure so it has this the structure and then we plug this in to this probably distribution function at every single point and we multiply all these likelihoods together and then we get this likelihood but you don't work directly with that it's the product of a lot of annoying things that are small those do not compute instead you take the log of that which a monotonic function it swaps your Sun your product for a sum and then just for convention you take the negative so we instead of maximizing the likelihood we're minimizing the negative log likelihood which is this formula here so if we stare at that awhile and we say oh yeah Sigma is constant throw that out then we're back to our square sum of squared errors that we know and love and there's good reasons to love it but sometimes it's not appropriate so for example what if your data is non-negative so if you have your X IJ KS are all positive all greater than or equal to zero what can you do so Gaussian feels a bit wrong because you could have a mean that's like point 1 and then you get it's very easy to see negative entries so there's another distribution of those many distributions but I just like for these days that the Rolly distribution and so it looks like what you see over on the right that you sort of have all your data distributed and this sort of hump that can't go past to the left of past zero and so the distribute the probability distribution function for that is in terms of a sigma it's a sort of a different but related Sigma to the normal distribution but that's just how the statisticians write it and so we can estimate our data ijk is Rolly distributed with this parameter defined by m IJ K which is the Sigma here so we can minimize this negative log likelihood if we remember that the X is constant so it doesn't affect our optimization in terms of M then we get this funny looking a goodness of fit function here that we want to minimize this requires that the model is positive but this is easy to enforce if we just choose the a B and C to be positive because we're just multiplying them together so we can enforce non negativity easily the expected value of your X is not exactly the M it's this funny scaling of it of course we could fold that scaling into the probability distribution function but it doesn't make any difference so just it's just something to know what if your data is even more interesting what if it's not even continuous it's discrete what if you have binary data just zeros and this happens a lot we'll give an example with the graph what do you do in that case so let's talk about probabilities for a minute those of you have gone to Vegas may go to sleep now most of us don't spend a lot of time in Vegas so let me tell you so you have your probability between 0 & 1 of something happening you can equivalently talk about the odds of it happening so if you know the probability the odds are the probability divided by 1 minus the probability if you have the odds you can figure out the probability as this the odds over 1 plus the odds okay so you can easily go back and forth between the two when you think of the likelihood of seeing a 0 or 1 it's just the probability to the power 1 and the probability of seeing a 0 is just 1 minus the probability to the to the power 1 minus X it's the easy way to write it but I'm gonna write in terms of the odds so that's this funny-looking expression on the left in terms of R over 1 plus R to the X and 1 over 1 plus R to the 1 minus X so it looks a little uglier but trust me it's I think it's well I see it's an opinion I think it's better you may not so M ijk is the odds ratio of observing x equals 1 X IJ K equals 1 and so X IJ K we're saying it's Bernoulli distributed with the probability given as the function of this odds and so the expected value of x at the end is the value of the model over 1 plus the model and so we can plug everything in to this probability mass function because it's discrete and called a PMF and we get the objective function here and then we require that the model be non-negative which I said before is easy so why is this easier than working directly with these probabilities that are between 0 & 1 that introduces a very ugly non linear constraint and that we don't want to have to worry about so this makes our lives much easier to work with the odds and it seems a little bit more interpretable to me as well so we have this idea of a generalized CP model that we can switch out different objective functions and I sort of see I don't wanna judge you judge your objective function use the one you like I work with a lot of analysts they have favorite objective functions we want to give them a flow simple framework so they could use any objective function they want we have the three we just discussed here there's others by there's one that's based on four count data Prasad likelihoods similar ideas have been introduced in the matrix world I won't go into the details or the algorithms I'll just mention a few notes this method can be solved as before in an alternating least squares problem but because these subproblems are not closed form there's less benefit in doing it so you can also do what we call all at once optimization and especially when you have missing data which we often have that is something you generally want to consider in that case instead of summing over every element you just sum over every element you know so we denote that set as an Omega here the gradient has a very elegant form and tensor language we say it has an MTT krp you don't even want to know what that is but if you do come talk to me afterwards but it's an important key kernel and tensor computations and it's neat that it shows up again in these generalized forms missing data turns out to be easy very easy to handle in this some form I mentioned the interesting pieces it also introduces zeros into part of the gradient computation wherever there's missing data which means we sort of get this sparsity in the gradient computation this turns out to be something we can exploit if we have very large-scale data and we want to use say a stochastic optimization method to solve we can actually get a sparsity and the gradient computation which makes it much faster to compute and so then we can exploit that and get very nice computations so this is all very elegant but I won't go into these details here I'll just give you a few examples so let's go back to that mouse ticket that was actually non-negative data and so we put this through the non negative thing and we got something different which is better it's really a matter of opinion so the folks that we were working with the neuroscientists they did not like negative values and I have to tell you I kind of I sympathize with them I don't really like interpreting negative values either but in some ways it's easier here we get a different set of nodes that see if we look at the third that's starting east and we look at the fit that's starting west before that was all in one piece now it's in - so that's good or bad depending on your point of view but we don't judge the the choice of fitting function you can see the South and North became the first and the last North didn't happen very often so that's why it's sort of a lower power and the incorrectness was second and split there so it's just a different way of looking at your data but it folds out most of the same features the gas data one thing I didn't tell you what that gas data is there's actually a lot of missing entries these sensors are flaky beasts and so we actually did some tricky stuff to fill in those missing entries which you don't normally want to do we could handle it directly using this method and it actually also happen to be non-negative so we also use this Rayleigh distribution again and so we get sort of similar results to before but it's all non-negative so it makes it a little bit easier to interpret for an analyst what about graph data so what have we had we had this data set that comes from a chat network of students at UC Irvine we arranged as a four-way tensor so there was senders and receivers in this chat then there was messages collected every hour of the day over a series of about 200 days so we got this very sparse tenser at about 15,000 non-zeros and we wanted to analyze what is happening there and so there was a 1 if student a and B chatted during our Y of day Z so we use this boolean odds function that I talked about before so boolean data we're going to estimate the odds that we see a 1 and I'm going to do a rank 12 decomposition and we see some structure here which sort of seems reasonable so we've ordered the senders according to which component they are the largest magnitude in the receivers are just ordered the same way so you can see there's some symmetry and the components groups talking amongst themselves you can see things happening in time so some of these were clearly about limited term projects which happened and maybe a few days or a couple weeks some things lasted for longer there is a break in the time so around 75 to 290 there's a break in the days and actually we went back and looked and that was some sort of semester break at the University so you can sort of see what's happening in your data so I think this is all for the samples before I get to my conclusions I'm gonna give one sort of ad from our sponsor and also an ad from me Siam is going to be starting a new journal this was just approved last month so in fact we're in the process of putting together advertisement material that will go online and so forth but you get to hear about it here first so be a siam journal on the mathematics of data science it will launch sometime in spring 2018 the focus will be on the role of mathematics in data science is complemented by all the fields you know are very important in computer science statistics signal processing and network science so I will be the editor-in-chief we've lined up a set of great section editors we're in the process of putting together an editorial board and I hope you will keep your eye out for that and think about submitting papers and encouraging your colleagues to do likewise so I'll close and just acknowledging my collaborators again and talk about the benefits of CP density composition and data analysis it's a very useful tool for real-world data analysis it gives sort of an idea of latent latent sort of factors in your data things that can explain your data and it can be used also for dimensionality reduction in and putting it into subsequent analysis like the PCA visualization we showed randomized methods are becoming a big important tool for us because of the size of the datasets that need to be analyzed and there's a lot of interesting mathematics and statistics and applied math and computational issues there we talked a little bit about flex flexing the way that you measure the fit to handle different data types and I'll say if you get the opportunity to work with the real data it can be ugly in so many ways but you need to understand how its distributed and how your model fits to that data measuring that is an important part of data analysis there's many many many many open math problem problems of all shapes and sizes we use sort of pure theory a lot we use algorithm and computation lab it's all useful in the world of data science our clothes give you a few links there's a thing called tensor toolbox for matlab where you can play around with tensors on your own data we even have parallel implementations in progress can always contact me by going to my web page and get my email thanks again to Siam for inviting me and to all of you for coming today thanks we have a few minutes for questions and someone wants to ask a question to ask that you come forward to one of the microphones in the aisles towards the front so back to that slide with convergence II check you are talking about a bottleneck there and you said that you have decided to stick with just some random subset of elements in function evaluation so my question was on each iteration was it the same subset or you have yep okay yeah so the question is that the we were doing the sub sample of elements to estimate our convergence if we've gotten to convergence what's is the same sub sample and I did not say that but actually yes on that we use the same some sample every time but the least squares problems we use a different set of random columns every time but we kept that one fixed because it was just much easier to do so you have mentioned that then you could evaluate the probability of your convergence you by using sure enough bones if you're using different subsets every time it's still applicable it's applicable but there's enough variance that it becomes difficult to really be confident about convergence so in these these randomized method whether it's matrix sketching using this iterative context or the stochastic optimization methods checking when you're converged is probably the most difficult part thank you any other questions you know that we can get a lot of ways from training neural networks in deep learning do you think your method would be able to apply to understanding the structure of the ways you obtain from deep learning you know are you aware of any work down in direction so you're asking if we can use the tensor analysis to understand the networks that come out of the you know the weights of a newer networks or matrix as well so I'm not aware top of my head it's kind of on my reading list to look at that but there's there are a lot of connections between tensor analysis and and deep neural networks and folks are looking into that but I'm not fully up-to-date on everything that's happening right now okay so I also have a question when you try to solve the generalize that a city problem do you think is it sorry do you think it was there is any opportunities for the Bayesian analysis oh for general ICP is there opportunity for Bayesian analysis absolutely yeah that would be great but Bayesian analysis it has to get a bit more expensive so but I think there's definitely opportunities there oh thank you oh hi hi great talk alright I yeah I'm here so I'm working on networks and hypergraphs which can be represented as Tasers I know I'm mainly using Paco decomposition I'm thinking you know why you you know choose CP is that a talker decomposition and can you just generalize or just apply everything here to the talk of the conversation thank you yeah so I really like Tucker decomposition I didn't talk about it today but it's the I often use it for compression of data I find that it's harder to use when you're trying to look at the components because it's not unique anymore so Tucker decomposition there's a rotational ambiguity which you don't have in CP and so CP is a bit harder to compute and there's various theoretical problems that I mentioned that those don't exist for Tucker so people do either one but they're they're not interchangeable and so I would encourage you maybe to try the other just for fun thank you one last question so the for the matrix we have a sparse matrix so you take a sparsity of the data so for the tensor do you have this a sparse tensor yeah so there there are sparse tensors now I was actually one of the first things I worked on because I was surprised there was no sparse data formats for tensors and and made him working with large scale as far as tensor is impossible and there has become now a big interest in how to what are the right data structures to do for sparse tensors and so when I mentioned that key kernel MTT krp there's a few key kernels that have been identified now and people are researching one of the best data structures in these cases including our group it's an open pot problem so if anyone has any further questions I encourage you to talk to Tammy after the lecture and let's thank her once again for an informative interesting walk
Info
Channel: Joint Mathematics Meetings
Views: 16,473
Rating: 4.9468083 out of 5
Keywords: mathematics, Society for Industrial and Applied Mathematics, Tamara Kolda, Tensor Decomposition, Data Analysis, 2018 Joint Mathematics Meetings, Joint Mathematics Meetings, 2018 JMM
Id: L8uT6hgMt00
Channel Id: undefined
Length: 47min 23sec (2843 seconds)
Published: Fri Apr 06 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.