New Directions for Tensor Networks: Machine Learning and Quantum Computing III

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm good to see you all again so today we'll be following up on the rest of material I started yesterday about introducing machine learning we'll go quickly through that and then we'll go into applications of tensor networks to machine learning which is you know kind of a new area that I'm trying to launch so it's definitely you know kind of experimental if I can use that word all those you know theory but we've been hoping to find that so you find it interesting and useful so first of all yesterday some people had requested that I post the slides ahead of time so I wanted to put a link up there was also an email that went around with this link so if you have the email you'll have this link otherwise I'll just wait a second if you want to copy it down or yeah take a picture action is probably the fastest way okay it's on the conference website as well so okay anybody need more time to write that link down okay good all right so we'll get started so right so yesterday I introduced tensor networks mostly talking about matrix product States but I did touch on peps and Mira as well just to let you know about those but those are still kind of future areas I would say for tensor networks like I wouldn't start there if I were you you can talk to some people I was yesterday talking to someone who's working on peps and it's a very tricky topic still but but I think the I think it's coming along rapidly so we'll see big improvements that area Mira is also very interesting and I think we'll see more uses of it as we as the field kind of built so matrix product states are still the workhorse of this field so talked about them and computations of matrix product states like how you can do efficient things efficiently with them that would be intractable otherwise and we started talking about machine learning so let's continue with that now today so this is just a review from yesterday I was summarizing what are like the three kinds of machine learning tasks and you know they don't all have to fit into these categories but this is just helpful to have some some categories and they were supervised learning where you have labeled data unsupervised learning which are like all the things you can do with unlabeled data and then reinforcement learning which I just wanted to mention for completeness because it's this kind of third big area that's kind of a hot area and machine learning and remember this has to do with how much information you have about your data upfront so supervised learning right that's the idea of your given data with labels and then you're supposed to produce a function that can assign the correct label to further data that you didn't train on unsupervised learning is more like things like you had some kind of notion of probability of seeing the date different examples in the data and you're trying to kind of fit that probability distribution just from the examples you don't have it but you imagine it's there and then you try to try to make a function that approximates it you can also do things like clustering and making reduce representations if we have time at the end I'm gonna talk about one way to do this making reduce representations of data with tensor networks I think that's a pretty interesting area for them and so then now to the slides we didn't get to yesterday so you know kind of to summarize the approaches and machine learning the idea is that instead of the solution to a problem being like a formula that you write down or some kind of logical code that you program like this kind of if-then and you and you you know that you produce this sort of pristine piece of code in machine learning the solution to your problem is viewed as just being some function so you just say okay in the space of all functions there exists a function that would solve my problem I just have to go find it or something near it so what you do instead though of looking at the space of all functions that should be you know way too big you can't even really feasibly work with that space what you do instead is you parameterize a very flexible functions just kind of any functions will do as long as they're extremely flexible and like have lots of adjustable parameters and hopefully other good properties as well we could get into that but that's still kind of a area I think machine learning is trying to figure out if like is it really about those functions being very limited or is it more about their flexibility is or something more subtle there's this topic and machine learning called inductive bias which is in some sense just what functions of the right functions but I mean what does that mean so they're trying to kind of figure that out but one example could be that we know in physics that that basically wave functions have some kind of hierarchical and structure imparted on them by like kind of these RG ideas so we know in physics that like some kind of hierarchical structures the right inductive bias from any body wave function so people are kind of looking for that analog in machine learning still I would say this also this point about picking very flexible functions just kind of any flexible functions this is also where I'd say it's a bit different from fitting because some people will say machine learning is just fitting right but I to me fitting kind of implies more something like that you know precisely why you're justified and using I mean this is this may be a bit of a reach but you're kind of justified in knowing why you pick a certain function you say well you know this is a this is some curve that's something is going to zero so there's a small parameter so I'm justified and expanding in a Taylor series so then I just need to fit the first few terms of a Taylor series so you have some kind of theory like that here you're not really trying to justify your choice of functions to that level you're preparing just convenient functions things that you can actually optimize over whether they're like exactly the right thing and then how do you really pick the function you say okay pick a bunch of functions I can actually work with then there might be a bunch that all do Requa ly well for sort of representing the true function so then of all the ones that do basically equally well you prefer the simplest one and that's where that's the idea that leads to generalization the idea that out of all the ones that could sort of account for the data I've seen so far perhaps if I go with the simplest one that'll continue to work to data I haven't seen yeah mm-hmm sure yeah yeah I mean I don't I wouldn't want to stress that point too much it's just that I would say maybe there's more to machine learning than just fitting the data that you have so maybe it's kind of related to your point that that to me when you're fitting it's somewhat more in the realm of like you kind of you kind of really want to go through the points that you have mostly or or that you have some kind of knowledge of the correct function class here it's a little bit different you're kind of thinking more statistically and you're not as worried about whether you have the correct function class because you're gonna kind of fix that anyway by by kind of also taking into account that you have limited data and so there's also some corrosion or regularization and that's what that's what this point is getting toward so I wouldn't want to say it's I wouldn't want to say I could really define fitting is it was more motivational same oh that well no there certainly is such a thing as overfitting yeah yeah that's where you take your data too seriously or you take your function class too seriously and you'd say it must go through all the data that I have but that's not correct unless you have all the data and the correct function class so unless both of those things are true then it's overfitting yeah good question okay so you know this you could talk for you could talk for days about philosophy of machine learning but that's just a little flavor okay so now to some more detailed things that this is really important to know so in terms of what is the field of machine learning really doing like what do people actually do and this is very helpful to understand where different ideas that you hear about sit in the kind of the history of the thinking of people in that field so this has to do with the kinds of model classes that people study and the most important one and for for learning about this field is the linear model because it's something you should always use as a tool and it's something good to know about and think about it's very simple and then some two other very important ones are these class of models that I would call kernel learning models or support vector machines is another name you'll hear basically this is a special case of these where you have a particular choice of cost function as well as model type and then more generally you could just talk about kernel learning and then finally neural networks and I'll say a bit about all these so the linear model is just what you'd expect is just fitting a line basically it's like you know some kind of hyperplane so all you do is you say my data comes in the form of some kind of vector that I'm given so it could be a vector that you're given or a vector that you prepare so let's say it's images then X is like just a vector whose entries are all the pixels just you know the X numbers or let's say if it's greyscale pixels then there's numbers between 0 & 1 or between 0 and 255 or something like that then all you do is you say okay my model is that I take the input as it is and I dotted into some unknown vector of weights W and I could have a constant as well and then I just need to figure out the best W that that will do whatever task I'm trying to do and this can actually work extremely well here's an example from physics where this works very very well so this is this so-called linear prediction method it actually ties back a little bit in - DMR G and tensor networks here so but this could be used with other methods as well so the idea is that you have some method that can generate very accurate data for some kind of time demand a time dependent process so this is some kind of quench and then you measure like say expectation value of SC as a function of time and then it's decaying but then you reach some time that you can't afford to go much past anymore so then what you do is you divide your data into some kind of training data back here and then this is some kind of testing or validation data here and then you just do linear fitting so you just say let me suppose there's a model that I can feed in some sequence of my points and then have it predict the next one and then I can run that forward by taking the next predicted point and using that as a new input and kind of walking that forward and that's what these authors did and then you can see the the solid circles are the results of their linear prediction model and it sits right on top of the exact results they were picking a model here where they could actually get the exact results - all times just to be just as a check so that's pretty interesting and then they actually have a theory in the paper about why it works so well it turns out that I think they use complex weights is one of the points and then it's able to model oscillatory and decaying terms and do super positions of those so anyway that's just a point about this can actually be a very powerful model how would you actually train this this linear classifier early your model I just wanted to put this slide in because it's like if you're gonna discuss how do you do training and machine learning this you know picking the easiest model makes it the easiest to explain so so remember for supervised learning let's take that case of supervised learning you can use a linear model for other things too so let's say you have a training set with elements X J or inputs X J and then you have these ideal outputs Y J that's like the thing you'd want the model to give and then you you minimize the cost function where you say the let's penalize the function for having different outputs from the ideal ones right and then let's take the outputs to be either plus 1 FX J is an A or minus 1 if X J is a B and then we want to just you know tweak F until we can minimize this as much as possible we may not be able to get it to go to 0 probably we don't even want to so then how do we do that for the linear model well you just plug in the linear classifier here I dropped to the constant term but it's not so important or you could absorb the constant term into the X anyway you can just always have one of the components of X be just the number one and then make it one dimension bigger if you want so then so you put the linear model into this cost function I put a little factor of two for convenience Oh doesn't matter then you just take the gradient with respect to the weights and so you know that pulls out let's say if we take derivative with respect to W n so that pulls out the in component of X out front the two comes down you know and then we have this term inside and then you just update all the weights with the negative gradient times some empirical small step size so you just say okay I get the gradient so you can see how this is efficient to get you just you just input all your training data you get this number you multiply it by the each coefficient of the VEX of that vector and you just add these up and you get this is just one number that you get for this w for each W n then you take that number multiply it by some alpha that's like point zero zero one or something you just make it up and then you just do this update a bunch of times that's it and then you can do if you want to get a little fancy or you can do things like decay the Alpha like have your program eventually have a schedule where alpha gets smaller and smaller and smaller things like that but it already usually works for this kind of simple model just to take alpha something small you know and you could just mess around with different sizes of alpha and to see what gives you good results so let me just show you a quick demo of that so this is actually taking what's called the M nest data set which I'll jump ahead and show you what that looks like in mist is actually a bunch of handwriting samples from the post office and you can train a model to actually do pretty well at recognizing them so these are actual immanence digits so M this is just some kind of something Institute of Standards and Technology or something and they they just you know sponsored this data set where they collected a bunch of scans of actual handwriting of digits from the post office and so this is what you train on and then the idea is you want to get good at recognizing different handwritten digits when you go over to the test set which consists of similar digits so this is a little linear classifier code that I'll show you the code actually it's not that long so this is just a little datas this is just the thing that holds the data I'm not going to go in much detail this is a little conjugate gradient routine that I wrote it's a little fancier than regular gradient descent so that's it's only here's something this reads in the data kind of wires up a bunch of vectors and then it just throws it in the conjugate gradient that's this line here and then it just evaluates the performance that's it so it's not that much code there's something at the end that turns it into a matrix product state so let me show you what it looks like when it runs so there it is reading the data then you see the cost is just being printed out just goes down down down and down and down and then you get 98% of the it can recognize 98% of the digits correctly and that's only getting about a thousand two hundred wrong out of 60,000 right so the linear model can do pretty well although this is considered a rather easy data set and then you try it over on the testing set and it does almost as well so that's that's remarkable right so some of this very simple model can actually already get you like 98% handwriting recognition task right so that's already enough to use for a business or something like that like you could have a human come along and check all the cases that it wasn't sure about but you would already get 98% of them right most of the time so so I think that's pretty interesting and that's some code you could write yourself just write it in Python or Julia and try that out yeah that's a good point so no I mean it's I'm not sure I mean every model it depends on how you define that you could just look at the yeah and not sure I don't think it probably it doesn't so easily yeah yeah well no actually I take that back sorry that you could always at least do this you could always at least look at the outputs for different labels so I believe here there's actually different weights one for each label like there's one that's recognizing this label that label that label that label and you could just see whether they're almost tied or whether there's clearly one output that's much bigger than the others so you could at least do that but I mean how much bigger should it be this gets into like you probably have to do some theory and some sort of testing on validation sets to figure out the correct thresholds and everything like that but I believe yes you could you could do something reasonable already with the simple model yep okay so that's just to show you that you know you can do a lot with the linear model it's a great thing just to sort of teach yourself the basics and I just wanted to show you what it actually looked like to run a code that does these kind of things okay so that's the linear model so basically I would say always use the linear models so if you're gonna do any machine learning research always use the linear models somewhere like just start with the linear model see how it does you may be surprised how well it does and you may learn something about your data set you may find out that you didn't prepare your data properly or you know it can be very helpful but then when you need better performance you have to move to a more complicated model class so the first one I'll talk about and this ties into the tensor networks the most is kernel learning and it's actually rather simple so kernel learning is like saying that linear classifier doesn't cut it so let's kind of beef it up in some way right so let's see what let's see how we can beef it up and make it more capable so so let's say we want to separate data into two classes and let's say that let's say the data looks like this and the blue points are these and the red points are these but then let's say we are naive about it and we want to use a linear classifier it may be insufficient so this is this classic XOR pattern where you can't separate Reds from blues with a single line it's just not possible for this data set right because it's sort of like high dimensional information embedded in a low dimensional space this is one way to look at it so you try to resolve this problem by saying let's let's not just apply a linear classifier let's first kind of augment the data with extra features another way to say it is let's just apply a nonlinear some kind of arbitrary nonlinear map to the data and let's not have the model consume X let's have it consume this thing Phi of X and we'll say a bit more about what that is so you basically empirically map the data up into a higher dimensional space so you just prepare your data in a way that you start with larger vectors X and then now you have this higher dimensional space to work in now you try a linear classifier and you are more likely to succeed because there's more dimensions to move in so there's more ways you can draw a plane that can cut the data into you know the right classes so the idea of the current of kernel learning is really simple it's just that you do something complicated to the data but you only have to do that once because you just have the data already once at the beginning it's like a constant then the thing you're actually going to spend all your time messing with that just enters linearly the weights so I would say these these function this this class of models is defined by the that the weights enter linearly that's thing that's the important thing about it the way it's inter linearly and everything else is nonlinear and complicated I haven't even said what the Phi of X is and the point is is that you get to pick what Phi of X is so so I haven't even had to say that yet okay so it's a linear classifier in this higher dimensional fictitious space you could call feature space and so feature space is up to you how you make it and so some of the research is like how should you construct feature space out of the data so that this thing can succeed and have good performance so let me give you an example of what's called a feature map which is this Phi map that takes you from the kind of native space of the data up into feature space so a simple example of such a feature map could be let's say our native data is just three dimensional data so these could be coordinates of things in real space or something but then let's say that you know these things are these these points are kind of really entangled with each other in real space they're like in some kind of complicated swirl but we still want to disentangle them from each other so we can give our model a better chance to do that by making this larger vector out of these components and we can say okay this larger vector Phi of X will contain the number one so that we get a free constant in our model it'll contain the original vector X but it'll also contain these pairs X 1 X 2 X 1 X 3 X 2 X 3 so now if we apply just a linear classifier to this we can put term extra terms on our models so if we have a coefficient of W that picks out this term we can have you know non linear combinations of the components of X I think you all see why that would help so basically our model cannot be sensitive to correlations in the data and not just looking at each component separately that's the important thing ok and you can make all kinds of other complicated feature maps I don't know if I have examples of other ones but another classic example is you basically can make one that one way to think about feature Maps is that it imposes a new distance measure on the data that's different from the original one and you can pick DIF just different distance measures I think I'll say one thing about that on the next slide and a classic one is to just pick kind of weighted Gaussian distance measure something like that so a few more technical notes about kernel learning so it's also called support vector machines but that's really only technically right when you're using a certain kind of cost function I think it's called the hinge loss just a detail so if you were wondering what that term it basically means you work with the model in the previous slide but a specific choice of cost function and training and everything in regularization the name Colonel learning comes from how you usually at least into most of the classic literature on this from like the 90s and you know although all the all the introductory review articles you read what they'll tell you is that typically you can't really afford to really work with your feature map the feature map is just something in your mind what you actually work with is the dot product of the feature map with different things plugged into it and that's called the kernel matrix so the idea is that you say maybe the distances are easy to write down the distances are these numbers kij in feature space even if the actual vectors defining features feature space are not efficient to write down or to work with right so the key example would be if you take a kernel that's like K of you know X I XJ and you just define it to be e to the minus alpha X I minus XJ squared okay that's a that's a very common and popular kernel what's called the radial basis function kernel okay but I don't know if the name really means but maybe something about the way gaussians look but that's a very common and very successful kernel that people pick and the idea is that this implicitly defines Phi and you can write down five basically just by doing a Taylor expansion of this formula you can work out what Phi is so people in kernel learning like to point out that for this kernel the Phi is actually infinite dimensional okay of course many of the coefficients of it are weighted by one over a different you know numbers factorial but but still in some formal sense it's an infinite dimensional feature space but really all that is saying is that you just change your measure of distance and then you train a model with this measure of distance inside the original one but when we get to the tensor network stuff the idea there is that we will work with the feature map directly and we'll pick one that lets you use tensor networks that's the idea of that okay and then one other thing that we might use later in the talk if there's time is this interesting fact called the representor theorem that says in this type of machine learning you actually know something about what the ideal weights look like and it's just that there are a linear combination of the Train of the feature map training data so the idea is that you have your training data you you could of blow it up into this higher dimensional space and then basically that the ideal weights live in the support of these these training data feature vectors so even though it's a super high dimensional space there's a much much lower dimensional subspace that where all the data lives and the weights live there too and so you just have to find these alphas and so there's much much fewer alphas then there are weights that's the idea and so one way that people typically work in this field is they immediately use this fact and then they just work with the alphas for the rest of training and that's called the dual representation of the model for those of you want to know but again with the tensor Network thing that I'm gonna be talking about later the idea is that that doesn't really scale but will actually work directly with the weights but the weights will be in the form of a tensor network so we can actually do that whereas if you tried working with the weights directly normally it would be way too big it could be like working directly with them anybody wavefunction so that's the that's the idea so this is analogous to working with a mini body wave function by like say someing slater determinants where as i'll be talking about a different approach okay and then so this really is a very popular field still in terms of practitioners a lot of people in academia and academic settings still use carnal learning because there's a lot of advantages basically you can just you can just immediately compute the exact solution to your model in many cases as long as your training set is not way too big so if your training set is a few thousands in size or less you can just compute the ideal weights kind of an you know just by diagonalizing some matrices and that's it so it works really well but it doesn't scale that well if you do it and i evilly it scales as the cube of the training set size you can get this down to N squared or even in but you start having to use fancy tricks so it's not as popular anymore with engineers by which I mean people in Silicon Valley because they're all about big data and big data means in is like millions or even infinite and you just can't afford you know in Cube scaling so that's why people have switched to neural networks one one big reason why okay so now let's wit let's switch to neural networks for just a few slides just to give you some background on those so neural networks these are like the current darling of machine learning right now so everybody in Silicon Valley is using these as you know and their note 8 diagrammatically very often but I just wanted to emphasize this isn't one of these tensor diagrams that I've used later and so I use yesterday and will use later this is these diagrams are a bit more empirical what they mean so I'll explain that on on the next slide so so what is a neural network basically all it's really saying is that you make up some complicated class of functions that you define by how you compute them so what you do is you take the input X and here I'm just making a three-dimensional just so it fits well in the slide but really it could be you know like a hundred dimensional or something like that so you take your input X then you just multiply it by some rectangular matrix and that's called the first weight layer so that rectangular matrix is usually notated by a bunch of lines the idea is that the lines are just the coefficients of the matrix so this is the one that says so you see this is going toward two numbers here there's going to be two outputs so this is a 3 by 2 matrix w1 and the idea is that that line is the component that connects the when the row index is set to 1 and when the column index is set to 2 so that's the 1/2 that line is the 1/2 element that's the 1 1 element so each line is just one of the elements of that matrix so the idea is that there's a little number writing on each of those lines so that's called the weight layer so you just multiply it by a rectangular matrix you get the two numbers in this case that result then you take each of those two numbers and you plug them in to a nonlinear function and which nonlinear function you pick is just up to you and there's different ones that are popular at different times in the literature so right now they're these ones called values that are popular but before it was these things called sigmoids which are kind of like look kind of like I don't know is that like a Fermi direct distribution I don't know but they look there is some kind of rounded step function basically there they're all some kind of rounded step function and then you just take the numbers and put them in point wise into those then you take the resulting numbers that come out of those those are the neurons by the way then you do another weight layer so you just multiply by another rectangular matrix and eventually the rectangular matrices get smaller and smaller and smaller and you get a small number of outputs either 1 or 2 or maybe 10 outputs and that's that's how you evaluate the model and then usually you use a different non-linearity at the very end so a very common one is that you put it into an exponential at the end so that's that's called a soft max out so that's basically all in neural network is so the nonlinear is nonlinearities are called neurons and sometimes they can have adjustable parameters too sometimes they can sometimes they don't and other neurons that people use include tange functions or this thing called the rail U which is nothing but zero and then a line that's all Aurelio's and the terminology deep just refers to any neural net that has more than one weight weight matrix or weight layer so this is already a deep neural network in that sense and the number of neurons in each layer is sort of ad hoc and arbitrary but there's a theorem that says if you have enough of them you can represent any function whatsoever so that's one very powerful thing about them is that you can always add more neurons and represent anything and empirically they just work very very well especially for image data and one reason they work well for image data is that you do additional tricks and one of the key tricks is a thing called convolutional layers and these are basically where you make the the second or third layer of the network only sensitive to two small patches of the previous layer and what you do is you basically scan these patches through the data and coming from the previous layer and then that defines the numbers coming out of the next layer so it looks a little bit kind of like a renormalization group process where you basically do some kind of blocking and then you scan through but they're overlapping blocks that you scan through the data so that's that's worked empirically very well for image data and yesterday I mentioned this image net paper which is this one from 2012 where they they just blew the doors off the performance of this really challenging image classification task and I believe that was I think Hinton was involved in that one that's right and these are his other colleagues Yann LeCun and Joshua Ben geo they're kind of like the people you must hear associated with deep learning so I thought I just mentioned them Luke Owen was recently the director of Facebook ai and Benji has a big group in Montreal and Hinton goes between vector Institute and Google okay so briefly let me just mention there are other model types of machine learning that are worth just knowing about but I won't say much about them one of them is called graphical models and this one is kind of similar to tensor networks from my point of view but it came up in the machine learning and statistics field and it's different from ishutin from tensor networks and that the entries are always where the output of the model is always interpreted as a probability and they're also restricted to and not an amplitude like an actual probability and they're always restricted to have non-negative parameters everywhere through the model so you can't use like the SVD when you work with them the way you do with tensor networks another class is Boltzmann machines many of you probably know about those from applications in physics this is basically nothing but a random bond classical Ising model but they're very powerful and have a lot of interesting theory behind them and in decision trees basically this is where you just sort of to make a bunch of forking paths' you say okay if this criterion is met I'll go this way otherwise I'll go that way these are actually empirically do very very well on competitions and and different kind of challenging tasks another thing I just wanted to mention is you know I've gone to some of these big machine learning conferences and met people in the field and it's kind of interesting to see how the field is constructed so there's different sub communities one of them is is very academic and the papers often involve lots of theorems sometimes I think they even prove things that don't really require proof but they they just like to prove things so there's a lot of theorems some of them are very deep though - there's another theory there's another community that's very engineering oriented and they you know these two communities kind of drive each other crazy the other community the engineering one get some incredible results but the developments are often a bit faddish like the papers will start by saying we will switch to using this kind of neuron because it's popular or you know this famous person uses it so we'll use it and then you know the people in this community will criticize them but then they can't really argue with the results so so it's interesting because there's a lot of good things happening but but in kind of from different perspectives one thing that's very important to know if you interact with this field is that conferences a conference talks and posters are valued much more highly than journal articles so that's kind of opposite of physics so if you actually get into a conference proceedings that's like the very best thing that's like getting a nature paper if you put your paper into Journal of machine learning research that's like a PR bee or something so that's that's how that field works so that's important to know otherwise you might get that backwards if you try to interact with them and it's also just interesting to see how strong their industry ties are so for example in a lot of machine learning groups professors have a hard time keeping students all the way through a PhD because they'll get poached with some kind of six-figure salary halfway through their PhD or something so it's like a problem for them actually and eventually their professors give up and they just leave - and get even bigger salaries so that that happens anyway it's a very interesting field but one thing I would say is that it's a very the field has a lot of really deep people in it though who know a ton of really interesting math and statistics so I think it's very worth talking to people in this field so just wanted to encourage you to interact with them and I'll post this I already did post the slides so I encourage you to take a closer look at some of these resources some of these are extremely good so this one in particular these lectures by this professor at Cal Tech Yasser Abu Mustafa or maybe some of you are at Cal Tech and can go take a course from him it's all on YouTube and he's extremely entertaining he always puts jokes into his slides and stuff it's really good so check that out there's also a nice article I should have updated this slide it's called I forgot the title it has this great title though but it's Pankaj Mehta and David Schwab's big review article aimed at physicists all about machine learning I think it's called like a high bias low variance introduction for business or something like that and then there's all kinds of examples and excellent blogs online that you can mess with yourself to teach yourself the techniques okay so okay great so now let's switch gears and talk about using tensor networks in machine learning so this will tie back into the introductory material of yesterday and today okay so this is work actually with David Schwab who is now a professor and Cooney in New York who does stats and neuroscience but is very interested in machine learning very generally so I just wanted to credit him for getting me helping me get into this field so the motivation of like why would I want to put tensor networks in machine learning why would anybody want to do that the motivation is that tensor network methods are very very powerful they have really excellent algorithms for optimization I mentioned yesterday DMR G for example and as there's many other algorithms also and these give extremely excellent results so this is showing a recent paper just in this year where where this group from Taiwan was studying predictions of this kind of famous like Kayne Fisher scaling of transport through 1d systems with a weak or weak link in them and the ideas that they were looking at very a long-distance behavior and correlation functions and checking the predictions of this this theory that's based on our G and getting really excellent agreement and this this really takes very precise calculations to do so I just wanted to point out that you know tensor networks are very powerful they're also highly interpretive all because everything's linear in it since they're network so it's really easy to kind of like analyze the internal structure of what's going on not really easy but you can do it versus other things where it's just much harder so this is an example of some work some ongoing work by the group of Frank for strata and different co-workers and what they've been doing is they've been kind of rethinking a lot of what we know about or let not rethinking but kind of recasting a lot of what we know about 2d topologically ordered systems in terms of peps tensor networks these two D tensor networks and so they they've come to these understandings of familiar concepts but kind of using unusual tools in terms of thinking of Wilson loops as being inside of the peps and in that form they're they're much cleaner and they take the form of MPOs which are called matrix product operators they're basically like matrix product States but they have two lines coming out of every tensor instead of just one and it gives a really easy understanding of like what is a topological spin of an any on and how could you think about calculating it and these different conditions that sort of define different types of topological order so just to say that you know that kind of understanding comes from the linear structure of a tensor network and finally tensor networks are widely applicable so this is an example of some method that's actually been known for quite a while now since 2007 called TRG and it actually has nothing to do with quantum mechanics per se it's just a method for classical systems so the idea is that you write your partition function of a classical model as a tensor network but one that's difficult to contract so the idea is that if you look at this tensor network so every one is blue dot system tensor and it's basically like a transfer matrix trick for 1d system but these are like 2d transfer tensors and you can think of the spins as being the indices so if the index is 1 that means the spin is up if it's - it means it's down and then all the tensors do in between is they kind of mediate between the spins and apply the correct Boltzmann weight then what you do is you basically explode this network out into all these little shards wet by doing singular value decomposition then you regroup the shards into larger tensors and you do this over and over over again and you can see this this loops like you get a new network of these larger blue tensors you put that in here and you just do this again and again and again and you get very very precise results you can compare to predictions of conformal field theory and you can get critical exponents like many digits very accurately you can see the RG fixed point emerge just by looking at the elements of the tensor so you can do a lot and it's ologist for classical systems so you know knowing all this you would say okay basically quantum wave functions are just an example of a large tensor and so are transfer matrices of a classical system they're all just large tensors that just happen to occur in physics but really a tensor networks are something kind of bigger than just physics they're really just a general math technique that just happened to be invented by a bunch of physicists so you know is there really any more applications for it than just wave functions or some classical systems is that could that only be the two applications of insular Networks forever or are there more applications right so of course there are more and some other background this is sort of switching to a different slightly different motivation is to say that machine learning has actually been benefiting from physics ideas for a long time kind of forever so bolson machines are nothing but this ordered Isaac model is just applied to data and there's ideas like renormalization group that are that come up all the time in machine learning so things like deep networks or auto-encoders ideas about what's called PCA which is nothing but the SVD which you can do hierarchically it's very similar to the renormalization group but this is 1920s in 1970s physics you know have we done nothing new since then right there's actually been a lot since then that we could also try to see if there's some and you know analogy to machine learning for some of the stuff since the seventies right and one of those things as tensor networks so so even more specifically I was looking at my my friend Juan charashkys notes when he was working on on this stuff he was doing using convolutional neural networks to recognize phases of matter and I kept asking him I said is that a mirror tensor Network and he goes no it's not it's a convolutional neural network and I said yeah but look those are like these kind of mixing layers they remind me of distant angular's and these are these pooling layers they remind me of the isometry is the RG part of amira and he's like no no it's but I wanted to use a mirror somewhere in machine learning so I thought okay I've got to use the tensor network in machine learning I'm just gonna skip this slide them too much introduction so then it made me want to know our tensor networks useful for machine learning and the answer is yes and if you use tensor networks and machine learning you realize a lot of benefits already and then there's some that I'm still working on but I'm confident will come so one benefit is that you take models that typically to train them would scale like the cube of the training set or the square of the training set and you can do that in just linear scaling with the training set and maybe even better another thing is that you can have adaptive weights so you can train the model and as you're training it the MOT the number of free parameters in the model can automatically grow and shrink as needed to adjust to the data so I think that's very interesting you can also learn data features I'll have a lot more to say about this what I mean here is that you can take the data and pre process it using tensor networks in a way that only extracts the useful information for learning and discards other information that's just not helpful and then future benefits some of these are already being realized or some of these are more in the works some of these are more speculative include better theory and interpretability so really trying to understand you know what has a model learned what is a model sensitive to like what inputs does it actually care about what inputs could fool it what inputs does it not even see at all our things that I think can be tackled very handily with these tensor network tools better algorithms already have some examples of that more on the way and then quantum computing which I'll I'll say about a bit about it's basically that tensor networks just are quantum circuits already so if you can get a tensor network to do machine learning you already have a quantum circuit in hand that can do machine learning so in a way it's almost the enemy of quantum machine learning because the more that I can do classically with sensor networks that's the that's one viewer thing we need a quantum computer to actually do but then actually in the end I think these things are friends because you can push a tensor network up to a certain point and you're getting certain performance but then you might run out of resources so then you can switch over to a quantum computer and in that point of view upon a computer is kind of like a special GPU or something whose job is to contract big tensor networks basically in that point of view yeah does it have any non-linearity so no no tensor networks have have any non-linearity they're all linear that's right so I was waiting for someone asked this questions this is a good question so this may come up again later better but let me briefly say that that is that is where the power of neural networks comes from is non-linearity but the question is compared to what so the question is compared to a neural network that doesn't have any nonlinear things in it that would just be a matrix so if you had a neural network you took out the neurons it would just be a matrix of the size of your original size of your data so if your data is 100 dimensional it biack 100 by 10 matrix so it's not very powerful however when we apply a tensor network it'll be also kind of be like a big rectangular matrix but it'll be acting in an exponentially larger space so that's the difference so the idea is that you first you map the data up into an exponentially bigger space than the one the neural network axon but then you can actually work in this space by using tensor networks and then you you get a lot of power that way so you load all the power up front and then you deal with it that way whereas in a neural network you're kind of you're kind of you know doing the doing the high dimensionality by putting nonlinearities all throughout your model in the tensor Network you do it all at the beginning that's the difference yeah it's a good question okay so let me just say though first of all that I'm not the first person to think about using tensor networks somewhere in machine learning so there's a bunch of interesting work that's already been done kind of going back to around 2014 by some different applied math groups and also some people who used to work in physics as well so these are some students have gift for Ava dolls actually so some of this work includes things like compressing weights of neural nets which I'll say a bit more about on the next slide but also things like large-scale PCA PCA again is basically just taking your data and putting it in a big matrix and trying to diagonalize that matrix but what if your data matrix is extremely big you can use tensor networks to help you with that so there's a nice paper about doing that also putting tensor networks into other kinds of machine learning models like called Gaussian processes and using them for some kind of feature extraction task so just mentioning this there is this other work so that'll be in the slides when you can of which you can download let me highlight one of these work just to show you something pretty different from what I'll be talking about but just to give you a different flavor of the kind of things you can do with tensor networks in machine learning so this is just a couple slides on this idea so the idea is let's say you do want to do neural networks but you want to do really big in expressive neural networks so you want to do a really wide model meaning that the middle layer has a ton of neurons let's say you just want to do thousands of neurons for some reason you couldn't afford to do that normally because you would have a really giant wait layer so this group of an avocado that all had this idea of why don't we approximate this weight layer by a kind of tensor network so you can kind of think of this as a matrix product state but where every tensor has a lag going up and down instead of just one going down or up so the idea this is called a matrix product operator is very similar to a matrix product state and the idea is you just say what if the weight layer could be approximated by that then you say well let's just assume that put it in and then train the weight layer in this form rather than in this kind of neat original form and with this technology this group was able to train a neural network that had something like close to 300 thousand neurons in the middle and they were able to compress this with some kind of 80 times compression and only get within 1% of the very best results for that same dataset so that they lost a little bit of performance if I got a huge compression so anyway that's interesting but I looked at this work and I was thinking I would prefer a framework where the tensor network like is the entire model break so rather than just being a piece of the model or some trick to compress a piece of some other model I think the tensor network could do the job I think it could just be the entire model here's some kind of maybe sillier or more serious motivations about why so one motivation is that you know could a natural image really even possibly be more complex than a quantum many-body wave function I mean a silly thing to say is well you know images are actually coming from quantum mechanics after all so how could they be more complicated I mean that's that's more of a light motivation I would say but basically these are the most complicated things we know about intensive networks can handle them more seriously we can import really interesting ideas from physics and maybe get some mileage in a different field by bringing in these interesting ideas and best of all we could export those ideas and maybe a few years from now those ideas will come back in a different form and we can learn from the machine learning community who's very good at applied math and optimization how to use these tools better in physics and that could be very interesting okay so now does some now to be a bit more concrete about how I want to do this like how do you actually make a framework that does machine learning but where tensor network is at center stage so let's be concrete about it let's say we have some kind of inputs X which are these vectors of numbers and these could be some numbers that are normalized to go from 0 to 1 and they could represent the the pixels of some kind of you say grayscale images like this handwriting data set it's just an easy thing to think about okay and then let's propose the following model so I will credit here another paper by Novikov that had the same idea the same week actually as us but I liked how they explained this idea better because the models a little cleaner the way they wrote it down so I like this way of writing it down so they said ok formally this is going to be one of these kernel learning models but let's put that on hold for a second it's easier to see it in this form so all the model is it's it's rather different from a neural network it's actually just a high dimensional polynomial so the model is a polynomial where we're going to label all the terms by binary strings so we could have all these binary all these numbers s 1 s 2 s 3 to s n if they're all 0 then we don't get any X's at all we just get a constant but now if s 1 equals 1 we get X 1 and not if none of the other X's or we can have s 2 equals 1 as 3 equals 1 and we just get each of the X's separately or we can get pairs of X's we could have s 1 being 1 and s 2 being 1 s 1 being 1 s 3 being 1 so we get pairs triples so on so it's all the different products of the X's that we can make but the XS only appear most once that's the idea and then the whole idea of doing this is that the coefficients have to have the same labels as the terms I mean that has to be the case but then they accidentally end up looking like a mini body wave function they're not a many-body wave function but they had the same structure so they're just a large tensor with a huge number of indices that it go over a small range of values so this kind of look like spin 1/2 spins and there's n of them that's the idea ok so it should be pretty clear already that this is a fairly expressive model it has very high order terms in polynomial so it could fit a lot of things it's not too hard to extend it to have things like the squares or the cubes of the individual X's as well so this was an idea that was touched on by a few different people we came up with something very similar and here's the entire model written out for the case of a three dimensional input so now I'm mingling more formal and saying okay this model is to the form W dot Phi of X the Phi is basically defined by this part and then W is just the weights so if you write this out it's interesting to see what you get so the first line is the linear classifier so you get the linear classifier for free and then the rest of it is just stuff that can do better so you're guaranteed not to do any worse than linear classifier as we saw in this example of these handwritten digits that already gives you up to like 98% performance or something then you just keep going by adding these other terms and these other terms just help you to get all the way up to you know 99.9 percent or something and there'd be a lot more of them for the case of law higher dimensional input it goes exponentially with the number of the dimension of the input so we're going to need some kind of tools to work with such a huge amount of weights okay and we can generalize this in a lot of ways and one way that was closer to the paper I wrote with David Schwab was to say we don't actually want powers of X's per se we just want a set of nonlinear functions applied to the X's individually so we take the X's the components and we sort of point wise evaluate them into these nonlinearities this is kind of like neurons it's a bit different though to but then we have you know exponentially many different combinations of these functions we can take all the different combinations we can take all these different products and then we add them all up with some unknown coefficients and that's our model and the goal is to find and train these coefficients and we could represent a lot of really complicated functions this way because I haven't even said on this slide what the Phi's are so they could be very complicated functions the particular Phi's that we picked is kind of motivated by this picturesque idea of thinking pixels of pixels of spins it was just a motivation we were just thinking like okay maybe if you wanted to apply ideas from physics it's helpful to use this mental crutch of thinking of like a white pixels and up-spin and a black pixels like a Down spin here I'm writing it as a Y vector and an X vector and so maybe a gray is like something in between it's like a superposition of up and down if you want or you could just say form it's just mapping one-dimensional number to a two-dimensional normalized vector that's that's all it is right but this actually leads to a very powerful model and then the idea is to say that the total feature map the thing that map's this data from this original dimensional space in dimensional space into this feature space which will actually be exponentially bigger to to the n-dimensional is a tensor product of all these functions and all I mean is just a just a product basically but it's a tensor product if you think of this as defining a tensor within indices so this is nothing but a product state wave function if you if it's think of it as a wave function it's just saying take the data represented by product States and it's just formally a vector in a 2 to the N dimensional space but it's a very easy one to represent of course you can usually store and represent this on a computer even though you can't represent most things in that space efficiently you know in a naive way so let me write this a few different ways but we all know what product states are but basically it's just saying ok take the data formally write it as this tensor product of these two component vectors that's how that looks in tensor diagram notation it's just outer product of a bunch of two component vectors again what was the point of doing that the point is that let's say we have the simplest case where the function outputs a number that be the case that you could say due to class you know binary classification a or B in that case then we want the output to be a scalar so we better have the weights have the same number of indices as as the representation of the data so there are the weights dotting in with the data to make a number that's our function and so then the upshot is that the weights again look like a mini body wave function formally speaking there's just a big tensor with lots of indices okay so then we can use tensor network tools so it's it's something that's set up to be able to use tensor networks well on these now there's a lot you know riding on whether this will actually work well or not some of you who are aware of these ideas of the area law and this idea which I didn't get to touch on that much but I think that we've talked about a lot more in all these lectures later I think there was some earlier in the school might wonder you know is this is this tensor have anything to do with the wave function maybe this thing is highly entangled from the point if you think of it as a wave function maybe it's not at all like a ground state so maybe it's not at all efficient to represent it by a tensor network and that might be the case but instead of trying to prove that or disprove that the idea we had was let's just try it and to see if it works and it did and now I have some arguments I can show you two about why we think it worked but it may not always work it may depend a lot on the task and on the data but there's been some encouraging results so far and I'll show you some more by other groups confirming that the idea is that the main approximation or the main further assumption we'll make about the model is that it's reasonable to compress this tensor by some kind of tensor network most often a matrix product state but it could also in the future be a peps tensor network or a tree tensor network or a Mira tensor network or some of these other kinds okay and what do you gain from this so right away you actually gain some interesting things so one thing you gain is that you can import ideas from physics into this into this area so you can take an algorithm very similar to DMR G and optimize this model that way and what do you get from doing that you get two things you get linear scaling in everything so you get linear scaling in the dimension of your input and you get linear scaling in your training set size which is already much better than you typically get for models of this class so this kernel learning class where the weights enter linearly and the idea is that here you can directly work with the weights rather than all these tricks involving kernels and all these things with it which I didn't explain in great detail but those tricks don't scale very well so so the idea is that you get linear scaling because what you do is you just attach each element of your training set to your weights and you calculate a gradient and I'll show you how the gradient works on the next slide let me just show you kind of how the algorithm looks at a very high level so what you do is you you say okay I want to optimize this matrix product state of the weights and the way I'll do that is I'll merge two of the tensors together so this is called - site D MRG if you merge two of the tensors together so you merge two of them together and the ideas that this will let you optimize in a slightly enlarged space and hopefully find your way to a better solution and you get to adapt to the bond and I'll show you that in a second so now what you do is you freeze these other tensors right so you freeze those and then you just calculate updates on this one for a while so you just calculate a gradient and apply it or some kind of other optimization and you improve this tensor for a while then when you're happy with it you say okay that tensor is done then you use a singular value decomposition to break it into two pieces and when you do you expose this bond again you discover that blonde again and that bond can now be bigger than it was before or smaller so it'll adapt automatically to solve the problem better and to give you more weight sort of pull weights out as needed yeah it is like a hyper perimeter that's right the difference so here is it gets picked by the algorithm so instead of that you the human picking it the algorithm picks it now you control it through another hyper parameter for those of you don't know that term by the way and machine learning people use this term hyper parameter to kind of just mean like external knob on the algorithm basically so you instead pick some kind of cutoff you would say what's what's the threshold of the size of singular values that I consider large versus small and that will automatically determine the bond dimension but does that answer your question sorry um taking the bond dimension to be big you said yes though it actually does I mean if you take the banda mentioned sufficiently big you can represent any tensor so then you can represent any model in this class so it doesn't mean you have the right class of model but that's the separate that's a bit of a separate story about that part about whether you pick the correct feature map however I think also tensor networks later down later in the line has something to say about that as well so I'll show you later if there's time may not be a lot enough time but I'll show you later that you can actually do an interesting thing where you can start with kind of an overly expressive feature map and you can actually use tensor networks to automatically prune that down so in the end you can actually kind of overpower your model and then have this adaptive 'ti of the tensor network on its own figure out how to prune that back so in the end you've essentially included every possible model and let the data steer you toward the subspace of models that is demanded by the data to solve the problem I mean this is something this is sort of ongoing work but I think I think these things can be tackled but let's say within but the thing I can definitely say is within a fixed model class of a fixed choice of how you map the data product States then there always exist some large enough bond dimension that this MPs can represent any weight tensor whatsoever that's for sure true but then whether that's enough to do the job that you actually want to do is is sort of a it's a sort of separate but definitely important question that you're raising but but it's it's a you have to consider that as a separate thing you know it's sort of up to the model choice yes yeah that's a good question so you know one thing you can do is you can always imply the kind of classic things that people already do in machine learning which are penalty terms in your cost function so that's that's like the main thing people do is they would say I mean besides also does model selection like what how do you do this map you know so you can put in a penalty term I didn't even say what cost function I'm using so what what I was what I'll use most of the time is just this kind of squared error cost function plus an l2 penalty like just a normal tea that says penalize W from having too big of a norm just do that that works but there's interesting other things too so one thing that ties back into kernel learning is what feature map do you pick that's a kind of regularization if you pick this to be overly expressive you might over fit if it's not that expressive you're protected from fitting but she also might not do that well and then one other thing that's yet to be explored that much but I think is very interesting is the role of the tensor network itself and regularization that's not something that has been considered before because these models haven't been looked at before but the fact that there's a bond dimension here the fact that the tensors factorized certainly has implicate the implications for regularization but what those are is sort of still to be worked out I mean because we know it's got to regularize in the sense that if this bond dimension is small there's definitely less parameters so that's already a kind of regularization but what what the regularization is doing and what statistical properties it has is something that needs to be thought about that answer your question ok so there's all these ideas of regularization I didn't talk about this too much but the idea is that thing I said briefly on the one slide about of all the functions that could fit your your thing pick the simplest one that's analogous thing here is that we control the norm of W as we trained it that's a I'm not saying that on the slide but that's implicit and then but the more explicit thing is that we don't let this bond dimension grow too much we keep it as small as we can keep it to solve the problem at hand if we let it grow and grow and grow and grow we could just fit anything and we might over fit ok so the algorithm continues by you merge the next two tensors optimize them together freezing the other ones break them apart and so on you go down the line so that's that's equivalent to D MRG except for what you do to that pair of tensors when you optimize it so in D MRG you siphon solve an eigenvalue problem here you solve whatever problem is appropriate to what you're doing so if it's supervised learning you do some kind of like gradient or conjugate gradient update instead of an eigenvalue problem otherwise it's basically d MRG ok so to show you a bit more about that so what about this sub-problem how do you actually do the gradient just one quick slide on that so this is just to give you the flavor so the idea is that you say let's say we want to update that tensor where we merge two of them together we've contracted over their shared bond and we're gonna freeze the other one so the other ones are just bystanders for this part of the algorithm so you youa fix one of the training data product States when you do this for every single one but but this is just an example for a particular one and then you take the derivative with expect to be and because everything here is linear the derivative is very easy it's just do you remove B so that's the derivative of that function with X plugged in at with respect to B and then you can contract these wings like this part with the frozen parts of the MPs tensor you can just do these contractions like contract over that line that line in that line and you get a tensor with one index that's that index and you do the same on the right and you get a tensor with one index and these two are just the raw data tensors of those two sites or those two you know pixels or whatever they are and then that's the gradient and then you just add that in with a small empirical weight a small empirical training rate alpha and you do that for every different element of your training set and that's what you do so you just do that at every bond and if you do conjugate gradient there's some more steps but basically it's this is the idea and then you just do that to keep improving this tensor B that should say B to keep improving it and get a new tensor B prime that has better performance then when you're done on that bond you break it apart and go on to the next one and just keep doing that back and forth so that's that's the idea question about that I know that's kind of fast but just to give you an idea of how these steps work so now why should this idea work at all so one thing you could say is is you know ok neural networks work really well but they have this layered structure and they're very complicated tensor networks you know they're good at representing ground states but maybe maybe the tensors that occur in these machine learning models that are good for classifying data are nothing like ground states and maybe they're they be they should be very highly entangled so why should this work so this is one slide just to argue that at least these things have a shot of working and the reason is very simple it's that they can always represent the linear classifier exactly with bond dimension two so any MP s has a bond dimension two meaning the matrix dimension is two can always represent the linear classifier model which says I showed you in that little quick demo I've got 98% on this hand writing example with that with linear classifier so you're always kind of guaranteed this kind of floor of performance that this is that linear classifier level and then everybody by growing the bond dimension bigger than two you can go much beyond that that's the idea so here I'm just showing an exact construction that's telling you how you can actually encode a linear classifier into an MPA I'm not going to unpack this construction that much other than to say if you remember from the end of the first lecture yesterday this looks a lot like a single particle wave function where I had you know Phi 1 1 Z and then I had the vacuum then I had Phi 2 1 in the vacuum in the vacuum Phi 3 1 is actually the exact same construction so the idea is that you can imagine a little one particle moving to the system and it goes over every pixel and it doesn't look it doesn't look it doesn't look finally it gets to the pixel it wants to look at and it looks at that pixel and applies a weight of V you know whatever like V 7 on pixel number seven so you can think of the wavefunction is actually being in the single particle space and so basically that's the a single particle wave function is a linear classifier so then if you have more particles it's like having a superposition of multiple linear classifiers that can run around and take products of the things that they see that's the idea and that's just kind of a motivational slide to say that like you're always guaranteed to at least two as well as linear classifier and more yeah yeah we've looked at that a little bit so that's it's interesting to think about so yes the the feature Maps I showed we're all kind of spin 1/2 in the sense that you just get to show you just get to see two pieces of information about every pixel say if it's an image or whatever that X component is so you could certainly expand this you could have instead of one X so this is going back to that that one where it was X to the power of 0 or X to the power of 1 form of the model instead of having one X and that's it on every side you could have one X x squared X cubed you know you could just keep going or you could have cosine sine some other functions another function you could have cosines and sines with different k's inside and have more complicated for you you know components or something like that so that's something worth looking into you get into the danger of overfitting though so then you have to work harder to regularize the model by penalizing the nor more heavily or other strategies so it's that's the trade-off ok one other thing you can do is you can extend this idea to for multiple outputs so neural networks have multiple outputs at the end you can output multiple labels same thing here you can basically stick an external index on the tensor Network it's very natural in the case of tensor networks that have like a tree structure like a mirror because you you have like a top and the top can have an index sticking out of it mps is a little more awkward looking but basically you can put in an index somewhere in the network and the idea is the the data comes in the bottom the NPS NPS evaluates it and then the result is a vector over this index l and you look at the outputs and the biggest one for example could mean that's the correct label depending on how you train things what's interesting is that you have this idea here a feature sharing which is that all the different models for all the different labels if this was a giant tensor if this was just a bunch of weights they would be totally different weights for every different label you want to recognize but once you factorize the weights you can take these sensors and these sensors to be the same for all the different labels you want to evaluate and only this tensor differs for the different labels so the idea is that you you're basically processing most of the data into some kind of common like reduced features that get fed into this middle tensor and then those reduced features are what lead to the classification okay so does it actually work yes so here's an example showing the experiment we did back in 2016 on this handwriting classification task so we took these 60,000 training images unrolled them into 1d just to use a matrix product State and some of you your hackles may go up and say you know why would you take one this 2d data of images and use a matrix product state the idea was that we want to just do the simplest thing because the idea is that if the simplest thing works then you know how much better might the more complicated ideas worked right so one just start simple so we just said let's just rasterize it you know kind of snake through it unroll it into a 1d chain and then make one of these matrix product state models so we represent all the pixels as a product state attach it to this NPS train the NPS and we get 99.9 percent accuracy on the training data and that drops only to 99 percent on the test data so it's only 97 wrong out of 10,000 so we were very happy with that and we got into this conference this so called nips conference which is like the big machine learning conference so that was a lot of fun right so yeah so so when I what I'm excited about in this in this topic it's not really about the performance it's more the ideas that I'll mention in a minute about it especially this data set this handwriting data set it's pretty saturated a lot of things can now get 99 percent and it's getting to where it's not even that meaningful to go past a 99 percent in some sense or whatever number 99.5 because at some point it starts turning into what's called snooping where even if you're not snooping by actually training your data training your model on the test data your kind of publication snooping in the sense of you'll try a bunch of ideas until one of them works and only publish the one that works so sort of like gradient descent on papers after a while so after a while it like it stops being a meaningful test of it doesn't tell you much about what models are really good it's more your like had the survivorship bias if you're only publishing the ones that you know so it starts to get into that game so for us it was meaningful because we just wanted to know like can this work at all could we do something as crazy as take a one D tensor network and scan over A to D you know thing and see reasonable results so so if we had not been able to do similar to all the other state-of-the-art things we would have been you know it wouldn't have been as interesting that this might be worth pursuing but the more interesting thing about it isn't to beat anything else it's more to like roughly tie other approaches but then have a lot more to say about algorithms and interpret ability in this kind of thing that's what that's the more interesting side I think so since then I've been encouraged to see that there's been some other interest in this idea some people who are already thinking about it before us some people who got interested in it afterward and so there's been some nice papers coming out of purely machine learning groups so these three papers are from this one group in Hebrew University that's just thinking about machine learning that they're just in the CS department and so they have this one paper it is kind of interesting deep learning and quantum entanglement fundamental connections so the idea is they're trying to say that maybe ideas that physicists have been talking about can be broadened and used to understand other things so physicists have been talking about entanglement and so maybe this can be generalized into these ideas of tensor rank and you can more about some kind of higher dimensional asked notions of rank instead of just thinking a matrix rank you can think about you know how we think of all these bonds of a tensor network and we've learned a lot in physics from thinking that way so they're trying to say can we export these ideas to machine learning in some useful way also the ideas have been extended to say we could do generative modeling which I'll touch on a bit this means not just trying to recognize data but generate new data that's similar to the training data so this was done very successfully by some groups and further work on supervised learning which is you know assigning labels with different kinds of architectures so let me talk about some of those works and a few slides first let me also mention that there's even some companies that are now starting to use this technology so one company is actually more they're more interested in drug discovery but their co-founder came out of a physics group at University College London vid's Jovovich and he actually was in a tensor network group before that so he's using sensor networks within their company and you see the name of it has tensor and I'm not sure exactly if they're totally wedded to using it but but that's that company and then another company that's very interesting is right up the street from us in new york city is called tunnel and this is a this ambitious mathematician named john korilla who is actually on leave from Cooney right now to start this startup where they're training matrix product states to recognize language so they're trying to Train matrix producing models to to do natural language processing and they're having some successes with that so it's interesting to talk to them ok so let me kind of highlight some of those works I mentioned here just to show you some of the different directions you can take these ideas so these might inspire some of you to do some interesting things so one of these is I don't have an updated reference but this paper is now in Physical Review X actually so this is a paper that's using wave function like ideas to do generative modeling and then the wave function parameterization they picked is tensor networks you could do the same ideas with other on sotsass or wave functions too but the idea was that they said could we do state of the art generative modeling using what are called born machines and this is basically models where you you have some kind of model but you square it to get the probability and that has a lot of benefits act it sounds like a simple thing but that's not something that wouldn't be obvious to do for any math reason it so it seems like a weird you know kind of quantum physics motivation but once you actually do it you realize all these benefits some of them are very simple one of them is that you don't have to worry about whether your model has minus signs or complex numbers in it the square cleans that up for you and that sounds kind of silly like a silly problem but that's actually dogging a lot of fields of statistics and machine learning that they have to tread very carefully with their optimization algorithms never to introduce minus signs and there's all these steps that would be the optimal step to take next but they have to aside they have to go around that step because that step would introduce minus signs so they have to do a suboptimal step instead but this group is saying why don't we just square the model then we can do all these optimal algorithms directly in no problem and one of the things they also do is they take advantage of a certain property that matrix product states have which is that when you think of them as probability distributions you can do what's called perfect sampling of them meaning that you can take you can take one sample from them and then the next sample will be totally uncorrelated with the previous one except through the correlations of the probability distribution itself but there won't be any autocorrelation it's not that kind of thing it's not a Markov chain at all so you just draw independent samples as much as you want from the model and that's very important in machine learning book it's very often they encounter multimodal probability distributions where you have one peak over here and then another peak way over here far away so this gets around that directly so the idea is that once they train their model to generate digits they could just sample from it and the first sample would be a nine then the next sample would be a three then the next two samples would be a nine then the next one would be a seven but if you take other things like these Boltzmann machines and try to do this kind of block Gibbs sampling you'll get a 9 + 9 + 9 + 9 + 9 you'll get like 10 9s in a row then you might get a 2 for a while then you might switch to a 1 for a while here you just jump all around did every digit totally randomly so it's actually very powerful for protests people want to do okay so another paper that had some aspects I thought were interesting was paper by this group that studied supervised learning but using these kind of hierarchical tensor networks so these are like treatments or networks that they studied and what I thought was the neatest thing in the paper was that they looked at what the data looked like in the original forum when you just represented my product States and do some kind of distance measure and they did some kind of low dimensional embedding of the distances this is that this is that like layer zero then you go look at layer one now what does the data look like two three four five six and you see at the top what the model has learned to do is to disentangle the data so the model has kind of pulled the data apart and gotten it into two clusters so that the very top it's ready to be labeled now those are a those are B's so it gives some insight into what the model is doing and I like this and I would like to see more work along these lines I think there's a lot more one could do to really understand precisely what each of these sensors are doing and I think more so than say like a neural network because these are all linear things so because they're linear there's a lot of analysis that one can do in fact it's even better than that they're not just linear they're actually isometric Maps meaning that they're unitary in one direction so if you call one of these things you then they have the property that you dagger U is the identity so they can all be interpreted as nothing but rotations and projections rotations and projections so you can ask questions like if I pick that tensor what subspace does it project onto and what subspace does it predict out and then you could know that there are certain inputs that the model at the top simply cannot see whatsoever and there's other inputs that is very sensitive to and you could study this all very carefully so I think I think that's where this could be pushed a lot more this is another interesting paper by this machine learning group in Hebrew University they studied treatments or networks as well and they were trying to argue that the different choices of the Brent what they call the ranks which we've called the bond dimensions of these sensor networks give different so-called inductive biases and that means that they're better and worse suited for different kinds of data so they did these two tasks they called a global task and local task where they had digits that were randomly moved around in the plane but in one case the digits are very large and took up a lot of the field vision another case they're very small and they found that in the one case a model where the the bond dimension started out small and then got bigger worked better in the other case another model where the bond dimensions were big and then got smaller worked better so basically the idea is that there's different just by choosing different bond dimensions the models can be better and suited to different tasks and what they didn't mention in the paper what's nice to know is that that can all be done automatically so instead of having to pick the bond dimensions to be larger at the bottom or smaller at the top and vice-versa you can just do DMR G on a tree straightforwardly and that'll figure out the spawn dimensions for you so that's great because it means tied with this paper it means that DMR G is figuring out the correct architecture for the tasks so that's that's what you could conclude or at least a suspect could happen so that should be tried now something that was a bit different but what I quite liked was this paper just from this year using a generalization of matrix product States these things called matrix product operators I mentioned them now a few times it's basically where you have two legs on every tents are going up and down not just one leg and using this for what's called sequence of sequence modeling and that's where you say input one sequence like some kind of time sequence or word sequence or something and then you want the output to be another sequence so the input could be say a question and the output could be an answer so this could be like talking to Siri or the input could be historical stock price data and the output could be future prices or something like that so what they did is they said let's assume our model is a MPO tensor network so it's like a double leg MPs with legs going up and down and then let's train it by attaching inputs to the bottom and outputs to the top and these are both of the form of product States and then we want to maximize these this overlap we want to have it when we put the correct output on top for the correct it for some input on the bottom we want that to be big then when you use the model you put new inputs on the bottom and then the output is some kind of it looks like in MPs it's like an entangled output and then what this group did is they just thinned it back down to a product stage and said that's the output and they got really good results they were able to beat state-of-the-art recurrent neural network approaches actually even when they partnered with some people who are practitioners in that field to help them try to beat it what I think they should have done though in which they didn't do but I I mentioned to them is instead of thinning down this output MPs back to a product state so they could interpret it as one of their training outputs I think they should have sampled from it so I think you should have think of the output is like perhaps a probability distribution and you can think of like multiple outputs for an input with different probabilities so I think that'd be really interesting you ask something question then you get different answers depending on what samples you take so maybe you'll get slightly different wordings of an answer or something like that so it's it should be investigated more here's a here's the thing I'll may have time to talk about in more detail at the end we'll see how the time is going but this is a paper I wrote in January called learning relevant features of data with hierarchical tensor networks and the idea here is like can we do other things rather different from supervised learning with tensor networks so instead of just saying okay here's a here's a model with some weights and we just kind of train the weights can we use other ideas from physics and kind of import them whole cloth into machine learning that are interesting ideas and one of these ideas is real space RG kind of principled reversible real space RG with tensor networks so the idea is that it's that I would take some data as a product state just in the form like in the earlier slides with a supervised learning but instead of doing supervised learning on it what I do is I take the data and I bundle it up until like something like a mini body density matrix so that's that's this thing here and I if I have time we'll talk more about it but you can actually justify this entirely outside of physics because this thing is actually related to what it's actually this thing you could call the feature space covariance matrix and it's related to this technique called PCA so you could justify this construction totally independent of any physics motivation but from the point of view of physics it looks like a mini body density matrix of a mixed state that's a sum of a bunch of product states like outer products of product space but again you can totally justify this mathematically as well then what you do is you basically do a real space RG that coarse grains this density matrix in a way that generates all these treatments or network layers so a tree tensor network is it's basically this construction where you have three index tensors two that attach to the physical sites and then one that goes up at the first level then at the next level you think of these outgoing indices as new physical sites and you put another layer of tensors on top and you keep going so you can basically create this tree tensor network and you can think of that as taking the raw data that's mapped into feature space then coarse graining at a certain number of steps then having this reduced representation of the data up here then you can do whatever you want with that reduced representation and what I did mostly is that I just put ts on the top and train that MPs to do supervised learning but everything in the middle was trained without any knowledge of the labels so is automatically just studying the data and basically projecting out directions and feature space that we're not useful to understand the data in only keeping directions that were useful that's why I said you know relevant features and then using those relevant features to do supervised learning and I got very very good performance on a much more challenging task than that handwriting task is this other task called fashion in NIST where you actually have to represent have to have recognized images of clothing and most of the state-of-the-art methods get around 89% and I got 89% also so so it's very comparable to things like convolutional neural networks and decision trees and other methods so that was that was encouraging and there's there some other little tricks that I put in as well okay one other paper kind of building off this result in fashion in this so this is kind of typical good performances 89% so if you look at most of the methods they're all in that pack but there's one or two outliers that do better and now one of those outliers is another tensor network actually so this is a very recent paper by the group of Ignacio serac and Ivonne Glasser was the lead author here and they studied generalizations of tensor networks that are known as string bond States and entangled plot ket States and some of these by the way get very close to Boltzmann machines some of these are very similar to Boltzmann machines but but these are the way they were thinking about them is they're more like groups of tensor networks that can be kind of folded together and by using stochastic optimization on them they were able to you know tackle these very challenging to Train tensor networks so these are really starting to approach training things like peps even and they got really state-of-the-art performance on this challenging fashion in this recognition taks task so the only other model I know of that beats theirs is this one particular deep neural network that has something like 100 layers and it's this thing called a Google Annette and that thing gets like 93 percent and theirs gets 92.3% so it's very very close another thing that I liked in this paper was that they actually did some kind of learning of the local features from the data so they said let's let's let the data tell us how to map it into a product state based on what gives the best performance and interestingly what they found is very very close to the thing that that me and David Schwab picked in this paper from 2016 so that's the thing that we picked that's what the data once so it's it's very similar so somehow we had already guessed a good choice and the data wants that choice anyway so that was I just thought that was kind of fun okay so that was just some different works people have done many questions about those before I switch gears to quantum computing so I actually just let the norm float so it isn't MPs but it's not a wavefunction per se so it doesn't have to be normalized for the case of supervised learning if you're doing generative modeling you might want to normalize it because that's that's like setting the total probability to one but but if it's just supervised learning it's just an empirical weights that are there to kind of just dot with your data and give you some output so it could be different from one but what we did in practice is that you don't want the norm to float too much because what it'll what it likes to do is just blow up it'll just basically grow and grow and grow and grow the more steps you do to the huge numbers and that's not actually doing anything for the performance it's just some artifact of the training so what we do is we put a weight penalty in the client in the cost function so if the norm gets too big it actually starts to increase the cost function so that goes into the gradient as well yeah oh yeah in that example that's right so in that example you can just you can just always at every step either always work within the norm one manifold so you can always make your gradient to be orthogonal to your to your whatever tensor that you're working with and then it won't change the norm at least to leading order it'll change it a little bit but then you can just you can just correct it by hand there's other things you could do to you could just let the norm drift and then normalize it at the end if you want that might also work yeah so it's it's not the easiest thing to handle it is it is a little bit a little bit of a headache but there are good ways to handle it so it's a good question yeah okay yes hmm that's a really good question Thanks so yes I would say by that that in question that question has some important aspects actually in the sense of that although I'm making these analogies for physics a little bit or we want to make these analogies because we're used to tensor networks being like wave functions or other things one key thing to keep in mind is that some of the machine learning stuff is much easier than physics and that's good so that means that we could actually things that were things that have been only working okay and physics might be super powerful in machine learning so like peps for example so peps tensor networks are a bit challenging to do for physics they have these issues but part of the problem is because you're trying to optimize over this Hamiltonian which has to go in between two copies of your of your peps but in in the supervised learning setting all you want to do is attach a product state on to the peps and that erases the physical indices I should draw this so so you have like a like a peps which I'll just draw it as like a grid with I'm not going to draw the circles or anything so everywhere the line crosses is understood that there's a tensor and then there has these physical indices coming up but then that's just a weight layer and then what I do is I attach a product state to it this may not be exactly your question but I just wanted to mention this so now I attach a product state onto it and when I do I just now have a single layer tensor Network that's it so I just get some kind of thing that's like this and this just looks like one of those classical partition functions networks that I mentioned briefly at the beginning it's just a number that I have to calculate and we have all these different strategies to calculate it one way is to think of this as an MP s that could be viewed as an MP s and this could be viewed as one of these M POS one of these matrix product operators and you can just multiply this times that and kind of march it through and you can get the numbers so so there's things that are much easier to do in machine learning with say with peps than they would be in physics now in terms of I think your question was more like what is the analogy of like what's going on here to something in physics I mean there doesn't have to be exactly it's a good question but but maybe the best thing to say is that is that there doesn't have to be that much of an analogy it's more like repurposing a powerful tool from physics so on one level it's just different on one level it's like finding the ground state of a Hamiltonian is an eigenvalue problem these are more like just some minimization problem it's just a different problem but of course you know you could minimize the like the energy you know in some sense there's just not there's not a Hamiltonian here per se but there are structures like in physics so there is like a density matrix in some cases or you can make you can make a thing up like a density matrix with this like I mentioned it can be perfectly justified and now this does get into some interesting things which is that there's it turns out that tensor networks when I say that tensor networks have a lot of interesting theory I haven't had a lot of time in these lectures to go into that but one of the bits of theory is that that tensor networks lets you discover what are called parent hamiltonians and so this is the idea that you give me a tensor network and I can constructively produce a Hamiltonian for which that tensor network will be the ground state and I can make a frustration-free amyl tone Ian for which it'll be the ground state so I could if I if I wanted to train one of these tinsel networks for some kind of machine learning task then I can like construct a Hamiltonian for which is the ground state and look at that Hamiltonian if I wanted to maybe I would learn something from doing that I'm not sure but maybe but it's interesting to think about but maybe the most I think the most important analogy to physics might be in the interpret ability like we could use physics ideas to look into what the sensor network networks are doing so one key example that again I didn't you know have time to get into all this but I think I mentioned this maybe to a few people was that you can if you have an NPS that's translation and variant that you've trained this is this is coming from the physics literature that goes on forever you can evaluate the correlation function sorry the correlation length precisely by just you pick out one of the NPS tensors yeah maybe I did mention this yesterday and then you make this thing called the MPS transfer matrix that object and then you diagonalize that that object you look for its dominant eigenvector which would be some other tensor like this and you want to find so let's call this V J you want to find lambda J EJ you're looking for this these eigenvectors V J of that thing you can think of this as being something like the result of putting an operator into two copies of the NPS and kind of rolling it up and then getting this blob but then you want that blob to sort of transport through the NPS in some simple way and by getting those numbers lambda J you can actually bound the correlation length of an NPS so be very interesting to do that for data so to take some kind of time series data like historical temperatures or prices or language or something like that something with a 1d character train a translation invariant NPS to model it whether it's generative modeling supervised learning then extract the correlation length from the NPS directly and so you've kind of learned something about the data that would be otherwise pretty hard to calculate so that's some kind of physics you know techniques for coming on that you could you could use so so I think there's a lot of things like that so I think is I think it could be interesting so let me let me now switch to the unless there's a pressing question let me switch now to the part about machine learning sorry quantum machine learning so the the idea of this part is just to say that tensor networks are quantum circuits and they're people are trying to figure out good uses of quantum computers and machine learning might be one of those uses and so then there's kind of a nice fit than I think between tensor networks quantum computing and machine learning and credit to my co-author first author on our paper bill Huggins who kind of pushed me to work on this I was pretty reluctant because I was like I don't know much about quantum computing you know what's going on there and he said let's just like look and then around the same time that we published this group at University College London also published very related work so it's nice to see that that this idea was sort of in the air also I'm not sure if I haven't on the slide but I wanted to mention I might have a slide later it was really directly inspired by a work by Brian Swingle and Isaac Kim who had some had some other ideas about using tensor networks for quantum computing their ideas are very similar to the ones that I'm going to show because you know ours were inspired by theirs but their idea was to use tensor networks to find ground States on a quantum computer and their their idea was basically that if you proposed the ground state to be in the form of a mirror you can then exploit the form of a mirror to write better quantum algorithms so I encourage you to take a look at their paper and I'll try to provide a reference so you can email me and I can send you the reference but then I was thinking that okay that's great but I think machine learning is an even better fit for reasons that I'll say but basically some of those reasons include that quantum computers are very noisy and machine learning is very tolerant of noise also it's just depressing application that people would want to use quantum computers for okay so a bit of background what is a quantum computer this may not be the most authoritative definition of it but just for our purposes so what I'm thinking about is gate based quantum computing there's other schemes too there's things like anneal errs and other things but I'm thinking of gate base so what you can think of a quantum computer as being is just a set of coherent qubits just some collection of qubits which right here are these circles for which one can do the following things efficiently prepare certain initial States so you can say prepare them into some specific product States so here it could be that they're all in the zero State that's really what those are zeroes and also to which you can apply unitary operations usually these take the form of 1 & 2 qubit unitary operations usually the one qubit ones are very flexible like you can just apply arbitrary one qubit operations to qubit are more limited usually it's just like something like controlled Z or Hadamard or something like that these very specific two qubit you know Terry's but it depends a lot on the hardware which ones you can do and I say usually because it's actually very interesting to me to find out talking to some other people in these groups in these labs that they actually can do multi qubit unit Ares but they don't always advertise this that much so I actually talked to one of the IBM experimental honey Peck about this and she was mentioning this one unit area that she has where they they basically blast all of their their superconducting qubits with microwaves and this this has some kind of effect on all the qubits at once it basically does some kind of different applies a different phase to each one but it depends on the particular qubit what phase you get and and I went up to her after her talk and said that's awesome that you can do that and she started apologizing and saying like oh I'm so sorry you know it's not a two qubit unitary she's like theorists only want two qubit unit Ares and I'm like no no no no like I like I want more than two qubits actually and you'll see why in in a few minutes so that's something by the way just for people who are thinking about this field is that we should really be pressing the experimentalist to be to tell us about all the unit areas that they can do pass to qubits and we should try to use all that capability so the last thing is performing measurements and this is interesting because some are some hardware well let me let me go ahead and go through the slide so this is the qubits they're usually notated in time as notated with these lines it's kind of like a musical it stands or something and then you put the unit Aries on in time you kind of do unitary operations one qubit two qubit in some pattern then you do measurements and on some hardware you can actually measure just a single qubit or a subset of qubits without disturbing the others at least in principle but some hardware you have to measure all the qubits it's like all or nothing so you just measure them all or you don't so some companies are working harder to try to put in this single qubit measurement which is something I'd like to have so it'd be nice when this comes online more but it depends on the particular hardware okay so now to the idea of doing machine learning with common computing this is an idea that got kind of a went into an interesting direction just this year so there'd been some works in past years about doing this but all of a sudden kind of starting in January in February a bunch of groups all had the same idea basically all at once which happens a lot in physics so one of these ideas was or first of all let me say what's the broader picture here these two slides will be about this this broad idea that Maria sholde called circuit centric machine learning and the idea there isn't saying instead of thinking about how do people do machine learning classically then let's just try to imitate a quantum computer the thinking here is let's let's just let a quantum computer do whatever it does best and then try to let that do machine learning so it's like you just let the quantum circuit be the model so then there's two modes of doing this one of them is supervised learning or discriminative learning and the idea is you say okay here we have data which we will represent as a product state so that should sound familiar from the earlier slides with the tensor network thing but this was just some other groups thinking about this so this is a group at Google Farhi and Evan and this is a group in Toronto and they were saying prepare the data as a product state then have some kind of quantum circuit that acts on to that product state and here I've been very vague as to what quantum circuit it is and some of these papers also are equally pretty vague about it so the idea is this whole bubble here is just a giant quantum circuit it's one big unitary but in practice it would actually have to be lots of smaller unit Ares acting you know one after another then whatever this big unitary is let's go and find this unitary that's our model that's like our adjustable weights is this unitary once we found this unitary that's our model we apply it then we just measure one of the output qubits I mean we could measure them all but we only care about one of them and then the output tells us what class the data is in so it's either class 0 or class 1 so that's like a or B that's it and if you want more labels you can measure two qubits or three qubits and you'll get exponentially many labels to pick from so that's that that's that idea so we had that same idea but we were still working on our paper at the time then there's another mode which is basically flipping this around it's generative modeling so let's say we have a model we want to generate data that's similar to some other data that we've seen so this was proposed back in December and also in this group by this group of benedetti at all in January they said ok start with qubits that are just prepared in some arbitrary reference state like all zeroes I'm using triangles here for some reason but that just means zero state now apply some other trained unitary that we trained somehow and I'll tell you in later slides how you could train it I could tell you a bit more about how this group trained it train some big unitary which could be an entire circuit acts on this reference state then again what we do at the end is we measure the qubits but here we don't just care about one or two of them we care about all of them and the output is a binary string it's you know zero zero ones your one or something and then that's our that's one sample of our model so we sample basically from the model and that could generate things like images that look like some other images that we train from or something like that so this is really you know you know this is really basically quantum mechanics it's just saying let's prepare a wavefunction and then collapse it and then but then use that for machine learning somehow yeah I'll definitely lead to that how do you do the updates but basically you just this is this is like the part of evaluating your function but I mean you can evaluate different functions to see how they do so then the question is can you do that in some smart way so do I have to search over all the different unit Ares or can I follow a gradient or something and get there faster so I'll mention a little bit about how to do that but we can say more about it so if your question if you still have that question in a few slides then let me know but I will say a bit about that okay but there's even even before we get into the issue a little bit of of what circuits should go here or maybe that's the way I'm gonna say but but there's issues with these proposals right so I like these proposals because some of the earlier proposals that I'm not going to talk about much we're seeing other things like saying like let's put a neural network on to a cloning computer but I'm not sure if a quantum computer that might work great but I'm not sure for myself if a quantum computer really wants to be a neural network a quantum computer just wants to be a quantum computer right so this these ideas are saying let's just let a quantum computer do what it does which is prepare state supply unit Ares and do measurements but let's just try to make those unit Ares and those measurements accomplish interesting tasks you know that's the idea however the way these proposals were framed these four papers that I mentioned they left some things unsaid or on the table or the less than ideal things that they even acknowledged in their papers one of these less than ideal things is that how do we parameterize these circuits what numbers should go in here we can't really necessarily make a completely arbitrary in qubit unitary right that's not feasible so we have to break this up into some other more specific unit areas either two qubit or these very special multi qubit ones that I mentioned they have in the in the labs and if you do try to just say let it be a random circuit you find massive problems with that so another paper by the Google group came out in March and they said if you actually tried to train a random circuit for machine learning you'd find that the gradient we very very very flat you're basically on this infinitely flat plateau and in parameter space and you can't go anywhere so it's really bad so you need some kind of smart circuit some better circuit to put here you probably know what I'm gonna say that should be and then the other problem with it is that it just needs too many qubits especially this generative approach so the idea is let's say I want to generate realistic images or something so I want to output pixels but I might want to output you know an image has like thousands of pixels so am I gonna have to have thousands of qubits to do that that's not gonna really fly right we're not gonna have thousands of coherent qubits for a while so we need to do something with we can do with tens of qubits in the New York term okay so now in comes tensor networks so the thinking here is that tensor networks are equivalent they're one-to-one with quantum circuits so if I take a particular tensor network this is a tree tensor network I can always read it as a quantum circuit and this is not just analogy this is a precise map so what I can do is I can say these lines at the bottom let's say these are spin 1/2 spins ok that's already a qubit that's just two state two level system then if I merge them by this tensor into a 4 dimensional index I can always think of that as a product of two two dimensional spaces I can always just factorize that so that's these two lines inside or these two lines going up here continuing up to this top level that's like this index so there's always some mapping between any tensor network and a quantum circuit you can do especially these ones that are trees like MPS's or a special case of trees so here's in fact of the quantum circuit for a particular matrix product State so if I have a bond invention for matrix product state that means you know the matrices here that connect one to another to another have an index size of four I can always write that as this circuit and this is a totally faithful mapping one to the other and the idea is that that's the first tensor that's the second that's that that's the next you know and the idea is that you see that for any external line here any physical line it just goes and touches one of the tensors and this tensor is joined to that's this one by two lines so that's a dimension of four and two lines to this one that's the dimension of four so you can see here it is with two lines going to left lines going to the right and one connecting to the physical space it's just like here you know dimension for space dimension for space dimension to space so you can see that that's an FPS hopefully if you kind of learn how to read it that way so that's an NPS as a quantum circuit so that suggests that so we already know that you know matrix product states are useful for machine learning so I already mentioned all these slides with all these other works people have done one of them was this group from Beijing that actually trained a matrix product state sampled from it and got good quality images out they actually would got good quality handwriting samples out from a matrix product state so you could do the exact same thing here so you could say well that's just interpret the matrix product say as a quantum circuit it could be that exact same matrix product state that group trained sample from a non akwonton computer the output would be the same outputs that they had so the idea is that this gives you some motivation of like how you could pick your circuit so instead of picking some messy random circuit you could just pick a matrix product state circuit and you know it'll do well because we already have these classical experiments showing that and best of all you can connect with those classical approaches classical I mean all the earlier slides of where people are using regular computers to train these you can actually optimize a matrix product state or some other tensor network as far as you can push it then take those parameters just put them onto a quantum computer you're guaranteed to get at least the same performance I mean apart from important issues about the noisiness of the computer and then continue to optimize it more in the quantum computer yes well because you may not be happy with the performance you're getting so you might you might be getting up to a certain amount of performance and say well I wish I had more resources I wish I could do a larger NPS so by going on to a quantum computer in principle you can the gate generation process oh you mean like applying unit Airy's yes using the what sorry yes the advantage is that you can manipulate very very high dimensional spaces directly so the idea is that you can and I'll kind of I'll get to this in a minute but the idea is that this is a bond dimension for MPS because two qubits go through but if three qubits went through bond dimension eight four would be 16 so it's coring exponentially with however many qubits I let go between this unit area in that one so if I can just let something like 16 qubits go between that this set of operations to when i transition to that set of operations in time then i can parameterize a matrix product state whose bond dimension is something like 200 thousand something like that which is way out of bounds for what we can do classically so that so your question is it's helpful that you're asking this question so thank you because the idea is that here this is just a bond dimension for MPS but if we can work with something like 16 or 17 or 18 qubits which is the numbers that they are working on we can work with matrix product states of bond dimensions of hundreds of thousands in principle so we could have extremely powerful models that's the idea so so now I'll show you how that actually can work with even a small number of qubits here in a minute so that's that's the sort of that idea for a generative model based on matrix product States but written in a form basically this is nothing but a program that can be run on a quantum computer so if I if I tell you exactly what these unit areas are you can carry this out on quantum hardware I'm glossing over some details which is that you don't have three cubed unit Ares but there are ways to so-called compile them into two cubed unit Ares which means fine two in one cubed unit Ares that approximate these three cubed ones or or you can actually use these multi cubed unitary capabilities there's other things you can do you can also just not propose this model propose one that only has two cubed ones but train them that way okay you can also do discriminative learning by just kind of flipping it around and that's where you prepare your data as a product state then you can apply another kind of tensor network motivated circuit or a tensor network equivalent circuit then you know read off one of the designated qubits to say that's the label you measure it and you do that a few times to kind of see what is probabilistic so you do it a few times to estimate what's the label that the model really wants most often that's your output how do you actually train it so this will come a bit more to your question the idea is that you take a given set of gates that you think are gonna do a good job so these could be either initialized from some classical pre-training like the stuff that I've been doing or you know you just start with some random gates or something then you run the program multiple times on your data also to estimate the output well so you have to get a good idea of what's the actual output because it's probabilistic now you have an idea of how the model performs you take that output and you feed the result to a classical algorithm and the classical algorithms job is to just propose new gates so it's that simple so this is just the idea of called these these are called QA o ei or variational eigen solver it's this idea of hybrid quantum classical algorithms so this is a idea that's very popular right now in quantum computing the idea of wait we don't have to do everything on the quantum computer we can just think of the quantum computer as something that's good at doing one part of the algorithm and all the other parts can be done classically without too much cost I mean it sounds very very simple obviously I'm not you know there's some steps I'm leaving out here but it really can be that simple so so one thing you can do is you can do let's see did I put a slide on this or if I did okay not on the thing I want to say right now so one thing that people do is very very simple it's really as simple as this it's called gradient free optimization and the idea is you really do just propose a circuit try it and then propose another circuit and try that and so on and then just try to sort of prune the circuits that perform poorly and advance the ones that perform well so this is things called like genetic algorithms or particle swarm algorithms the idea is that you just have a bunch of a population of circuits and you say evaluate all of them see how they did the ones that did well those advance to like the next step the ones that had poorly kill them and then maybe crossbreed the ones that did well in some way so that's the idea that's why they're called genetic algorithms it's like evolution evolutionary algorithms okay so you can do that you can actually do that and that works so this paper I mentioned earlier by Benedetti and Perdomo artis that's what they did so they download they have this package it's a Python package called particle swarm that they didn't write they just used and they just input these unitary angle parameters into this particle swarm optimization which just does this kind of you know it proposes lots of circuits and they just tested other circuits and kept re optimizing and they got very good performance so you can you can do this but there's other things you can do which is you can actually stick in specially design unit Ares that actually change the meaning of the output of the models instead of the output being the label the output is the gradient of a particular unit area so you can actually get the gradient out by putting specially designed unit Ares in and estimating the output of certain qubits that gives you the gradient so you can actually get the gradient out in another scheme so you can do various things to train these it's pretty interesting what to do I'll mention one other one on the next slide so so Heroica lee bill my colleague at berkeley took this on so he took on the very challenging task of let me train a discriminative machine learning model as a quantum circuit but only restricting to operations you could actually do on quantum hardware so this is tough to do because you know he didn't actually calculate gradients because I mentioned you can do it but it takes special techniques he did a simpler thing but it's harder to get working called SPSA and I'll describe that on the next slide but let me show you the results that he got so this is showing how the algorithm works when you do you know up to 5,000 steps the test fact he kept looking at the test set which you're not really supposed to do but but he didn't use it as training data he just wanted to see how the model was actually gonna perform so he's training on the training data and he kept he keeps taking a peek at the test set to see is it gonna work and he just keeps working better and better and better and better up to almost you know to very high percentage accuracy and then here it is in the last few tens of thousands of steps lots of steps because this is a very tricky training technique and he's getting up into 99% performance this is distinguishing two kinds of digits handwritten digits one from another using this tree tensor Network applied to the data that's what we did so what is the algorithm that we'll use to use this thing called the SPSA algorithm is very simple it just says go through the circuit and pick one angle that you want to improve you just say okay I'm gonna pick angle number three from Q cube from unitary number 80 or something and then you say okay how am I going to improve that angle well I'm gonna prepare two new circuits one where that angle is slightly bigger and one more the angle is slightly smaller and I'm gonna evaluate both circuits just see which one does better and then either just pick the better of the two or you know move in the direction of if the angle increases or decreases move a little bit in that direction with some empirical step size that could maybe be a random step size or something so you just take these tiny steps carefully one parameter at a time and that's SPSA so you can do this on quantum hardware because all it really requires is that you can evaluate whatever circuit you're testing that's all it requires so he did it that way I got this good performance another thing that that we worked on was the idea that these might be robust to noise so it could be that it's robust to noise because of the way we trained it but we think it's also to do with the tensor Network structure that we that we picked so we think that basically that because it had this tree structure that even if one branch of the tree got corrupted by some kind of random noise other branches could go to the top and still give a good output so the idea is that we as we were feeding the data through a tree I should have drawn the tree in one of the slides but the data goes up here's the data as a product state and then it's fed upwards through a tree tensor network and then there's some kind of some kind of noisy map that gets applied to the qubits that simulate some kind of actual noise that could be present in a quantum device and so you apply this noisy map to the qubits and it has the effect of kind of corrupting the information in the the qubits going up and this is basically increasing noise and the red means things that are dangerously close to being misclassified and the gray means they have been misclassified so you can see that for quite a lot of noise this is kind of a realistic level of noise and this is getting even bigger than realistic level of noise then you could actually still correct or recover the right labels here these are things that are tending just to sometimes give the wrong labels but they still can give the correct labels if you just pull the output qubit enough times to really hone down what the right distribution is if they're gray it means that they just give the wrong labels so no matter how many times you measure it you're gonna get the wrong thing but it held on for a very long time and then it really starts to get corrupted down here so that was encouraging to see so the last few minutes let me just say a few more ideas we had so one idea that I'm pretty excited about is is you know so what content star networks bring to machine learning to quantum machine learning besides just you know maybe a better choice of circuit some of these nice properties like robustness to noise is there anything like bigger that it can bring to it other than maybe I like this circuit someone else likes likes that circuit well it can bring some pretty big things I think actually because you can actually get you around this bottleneck of the fact that you only have a very small number of qubits and near-term hardware so this is an actual quantum computer this is the IBM one of the IBM devices and just has five cubits so just motivating that you know the really high quality devices right now don't have a lot of qubits there's some that do have a lot of qubits but quality is not maybe as high right so so can we still do interesting things with a small number of qubits and you can if you use tensor network thinking to do it so let me show you how to use the tensor network ideas to sample a very large output much larger than the number of qubits so here's how it works so let's say I have four cubits that's all I have that's all I get to use and then what I do is I entangle those qubits with a bunch of unitary so this blob means a bunch of two and one qubit unit areas that that collect collectively are like a four cubed unit area then what I do is after that round of entangling of applying unit Ares I just measure one of the qubits so I just collapse that one and record what I get and I say okay I got you know zero or something maybe one now the other ones are still in principle and tangles with each other so they go on in time and now I reap repair the one that I measured now I start entangling them all again and this unitary could be the same as that one it could be different interesting things can happen either way now I measure that that first one again record the result now I reset it entangle measure reset entangle and go on and war and one more and the idea is that in principle I can do this forever now of course in practice I have a coherence time and it's going to decohere and it you know but what I'm doing is I'm trading circuit kind of width for circuit depth that's the idea and I can record all these outputs and you see that I here I have six outputs for only four cubits so I can get much more outputs than I have a number of qubits and what am I really doing here so is are these gonna be interesting outputs and will I will this be an interesting way to get outputs like what's what am I going to what am I expecting to see what kind of properties can I generate for this output data well the way I can understand this is to think of the qubits that I don't measure as some kind of hidden state some kind of virtual space and that hidden state size will grow exponentially in the number of qubits that I have so the idea is this is an eight dimensional space but if I put one more qubit that's going to be a 16 dimensional space and so on so if I had you know 17 or 18 qubits this would be a really big hidden space that's kind of transmitting the information about what measurement I got here to the next one to the next one to the next one to the next one so you kind of think of some flow of information where you know once I met once I collapse this one that'll tell me something about the state of these other three which will influence that one and so on and so on and we can connect this at insular networks as follows so that's just some weird circuit but if you interpret these circuit elements as tensors which you can and absorb anything fixed these little fixed starting vectors and and reset vectors into these tensors then just reshape them there's an all I'm doing from this one to this one to this one is just redrawing that one is that no change here then I'm just redrawing collections of indices as one big index at the very last line that is a matrix product state so you can actually show that this process of sampling where I take for qubits and sample sample sample sample and then finally at the end I just measure them all is it's precisely equivalent to preparing a matrix product state and sampling from that so if I can understand what spaces matrix product states cover like what kind of samples can I expect to get from a matrix product State that's the same exact space of samples that I'll be getting from this algorithm that's the idea and so if I can use more and more qubits I can sample from matrix product states of increasingly large bond dimension so that's that's an idea that we're pursuing right now last slide is to say these ideas have been tried on actual quantum hardware so this is from this other paper that came out around the same time as ours by this group in London and they tried a little toy data set called the iris data set but they got very reasonable results and this is the circuit they used it looks like a tree and they actually carried this out in an IBM computer so it's really been done just on toy data so far but I think this has some interesting ideas that might keep pushing what we can do with quantum hardware so okay so I had this other part of the talk prepared in case there was time but there isn't then that's okay so if you want to check it out this other part I'll post the slides and take a look at this paper it'll be in the slides this is the this is the work I did about turning data into a density matrix and growing a tree tensor network so that's what the density matrix looks like after you explode it into a treat in store network so you can actually calculate that so you basically diagonalize this exponentially big matrix that way and then you can put things on top to do machine learning and get good performance and then do partial number of layers and so on and get state-of-the-art performance on these challenging tasks so I was trying to get to the conclusion slide all right so thanks for your attention [Applause]
Info
Channel: ICTP Condensed Matter and Statistical Physics
Views: 2,242
Rating: 5 out of 5
Keywords: ICTP, Abdus Salam International Centre for theoretical Physics, Condensed Matter, Statistical physics, Statistical Mechanics, Numerical methods, Collective Behaviour, Coherent dynamics, Quantum Matter, Topological quantum matter, entanglement, phase transitions
Id: kzbVHT36aLE
Channel Id: undefined
Length: 116min 51sec (7011 seconds)
Published: Tue Sep 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.