Can Machine Learning replace Signal Processing? - Prof. Nathan Intrator

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you so and I'd like to thank professor mooch Nick and the other organizers for inviting me it's a pleasure what's this quick time I don't need this so I'm gonna try to offer a little bit of unconventional view on some of the basics of neural networks we are kind of reviving neural nets from their 90s and there's it's a good idea to remember what was there and part of it will be done with the help of non-conventional colleagues which are these animals including the conductor each of these animals is performing computation far better than what we can do and what our machine learning or signal processing can do so it makes sense to try and understand what it is that they are doing I try to put an emphasis on their presentation to kind of say that it's all about a sophisticated representation which makes the calculation be very simple and let's start and get into it and for me it basically starts from Jeff in tune in 87 so that was a paper where he already described several methods for training neural networks and started talking about ways to impose interesting structure into neural networks he then took it to again the same time he took it to the second paper so both these papers really talked about one auto encoding and two training for some kind of invariants okay and that was intriguing until a few years later I realized that this idea of putting bias into hidden units into the weight of hidden unit can be taken into very many directions so just two years before that I discovered a certain unsupervised learning rule that had to do with synaptic modification that was really nourish science kind of analyses where I did a mathematical analysis and showed that a certain rule of synaptic modification actually performs what is called in statistics exploratory projection pursuit or search for projections where the distribution of these projections is far from Gaussian and I actually defined a specific way of being far from Gaussian and that had to be having kind of a dip at the center of the Gaussian where the center mass of a Gaussian is so there was a dip there basically dividing it into two gaussians or more and then I wanted to take that into back propagation and I introduced this idea of having the weights in the hidden layer be changed by a gradient descent of a combination not just of course the minimization of the error and not just regularization l2 l1 weight decay etc but also imposing actual real structure in there and imposing for example this idea of what's called multi modality or exploratory projection pursuit at that time Gluck presented that nips a very interesting model of hippocampus and that model was the following it had the outer encoder the same outer encoder that we talked about however it had something else and that was an outer encoder and an additional output units that are trying to learn some structure in the data or basic perform classification that intrigued me very much and I took it into face-recognition I was doing then face recognition like a lot of researchers and I realized that the face recognition where you take beautiful pictures at the very nice background when everything is clean and nice is one thing but real face recognition requires people to move around to have some blur between them and the person watching them etc etc so I wanted to do the following take a back propagation network that was trying to learn faces and do two tasks at the same time Auto encoding namely tried to reconstruct the original face and at the same time of course try to classify so again imposing representation into the hidden unit that was much more informative for the classification because the daytime it was easy to show that some of the data of experiments that were done were actually discovering the background of the faces some were just concentrating on the hairstyle basically not concentrating concentrating on every feature that could be in the picture and we wanted to improve on that so that was some improvement but what came next was much more interesting we realized that we can actually synthesize in using the words that Laura mentioned yesterday synthesize data out of the same data and what we did is we started degrading the images degrading the images training this network with the degraded images forcing the network to try and reconstruct the original images and at the same time do the classification ok and the degradation became intense so we did all sorts of blurring mechanisms and by this I mean Gaussian blur out of focus blur blind deconvolution adding noise and every other aspect of degradation of the image that you can think of and then because we were training the network to reconstruct the original images what we created was some kind of weights convolutional weights in the hidden unit that were orthogonal or relatively independent on this degradation process okay so the performance in classification of such a graded images was much much better could we do some more with this absolutely yes so we realized that actually we can do some kind of recurrency so we could take a degraded image we could run it through the network take the output take it again into the network and usually in about two or three iterations we got even better classification result so again I'm over I was training to reconstruct the original image the real idea was to do the classification and it was motivated by the fact that when we classify someone that is partially occluded or or blurred etc we don't try to reconstruct we just try to classify and the partial occlusion did also work okay so we were able to show that this idea produced performance in partial occlusion as well so again the whole idea of the outer encoder was to improve the internal representation for these various tasks let me shift gears and move to something else and talk about in Samba leveraging and Samba leveraging was very common for quite a while and was very successful and kind of people stopped talking about it and I hope to persuade you that it's really really crucial ok so there are two questions when you do in Samba leveraging is how to actually train each of these experts when we know that we are going to use in Samba leverage are we just going to train each expert to be an optimal expert by itself or are we going to take into account the fact the fact that it's going to be in an in Samba and the second one is what happens and this is the practical question what happens when we take the wrong model to model specific data and as I mentioned yesterday when we are in a high dimensional space we cannot see anything there we cannot really know so we are very likely to start with something like this let me give just two slides of some equations something very very simple it all has to do with the variance bias decomposition but not of a single expert but of an example in Samba love experts so it's really starting I define the example to be the simple average of the experts I plug that in here and then I get these two equations for each one of the parts and when I plug this back I get the following if I define this gamma to be the variance of F plus Q minus 1 the maximum of this variance okay or the maximum of the variance error of each of the experts then I get to this very interesting bound saying that the variance of the in Samba is smaller than this one over Q times gamma where gamma basically indicates the variance of each of the experts as taken into account when trained in this condition so in the best scenario it would be 1 over Q the variance of each one and in the worst scenario it will be the of the original and members of the ensemble okay so with this in mind notice that there is a very very simple bound here the question is is it really useful is it really practical and I'm going to demonstrate it on a problem that has been studied quite a bit highly nonlinear this is this two spiral problem okay there were many papers written indicating that back propagation cannot solve this problem and it makes a lot of sense so let me explain what the problem is for those who cannot see their X's here and zeros so the task is to separate between the X's and the zeros also create the nonlinear boundary separating between them okay because the radius is changing then this becomes a highly nonlinear case both in Arteta in R theta and in of course the Euclidean space as well and the question was how to do it and this is a very non-trivial way to do it and the student that worked on that was frustrated for a long time and was asking me is it really going to work is it really going to work because it didn't make sense so what did I do here I injected noise I actually well I mean the data set was I think hundred and ninety two patterns from this so you couldn't really learn the whole the whole surface you just had few examples there and the idea was to increase again synthesize more data but in this case synthesize more data by adding a lot of noise a lot of noise to the point were actually accessing zeros cross boundaries so I was actually training on kind of the wrong labels by doing that and the question is could it do something useful so it makes sense what I'm don't reveal but crazy so look at the earth surface that was with another student look at what happens at the air surface when you train so this is time and these each one of them represents a larger and larger ensemble so the top one is Q equals one Q equals to four etc and there are two two experts here or two kind of architectures here this first architecture for Q equals one the air is much smaller than this first architecture for Q equals one so if I was training just a single expert I would kind of choose this architecture okay if you notice carefully the minima is shifted because when we Ann Samba we actually reduce the variance portion of the error we don't touch the bias portion of the error and so since the optimal is when the baiance the variance and the bias are equal then it is shifted to a point where the variance of a single expert would be much higher but the variance of the in Samba is low and what's interesting is that if you take the minimum point of each of this expert and you put them on a graph you actually get this straight line just like the equation from the previous slide predicted and as I mentioned before this one starts with a very high error so you see the error here so this is one this axis is 1 over Q so one is here and as Q goes becomes high then we move close to zero and what we see is that the slope is really what matters and when the slope of this architecture is bigger then eventually it will cross and it will actually produce better result okay non-trivial and apparently you don't have to train a lot of networks in order to say that because you can just take the first two points and draw the line and determine which one is going to be better so let's take this into the problem that I was talking about this spiral problem what's nice about this problem it's two-dimensional you can really see there's this error surface very very clearly so it's kind of nice to demonstrate what's going on and what we see here is from left to right going this way the result of an example in this case of five networks okay large networks five large networks when we increase the level of the noise as I mentioned before that noise I put one slide of the noise that would turn out to be the optimal noise so we see that when we increase the level of the noise we do get to some areas where the example appears to be slightly better of course regularization should be added so here we see the same graph with regularization we see slightly more smooth result and when we take this 240 example of 40 members we actually recover very nicely the error surface so what we have here is an in Samba with no noise injection noise injection with no Reger ization regularization with no noise injection and regularization and noise injection together okay so what did we do we increase the noisiness or the variance of each of the experts by injecting this noise but we gain something and there's something that we gained is the independence between the errors the independence between the earths of each of this expert and independence that then reduced the performance sorry improved the performance or reduce the error of the in Samba okay so the question is are we the only ones doing that the answer is no and we are now shifting to a little bit to animals and soon we're gonna shift into more serious signal processing so this little animal the sauna bed there are about eight hundred species of this animal using sonar to catch its prey and it has it's using about two meal worms a day as an energy for energy it has less neurons than the number of transistors in a Pentium and the amount of computation that it does in real time we can hook any cloud your favorite cloud together we cannot do we cannot reach the accuracy of this animal so obviously it makes sense to study this anymore and this is much more sophisticated sonar this has also a very sophisticated way to acquire the data if we could produce an ultrasound with the resolution of the sonar of dolphins we would probably be able to detect cancer and fetal problems much much earlier so of course there's very strong incentive to study this animal it's actually being studied quite a bit so there's a place where these dolphins swim with microphones attached to their head with the GPS with every possible thing you can think of you see exactly the kind of signal that they sent and you record exactly the kind of signal that they receive so it's kind of a black box and you try to understand this black box so let's see if we can understand this black box with the help of signal processing so this Heisenberg uncertainty principle basically says that you cannot localize in time and in space above a certain a certain coefficient which kind of makes sense I believe you'll agree with this this is a simple example of that so if the frequency F is the bandwidth of the frequency is zero namely there's it's a fixed frequency of course you cannot localize in time and when you increase the bandwidth and of course the Delta function has an infinite bandwidth then you can localize better and better in time okay so a series of several papers that we did part of them with Jim Simmons who is one of the most famous bets researchers at Brown University revealed something very very interesting how much time do I have ten 500 K so reveal something very interesting this is the signal that the the BET is sending it's a chirp so you see high frequency going to low frequency when the BET is analyzing that signal it's actually analyzing it in different bandwidth separately okay we know from would birds theory from 1953 that the optimal match filter should be the exact signal so it's doing something that would would and a few thousand other papers that were written after that would say is wrong okay because each expert by itself is actually producing much higher because it's analyzing part of their bandwidth for those of you are interested it's kind of very easy to see what's going on this is the cross correlation function amplified to the to the top here so we see that higher frequency produces smaller earth so basically when there's noise in the y-axis it translates into temporal noise or temporal ability to estimate in the x-axis so the sharper one of course produces higher accuracy and then when the noise is such that it actually crosses to what's called the side lobes this is called a coherent receiver this is called a semi coherent receiver then suddenly the noise jumps quite a bit because the side lobes may be very far and then we see that the one that was very accurate here has the highest side lobes therefore it's going to be more sensitive to noise so this non-trivial phenomena of the cross correlation function kept us being puzzled and if we want to really see what's going on about this distribution of the error then it's the farthest you can think of from Gaussian it's basically a uniform distribution with a delta function therefore when you try to do averaging or in some way leveraging of multiple experts you don't achieve anything however both pretty much all sonar animals perform this kind of multiple observations and averaging okay and to cut the story short we were able to show that by doing the exact idea by simulating the exact idea that the sonar bed is doing namely analyzing with a single expert each bandwidth separately and then and sampling them together we were able to show that basically we could improve on what's called this Woodward equation which kind of shows the following so this is SNR and this is performance so as SNR goes down mainly the signal the noise becomes higher there is this break from what's called the coherent receiver to the semi coherent receiver and then the error jumps by a big factor okay so the whole idea is to push this part as far as possible to stay in a coherent receiver for as long as possible we were able to show that this was possible when doing this bandwidth I'm not going to get into it more and the point was that we could actually then combine such together to improve the performance based on the observations from the bet let's move to another animal this is the mole rat doing infrasound living under the ground banging its head in the tunnel and receiving returns and deciding about its three-dimensional representation from that what's very interesting about this if you were asking yourself well this amazing performance of the bed is probably as a result of millions of years of evolution etc etc so this animal tells us the opposite sorry so this more red is born where the eyes are connected to the visual cortex and since it's living under the ground within about four months the eyes are getting disconnected from the visual cortex and the auditory auditory cortex actually invades into the visual cortex so we see that in the lifetime in the very short early lifetime of this animal it realizes that actually the eyes are not producing any stimulation and it's actually forming with the same network with the same architecture it's forming a totally different computational engine to to actually calculate the three-dimensional for those of you are familiar with radar this little red is actually doing what's called synthetic aperture which we do with Raiders and we thought we invented it no this one does it the bat is doing it the dolphin doesn't need to do it so what did we do here let me quickly jump into this and go into the technicalities again we wanted to demonstrate the same idea of multiple observations with this animal but now the bandwidth is very very low because this is infrasound you cannot really break you cannot really do the trick that the bat is doing and it turned out that if we analyze the returning signal with the match filter again not with the single match filter now we wanted to retain the bandwidth so the only other parameter to change was the face so we are now analyzing at the returning signal with an array of unmatched filters changing the face this really leads to a very interesting theory of biased estimators again it was about six papers to nail it down in in various directions the point is that now the cross correlation function has this non symmetric sidelobes and with machine learning one can do something very interesting with this so the bottom line was that we were able again to reduce the error with this infrasound we were able to reduce the earth by using this array of experts each one not optimal so I believe if animals are doing it we have to go back and do it ourselves everything that I was speaking so far can fit two shallow layer single hidden layer or of course multiple hidden layer we took it to adding several factors which I'm not going to get into we were able to show that we were able to achieve certain accuracy with much reduced energy with an optimized number of pings by the way the dolphin sends about 200 between 60 to 200 pings in order to explore a certain target so this is a varying number of pings and we were able to demonstrate that there is a an algorithm to optimize the number of pings I'm soon running out of time so let me just mention another colleague why am i mentioning the elephant so the elephant has semuc sensors in its legs and it is able to actually predict earthquakes okay it's also communicating with other so if one elephant is in the forest and there's fire elephants about five kilometres away we'll start running getting away from the fire so an amazing animal and that reminds us that the most in my mind at least the most sophisticated signal processing was invented in France in the 80s to analyze semuc data so seismic data is considered to be very very difficult and obviously this elephants are able to resolve it I'll just touch upon two things that has to do with it again I mentioned there was a guy here what is so special about an orchestra for us again a very very difficult signal processing task that we don't know how to solve okay so we can sit in an orchestra room we can hear with a single ear the whole Orchestra and then we can decompose in our brain the different instruments we don't know how to do that in signal processing we actually use what we call the color of the sound the timbre and we are very very sophisticated in doing that and this is something that I've been trying to do in the last few years both for earthquakes and for eg data analysis I just mentioned one example here so this is the Fukushima earthquake that occurred a few years ago and when I'm passing it through this machinery that has to do with harmonic analysis and machine learning I can find features that are indicative of abnormalities that had to do with the earthquake in this Fukushima event between 14 to 3 hours before before the event there was an abnormality that indicated a coming earthquake it happened also in the aftershocks and now this has become a big project when we analyze of different areas in the world to determine where where we can predict and where we cannot predict earthquakes so skipping the earthquakes maybe I'll just say one word about this so as you see here there are a lot of microphones and as I mentioned we don't need all these microphones when we analyze they sound and the question is do we need all these microphones when we try to analyze eg so EG is actually an interesting signal because the real seminal work done in EEG was done in 1924 using prehistorical signal processing tools this guy huntzberger was able to show that there are certain frequencies that are associated with certain activities in the brain and this work was so groundbreaking that people did not really try to see maybe there's something much more that can be done this is really what I'm doing these days I'm trying to remove the need for something like this and actually analyze the brain just with one ear as I mentioned or with two electrodes as here based on these principles that I mentioned so I'm not going to go into this we were talking about advanced signal processing and this was mentioned yesterday there was this panel a year ago in the Technion with the two top people in signal processing Rafik - from Yale and Stephan mala and several people I'm not sure schewe and others that are more familiar with deep learning and first of all I certainly recommend to watch this that was very interesting and but maybe I'll highlight two things before I finish basically Rafi Kauffman who is pretty much the authority to say that said that this imposing structure the test to do with harmonic structure is very very difficult and he basically does not believe it's gonna be easy to do in the next several years with deep networks like even modeling the reverberations that are occurring in this room right now when I speak is a very sophisticated test that relies on a lot of structure and Stephen Milad who starts started looking into deep architectures basically said that fast Fourier transform or even wave your packet analysis are all deep kind of architectures they have you can present them as layers of convolutional networks and so the only difference is basically how you impose the weights in those units and that's an interesting approach so let me summarize by saying that signal processing theory tells us that single expert is not optimal I hope I was able to persuade you with that training for invariance can actually go a long way and one can do a lot more with it and advances in brain research will improve AI there's no doubt in that and my hope is that advances in AI and signal processing will improve the way we research the brain thank you very much
Info
Channel: Компьютерные науки
Views: 10,401
Rating: undefined out of 5
Keywords: Яндекс, Yandex, Machine Learning, Prospects and Applications, ШАД, SHAD, Can Machine Learning replace Signal Processing?
Id: jgdTQqCxMQ0
Channel Id: undefined
Length: 35min 0sec (2100 seconds)
Published: Fri Oct 23 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.