Emily Fox: "Interpretable Neural Network Models for Granger Causality Discovery"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so in this talk I'm going as we face in deploying deep learning and time series analysis has so in particular we're going to think about moving beyond prediction in large data sets and in one case in particular we're going to talk about learning the structure of interaction between the individual time series and think about how in this case we can use neural nets with specific architectures combined with sparse T inducing penalties to start learning these types of interpretive structures okay so I think the first thing that's important to hammer home is the fact that time series data appear everywhere and the statement is increasingly true based on the development of new recording devices and other innovations so for example this picture here might represent streams of posts or views of users on these platforms or purchase histories of users on these ecommerce sites there's also been a lot of development of wearable devices that provide us with really interesting streams of activity data and in the field of health care there have been developments like electronic health records and different monitoring devices that allow us to start assessing the status of a patient's health care over time okay but until recently machine learning for has for the most part really ignored time series and a question is why well it's hard if we think about traditional dynamical models that we use in these cases the number of parameters tends to grow really rapidly with the number of observed time series as well as with the complexity of the dynamics that we wish to capture so what this means is we need that much more data to start learning efficiently likewise the algorithms that we use associated with these dynamical models tend to be much more computationally intensive per data point so that means we need even more compute power and finally the theoretical results that we like to establish tend to be much harder to derive when they're dependencies present between our different observations okay well what should we do should we just ignore these types of data or treat the observations as iid to see why that might not be a good thing to do let's think about what I'm faced with every week I go to choose which restaurant my family's gonna go to because my son always likes to eat either pizza or sushi and if I'm gonna guess at random which restaurant he wants to go to I'm gonna get it right about 50% of the time but if I take advantage of the fact that I if we went to pizza last week he's very likely to want to go to sushi this week or vice-versa I can start getting much higher prediction accuracy okay so hopefully this has motivated you guys to to care about time series now so the key point here is that if we leverage the temporal dependencies present in our data this can be quite useful in a wide variety of different settings okay but really the analysis of time series or sequential data is all the rage right now so what's changed well it's really the combination of the presence of massive web scale time series and sequential data combined of course with large compute power and important developments in deep learning that's led to a number of success stories in many different areas so in particular if we think about our NS and different variants like LST ends chair news wavenet seek to seek these really underlie a lot of the success stories we've had in reinforcement learning and speech generation machine translation speech recognition we just heard it talked about other aspects of NLP analysis of medical records and the list goes on and on okay but these success stories really rely on three critical components so one is the fact that we have this massive data and so in these cases we often think of having lots and lots of replicated time series so we have lots of correspondence data or lots of trials of a robot navigating every part of a maze or we have lots of transcribed audio but in a lot of applications of course that isn't the case especially if we think about scientific applications or there are a lot of settings where we might actually have lots of data but not lots of data for the question that we're interested in answering so for example imagine where Amazon and we want to do demand forecasting in order to stock inventory well we have lots and lots of purchase histories of all these products that are in our inventory but what if we get a new product and we want to forecast a man of that product so this is an example where we have no previous history and so we have limited data for the question of interest okay so another really critical component to success in the cases that we've seen is what I'll call manageable contextual memory okay so our intents are touted as having this really powerful ability to capture a rich history of the sequential input and that's indeed true and leverage that in forming predictions but the big success stories we've seen are cases that really have a lot of what what I think of as structure to this input that's being seen so for example we've seen this structure and the maze before it because we have this massive data set and out of that it's pretty easy to assume that for example we've seen these words used in this context before or we've seen the a patient with these types of symptoms and these types of test results but that's not universally the case so if we think about forecasting a really complicated weather system so we might be observing an air temperature dew point relative humidity the list goes on and on this is an actual data set that's collected and this is a really really complicated nonlinear process and to think about learning in this context you just need an enormous enormous enormous I don't know how many times to say that word amount of data to start learning these nonlinear relationships so if we throw this data into an RNN as a black box we wouldn't expect to get the kind of performance that I think we could get if we think about leveraging prior information about what the structured relationships are between these really complicated variables that form this input okay and the last thing that I think has been really critical to success is the fact that in these cases there's a clear prediction objective so for example word error rate for speech recognition blue score for machine translation or the reward function and ran horsemen morning so in this talk what we're gonna do is we're gonna think about moving beyond prediction on big data and we're gonna talk about other really important time series analysis tasks one of them is thinking about characterizing the actual dynamics of the process that we're observing and another one which i think is really important is thinking about how to efficiently share information in limited data scenarios and one that we're gonna dig into and detail on this talk is thinking about learning interpretable structures of interactions amongst our observed time series and we're gonna also quickly touch on this idea of non stationary processes and other issues like measurement biases and so these are all cases where i think deep learning has a lot of potential for impact and let's talk a little bit more about what that might be okay so let's start with this idea of thinking about characterizing the dynamics of the observed process and in this case a really classical approach that's used in time series analysis is thinking about a spectral analysis so starting to analyze what's the frequency content of our observed signals or maybe doing a time frequency analysis and this allows us to start asking questions or thinking about properties of the process like those local stationarity and other things like this so just to show one little example of where spectral analysis can be really important is if we're thinking about looking at neuroimaging data because in this context there are known frequency bands of particular interest and so one data set that my groups spent a lot of time looking at is M eg where a person sits in this chair has this helmet off all these spatially distributed sensors that provide recordings of the magnetic field that's induced by underlying brain activity and what we get out is this really rich description of brain activations over time and what our focus has been recently is on understanding what our networks activated in the brain in response to different stimuli and one of the cool things about taking a spectral approach is that in this frequency domain it's really straightforward to think about how to define things like conditional independencies between these individual time horses okay well let's look at a totally different task of trying to characterize the dynamics of human motion so here this person is wearing this mocap suit and if we look at the recordings coming out of that suit here's just a little snippet this is what the recordings look like and so overall it looks like a really complex dynamical process you'd imagine having to use a really complicated dynamical model to capture those dynamics but really what you can do is you can think of this process as switches between a set of simpler dynamic behaviors like the person's doing some jumping jacks and then arm circles and knee raises and so on and so what we're interested in doing is thinking about learning these underlying simpler dynamic behaviors and then the switching process between these behaviors to characterize this overall complex process so we developed a fully unsupervised method for doing this and analyze a large collection of motion capture videos and I'm just showing some of the underlying simpler behaviors that we discovered in these processes okay but this idea of taking a complex dynamical process and describing it as switches between a set of simpler dynamic behaviors is useful in a variety of different domains some of them that my groups looked at is automatically parsing EEG recordings speech segmentation detecting changes in the volatility of financial indices or segmenting a human chromatin sequence but the key message that I want to be the take-home here is the fact that these are really useful representational structures for thinking about characterizing dynamics of the process that we're observing and I think there are a lot of opportunities for embedding or completely deploying deep learning techniques with these representational structures to get at really nice and more descriptive formulations of these processes that may be classical approaches have allowed but in a way that allows us to extract more meaningful and interpretable information from the learned processes okay so the next thing that I want to think about is how to efficiently share information in limited data scenarios so let's go back to this idea that we're Amazon and we want to do this demand forecasting of products that we stock so maybe we have skis and ski jackets that are purchased more in winter months or maybe we have roasting pans purchased close to holidays running shoes in the spring and car seats might not have much of a seasonal trend at all okay so the idea though is that we have purchase histories for every single product that we carry and what we're interested in doing is we'd like to form these long-range forecasts for example the next year of demand for a given product based on the historical histories we have for those products and we'd also like to do this cold start forecasting so we have a new product like we talked about before and we want to forecast the demand of that product when we have no history at all so this is an example of a prediction task with large data so it sounds like a classic deep learning scenario here but remember that there's not much data for the question of interest especially in the cold start a warm start scenario okay so one method that we looked what I was thinking about a low-rank description of a specific data representation that allows us to leverage repeated patterns we see across years and across products and then combined with site information like product descriptions that are fed through a regression component where that regression component we we're going to use a neural network as our function approximator here so this is just a small little example of how you can use deep learning within a structured content context to start thinking about leveraging the power of these methods but in these more limited data scenarios so using this approach we analyze some Wikipedia data treating this as a proxy for product demand so it's really page traffic counts across a six-year period and the metadata that we have are the product summaries so this is just the first few paragraphs of text for each article and I'm just going to show some of the results qualitatively so here is a long-range forecast where we have past years of this article Apolo and we want to forecast this entire year of page traffic here we're showing what the true page traffic was and here is our prediction so we're able to capture this weekly oscillatory structure as well as this more global trend including this dip down in the summer months which might be due to schoolchildren looking up this article less often here's another article economics so also capturing this weekly oscillate oscillation but a very different global trend so one thing I just want to pause for a second and mention that seasonality is something that's also really important to time series analysis and I think there are ways that it could be built into when we're thinking about deep learning in certain contexts to start learning that kind of structure efficiently ok but let's turn to the cold start forecasts that we'd like to make so this is an article where there was no past history in our training data and we see that we've been able to forecast the page traffic in this case here's an article where we don't do so well this is the NCAA Men's Division one basketball champions where there's this unpredictable spike of activity that our method is not able to capture we do get a bump up around this time period and that's probably due to March Madness that we might have seen in other related articles okay so now I'd like to turn to something that takes us away from prediction so up to this point we've been talking and that and the product forecasting case of forecasting task but let's think about something which is actually a collaboration that I've had was Zillow over the past few years where the goal is to model a local level housing index so in particular our goal is to estimate housing at a really local level like a census tract and how that value changes over time the challenge is the fact that the data are spatiotemporal II really really sparse so again this is an example where collectively in our data set there's hundreds of thousands of how sales but if we think about for our question of interest which is spatiotemporal e localized there isn't much data so more than 10% of census tracts have fewer than one house sale on average per month and more than 40% have fewer than three house sales so we can look at this visually here's one census tract with lots of observations each dot is a different house sale it doesn't look like that challenging at estimate the red curve is what our estimate is of the latent value in this trend and some uncertainty about that but here's a census tract with only for house sales over a 17 year period how can we start to think about doing this in this really truly limited data scenario well what we proposed was a method that discovers clusters of correlated time series that if neighborhoods are discovered to be within the same cluster it allows them to share information with one another and improve the robustness of our estimates so we apply this method to some city of seattle data and these are some qualitative results that have a lot of interpret e you can read more about in the paper but we can look at things more quantitatively and I said prediction is not our goal and it isn't our goal but we can use it as a proxy to assess the quality of this index that we're trying to form so what we do is we hold out a set of individual house sales and then we compare our housing index to the industry standard Case Shiller index and use that within a house level prediction model and compare our performances and that should get at the difference in our ability to estimate the underlying value okay and so this is our percent improvement over Case Shiller on these held out predictions breaking our analyses into the 5% of census tracts with the most observations all the way to the 5% with a fewest observations so we have improvements across the board but not surprisingly the most significant improvement that we get out of this clustering approach is for census tracts that have few observations where they can really benefit from that sharing of information it's not so challenging to analyze the dense census tracts without that type of sharing okay so just to step back and recap what we're talking about are structures that allow us to leverage our limited available data efficiently this is something my groups thought quite a bit about over the past years thinking about clusters and hierarchies of time-series sparse directed interactions low dimensional embeddings like the low rank structures we talked about in that cold start challenge and even the switching model that we talked about earlier is a way of sharing information across time between behaviors we see again and again so I think these are really important structures and there are other ones out there for thinking about again embedding the power of deep learning approaches going beyond simple linear Gaussian dynamical processes but in ways that allow us to start deploying these methods in limited data scenarios okay so now I just briefly want to touch upon this idea of non stationarity and measurement bias okay so this is another collaboration we've had with Zillow where here our goal is to analyze the dynamics of homelessness and it's another very very data scarce situation where we get counts of homelessness on a single night where volunteers go out with clipboards and literally try to count the homeless population the count method that's used varies from metro to metro and it varies over time as they think of better strategies to use and of course we get better counts of the homeless that are in shelters it's easier to count them than the ones that are unsheltered on the streets and the percent of homeless that are sheltered varies quite dramatically between metros Los Angeles is an example of a highly unsheltered population okay but the net result here is the fact that there's significant measurement bias what this means is that the observations we have we can't treat directly at naively as observations of the homeless population in the same way across time and between metros and even on any one of those it's not a very direct measurement of the actual homeless population there's some uncertainty about that in a very interesting way so what we did was we devised the Bayesian dynamical model that accounts for this imperfect measurement mechanism as well as changes in count quality over time another really critical component to our analysis was capturing the non stationary dynamics of the total population not the homeless population but just total population in a metro and using this model one thing we can do is think about a year ahead forecast of the homeless population and we're showing that for San Francisco here but the key thing that I want to emphasize here is that the challenge is faced in this data set are endemic to lots and lots of datasets sparse data situations which we've talked about quite a bit thinking about non stationary processes really really common in time series and so far there's not a principled deep learning approach to cap handling that and also this really important idea of measurement biases if we just take the data and shove it into our an RNN I would have some some concerns some issues with the inferences that we draw from that analysis okay so that's kind of food for thought and at this point we've covered three different topics at a high level and for the rest of the talk I want to dig into one last question that I wanted to present here at a much deeper level which is learning the structure of interactions between our observed time series okay well why might we care about this in the first place in a lot of scientific applications this structure of interactions is the thing of interest so for example if we go back to understanding these networks in the brain our goal is to understand what are these networks how do they differ in response to different stimuli or different tasks how do they differ between different clinical populations like those with schizophrenia autism so having an interpretable structure is really critical to those types of questions or if we're in biology maybe we're interested in a gene regulatory network where these molecular regulators interact with one another and there's interested in characterizing what those interactions are and even oh they're supposed to be a mu B here okay technology well pretend there is a movie of basketball being played and players interacting on a court oh there it is okay so in lots of other scenarios outside outside science and medicine um okay well at least this gives you a better sense of what I'm asking you to imagine it's not gonna work there we go pretend it moves okay so there are lots of cases where we're interested in interactions between people or objects or people and an object in this case between players and the ball so let's think about this basketball example because it's very intuitive so in this case we're interested in discovering these directed interactions between players in the ball and just as an example maybe one thing that we would be able to infer is that the position of the point guard at time T is highly influential of the position of the ball at time t plus 1 so these are the kind of statements that we'd like to be able to infer okay so to start getting out these questions we can't start using something called Granger causality selection and so a classical approach to doing Granger causality selection assumes that you have a linear model so here I'm showing things for just two different time series and we're saying that the value of the time series at time T is just a linear function of two lagged values of those series and the linear relationships are captured by these lag matrices a1 and a2 and there's some additional noise added and here I'm writing an equation more generically for an arbitrary number of times here and assuming some capital K different legs okay so we say that series I does not Granger cause series J if and only if the ji entry of this a matrix is zero across all lakhs so what we're saying is that in this case that I just showed series I is irrelevant in forming the prediction of series J okay so in pictures if you're used to thinking about graphical models we're talking about all directed interactions from series I to Series J or series J to series I and out of all these directed interactions which ones are relevant for the evolution of this process so a modern approach to great your causality selection that's pretty popular it's just doing a penalized log likely approach so we're going to maximize our log likelihood over all possible like matrices here to explain the data well then we're gonna add this penalty term where the specific penalties encourage the structured zeros so that the zeros appear across all of our lags so more formally and a lot of cases what this boils down to is thinking about just minimizing our reconstruction error and then applying what's called a group lasso penalty so this is just an example of a penalty that encourages this entire set of parameters to go to zero together okay so what's the issue with this linear approach well of course in a lot of different scenarios such as the ones that I've shown here the dynamics are almost assuredly nonlinear right so what can we do well what we're going to think about doing is for every time series we're going to introduce a nonlinear map of the past values of all series to that particular series and we're still going to have this additive additional noise term here then we're going to be able to say that series I is not granger causal of series J if that nonlinear mapping is invariant to the past values of series I okay so in our specification we're going to think about defining these non-linear mappings using neural networks and then penalizing the weights on the neural networks to be able to identify these in variances so the most straightforward approach you would probably think of that first is just take the K pass lag values of my time series and shove them through an MLP and predict the full set of outputs so here I'm showing imagining we have three different time series but of course because of this tangled set of interconnections between all our hidden units it's really difficult to infer these statements of Granger non causality so and another thing is that all of our different time series have to rely on the same set of lags in this specification so instead the first thing we're going to do is we're going to think about a neural network based approach for each time series separately so we're gonna take the lags of all of our time series treat that as the input and predicting an individual series I so this is looking just specifically at this function GI and then we're going to take a penalized likelihood approach like we discussed before and I'll dig into this in more detail okay so our MLP specification is very standard up to this point we're just going to assume a linear output decoder and but the the thing is for our layer one hitted values we're going to write things in terms of these lag specific weight matrices but there's still this issue of trying to disentangle the effects of specific past time series values on our series I so we still haven't gotten at this issue of Granger in oncology so what we're going to do for that is very straightforward we're just going to regroup our inputs by what past series that they were associated with and so we're gonna take all lags of series j together and we're gonna place group wise penalties on the weights from this input to our hidden layers and then we're going to be able to say that series J does not Granger cause series I if it's weights are zero so if I knock out all these weights then clearly this second time series is not influential in forming the prediction of series I okay so in math here it is looks very familiar from what we had before where we have our reconstruction error here though no we have this nonlinear mapping defined by our MLP and then we have this group lasso penalty now on the weights the decomposed weights of this MLP where here these are all weights at all lags specific to series J encouraging those to go to zero together okay so I just want to spend a second on a little algorithmic note about how we optimize this MLP this penalize MLP because often the focus in deep learning is on prediction error where we can get away with optimizing approximate lease for example using SGD but we actually care about the zeros of our solution because we're trying to get at these notions of interpretability so it's really important for us to get very close to a stationary point of this non convex objective so for this we use a proximal gradient descent algorithm with line search so this is actually a batch method but remember we're in these small data scenarios okay so here are some simulated data results where we simulate from a nonlinear system in particular Lorenz 96 and what we're showing here this is just area under the ROC curve so that's going to show us what our performance is in terms of detecting true versus false edges of course higher is better and what we're showing is that we're able to recover these networks quite well as the length of our series increases and this says if you can't read in the package is 1,200 not a very long series there and one nice thing is even if we simulate data from a linear vector autoregressive process even though our framework has the capabilities of capturing these nonlinear interactions we can still recover the network structure in this linear setting okay and one other thing I want to again just very briefly mention is the fact that we can start swapping out this penalty with other latest greatest penalties one thing we can do is swap out what's call into what's called a hierarchical group lasso penalty and with the specification I'm showing here not only are we doing greater causality selection but we're also doing wag selections so each of our time series can now depend on its own set of lags and that set of lags is learned in the interest of time this is just showing yay we can recover this kind of structure the lag selection okay but what I'd like to spend a little bit more time discussing is right now everything we've done is just for an MLP but we can also apply the same kind of ideas within the context of rnns where in that context we hope to capture these long-range dependencies between our series via these nonlinear hidden variables okay so here's a generic Araneta specification where we introduce a hidden state HT that captures this historical context of our input sequence and it evolves according to this function f that some non-linear function depending on our specified architecture and again we're just going to assume a linear output layer for simplicity but we're gonna focus in on an LST M specification which also introduces a cell state C T in addition to this hidden state HT and here's details of the specification really pretty irrelevant to this talk the main thing I want to emphasize is that this forget input an output case they control how the cell state is updated and transferred to the predicted hidden state okay so that's if you don't know about LSD I'm sad so what you need to know but what you need to know for this talk is that the effect if we think about how the inputs are influencing our predictions that's determined by these weights matrices that appear here okay so let's think about concatenating all these matrices into Big W matrix and that Big W matrix it's jointly going to capture the effect from input to prediction so now in this case once again we can say that Series J does not grant or cause series i if the j call of this weights matrix is zero okay so again this equation you're gonna get bored of it but that's part of the cool thing is you can apply it in lots of different contexts you're minimizing your reconstruction error where now this nonlinear map is 3-year LS TM and you have this group lasso penalty on the J column of this weights matrix trying to encourage that entire column to go to zero okay so we applied this penalized LS TM to the dream three challenge data set so this is a really difficult nonlinear data set that's used to benchmark Granger causality detection in a consists of simulated gene expression and regulation dynamics for two networks of e.coli and three yeast networks and so there each one consists of a hundred different series we get 46 replicates of each but only twenty one data point time points so very very small data set and I want to say that each one of the network structures is very different between these five different examples and the structures that they use for these simulations as well as the dynamics are based on currently established knowledge about these gene regulatory networks so this really is supposed to simulate a real data example so again we're gonna look at area under the ROC curve and these are our five different networks here's the performance of our MLP are penalized MLP here's the performance of our lsdm so as is often the story in the deep learning community the long-range dependencies that we can get out of the lsdm are useful in this context and here this blue bar is a 2015 gold standard for a method for doing non-linear granger causality selection so we really do see the benefits of deploying deep learning in this context and here's a comparison to just the linear approach penalize linear model and here's a dynamic Bayesian network based approach so the take-home message here is if you think at 46 replicates of 21 time point I mean who would think about doing deep learning there that's crazy right and we're not doing prediction these performance numbers are about learn structure this is about edge detection performance so we're getting out interpretability and we in this way by having this structured representation that we've presented with these particular sparsity inducing penalties that provide regulation and the limited data scenario as well as it get us to interpret ability really are quite powerful I also wanted to present it's just basically a cute example to that's a little bit more intuitive so you can start visualizing some of these things we decided to look at learning the interactions between joint angles and body position in this motion caption data that I showed earlier so that we're looking at six different videos 56 dimensions of joint angles and body position and out of this collection they're the following set of behaviors there's instances of jumping jacks side twist knee raises squats arm circles various versions of toe touches and punches and running in place so here are our learned interactions for increasing sparsity inducing penalties and if you stare at this and I think it probably is quite hard to see there's some very intuitive structure for example these connections between the knees one knee is leading the other knee and then vice versa there's also connections between hands and toes for these toe touches and lots of other structure I could talk about if you're interested okay so what we've talked about our ways of thinking about these really structured representations of MLPs or recurrent neural nets and placing these sparsity inducing penalties to get at these statements of interpretability and in these cases we're talking about doing Granger causality selection on encoding but we can actually do the same kind of thing but doing Granger causality selection on decoding where we take the output to be a linear combination of learned nonlinear features of our time series so we can have a whole separate talk on that but very straightforward to do there as well and likewise you can plop in your favorite components and use exactly the same ideas and develop things with these these other more recent developments but the key idea is thinking about these structure representations adding regularization in this context we were thinking about sparsely inducing penalties for interpretability and this really helps us in these limited data scenarios okay so to summarize deep learning really does offer tremendous opportunities for modeling complex dynamics going well beyond what traditional approaches can allow using linear Gaussian stationary assumptions but what I really want to emphasize here and I hope the message has come through is that time series problems are much faster than the types of really big success stories that that we've seen so far prediction on large corpora but there are lots of opportunities for other success stories if we go back to thinking about traditional time series representations and methods for coping with some of the challenges we face in this context and of course I want to thank the students and postdocs who did all the hard work on these projects and these are the the students and postdoc who is on the green tier causality project okay thank you you
Info
Channel: Institute for Pure & Applied Mathematics (IPAM)
Views: 3,353
Rating: undefined out of 5
Keywords: ucla, ipam, math, mathematics, cs, computer science, deep learning, machine learning, emily fox, neural nets, neural networks, granger causality
Id: as3RpspRb88
Channel Id: undefined
Length: 39min 34sec (2374 seconds)
Published: Fri Feb 16 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.