Emily Fox - Flexibility, Interpretability, and Scalability in Time Series Modeling

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right so it's a pleasure to introduce our fourth and final plenary session of the workshop our very own Emily Fox from University of Washington's computer science and also statistics departments it's a really exciting to hear about the great work she's doing so mostly successful here about a machine there in these days involved with a really massive training dataset there's a lot of the problems that we might wish to solve you simply don't fit that bill so this comp review focuses primarily on learning from time series and discuss some of the open challenges as well as some passport handling more limited data scenarios as well as getting active interpretability so the first thing I think is important everywhere and that statements actually increasingly true based on the development of new 420 devices and platforms so the temperatures picture here might represent extremes of these sites or commerce platforms there's also been a lot of interesting advantage in terms of wearable devices to support provide really interesting streams of activity data and for example field of health care they've been advancing light electronic health records and different monitoring devices that allow us to start assessing the patient's health status over time but for the most part until recently machine learning has largely ignored time series and a question is why well it's hard typically the number of parameters associated by Amoco models would use for these data set tend to grow really rapidly with the number of time series that are evolving as well as the complexity captured so that means that we need that much more data to start learning effectively lifeline is that algorithms associated with these models tend to be more computationally intensive per data point so it means that we need that much more compute power and finally a lot of the theoretical tools are really complete without our algorithms are really challenged cases where these can create give up either pizza or sushi I'm just going to guess at random which he wants I'm gonna get it right about 50% of time but if I'd leverage the fact that we just had pizza he's very likely to want sushi or vice-versa I can get much better predictive accuracy maybe something like 90% it's actually really important okay but really the analysis of time series are more generally sequential data is all the recap and why is that well it's really the confluence of the number of factors including the presence of its massive web scale time series combined with large compute resources and advances in deep learning and have led to a number of success stories so if we think about our names and their many variants these are their life successes we've seen of course including feature in creation machine translation feature condition MLP analysis the medical records and so on but the success of these cases really rely on three critical components one is the fact that we have this massive data in the first place so I don't think about countries were thinking about having lots of replicated kind series so maybe we have lots of correspondence data or lots of examples of a robot navigating every part of the mains or lots of transcribed audio but of course in most applications we don't have that especially scientific applications so for example imagine you're trying to put four networks in the brain well if we're looking at these neural imaging modalities it's very costly to collect this data so we typically only get a few scans per person in an addition there is sound becomes subject to subject variability so we can't just pull all this data together in an industry board way or sometimes we actually have lots of data but not that much Nader for the question of interest so in this case imagine that were an Sanda we're trying to forecast the demand of every product in our inventory so we have all these purchase histories for all these products but we get a new product and one of four test tubes in it so this is an example of a question where there's limited data for that particular product and how do you think about forecasting as to that or what if we're trying to detect a rare disease in this case is really important to focus the modeling and on the heels of the distribution where we have limited data okay well another really critical component of the success stories we've had is having what I call manageable contextual memory so our dens are covered as having this really powerful ability to capture rich historical context of input sequence informing and using that kids form these predictions and that is indeed true but the cases where there's really a lot of structure to that input sequence so for example in our massive data set it's very likely that we've seen that structure similar structure remains before or we've seen these types of words used in this context before or we seen a patient with these types of test results before for common illnesses but of course again this is not generically the case of what if we're trying to forecast a really complicated weather system and we might be measuring a whole bunch of different variables air temperature dew point ego the committee and so on this is an actual data set that's collected and of course this is a really complicated noisy nonlinear dynamical system so we're just taking enormous amount of data to start learning the interactions between these variables and having a good predictive model in contracts that we can think about leveraging some of our prior belief about what the relationships are between these variables I think we could do much better another big lead a challenge is one we face when there are non stationarity so there are changes in the relevant context to the predictions were trying to find so imagine that a patient just undergoes orthopedic surgery so all of a sudden the relevant context that was useful for describing that patient's activity level has changed right where as that patient is recovering from surgery they're likewise having this gradual change in what the relevant context is or on the internet if somebody has an illness is getting sicker and sicker okay the last really critical challenge that I want to mention is the fact that in big success stories that we've seen there's always this very clear prediction task and this objective that you can optimize so for example there's word air-raids speech recognition news large machine translation or the reward function you can right now at the top for reinforcement learning but what if we have few and noisy labels available for our task what if our goal isn't one a prediction in the first place what if it's or there's no clear prediction metric that we can write down so for example for thinking about that study of neuroimaging data there it was one of trying to understand these networks of the brain it's one of trying to extract interpret all information a structured learning task in essence rather than prediction ok so in this talk we're going to do is we're going to think about moving beyond prediction on large data sets and talk about some other really important time series analysis as so some of the we're going to talk about our learning interpret wall structures of interactions between our observed time series we're also going to talk about how to efficiently share information and limited data scenarios and then we'll touch upon the ideas of non stationarity asymmetrical high seas ok let's begin let's talk about some of the work mangers done I'm learning structured and it's sparse neural network models that get at notions of interpretability it allows us to better handle more limited data scenarios and we're going to discuss this in the context of two different key studies one is teach other than modeling and the other is learning these Granger causal interactions amongst nonlinear dynamical processes ok so to start let's imagine that we want to infer these function we connected regions in the brain like who activated in response to certain stimuli so typically in these analyses the cortex is divided into these regions of interest that consists of these accordingly with light signals and they're highly correlated to be sparkly little eye signals and the number of these signals in each one of these groups varies depending on what group we're looking at so for this type of analysis where you have these groups of observations and you're trying to understand relationships between these groups you might think of using so these methods tend to be fairly internal and can be made to handle limited data but traditional techniques are not very flexible they have limited representational power so instead in order to capture the complex nonlinear relationships between these different regions one might think about using some of the latest greatest each other the modeling techniques however although these methods are really flexible they are very hard to interpret and it's a struggle to train these models unless you have lots of data ok so what can we do can we leverage some of the flexibility of generic models while maintaining the parsimony of these more traditional impactor analysis methods well what we're going to think about doing here is leveraging what's called inductive bias to help with the sample complexity issue where the idea is just to incorporate known structure of the data into the model and the notion of structure that we're going to think about here is the fact that our high dimensional observations can be decomposed into these groups of highly correlated variables and so we talked about this in the context of this narrow imaging application but we see this type of structure and lots of are lots of datasets where we have these high dimensional observations and we have a whole bunch of different indices that were looking at each of these has an Associated asset class associated with it or is a running example that only uses in this talk just because it's very - it is imagine we're trying to synthesize human body those measurements well every joint in the body is a collection of joint angle measurements that are highly correlated negotiations and in this case it's really simple to see how it's going to be very helpful to look at generic models to capture the nonlinear manifold that human body poses live on ok well what we're going to do in order to incorporate this group structure is we're going to define these group specific generators and these declarative models and for further parts learning as well as to help with interpretive building we're going to ancient for a set of sports relationships between these group specific generators ok we're going to describe all of this in the context of variational Ottoman coders which are a particular class of deep generative models and these consist of an encoder and a decoder and so you can think of the decoder at it not from a low dimensional latent code to your complex high dimensional observations so in particular the VAP assumes that your distribution oddly encodes is just a standard normal and then the decoder defines a Senate a neural network layers that are used to define a conditional mean and diagonal covariance that defined the conditional distributions out here complex high dimensional observations so generally you think about sampling from this low dimensional native code and so one specifically encode in this visualization is just a 2d vector and then you pass that factors through the decoder which of these neural network layers and that defines this condition from which it's very straight or it's a sample right so that's the generative the process as defined here well one thing to note is the fact that all the dimensions of the latent code can just two in this case but obviously it could be higher dimensional get entangled through those neural network layers so it's really hard to understand how different mentions about Legion code influence different aspects of this high dimensional observation okay that's the home just to complete the V 18 a little bit more because the marginal likelihood is intractable what the VAP does is it introduces this variation optimizes variational objective where the variational approximation is just a Gaussian distribution with mean and diagonal covariance defined by separate internal negative layers so this stage is called the nclor so you think of it as a map from our complex high dimensional observations to a distribution on our mo dimensional LinkedIn code so this is thinking about embedding our high dimensional observations in this allegiant space so we pass it through the encoder and that defines this distribution on Legion codes and the being jointly trains the neural network layers and angler in decoder the Bremer's of those so this is a convention that is on this slide from here you can see that this model is really akin to a nonlinear legion factor model right so you have your Legion factors where the mapping to your high dimensional space instead of being this linear model that has that config lambda matrixes it's nonlinear mapping distribution so they're there of course these Twitter if you log letter sherry and a linearly to the factor model studying for trying to figure out what that if if dimension is but it's that alone is a challenging problem and here people tend to just specify some dimensionality even have some robustness to I mean it's still like it's an art it's not a science but I will show you they don't come back I'll show you how through the sparsity we're gonna have it in hopes with that issue where you can basically consider a little bit of an overcurrent over complete basis method like actually interventions that it just Principality what is an important question for sure okay so what we're going to try to address here is this issue of disentangling the facts from the different dimensions of your high dimensional observation to do this like you before what we're gonna do is define these groups specific decoders so through specific decorators so we have one maybe for one for the elbow one for the meeting and so on remember even though each thing is a joint is actually a vector of observations because it's a factory different angle I should have spent after financial okay and all these decoders and anchor are a couple however because they share the same like that space and they're jointly trained with you and we refer to this model as the open interval maybe for pointing but certainly make the small interval what we do is we add this penalty on the weights that not from the dimensions of the latent code to each one of the group specific neural network layers so in particular what that allows us to do is learn that maybe this first dimension it's important for controlling a joint motion between the right elbow on the left knee but might maybe is irrelevant and describing the motion of the neck or it's the second dimension you know describes motion neck which happens separately from the elbow knees okay and I won't go through the details he looks like but it's a structured sparsely inducing a penalty that looks amazing that looks a lot like a technicality okay well what we've talked about so far is just structuring the decoder but you can also structure the anchor as well this allows for more efficient information well I was office grant next slide but another thing it allows us to do is address two really important challenges with training the AES one of the fact that behaved ease assume that you have complete observations at training time but what if you have some missing lavallee like a sensor drops out you just do that application way to compute it honey so by structuring an anchor we're going to more robustly be able to handle missing data generically but this structure also allows us to handle multimodal data sources all we need here is a method curve combining the leader representations at each of these groups specific and careers produce okay so the more details in a paper but what I'll focus on just as this high-level structure here where if you remember in the standard 18 as well as the way that I've described so far every dimension of observation factor informs every dimension of the leaking hose but what we want is we want groups of observations of observation to only inform the dimensions of the leak and code that are used to generate factory right so there's no point for a group to inform dimensions that aren't used to generate a factory with that that's where we get the more efficient flow of information and it also provides a framework that is much more modular and Turkish law okay well we're going to look at some data analysis and to begin once we get synthesizing human body close measurements where our training data consists of 10 very short videos of a person walking so this is actually really limited data for training something like a PE one thing we've done we can look at here is the learnt weights matrix between each of the dimensions of our reading code which are the different columns of this matrix and each group of observations every joint in the human body reaching the rows here and this is a sparse matrix so it's interpret long oh and I wanted to say that in the standard view you can't even start to show something like this majors that there's no comparison I can offer comparability but one thing you can ask for example is for every joint which dimensions of the latent code control this motion or you put that around and say for given leading dimension which joints does it control and so for example what we're showing here in this table is for each of the 16 differently dimensions what are the top 3 joints and controls they call them that I really highlighted here controls the right lower arm right wrist and right upper arm or jump visualize in here you just step through this first controls aspects of the leftover play this mess with the next one aspects of that head attack so it's learning these groups of joints that have correlated observations who are they to behavior in the emotion okay well you might think that that it's gained and inflexible in turkey ability comes on a loss of flexibility of representation but what are we shown the paper will complete its detail series that you actually have better health organically hood on unseen sequences when you're training in these limited data scenarios it starts to converge to the same performance as you get more and more training data for what I'll show here it's just a quality of our generator so for this work in that jointly trained and Peter we're gonna throw away the encoder go back to sampling just from our standard normal 0 1 distribution and passing these samples through the decoder and so this is just says what the vanilla generator has learned and here's one sample from the standard ad here's another and another another and my students like to call these samples from the Ministry of Silly Walks because first case this looks birth in here and here's samples from the white bake and of course I'm just showing four here but there's a much larger sample size you can look at in the paper and the story is consistent it's much more representative of what human emotion looks like and the point here is the fact that in these limited data scenarios the fact that this model is including the structure of what these groups are and then learning these sparse interactions between these groups is basically in providing a lot of great organization and focusing the learning August lower dimensional space to learn okay but they also want to show you a little bit of that our request is for missing data and for this we just synthesize something this again here we're going to change this up a little bit we're going to treat every group as an entire limb measurement and we're going to put out 10% of the live measurements in these videos and then what we're going to do is something called the additional sampling here where we're going to condition on one group of observations which we choose to be the core body position so that's shown with these orange dots and our goal is to impute all these other ways okay so we're going to do this condition on those orange dots and where those other needs to be and this is what we get so for a while we actually thought there was a bug there like I'm just thinking how this they're actually very big input but it's very simpler but if you think about it it really is a very low dimensional manifold that human emotion is great and just these small changes in core body position give you a lot of information about what it's a very either 20 it's not totally unbelievable that is okay what I want to be returned to this example I really motivated this model accounted for these functionally affected regions in the brain where we're looking at energy data so person sits in this chair has this common of the spatially distributed sensors that provide recordings of green activations over time and remember that our observations are defined in terms of these groups these regions of interest have a set of correlated light signals and we want to understand the relationships between these different groups so we can look at this week's matrix that we looked at before and here what we see is that the different dimensions of the we think of correspond to no networks of the brain so here this dimension after these let's call the dorsal attention Network and this one that you thought look network and there's some other matters as well of course there are some that don't correspond to networks that we know how to use this as a tool to hypothesize networks for for the web the interesting things to me here is the fact that is much much noisier setting we still are able to extract the center for information okay so now we're going to turn to you exactly the same idea of placing these structured sparsely inducing penalties on layers of weights of these network layers but in the context of an explicit dynamical model where the goal is to incur these great to the salvaged statements so in particular we're interested in whether countries I is predictive of series J or James identify and here we're showing things just for two different time series but we're interested at this curve large networks of time series and exactly what I was going to take so right you know on all these sets of interactions the question is which ones are informative of the evolution of this process okay so why are interactions importance of study in the first place why do we want to do this well in a lot of applications is especially scientific ones like the ones that we've discussed so far it's the direct question of interest so when we're thinking about these functional connectivity rate networks so far what I've talked about one definition of functional connectivity another is thinking about black interactions between regions so these directly interactions so that's another notion of functional time to do that we're interested in or in biology maybe you're interested in stonee between regulatory networks and so it's the structure how these molecular that's a particular entrance okay well the classical approach to doing major construction is to a single linear path so if you're enjoying things just to different series thank you well we can say that series I does not read your cause series Jane if the IJ entry and zero across all that's okay well what's the issue with this linear approach well of course and the settings that we talked about here as well as many others the dynamics are almost assuredly not there and if we assume this linearity it can lead to inconsistent estimation of up edges and this Granger causality method okay so what we're gonna do instead is introduced this non linear mapping for each time series that takes the history of all our time series as input and there's so beside a big noise term here and then we're going to be able to say that series I is not pretty across series J if it's nonlinear mapping so this is serious J is invariant to the history of series I so what we're going to have specifications we're going to introduce these will networks to define these domineer nothing's with certain specifications that allow us to place these sparsity introducing penalties to further these greater cause all these reasons so as a simple example of what we can do we can look at a multi-layer perceptron where we take as the input history of all of our time series here now we're showing her three different oh and then he output a particular series hi and then just see what we break arranging these inputs so that we move together all the history of a particular series and then if we place a group wise first a juicing penalty like a third lasso penalty on the weak scared that's a lot of this to infer is exactly these range across all these statements so all of these weeks through to zero or saying that the history of series two is irrelevant in forming our prediction of series I okay well we can do this in the context not just of enemies but also recurring neural networks and we can also formulate things not just on including cities like we've talked about but also on decoding but the key thing is just identifying these groups of leads where can police penalties on those groups and try an entire through to zero your occurring the destructor of interaction between the series okay so let's analyze what this method does I a dream cream challenge dataset which is used to benchmark gradient a zombie selection methods and so the dataset consists of simulated gene regulatory effects of gene expression and regulation dynamics for five different networks Jimmy coli and three yeast and these networks that are simulated have a hundred time series of hundred dimensional networks and we just get 46 replicas each of the final price so that's a really really little dancer compared to the complexity of these dynamics and the dimensionality of these networks we have very few training data points or observations so as a performance metric what we're going to look at is area under the ROC curve so this shows our ability to detect true versus false edges in a network so this is not a measure of predictive performance this is really getting at how well we're doing at a structure writing tasks and we're going to compare the performance of our multi-layer perceptron approach to think anybody applied to an LST M which is an example of an RM and then here's the performance of 2015 gold standard non-linear greater causality method and here's a linear approach and here's the banging like facing network but what we see is that there are definitely games from using these neural network approaches so this was really interesting to me because if you had asked me some years ago whether I would use neural networks in the study I said no way right if you have limited data we're trying to do structure learning that's not what girl networks are known for but we see that they actually can provide a lot of promise in describing flexible dynamics but in cases where you can really encourage a lot of sparse we can handle okay so we're now going to build on this same idea of studying interactions in a time series but in the context of Bayesian I know the model where we're not only going to be understanding what the interactions are between these time series they're still going to be a terminal by construction but these models are also going to produce out certain estimates okay and then we're also in a section that touch upon ideas ah stationary and handling freshman biases okay so to start let's describe book collaboration that my groups have estilo over the years where the goal is to estimate the value of housing at a really local level so I took four foods positive disease so their particular we're going to look at these census tracts and try to estimate the value in each census tract and how that value changes over time the challenge however is the fact that the data are spatiotemporal a really really sparse so we get a sense of this we see is that in the City of Seattle more than 40% of census tracts happier than three house sales on average a month and one that 10 percent have to do their part so we can look at this a bit more qualitatively so here's one census tract where we have lots of observations each dog is a different house ale and our goal is to estimate this record that lead to value in that size of track their time as well as this bright band about certainty it doesn't look terribly challenging here but here's one census track that over this 17 year period we only have four house and how do we hope to do anything that well that I do that we're going to explore here is to discover these clusters of correlated tax rates so in particular the goal discoverers census tracts who is latent price dynamics are coordinated and if we can discover this structure we can pool information share information between these different regions to improve the robustness of our estimates so there's a graph that I have to model a little bit we're going to take each census tract and model the price dynamics museum a state-space model where at some time points we have just a single house sale it has multiple and many time points no ha sale it's a little bit more explicitly with the small consist so there's a latent linear auto regressive process on the link to price dynamics and very premise equation just for simplicity is the fact that we also model a global trend what's happening level so what's happening in Seattle and we also capture seasonality and then what we assume is that each house sale is just a noisy observation of belief in valuing that tract corrected for my house level features okay remember we don't have just a single census tract we have a whole collection of census tracts and each of these census tracts in the dynamical model has this innovation that are driving the dynamics of that than not truck and instead of modeling those as independent between census tracts which would imply that each census tracts value is evolving independently we're going to stack these innovations up into a big vector big P dimensional vectors we have P different census tracts and assume a joint calcium and so then our clustering tasks just boils down to this discovery of block structure on this big covariance matrix Sigma so what we mean here is that if census tracts are learn to be of the same block then they have coordinated innovations and then if they're at separate they're independent so if you pass this innovation structure through the damp remodel you can exhaust exactly these clusters have correlated the price dynamics so in order to learn this clustering structure what we do is use a leading factor model combined with facing other measures and we'll go through the details but it allows us to learn how many clusters there are as well as what the Questor assignments are and so we apply this model to analyzing some city of seattle data and here this is yeah some stats and 144 yeah so this is the bank that's the reason this is very different than a standard spatial process is the way census tracks are defined it means that these neighboring census tracts are actually fairly different so you get a lot of spatial heterogeneity and this is like fairly well studied and a literature see you really don't wanna model I mean you could have a model and it had a lot of complexity there's some spatial start here for sure that you see but like there are a lot of it would blur for a lot of really critical features for example some neighboring census tracts are totally totally different in how they meet and if you build in a spatial model that's going to have some flooring across that so we found that you do get spatial smoothness wearing exist but of course what you want to do okay so in 144 census tracts you learned that there are 16 different clusters that color year and each panel shares the long pricing and in sign language in that cluster over time and we see in three obstruction so for example this red Buster this is downtown the region has the largest bust into the cycle over this period but you can also analyze things more quantitatively and what we do here is we're gonna try to form holding up house predictions so predict the price of other houses and that's not our goal but we're gonna use of proxies assessing the quality of our housing index and we're going to compare our performance to that as an industry standard Case Shiller index and what we're showing in this block is the percent improvement of our predictions over case you learn so across the board that we've broken our analyses down into the 5% of census tracts with the most observations all the way to the 5% with the fewest and not surprisingly the largest gains we see are for the census tracts that have the fewest number of occupations because that's where sharing of information is so great okay well we decided to push our methane even further and analyze even finer scale regions that are curious to keep in mind quite soon and what we found is that even though the data scarcity challenge is much larger in the setting we actually ended up with a 5 percent improvement in predictive performance just because of how spatially heterogeneous house it is an in contrast existing bath is performing at disuse the performance gets worse and worse as you go below the circuit level because the detection handle the stairs new data houses clustered or one one so we assumed within each unit that's given to us you're gonna have one behavior that you're trying to find other region no they're not yeah ok so this is that's another collaboration that my groups have a Zillow where the goal here is to study the dynamics of homelessness and one of the goals is also to assess how changes in rent level affect the size of the population and we also want to be able to produce uncertainty estimates because that's really critical to decision-makers so this is another really really neat astaire situation where every year volunteers go out with clipboards and literally try to count the homeless population and the methods that they use very for metrics that you know are they also vary over time as they come up with what they think to be better counting and another thing that makes an analysis of metros hard is the fact that of course it's easier to count the homeless that are in shelters and those that are out on the streets but that fraction of shelter population vary significantly between edges between different metros so overall the tape measure message is that there's clearly passionate bias so we can't straightforwardly treat these homeless counts as counts of the actual homeless population so I'll just describe them all at a party level where we introduced this dynamical process model that non-stationary dynamics of the adult population and on any given metro and then we assume that our census counts are noisy observations of that total population and then we introduce another thing in the process to model the log audit of homelessness that's regressed on the Zillow rank index and then we assume that the total homeless population which very critically we assume is unobserved is a noisy function of the total population as well as the log odds of homelessness and then we introduce count accuracy that's informed by Metro level information for each one of these different measures and then our second source that counts beyond the census data are they homeless counts and we take those to be a function of this population account accuracy for two homeless populations and Boise's census homeless counts this is tall metro is infallible this is under our blue ones this this one those our total population of everything here is their total population this is how chances of being homeless this is what an unobserved and this is what we count okay so in contrast past methods were directly modeling count data and we're also very greatly treating the to number of homeless as missing data and we're able to form these year-over-year forecasting comparisons because we're introducing these dynamic processes and even though I'm not showing it here this is a hierarchical bayesian model that allows us to share information between my chairs to help deal with the limited sample size okay so one thing that we had asked is after adjusting to the dynamics of con accuracy and total population is the rate of homelessness increasing so here we're showing what we inferred for each of these different measures as well as our uncertainty about that estimate and we see that we identified some factors like New York LA San Francisco and CLO as what I call a state of emergency where there's others that or status quo and then there is some that appeared to be making progress of course we don't know why maybe homeless and tempo are moving to New York ah but overall this type of information is really useful for policymaking so these are two assumptions we have we don't know like actual account accuracy and so you can this is something that you can vary in the analysis so if you're gonna do resource allocation you want to know how sensitive is your inference to what you're singing accuracy is in your account be sure you're above 8% right in terms of what decisions you might make their cross these things like yeah here's your LA I'll show you a little bit light you see that that difference there right so another thing that we wanted to do was understand so if one another to spark especially Redemption so this is first of all totally separate model from the house of one sure right but it doesn't have it's like clustering structure it's just sharing information between foreigners in a sense of like having a prior that that's influenced multiple different parameters that it won't have like as explicit reference what you're talking about okay so right so as we change we look at the percentage of sweets it's called Mozilla rank index so that assessment of what grant levels are like I'm not sure what can we expect the size of the homeless population to meet so we're showing our posterior mean and 95% credible and their goals and we're looking here so this is just for New York and just relate and we're showing these estimates both for where we expect the total population and the counted population to be and what we see is both Coeur New York and LA we expect as we increase rent we're going to see an increase in the size of the homeless population but interestingly not so surprisingly in LA what our model port has is different for the total number than the county and not of voices because LA has a much larger unsheltered population right this is where assumptions have been how accuracy pretty influence on what's decided but what we actually found if you look across all metros is you have these really wide and certainty intervals and there's typically only a weak relationship between ranks and sighs about homeless population and so it's not so surprising given how noisy the data collection process is as well as how Olympic data we have available in contrast hoods were really overly confident in the inferences they were drawing because they were treating the accountant number of homeless as the actual homeless population so they're more ignoring the noise of the homeless in Census count processes okay so stepping back this issue of having measurement bias and non-stationary aids it's endemic to many many data sets so but if we think about this measurement bias issue of particular if we just talk these homeless count and shove them into an RN I have a lot of issues with what inferences might be drawn there right you have to think very carefully about what the data are and are not telling you about the particular question of interest okay and if we go back to this idea of efficiently sharing information what we've talked about so far are these structures that allow you to more efficiently share information between data streams so we talked about these ideas of clusters and hierarchies of prices or any sparse collected interaction or thinking about low dimensional endings and these are different structures that my group sought out with their contacts and many cases we've been a very very quickly go over this because there are very good questions but it won't want you guys did go too much into your poster session so this was like this little thing I added that is fairly irrelevant to the next part that I want to go through but there's another type of way we can think about sharing information in our time series and I think Oh switches between scenario sets of dynamic behaviors or here we're thinking about sharing information across time if we see the same behaviors appearing again and again so I'm actually an interest of time gonna skip this part so come talk to me afterwards got it but the point is this which is that for many complex dynamical processes you can actually describe the dynamics as switches between a set of simpler dynamic behaviors using things akin to get in Markov models and so some examples that my groups look at are automatically parsing eg recording and speaker segmentation detecting these regimes of volatility and stock indices or segmenting a human chromatin sequence but overall what I wanted to mention is the fact that these are really useful representational frameworks and there's a lot of opportunities to combine flexible neural network based components in these types of modeling frameworks that allow us to go beyond assumption to Gaussian guess I'm going to be linearity but a lot of some handle more limited data scenarios and extract inertia and so billiard more meaningful information out of the dynamical processes where or something okay so I'm a very last part of the talk what I wanted to touch upon is a really critical challenge of how we think about scaling up in currents and the pace of dynamical models we've talked about it so just to motivate this let's look at some intracranial EEG recordings so this is actually just a snippet of recordings and a really long time series so it's a patient that had seizures and there were many episodes of interest in this very long recording well I mentioned it very very briefly a couple of slides ago that we looked at we had this project where we were trying to automatically parse these recordings into outputs that were very interpretable for neurologists to quickly skim and act upon and the model that underlies that uses a state-space model just like what we talked about her an application or the homelessness study or the motion capture application that you quickly click through so state-space models are very broad class of dynamical models and human use for many purposes like segmentation and smoothing and filtering and forecasting and if we look at the long term probability when we're thinking about learning the parameters in these models a really critical term to look at is the law of article likelihood but note that when we're marginalizing over this legal speed sequence we're breaking this Markov star sharing and using these really long range dependencies between our observations so typically learning algorithms iterate between in computing the latent state sequence and updates to the model parameters okay so we're going to focus on learning algorithms that compute the gradient of this long marginal likelihood or that's also called the score function including fish's identity you can be right back at this expectation here where the key thing to note is that this expectation is conditioned upon the full observation sequence okay so you can actually efficiently compute that expectation using the dynamic programming routine which they're different words for the same different literature's but call it a forward backward and algorithm propagating information forwards and propagating information backwards and combine these messages to form smooth estimates of the state at every time point on the filtering common smoothing and other models and literature's and then he would use these smooth state estimates to do an update and we're going to think about doing it gradient based update of our model parameters and then you can read right well the issue of the effects about relevance is that the complexity grows linearly with the lake time series so that doesn't sound cheap now but what did you have millions of observations starts including perfectly to do that's iterating again and again over millions of observations so can we use their classic gradients instead of the standard radians all right so what if I just grabbed out a little subsequence do a board factor just locally on that subsequence use that those smooth estimates as a noisy gradient updated stochastic reading I think my model brothers in right now I know they're subsequence and so on okay so how do I initialize the messages in those peace yeah so imagine initializing from this your current estimate of the stationary distribution I want it people do all sorts of things they do really nice isn't that exciting okay so you're going to see we don't actually the most famous but so people do this all the time and it seems reason both is normally the storing the literature the Sasha will point out all their commentary but you know who applies their classic readings and things work out just fine just like you say whatever if you'd understand the gradient updates but of course for the issue with this is what living in fact you're not propagating the information you should have in that forward backward from outside that little freaking critical dependencies in the chain what this does is it introduces Pisces and that's going to be pretty pretty important um so what we're going to propose to aim is to account for those groupings of dependencies but I became a memory decay of the process so that we would still actively with provably good approximation as to what we should have been doing so to start the derivation let's go back to the score function because I got a need because if it's Molly rewrite it as follows but remember you have exactly the whole state sequence well the 19th gradient update that I showed in pictures on the last slide subsamples a subsequence of observations that's like observations out of the box put it critically it also cases expectation that forward backward just over that box so not only is the gradient information just what's in that box or your four factories as well and so you can see where the bias comes in here you can go to act an unbiased estimator by going back to beautiful or backward it's will be updated based on the subsequence here just looking you still have to do that full message passing which was the computationally prepared everything in the first place so what we're gonna propose doing is something between these two extremes the accuracy of these approximations so in particular we're going to have our greeting update be based on information within our subsequent acts such as showing this black box here that's what's here an equation though when we're doing our expectation we're going to use what we call a buffer it's I've seen once I started this red box here where the high-level idea is that we hope that the observations within that box account for enough of a memory about the processor that one who look lovely in this box it's good enough hey yeah and I'll just say this very quickly and happy to talk more about the details but what the theorem says that is that under some assumptions party this importantly the difference in our expected score function when we do the border backwards again that's working to just over here but we're going to the fourth backward on the pool observation sequence compared to the quarterback we're just on the water it's sequence we look at that difference in the expected scores and it's upper fabulous hollows where we see that there's geometric decay and the length of the pucker size and it's constant with respect to the length of the subsequent site is where the intuition there is that the air is really accumulated at the end points but the take-home message is that you can actually get away with fairly short rocker lengths and practice and have a pretty good approximation to what you should have been doing that the fall sequence so just a few pictures on this so this is a simulated dynamical system it's related to each shown here the showing this one margin we're going to do a full Bayesian analysis so this is the exact posterior the state sequence over time and then what we're showing here is grabbing out a subsequence and doing the 90 thing abort backward globally if we can have the posterior and you see where the hair is hard but if you do this offering again remember the other greeting update is just going to be based on what happens here when our message passing looks at a longer window so as we think things that bother link very quickly he's getting a good approximation locally and in these pictures you can see that the errors tend to accumulate closely at the endpoints okay I'm still last thing I promised applying this idea to you analyzing that and Iranian EEG data which is really in data collected from a dog that has teachers so implanted sixteen different electrodes had multiple weeks of recordings and during that that period experience 90 seizures but so we dragged out four minutes around each of these two procedures and then so if you aggregate over the 16 channels and the seizures that leads to 70 million time points that were analyzing to you of these approach where we're going to apply our buffers radiant face updates within the context of what are called these gradient things at CMC algorithms and a mobile group you know this is auto regressive hidden Markov models and model the dynamics of each channel that where's our game and what we're going to do is we're going to train on 90 percent of the seizure data and then text look at our health at love likelihood that hold out 10% and we're striving right now across seizures so that's what this one is hello I'm likelihood criticism primetime and what you see is that if you do a standard tips and learn these little performances you might say oh that's taking way too long I'm just going to train on a subset of my seizures and still evaluate on the same without seizures so of course it's faster but not surprisingly you don't get as good a performance when you're throwing out to you but the one is of our buffer area stochastic gradient a sensitive so you have to where we converge and roughly an hour in contrasting views being roughly a week so we're also showing just what an example segmentation like okay I'll stop there and just wrap up by saying that I think that there are a lot of opportunities for using elaborate models to model complex dynamical processes move us beyond our standard linear desolate assumption when I think the problems of interest are much faster than just prediction using massive data sets and we there's work conditions in particular here how we can think about applying structured sparsely inducing penalties on the layers of these networks help handle very limited datasets and definitions of interpretability budgets of course the start of what can be done here so I'll just wrap up taking the students and postdocs that work did all the real work on these projects [Applause] so understand and use the application and so you think that you have hypotheses estimates I mean you're saying our estimates are presence or not right because we're doing structure learning we don't have to get the week's exactly right great it's not that we have the great thing with them all is that we have the right structure and that's a simpler problem right so maybe that's well so and that's what we're looking at entirely so there were five different groups there each each arm and then the core and I actually think I mean I mean the main point is to highlight your newly captured very well what's the correlations are between the dimensions of this high dimensional said we're still the dimensionality and that vector the swinging like you still have every joint which is what you two yeah I do yeah oh so it actually it doesn't have just one layer has many layers though so for both of them it doesn't matter how many layers you have it's just so the sparsity is on just one layer but that one layer knocks out all the influence giving down that makes sense really so there's there is an issue that you could imagine shrinking the influence of that first layer and then increasing the influence of subsequent layers rate to make up already meant you think it's shrunk when it is it really but that's details I didn't go into the talk where we reinforced it so that they cared about these come through this private specification so you do get identified believe what the structure is through that first layer ain't knocked out or not did I answer I don't know concepts of when you had the different tracts and there were clusters but they were like wasn't obvious why they'd be connected and was those tracts happen to have statistics that match so you can group them or like were you able to look and say oh each of those has a toxic waste dump and so they may say ladies are coming are validating whether is a good clustering or not well not include only sounds like clustering surely because the statistics happen to match or was it because look and say oh there's some underlying that that's them all right so so there's nothing like we're not passing any other covariance or things like this into the clustering so it doesn't take features of the neighborhoods or any of that kind of information into account so we're not explicitly clustering on that type of information we're just clustering on price the house sales prices and then we're learning these like what the clustering on the latent dynamics looks like yeah but I mean for sure if you have one thing that's driving the dynamics as many neighborhoods like one real estate agent controls a whole market and all four of these neighborhoods right yes that is exactly why those things would be different that way but we don't have a way to validate that the way we were basically I mean we did do some qualitative digging in weather that it makes sense what was coming out of the clustering and there was like beyond the downtown but the thing I pointed out that's in the paper really the grad student at the time her name with her which hatch other low-income neighborhoods and she was like yes this is good because she was harassed but there's those further than that but um we're you know it's hard to validate whether it's an equestrian or not but the point was really to form more robust estimates at the housing into those so what's the question so like going back to the answer before like that question was just one of structure and not having a good predictive model here the question that is the opposite is having a good estimate of the index the structure is helpful in thinking that more robust but it's not the actual structure that we care about it was a lot more we share and we validated that through a couple methods one being held out house predictions so we also compared to please like we still own based measures that they haven't showed that our method was closer than these other indices [Applause]

Info

Channel: Physics Informed Machine Learning

Views: 1,974

Rating: undefined out of 5

Keywords: time series modeling, machine learning, data-driven discovery, deep learning, artificial intelligence, neural networks, reduced order models, dynamical systems, physics

Id: LkoriFtcRss

Channel Id: undefined

Length: 71min 13sec (4273 seconds)

Published: Wed Jul 17 2019