Beyond Prediction on Big Data: Interpretable Models for Complex Time Series

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so most of the success stories we hear about a machine learning these days involve a clear prediction goal combined with a massive training data set but a lot of problems that we wish to solve just simply don't fit this bill so in this talk I'm gonna focus in on learning from time series and discuss some of the open challenges as well as some pass forward to start handling more limited data scenarios and getting at notions of interpretability so it's gonna be a slightly more pessimistic talk than the last one we just heard about deep learning and all it can do but it is an amazing amazing tool but there are some important limitations we need to think about okay so the first thing I think is important to hammer home is the fact that time series data appear everywhere and that statements increasingly true based on the development of new recording devices and platforms so for example this picture here might represent anything from streams of posts or views of users on these platforms to purchase histories of these users on these ecommerce sites there's also been the development of a lot of wearable devices that provide us with interesting streams of activity data or in the field of health care there been advances like electronic health records and different monitoring devices that allow us to start assessing a patient's health status over time but for the most part machine learning has really ignored time series and a question is why well it's hard typically the number of parameters associated with our models grow really rapidly with the number of series we're observing or with the complexity of the dynamics we wish to capture so it means we need that much more data to start learning effectively likewise the algorithms associated with these models tend to be much more computationally intensive per data point which means we need that much more compute power to start addressing these problems and finally the theoretical results that we like to be able to establish tend to assume independent observations and they're hard to extend to these time-dependent scenarios so what should we do should we just give up on time series and treat the data as if they were independent well to see why that might not be a good idea let's think about a challenge I'm faced with every week when I go to book a restaurant for my family as my son always likes either pizza or sushi and so if I'm just gonna guess at random which restaurant he's gonna want gonna get it right about 50% at the time but if I take into account the fact that if we just had sushi he's very likely to want pizza or vice versa I can start getting much better prediction accuracy maybe something like 90% there's a lot of feedback I don't know if you can turn it down a little sorry so hopefully I've proven to you guys the importance of dynamical modeling capturing dependencies over time in our observations okay well really the analysis or time series or sequential data more generally is all the rage right now and that's due to a confluence of factors one is the presence of these massive web-scale time series or sequential data sources combined with huge compute power as well as advances that have been made and deep learning that have led to a number of success stories so for example recurrent neural networks and different variants like LST AMC are used wavenet seek to seek these different architectures underlie success stories we've seen in a number of different scenarios including reinforcement learning speech generation machine translation speech recognition NLP healthcare the list goes on and on but there are really a number three critical challenges or components to these success stories one is the presence of this massive time series out there often in the form of lots and lots of replicated series so for example we have you know lots of correspondents data or lots of examples of a robot navigating every part of a maze or lots of transcribed audio but in a lot of applications especially scientific application that isn't the case or sometimes we have lots of data but we don't have lots of data for our particular question of interest so for example imagine that we're Amazon and we're trying to forecast the next year of demand of every product in our inventory so we have tons and tons of purchase histories but what if we get a new product how do we forecast its demand so that's an example of having very little new data for that particular question of interest the other really critical component is something I call having manageable contextual memory so recurrent neural networks are are intense or touted as having this really powerful ability to capture a rich historical context to the sequential input and that's indeed true but the big sister stories we've seen are in cases where there's a lot of structure to that input sequence so for example out of our massive dataset that we have available it's very likely that I've seen this structure in a maze before or I've seen words used in this context or similar context before or I've seen a patient with these types of symptoms and test results but what if you're trying to forecast a really complicated weather system and you have a whole bunch of variables you're observing this is a very nonlinear noisy process and to start learning effectively you just need an enormous enormous I don't know how many times they say that were enormous amount of data and so if you just threw this data at an R a 10 I don't think you'd get the type of performance that I'd hope to get if you leveraged prior beliefs about structured relationships between these different variables that you're observing and the last really critical component to the success stories we've seen is having a clear prediction objective so for example there's word error rate for speech recognition blue score for machine translation and the reward function and reinforcement learning mm-hmm so what we're gonna do in this talk is think about moving beyond prediction tasks on these really massive data sets and talked about a number of other important time series analysis tasks one is thinking about characterizing the underlying dynamics of the process that were observing and others thinking about efficiently sharing information and more limited data scenarios and the last one that we're gonna spend the most time on is thinking about learning interpretable structures of interactions amongst the the different series that we're observing okay well let's start with this goal of characterizing the underlying dynamics and for this let's think about observing and trying to characterize human motion and here what we have is some motion capture data where these people wear these suits with all these sensors that provide recordings of joint angles over time so the recordings look something like this it looks like a really complicated process and you'd imagine having to use a really complex dynamical model in order to capture the evolution of this process but really you can describe that process as switches between a set of simpler dynamic behavior like the person is doing jumping jacks arm circles side twists and so on so then the questions are what are the dynamics that are used to describe each one of these simpler behaviors how many behaviors are present in this recording and what if new behaviors appear over time and another question of course is what is the switching pattern between these different behaviors so we developed a fully unsupervised technique so unsupervised learning is possible you don't need training labels always so this is just an example of some of the behaviors that we discovered in a large collection of motion capture videos that we analyzed with this model and all of the labels that you see here are things that we added post facto ok but the high level idea is that this idea of taking complex dynamics and describing them as switches between a set of simpler dynamic behaviors appears in lots of different contexts some that my groups analyze include automatically parsing EEG recordings segmenting conference audio detecting regime changes of volatility and financial indices or segmenting a human chromatin sequence but overall this framework amongst many other traditional time series analysis frameworks are very powerful for describing these underlying processes and I think there are ways where deep learning methods can be combined with these particular frameworks to get at more flexible descriptions of individual behaviors but allow more meaningful and interpretive information to be extracted ok what I'd like to talk about now is thinking about efficiently sharing information and limited data scenarios mm-hmm sorry I'm getting over a cold so excuse my hoarseness here so let's return to this example of being Amazon and trying to forecast demand for products so we might have things like skis and ski jackets that are purchased more in winter months maybe we have roasting pans that are purchased around holidays running shoes and the spring car seats might not have much of a seasonal trend at all so the question is how do i forecast the next year of demand for each of these products in order to stock my inventory and a really critical question we talked about is what if we have a new product how do we start addressing that problem so again this is an example of a prediction task where we don't have a lot of data for our particular question of interest okay so the model that we developed leveraged leverages repeated patterns you see both across years and across products and then we combine that structure with a regression component that builds inside information we might have for example features of text associated with product descriptions and for this regression component we used a neural network as a function of proximate are here so this is just a really simple example of how neural networks can be built into structured models to give us nice flexibility but in the context of more limited data scenarios so we use this model to analyze some Wikipedia data so we're looking at page traffic over time and using that as a proxy for product demand we didn't actually have data from Amazon for this so the site information we're using are looking at features of texts of these article summaries and I'm just going to show a few qualitative results here so here's one article where we have past years of this article in our training data set and this is a held out year and this is the page traffic over the entire year this is the true page traffic and this is our forecast so you can see that we're able to capture this weekly oscillatory structure as well as this more global structure by leveraging these repeated patterns we see both across this article as well as related articles here's another article where we're trying to do what's called a cold start forecast so this is an article where we have no past history of this article in our training data set but again by leveraging relationships with similar articles we see that we're able to capture again this weekly oscillatory structure as well as more global trends okay well I promise that we were going to move away from prediction tasks and we will and so the first example of this is a collaboration that my groups had was Zillow where the goal is to model a local level housing index so in particular we're trying to assess the value of housing at a local level like a census tract and how that value changes over time the challenge however is the fact that the data are spatiotemporal II really really sparse so to get a sense of this we can look at the following chart and we see that more than 10% of census tracks have fewer than one house sale on average per month and more than 40% have fewer than three so we can look at this facially here's one census tract we're looking at time price and each dot is a different house sale so this census tract has tons of house sales lots of data and our goal is to estimate this red curve so the latent value in that tract over time as well as this band of uncertainty so it doesn't look too hard in that in that scenario mm-hmm but here's another census tract that is hard to see but it has for house sales over a 17 year period how do you hope to start analyzing things in these types of truly data scarce situations well their approach we're gonna take is one of discovering clusters of neighborhoods that have correlated latent price dynamics so maybe we discover that these neighborhoods all move together and these neighborhoods move together but different from the first set and so on because if we can discover this structure what allows us to do is share information between related neighborhoods and thus improve the robustness of our estimates so we use this model to analyze some City of Seattle data so here's how broken down into different census tracts and the color of each one of these census tracts is our inferred cluster label so we discovered that there are 16 different clusters and for each of those 16 clusters I'm showing the value the average value in that cluster over time and there's a lot of intuitive structure so for example here's the downtown region in Seattle which had the largest bust and boom cycle over this time period and there's more structure you can read about in the paper but another thing we can do is a more quantitative analysis and for this we can look at predicting held out house sales and I said we're not interested in prediction and or not but we can use this as a proxy for assessing the quality of our underlying index and we're gonna compare our index to the industry standard Case Shiller index and we're gonna look at the percent improvement of the predictions we're forming over what you can form using Case Shiller and we're looking at dividing our analyses into the five percent of sensors tracks with the most observations all the way to the five percent with the fewest and across the board we have improvements but not surprisingly the case where we have the largest improvements are for the census tracks with the fewest number of observations because this is a case where sharing information is so critical okay so just to step back what we're talking about are structures for thinking about efficiently sharing information so this is something my groups thought a lot about over past years thinking about clusters and hierarchies of time-series thinking about sparse directed interactions low dimensional embedding so the long range forecasting product demand that we were talking about is really thinking about this low dimensional structure that lives in the data even the method we talked about at the beginning where we're switching between behaviors that's a way of sharing information across time as you see behaviors appear again and again and once again the story is that these are really powerful frameworks and there are a lot of ways in some ways that might groups exploit of combining deep learning techniques with these types of frameworks that allow us to handle more limited data scenarios okay so the last part that I want to dig into in a bit more depth is thinking about learning interpret will structure zuv interactions and so to do this we're going to discuss this in the context of deep generative models so the goal of deep generative models is to synthesize data so for example maybe we have a bunch of images of handwritten digits the famous feminist data set and maybe we'd like to synthesize new images of handwritten digits that cover the space of all possible things that people might write so for example in the last talk we heard that the US Postal Service has been using digit recognition hydrogen digit recognition you know this might be a reason why we want to have these types of training examples for these algorithms okay deep generative models take a probabilistic approach to this thinking about defining a distribution over these complex observations and these methods are really powerful so one example coming out of a team at Apple is the following they use deep generative models to synthesize images of eyes and they combine those with an unlabeled data set of real images of eyes and use those to refine their synthetic images and that paper one the best paper award at the prestigious cvpr conference in 2017 right here in Seattle there's a team at the Allen Institute for cell Sciences that use these types of methods for predicting the location of cellular structures throughout the cell cycle so their training data in this case though only included examples of where the cellular envelope was and the nucleus and an individual cellular structure at a given time so this is really cool stuff but the challenge in defining these kind of methods is that you need a very flexible distribution to describe these complex observations but you need a distribution that you can learn and one that you can sample from in order to simulate or synthesize these datasets and examples of techniques that have been proposed to do this include variational autoencoders generative adversarial networks and there are many many many variants okay so we're gonna focus in on variational autoencoders or VA es and what VA E's do is they define what's called an encoder and in a VA II it uses a neural network mapping from a data point to a Leighton code leaked an embedding of that data so we think about taking our high dimensional complex observation like our handwritten digit shoving it through our encoder and it defines some low dimensional embedding really a distribution on a low dimensional space but you can ignore that part um then the other part is a decoder which is again defined by a neural network that maps from our low dimensional weight in space back to our high dimensional observations so in particular we think about sampling from this low dimensional distribution which is some you know particularly the ste coder and that synthesizes a high dimensional observation and what the ve does is it jointly trains this encoder and decoder top to my something called a variational objective and really what we can think of doing is taking all of our training data shoving it through an encoder and then what we're gonna do is sample from this latent space synthesize a whole bunch of different data points or images in this case look at the differences between our real data and our synthetic data that we generated and from those differences that's going to drive the update to the neural network parameters underlying this encoder and decoder and then we iterate this process hoping to be able to generate data that's closer to our real data set okay there's a lot of neural network magic buried under the head here but that's the high-level idea okay so bees are really cool but there a number of limitations one is the fact that you just need a tremendous amount of data to start defining good generative models good good synthesizers another is the fact that it's really hard to interpret what's going on so for example it's really hard to understand how changes in your latent code that low dimensional space influence changes in your really high dimensional observation space so let's imagine we're trying to synthesize human body pose positions well what if we just wanted to move an arm if we took a new sample from our VA e we almost really get a completely different pose altogether okay well there are lots of cases well most of the time when we're thinking about complex high dimensional observations we can actually think of those observations as collections of groups are really correlated random variables so for example in human motion each joint is a collection of very correlated angle measurements or if we're looking at neuroimaging data we tend to do something called a region of interest analysis where a bunch of cortically localized signals that are highly correlated or in finance we have a whole bunch of different indices but they're associated with different assets which tend to be very correlated so a question is how do changes in our latent code that low-dimensional latent code influence these groups of highly correlated observations okay well to see why the VA II can't address this question let's just be visit what it's doing we're gonna think about embedding frames of a video of a person walking so we take each of those frames shove it through our encoder get this distribution on latent codes sample our particular latent code which in this example is just a two-dimensional vector and then the neural network that's defining the decoder entangles both of those dimensions in a really complicated way to synthesize these high dimensional observations so there's no sense of how one dimension versus another is influencing this overall observation but instead what we propose doing is defining a separate decoder for each group of observations so for example one decoder for the neck one for the elbow one for the knee and so on even though we have these separate decoders they're all coupled because they're jointly trained together with the encoder and they also share the same lead in space and we refer to this model as that ofa output interpretive all ve and I have to be honest the acronym came before the name because there's just been a million different variants of Ganz and V AES for posts out there and they all have their own cute little name and we just felt jaded and didn't really want to contribute to all of this and we kind of said okay and it's stuck so then we had to make it output interpretable but to make it interpret all what we do is we add what's called a sparsely inducing penalty on the weights that map from this latent code to each of these group specific decoders so what it allows us to do is infer that for example this first dimension might only be relevant and describing the motion of the elbow and knees but not the neck and maybe the second dimension is really important for describing what's going on with the neck but not elbow and knees okay well I'm just going to show you some results from training this OFA on just 10 short videos of a person walking so really limited data submit you know and what we can look at is we can look at this weights matrix that map's what's going on in our different latent dimensions to each of our groups of observations each joint in the human body and it's a sparse matrix so it's interpretive all so we can start asking questions like for a given joint which leaked into mentions are controlling its motion or we can also say for a given latent to mention which joints does it control so for this here we're going to show in this table for every latent dimension the set of joints that it controls top three joints and this is very interpretable so for example this one right here corresponds to controlling right lower arm right wrist right upper arm so I'm showing that here and you can walk through every dimension this first dimension controls aspects of the left lower leg and this one controls aspects of the head and neck and this is all learned it's learning this structure from the data learning these systems of interaction you might think that this gain and interpretability comes at a loss of flexibility of representation but we show in the paper that that is not the case especially unlimited data scenarios and I'll just show some qualitative examples here if you think about just sampling at random a latent code passing it through the decoder I'm first going to show you results from the VA e so this is our train VA e Fenella VA e and my students like to call these samples from the Ministry of Silly Walks because for example this guy's lake looks like it's broken here I don't I don't think I've ever hit that posed in human motion but here are samples from our boy they look much more representative of human motion and these are just a few samples but you can see a much larger set in the paper and same story holds and ideas by capturing these groups of interaction and leveraging these important relationships that exist to describe these high dimensional observations we're better utilizing our available data okay well we also apply this to looking at some neuroimaging data where person sits in this chair has all these sensors that provide recordings of brain activations over time and we do this region of interest analysis and here what we're interested in are defining networks in the brain understanding which regions of interest work together in response to different stimuli so we can again look at this weights matrix and if we look at the different latent dimensions here what we see is that they control regions of interest that comprise known networks in the brain so this is cool okay so what we've talked about so far is structuring the decoder in a V a II but what if you think about doing the same thing on the nth KOTOR side well you have to think carefully about how you're gonna merge together these different encodings to get this late and embedding and I won't go through the details you can read about them if you're interested but I just want to tell you what it allows you to do because just like they mentioned in the last talk Paige mentioned sorry this is where I lose my voice rather than two weeks ago when I had no voice at all so a lot of our data sets are very messy right they have a lot of missing data and these just choke on missing data so here though we can have an entire missing creep of observations like maybe the elbow sensor goes out and we can still do our training and testing with this likewise we can also handle multimodal data sources so we can run VA ease on things like video audio and text together treating these as different groups so I'll just quickly show sorry just quickly show some results here what we did was again trained on these motion capture videos and in training just to make things harder we randomly removed whole limbs okay so lots of missing data and then at test time what we're gonna do is only give you observations that are highlighted in orange so we're only going to give you observations of the person's body core and ask our VA e to synthesize everything else and you're seeing the true and gray or blue one or whatever this is and here's what we synthesized so a matches really closely just from information about where a person's head neck and core are you synthesize all the limbs okay another separate example is if you have multi-view data so for example you have different views of a person's face um in this case what we did was we held out one view we passed in the six other reviews and asked the VA e to synthesize the held out view and each one of these images you're seeing here is a synthetic image each one its hold out six the other views synthesizing image so really realistic looking synthesized images very limited data setting actually so again very cool stuff okay in the last two minutes don't worry two minutes I'm going to talk about exactly the same idea but in the context of an explicit dynamical model so in particular we're going to be interested in directed interactions in time series so we want to be able to make statements like maybe time series I is influencing time series J at some lag and we'd like to be able to discover that kind of structure from the data and of particular interest and the really key thing is in nonlinear complex time series and so to do this again I won't talk about the model but it has very similar flavor to what we talked about with ofa we take neural networks we consider structured representations with the sparsity inducing penalties or if you shrink away these weights what you're able to make statements about is the fact that time series eye does not influence time series J because it's weights got shrunk to zero so some details in that but it's same kind of idea and I just want to mention some of our results so we looked at the dream three challenge data set so this is a benchmark data set for Granger causality techniques and it's simulated gene regulatory network data very nonlinear data and we have five networks two e.coli three East and these are hundred dimensional networks again really complicated nonlinear dynamics and we only get 46 replicates of twenty one time points so that's really insanely little data for this kind of problem and here's some results in terms of our ability to do structure learning to be able to detect true versus false edges so it's a metric called area under the ROC curve and here's our performance using a multi-layer perceptron here's our performance using an LS TM a popular RNN here's the performance of a past nonlinear gold standard Granger causality method here's our in the linear model and something called a dynamic Bayesian network but what you see is across the board we have significant gains using these neural network approaches and I would not have thought about using neural networks for this kind of problem a couple years ago first its structure learning not prediction never seen that before and it's really really limited data so very non-traditional use but you see some important gains there so in summary I do think there are really tremendous opportunities for deep learning I think they allow us to go beyond simple assumptions like linearity Gaussian ax T and so on but the problems that were interested are much faster than just prediction tasks on massive data sets that you see in lots of companies out there that have been driving this deep learning revolution and so we've shown two examples the Oy Vey and this neural Granger causality for thinking about looking at structured representations of these neural networks with sparsity inducing penalties that allow us to get at notions of interpretability had the limited data scenarios and I really think that this is just the tip of the iceberg of what can be done to handle a much faster set of problems I just also want to thank the students and postdocs that were involved in this project especially those that were involved with the hyphae and Earl Granger projects and also if you want to learn more about machine learning just a plug for a Coursera machine learning specialization that I together with Carlos Castro and that you might have heard from this morning co-developed so thank you they definitely added a wave of new professors who coming over the last five six years yeah whatever the change has been that you've seen there in particular all right yeah so it's interesting while industry has like exploded with practical machine learning deep learning there's obviously a lot still going on on campus there's maybe a little bit of them back then I guess backlash a little bit of a like let's go back to theory so you dub has hired a lot of really fantastic people on the more Theory side of ML which is really critical to start traditionally in machine learning it was absolutely a must to be able to have a statement about the properties of what you were doing and to be able to say some kind of guarantees of how it's gonna perform as you get more data or as you let your algorithm run more and more and we've kind of lost that a little bit we're kind of in this like you know gung-ho which is it's fun and it's cool and you can do cool things but I think a lot of us on the research side of things are trying to slow down a little bit and think about what can we do and what can't we do and what can we see about what we can do and a lot of things we can do but we don't know why we can do them and so I think a lot of us are thinking more in that space it's exactly thing heard me give a talk and then reached out and Zillah and I have been working together actually basically since I got to you to have a number of projects I had to cut one for time we had a really interesting paper on analyzing the dynamics of homelessness and whole slew of different things that you can think about but yeah that's how these things happen my question is have you combined any of these neuro net methods with traditional time series forecasting methods like a Rhema or exponential smoothing or seasonal decompose do you yes we have a paper in the work so to kind of you know halfway answer your question like the forecasting that that's picking up on like seasonal models it's not actually explicitly a seasonal model but that's exactly what it's trying to capture a seasonal trends but we have a project right now that's trying to we have two separate ones that are related that are using these more traditional processes this is something that's really interesting to think about so a Rhema if you guys aren't familiar it captures long memory and processes all right ends are supposed to capture long memory and processes but it's a very different notion of long memory it's really it's not long memory it's more like context it's a contextual memory and these ideas of attention mechanisms change that even to another level about context and relevance of information so we're doing a lot of studies trying to compare these and combine these different ideas I think it's it's really important to kind of better understand what's going on there oh yeah yeah so I have two answers to you inside yeah yeah power needed to pull off something like this and the second part was and yeah I can say that none of these think none of these things were run on GPUs these were all in CPU based actually that's the one exception to that was the face thing because we used some pre-trained models and the synthesizing images of faces but so I'm looking at really limited data situations so I have two answers one is if you go out there in industry and you have your massive data set and you're trying to train a model yourself and not just grab one off the shelf and then do the last few layers a massive massive amount what we're doing I mean embarrassingly small about a computer resources but that's kind of cool right it's like it's really taking these methods out of their traditional comfort zone and where they've been developed and showing that you can do the kind of more traditional things yeah so they are very compute hungry and something that wasn't mentioned that I think I think it wasn't mentioned unless I fell asleep for a second but tuning parameters oh my god tuning parameters and like just 99.999% of machine learning and time is spent on tuning parameters and that is supercomputing Gris so yeah I mean it's it's hard to say absolutely how much but I can tell you these are all local analyses yeah I mean we we do do things in the cloud and we use AWS quite a bit but it's not the scale of things that people do it with the massive training sites that they were looking at yeah yes very Emily Fox thank you very much thank you [Applause]
Info
Channel: GeekWire
Views: 997
Rating: undefined out of 5
Keywords:
Id: sxgeisddYj8
Channel Id: undefined
Length: 36min 23sec (2183 seconds)
Published: Fri Aug 10 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.